WO2021000497A1

WO2021000497A1 - Retrieval method and apparatus, and computer device and storage medium

Info

Publication number: WO2021000497A1
Application number: PCT/CN2019/118254
Authority: WO
Inventors: 王建华; 马琳; 张晓东
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-07-03
Filing date: 2019-11-14
Publication date: 2021-01-07
Also published as: CN110444198A; CN110444198B

Abstract

A retrieval method, comprising: performing speech recognition on colloquial speech, used as speech to be recognized, of a user, to obtain recognized text; performing natural language processing on the recognized text by means of a semantic analysis model, an emotion analysis model and a text classification model, so as to obtain key information used for retrieval; and finally, obtaining target retrieval content according to the key information.

Description

Retrieval method, device, computer equipment and storage medium

Cross references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 3, 2019. The application number is 201910594101.9 and the application name is "Search methods, devices, computer equipment, and storage media". The entire content is incorporated by reference. In this application.

Technical field

This application relates to a retrieval method, device, computer equipment and storage medium.

Background technique

The rapid development of random computer technology and Internet systems has derived application systems for multiple purposes in various industries and positions. At present, when information retrieval is involved in the application system, traditional retrieval methods require users to choose and manually fill in keywords to do so. Retrieve the corresponding content. However, as the current population of Internet users and the complexity of business scenarios required in daily work, the timeliness of data, and the huge amount of data continue to increase, the retrieval workload of traditional retrieval methods also increases. Traditional information retrieval models Will greatly slow down work efficiency.

Summary of the invention

According to various embodiments disclosed in the present application, a retrieval method, device, computer equipment, and storage medium are provided.

A retrieval method including:

Obtain the voice to be recognized;

Input the to-be-recognized speech into a trained speech recognition model for recognition to obtain a recognized text;

Input the recognized text into the trained semantic analysis model and sentiment analysis model to obtain first feature data and second feature data, respectively; wherein, the first feature data is the analysis result of semantic analysis of the recognized text The second feature data is an analysis result of sentiment analysis on the recognized text;

After performing word preprocessing on the recognized text, the target text is obtained; wherein, the word preprocessing includes word segmentation, removal of stay words, and word filtering;

The first feature data, the second feature data, and the target text are input into a text classification model, and the text classification model obtains a first logical rule that matches successfully according to the first feature data and the second feature data, and according to the The first logic rule classifies the target text to obtain key information; and

The target retrieval content is obtained by searching according to the key information.

A retrieval device, including

The voice acquisition module is used to acquire the voice to be recognized;

A speech recognition module, configured to input the to-be-recognized speech into a trained speech recognition model for recognition to obtain recognized text;

The key information confirmation module is used to input the recognized text into the trained semantic analysis model and the sentiment analysis model to obtain the first feature data and the second feature data respectively, wherein the first feature data is for the recognition The analysis result of the semantic analysis of the text; the second feature data is the analysis result of the sentiment analysis of the recognized text; it is also used to obtain the target text after the word preprocessing of the recognized text, wherein the word Preprocessing includes word segmentation, removal of staying words, and word filtering; it is also used to input the first feature data, second feature data, and target text into a text classification model, which is based on the first feature data and the first feature data. The first logical rule matching the two feature data is obtained, and the target text is classified according to the first logical rule to obtain key information; and

The retrieval module is used to retrieve the target retrieval content according to the key information.

A computer device, including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:

Obtain the voice to be recognized;

One or more non-volatile storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors perform the following steps:

Obtain the voice to be recognized;

The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

Fig. 1 is an application scenario diagram of the retrieval method according to one or more embodiments.

Fig. 2 is a schematic flowchart of a retrieval method according to one or more embodiments.

Fig. 3 is a schematic flow diagram of speech recognition according to one or more embodiments.

Fig. 4 is a schematic diagram of a process of speech recognition according to one or more embodiments.

Fig. 5 is a schematic flowchart of training steps of a model to be trained according to one or more embodiments.

Fig. 6 is a schematic flowchart of a training step of a speech recognition model according to one or more embodiments.

Fig. 7 is a block diagram of a retrieval device according to one or more embodiments.

Figure 8 is a block diagram of a computer device according to one or more embodiments.

Detailed ways

In order to make the technical solutions and advantages of the present application clearer, the following further describes the present application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.

The retrieval method provided in this application can be applied to the application environment as shown in FIG. 1. Fig. 1 is a diagram of the application environment of the retrieval method in an embodiment. As shown in Figure 1, the application environment includes a terminal 110 and a server 120. The terminal 110 and the server 120 communicate through a network. The communication network may be a wireless or wired communication network, such as an IP network, a cellular mobile communication network, etc., where the terminal And the number of servers is not limited.

The terminal 110 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 120 may be implemented as an independent server or a server cluster composed of multiple servers. The terminal 110 obtains the voice to be recognized. The terminal 110 inputs the voice to be recognized into the trained voice recognition model for recognition, and the recognized text is obtained. The terminal 110 inputs the recognized text into the trained semantic analysis model and the emotion analysis model to obtain the first First feature data and second feature data. After the terminal 110 performs word preprocessing on the recognized text, the target text is obtained, and the first feature data, the second feature data, and the target text are input into the text classification model. The text classification model According to the first characteristic data and the second characteristic data, the first logic rule that is successfully matched is obtained, and the target text is classified according to the first logic rule to obtain key information, and the terminal 110 searches according to the key information Get the target retrieval content.

In one of the embodiments, the steps of processing the voice on the terminal 110 and finally obtaining the target retrieval content can also be performed on the server 120. Specifically, after the terminal 110 obtains the voice to be recognized, it sends the voice to be recognized to the server 120, where the voice to be recognized is processed on the server 120 to obtain the target retrieval content, and the server 120 returns the target retrieval content to the terminal.

In one of the embodiments, as shown in FIG. 2, a retrieval method is provided. Taking the method applied to the terminal in FIG. 1 as an example, the method includes the following steps:

Step 210: Obtain a voice to be recognized.

Specifically, the terminal records the user voice, and uses the user voice as the voice to be recognized. The voice to be recognized is the voice data that the user expresses in a more verbal manner. The voice data is used when the user is involved in retrieval when using the enterprise application system, freeing his hands to achieve human-computer interaction, and automatically retrieve the content that he wants to retrieve. Among them, the operation of triggering the terminal to input the user's voice may be triggered by the user, such as clicking a control on the terminal, or it may be automatically detected by the terminal, such as automatically recording the voice of a person detected. Among them, the enterprise application system can refer to the pure software system running in the enterprise, or it can be an application system composed of three levels: standardized management mode, knowledgeable business model, and integrated software system, such as OA collaborative office System, safe CSTS system, fingertip office system, etc.

Step 220: Input the to-be-recognized speech into the trained speech recognition model for recognition, and obtain the recognized text.

Specifically, the terminal inputs the to-be-recognized speech into a trained speech recognition model for recognition, and obtains the recognized text. The speech recognition model is mainly the process of converting speech into text, recognizing the text content in the speech, and obtaining a speech recognition algorithm for recognizing text.

Step 230: Input the recognized text into the trained semantic analysis model and sentiment analysis model to obtain first feature data and second feature data respectively; wherein, the first feature data is an analysis of semantic analysis of the recognized text Result; the second feature data is an analysis result of sentiment analysis on the recognized text.

Specifically, the terminal inputs the recognized text into the trained semantic analysis model to obtain the first feature data. The semantic analysis model is a semantic analysis algorithm that analyzes and processes the content of the recognized text based on the task of establishing the contextual words in the recognized text. The first feature data refers to the analysis result of the semantic analysis of the recognized text. In different semantic situations, the same words often represent different word meanings. Therefore, it is necessary to combine the meanings of words adjacent to each word context to judge and analyze the word, and analyze the meaning of the word in the semantic context . Among them, the tasks of semantic analysis are different for different language units. At the word level, the basic task of semantic analysis is word sense disambiguation (WSD), semantic role labeling (SRL) at the sentence level, and reference disambiguation at the text level, also known as co-referential resolution.

Specifically, the terminal inputs the recognized text into the trained sentiment analysis model to obtain the second feature data. Among them, the sentiment analysis model refers to the sentiment analysis algorithm that judges the sentiment color of the text or the attitude of praise and criticism based on the analysis of the recognized text. Sentiment analysis is also called tendency analysis, that is, a subjective text analysis judges the speaker's emotional color or praise and criticism attitude. The second feature data refers to the analysis result of sentiment analysis on the recognized text.

Step 240: After performing word preprocessing on the recognized text, the target text is obtained, where the word preprocessing includes word segmentation, removal of staying words, and word filtering.

Specifically, after the terminal performs word preprocessing on the recognized text, the target text is obtained. Among them, word preprocessing refers to a process of preliminary processing of the recognized text. After word preprocessing, the target text is obtained, and the target text is more accurate in subsequent processing. In one of the embodiments, word preprocessing can be to perform word segmentation processing on the recognized text, removing remaining words processing, and word filtering. The word segmentation processing refers to segmentation of the recognized text, and removing stop words means to recognize Words that do not have any meaning in the text, such as removing words that have no special meaning such as "的, MA, 呢". Word filtering is a way of managing keywords in the recognized text, and is used to filter bad information.

Step 250: Input the first feature data, the second feature data, and the target text into the text classification model. The text classification model obtains the first logical rule that matches successfully according to the first feature data and the second feature data, and compares them according to the first logical rule. The target text is classified and processed to obtain key information.

Specifically, the terminal inputs the first feature data, the second feature data, and the target text into the text classification model. Among them, the text classification model refers to an algorithm that classifies the target text according to the first data and the second feature data. The text classification model obtains a first logical rule that is successfully matched according to the first feature data and the second feature data, and classifies the target text according to the first logical rule to obtain key information. That is, through the results of semantic analysis and sentiment analysis, the target text is classified and extracted to obtain key information for retrieval.

Step 260: Retrieve based on the key information to obtain the target retrieval content.

Specifically, the terminal retrieves the target retrieval content according to the key information. In one of the embodiments, speech recognition and natural language processing (NLP) technologies are introduced into the existing retrieval function of the enterprise Internet application system, and the user’s speech is entered, and the speech recognition and natural language processing are performed automatically according to the key information finally obtained. Complete the search, avoid manual and frequent complex information retrieval, and greatly improve the efficiency of retrieval.

NLP (Natural Language Processing) is a sub-field of artificial intelligence (AI), which plays a role in the entire artificial intelligence system. Natural language processing is an important technology that embodies language intelligence. It is an important branch of artificial intelligence that helps analyze, understand or generate natural language, realize the natural communication between humans and machines, and also help communication between people.

The entered user voice refers to any type of voice. A series of information most likely to be needed by the user is retrieved based on any type of voice of the user, which improves the accuracy of retrieval. The types of voice include standardized and colloquial terms. In one of the embodiments, for example, the input voice can be the user using standardized language to say a voice: "Please check the turnover of the fourth quarter of 2018", or it can be the user using a colloquial expression to say a voice: " How much money did you make this quarter?" Whether it is standardized users or spoken language, speech recognition and natural language processing can be performed on them. The key information obtained through text classification model matching and classification is "turnover and current quarter’s Time", and automatically search based on key information, and finally get the target search content that users need, such as "specific operating income and operating income sources for each quarter."

In this embodiment, by acquiring the speech to be recognized, the speech to be recognized is input into the trained speech recognition model for recognition to obtain the recognized text, and the recognized text is input into the trained semantic analysis model and sentiment analysis model to obtain the first A feature data and a second feature data. After preprocessing the recognized text, the target text is obtained. The first feature data, the second feature data, and the target text are input into the text classification model. The text classification model is based on the first feature The data and the second characteristic data obtain the first logical rule that matches successfully, and the target text is classified according to the first logical rule to obtain key information, and then search is performed according to the key information to obtain the target search content. By using the spoken speech of the user as the speech to be recognized for speech recognition, the recognized text is obtained, and then natural language processing is performed on the recognized text through the semantic analysis model, the sentiment analysis model and the text classification model to obtain the key information for retrieval , And finally get the target retrieval content based on the key information. By replacing the traditional keyword input with voice input, it saves the user input time. Through natural language processing, the accuracy and comprehensiveness of the key information can be ensured, and then the key information can be automatically retrieved to accurately retrieve the corresponding target retrieval content. Improve the efficiency of information retrieval.

In one of the embodiments, the speech recognition model includes an acoustic model and a language model. As shown in FIG. 3, step 220 includes:

Step 221: Perform signal processing and feature extraction on the audio signal of the voice to be recognized to obtain a feature sequence.

Step 222: Input the characteristic sequence into the trained acoustic model and the trained language model to obtain acoustic model scores and language model scores, respectively.

Step 223: Perform a decoding search on the acoustic model score and the speech model score to obtain the recognized text.

Specifically, the terminal performs signal processing and feature extraction on the audio signal of the voice to be recognized to obtain a feature sequence. Among them, it can be understood that the audio signals of different voices are different. The audio signals have characteristic parameters, such as frequency, period, energy, etc. Therefore, signal processing and feature extraction on the voice audio signals can obtain a characteristic sequence. The feature sequence includes multiple voice features of the voice to be recognized.

Specifically, the terminal inputs the characteristic sequence into the trained acoustic model and the trained language model to obtain acoustic model scores and language model scores, respectively. Among them, the language model score refers to the evaluation of the quality of the language model and the analysis of the recognition result of speech recognition. Acoustic model score refers to the acoustic model score generated by integrating the acoustic and phonetic systems and according to the input feature sequence.

Specifically, the terminal performs a decoding search on the acoustic model score and the speech model score to obtain the recognized text. Among them, the decoding search refers to the process of matching preset words according to the feature sequence and the score of the feature sequence to obtain the recognized text.

In this embodiment, by performing signal processing and feature extraction on the speech to be recognized, a feature sequence is obtained, the acoustic model score and the language model score are obtained, and then the recognized text is obtained through decoding search, so as to realize accurate conversion of speech to text.

In one of the embodiments, as shown in FIG. 4, step 223 further includes:

Step 223A, obtain the preset hypothesis word sequence.

Step 223B: Calculate the acoustic model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain acoustic model groups.

Step 223C: Calculate the language model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain language model groups.

Step 223D: According to the grouping of the acoustic model and the grouping of the language model, the overall score of the hypothetical word in the preset hypothesis word sequence is calculated, and the hypothetical word with the highest overall score is used as the recognized text.

Specifically, the terminal obtains a preset hypothetical word sequence, and the preset hypothetical word sequence is a number of preset hypothetical words. The grouping of the target acoustic model refers to the comparison and calculation of the hypothesis words in the hypothesis word sequence and the feature vectors in the feature sequence to obtain the acoustic score set of the hypothesis words. The grouping of the target language model refers to the comparison and calculation of the hypothetical words in the hypothetical word sequence and the feature vectors in the feature sequence to obtain the language score set of the hypothetical words. The overall acoustic score score of each hypothesis word in the preset hypothesis word sequence is calculated according to the acoustic score set and the language score set, and the hypothesis word with the highest overall score is selected as the recognition text.

In one of the embodiments, the model to be trained includes the semantic analysis model, the sentiment analysis model, and the text classification model, as shown in FIG. 5, the method further includes:

Step 310: Obtain a training sample set. The training sample set includes granular data samples, language data samples, and modal data samples. The granular data samples include granular data features, language data features, and modal data features.

Step 320: Obtain the text to be trained, and input the text to be trained into the initial model to be trained to obtain the initial text.

Step 330: Adjust the parameters of the initial model to be trained according to the initial text, granular data features, language data features, and modal data features, until convergence conditions are met, to obtain the semantic analysis model, the sentiment analysis model, and the text classification model.

The training sample set refers to the big data samples used to train semantic analysis models, sentiment analysis models, and text classification models. Big data samples can be obtained through crawlers or purchased. The training sample set includes granular data samples, language data samples and modal data samples. The granular data sample is detailed and comprehensive multi-granular monolingual data. Multilingual data is information data representing different languages, such as Chinese, English, Korean, Japanese, and dialects of different regions. Multi-modal data is data that represents multiple manifestations of the same thing. It is similar to the information form of human perception and learning. From the perspective of a machine, it is equivalent to the description of the same thing by different sensors, such as cameras and X-rays. , Infrared photo of the same target in the same scene.

The sample to be trained is the sample used for training. The sample to be trained can be a human sentence, or a novel, a paper, or even a large amount of industry data. Through continuous training and adjusting the parameters of the initial model to be trained, until the convergence condition is met, a semantic analysis model, a sentiment analysis model and a text classification model are obtained.

In one of the embodiments, the speech recognition model includes an acoustic model and a language model, as shown in FIG. 6, and the method further includes:

Step 341: Obtain training samples, where the training samples include language features and acoustic features.

Step 342: Obtain the training speech to be recognized, and input the training speech to be recognized into the initial language model to obtain the initial language score.

Step 343: Adjust the parameters of the initial language model according to the language features and initial language scores, and adjust the parameters of the initial acoustic model according to the acoustic features and initial acoustic scores until both the initial language model and the initial acoustic model meet the convergence conditions to obtain a speech recognition model .

The training sample refers to the sample data used for speech training, and the training sample includes language features and acoustic features. Linguistic features refer to the features used to distinguish different languages. For example, Chinese has the characteristics of Chinese, and English has the characteristics of English, etc., just as the human ear can recognize different languages according to the characteristics of different national languages. Acoustic feature refers to the feature obtained by combining acoustics and pronunciation.

It should be understood that although the various steps in the flowcharts of FIGS. 2-6 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in Figures 2-6 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The execution order of is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

In one of the embodiments, as shown in FIG. 7, a retrieval method device is provided, including: a voice acquisition module 510, a voice recognition module 520, a key information confirmation module 530, and a retrieval module 540, wherein:

The voice acquisition module 510 is used to acquire the voice to be recognized.

The voice recognition module 520 is configured to input the to-be-recognized voice into a trained voice recognition model for recognition to obtain recognized text.

The key information confirmation module 530 is configured to input the recognized text into the trained semantic analysis model and the sentiment analysis model to obtain first feature data and second feature data, respectively, wherein the first feature data is a reference to the The analysis result of the semantic analysis of the recognized text; the second feature data is the analysis result of the sentiment analysis of the recognized text; it is also used to obtain the target text after the word preprocessing of the recognized text, wherein the Word preprocessing includes word segmentation, removal of remaining words, and word filtering; it is also used to input the first feature data, second feature data, and target text into a text classification model, which is based on the first feature data and The second characteristic data obtains the first logical rule that is successfully matched, and the target text is classified according to the first logical rule to obtain key information.

In one of the embodiments, the speech recognition model includes an acoustic model and a language model, and the speech recognition module 510 includes:

The feature sequence extraction unit is used to perform signal processing and feature extraction on the audio signal of the voice data to obtain a feature sequence.

The score confirmation unit is used to input the feature sequence into the trained acoustic model and the trained language model to obtain the acoustic model score and the language model score respectively.

The recognition text acquisition unit performs a decoding search on the acoustic model score and the speech model score to obtain the recognized text.

In one of the embodiments, the recognized text obtaining unit further includes:

The preset hypothesis word sequence obtaining unit is used to obtain the preset hypothesis word sequence.

The score calculation unit is configured to calculate the acoustic model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain the grouping of the acoustic model, and also to calculate the score based on the feature vector in the feature sequence. The language model scores of the predetermined hypothesis word sequence are stated to obtain language model groups.

The recognition text confirmation unit is configured to calculate the overall score of hypothetical words in the preset hypothesis word sequence according to the grouping of the acoustic model and the grouping of the language model, and use the hypothesis word with the highest overall score as the recognized text.

For the specific limitation of the retrieval device, please refer to the above limitation on the retrieval method, which will not be repeated here. Each module in the above retrieval device can be implemented in whole or in part by software, hardware and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.

In one of the embodiments, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 8. The computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus. The processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer readable instructions. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to implement a retrieval method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

A computer device includes a memory and one or more processors. The memory stores computer readable instructions. When the computer readable instructions are executed by the processor, the steps of the retrieval method provided in any embodiment of the present application are implemented.

One or more non-volatile storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors implement the retrieval provided in any of the embodiments of the present application Method steps.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink), DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, they should It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A retrieval method including:

Obtain the voice to be recognized;

Input the to-be-recognized speech into a trained speech recognition model for recognition to obtain a recognized text;

Input the recognized text into the trained semantic analysis model and the sentiment analysis model to obtain first feature data and second feature data, respectively; wherein, the first feature data is an analysis of semantic analysis of the recognized text Result; the second feature data is an analysis result of sentiment analysis on the recognized text;

After performing word preprocessing on the recognized text, the target text is obtained; wherein, the word preprocessing includes word segmentation, removal of stay words, and word filtering;

The first feature data, the second feature data, and the target text are input into a text classification model, and the text classification model obtains a first logical rule that matches successfully according to the first feature data and the second feature data, and according to the The first logic rule classifies the target text to obtain key information; and

The target retrieval content is obtained by searching according to the key information.
The method according to claim 1, wherein the speech recognition model includes an acoustic model and a language model, and the step of inputting the to-be-recognized speech into a trained speech recognition model for recognition to obtain recognized text, include:

Signal processing and feature extraction on the audio signal of the voice to be recognized to obtain a feature sequence;

Input the feature sequence into the trained acoustic model and the trained language model to obtain acoustic model scores and language model scores respectively; and

A decoding search is performed on the acoustic model score and the speech model score to obtain the recognized text.
The method according to claim 2, wherein the step of decoding and searching the acoustic model score and the speech model score to obtain the recognized text comprises:

Obtain the presupposition word sequence;

Calculating the acoustic model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain the acoustic model grouping;

Calculate the language model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain language model groups; and

According to the grouping of the acoustic model and the grouping of the language model, the overall score of the hypothesis word in the preset hypothesis word sequence is calculated, and the hypothesis word with the highest overall score is used as the recognized text.
The method according to claim 1, wherein the model to be trained comprises the semantic analysis model, the sentiment analysis model and the text classification model, and the training step of the model to be trained comprises:

Acquiring a training sample set, the training sample set including granular data samples, language data samples and modal data samples, the granular data samples including granular data features, language data features, and modal data features;

Obtain the text to be trained, and input the text to be trained into the initial model to be trained to obtain the initial text; and

According to the initial text, the granular data feature, the language data feature, and the modal data feature, the parameters of the initial model to be trained are adjusted until the convergence condition is met, and the semantic analysis model and the emotion are obtained. Analysis model, the text classification model.
The method according to claim 1, wherein the speech recognition model includes an acoustic model and a language model, and the training step of the speech recognition model comprises:

Acquiring training samples, the training samples including language features and acoustic features;

Obtain the training speech to be recognized, and input the training speech to be recognized into the initial language model to obtain the initial language score;

Obtain the training speech to be recognized, and input the training speech to be recognized into the initial acoustic model to obtain the initial acoustic score;

Adjust the parameters of the initial language model according to the language feature and the initial language score, and adjust the parameters of the initial acoustic model according to the acoustic feature and the initial acoustic score until the initial language model and the initial language model The initial acoustic models meet the convergence condition, and the speech recognition model is obtained.
A retrieval device, including:

The voice acquisition module is used to acquire the voice to be recognized;

A speech recognition module, configured to input the to-be-recognized speech into a trained speech recognition model for recognition to obtain recognized text;

The key information confirmation module is used to input the recognized text into the trained semantic analysis model and the sentiment analysis model to obtain the first feature data and the second feature data respectively, wherein the first feature data is for the recognition The analysis result of the semantic analysis of the text; the second feature data is the analysis result of the sentiment analysis of the recognized text; it is also used to obtain the target text after word preprocessing of the recognized text; wherein, the word Preprocessing includes word segmentation, removal of staying words, and word filtering; it is also used to input the first feature data, second feature data, and target text into a text classification model, which is based on the first feature data and the first feature data. Second, the characteristic data obtains the first logical rule that matches successfully, and classifies the target text according to the first logical rule to obtain key information;

The retrieval module is used to retrieve the target retrieval content according to the key information.
The device according to claim 6, wherein the speech recognition model comprises an acoustic model and a language model, and the speech recognition module comprises:

The feature sequence extraction unit is configured to perform signal processing and feature extraction on the audio signal of the voice data to obtain a feature sequence;

The score confirmation unit is used to input the feature sequence into the trained acoustic model and the trained language model to obtain the acoustic model score and the language model score respectively;

The recognition text acquisition unit performs a decoding search on the acoustic model score and the speech model score to obtain the recognized text.
The device according to claim 7, wherein the recognition text obtaining unit comprises:

The preset hypothesis word sequence obtaining unit is used to obtain the preset hypothesis word sequence;

The score calculation unit is configured to calculate the acoustic model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain the grouping of the acoustic model, and also to calculate the score based on the feature vector in the feature sequence. State the language model score of the presupposition hypothesis word sequence, and obtain the language model grouping;

The recognition text confirmation unit is configured to calculate the overall score of hypothetical words in the preset hypothesis word sequence according to the grouping of the acoustic model and the grouping of the language model, and use the hypothesis word with the highest overall score as the recognized text.
A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more Each processor performs the following steps:

Obtain the voice to be recognized;

Input the to-be-recognized speech into a trained speech recognition model for recognition to obtain a recognized text;

Input the recognized text into the trained semantic analysis model and the sentiment analysis model to obtain first feature data and second feature data, respectively; wherein, the first feature data is an analysis of semantic analysis of the recognized text Result; the second feature data is an analysis result of sentiment analysis on the recognized text;

After performing word preprocessing on the recognized text, the target text is obtained; wherein, the word preprocessing includes word segmentation, removal of stay words, and word filtering;

The first feature data, the second feature data, and the target text are input into a text classification model, and the text classification model obtains a first logical rule that matches successfully according to the first feature data and the second feature data, and according to the The first logic rule classifies the target text to obtain key information; and

The target retrieval content is obtained by searching according to the key information.
The computer device according to claim 9, wherein the speech recognition model includes an acoustic model and a language model, and the step of inputting the speech to be recognized into a trained speech recognition model for recognition, and obtaining recognized text ,include:

Signal processing and feature extraction on the audio signal of the voice to be recognized to obtain a feature sequence;

Input the feature sequence into the trained acoustic model and the trained language model to obtain acoustic model scores and language model scores respectively; and

A decoding search is performed on the acoustic model score and the speech model score to obtain the recognized text.
The computer device according to claim 10, wherein the step of decoding and searching the acoustic model score and the speech model score to obtain the recognized text comprises:

Obtain the presupposition word sequence;

Calculating the acoustic model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain the acoustic model grouping;

Calculate the language model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain language model groups; and

According to the grouping of the acoustic model and the grouping of the language model, the overall score of the hypothesis word in the preset hypothesis word sequence is calculated, and the hypothesis word with the highest overall score is used as the recognized text.
The computer device according to claim 9, wherein the model to be trained comprises the semantic analysis model, the sentiment analysis model, and the text classification model, and the training step of the model to be trained comprises:

Acquiring a training sample set, the training sample set including granular data samples, language data samples and modal data samples, the granular data samples including granular data features, language data features, and modal data features;

Obtain the text to be trained, and input the text to be trained into the initial model to be trained to obtain the initial text; and

According to the initial text, the granular data feature, the language data feature, and the modal data feature, the parameters of the initial model to be trained are adjusted until the convergence condition is met, and the semantic analysis model and the emotion are obtained. Analysis model, the text classification model.
The computer device according to claim 9, wherein the speech recognition model comprises an acoustic model and a language model, and the training step of the speech recognition model comprises:

Acquiring training samples, the training samples including language features and acoustic features;

Obtain the training speech to be recognized, and input the training speech to be recognized into the initial language model to obtain the initial language score;

Obtain the training speech to be recognized, and input the training speech to be recognized into the initial acoustic model to obtain the initial acoustic score;

Adjust the parameters of the initial language model according to the language feature and the initial language score, and adjust the parameters of the initial acoustic model according to the acoustic feature and the initial acoustic score until the initial language model and the initial language model The initial acoustic models meet the convergence condition, and the speech recognition model is obtained.
One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:

Obtain the voice to be recognized;

Input the to-be-recognized speech into a trained speech recognition model for recognition to obtain a recognized text;

Input the recognized text into the trained semantic analysis model and the sentiment analysis model to obtain first feature data and second feature data, respectively; wherein, the first feature data is an analysis of semantic analysis of the recognized text Result; the second feature data is an analysis result of sentiment analysis on the recognized text;

After performing word preprocessing on the recognized text, the target text is obtained; wherein, the word preprocessing includes word segmentation, removal of stay words, and word filtering;

The first feature data, the second feature data, and the target text are input into a text classification model, and the text classification model obtains a first logical rule that matches successfully according to the first feature data and the second feature data, and according to the The first logic rule classifies the target text to obtain key information; and

The target retrieval content is obtained by searching according to the key information.
The storage medium according to claim 14, wherein the speech recognition model includes an acoustic model and a language model, and the step of inputting the speech to be recognized into a trained speech recognition model for recognition to obtain recognized text ,include:

Signal processing and feature extraction on the audio signal of the voice to be recognized to obtain a feature sequence;

Input the feature sequence into the trained acoustic model and the trained language model to obtain acoustic model scores and language model scores respectively; and

A decoding search is performed on the acoustic model score and the speech model score to obtain the recognized text.
The storage medium according to claim 15, wherein the step of decoding and searching the acoustic model score and the speech model score to obtain the recognized text comprises:

Obtain the presupposition word sequence;

Calculating the acoustic model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain the acoustic model grouping;

Calculate the language model score of the preset hypothesis word sequence according to the feature vector in the feature sequence to obtain language model groups; and

According to the grouping of the acoustic model and the grouping of the language model, the overall score of the hypothesis word in the preset hypothesis word sequence is calculated, and the hypothesis word with the highest overall score is used as the recognized text.
The storage medium according to claim 14, wherein the model to be trained comprises the semantic analysis model, the sentiment analysis model and the text classification model, and the training step of the model to be trained comprises:

Acquiring a training sample set, the training sample set including granular data samples, language data samples and modal data samples, the granular data samples including granular data features, language data features, and modal data features;

Obtain the text to be trained, and input the text to be trained into the initial model to be trained to obtain the initial text; and

According to the initial text, the granular data feature, the language data feature, and the modal data feature, the parameters of the initial model to be trained are adjusted until the convergence condition is met, and the semantic analysis model and the emotion are obtained. Analysis model, the text classification model.
The storage medium according to claim 14, wherein the speech recognition model comprises an acoustic model and a language model, and the training step of the speech recognition model comprises:

Acquiring training samples, the training samples including language features and acoustic features;

Obtain the training speech to be recognized, and input the training speech to be recognized into the initial language model to obtain the initial language score;

Obtain the training speech to be recognized, and input the training speech to be recognized into the initial acoustic model to obtain the initial acoustic score;

Adjust the parameters of the initial language model according to the language feature and the initial language score, and adjust the parameters of the initial acoustic model according to the acoustic feature and the initial acoustic score until the initial language model and the initial language model The initial acoustic models meet the convergence condition, and the speech recognition model is obtained.