CN114817465A

CN114817465A - Entity error correction method and intelligent device for multi-language semantic understanding

Info

Publication number: CN114817465A
Application number: CN202210394592.4A
Authority: CN
Inventors: 胡胜元; 曹晚霞
Original assignee: Hisense Electronic Technology Wuhan Co ltd
Current assignee: Hisense Electronic Technology Wuhan Co ltd
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-07-29

Abstract

The embodiment of the application provides an entity error correction method and intelligent equipment for multi-language semantic understanding, wherein the method comprises the following steps: and after the entity to be corrected is obtained, coding the entity to be corrected by utilizing a sound-shape code algorithm. And searching candidate entities matched with the coded entity to be corrected in the sound-shape code database. And screening the candidate entities according to the knowledge graph to obtain a result entity, wherein the knowledge graph describes the association relation between the candidate entities, and the result entity is a candidate entity having the association relation with other candidate entities. The entity error correction method and the intelligent device for multi-language semantic understanding provide a unified framework for multi-language semantic understanding, can span the influence of different languages under the condition of lacking large-scale training data, and realize entity error correction of texts in different languages, so that the accuracy of semantic understanding and entity recognition is improved, the performance of multi-language voice recognition products is improved, and the use experience of users is further improved.

Description

Entity error correction method and intelligent device for multi-language semantic understanding

Technical Field

The application relates to the technical field of voice interaction, in particular to an entity error correction method and intelligent equipment for multi-language semantic understanding.

Background

With the development of intelligent voice interaction technology, voice interaction functions gradually become standard configurations of intelligent terminal products. The user can utilize the voice interaction function to realize the voice control of the intelligent terminal product, and carry out a series of operations such as video watching, music listening, weather checking, television control and the like.

The process of voice control of smart terminal products is typically that a voice recognition model recognizes the voice input by a user as text. And then, the semantic understanding model analyzes the lexical syntax and the semantics of the text, so as to understand the intention of the user. And finally, the control end controls the intelligent terminal product to carry out corresponding operation according to the understanding result.

In practical application, errors exist in a speech recognition model and a semantic understanding model, and the accumulated errors in the two stages directly influence the speech recognition quality of the whole system. Therefore, existing speech recognition systems are configured with error correction models to correct the entities in the user request statements to improve the quality of the overall speech recognition system. At present, the multi-language error correction is mainly the error correction research of text grammar, and large-scale text data is usually adopted for deep learning model training.

However, since the entities of the multi-language speech recognition are usually short texts, these short text entities are not sensitive to the character sequence and have no semantic information such as grammar, and therefore it is difficult to correct the errors based on the text grammar. In addition, the training data set used by the current multilingual speech recognition is not consistent with the application scene of the speech assistant, and the difficulty of model training by collecting a large amount of text data under a real scene is also high. The current products aiming at multi-language voice recognition cannot well meet the use requirements of users. Therefore, there is a need for a real-time entity error correction method that can span different language impact in the absence of large-scale training data, and is versatile in multiple languages.

Disclosure of Invention

The application provides an entity error correction method and intelligent equipment for multi-language semantic understanding, which are used for solving the problem that error correction is difficult to be performed based on grammar because entities of multi-language voice recognition are usually short texts and are insensitive to character sequence and have no semantic information such as grammar and the like. In addition, the training data set used by the current multilingual speech recognition is not consistent with the application scene of the speech assistant, and the difficulty of model training by collecting a large amount of text data under a real scene is also high.

In a first aspect, an embodiment of the present application provides an entity error correction method for multi-language semantic understanding, where the method includes: acquiring an entity to be corrected, wherein the entity to be corrected is an entity obtained after semantic analysis processing is carried out on a request statement input by a user;

coding the entity to be corrected by using a sound-shape code algorithm;

searching a candidate entity matched with the coded entity to be corrected in a sound-shape code database;

and screening the candidate entities according to a knowledge graph to obtain a result entity, wherein the knowledge graph describes the association relationship between the candidate entities, and the result entity is the candidate entity having the association relationship with other candidate entities.

In a second aspect, an embodiment of the present application provides an intelligent device for error correction of an entity for multi-language semantic understanding, where the intelligent device includes:

an entity to be corrected obtaining unit configured to perform: acquiring an entity to be corrected, wherein the entity to be corrected is an entity obtained after semantic analysis processing is carried out on a request statement input by a user;

an encoding unit configured to perform: coding the entity to be corrected by using a sound-shape code algorithm;

a candidate entity lookup unit for performing: searching a candidate entity matched with the coded entity to be corrected in a sound-shape code database;

a screening unit to perform: and screening the candidate entities according to a knowledge graph to obtain a result entity, wherein the knowledge graph describes the association relationship between the candidate entities, and the result entity is the candidate entity having the association relationship with other candidate entities.

The technical scheme provided by the application comprises the following beneficial effects: and after the entity to be corrected is obtained, coding the entity to be corrected by utilizing a sound-shape code algorithm. And searching candidate entities matched with the coded entity to be corrected in the sound-shape code database. And screening the candidate entities according to the knowledge graph to obtain a result entity. The knowledge graph describes the incidence relation among the candidate entities, and the result entity is a candidate entity having incidence relation with other candidate entities. The entity error correction method and the intelligent device for multi-language semantic understanding provide a unified framework for multi-language semantic understanding, can span the influence of different languages under the condition of lacking large-scale training data, and realize entity error correction of texts in different languages, so that the accuracy of semantic understanding and entity recognition is improved, the performance of multi-language voice recognition products is improved, and the use experience of users is further improved.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 illustrates a schematic diagram of the principles of voice interaction, in accordance with some embodiments;

FIG. 2 illustrates a block diagram of an entity error correction module, according to some embodiments;

FIG. 3 illustrates a flow diagram of an entity error correction method for multi-lingual semantic understanding, in accordance with some embodiments;

FIG. 4 illustrates a Chinese phonetic-to-shape code encoding flow diagram according to some embodiments;

FIG. 5 illustrates a specific example framework diagram of an entity error correction method according to some embodiments;

FIG. 6 illustrates yet another entity error correction method flow diagram in accordance with some embodiments;

FIG. 7 illustrates a further entity error correction method flow diagram according to some embodiments;

FIG. 8 illustrates a diagram of a multi-lingual phonetic-font-code encoding algorithm, in accordance with some embodiments;

FIG. 9 illustrates yet another entity error correction method flow diagram in accordance with some embodiments;

FIG. 10 illustrates a smart device framework diagram for entity error correction for multi-lingual semantic understanding, in accordance with some embodiments;

FIG. 11 illustrates a flowchart of a multilingual voice assistant application in accordance with some embodiments.

Detailed Description

To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the functionality associated with that element.

For clarity of explanation of the embodiments of the present application, a speech recognition network architecture provided by the embodiments of the present application is described below with reference to fig. 1.

Referring to fig. 1, fig. 1 is a schematic diagram of a voice recognition network architecture according to an embodiment of the present application. In fig. 1, the smart device is configured to receive input information and output a processing result of the information. The voice recognition service equipment is electronic equipment with voice recognition service deployed, the semantic service equipment is electronic equipment with semantic service deployed, and the business service equipment is electronic equipment with business service deployed. The electronic device may include a server, a computer, and the like, and the speech recognition service, the semantic service (also referred to as a semantic engine), and the business service are web services that can be deployed on the electronic device, wherein the speech recognition service is used for recognizing audio as text, the semantic service is used for semantic parsing of the text, and the business service is used for providing specific services such as a weather query service, a music query service, and the like. In one embodiment, in the architecture shown in fig. 1, there may be multiple entity service devices deployed with different business services, and one or more function services may also be aggregated in one or more entity service devices.

In some embodiments, the following describes an example of a process for processing information input to a smart device based on the architecture shown in fig. 1, where the information input to the smart device is an example of a query statement input by voice, the process may include the following three processes:

and (3) voice recognition: the intelligent device can upload the audio of the query sentence to the voice recognition service device after receiving the query sentence input by voice, so that the voice recognition service device can recognize the audio as a text through the voice recognition service and then return the text to the intelligent device. In one embodiment, before uploading the audio of the query statement to the speech recognition service device, the smart device may perform denoising processing on the audio of the query statement, where the denoising processing may include removing echo and environmental noise.

Semantic understanding: the intelligent device uploads the text of the query sentence identified by the voice identification service to the semantic service device, and the semantic service device performs semantic analysis on the text through semantic service to obtain the service field, intention and the like of the text.

Semantic response: and the semantic service equipment issues a query instruction to corresponding business service equipment according to the semantic analysis result of the text of the query statement so as to obtain the query result given by the business service. The intelligent device can obtain the query result from the semantic service device and output the query result. As an embodiment, the semantic service device may further send a semantic parsing result of the query statement to the intelligent device, so that the intelligent device outputs a feedback statement in the semantic parsing result.

It should be noted that the architecture shown in fig. 1 is only an example, and does not limit the scope of the present application. In the embodiment of the present application, other architectures may also be adopted to implement similar functions, for example: all or part of the three processes can be completed by the intelligent device, and are not described herein.

In some embodiments, the intelligent device shown in fig. 1 may be a display device, such as an intelligent television, the functions of the speech recognition service device may be implemented by cooperation of a sound collector and a controller provided on the display device, and the functions of the semantic service device and the business service device may be implemented by the controller of the display device or by a server of the display device.

In some embodiments, the intelligent device supports voice interaction functions, which can be configured as a standard for intelligent terminal products. The user can utilize the voice interaction function to realize the voice control of the intelligent terminal product, and carry out a series of operations such as video watching, music listening, weather checking, television control and the like.

The process of voice-controlling a smart terminal product is generally that a voice recognition module recognizes a voice input by a user as text. And then, the semantic analysis module analyzes the lexical syntax and the semantics of the text, so that the intention of the user is understood. And finally, the control end controls the intelligent terminal product to carry out corresponding operation according to the understanding result.

In practical application, errors exist in a speech recognition model and a semantic understanding model, and the accumulated errors in the two stages directly influence the speech recognition quality of the whole system. Thus, in some embodiments, the speech recognition system is configured with an error correction model to correct the entities in the user request statements to improve the quality of the overall speech recognition system.

For example, a user in chinese requests "i want to see a certain movie XX" in wu, where a certain movie XX "is an entity, and an error correction module of the speech system may correct the" certain movie XX ", so as to better reach the actual intention of the user and improve the user experience. It should be noted that the entity here refers to a named entity, and the named entity is an entity with specific meaning, and mainly includes a person name, a place name, an organization name, a proper noun, and the like.

In some embodiments, the error correction model is based primarily on two approaches: rule-based methods and machine learning-based methods. The rule-based method refers to correcting homophones, fuzzy sound, and the like according to the pronunciation rules of the language and combining the use habits of users. The machine learning-based method is that a plurality of decision models are constructed, a large amount of error-corrected texts are used for training the models, and finally, the decision models give out uniform error correction results.

For the error correction method of multiple languages, the error correction method in Chinese cannot be simply multiplexed. There are mainly the following problems: chinese is pictographic characters, most of the multi-languages are phonograms, and the character meaning logic of the phonograms is fundamentally different from that of Chinese; pronunciation habits of different languages are different, and a correction candidate set based on rules cannot be effectively constructed; a unified framework is needed to provide multiplexing functions for the recognition of different languages.

In some embodiments, the multi-language correction is mainly a text grammar correction research, and deep learning model training is usually performed by adopting large-scale text data. However, since the entities of the multi-lingual speech recognition are usually short texts, the short text entities are not sensitive to the character order, and have no semantic information such as grammar, and the like, it is difficult to correct errors based on the grammar. In addition, the training data used by the multi-language speech recognition is not consistent with the application scenes of the speech assistant, and the difficulty of model training by collecting a large amount of text data under real scenes is also high. Products aiming at multi-language voice recognition cannot well meet the use requirements of users. Therefore, there is a need for a real-time entity error correction method that can span different language impact in the absence of large-scale training data, and is versatile in multiple languages.

In order to solve the problems, the method provides a unified framework for the multi-language semantic understanding, and can span the influence of different languages under the condition of lacking large-scale training data to realize the entity error correction of texts in different languages, so that the accuracy of semantic understanding and entity recognition is improved, the performance of a multi-language voice recognition product is improved, and the use experience of a user is further improved.

Before the process flow of the method of the present application is explained, the technical terms involved in the present application are explained:

the multilingual semantic understanding model adopts a large-scale multi-language corpus data pre-training model LaBSE (multilingual embedded vector model) to perform coding analysis on a request text, and performs intention judgment and entity identification on the request text. For example, if the user's request is "search for XX by Tom", the semantic analysis result is output after passing through the multilingual semantic analysis model: "intent": "video. search", "actor": "Tom", "title": "XX", where intent is the judged intent is search, and actor and title are both entities.

The Metaphone algorithm is a sound-shape code algorithm for coding according to a text English pronunciation method, and is improved to Metaphone3 in recent years. The algorithm is mainly a pronunciation and configuration code coding algorithm aiming at pronunciation rules under English, and the main purpose of the algorithm is to code texts with similar pronunciations into the same key value. And (3) coding the input text by using pronunciation rules of words in English according to vowel letters and consonant letters and according to preset rules by using consonant letters. For example, the consonant letter in the text "volume up" is "vlmp", and thus the entire text is encoded as "FLMP".

Although the Metaphone algorithm is developed for English, the basic idea is based on the pronunciation rules of language, and the algorithm can be extended to other phonograms similar to English. For example, the Metaphone algorithm of Brazilian Portuguese (pt-br) version has become a solution for databases in the local cities in the Ba xi province, and the Metaphone algorithm based on French, Spanish, Russian and the like has been sourced by a learner on GitHub (an open-source and proprietary software project-oriented hosting platform).

The method is mainly based on an entity error correction module in the multilingual voice assistant, and error correction processing is performed on entities identified by the semantic understanding model. As shown in the schematic diagram of the entity error correction module in fig. 2, the input of the entity error correction module is the parsing result of the multi-language semantic understanding model for the user request statement, and the parsing result includes an intention and an entity. The entity error correction module mainly corrects errors of input entities. The entity error correction module mainly comprises a sound-shape code recall sub-module and a knowledge map retrieval sub-module. Based on the entity error correction module shown in fig. 2, the flowchart of the entity error correction method for multi-language semantic understanding shown in fig. 3 is schematic, and the method includes the following steps:

step S101, acquiring an entity to be corrected, wherein the entity to be corrected is an entity obtained after semantic analysis processing is performed on a request statement input by a user.

The voice text is obtained by analyzing the voice signal input by the user. Specifically, the user inputs a voice signal within a distance range in which the terminal device receives the signal. The terminal device may collect a voice signal input by a user through a microphone and then recognize a voice text from the voice signal. The voice text can be recognized by the voice recognition server in the embodiment of the application. And performing semantic analysis processing on the voice text by a semantic server. It should be noted that the semantic server performs semantic analysis processing on the voice text, and may use the aforementioned multi-language semantic understanding model to analyze the voice text to obtain the intention and the entity. The specific process of semantic analysis processing on the speech text can adopt the existing technology, which is not described in detail in this application.

And S102, coding the entity to be corrected by using a sound-shape code algorithm.

The step mainly utilizes a sound-shape code algorithm to encode the analyzed entity, and the analyzed entity is the entity to be corrected. The input of the intelligent voice assistant is the voice of the user, so that the place where the text error occurs only exists in the error of the text recognized by the voice, and the pronunciation of the text is an important error correction basis. The method and the device for coding the entity to be corrected are used for coding the entity to be corrected based on the existing Metaphone algorithm. As mentioned above, Metaphone is a phonetic and shape code algorithm for english, which encodes texts with similar pronunciation into the same phonetic and shape codes by using english pronunciation rules. The Metaphone algorithm is designed for english and can also be extended to languages like phonograms (e.g., english, spanish, russian, etc.). Therefore, the error correction entity can be coded by utilizing the Metaphone algorithm aiming at different languages.

Besides the Metaphone algorithm, the phonogram can also be used to develop the phonogram language phonogram code algorithm based on pronunciation rules, such as coding based on pinyin rules in Chinese and coding by fifty sounds in Japanese. Fig. 4 is an example of chinese phonetic-configurational code encoding, for a text to be encoded, the text is first converted into corresponding pinyin by a PyPinyin tool, and the initial consonants and the final sounds in the pinyin are encoded respectively, and in the encoding process, the similarities of the pinyin pronunciations are considered, such as initial consonant n and initial consonant I, and the final sounds "an" and "ang" can be encoded into the same codes.

Step S103, searching candidate entities matched with the coded entity to be corrected in a sound-shape code database.

A large number of text entities are stored in the sound and shape code database in advance, and the entities in the sound and shape code database are generated through the same encoding process, so that the entities in the sound and shape code database are also in an encoding form. The step matches the coding form of the entity to be corrected with the coding form data in the sound-shape code database, so as to search the candidate entity matched with the entity to be corrected. It should be noted that, here, matching between the entity to be error corrected and the candidate entity may be that the encoding of the entity to be error corrected is the same as that of the candidate entity, or the encoding of the entity to be error corrected includes encoding of the candidate entity, and the application is not limited to the matching form.

And step S104, screening the candidate entities according to a knowledge graph to obtain a result entity, wherein the knowledge graph describes the association relationship among various entities, and the result entity is the candidate entity having the association relationship with other candidate entities.

In the step, in order to obtain a final error correction result, specifically, an entity to be error corrected and a candidate entity are put into a knowledge graph for query. The entities in the phono-configurational code database are substantially the same as the entities in the knowledge-graph. The knowledge graph refers to a knowledge base used for enhancing the function of a search engine, and aims to describe various entities or concepts existing in the real world and the relations of the entities or the concepts, and the entities or the concepts are represented by nodes, and the edges are formed by attributes or relations. All data in the knowledge graph can be expressed and stored by using RDF (Resource Description Framework).

For example, the complex relationship between entities is represented in the form of triples of [ entity, attribute value ], [ entity 1, relationship, entity 2], and the like. For example, [ XX (movie name), lead actor, somebody in wu ], [ somebody in, place of birth, beijing ], and the like. It should be noted that the knowledge graph in the embodiment of the present application may also be represented and stored in other forms, and the embodiment of the present application is not particularly limited thereto.

And screening the candidate entities to obtain a result entity, and finally outputting an effective intention according to the result entity. Specifically, the results output by knowledge graph retrieval are packaged into a uniform format, so that the downstream terminal can conveniently execute commands. Partial intended entities also need to be converted to a unified format, such as some fixed settings of a television need to be converted to television language; the time format of the different languages is converted into a numeric format that the terminal can parse, such as 2hours in english to { h: 2, m: 0, s: 0}.

The entity error correction method of steps S101 to S014 is explained by a specific example as shown in fig. 5:

firstly, an entity to be corrected [ Title: barnaby ] and [ Actor: david ]. And coding the two entities to be corrected by using a tone-shape code algorithm to respectively obtain codes [ Title: FTPRNPRJ ] and [ Actor: TFTKRLN ]. And searching candidate entities in a sound and shape code data path, and specifically recalling entities with similar codes in a sound and shape code database [ Title: barnaby, Bridge, Barrage ] and [ Actor: david ]. Three Title entities and one Actor entity are recalled at this time. And then screening all recalled entities according to the knowledge graph.

Specifically, the knowledge-graph is a knowledge base describing various entities or concepts existing in the real world and their relationships, and the movie starring the actor David is Barrage, and the representation in the knowledge-graph is [ David, main character (lead actor), Barrage ]. Therefore, at this time, the result entity [ Title: barrage ] and [ Actor: david ]. An entity to be corrected [ Title: barnaby ] corrects for fruiting body [ Title: barrage ]. And finally, outputting an effective intention according to the obtained fruiting body: search for the movie Barrage where the actor David stared.

The entity error correction method for multi-language semantic understanding provides a unified framework for multi-language semantic understanding, can span the influence of different languages under the condition of lacking large-scale training data, and realizes entity error correction of texts in different languages, so that the accuracy of semantic understanding and entity recognition is improved, the performance of multi-language voice recognition products is improved, and the use experience of users is further improved.

In some embodiments, the corresponding phonetic-font code databases may be set up in advance for different languages. Before searching the candidate entity matched with the entity to be corrected through the sound-shape code database, the language type of the entity to be corrected is determined, and then the candidate entity is searched in the corresponding sound-shape code database. For example, the entity spider to be corrected first determines that the language type of the entity to be corrected is english. Therefore, candidate entities matched with the entity to be corrected can be searched in the English pictographic code database. Therefore, different languages are distinguished, and the searching efficiency can be improved when the candidate entity is recalled according to the sound-shape code database.

In some embodiments, the corresponding phonetic and font code databases may be set in advance for different services. Before searching the candidate entity matched with the entity to be corrected through the sound-shape code database, the service type of the entity to be corrected is determined, and then the candidate entity is searched in the corresponding sound-shape code database. Often only a portion of the types of entities need error correction in specific service requirements, such as video names, person names, and channel names in a voice over television service.

As shown in fig. 6, when constructing a sound-shape code database related to an actual service, collected different types of entities are encoded by using sound-shape codes of a single language, the encoded result is used as a key value, and a text of an original entity is stored in the sound-shape code database S as a value. By way of example, Schannel refers to a database of phonograms associated with channels, sti refers to a database of phonograms associated with video titles, and sator refers to a database of phonograms associated with person names.

And recalling candidate entities E with similar pronunciations for the entity text to be corrected as follows:

E _phoneric ＝S(F _phonetic (text))

for example, when the entity text to be corrected is Barnaby, the entity type is title, the key value coded by using the phonographic code algorithm F is FTPRNPRJ, and then the entity text is recalled in the database, where the stub (FTPRNPRJ) can recall three entities { Barnaby, Barrage, Bridge } stored in the knowledge base, and the three recalled entities are used as a candidate set for subsequent correction. Therefore, different sound-shape code databases are divided according to different service types, and the searching efficiency can be improved when the candidate entities are recalled according to the sound-shape code databases.

In practical applications of the multi-language identification system, there may be situations where the number of the recalled candidate entities is too large or zero, and the recalled error correction candidate entities need to be post-processed to improve the quality of the system. For a large multi-language identification system, the number of collected candidate entities is in the order of tens of millions, and candidate entities with similar pronunciation to the entity to be corrected are recalled in the sound-shape code database, so that a huge number of candidate entities may be recalled. Too many candidate entities may burden computational queries for subsequent error correction, wasting computational resources and time. Thus requiring a quantitative flush of recalled candidate entities.

Specifically, if the number of candidate entities recalled from the sound-shape code database is greater than the number threshold N, the candidate entities are cleaned by adopting a sorting method based on the editing distance. The Edit Distance (MED) is the Minimum number of times required to transform one text into another text, and is used to measure the difference between the two texts. When the number of the recalled candidate entities exceeds a preset threshold value N, the editing distance between each entity in the candidate entities and the entity to be corrected is calculated, the entities are sorted according to the editing distance, and TopN candidate entities with smaller editing distances are selected for subsequent correction.

In some embodiments, if the number of candidate entities recalled from the phonogram code database is zero, the substring is truncated from the entity to be corrected. And then coding the sub-character strings by using a phonogram code algorithm, and finally searching candidate entities matched with the coded sub-character strings in a phonogram code database.

For example, the multi-language semantic understanding model may identify a title entity in search for file Red Dog as a file Red Dog. Since the entity to be corrected contains the coding of the film when it is coded, the number of candidate entities recalled in the pictogram code database by the entity to be corrected is zero. At this time, the substring Red Dog of the entity to be corrected file Red Dog may be intercepted. The substring "Red Dog" is encoded and retrieved from the database S. The method is realized by adopting a recursion mode when the substring codes are removed from the database for retrieval, the longest substring which can be matched with the result is found, and the candidate result is taken as the final phonetic-shape code recall result.

The above embodiments are all cases where one user request statement only contains one language, and in practical applications, there may be cases where one user request statement contains multiple languages. There are two main reasons for this: there is some regional relevance for users in different languages. For example, Chinese and Japanese users have geographical relevance, and the horse language users carry partial Chinese in the request because of more Chinese; the wide use of english, and the abundance of media resources, result in many users entering service requests carrying english when the voice assistant language is not english.

For example, a French user may have such a request to rechercher le file spider (English). When the existing Speech Recognition model (ASR) faces such a mixed multi-language, the accuracy rate is lowered, and the pronunciation of the non-native language is probabilistically misrecognized as the text of the native language. For example, the requested speech of the Malay user is "buka sofa" which is recognized as the text "buka soufa" by the ASR model, the single-language phonetic-shape code recall cannot be applied to the scenes, and the multilingual speech assistant system has low accuracy in the face of the problems.

In order to solve the above problem, the present embodiment improves the physical error correction method based on the above embodiment. Specifically, when the entity to be corrected is coded and then no valid entity (entity capable of generating a valid intention) is recalled in the sound-shape code database, the sound-shape code algorithm of other candidate languages is adopted for recall for the second time. There are two ways of determining candidate languages employed here: determining a language type to which the text is likely to belong based on characters of the text; other languages that the language user may use are presumed based on a priori knowledge.

Illustratively, the multi-language font code recall flow diagram is shown in fig. 7. When the French user request statement is rechercher le fileman, the named entity text identified is man. And the candidate set E of the entity text man after the phonetic-configurational code recall of French is [ mania ]. The recalled entity text mania cannot generate a valid intent. Then a second recall is made using the phonographic code algorithm F for the other language.

First, the language to which man may belong is analyzed as english based on the solid text characters. Meanwhile, based on the language habit of the French user, the adjacent languages of French are presumed to be English, Spanish, Italian, Portuguese and the like based on the priori knowledge (Spanish, Italian, Portuguese, English and the like belong to the same language family as French and have relevance in regions). And integrating the information of the two aspects to obtain the language which can be used for the secondary recall, namely English. The candidate entity E obtained after the second recall is performed by using the phonetic-configurational code algorithm Fen in english (Ffr in fig. 7 represents the phonetic-configurational code algorithm in french, Sfr represents the phonetic-configurational code database in french, and Sen represents the phonetic-configurational code database in english), is [ man ]. At which point a valid intent (e.g., search for a cineman) is to be generated from the recalled entity text man.

The improved method of the above embodiment is basically a way of combining different languages of phonetic and font code algorithms, but the ASR model has diversity and uncertainty in the errors of the output text when facing multi-language mixing requests, so the improved method of the above embodiment cannot be applied to all cases of ASR recognition errors.

In order to solve the above problems, a further improved method is to consider the issue of multi-language pronunciation when designing a single-language pronunciation-shape code algorithm, that is, the texts of different languages with similar pronunciation are coded in the same language. For example, the "soufa" code in English is similar to the "sofa" code in Chinese. And when the candidate entity is recalled, calculating and recalling candidate entities with similar pronunciations in other languages by adopting the similarity. And large quantities of text data with similar pronunciations in different languages can be collected to train a semantic understanding model, wherein the semantic understanding model can be a LABSE model.

The network structure of the coding model can adopt a Transformer model, and after the text pairs with similar pronunciation under different languages are coded by the model in a contrast learning mode, the feature vectors of the text pairs with similar pronunciation under different languages are closer in the coding space, while the feature vectors of the text pairs with larger pronunciation difference under different languages are farther in the coding space. For example, a feature vector of a first entity in English and a feature vector of a second entity in Chinese exist in the coding space. If the pronunciations of the first entity and the second entity are similar, the distance between the feature vectors of the first entity and the second entity in the coding space is shorter. If the pronunciation difference between the first entity and the second entity is large, the distance between the feature vectors of the first entity and the second entity in the coding space is long.

As shown in fig. 8, an example deep learning network is provided, where an entity text "soufa" in english is similar in pronunciation to an entity text "sofa" in chinese, and when encoding is performed by respective encoding algorithms, encoding parameters can be shared, so that the obtained encoded feature vectors are closer in distance in an encoding space. In searching for a candidate entity, if the screened candidate entity "soufa" cannot output a valid intention, the candidate entity "soufa" may be directly replaced with a candidate entity "sofa" that is closer in distance. Therefore, the accuracy of semantic understanding and entity recognition can be further improved, the system computing time can be shortened, and the efficiency of semantic understanding is improved.

In some embodiments, in screening candidate entities according to the knowledge-graph, if there are pre-stored entities in the knowledge-graph that match the candidate entities, the pre-stored entities are determined to be result entities. If there is no pre-stored entity in the knowledge-graph that matches the candidate entity, but there is a pre-stored entity that matches the substrings in the candidate entity, then the pre-stored entity is determined to be the resulting entity. It should be noted that the matching between the pre-stored entity and the candidate entity may be completely the same, or may have an inclusion relationship, which is not limited in the present application.

Illustratively, if the candidate entity is a film man, there is no pre-stored entity matching the candidate entity in the knowledge-graph, but there is a pre-stored entity matching the sub-character man, then the pre-stored entity man is determined to be the final result entity.

In some embodiments, the incidence relation between the candidate entities can be queried to eliminate the unsuitable candidate entities. Illustratively, the English request "search for man by Tom" will output two types of entities after passing through the semantic understanding model: the movie name is man; the actor name is Tom. Entities of this type may recall a number of candidate movie names and actor names after being encoded in a phonogram code. For example, a man may recall a number of movies that are related to the man but not to the actor Tom. And screening candidate entities which are simultaneously associated with the entities man and the entities Tom through knowledge map query, namely only retaining the entities which are both related to man and Tom.

In some embodiments, if the screening process in the above embodiments is performed, there are still multiple result entities, which will generate multiple effective intents and is not favorable for responding to the user request. Therefore, TF-IDF (Term Frequency-Inverse Document Frequency) can be adopted to carry out grading and sequencing on a plurality of result entities, and the result entity with the highest grade is selected as the final result entity. Other knowledge attributes of the entity may be specifically referenced, such as the rating of the movie, the year and heat of the movie, with newer years and higher ratings for entities with higher heat. Therefore, the user request can be effectively responded, and the final response result can be more consistent with the expectation of the user.

In some embodiments, the instruction set to be matched may be constructed for common instructions of the voice assistant, and when the request of the user meets a preset condition (for example, the length is less than a preset length), the original request text of the user is directly subjected to phonographic code encoding and then matched with the instruction set to be matched. And if the matching result exists, directly sending the pre-stored semantic result to a downstream task. The instruction set to be matched constructed at this time is a sound-shape code database. It should be noted that the matching here may be the same as the two, or may have an inclusion relationship between the two, and the present application does not limit this.

Illustratively, in the embodiment shown in FIG. 9, when the user's request is "pos" when the language is English, the user's request should be "pause" according to the experience of the multilingual speech assistant, but the speech-to-text recognition is incorrect and is recognized as "pos". At the moment, the multi-language semantic understanding model cannot effectively determine the intention of the position and cannot issue a command to the terminal. And (3) coding the "pos" by using a phono-configurational code algorithm to obtain a code "PS", and then searching in the instruction set to be matched to obtain a code "pause" which is also the "PS". And then, further screening in the knowledge graph, wherein in the embodiment, because the candidate entity is searched in the instruction set to be matched, the "pause" can be directly used as a result entity without screening in the knowledge graph. Pause associated with pause is further retrieved from the instruction set, and the instruction is issued to the terminal for execution. Therefore, the accuracy of semantic understanding and entity recognition can be further improved, the system computing time can be shortened, and the efficiency of semantic understanding is improved.

An embodiment of the present application provides an intelligent device for error correction of an entity for multi-language semantic understanding, which is used to execute the embodiment corresponding to fig. 2, and as shown in fig. 10, the intelligent device provided by the present application at least includes:

an entity to be corrected obtaining unit U1001, configured to perform: acquiring an entity to be corrected, wherein the entity to be corrected is an entity obtained after semantic analysis processing is carried out on a request statement input by a user;

an encoding unit U1002 configured to perform: coding the entity to be corrected by using a sound-shape code algorithm;

a candidate entity search unit U1003 configured to perform: searching a candidate entity matched with the coded entity to be corrected in a sound-shape code database;

a screening unit U1004 for performing: and screening the candidate entities according to a knowledge graph to obtain a result entity, wherein the knowledge graph describes the association relationship between the candidate entities, and the result entity is the candidate entity having the association relationship with other candidate entities.

Based on the entity error correction method of the above embodiment, the present application also provides a multilingual voice assistant for multilingual semantic understanding, such as the multilingual voice assistant application flowchart shown in fig. 11, where the content in the dashed box is the entity error correction method provided by the present application. The voice-to-text model receives the voice requested by the user, converts the voice requested by the user into text, inputs the multi-language semantic understanding model and outputs intention, wherein the intention comprises an entity.

If the output intention needs to carry out entity error correction, the entity is coded by utilizing a phono-configurational code algorithm, and the candidate entity is recalled in a phono-configurational code data path. And then, inquiring a knowledge graph to obtain a result entity. And finally, retrieving and outputting the knowledge graph, and packaging the knowledge graph into a uniform format, so that a downstream terminal can conveniently execute the command.

And if the output intention does not need to carry out entity error correction, carrying out sound-shape code matching on the entity text and the text in the instruction set to be matched. And if the entity text is matched with the text sound-shape codes in the instruction set to be matched, packaging the text in the instruction set to be matched into a uniform format and then outputting the uniform format. And if the entity text is not matched with the text sound-shape codes in the instruction set to be matched, packaging the entity text into a uniform format and outputting the uniform format.

What has been described above includes examples of implementations of the invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but it is to be appreciated that many further combinations and permutations of the subject innovation are possible. Accordingly, the claimed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Moreover, the foregoing description of illustrated implementations of the present application, including what is described in the "abstract," is not intended to be exhaustive or to limit the disclosed implementations to the precise forms disclosed. While specific implementations and examples are described herein for illustrative purposes, various modifications are possible which are considered within the scope of such implementations and examples, as those skilled in the relevant art will recognize.

Moreover, the word "exemplary" or "exemplary" is used herein to mean "serving as an example, instance, or illustration". Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word "exemplary" or "exemplary" is intended to present concepts in a concrete fashion.

Claims

1. An entity error correction method for multi-language semantic understanding, comprising:

acquiring an entity to be corrected, wherein the entity to be corrected is an entity obtained by performing semantic analysis processing on a request statement input by a user;

coding the entity to be corrected by using a sound-shape code algorithm;

2. The entity error correction method for multilingual semantic understanding according to claim 1, wherein the candidate entities matching the encoded entity to be error-corrected are searched in the phonetic-shape code database by the following specific steps:

acquiring the language type of the entity to be corrected;

calling the sound and shape code database according to the language type of the entity to be corrected;

and searching the candidate entity matched with the coded entity to be corrected in the sound-shape code database.

3. The entity error correction method for multilingual semantic understanding according to claim 1, wherein the candidate entities matching the encoded entity to be error-corrected are searched in the phonetic-shape code database by the following steps:

acquiring the service type of the entity to be corrected;

calling the sound-shape code database according to the service type of the entity to be corrected;

4. The entity error correction method for multilingual semantic understanding according to claim 1, wherein the candidate entities matching the encoded entity to be error-corrected are searched in the phonetic-shape code database by the following specific steps:

when the number of candidate entities recalled from a phonographic code database according to the coded entity to be corrected is zero, intercepting substrings from the entity to be corrected;

and coding the sub character strings by utilizing a phonogram code algorithm, and searching candidate entities matched with the coded sub character strings in a phonogram code database.

5. The entity error correction method for multilingual semantic understanding according to claim 1, wherein the candidate entities matching the encoded entity to be error-corrected are searched in the phonetic-shape code database by the following specific steps:

when the number of candidate entities recalled from a sound-shape code database according to the coded entity to be corrected is greater than a number threshold N, calculating the edit distances between all the candidate entities and the entity to be corrected;

and sequencing all the candidate entities according to the editing distance, and determining the first N candidate entities with the editing distance sequenced from small to large as the final candidate entities.

6. The entity error correction method for multi-lingual semantic understanding according to claim 1, wherein the entities are encoded to have feature vectors, and the distance between the feature vectors of a first entity and the feature vectors of a second entity in the encoding space matches the pronunciation similarity degree of the first entity and the second entity, wherein the language type of the first entity is different from the language type of the second entity.

7. The entity error correction method for multilingual semantic understanding according to claim 1, wherein the candidate entities are screened according to a knowledge-graph to obtain a result entity, and the specific steps are as follows:

when a prestored entity matched with the candidate entity exists in the knowledge graph, determining the prestored entity as the result entity;

and determining the prestored entities as the result entities, wherein the prestored entities are not matched with the candidate entities in the knowledge graph, and the prestored entities are matched with the substrings in the candidate entities in the knowledge graph.

8. The entity error correction method for multilingual semantic understanding according to claim 1, wherein the voice-shape code database is a preset common instruction set, and the candidate entity is an entity for generating common instructions.

9. An intelligent device for error correction of entities for multi-lingual semantic understanding, comprising:

10. The intelligent apparatus for error correction of entities for multilingual semantic understanding according to claim 9, wherein the step of searching the phonetic-shape code database for candidate entities matching the encoded entities to be error corrected comprises:

acquiring the language type of the entity to be corrected;

calling the sound-shape code database according to the language type of the entity to be corrected;