CN109272995A

CN109272995A - Audio recognition method, device and electronic equipment

Info

Publication number: CN109272995A
Application number: CN201811126924.0A
Authority: CN
Inventors: 叶顺平; 邹明
Original assignee: Chumen Wenwen Information Technology Co Ltd
Current assignee: Chumen Wenwen Information Technology Co Ltd
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2019-01-25

Abstract

The embodiment of the invention discloses a kind of audio recognition method, device and electronic equipments.Wherein method includes: the voice messaging for obtaining collected user's input；Voice messaging is identified with the language model that user matches according at least one, obtains speech recognition result.The embodiment of the present invention identifies voice messaging with the language model that user matches using at least one, this language model by matching with user knows otherwise voice messaging, realize the purpose that voice messaging is identified by appointed language model, not only increase the accuracy to voice messaging identification, it ensure that recognition result can meet the individual demand of user, and improve the accuracy and recognition efficiency of speech recognition, solves the technical issues of even wrong identification can not be identified caused by the indiscriminate general language model used in the related technology is identified, improve user experience.

Description

Voice recognition method and device and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of voice recognition, in particular to a voice recognition method, a voice recognition device and electronic equipment.

Background

With the development of voice recognition technology, the application fields of voice wakeup are very wide, such as robots, mobile terminals, wearable devices, smart home devices, vehicle-mounted devices, and the like. However, in the related art, the speech recognition technology can only recognize some conventional words, and there is a technical problem that the words with strong specialization or even rarely used are not recognized or wrongly recognized.

Disclosure of Invention

In view of this, embodiments of the present invention provide a speech recognition method, a speech recognition device and an electronic device, which can achieve the above technical problem.

In order to solve the above problems, embodiments of the present invention mainly provide the following technical solutions:

in a first aspect, an embodiment of the present invention provides a speech recognition method, where the method includes:

acquiring collected voice information input by a user;

and recognizing the voice information according to at least one language model matched with the user to obtain a voice recognition result.

In a second aspect, an embodiment of the present invention further provides a speech recognition apparatus, where the apparatus includes:

the voice acquisition module is used for acquiring the collected voice information input by the user;

and the voice recognition module is used for recognizing the voice information according to at least one language model matched with the user to obtain a voice recognition result.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

at least one processor;

and at least one memory, bus connected with the processor; wherein,

the processor and the memory complete mutual communication through the bus;

the processor is used to call program instructions in the memory to perform the speech recognition method.

In a fourth aspect, embodiments of the present invention also provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform a speech recognition method.

By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages: the voice information is recognized by adopting at least one language model matched with the user, and the mode of recognizing the voice information by the language model matched with the user realizes the aim of recognizing the voice information by the appointed language model, thereby not only improving the accuracy of recognizing the voice information, ensuring that the recognition result can meet the individual requirements of the user, but also improving the accuracy and the recognition efficiency of the voice recognition, solving the technical problem of incapability of recognition and even wrong recognition caused by the recognition of an undifferentiated universal language model adopted in the related technology, and improving the user experience.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flow chart illustrating a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a process for determining a language model matching a user according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of another speech recognition apparatus provided in an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Example one

As shown in fig. 1, the present invention provides a speech recognition method, comprising the steps of:

and step S101, acquiring the collected voice information input by the user.

The method provided by the embodiment of the invention is executed by the electronic equipment, and when the electronic equipment is in a working state, the purpose of acquiring the collected voice information input by the user is realized.

Specifically, the electronic device may include a terminal device or a cloud server.

In specific application, the voice information input by the user can be acquired by the terminal equipment, or acquired by the voice acquisition equipment connected with the terminal equipment. When the voice information input by the user is collected by the voice collecting device connected with the terminal device, the voice collecting device may be a microphone or the like. During practical application, the connection between the voice acquisition equipment and the terminal equipment can be realized through a data line or through wireless connection modes such as Bluetooth, so that the voice information input by the user acquired by the voice acquisition equipment is transmitted to the terminal equipment, and the terminal equipment or the cloud server acquires the voice information input by the user acquired by the voice acquisition equipment.

And S102, recognizing the voice information according to at least one language model matched with the user to obtain a voice recognition result.

Specifically, the language model matched with the user may be a specific language model capable of identifying a domain-specific term to which the user belongs, such as a language model corresponding to a computer domain term, a language model corresponding to a legal domain term, a language model of a financial domain term, and the like. Performing speech recognition via a particular language model that matches the user can not only increase the speed of speech recognition, but also increase the accuracy of speech recognition.

Specifically, there may be one or more language models that match the user.

The speech recognition method provided by the embodiment of the invention utilizes at least one language model matched with the user to recognize the speech information, and the language model matched with the user can be used for pertinently recognizing the professional nouns of the specific field to which the user belongs, so that the aim of performing speech recognition through the specified language model matched with the user is fulfilled, the accuracy of speech recognition can be improved, the recognition result can be ensured to meet the aim of the user, the speed of speech recognition is improved, the technical problem that the recognition result does not meet the requirements of the user or even cannot be recognized due to the fact that the speech recognition is performed by a non-differential general language model adopted in the related technology is solved, and the user experience is improved.

In practical application, the electronic device applying the voice recognition method provided by the invention can be a terminal device and can also be a cloud server.

When the electronic equipment is a cloud server, voice information input by a user is collected by the terminal equipment or audio collection equipment connected with the terminal equipment, the collected voice information input by the user is sent to the cloud server, the cloud server identifies the voice information according to at least one language model matched with the user, an obtained voice identification result is fed back to the terminal equipment, and the terminal equipment processes the voice identification result fed back by the cloud server to finish voice information identification.

When the electronic device is a terminal device (such as a mobile phone, a sound device, a tablet computer, a notebook computer, a PC, a wearable device, etc.), the terminal device (such as a mobile phone) acquires voice information input by a user, which is acquired by a microphone on the mobile phone, in a power-on state, and the mobile phone recognizes the voice information according to at least one language model matched with the user to obtain a voice recognition result.

In some examples, prior to recognizing the speech information according to at least one language model that matches the user, the method further comprises: at least one language model matching the user is determined.

For the embodiment of the invention, after at least one language model matched with the user is determined, the purpose is to use the determined at least one language model matched with the user as a language model matched with the user for identifying the voice information, so that the language model for identifying the voice information is specified, and the accuracy for identifying the voice information is improved.

Specifically, in some embodiments, one implementation of determining at least one language model matching the user may include step S201 (not shown) or step S202.

Step S201, based on the received selection instruction of the user for the language model, at least one language model matched with the user is determined.

The embodiment of the invention utilizes the received selection instruction of the user aiming at the language model to determine the language model matched with the user. Specifically, the selection instruction of the user may be to select one language model or to select a plurality of language models.

Specifically, the selection instruction may be sent by the user through a human-computer interaction interface provided by the terminal device. For example, the terminal device provides an identification of a language model of a plurality of domain-specific terms on a human-computer interaction interface, such as the identification can be legal, medical, computer, financial, etc. Supposing that the marks selected by the user on the human-computer interaction interface are legal, computer and medical, the human-computer interaction interface generates corresponding selection instructions according to the marks selected by the user, so that the terminal equipment determines a language model matched with the user according to the received selection instructions, and the finally determined language model is a language model of a legal field term, a language model of a computer field term and a language model of a financial field term.

The following description is given by taking an example in which the electronic device determines at least one language model matching the user as the terminal device.

During practical application, the terminal device can be a mobile phone, a pad, a notebook, a wearable device and intelligent devices such as a smart watch and a smart sound box. All language models can be stored locally in the terminal equipment or stored in the cloud. Thus, there may be two ways to determine a language model that matches the user: one is that based on the received selection instruction of the user for the language model, the language model stored in the cloud is downloaded to the terminal device, and the language model downloaded to the terminal device is used as the language model matched with the user to identify the voice information; and the other is to determine a language model matched with the user in the language models stored locally in the terminal equipment based on the received selection instruction of the user for the language models.

As shown in fig. 2, step S202 includes steps S2021 and S2022.

Step S2021, obtain the user' S history input record and/or user attributes.

In the embodiment of the invention, the history input record of the user is the content input by the user in a certain period of time in the past, wherein the source of the content input by the user can be the content input by text or the content input by voice. Specifically, the content input by the user can be input in a search engine, can be input in an input method, and can also be input by using other software tools in actual application. For example, in the last month, the content input by the user in a certain input method (such as a dog search input method and a hundredth input method) mainly relates to editing codes, composing novels and papers.

In the embodiment of the invention, the user attributes can comprise occupation, profession, interests and the like of the user, wherein the user interests can be determined by the user according to preset interest classification selection. In particular, user interests may include stars, sports, movies, prose, law, technology, and so forth.

Step S2022, determining at least one field corresponding to the user according to the historical input record and/or the user attribute, and taking the language model corresponding to the at least one field as the language model matched with the user.

The embodiment of the invention determines at least one field corresponding to the user through the historical input record or the user attribute of the user, thereby determining the corresponding language model according to the determined at least one field and completing the purpose of determining at least one language model matched with the user.

In specific application, only the historical input record of the user or only the user attribute may be acquired, or both the historical input record and the user attribute of the user may be acquired, so as to achieve the purpose of determining the language model matched with the user.

In practical application, the step S201, the step S202, or both the step S201 and the step S202 may be referred to for determining at least one language model matching the user.

In some examples, after determining at least one language model matching the user with reference to step S201, step S202, or both step S201 and step S202, the method further includes: and updating the language model matched with the user based on the user-defined word stock.

According to the embodiment of the invention, at least one language model matched with the user is updated by the user-defined word bank, so that during voice recognition, the updated at least one language model matched with the user can be quickly recognized according to the user-defined voice word bank.

Specifically, the user-defined word stock may be a user-defined speech word stock, and may also be a user-defined word stock of word meaning.

For example, when the user-defined word bank is a user-defined speech word bank, as in the user-defined word bank: the result corresponding to the customized voice "A" is a "new message", and the result corresponding to the customized voice "B" is a "DNA"; the result corresponding to the customized voice 'C' is 'science and technology creation destiny'. During voice recognition, after the language model matched with the user is updated by the user-defined word bank, when the updated language model matched with the user detects that the voice input by the user is 'A', the result can be directly output as 'new information' without further recognition, thereby accelerating the recognition efficiency.

For example, when the user-defined word bank is a word bank of the meaning, it is assumed that the language model matched with the user is a language model of a professional term in the legal field, the user-defined word bank of the meaning includes a medical special word "alzheimer's disease" commonly used by the user and a financial special word "colleague", the professional term commonly used by the user in the word bank of the meaning is updated to the language model of the professional term in the legal field, so that during later speech recognition, the updated language model of the professional term in the legal field can recognize the professional term in the illegal field quickly while recognizing the professional term in the legal field.

Specifically, in some embodiments, another implementation of determining at least one language model that matches a user includes: step S301, step S302, and step S303.

In step S301 (not shown), a user-defined word library is obtained.

Step S302 (not shown in the figure), generating a personalized language model of the user based on the user-defined word stock;

step S303 (not shown), determining the personalized language model of the user as the language model matching the user.

According to the embodiment of the invention, the personalized language model suitable for the user is generated according to the user-defined word stock, and the personalized language model is determined as the language model matched with the user.

In some embodiments, after determining at least one language model matching the user with reference to step S201, step S202, or both step S201 and step S202, the personalized language model may also be determined as the language model matching the user with reference to steps S301 to S303 in this embodiment.

For example, after determining that the language model is a language model of a legal field term, a language model of a computer field term, and a language model of a financial field term with reference to step S201, step S202, or both step S201 and step S202, the personalized language model is simultaneously determined as a language model matching the user. Therefore, when the speech recognition is carried out by using the language models of the professional nouns in different fields, the personalized language model can be used for carrying out the recognition quickly, and a recognition result is obtained. In practical application, when the language models of the professional nouns in different fields and the personalized language model exist in the determined language model matched with the user at the same time, the priority of the personalized language model can be set to be higher than that of the language model matched with the user, so that the personalized language model is used for preferentially recognizing in the recognition process.

The following describes embodiments of the present invention by taking an example of applying the method provided by the embodiments of the present invention to a search engine.

When the search engine runs in the background or foreground, the mobile phone acquires the voice information input by the user, then identifies the voice information by using at least one language model which is downloaded locally and matched with the user, obtains a voice identification result and outputs the voice identification result to an input window of the mobile phone search engine, and simultaneously controls the search engine to search according to the voice identification result and displays the search result. Or the mobile phone acquires the voice information input by the user, then sends the voice information to the server, and at least one language model which is pre-stored on the server and matched with the user identifies the voice information to obtain a voice identification result, then searches based on the voice identification, and feeds back the corresponding search result to the mobile phone terminal.

In order to further explain the speech recognition method provided by the present invention, the following describes an embodiment of the present invention by taking an example of applying the method provided by the embodiment of the present invention to an input method application program.

When a user downloads and installs an input method, the input method application program can display language models of word banks in different fields to the user so that the user can select the language models, and the input method downloads the corresponding language models to the server according to the selection of the user and stores the language models locally to serve as the language models matched with the user; or the input method application program downloads all language models to the local area during installation, and during installation, the input method application program can display the identification information of the language models of the word banks in different fields to the user so as to facilitate the user to select, so that the language models corresponding to the selection of the user are preferentially identified during later speech information identification, and after an identification result is obtained, the identification result is displayed on an editing interface to complete input.

Example two

As shown in fig. 3, which is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention, the speech recognition apparatus 30 according to an embodiment of the present invention may include: a voice acquisition module 301 and a voice recognition module 302.

The voice acquiring module 301 is configured to acquire acquired voice information input by a user;

the speech recognition module 302 is configured to recognize the speech information according to at least one language model matched with the user, so as to obtain a speech recognition result.

The voice recognition device provided by the embodiment of the invention utilizes at least one language model matched with a user to recognize voice information, and the mode of recognizing the voice information through the language model matched with the user realizes the purpose of recognizing the voice information through the specified language model, thereby not only improving the accuracy of recognizing the voice information, ensuring that the recognition result can meet the personalized requirements of the user, but also improving the accuracy and the recognition efficiency of voice recognition, solving the technical problem that the indiscriminate universal language model adopted in the related technology can not be recognized or even be recognized wrongly, and improving the user experience.

The speech recognition apparatus of this embodiment can execute the speech recognition method provided in the first embodiment of the present invention, and the implementation principles thereof are similar, and are not described herein again.

Further, as shown in fig. 4, the apparatus 30 further includes: a first model determining module 303, a custom thesaurus obtaining module 304, a personalized model generating module 305 and a second model determining module 306.

The first model determining module 303 is configured to determine at least one language model matching the user before recognizing the speech information according to the at least one language model matching the user;

a custom thesaurus obtaining module 304, configured to obtain a user-defined thesaurus;

a personalized model generation module 305 for generating a personalized language model of the user based on the user-defined word stock,

the second model determining module 306 is configured to determine the personalized language model of the user as a language model matching the user before recognizing the speech information according to at least one language model matching the user.

In some embodiments, the first model determining module 303 comprises: a first determining unit 3031 (not shown in the figure), wherein the first determining unit 3031 is configured to determine at least one language model matching the user based on a received instruction of the user for selecting the language model.

In some embodiments, the first model determining module 303 comprises: a user data unit 3032 (not shown in the figure) and a second determination unit 3033 (not shown in the figure), wherein,

a user data unit 3031, configured to obtain a history input record and/or a user attribute of a user;

a second determining unit 3033, configured to determine at least one domain corresponding to the user according to the history input record and/or the user attribute, and use a language model corresponding to the at least one domain as a language model matched with the user.

Further, as shown in fig. 4, the apparatus 30 further includes: model re-determination module 307.

The model re-determination module 307 is configured to update the language model matched with the user based on the user-defined word bank.

The speech recognition device provided by the embodiment can update the language model matched with the user based on the user-defined word bank so as to realize the purposes of recognizing the speech information according to the user-defined word bank and improving the recognition efficiency.

In a specific application, the speech recognition device provided by the embodiment of the invention can not comprise a user-defined word bank obtaining module and a personalized model generating module, and the purpose of updating the language model matched with the user by obtaining the user-defined word bank is realized.

The speech recognition device of the embodiment of the present invention can execute the speech recognition method provided in the first embodiment, which is similar to the first embodiment in the implementation principle, and is not described herein again.

EXAMPLE III

An embodiment of the present invention provides an electronic device, as shown in fig. 5, an electronic device 600 shown in fig. 5 includes: a processor 6001 and a memory 6003. Processor 6001 and memory 6003 are coupled, such as via bus 6002. Further, the electronic device 600 may also include a transceiver 6006. It should be noted that the transceiver 6006 is not limited to one in practical application, and the structure of the electronic device 600 is not limited to the embodiment of the present invention.

The processor 6001 could be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 6001 might also be a combination that performs a computing function, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

The bus 6002 may include a path that conveys information between the aforementioned components. The bus 6002 may be a PCI bus, an EISA bus, or the like. The bus 6002 can be divided into an address bus, a data bus, a control bus, and so forth. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

Memory 6003 can be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 6003 is used to store application code that implements aspects of the invention and is controlled for execution by the processor 6001. Processor 6001 is configured to execute application code stored in memory 6003 to implement the speech recognition apparatus provided by the embodiments shown in fig. 3 and 4.

The electronic equipment applying the voice recognition method provided by the embodiment of the invention utilizes at least one language model matched with the user to recognize the voice information, and the mode of recognizing the voice information through the language model matched with the user realizes the purpose of recognizing the voice information through the specified language model, thereby not only improving the accuracy of recognizing the voice information and ensuring that the recognition result can meet the personalized requirements of the user, but also improving the accuracy and the recognition efficiency of the voice recognition, solving the technical problem that the indiscriminate universal language model adopted in the related technology can not be recognized or even wrongly recognized, and improving the user experience.

Example four

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform a speech recognition method as shown in one of the above-described method embodiments.

Compared with the prior art, the non-transitory computer readable storage medium provided by the embodiment of the invention has the advantages that the voice information is identified by utilizing at least one language model matched with the user, the voice information is identified by the language model matched with the user, the aim of identifying the voice information by specifying the language model is fulfilled, the accuracy of identifying the voice information is improved, the identification result can meet the personalized requirement of the user, the accuracy and the identification efficiency of the voice identification are improved, the technical problem that the identification cannot be carried out or even is mistakenly identified due to the identification of an undifferentiated universal language model adopted in the related technology is solved, and the user experience is improved.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A speech recognition method, comprising:

acquiring collected voice information input by a user;

2. The method of claim 1, wherein prior to recognizing the speech information according to at least one language model matching the user, further comprising:

at least one language model matching the user is determined.

3. The method of claim 2, wherein determining at least one language model that matches the user comprises:

and determining at least one language model matched with the user based on the received selection instruction of the user for the language model.

4. The method of claim 2, wherein determining at least one language model that matches the user comprises:

acquiring a historical input record and/or user attributes of a user;

and determining at least one field corresponding to the user according to the historical input record and/or the user attribute, and taking a language model corresponding to the at least one field as a language model matched with the user.

5. The method of claim 2, wherein determining at least one language model that matches the user comprises:

acquiring the word bank customized by the user;

generating a personalized language model of the user based on the user-defined word stock;

and determining the personalized language model of the user as a language model matched with the user.

6. The method of claim 1, further comprising:

and updating the language model matched with the user based on the user-defined word bank.

7. A speech recognition apparatus, comprising:

8. The apparatus of claim 7, wherein the speech recognition module, prior to recognizing the speech information according to at least one language model matching the user, further comprises:

and the model determining module is used for determining at least one language model matched with the user.

9. An electronic device, comprising:

at least one processor;

and at least one memory, bus connected with the processor; wherein,

the processor and the memory complete mutual communication through the bus;

the processor is configured to invoke program instructions in the memory to perform the speech recognition method of any of claims 1 to 6.

10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the speech recognition method of any one of claims 1 to 6.