CN103677729B

CN103677729B - Voice input method and system

Info

Publication number: CN103677729B
Application number: CN201310701517.9A
Authority: CN
Inventors: 陈伟; 梁伟文
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2013-12-18
Filing date: 2013-12-18
Publication date: 2017-02-08
Anticipated expiration: 2033-12-18
Also published as: CN103677729A

Abstract

The invention provides a voice input method and system. The method comprises the steps that voice data are collected, and the voice data are sent to a server; the first M candidate identification texts which have the highest first identification scores, correspond to the voice data and are identified by the server are received, and the identification information is also received, wherein the identification information comprises the first identification scores; individual text data of a current user are used for calculating the second identification scores of the first M candidate identification texts; the first identification scores and the second identification scores are used for calculating the third identification scores of the first M candidate identification texts; the confidence coefficients of the first N candidate identification texts with the highest third identification scores are calculated; the first N candidate identification texts are displayed according to the confidence coefficients. Multiple candidate results are displayed, the user can choose conveniently, the accuracy of identification success is improved, the individual data of the user are used for ranking again, and the fact that the habit of the user is met to the largest extent is guaranteed so that the identification precision can be improved.

Description

Voice input method and system

Technical Field

The invention relates to the technical field of input methods, in particular to a voice input method and a voice input method system.

Background

At present, the rapid development of the mobile internet has driven the wide popularization of mobile intelligent devices such as mobile phones, tablet computers, wearable devices and the like, and as one of the most convenient and natural ways of man-machine interaction on mobile devices, the voice input method is gradually accepted by the vast majority of users.

Although the performance of the voice input method has been greatly improved with the development of the voice recognition technology, the result fed back to the client may not be the real voice input of the user due to the influence of factors such as model precision, noise, accent and the like. For example, the user inputs "what section is after national day section" using voice, and the result with the highest recognition score may be "what section is before and after national day section".

Secondly, the currently used language model is a general language model, which is obtained by learning based on a large amount of texts from multiple sources and multiple users before use, and is not suitable for the personalized requirements of the users. For example, a user often uses "quanza", and when the user inputs "i want to go to quanza" by voice, since the usage rate of "quanza" by most users may be higher than that of "quanza", the recognition result may be "i want to go to quanza" based on the universal language model, and the result is not changed after inputting for many times, which does not meet the user's desire.

And a certain input error (such as wrongly written characters) exists when a user inputs the keyboard, so that the accuracy of the constructed user model is influenced to a certain extent. For example, a user often enters "you drive in to a dining bar" when entering a keyboard, where "drive in" is a wrong input and should be "drive up", which affects the accuracy of the model if the user model is built based on the above types of text.

Disclosure of Invention

The invention provides a voice input method, which aims to solve the problem of low voice recognition accuracy.

Correspondingly, the invention also provides a voice input system for ensuring the realization and the application of the method.

In order to solve the above problems, the present invention discloses a voice input method, comprising:

collecting voice data and sending the voice data to a server;

receiving the first M candidate recognition texts with the highest first recognition scores corresponding to the voice data recognized by the server and recognition information thereof, wherein the recognition information comprises the first recognition scores;

calculating second identification scores of the first M candidate identification texts by adopting the personalized text data of the current user;

calculating third recognition scores of the first M candidate recognition texts by adopting the first recognition scores and the second recognition scores;

calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores;

and displaying the first N candidate recognition texts according to the confidence degrees.

Preferably, the personalized text data comprises one or more of:

inputting behavior text data, a user-defined word bank, equipment text data and a voice recognition text with confidence coefficient higher than a preset threshold value.

Preferably, the candidate recognition text comprises a plurality of voice candidate words, and the recognition information further comprises occurrence probabilities of the plurality of voice candidate words; saying how to calculate the second recognition score

The step of calculating the second recognition scores of the first M candidate recognized texts by using the personalized text data of the current user includes:

performing word segmentation on the first M candidate recognition texts to obtain first word segmentation;

respectively mapping the first participles into preset second participles, wherein the second participles are participles of personalized text data of a current user, and the second participles have word frequency;

respectively searching the occurrence probability of the first segmentation by adopting the second segmentation; the occurrence probability is a ratio of a first word frequency to a second word frequency, wherein the first word frequency is a second word frequency corresponding to a current first word, the word frequency appears behind a second word corresponding to one or more first words before the current first word, and the second word frequency is a total word frequency of the second word corresponding to the one or more first words before the current first word;

performing multiplication operation by adopting the occurrence probability of the first word segmentation to obtain the connection probability of the candidate recognition text;

and calculating a second recognition score of the candidate recognition text by respectively adopting the occurrence probability of the plurality of voice candidate words and the connection probability of the candidate recognition text.

Preferably, the second recognition score of the candidate recognized text is calculated using the following formula:

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

wherein,for the probability of occurrence of the plurality of speech candidate words,and lambda is the weight, and WP is a word insertion penalty parameter for the connection probability of the candidate recognition text.

Preferably, the third recognition score is calculated using the following formula:

MS(i)＝α*s_i+β*u_i

wherein MS (i) is the third recognition score, s, of the ith candidate recognized text_iFirst recognition score, u, for the ith candidate recognized text_iFor the second recognition score of the ith candidate recognition text, α and β are non-negative numbers.

Preferably, the confidence is a ratio of the third recognition score of the current candidate recognized text to the sum of the third recognition scores of the first N candidate recognized texts.

The invention also discloses a voice input method, which comprises the following steps:

receiving voice data sent by a client;

identifying a plurality of candidate identification texts corresponding to the identification voice data and identification information thereof, wherein the identification information comprises a first identification score;

sending the first M candidate recognition texts with the highest first recognition scores and the recognition scores thereof to a client; the client is used for calculating a second recognition score of the candidate recognition texts by adopting the personalized text data of the current user, calculating a third recognition score of the M candidate recognition texts by adopting the first recognition score and the second recognition score, calculating confidence degrees of the first N candidate recognition texts with the highest third recognition score, and displaying the first N candidate recognition texts according to the confidence degrees.

Preferably, the step of recognizing the candidate recognition texts and the recognition scores thereof corresponding to the recognition speech data includes:

extracting acoustic features of multi-frame voice signals in the voice data;

respectively adopting the acoustic features to identify a plurality of voice candidate words corresponding to the multi-frame voice information;

respectively calculating the occurrence probability of the plurality of voice candidate words;

calculating connection probability among the voice candidate words;

combining the plurality of voice candidate words into a plurality of candidate recognition texts corresponding to the voice data;

and calculating the first recognition scores of the corresponding candidate recognition texts by respectively adopting the occurrence probabilities of the plurality of voice candidate words and the connection probabilities among the plurality of voice candidate words.

Preferably, the probability of occurrence of the plurality of speech candidate words is calculated by the following formula:

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

wherein,is the acoustic feature of the multi-frame speech signal,are language candidate words.

Preferably, the connection probability between the plurality of speech candidate words is calculated by the following formula:

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

Preferably, the first identification score is calculated by the formula:

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

wherein,is the probability of occurrence of the speech candidate word,and lambda is the weight, and WP is the word insertion penalty parameter.

Preferably, the recognition score further includes a probability of occurrence of the plurality of speech candidate words.

receiving voice data sent by a client;

recognizing a plurality of candidate recognition texts and recognition information thereof from the voice data; the identification information comprises a first identification score;

calculating second recognition scores of the first M candidate recognition texts with the highest first recognition scores by adopting the personalized text data of the current user;

sending the first N candidate recognition texts and confidence degrees thereof to a client; and the client is used for displaying the first N candidate recognition texts according to the confidence degrees.

collecting voice data and sending the voice data to a server;

receiving the first M candidate recognition texts with the highest first recognition scores in each voice subdata returned by the server and the recognition scores thereof, wherein the voice subdata is a plurality of voice subdata into which the voice data is cut by the server, and the recognition scores comprise the first recognition scores; respectively adopting the personalized text data of the current user to calculate second recognition scores of the first M candidate recognition texts corresponding to each voice subdata;

calculating third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition score and the second recognition score;

respectively calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each voice subdata;

and respectively displaying the candidate recognition text with the highest confidence level in each voice subdata.

Preferably, the personalized text data comprises one or more of:

Preferably, the candidate recognition text comprises a plurality of voice candidate words, and the recognition information further comprises occurrence probabilities of the plurality of voice candidate words;

the step of calculating the second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user comprises the following steps:

respectively segmenting the first M candidate recognition texts corresponding to each voice subdata to obtain first segmented words;

respectively searching the occurrence probability of the first segmentation by adopting the second segmentation; the occurrence probability is a ratio of a first word frequency number and a second word frequency number, wherein the first word frequency number is a word frequency number of a second word division corresponding to a current first word division appearing behind a second word division corresponding to one or more first word divisions before the current first word division, and the second word frequency number is a total word frequency number of the second word division corresponding to the one or more first word divisions before the current first word division;

respectively adopting the occurrence probability of the first participle to carry out multiplication operation so as to obtain the connection probability of the candidate recognition text;

and calculating a second recognition score of the candidate recognition text by adopting the occurrence probability of the plurality of voice candidate words and the connection probability of the candidate recognition text.

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

MS(i)＝α*s_i+β*u_i

Preferably, the method further comprises the following steps:

when the candidate recognition text with the highest confidence coefficient is triggered, displaying other candidate recognition texts; and the other texts are candidate recognition texts except the candidate recognition text with the highest confidence coefficient in the first N candidate recognition texts.

receiving voice data sent by a client;

segmenting the voice data into a plurality of voice subdata;

respectively identifying a plurality of candidate identification texts and identification information thereof corresponding to each voice subdata; the identification information comprises a first identification score;

sending the top M candidate identifications with the highest first identification score in each voice subdata and identification score texts thereof to a client; the client is used for calculating second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user, calculating third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition scores and the second recognition scores, calculating confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each voice subdata, and respectively displaying the candidate recognition texts with the highest confidence degree in each voice subdata.

Preferably, the step of respectively identifying the candidate recognition texts and the recognition scores thereof corresponding to each piece of speech subdata includes:

respectively extracting the acoustic characteristics of multi-frame voice signals in each voice subdata;

respectively calculating connection probabilities among the voice candidate words;

combining the plurality of voice candidate words into a plurality of candidate recognition texts corresponding to the voice data respectively;

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

Preferably, the first identification score is calculated by the formula:

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

receiving voice data sent by a client;

segmenting the voice data into a plurality of voice subdata;

respectively identifying a plurality of candidate identification texts and identification information thereof of each voice subdata; the identification information comprises a first identification score;

respectively adopting the personalized text data of the current user to calculate second recognition scores of the first M candidate recognition texts with the highest first recognition score corresponding to each voice subdata;

sending the first N candidate recognition texts with the highest third recognition score of each voice subdata and the confidence degrees of the candidate recognition texts to the client; and the client is used for respectively displaying the candidate recognition text with the highest confidence level in each voice subdata.

The invention also discloses a voice input system, comprising:

the first voice data acquisition module is used for acquiring voice data;

the first voice data sending module is used for sending the voice data to a server;

a first receiving module, configured to receive the first M candidate recognition texts with the highest first recognition score corresponding to the voice data recognized by the server and recognition information thereof, where the recognition information includes the first recognition score;

the first score calculating module is used for calculating second recognition scores of the first M candidate recognition texts by adopting the personalized text data of the current user;

the second score calculation module is used for calculating third recognition scores of the first M candidate recognition texts by adopting the first recognition scores and the second recognition scores;

the first confidence coefficient calculation module is used for calculating the confidence coefficients of the first N candidate recognition texts with the highest third recognition scores;

and the first display module is used for displaying the first N candidate recognition texts according to the confidence degrees.

Preferably, the personalized text data comprises one or more of:

Preferably, the candidate recognition text comprises a plurality of speech candidate words, and the recognition score further comprises occurrence probabilities of the plurality of speech candidate words;

the first score calculation module comprises:

the first word segmentation sub-module is used for carrying out word segmentation on the first M candidate recognition texts to obtain first word segmentation;

the first mapping sub-module is used for respectively mapping the first participles into preset second participles, wherein the second participles are participles of personalized text data of a current user, and the second participles have word frequency;

the first searching submodule is used for respectively adopting the second participles to search the occurrence probability of the first participles; the occurrence probability is a ratio of a first word frequency to a second word frequency, wherein the first word frequency is a second word frequency corresponding to a current first word, the word frequency appears behind a second word corresponding to one or more first words before the current first word, and the second word frequency is a total word frequency of the second word corresponding to the one or more first words before the current first word;

the first connection probability obtaining submodule is used for carrying out multiplication operation by adopting the occurrence probability of the first participle to obtain the connection probability of the candidate recognition text;

and the first candidate recognition text score calculating sub-module is used for calculating a second recognition score of the candidate recognition text by respectively adopting the occurrence probability of the plurality of voice candidate words and the connection probability of the candidate recognition text.

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

MS(i)＝α*s_i+β*u_i

wherein MS (i) is the third recognition score, s, of the ith candidate recognized text_iThe first recognition score for the ith candidate recognized text, ui the second recognition score for the ith candidate recognized text, and α and β are non-negative numbers.

The invention also discloses a voice input system, comprising:

the first voice data receiving module is used for receiving voice data sent by the client;

the first recognition module is used for recognizing a plurality of candidate recognition texts corresponding to the recognition voice data and recognition information thereof, wherein the recognition information comprises a first recognition score;

the first sending module is used for sending the first M candidate recognition texts with the highest first recognition scores and the recognition scores thereof to the client; the client is used for calculating a second recognition score of the candidate recognition texts by adopting the personalized text data of the current user, calculating a third recognition score of the M candidate recognition texts by adopting the first recognition score and the second recognition score, calculating confidence degrees of the first N candidate recognition texts with the highest third recognition score, and displaying the first N candidate recognition texts according to the confidence degrees.

Preferably, the first identification module comprises:

the first voice signal extraction submodule is used for extracting the acoustic characteristics of multi-frame voice signals in the voice data;

the first voice candidate word recognition sub-module is used for recognizing a plurality of voice candidate words corresponding to the multi-frame voice information by respectively adopting the acoustic characteristics;

the first occurrence probability calculation submodule is used for calculating the occurrence probability of the voice candidate words respectively;

the first connection probability calculation submodule is used for calculating the connection probability among the voice candidate words;

a first candidate recognition text combination sub-module, configured to combine the multiple speech candidate words into multiple candidate recognition texts corresponding to the speech data;

and the third score calculating submodule is used for calculating the first recognition scores of the corresponding candidate recognition texts by respectively adopting the occurrence probabilities of the multiple voice candidate words and the connection probabilities among the multiple voice candidate words.

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

Preferably, the first identification score is calculated by the formula:

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

The invention also discloses a voice input system, which is characterized by comprising the following components:

the second voice data receiving module is used for receiving voice data sent by the client;

the second recognition module is used for recognizing a plurality of candidate recognition texts and recognition information thereof from the voice data; the identification information comprises a first identification score;

the fourth score calculating module is used for calculating second identification scores of the first M candidate identification texts with the highest first identification scores by adopting the personalized text data of the current user;

a fifth score calculation module, configured to calculate third recognition scores of the top M candidate recognition texts by using the first recognition score and the second recognition score;

the second confidence coefficient module is used for calculating the confidence coefficients of the first N candidate recognition texts with the highest third recognition scores;

the second sending module is used for sending the first N candidate recognition texts and the confidence degrees thereof to the client; and the client is used for displaying the first N candidate recognition texts according to the confidence degrees.

The invention also discloses a voice input system, comprising:

the second voice data acquisition module is used for acquiring voice data;

the second voice data sending module is used for sending the voice data to a server; the second receiving module is used for receiving the first M candidate recognition texts with the highest first recognition scores in each voice subdata returned by the server and the recognition scores of the candidate recognition texts, wherein the voice subdata is a plurality of voice subdata into which the voice data are cut by the server, and the recognition scores comprise the first recognition scores;

the sixth score calculating module is used for calculating second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user;

a seventh score calculating module, configured to calculate, by using the first identification score and the second identification score, third identification scores of the first M candidate identification texts corresponding to each piece of voice sub-data;

the third confidence coefficient calculation module is used for calculating the confidence coefficients of the first N candidate recognition texts with the highest third recognition scores of each voice subdata respectively;

and the second display module is used for respectively displaying the candidate recognition text with the highest confidence level in each voice subdata.

Preferably, the personalized text data comprises one or more of:

the sixth score calculating module includes:

the second word segmentation submodule is used for segmenting the first M candidate recognition texts corresponding to each voice subdata respectively to obtain first segmented words;

the second mapping sub-module is used for respectively mapping the first participles into preset second participles, wherein the second participles are participles of personalized text data of the current user, and the second participles have word frequency;

the second searching submodule is used for respectively adopting the second participles to search the occurrence probability of the first participles; the occurrence probability is a ratio of a first word frequency to a second word frequency, wherein the first word frequency is a second word frequency corresponding to a current first word, the word frequency appears behind a second word corresponding to one or more first words before the current first word, and the second word frequency is a total word frequency of the second word corresponding to the one or more first words before the current first word;

the second connection probability obtaining submodule is used for respectively adopting the occurrence probability of the first participle to carry out multiplication operation so as to obtain the connection probability of the candidate recognition text;

and the second candidate recognition text score calculating sub-module is used for calculating a second recognition score of the candidate recognition text by adopting the occurrence probability of the plurality of voice candidate words and the connection probability of the candidate recognition text.

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

MS(i)＝α*s_i+β*u_i

Preferably, the method further comprises the following steps:

the third display module is used for displaying other candidate recognition texts when the candidate recognition text with the highest confidence coefficient is triggered; and the other texts are candidate recognition texts except the candidate recognition text with the highest confidence coefficient in the first N candidate recognition texts.

The invention also discloses a voice input system, comprising:

the third voice data receiving module is used for receiving voice data sent by the client;

the first voice data segmentation module is used for segmenting the voice data into a plurality of voice subdata;

the second identification module is used for respectively identifying a plurality of candidate identification texts and identification information thereof corresponding to each voice subdata; the identification information comprises a first identification score;

the third sending module is used for sending the top M candidate identifications with the highest first identification score in each voice subdata and the identification score texts thereof to the client; the client is used for calculating second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user, calculating third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition scores and the second recognition scores, calculating confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each voice subdata, and respectively displaying the candidate recognition texts with the highest confidence degree in each voice subdata.

Preferably, the second identification module comprises:

the second voice signal extraction submodule is used for respectively extracting the acoustic characteristics of the multi-frame voice signals in each voice subdata;

the second voice candidate word recognition sub-module is used for recognizing a plurality of voice candidate words corresponding to the multi-frame voice information by respectively adopting the acoustic characteristics;

the second occurrence probability calculation submodule is used for calculating the occurrence probability of the voice candidate words respectively;

the second connection probability calculation submodule is used for calculating connection probabilities among the voice candidate words respectively;

a second candidate recognition text combination sub-module, configured to combine the multiple speech candidate words into multiple candidate recognition texts corresponding to the speech data, respectively;

and the eighth score calculating submodule is used for calculating the first recognition scores of the corresponding candidate recognition texts by respectively adopting the occurrence probabilities of the plurality of voice candidate words and the connection probabilities among the plurality of voice candidate words.

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

Preferably, the first identification score is calculated by the formula:

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

The invention also discloses a voice input system, comprising:

the fourth voice data receiving module is used for receiving the voice data sent by the client;

the second voice data segmentation module is used for segmenting the voice data into a plurality of voice subdata;

the third identification module is used for respectively identifying a plurality of candidate identification texts of each voice subdata and identification information thereof; the identification information comprises a first identification score;

the ninth score calculating module is used for calculating second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user;

a tenth score calculation module, configured to calculate, by using the first identification score and the second identification score, third identification scores of the first M candidate identification texts, which have the highest first identification score and correspond to each piece of voice sub-data, respectively;

the fourth confidence coefficient calculation module is used for calculating the confidence coefficients of the first N candidate recognition texts with the highest third recognition scores of each voice subdata respectively;

the fourth sending module is used for sending the first N candidate recognition texts with the highest third recognition score of each voice subdata and the confidence degrees of the candidate recognition texts to the client; and the client is used for respectively displaying the candidate recognition text with the highest confidence level in each voice subdata.

Compared with the prior art, the invention has the following advantages:

the invention can obtain the recognition result of the voice data, namely the candidate recognition text and the recognition score thereof, then adopt the personalized data of the user to reorder and calculate the confidence level, finally display the first N results according to the confidence level, display the multiple candidate results, facilitate the selection of the user, improve the accuracy of the recognition success, adopt the personalized data of the user to reorder, can ensure to meet the habit of the user as much as possible and ensure that the recognition precision is improved.

The invention can adopt various user personalized text data such as input behavior text data, a user-defined word bank, equipment text data, a voice recognition text with confidence coefficient higher than a preset threshold value and the like to train a user language model and reorder candidate recognition texts, the coverage rate of the user personalized text data is high, the practicability is strong, the user personalized text data can change the proportion along with the time, and the recognition precision of the candidate text is further improved.

Drawings

FIG. 1 is a flow chart of the steps of an embodiment 1 of a speech input method of the present invention;

FIG. 2 is an exemplary flow chart of one embodiment of a method of speech input of the present invention;

FIG. 3 is a schematic diagram of a user model training of the present invention;

FIG. 4 is a flow chart of the reordering of recognition results of the present invention;

FIG. 5 is a display diagram of a candidate recognition text according to the present invention;

FIG. 6 is a flow chart of the steps of a speech input method embodiment 2 of the present invention;

FIG. 7 is a flow chart of the steps of a speech input method embodiment 3 of the present invention;

FIG. 8 is an exemplary flow chart of one voice input method embodiment of the present invention;

FIG. 9 is a flow chart of the steps of a speech input method embodiment 4 of the present invention;

FIG. 10 is a flow chart of the steps of a speech input method embodiment 5 of the present invention;

FIG. 11 is a flow chart of the steps of a speech input method embodiment 6 of the present invention;

FIG. 12 is a block diagram of a speech input system embodiment 1 according to the present invention;

FIG. 13 is a block diagram of a speech input system embodiment 2 of the present invention;

FIG. 14 is a block diagram of a speech input system embodiment 3 of the present invention;

FIG. 15 is a block diagram of the structure of an embodiment 4 of a speech input system of the present invention;

FIG. 16 is a block diagram of the structure of an embodiment 5 of a speech input system of the present invention;

fig. 17 is a block diagram of a speech input system according to embodiment 6 of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of embodiment 1 of a speech input method of the present invention is shown, which may specifically include the following steps:

step 101, collecting voice data;

the voice data may be data of recording the voice of the user, which is collected by the user at the client through a sound card device such as a microphone.

Step 102, sending the voice data to a server;

the client may establish a Wireless connection with the server through a Wireless network such as WIFI (Wireless Fidelity, short-range Wireless transmission technology), bluetooth, Wireless network communication (for example, GPRS (general packet radio service), 3G (third generation mobile communication technology), 4G (fourth generation mobile communication technology, etc.), etc., or may establish a wired connection with the server through a wired network such as a network cable and USB (Universal Serial Bus), etc., and transmit voice data.

103, receiving the first M candidate recognition texts with the highest first recognition score corresponding to the voice data recognized by the server and recognition information thereof, wherein the recognition information includes the first recognition score;

as shown in fig. 2, the embodiment of the present invention may deploy a client and a server corresponding to the client at the same time, where the client collects voice data input by a user (corresponding to "voice collection" shown in fig. 2) and sends the voice data to the server. The speech recognition system deployed at the server side can obtain a plurality of candidate recognition texts and recognition scores thereof (corresponding to "speech recognition" shown in fig. 2) for recognizing the received speech data under the guidance of an Acoustic Model (AM) and a general language Model (G-LM), wherein the recognition information includes a first recognition score, and the top M (M is a positive integer) candidate recognition texts with the highest first recognition score and the recognition scores thereof are fed back to the client side (corresponding to the top M optimal results of parsing shown in fig. 2).

Step 104, calculating second identification scores of the first M candidate identification texts by adopting the personalized text data of the current user;

by applying the embodiment of the invention, personalized text data of the User (corresponding to the User corpus collection shown in fig. 2) can be collected in advance, a User Language Model (U-LM) can be trained, and the User Language module can be used for calculating the second recognition scores of the first M candidate recognition texts.

As a preferred example of the embodiment of the present invention, the personalized text data may include one or more of the following:

As shown in fig. 3, the input behavior text data may include information such as a pinyin input corpus of the user, and the pinyin input corpus may record the user's normal typing content and behavior information such as backspace and space usage. In practical application, the specific gravity of the input behavior text data is gradually reduced along with the time t, the specific gravity refers to the contribution of information in the input behavior text data in different time periods in model training, the earlier recorded information contribution is smaller, and the specific gravity and the time can be modeled by adopting a certain mathematical function.

The custom thesaurus can record information such as custom words and sentences generated or set by a user when the user uses the input method tool.

The device text data may be text data in a device (e.g., computer, cell phone, tablet, etc.) that the user is using or using, such as an address book, a music playlist, an APP (application) list, and so on.

The speech recognition text may be a recognition text of a user speech input, and its source may be a corpus for a speech input method. In practical application, the method can be used for training the user speech model only when the confidence coefficient CM of the speech recognition text is greater than a preset threshold Thresh, otherwise, the method is discarded, because the confidence coefficient is used, the speech recognition text can be the recognition text applied to the embodiment of the present invention, and can also be the recognition text of other schemes for returning the confidence coefficient, but it is required to ensure that the meaning of the confidence coefficient is consistent with the embodiment of the present invention, for example, the confidence coefficient is in the range of 0 to 1, and the closer the confidence coefficient is to 1, the more credible the recognition text is. Preferably, the value of the preset threshold Thresh may be 0.85-0.9. And the specific gravity of the speech recognition text may also gradually decrease with time t.

And training based on the personalized text data of the user to obtain a user language model. The trained model can be N-Gram (a language model commonly used in large vocabulary continuous speech recognition), a neural network-based language model, and the like, and the learning of the user language model can be performed in a regular or client-side idle manner.

Taking N-Gram as an example, the N-Gram is applied to the embodiment of the present invention, and can perform word segmentation on the personalized text data of the user and count word frequency counts, specifically, the word frequency counts may include a total word frequency count appearing in the personalized text data of the user and a word frequency count appearing before one or more other word segmentations.

In particular implementations, word segmentation may be performed based on string matching. That is, the Chinese character string to be analyzed can be matched with the entry in a preset machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is recognized). In the actually used word segmentation system, mechanical word segmentation is used as an initial segmentation means, and various other language information is used to further improve the accuracy of segmentation.

In a preferred embodiment of the present invention, the candidate recognition text may include a plurality of speech candidate words, and the recognition information may further include occurrence probabilities of the plurality of speech candidate words; the step 104 may specifically include the following sub-steps:

substep S11, performing word segmentation on the first M candidate recognition texts to obtain a first word segmentation;

it should be noted that the method for segmenting the candidate recognition texts is consistent with the method for segmenting the personalized text data of the user. For example, segmenting the user personalized text data based on the character string matching, the candidate recognition text needs to be segmented based on the same character string matching.

Substep S12, respectively mapping the first participles into preset second participles, where the second participles are participles of personalized text data of a current user, and the second participles have a word frequency;

the second segmentation can be a segmentation obtained when the personalized text data of the current user is subjected to user language model training, and the first segmentation and the second segmentation are substantially the same.

Substep S13, finding the occurrence probability of the first participle by using the second participle respectively; the occurrence probability may be a ratio of a first word frequency number to a second word frequency number, where the first word frequency number may be a second participle corresponding to a current first participle, and a word frequency number that appears behind a second participle corresponding to one or more first participles before the current first participle, and the second word frequency number may be a total word frequency number of the second participles corresponding to the one or more first participles before the current first participle;

substep S14, performing multiplication operation by using the occurrence probabilities of the first participles respectively to obtain connection probabilities of the candidate recognition texts;

the N-Gram model may be based on a markov assumption that the occurrence of a word depends only on a limited word or words that occur before it. For a sentence T, it can be assumed that T is formed by a sequence of words W₁，W₂，W₃…, Wn, then this sentence T consists of W₁，W₂，W₃，…，W_nThe connection probability of a connection composition is P (t) = P (W)₁W₂W₃…W_n)=P(W₁)P(W₂|W₁)P(W₃|W₁W₂)…P(W_n|W₁W2…W_n-1)。

A word is called bigram if its occurrence depends only on the word it has previously appeared. Namely P (t) = P (W)₁W₂W₃…W_n)=P(W₁)P(W₂|W₁)P(W₃|W₁W₂)…P(W_n|W₁W₂…W_n-1)≈P(W₁)P(W₂|W₁)P(W₃|W₂)…P(W_n|W_n-1)。

A word is called a trigram if its occurrence depends only on the two words that it occurs before. Bigram and trigram are used as main parts in the practical application of the N-Gram model, and the N-Gram model higher than the quaternion is less in application, because the training of the quaternion N-Gram model requires more huge corpora, the data is sparse and serious, the time complexity is high, and the precision is improved a little.

The following description takes the candidate recognition text "I wait to eat Chinese food lunch" as an example:

the candidate recognition text "I wait eat Chinese food Lunch" is segmented to obtain the first segmentation "I", "wait", "to", "eat", "Chinese", "food", "Lunch", and in the user language model, the preset second segmentation and its word frequency number are shown in Table 1 and Table 2.

TABLE 1 statistical table of total word frequency of second participles

Second participle	Total number of words and frequencies
		I	3437
want	1215
		to	3256
eat	938
		Chinese	213
food	1506
		lunch	459

TABLE 2 statistical table of word frequencies for which a second participle appears before one or more other participles

	I	want	to	eat	Chinese	food	lunch
								I	8	1087	0	13	0	0	0
want	3	0	786	0	6	8	6
								to	3	0	10	860	3	0	12
eat	0	0	2	0	19	2	52
								Chinese	2	0	0	0	0	120	1
food	19	0	17	0	0	0	0
								lunch	4	0	0	0	0	1	0

For example, 1087 in the second row and the third column indicates that the word frequency of "I" appearing before "wan" in the personalized text data of the user is 1087.

The connection probability of the candidate recognition text "I wait to eat Chinese food value lunch" is as follows:

P(I want to eat Chinese food)

=0.25*（1087/3437）*（786/1215）*（860/3256）*（19/938）*（120/213）

=0.000154171

and a substep S15 of calculating a second recognition score of the candidate recognition text by using the occurrence probabilities of the plurality of speech candidate words and the connection probability of the candidate recognition text, respectively.

In a specific implementation, the following formula may be adopted to calculate the second recognition score of the candidate recognition text:

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

WP represents an insertion penalty to minimize insertion errors.

Step 105, calculating third recognition scores of the first M candidate recognition texts by adopting the first recognition score and the second recognition score;

in a particular implementation, the third recognition score may be calculated using the following formula:

MS(i)＝α*s_i+β*u_i

Further, α + β =1, α and β need to be adjusted according to the actual model accuracy when selecting, α > β when initially setting, and β may gradually increase in value as the accuracy of the user language model U-LM gradually increases.

106, calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores;

as shown in fig. 4, the recognition results may be rearranged by using the user language model at the client, and the top N candidate recognition texts with the highest third recognition score (corresponding to the "M most-resulting reordering" shown in fig. 2) may be obtained. N is a positive integer, and N.ltoreq.M.

As a preferable example of the embodiment of the present invention, the confidence may be a ratio of a third recognition score of the current candidate recognized text to a sum of third recognition scores of the top N candidate recognized texts.

Of course, there are various methods for calculating the confidence level, the above methods for calculating the confidence level are only examples, and when the embodiment of the present invention is implemented, other methods for calculating the confidence level may be set according to actual situations, which is not limited in the embodiment of the present invention.

It should be noted that in the given example, the third recognition score and the confidence level may be ranked consistently, but if the confidence level is calculated based on the lattice word posterior probability, the third recognition score and the confidence level may be ranked inconsistently.

And 107, displaying the first N candidate recognition texts according to the confidence degrees.

In a specific implementation, the embodiment of the present invention may use any form to display the confidence levels to display the first N candidate recognition texts (corresponding to the "N optimal result top screens" shown in fig. 2), and the embodiment of the present invention is not limited.

Preferably, candidate recognized texts with higher confidence may be presented with higher priority. For example, as shown in fig. 5, three optimal candidate identification texts are displayed in the current client interface, where the candidate identification text "dog searching mobile phone input method" has the highest confidence level and is displayed in the front, the candidate identification text "dog searching mobile phone input method" has the second confidence level and is displayed in the middle, and the candidate identification text "dog searching mobile phone input method" has the lowest confidence level and is displayed in the bottom.

Referring to fig. 6, a flowchart illustrating steps of embodiment 2 of the speech input method of the present invention is shown, which may specifically include the following steps:

step 601, receiving voice data sent by a client;

corresponding to step 101, the server according to the embodiment of the present invention receives the voice data sent by the client.

Step 602, identifying a plurality of candidate identification texts and identification information thereof corresponding to the identification voice data, wherein the identification information includes a first identification score;

speech Recognition technology, also known as Automatic Speech Recognition (ASR), has the task of converting the vocabulary content of human uttered Speech into text that can be read into a computer. The speech recognition technology is a comprehensive technology and relates to a plurality of subject fields, such as vocalization mechanism and auditory mechanism, signal processing, probability theory and information theory, pattern recognition, artificial intelligence and the like.

In practical application, a speech recognition system deployed at a server side can obtain a plurality of candidate recognition texts and recognition scores thereof for the received speech data under the guidance of an Acoustic Model (AM) and a Global Language Model (G-LM).

An Acoustic Model (AM) is a bottommost part in a Model of an automatic speech recognition system and is also a most key component unit in the automatic speech recognition system, and the recognition effect and robustness of the speech recognition system are directly and fundamentally influenced by the quality of Acoustic Model modeling. The model of the acoustic model experiment probability statistics establishes a model for the voice basic unit with the acoustic information and describes the statistical characteristics of the voice basic unit. Through modeling of the acoustic model, the similarity between the feature vector sequence of the speech and each pronunciation template can be effectively measured, and the acoustic information of the speech, namely the content of the speech, can be judged. The speech content of a speaker is composed of basic speech units, which may be sentences, phrases, words, syllables (syllables), Sub-syllables (Sub-syllables) or phonemes.

Due to the time-varying nature of speech signals, noise and other instability factors, a higher accuracy of speech recognition cannot be achieved by purely using acoustic models. In human language, the words of each sentence are directly and closely connected, the information of the word level can reduce the search range on the acoustic model, effectively improve the identification accuracy, and the language model is necessary for completing the task, and provides context information and semantic information between words in the language. The general Language Model (G-LM) may specifically include an N-Gram Language Model, a Markov N-Gram (MarkovN-Gram), an Exponential Model (Exponential Models), a Decision Tree Model (Decision Tree Models), and so forth. N-gram language models are the most commonly used statistical language models, in particular bigram (bigram) and trigram (trigram).

In a preferred embodiment of the present invention, the step 602 may specifically include the following sub-steps:

a substep S21 of extracting acoustic features of a plurality of frames of voice signals in the voice data;

extraction and selection of acoustic features of speech data is an important link of speech recognition. The extraction of the acoustic features is a process of information large-amplitude compression and a signal deconvolution process, and aims to better divide the acoustic features.

Due to the time-varying nature of speech signals, feature extraction must be performed on a small segment of the speech signal, i.e., a short analysis. This segment of the analysis, which is considered stationary, is called a frame, and the offset from frame to frame is typically 1/2 or 1/3 of the frame length. The signal is usually pre-emphasized to boost the high frequencies and windowed to avoid the short speech segment edges.

The acoustic features may specifically include linear prediction coefficients LPC, cepstral coefficients CEP, mel-frequency cepstral coefficients MFCC and perceptual linear prediction PLP, among others.

A substep S22 of recognizing a plurality of voice candidate words corresponding to the plurality of frames of voice information by respectively adopting the acoustic features;

in a specific implementation, a pronunciation dictionary may be used for recognition on the speech candidate side. The pronunciation dictionary is a dictionary storing pronunciations of all words, and is used for connecting the acoustic model and the language model. For example, a sentence may be divided into a number of words connected, each word being associated with a phoneme sequence of the pronunciation of the word by querying a pronunciation dictionary. The transition probabilities of adjacent words may be obtained by a language model and the probability model of phonemes may be obtained by an acoustic model. Thereby generating a probabilistic model of the word.

Substep S23, respectively calculating the occurrence probabilities of the plurality of speech candidate words;

in a preferred example of the embodiment of the present invention, the probability of occurrence of the plurality of speech candidate words may be calculated by the following formula:

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

Suppose thatIs composed ofA corresponding Hidden Markov Model (HMM) sequence,is a corresponding HMM state sequence, then

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) = p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{M_{1}^{H}}) = \underset{\underset{X_{1}^{T}}{&OverBar;}}{Σ} p (\overset{&RightArrow;}{O_{1}^{T}}, \overset{&RightArrow;}{X_{1}^{T}} | \overset{&RightArrow;}{M_{1}^{H}})

\approx \max_{\underset{X_{1}^{T}}{&OverBar;}} {p (\overset{&RightArrow;}{O_{1}^{T}}, \overset{&RightArrow;}{X_{1}^{T}} | \overset{&RightArrow;}{M_{1}^{H}})}

= Π_{t = 1}^{T} p (\overset{&RightArrow;}{o_{t}} | x_{t}) * P (x_{t} | x_{t - 1})

Wherein,conversion into a dictionary by pronunciationWhileThen the viterbi approximation is performed; probability of state output per frameThe description is performed using a Gaussian Mixture Model (GMM),

p (\overset{&RightArrow;}{o_{t}} | x_{t}) = Σ_{i = 1}^{N} c_{i} * N (\overset{&RightArrow;}{o_{t}}, μ_{x_{t} i}, Σ_{x_{t} i})

wherein N is the number of gaussians,is a state x_tThe ith Gaussian component, c_iAre the corresponding gaussian component weights.

Substep S24, calculating connection probabilities between the plurality of speech candidate words;

in a preferred example of the embodiment of the present invention, the connection probability between the plurality of speech candidate words may be calculated by the following formula:

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

Substep S25, combining the plurality of speech candidate words into a plurality of candidate recognition texts corresponding to the speech data;

each speech signal corresponds to a plurality of language candidate words, and therefore, the combination of candidate recognition texts is a plurality.

And a substep S26 of calculating a first recognition score of the corresponding candidate recognition text by using the occurrence probabilities of the plurality of speech candidate words and the connection probabilities between the plurality of speech candidate words, respectively.

In the embodiment of the invention, under the guidance of an acoustic model and a universal language model of a server, a posterior probability condition (MAP) is calculated as a first recognition score of a candidate recognition text.

In a preferred example of embodiment of the present invention, the first recognition score may be calculated by the following formula:

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

wherein,is the probability of occurrence of the speech candidate word,and lambda is a weight, and WP is a word insertion penalty parameter (representing insertion penalty and reducing insertion errors as much as possible).

In particular, the amount of the solvent to be used,

\begin{matrix} {\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {p (\overset{&RightArrow;}{W_{1}^{N}} | \overset{&RightArrow;}{O_{1}^{T}})} \\ = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\frac{p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) * P (\overset{&RightArrow;}{W_{1}^{N}})}{p (\overset{&RightArrow;}{O_{1}^{T}})}} \\ = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) * P (\overset{&RightArrow;}{W_{1}^{N}})} \\ = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP} \end{matrix}

step 603, sending the first M candidate recognition texts with the highest first recognition scores and the recognition scores thereof to the client; the client is used for calculating a second recognition score of the candidate recognition texts by adopting the personalized text data of the current user, calculating a third recognition score of the M candidate recognition texts by adopting the first recognition score and the second recognition score, calculating confidence degrees of the first N candidate recognition texts with the highest third recognition score, and displaying the first N candidate recognition texts according to the confidence degrees.

In a preferred embodiment of the present invention, the recognition score may further include an occurrence probability of the plurality of speech candidate words

Referring to fig. 7, a flowchart illustrating steps of embodiment 3 of the speech input method of the present invention is shown, which may specifically include the following steps:

step 701, receiving voice data sent by a client;

step 702, recognizing a plurality of candidate recognition texts and recognition information thereof from the voice data; the identification information comprises a first identification score;

in a preferred example of the embodiment of the present invention, the step 702 may specifically include the following sub-steps:

a substep S31 of extracting acoustic features of a plurality of frames of voice signals in the voice data;

a substep S32 of recognizing a plurality of voice candidate words corresponding to the plurality of frames of voice information by respectively adopting the acoustic features;

substep S33, respectively calculating the occurrence probabilities of the plurality of speech candidate words;

in a specific implementation, the probability of occurrence of the plurality of speech candidate words may be calculated by the following formula:

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

wherein,is the acoustic feature of the multi-frame speech signal,candidate words for language

Substep S34, calculating connection probabilities between the plurality of speech candidate words;

in practical applications, the connection probability between the speech candidate words can be calculated by the following formula:

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

Substep S35, combining the plurality of speech candidate words into a plurality of candidate recognition texts corresponding to the speech data;

in a particular implementation, the first identification score may be calculated by the following formula:

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

And a substep S36 of calculating a first recognition score of the corresponding candidate recognition text by using the occurrence probabilities of the plurality of speech candidate words and the connection probabilities between the plurality of speech candidate words, respectively.

In a preferred embodiment of the present invention, the recognition score further includes a probability of occurrence of the plurality of speech candidate words.

Step 703, calculating second recognition scores of the first M candidate recognition texts with the highest first recognition scores by adopting the personalized text data of the current user;

by applying the embodiment of the invention, the personalized text data of the User can be collected in advance, the training of a User Language Model (U-LM) can be carried out, and the User Language module can be used for calculating the second recognition scores of the first M candidate recognition texts.

In a preferred example of the embodiment of the present invention, the candidate recognition text may include a plurality of speech candidate words, and the recognition information may further include occurrence probabilities of the plurality of speech candidate words; the step 703 may specifically include the following sub-steps:

substep S41, performing word segmentation on the first M candidate recognition texts to obtain a first word segmentation;

Substep S42, respectively mapping the first participles into preset second participles, where the second participles are participles of personalized text data of a current user, and the second participles have a word frequency;

Substep S43, finding the occurrence probability of the first participle by using the second participle respectively; the occurrence probability is a ratio of a first word frequency to a second word frequency, wherein the first word frequency is a second word frequency corresponding to a current first word, the word frequency appears behind a second word corresponding to one or more first words before the current first word, and the second word frequency is a total word frequency of the second word corresponding to the one or more first words before the current first word;

substep S44, performing multiplication operation by using the occurrence probability of the first participle to obtain the connection probability of the candidate recognition text;

A word is called bigram if its occurrence depends only on the word it has previously appeared. I.e. P (t) = P(W₁W₂W₃…W_n)=P(W₁)P(W₂|W₁)P(W₃|W₁W₂)…P(W_n|W₁W₂…W_n-1)≈P(W₁)P(W₂|W₁)P(W₃|W₂)…P(W_n|W_n-1)。

TABLE 1 statistical table of total word frequency of second participles

P(I want to eat Chinese food)

=0.25*（1087/3437）*（786/1215）*（860/3256）*（19/938）*（120/213）

=0.000154171

and a substep S45 of calculating a second recognition score of the candidate recognition text by using the occurrence probabilities of the plurality of speech candidate words and the connection probability of the candidate recognition text, respectively.

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

WP represents an insertion penalty to minimize insertion errors.

Step 704, calculating third recognition scores of the M candidate recognition texts by using the first recognition scores and the second recognition scores;

in practical applications, the third recognition score may be calculated using the following formula:

MS(i)＝α*s_i+β*u_i

Step 705, calculating confidence degrees of the first N candidate recognition texts with the highest third recognition scores;

in a specific implementation, the confidence level display method and the device of the embodiment of the present invention may display the first N candidate recognition texts in any form, and the embodiment of the present invention is not limited.

In a preferred example of the embodiment of the present invention, the confidence may be a ratio of a third recognition score of the current candidate recognized text to a sum of third recognition scores of the top N candidate recognized texts.

Step 706, sending the first N candidate recognition texts and their confidence degrees to a client; and the client is used for displaying the first N candidate recognition texts according to the confidence degrees.

As shown in fig. 8, in the embodiment of the present invention, voice data is collected at the client (corresponding to "voice collection" shown in fig. 8), the voice recognition system deployed at the server may obtain a plurality of candidate recognition texts and recognition scores thereof (corresponding to "voice recognition" shown in fig. 8) for the received voice data under the guidance of an Acoustic Model (AM) and a general Language Model (G-LM), rearrange the recognition results (reordering corresponding to "M optimal results" shown in fig. 8) by using a pre-trained User Language Model (U-LM), and finally send the rearranged candidate recognition texts to the client for display (corresponding to "N optimal results before parsing" shown in fig. 8).

By applying the embodiment of the present invention, the client may collect the personalized text data of the user in advance, and upload the personalized text data of the user to the server at a preset time (for example, 6 am every day, 12 pm every saturday, etc.) (corresponding to "uploading personalized text data of the user" shown in fig. 8).

The server can receive the personalized text data of the user sent by the client side to carry out user language model training. And personalized text data of the user can be actively collected from data bound with the user in the client and the server. In order to ensure the privacy right and the right of awareness of the user, the client can acquire the file information on the client after the server is authorized. Specifically, whether the user of the client joins the designated plan or not may be checked first, if so, it is determined that the user of the client authorizes information acquisition of the server, the server may continue to execute the acquisition process, and if the user of the client does not join the execution plan, it is determined that the user of the client does not authorize information acquisition of the server, and the server may not acquire information on the client.

It should be noted that, since method embodiment 3 is basically similar to method embodiments 1 and 2, the description is relatively simple, and the relevant points can be described with reference to parts of method embodiments 1 and 2, which are not described in detail herein.

In a preferred embodiment of the present invention, the steps of "acquiring voice data and sending the voice data to the server" and "displaying the first N candidate recognized texts according to the confidence" are performed at the client, and the steps of "recognizing a plurality of candidate recognized texts and their recognition information corresponding to the recognized voice data" and "sending the first M candidate recognized texts with the highest first recognition score and their recognition scores to the client" are performed at the server, except that:

the step of calculating the second recognition scores of the first M candidate recognized texts by using the personalized text data of the current user may be performed at the server, and then the step of calculating the third recognition scores of the first M candidate recognized texts by using the first recognition scores and the second recognition scores, and calculating the confidence degrees of the first N candidate recognized texts with the highest third recognition scores may be performed at the client;

after the client executes the step of calculating the second recognition scores of the first M candidate recognition texts by adopting the personalized text data of the current user, returning to the server to execute the step of calculating the third recognition scores of the first M candidate recognition texts by adopting the first recognition score and the second recognition score, and finally returning to the client to execute the step of calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores;

after the client executes the steps of calculating the second recognition scores of the first M candidate recognition texts by using the personalized text data of the current user and calculating the third recognition scores of the first M candidate recognition texts by using the first recognition score and the second recognition score, returning to the server to execute the step of calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition score;

after the server executes the steps of calculating the second recognition scores of the first M candidate recognized texts by using the personalized text data of the current user and calculating the third recognition scores of the first M candidate recognized texts by using the first recognition score and the second recognition score, returning to the client to execute the step of calculating the confidence degrees of the first N candidate recognized texts with the highest third recognition score;

after the server executes the step of calculating the second recognition scores of the first M candidate recognition texts by using the personalized text data of the current user, returning to the client to execute the step of calculating the third recognition scores of the first M candidate recognition texts by using the first recognition score and the second recognition score, and finally returning to the server to execute the step of calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores;

the step of calculating the second recognition scores of the first M candidate recognized texts using the personalized text data of the current user may be performed at the client, and then the steps of calculating the third recognition scores of the first M candidate recognized texts using the first recognition score and the second recognition score, and calculating the confidence degrees of the first N candidate recognized texts with the highest third recognition score may be performed at the server.

It should be noted that the implementation of the above embodiment is basically similar to that of method embodiments 1, 2, and 3, so that the description is relatively simple, and the relevant points can be described with reference to the parts of method embodiments 1, 2, and 3, and the embodiment of the present invention is not described in detail herein.

Referring to fig. 9, a flowchart illustrating steps of embodiment 4 of the speech input method of the present invention is shown, which may specifically include the following steps:

step 901, collecting voice data;

step 902, sending the voice data to a server;

step 903, receiving the top M candidate recognition texts with the highest first recognition score in each voice subdata returned by the server and the recognition scores thereof, where the voice subdata is a plurality of voice subdata into which the voice data is cut by the server, and the recognition scores include the first recognition score;

in the embodiment of the invention, the server adopts a continuous voice recognition technology, voice data is divided into a plurality of voice subdata through silence detection, for each voice subdata, a plurality of candidate recognition texts can be recognized through an acoustic model and a language model, then the candidate recognition texts are reordered through a user language model, and then confidence coefficients are respectively calculated.

The silence detection can detect silence existing in voice data according to a time sequence, and the input voice data is segmented according to silence with a certain length in the voice data and is divided into a plurality of voice subdata. For example, the voice data is "i want to eat noodle at present [0.2 second silence ] but not sell noodle at dining room [0.3 second silence ] we go out to eat a bar", silence detection judges the silence length in the voice data, and then cuts the voice into 3 pieces of voice sub-data "i want to eat noodle at present", "but not sell noodle at dining room" according to a certain threshold (in the above example, 0.15 second is selected as the threshold for judging whether to cut the voice data).

Step 904, respectively adopting the personalized text data of the current user to calculate second recognition scores of the first M candidate recognition texts corresponding to each voice subdata;

in a preferred example of the embodiment of the present invention, the personalized text data may include one or more of the following:

In a preferred embodiment of the present invention, the candidate recognition text may include a plurality of speech candidate words, and the recognition information may further include occurrence probabilities of the plurality of speech candidate words;

the step 904 may specifically include the following sub-steps:

substep S51, performing word segmentation on the first M candidate recognition texts corresponding to each voice subdata respectively to obtain first word segmentation;

substep S52, respectively mapping the first participles into preset second participles, where the second participles are participles of personalized text data of a current user, and the second participles have a word frequency;

substep S53, finding the occurrence probability of the first participle by using the second participle respectively; the occurrence probability is a ratio of a first word frequency number and a second word frequency number, wherein the first word frequency number may be a second word frequency number corresponding to a current first word segmentation, the word frequency number appearing behind a second word frequency number corresponding to one or more first words before the current first word segmentation, and the second word frequency number is a total word frequency number of the second word frequency number corresponding to the one or more first words before;

substep S54, performing multiplication operation by using the occurrence probabilities of the first participles respectively to obtain connection probabilities of the candidate recognition texts;

and a substep S55 of calculating a second recognition score of the candidate recognition text using the occurrence probabilities of the plurality of speech candidate words and the connection probability of the candidate recognition text.

In practical applications, the following formula may be adopted to calculate the second recognition score of the candidate recognition text:

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

Step 905, calculating third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition score and the second recognition score;

MS(i)＝α*s_i+β*u_i

Step 906, respectively calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each piece of voice subdata;

Step 907, respectively displaying the candidate recognition texts with the highest confidence level in each voice subdata.

In practical application, the embodiment of the present invention may show the candidate recognition text with the highest confidence in any form, and the embodiment of the present invention does not need to be limited to this.

In a preferred embodiment of the present invention, the method may further include the following steps:

step 908, when the candidate recognition text with the highest confidence coefficient is triggered, displaying other candidate recognition texts; and the other texts are candidate recognition texts except the candidate recognition text with the highest confidence coefficient in the first N candidate recognition texts.

When the user touches and clicks or triggers the candidate recognition text with the highest confidence coefficient displayed currently, other candidate recognition texts can be displayed in a pull-down menu mode, a pop-up menu mode and the like.

Referring to fig. 10, a flowchart illustrating steps of embodiment 5 of the speech input method of the present invention is shown, which may specifically include the following steps:

step 1001, receiving voice data sent by a client;

step 1002, segmenting the voice data into a plurality of voice subdata;

step 1003, respectively identifying a plurality of candidate identification texts and identification information thereof corresponding to each voice subdata; the identification information comprises a first identification score;

in a preferred embodiment of the present invention, the step 1003 may specifically include the following sub-steps:

substep S61, respectively extracting the acoustic features of the multi-frame voice signals in each voice sub-data;

a substep S62 of recognizing a plurality of voice candidate words corresponding to the plurality of frames of voice information by respectively adopting the acoustic features;

substep S63, respectively calculating the occurrence probabilities of the plurality of speech candidate words;

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

Substep S64, respectively calculating connection probabilities between the plurality of speech candidate words;

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

A substep S65, combining the plurality of speech candidate words into a plurality of candidate recognition texts corresponding to the speech data respectively;

and a substep S66 of calculating a first recognition score of the corresponding candidate recognition text by using the occurrence probabilities of the plurality of speech candidate words and the connection probabilities between the plurality of speech candidate words, respectively.

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

Step 1004, sending the top M candidate identifications with the highest first identification score in each voice subdata and identification score texts thereof to a client; the client is used for calculating second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user, calculating third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition scores and the second recognition scores, calculating confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each voice subdata, and respectively displaying the candidate recognition texts with the highest confidence degree in each voice subdata.

In a preferred embodiment of the present invention, the recognition score may further include a probability of occurrence of the plurality of speech candidate words.

It should be noted that, since the application of the acoustic model, the general language model and the user language model of each piece of speech sub data in method embodiments 4 and 5 is basically similar to that in method embodiments 1 and 2, the description is relatively simple, and relevant points can be described with reference to parts of method embodiments 1 and 2, which is not described in detail herein.

Referring to fig. 11, a flowchart illustrating steps of embodiment 6 of the speech input method of the present invention is shown, which may specifically include the following steps:

step 1101, receiving voice data sent by a client;

step 1102, segmenting the voice data into a plurality of voice subdata;

step 1103, respectively identifying a plurality of candidate identification texts and identification information thereof of each voice subdata; the identification information comprises a first identification score;

in a preferred embodiment of the present invention, the step 1103 may specifically include the following sub-steps:

substep S71, respectively extracting the acoustic features of the multi-frame voice signals in each voice sub-data;

a substep S72 of recognizing a plurality of voice candidate words corresponding to the plurality of frames of voice information by respectively adopting the acoustic features;

substep S73, respectively calculating the occurrence probabilities of the plurality of speech candidate words;

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

Substep S74, respectively calculating connection probabilities between the plurality of speech candidate words;

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

A substep S75, combining the plurality of speech candidate words into a plurality of candidate recognition texts corresponding to the speech data respectively;

and a substep S76 of calculating a first recognition score of the corresponding candidate recognition text by using the occurrence probabilities of the plurality of speech candidate words and the connection probabilities between the plurality of speech candidate words, respectively.

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

Step 1104, respectively calculating second recognition scores of the first M candidate recognition texts with the highest first recognition score corresponding to each voice subdata by using the personalized text data of the current user;

In a preferred embodiment of the present invention, the candidate recognition text may include a plurality of speech candidate words, and the recognition score may further include occurrence probabilities of the plurality of speech candidate words;

the step 1104 may specifically include the following sub-steps:

substep S81, performing word segmentation on the first M candidate recognition texts corresponding to each voice subdata respectively to obtain first word segmentation;

substep S82, respectively mapping the first participles into preset second participles, where the second participles are participles of personalized text data of a current user, and the second participles have a word frequency;

substep S83, finding the occurrence probability of the first participle by using the second participle respectively; the occurrence probability is a ratio of the first word frequency number to the second word frequency number, wherein the first word frequency number may be a second participle corresponding to a previous first participle, and a word frequency number appearing behind a second participle corresponding to one or more first participles before the current first participle, and the second word frequency number may be a total word frequency number of the second participle corresponding to the previous one or more first participles;

substep S84, respectively multiplying the occurrence probabilities to obtain connection probabilities of the candidate recognition texts;

and a substep S85 of calculating a second recognition score of the candidate recognition text using the occurrence probabilities of the plurality of speech candidate words and the connection probability of the candidate recognition text.

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

Step 1105, calculating third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition score and the second recognition score;

MS(i)＝α*s_i+β*u_i

Step 1106, calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition score of each piece of voice subdata respectively;

Step 1107, the first N candidate recognition texts with the highest third recognition score of each voice subdata and the confidence thereof are sent to the client; and the client is used for respectively displaying the candidate recognition text with the highest confidence level in each voice subdata.

It should be noted that, since the application of the acoustic model, the general language model and the user language model of each piece of speech sub data in method embodiment 6 is substantially similar to that in method embodiments 1 and 2, the description is relatively simple, and relevant points can be described with reference to parts of method embodiments 1 and 2, which is not described in detail herein.

In a preferred embodiment of the present invention, the steps of "collecting voice data and sending the voice data to the server" and "respectively showing the candidate recognition text with the highest confidence level in each voice sub-data" are executed at the client, and the steps of "receiving the voice data sent by the client", "segmenting the voice data into a plurality of voice sub-data", "respectively recognizing a plurality of candidate recognition texts and recognition information thereof corresponding to each voice sub-data", and "sending the top M candidate recognition with the highest first recognition score in each voice sub-data and the recognition score text thereof to the client" are executed at the server, except that:

the server may perform a step of "calculating second recognition scores of the first M candidate recognition texts corresponding to each voice sub-data by respectively using personalized text data of the current user", and then perform a step of "calculating third recognition scores of the first M candidate recognition texts corresponding to each voice sub-data by respectively using the first recognition scores and the second recognition scores", and "calculating confidence degrees of the first N candidate recognition texts having the highest third recognition scores of each voice sub-data" at the client;

or after the client executes the step of calculating the second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user, returning to the server to execute the step of calculating the third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition scores and the second recognition scores, and finally returning to the client to execute the step of calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each voice subdata;

after the step of "calculating the second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user" and the step of calculating the third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition score and the second recognition score "are executed by the client, the step of" calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each voice subdata by returning to the server "may be executed;

after the server executes the steps of calculating the second recognition scores of the first M candidate recognition texts corresponding to each piece of voice sub-data by respectively adopting the personalized text data of the current user and calculating the third recognition scores of the first M candidate recognition texts corresponding to each piece of voice sub-data by respectively adopting the first recognition score and the second recognition score, returning to the client to execute the step of calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each piece of voice sub-data respectively;

or after the server executes the step of calculating the second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user, returning to the client to execute the step of calculating the third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition scores and the second recognition scores, and finally returning to the server to execute the step of calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each voice subdata;

the step of calculating the second recognition scores of the first M candidate recognition texts corresponding to each sub-speech data by respectively adopting the personalized text data of the current user may be performed at the client, and then the step of calculating the third recognition scores of the first M candidate recognition texts corresponding to each sub-speech data by respectively adopting the first recognition scores and the second recognition scores, and the step of calculating the confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each sub-speech data by respectively adopting the first recognition scores and the second recognition scores may be performed at the server.

It should be noted that the implementation of the above embodiment is basically similar to that of method embodiments 4, 5, and 6, so that the description is relatively simple, and the relevant points can be described with reference to the parts of method embodiments 4, 5, and 6, and the embodiment of the present invention is not described in detail herein.

For simplicity of explanation, the method embodiments are described as a series of acts or combinations, but those skilled in the art will appreciate that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the embodiments of the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 12, a block diagram of a structure of an embodiment 1 of the speech input system of the present invention is shown, which may specifically include the following modules:

a first voice data collection module 1201, configured to collect voice data;

a first voice data sending module 1202, configured to send the voice data to a server;

a first receiving module 1203, configured to receive the first M candidate recognition texts with the highest first recognition scores corresponding to the voice data recognized by the server and recognition information thereof, where the recognition information may include the first recognition scores;

a first score calculating module 1204, configured to calculate second recognition scores of the top M candidate recognition texts by using personalized text data of the current user;

a second score calculating module 1205, configured to calculate third recognition scores of the top M candidate recognition texts by using the first recognition score and the second recognition score;

a first confidence calculation module 1206, configured to calculate confidence of the top N candidate recognition texts with the highest third recognition score;

a first displaying module 1207, configured to display the top N candidate recognition texts according to the confidence degrees.

the first score calculating module 1204 may specifically include the following sub-modules:

the first searching submodule is used for respectively adopting the second participles to search the occurrence probability of the first participles; the occurrence probability may be a ratio of a first word frequency number to a second word frequency number, where the first word frequency number may be a second participle corresponding to a current first participle, and a word frequency number that appears behind a second participle corresponding to one or more first participles before the current first participle, and the second word frequency number may be a total word frequency number of the second participles corresponding to the one or more first participles before the current first participle;

In a preferred example of the embodiment of the present invention, the following formula may be adopted to calculate the second recognition score of the candidate recognition text:

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

In a preferred example of the embodiment of the present invention, the third recognition score may be calculated using the following formula:

MS(i)＝α*s_i+β*u_i

Referring to fig. 13, a block diagram of a structure of an embodiment 2 of the speech input system of the present invention is shown, which may specifically include the following modules:

a first voice data receiving module 1301, configured to receive voice data sent by a client;

a first recognition module 1302, configured to recognize a plurality of candidate recognition texts corresponding to the recognition voice data and recognition information thereof, where the recognition information may include a first recognition score;

the first sending module 1303 is configured to send the first M candidate recognition texts with the highest first recognition scores and the recognition scores thereof to the client; the client is used for calculating a second recognition score of the candidate recognition texts by adopting the personalized text data of the current user, calculating a third recognition score of the M candidate recognition texts by adopting the first recognition score and the second recognition score, calculating confidence degrees of the first N candidate recognition texts with the highest third recognition score, and displaying the first N candidate recognition texts according to the confidence degrees.

In a preferred embodiment of the present invention, the first identifying module 1302 may specifically include the following sub-modules:

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

wherein,is the probability of occurrence of the speech candidate word,between words being candidate for said speechAnd the connection probability, lambda is the weight, and WP is a word insertion penalty parameter.

In a preferred example of the embodiment of the present invention, the recognition score further includes a probability of occurrence of the plurality of speech candidate words.

Referring to fig. 14, a block diagram of a structure of an embodiment 3 of the speech input system of the present invention is shown, which may specifically include the following modules:

a second voice data receiving module 1401, configured to receive voice data sent by a client;

a second recognition module 1402, configured to recognize a plurality of candidate recognition texts and recognition information thereof from the voice data; the identification information comprises a first identification score;

a fourth score calculating module 1403, configured to calculate, by using personalized text data of the current user, second identification scores of the top M candidate identification texts with the highest first identification score;

a fifth score calculating module 1404, configured to calculate third recognition scores of the top M candidate recognized texts by using the first recognition score and the second recognition score;

a second confidence module 1405, configured to calculate confidence levels of the top N candidate recognition texts with the highest third recognition score;

a second sending module 1406, configured to send the top N candidate recognition texts and their confidence levels to the client; and the client is used for displaying the first N candidate recognition texts according to the confidence degrees.

In a preferred embodiment of the present invention, the first identifying module 1402 specifically includes the following sub-modules:

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

the fourth score calculating module 1403 may specifically include the following sub-modules:

the first mapping sub-module is used for mapping the first participles into preset second participles respectively, wherein the second participles can be participles of personalized text data of a current user, and the second participles can have word frequency;

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

MS(i)＝α*s_i+β*u_i

Referring to fig. 15, a block diagram of a voice input system embodiment 4 of the present invention is shown, which may specifically include the following modules:

a second voice data collection module 1501, configured to collect voice data;

a second voice data sending module 1502, configured to send the voice data to a server;

a second receiving module 1503, configured to receive the top M candidate recognition texts with the highest first recognition scores in each piece of voice sub-data returned by the server and the recognition scores thereof, where the voice sub-data is a plurality of voice sub-data into which the voice data is divided by the server, and the recognition scores include the first recognition scores;

a sixth score calculating module 1504, configured to calculate second recognition scores of the first M candidate recognition texts corresponding to each voice sub-data by using personalized text data of the current user, respectively;

a seventh score calculating module 1505, configured to calculate, by using the first recognition score and the second recognition score, a third recognition score of the first M candidate recognition texts corresponding to each piece of speech subdata;

a third confidence coefficient calculation module 1506, configured to calculate confidence coefficients of the top N candidate recognition texts with the highest third recognition score of each piece of speech sub data, respectively;

a second displaying module 1507, configured to respectively display the candidate recognition texts with the highest confidence level in each speech sub-data.

the sixth score calculating module 1504 may specifically include the following sub-modules:

the second mapping sub-module is used for mapping the first participles into preset second participles respectively, wherein the second participles can be participles of personalized text data of a current user and can have word frequency;

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

MS(i)＝α*s_i+β*u_i

In a preferred embodiment of the present invention, the following modules may be specifically included:

Referring to fig. 16, a block diagram of a structure of an embodiment 5 of the speech input system of the present invention is shown, which may specifically include the following modules:

a third voice data receiving module 1601, configured to receive voice data sent by a client;

a first voice data splitting module 1602, configured to split the voice data into a plurality of voice subdata;

a second identifying module 1603, configured to respectively identify a plurality of candidate identification texts and identification information thereof corresponding to each piece of speech sub-data; the identification information may include a first identification score;

a third sending module 1604, configured to send the top M candidate identifications with the highest first identification score in each piece of voice sub-data and the identification score texts thereof to the client; the client is used for calculating second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user, calculating third recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the first recognition scores and the second recognition scores, calculating confidence degrees of the first N candidate recognition texts with the highest third recognition scores of each voice subdata, and respectively displaying the candidate recognition texts with the highest confidence degree in each voice subdata.

In a preferred embodiment of the present invention, the second identification module 1603 may specifically include the following sub-modules:

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein，Candidate words for the language.

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

In a preferred example of the embodiment of the present invention, the recognition score may further include a probability of occurrence of the plurality of speech candidate words.

Referring to fig. 17, a block diagram of a structure of an embodiment 6 of the speech input system of the present invention is shown, which may specifically include the following modules:

a fourth voice data receiving module 1701, configured to receive voice data sent by the client;

a second voice data splitting module 1702, configured to split the voice data into a plurality of voice subdata;

a third identifying module 1703, configured to identify a plurality of candidate identification texts of each piece of speech subdata and identification information thereof, respectively; the identification information may include a first identification score;

a ninth score calculating module 1704, configured to calculate second recognition scores of the first M candidate recognition texts corresponding to each piece of speech sub data by using personalized text data of the current user respectively;

a tenth score calculating module 1705, configured to calculate, by using the first identification score and the second identification score, third identification scores of the first M candidate identification texts with the highest first identification score corresponding to each piece of voice sub-data;

a fourth confidence calculating module 1706, configured to calculate confidence of the top N candidate recognition texts with the highest third recognition score of each piece of speech sub data, respectively;

a fourth sending module 1707, configured to send the top N candidate recognition texts with the highest third recognition score of each piece of voice sub-data and their confidence levels to the client; and the client is used for respectively displaying the candidate recognition text with the highest confidence level in each voice subdata.

In a preferred embodiment of the present invention, the second identifying module 1703 may specifically include the following sub-modules:

p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}})

P (\overset{&RightArrow;}{W_{1}^{N}}) = P (w_{1} w_{2} w_{3} . . . w_{N}) = Π_{i = 1}^{N} P (w_{i} | w_{i - n + 1} . . . w_{i - 1})

wherein,candidate words for the language.

{\overset{&RightArrow;}{W_{1}^{N}}}^{*} = \underset{\underset{W_{1}^{N}}{&OverBar;}}{\arg} {\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP}

wherein,is the probability of occurrence of the speech candidate word,is that it isAnd the connection probability between the voice candidate words, lambda is the weight, and WP is a word insertion penalty parameter.

the ninth score calculating module 1704 may specifically include the following sub-modules:

the second mapping sub-module is used for mapping the first participles into preset second participles respectively, wherein the second participles can be participles of personalized text data of a current user, and the second participles can have word frequency;

the second searching submodule is used for respectively adopting the second participles to search the occurrence probability of the first participles; the occurrence probability may be a ratio of a first word frequency count to a second word frequency count, where the first word frequency count is a second word frequency count corresponding to a current first word segmentation, the word frequency count occurring after a second word segmentation corresponding to one or more first words before the current first word segmentation, and the second word frequency count is a total word frequency count of the second word segmentation corresponding to the one or more first words before the current word segmentation;

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + WP

MS(i)＝α*s_i+β*u_i

For the system embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing detailed description of a voice input method and a voice input system provided by the present invention, and the specific examples applied herein have been set forth to explain the principles and embodiments of the present invention, and the above description of the embodiments is only used to help understand the method and its core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A speech input method, comprising:

collecting voice data and sending the voice data to a server;

2. The method of claim 1, wherein the personalized text data comprises one or more of:

3. The method according to claim 1 or 2, wherein the top M candidate recognition texts comprise a plurality of speech candidate words, and the recognition information further comprises occurrence probabilities of the plurality of speech candidate words;

respectively searching the occurrence probability of the first segmentation by adopting the second segmentation; the occurrence probability of the first participle is a ratio of a first word frequency number and a second word frequency number, wherein the first word frequency number is the word frequency number of a second participle corresponding to the current first participle appearing behind a second participle corresponding to one or more first participles before the current first participle, and the second word frequency number is the total word frequency number of the second participles corresponding to the one or more first participles before;

performing multiplication operation by adopting the occurrence probability of the first word segmentation to obtain the connection probability of the first M candidate recognition texts;

and calculating second recognition scores of the first M candidate recognition texts by respectively adopting the occurrence probabilities of the plurality of voice candidate words and the connection probabilities of the first M candidate recognition texts.

4. The method of claim 3, wherein the second recognition score of the top M candidate recognized texts is calculated using the following formula:

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + W P

wherein,for the probability of occurrence of the plurality of speech candidate words,and inserting a penalty parameter for words by using lambda as weight and WP as word insertion for the connection probability of the first M candidate recognition texts.

5. The method of claim 1, wherein the third recognition score is calculated using the following equation:

MS(i)＝α*s_i+β*u_i

6. The method of claim 1, wherein the confidence is a ratio of a third recognition score of the current candidate recognized text to a sum of third recognition scores of the top N candidate recognized texts.

7. A speech input method, comprising:

receiving voice data sent by a client;

8. A speech input method, comprising:

collecting voice data and sending the voice data to a server;

9. The method of claim 8, wherein the personalized text data comprises one or more of:

10. The method according to claim 8 or 9, wherein the top M candidate recognition texts comprise a plurality of speech candidate words, and the recognition score further comprises occurrence probabilities of the plurality of speech candidate words;

respectively searching the occurrence probability of the first segmentation by adopting the second segmentation; the occurrence probability of the first participle is a ratio of a first word frequency count and a second word frequency count, wherein the first word frequency count is a word frequency count of a second participle corresponding to a current first participle appearing behind a second participle corresponding to one or more first participles before the current first participle, and the second word frequency count is a total word frequency count of the second participles corresponding to the one or more first participles before the current first participle;

respectively adopting the occurrence probability of the first participle to carry out multiplication operation so as to obtain the connection probability of the first M candidate recognition texts;

and calculating second recognition scores of the first M candidate recognition texts by adopting the occurrence probabilities of the plurality of voice candidate words and the connection probabilities of the first M candidate recognition texts.

11. The method of claim 10, wherein the second recognition score of the top M candidate recognized texts is calculated using the following formula:

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + W P

12. The method of claim 8, wherein the third recognition score is calculated using the following equation:

MS(i)＝α*s_i+β*u_i

13. The method of claim 8, wherein the confidence is a ratio of a third recognition score of the current candidate recognized text to a sum of third recognition scores of the top N candidate recognized texts.

14. The method of claim 8, further comprising:

15. A speech input method, comprising:

receiving voice data sent by a client;

segmenting the voice data into a plurality of voice subdata;

16. A speech input system, comprising:

the first voice data acquisition module is used for acquiring voice data;

17. The system of claim 16, wherein the personalized text data comprises one or more of:

18. The system according to claim 16 or 17, wherein the top M candidate recognition texts comprise a plurality of speech candidate words, and the recognition information further comprises occurrence probabilities of the plurality of speech candidate words;

the first score calculation module comprises:

the first searching submodule is used for respectively adopting the second participles to search the occurrence probability of the first participles; the occurrence probability of the first participle is a ratio of a first word frequency number and a second word frequency number, wherein the first word frequency number is the word frequency number of a second participle corresponding to the current first participle appearing behind a second participle corresponding to one or more first participles before the current first participle, and the second word frequency number is the total word frequency number of the second participles corresponding to the one or more first participles before;

a first connection probability obtaining submodule, configured to perform multiplication operation by using the occurrence probability of the first participle to obtain connection probabilities of the first M candidate recognition texts;

and the first candidate recognition text score calculating sub-module is used for calculating second recognition scores of the first M candidate recognition texts by respectively adopting the occurrence probability of the plurality of voice candidate words and the connection probability of the first M candidate recognition texts.

19. The system of claim 18, wherein the second recognition score for the top M candidate recognized texts is calculated using the following formula:

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + W P

20. The system of claim 16, wherein the third recognition score is calculated using the following equation:

MS(i)＝α*s_i+β*u_i

21. The system of claim 16, wherein the confidence level is a ratio of a third recognition score of the current candidate recognized text to a sum of third recognition scores of the top N candidate recognized texts.

22. A speech input system, comprising:

the first recognition module is used for recognizing a plurality of candidate recognition texts and recognition information thereof from the voice data; the identification information comprises a first identification score;

the third score calculating module is used for calculating second recognition scores of the first M candidate recognition texts with the highest first recognition scores by adopting the personalized text data of the current user;

a fourth score calculating module, configured to calculate third recognition scores of the first M candidate recognition texts by using the first recognition score and the second recognition score;

the second confidence coefficient calculation module is used for calculating the confidence coefficients of the first N candidate recognition texts with the highest third recognition scores;

the first sending module is used for sending the first N candidate recognition texts and the confidence degrees thereof to a client; and the client is used for displaying the first N candidate recognition texts according to the confidence degrees.

23. A speech input system, comprising:

the second voice data acquisition module is used for acquiring voice data;

the fifth score calculating module is used for calculating second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user;

a sixth score calculating module, configured to calculate, by using the first recognition score and the second recognition score, third recognition scores of the first M candidate recognition texts corresponding to each piece of speech sub-data;

24. The system of claim 23, wherein the personalized text data comprises one or more of:

25. The system according to claim 23 or 24, wherein the top M candidate recognition texts comprise a plurality of speech candidate words, and the recognition score further comprises occurrence probabilities of the plurality of speech candidate words;

the fifth score calculation module includes:

the second searching submodule is used for respectively adopting the second participles to search the occurrence probability of the first participles; the occurrence probability of the first participle is a ratio of a first word frequency number and a second word frequency number, wherein the first word frequency number is the word frequency number of a second participle corresponding to the current first participle appearing behind a second participle corresponding to one or more first participles before the current first participle, and the second word frequency number is the total word frequency number of the second participles corresponding to the one or more first participles before;

a second connection probability obtaining submodule, configured to perform multiplication operation by using the occurrence probabilities of the first participles respectively to obtain connection probabilities of the first M candidate recognition texts;

and the second candidate recognition text score calculating sub-module is used for calculating second recognition scores of the first M candidate recognition texts by adopting the occurrence probability of the plurality of voice candidate words and the connection probability of the first M candidate recognition texts.

26. The system of claim 25, wherein the second recognition score of the candidate recognized text is calculated using the following formula:

\log p (\overset{&RightArrow;}{O_{1}^{T}} | \overset{&RightArrow;}{W_{1}^{N}}) + λ * \log P (\overset{&RightArrow;}{W_{1}^{N}}) + W P

27. The system of claim 23, wherein the third recognition score is calculated using the following equation:

MS(i)＝α*s_i+β*u_i

28. The system of claim 23, wherein the confidence measure is a ratio of a third recognition score of the current candidate recognized text to a sum of third recognition scores of the top N candidate recognized texts.

29. The system of claim 23, further comprising:

30. A speech input system, comprising:

the second identification module is used for respectively identifying a plurality of candidate identification texts of each voice subdata and identification information thereof; the identification information comprises a first identification score;

the seventh score calculating module is used for calculating second recognition scores of the first M candidate recognition texts corresponding to each voice subdata by respectively adopting the personalized text data of the current user;

the eighth score calculating module is used for calculating third recognition scores of the first M candidate recognition texts with the highest first recognition score corresponding to each voice subdata by respectively adopting the first recognition scores and the second recognition scores;

the second sending module is used for sending the first N candidate recognition texts with the highest third recognition score of each voice subdata and the confidence degrees of the candidate recognition texts to the client; and the client is used for respectively displaying the candidate recognition text with the highest confidence level in each voice subdata.