CN105718239B

CN105718239B - A kind of method and apparatus of voice input

Info

Publication number: CN105718239B
Application number: CN201610054745.5A
Authority: CN
Inventors: 赵毅
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2016-01-27
Filing date: 2016-01-27
Publication date: 2019-03-08
Anticipated expiration: 2036-01-27
Also published as: CN105718239A

Abstract

The present invention provides a kind of method and apparatus of voice input, and wherein method includes: to show input text recommendation list on the speech voice input function interface into after speech voice input function interface；Wherein the input text recommendation list is generated according to the recognition result of user's history voice input.The invention enables users can be after entering speech voice input function interface, the selection input text directly from input text recommendation list.If having existed user's content by voice input originally in input text recommendation list, voice input need not be repeated, improve input efficiency；In addition, also facilitating some non-secret environment or being inconvenient to use in the environment spoken.

Description

Voice input method and device

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of computer application, in particular to a voice input method and device.

[ background of the invention ]

With the rapid development of smart devices, text input has gradually failed to satisfy the requirements of people on liberation of both hands, and voice input and voice search based on voice input become one of the important ways for people to interact with smart devices.

Currently, when a user needs to perform voice input, the user needs to enter a voice input function interface first, as shown in fig. 1, record voice content by pressing a voice input function key for a long time, and identify or further search the voice content after releasing the key. However, the user needs to speak the exact voice content every time the user inputs the voice, and even if the user has input the same voice content in the past, and even frequently inputs the same voice content, the user needs to input the voice content again by pressing the function key of the voice input for a long time. On one hand, the input mode has low efficiency and requires the user to repeatedly input the same voice content; on the other hand, the environment in which the voice input function is used is limited, and for example, in some non-private environments such as buses and cafes, or in environments where it is inconvenient to speak in a meeting or a classroom, efficient voice input cannot be performed.

[ summary of the invention ]

In view of the above, the present invention provides a method and an apparatus for inputting speech, so as to improve input efficiency and reduce restrictions on the use environment of speech input functions.

The specific technical scheme is as follows:

the invention provides a voice input method, which comprises the following steps:

after entering a voice input function interface, displaying an input text recommendation list on the voice input function interface; wherein the input text recommendation list is generated based on recognition results of historical speech inputs of the user.

According to a preferred embodiment of the invention, the method further comprises:

generating an input text recommendation list according to the recognition result of the historical voice input of the user; or,

and acquiring an input text recommendation list generated according to the recognition result of the historical voice input of the user from the server.

According to a preferred embodiment of the present invention, generating an input text recommendation list according to a recognition result of a user's historical speech input includes:

acquiring a recognition result record of historical voice input;

sorting the recognition results of the historical voice input according to at least one factor of the number of times of occurrence of the recognition results of the historical voice input of the user in the last period of time, the number of times of occurrence of the recognition results of the historical voice input of the user in the same current period of time, the popularity degree of the recognition results of the historical voice input of all the users in the same current place and the popularity degree of the recognition results of the historical voice input of all the users in the same current period of time;

and selecting the recognition results of the previous M historical voice inputs to form the input text recommendation list, wherein M is a preset positive integer.

one of the input texts in the input text recommendation list is in a selected state.

when the operation of switching the input text by the user is acquired, switching another input text in the input text recommendation list to be in a selected state;

wherein only one input text in the input text recommendation list can be in a selected state at the same time.

According to a preferred embodiment of the present invention, when the operation of switching the input text by the user is obtained, switching another input text in the input text recommendation list to be in the selected state includes:

when the user operation in the voice input state is triggered and the duration that the input voice quality does not reach the recognition requirement reaches the preset duration, switching the next input text in the input text recommendation list to be in the selected state in sequence or randomly; or,

and acquiring an input text which is in an unselected state in the input text recommendation list clicked by the user, and switching the input text clicked by the user to be in a selected state.

and when the operation of triggering input text input is acquired, inputting the input text in the selected state at present.

According to a preferred embodiment of the present invention, the operation of acquiring the input of the trigger input text includes:

and the input voice quality still does not meet the recognition requirement when the user operation in the voice input state is acquired and ended.

and when the user operation in the voice input state is acquired and triggered and the input voice quality meets the identification requirement, identifying the input voice and inputting an identification result when the user operation in the voice input state is acquired and ended.

According to a preferred embodiment of the present invention, triggering a user operation in a voice input state includes: pressing a voice input function button on the voice input function interface;

the ending of the user operation in the voice input state includes: and releasing the voice input function button.

The invention also provides a voice input device, comprising:

the recommendation unit is used for displaying an input text recommendation list on the voice input function interface after entering the voice input function interface;

wherein the input text recommendation list is generated based on recognition results of historical speech inputs of the user.

According to a preferred embodiment of the invention, the apparatus further comprises: a generating unit or an acquiring unit;

the generating unit is used for generating an input text recommendation list according to the recognition result of the historical voice input of the user;

the acquisition unit is used for acquiring an input text recommendation list generated according to the recognition result of the historical voice input of the user from the server.

According to a preferred embodiment of the present invention, the generating unit is specifically configured to:

acquiring a recognition result record of historical voice input;

According to a preferred embodiment of the invention, the apparatus further comprises:

and the selection unit is used for enabling one input text in the input text recommendation list to be in a selected state.

According to a preferred embodiment of the present invention, the selecting unit is further configured to switch another input text in the input text recommendation list to be in a selected state when acquiring an operation of switching the input text by a user;

According to a preferred embodiment of the present invention, the selecting unit is specifically configured to:

the first input unit is used for inputting the input text in the selected state when the operation of triggering the input of the input text is acquired.

According to a preferred embodiment of the present invention, when the first input unit acquires that the quality of the voice input when the user operation in the voice input state is ended does not meet the recognition requirement, it is determined that the operation of triggering the input of the text is acquired.

and the second input unit is used for acquiring the user operation which triggers the voice input state, and the quality of the input voice meets the identification requirement, and identifying the input voice and inputting the identification result when the user operation which ends the voice input state is acquired.

According to the technical scheme, the input text recommendation list is generated by utilizing the recognition result of the historical voice input of the user and is displayed on the voice input function interface, so that the user can directly select the input text from the input text recommendation list after entering the voice input function interface. If the input text recommendation list has the content which is originally input by the user through the voice, the voice input is not required to be repeated, and the input efficiency is improved; in addition, the method is convenient to use in some non-private environments or environments inconvenient to speak.

[ description of the drawings ]

FIG. 1 is a schematic diagram of a voice input function interface provided by the prior art;

FIG. 2 is a flow chart of a method provided by an embodiment of the present invention;

FIG. 3a is a schematic diagram of a speech input function interface according to an embodiment of the present invention;

FIG. 3b is a diagram illustrating a selected state according to an embodiment of the present invention;

FIG. 3c is a diagram illustrating an exemplary switching of an input text in a selected state according to an embodiment of the present invention;

FIG. 3d is a diagram illustrating the result of a voice input according to an embodiment of the present invention;

FIG. 3e is a diagram illustrating the result of another speech input provided by the embodiment of the present invention;

fig. 4 is a diagram illustrating an apparatus according to an embodiment of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

In the invention, in order to improve the voice input efficiency of a user and be suitable for the environment of voice input in different aspects, after entering a voice input function interface, an input text recommendation list is displayed on the voice input function interface, wherein the input text recommendation list is generated according to the recognition result of the historical voice input of the user. So that the user can select the input text in the input text recommendation list for input.

The above method is described in detail with reference to one embodiment. Fig. 2 is a flowchart of a method according to an embodiment of the present invention, and as shown in fig. 2, the method may include the following steps:

in 201, an input text recommendation list is generated based on the recognition result of the user's historical speech input.

It should be noted that this step may be a step executed in advance, or may be a step executed in real time after entering the voice input function interface, and in this embodiment, the step is executed in advance as an example.

In addition, there may be, but are not limited to, two implementations of this step:

the first implementation mode comprises the following steps: an input text recommendation list is generated by a client of the voice input.

At the client of the voice input, the following processing can be performed:

first, a recognition result record of a historical speech input is obtained. The record may be a record performed by the client, or a record performed by the server, and then the client obtains the record from the server. Or the client records the recognition result of the historical voice input of the user, and the server records the recognition results of the voice input of all the users. The record may include attribute information such as an input time, an input place, and the like, in addition to the user ID and the recognition result information of the voice input.

The recognition results of the historical speech inputs are then ranked according to a number of factors. Factors considered in ranking may include, but are not limited to, one or any combination of the following:

factor 1: the number of occurrences of recognition results of the user's historical speech input over a recent period of time. By this factor, the recognition result of the voice frequently input by the user in the last period of time can be recommended to the user. It should be noted that the recognition result of the input speech is included in the input text recommendation list, and is in the form of text, so that the user can intuitively view the recognition result. The period of time can be one week, one month, one year and the like, and can be flexibly set according to requirements. For example, the user frequently inputs "beijing weather" by voice input in the last week to check the weather condition of beijing, and then the "beijing weather" may be put into the input text recommendation list.

Factor 2: the number of occurrences of the recognition result of the historical speech input of the user in the current same period. By this factor, it is possible to recommend the recognition result of the voice frequently input by the user in the same period to the user. For example, a user can input a water and wood community to browse a forum in a voice input mode every morning from 8:00 to 9:00, and when the user opens a voice input function interface in a time period from 8:00 to 9:00 in the morning, the water and wood community can be put into an input text recommendation list.

Factor 3: and the occurrence times of the recognition result of the historical voice input of the user at the current same place, wherein the information of the place where the user is located at present can be acquired by calling a positioning API of the user equipment. By this factor, it is possible to recommend the recognition result of the voice frequently input by the user at the same place to the user. For example, when a user arrives near an entrance of a high-speed toll booth, a 'Baidu map' is often input by means of voice input to check a high-speed congestion condition, and when the user arrives near the entrance of the high-speed toll booth and opens a voice input function interface, the 'Baidu map' can be put into an input text recommendation list.

Factor 4: the popularity of the recognition results of historical speech input over the last period of time for all users. The client can acquire recognition results of the more popular voice input by all users in the last period of time from the server, and put one or more recognition results into the input text recommendation list. For example, in the recent period of time, the moon MI biography is popular, many users input the moon MI biography in a voice input mode to view videos, and when the user opens a voice input function interface, the moon MI biography can be put into an input text recommendation list.

Factor 5: the popularity of the recognition results of the historical speech inputs of all users at the current same location. The client can acquire recognition results of more popular voice input by all users at the same place from the server, and one or more recognition results are put into the input text recommendation list. Also, the location information where the user is currently located may be obtained by calling a positioning API of the user equipment. For example, if a location is remote and the user usually inputs "taxi taking" to use the taxi taking software when performing voice input at the location, the user can put "taxi taking" into the input text recommendation list when opening the voice input function interface at the location.

Factor 6: the hot degree of the recognition result of the historical voice input of all users in the same period. The client can acquire the more popular identification results input by all users in the same time period from the server, and one or more identification results are put into the input text recommendation list. For example, many users input the "weather forecast" to check the weather conditions during the speech input between 7:00 and 8:00 in the morning, and then the "weather forecast" can be put into the input text recommendation list when the users open the speech input function interface in the time period.

Of course, the invention is not limited to the above-mentioned 6 factors, and other factors may be adopted to order the recognition results of the historical speech input, which is not exhaustive here. In addition, when more than one factor is used for sorting, the scores of the recognition results of the historical speech inputs can be calculated in a weighting mode, and sorting is carried out according to the scores.

And finally, forming an input text recommendation list by the recognition results of the previous M historical voice inputs, wherein M is a preset positive integer.

In the second implementation mode, the server generates an input text recommendation list, and then the input text recommendation list is issued to the voice input client for the client to use when presenting a voice input function interface. The method and factors considered in this implementation are similar to those in the first implementation, except that the server needs to obtain the location information of the user from the client when considering the location factor.

In 202, after the instruction to enter the voice input function interface is obtained, the input text recommendation list is displayed on the voice input function interface.

For example, as shown in fig. 3a, when the user enters the speech input function interface, several recommended input texts are presented on the speech input function interface: "weather forecast", "water tree community", ". x taxi", "PM 2.5".

The user can select an input text from the input text recommendation list to complete input, and the user can select the input text by pressing a certain input text for a long time. However, in order to better accommodate the functionality of existing speech input interfaces, the present invention provides a preferred alternative, which is described in more detail below.

To facilitate user selection of input text, one of the input texts in the input text recommendation list may be made in a selected state by default when presented. For example, default to the first input text being in the selected state. Various highlighting means may be employed in indicating the selected state, such as bolding, enlarging, framing, etc. As shown in FIG. 3b, when the user enters the speech input function interface, the default first input text "weather forecast" is in the selected state, indicating that the selected state is boxed.

However, in many cases, the user does not select the default input text in the selected state for input, and therefore, a need for switching the input text is generated. In 203, when the operation of switching the input text by the user is acquired, another input text in the switching input text recommendation list is in the selected state, and only one input text in the input text recommendation list can be in the selected state at the same time.

The operation mode of the user for switching the input text can be various, and the following two modes are listed here:

the first mode is as follows: and when the input text in the unselected state in the input text recommendation list clicked by the user is acquired, switching to the state that the input text clicked by the user is in the selected state. This approach is relatively easy to understand. However, the second approach is preferred in order to better accommodate existing speech input functionality interfaces.

The second mode is as follows: and when the user operation in the voice input state is triggered and the time length of the input voice quality which does not reach the recognition requirement reaches the preset time length, switching the next input text in the input text recommendation list in sequence or at random to be in the selected state. For example, when the user presses a voice input function button, the client is triggered to be in a voice input state, but the quality of the recorded voice does not meet the recognition requirement, for example, the voice strength is insufficient or the voice definition is insufficient, which indicates that the user does not input the voice, and when the condition reaches a certain preset time period, for example, 1s, the next input text is switched to be in a selected state, as shown in fig. 3c, the water wood community is switched to be in the selected state. And if the user continuously presses the voice input function button and does not place the button and the input voice quality does not reach the duration of the recognition requirement for 1s again, namely the user presses 2s in total, continuously switching the next input text to be in the selected state.

In 204, when an operation of triggering input of an input text is acquired, an input text currently in a selected state is input. The operation for triggering input of the text may be that the quality of the input voice still does not meet the recognition requirement when the user operation in the voice input state is ended. For example, when the user presses the voice input function button and the input voice quality does not meet the recognition requirement, the user releases the voice input function button to trigger the input of the input text currently in the selected state.

According to different types of the client, the functions triggered after the input text in the selected state is input are different. For example, for an instant messaging client, the client may include a voice input function, and in the above manner, if the user releases the voice input function button at the time shown in fig. 3c, the user goes to the interface shown in fig. 3d, i.e., "water wood community" as an input message.

For another example, for a search-class client, the client may include a voice input function, and in the above manner, if the user releases the voice input function button at the time shown in fig. 3c, the user turns to the interface shown in fig. 3e, that is, the search result obtained after "water wood community" is used as query.

The above is a detailed description of the method provided by the present invention, and the following is a description of the apparatus provided by the present invention with reference to the examples.

Fig. 4 is a structural diagram of an apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus may include: a recommendation unit 01, which may further include a generation unit 02 or an acquisition unit (not shown in the figure); the electronic device may further include a selection unit 03, a first input unit 04, and a second input unit 05. The main functions of each constituent unit are as follows:

the recommending unit 01 is responsible for displaying an input text recommending list on a voice input function interface after entering the voice input function interface; wherein the input text recommendation list is generated based on recognition results of historical speech inputs of the user.

The input text recommendation list can be generated by the device, or generated by the server and then sent to the device. In addition, the input text recommendation list may be generated in advance, or may be generated in real time after entering the voice input function interface.

If the input text recommendation list is generated by the apparatus, the generation unit 02 in the apparatus generates the input text recommendation list according to the recognition result of the user's historical speech input.

Specifically, the generation unit 02 may acquire a recognition result record of the historical voice input; sorting the recognition results of the historical voice input according to at least one factor of the number of times of occurrence of the recognition results of the historical voice input of the user in a recent period of time, the number of times of occurrence of the recognition results of the historical voice input of the user in the same period of time, the hot degree of the recognition results of the historical voice input of all the users in the same place and the hot degree of the recognition results of the historical voice input of all the users in the same period of time; and selecting the recognition results of the previous M historical voice inputs to form an input text recommendation list, wherein M is a preset positive integer.

If the input text recommendation list is generated by the server, the acquisition unit acquires the input text recommendation list generated according to the recognition result of the historical voice input of the user from the server. The way of generating the input text recommendation list by the server is similar to the way of generating the input text recommendation list by the generating unit 02, and is not described herein again.

In order to facilitate the user to select the input text, the selecting unit 03 may put one of the input texts in the input text recommendation list in a selected state. For example, the first input text may default to a selected state when entering the speech input function interface. Various highlighting means may be employed in indicating the selected state, such as bolding, enlarging, framing, etc.

However, in many cases, the user does not select the default input text in the selected state for input, and therefore, a need for switching the input text is generated. When the selection unit 03 acquires the operation of switching the input text by the user, another input text in the input text recommendation list can be switched to be in a selected state; wherein only one input text in the input text recommendation list can be in a selected state at the same time.

The operation mode of the user for switching the input text can be various, and two modes are listed here:

the first mode is as follows: the selecting unit 03 acquires user operation triggering the voice input state, and if the time length of the input voice quality not meeting the recognition requirement reaches the preset time length, sequentially or randomly switches the next input text in the input text recommendation list to be in the selected state.

The second mode is as follows: the selecting unit 03 acquires the input text which is in the unselected state in the input text recommendation list clicked by the user, and switches the input text clicked by the user to be in the selected state.

The first mode is preferred.

When the operation of triggering input text input is acquired, the first input unit 04 inputs the input text currently in the selected state. When the first input unit 04 acquires that the quality of the input voice does not meet the recognition requirement when the user operation in the voice input state is ended, it may be determined that the operation for triggering the input of the text is acquired.

If the user operation in the voice input state is triggered and the input voice quality meets the recognition requirement, the second input unit 05 can confirm that the user really has the voice input requirement and inputs the voice, so that when the user operation in the voice input state is obtained and ended, the input voice is recognized and the recognition result is input.

The user operation of triggering the voice input state may include: pressing a voice input function button on the voice input function interface; ending the user operation in the voice input state may include: the voice input function button is released.

The apparatus provided in the embodiment of the present invention may be an application located in the user terminal, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) located in the application of the user terminal, which is not particularly limited in the embodiment of the present invention.

As can be seen from the above description, the method and apparatus provided by the present invention can have the following advantages:

1) according to the method and the device, the input text recommendation list is generated by utilizing the recognition result of the historical voice input of the user and is displayed on the voice input function interface, so that the user can directly select the input text from the input text recommendation list after entering the voice input function interface. If the input text recommendation list has the content which is originally input by the user through the voice, the voice input is not required to be repeated, and the input efficiency is improved; in addition, the method is convenient to use in some non-private environments or environments inconvenient to speak.

2) The method and the device can generate the input text recommendation list according to the latest voice input record of the user, the voice input record of the user in a specific time period, the voice input record of a specific place of the user, some popular voice inputs of current equipment and the like in a voice mode, and meet the voice input requirements of the user as much as possible.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of speech input, the method comprising:

after entering a voice input function interface, displaying an input text recommendation list on the voice input function interface; wherein the input text recommendation list is generated according to a recognition result of a user historical voice input;

making one of the input texts in the input text recommendation list in a selected state;

when the operation of switching the input text by the user is acquired, the input text in the selected state is switched, wherein the operation of acquiring the user switching the input text is that the user operation in the voice input state is acquired and triggered, and the duration that the input voice quality does not reach the recognition requirement reaches the preset duration.

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein generating the input text recommendation list based on the recognition of the user's historical speech input comprises:

acquiring a recognition result record of historical voice input;

4. The method of claim 1, further comprising:

5. The method of claim 4, wherein when the operation of switching the input text by the user is obtained, switching another input text in the input text recommendation list to be in a selected state comprises:

and when the user operation in the voice input state is triggered and the time length of the input voice quality which does not reach the recognition requirement reaches the preset time length, switching the next input text in the input text recommendation list to be in the selected state in sequence or randomly.

6. The method of claim 1, further comprising:

7. The method of claim 6, wherein the operation of obtaining a trigger input text input comprises:

8. The method of claim 1, further comprising:

9. The method of claim 5, 7 or 8, wherein triggering a user action in a speech input state comprises: pressing a voice input function button on the voice input function interface;

10. An apparatus for speech input, the apparatus comprising:

the recommendation unit is used for displaying an input text recommendation list on the voice input function interface after entering the voice input function interface; wherein the input text recommendation list is generated according to a recognition result of a user historical voice input;

the selection unit is used for enabling one input text in the input text recommendation list to be in a selected state;

the selection unit is further used for switching the input text in the selected state when the operation of switching the input text by the user is obtained, wherein the operation of obtaining the user switching the input text is that the user operation in the voice input state is triggered, and the time length that the input voice quality does not meet the recognition requirement reaches the preset time length.

11. The apparatus of claim 10, further comprising: a generating unit or an acquiring unit;

12. The apparatus according to claim 11, wherein the generating unit is specifically configured to:

acquiring a recognition result record of historical voice input;

13. The device according to claim 10, wherein the selecting unit is further configured to switch another input text in the input text recommendation list to be in a selected state when an operation of switching the input text by the user is acquired;

14. The apparatus according to claim 13, wherein the selection unit is specifically configured to:

15. The apparatus of claim 10, further comprising:

16. The apparatus according to claim 15, wherein when the first input unit acquires that the quality of the voice input at the time of the user operation in the voice input state is not yet met with the recognition requirement, it is determined that the operation of triggering the input of the text is acquired.

17. The apparatus of claim 10, further comprising:

18. The apparatus of claim 14, 16 or 17, wherein triggering a user operation in a speech input state comprises: pressing a voice input function button on the voice input function interface;