WO2016187910A1

WO2016187910A1 - Voice-to-text conversion method and device, and storage medium

Info

Publication number: WO2016187910A1
Application number: PCT/CN2015/081688
Authority: WO
Inventors: 吴建明
Original assignee: 西安中兴新软件有限责任公司
Priority date: 2015-05-22
Filing date: 2015-06-17
Publication date: 2016-12-01
Also published as: CN106297794A

Abstract

A voice-to-text conversion method and device. The method comprises: using two or more microphones to acquire voice information about one or more users (101); analysing and processing the voice information acquired by the various microphones to obtain sound source feature parameters of various users (102); classifying the acquired voice information according to the sound source feature parameters of the various users to obtain voice information respectively corresponding to the various users (103); and converting the voice information respectively corresponding to the various users into corresponding text information (104).

Description

Voice text conversion method and device, and storage medium

Technical field

The present invention relates to information conversion technology, and in particular, to a voice text conversion method and device, and a storage medium.

Background technique

As a smart terminal, mobile phones are becoming more and more intelligent, and the demand for human-computer interaction is becoming stronger. As a basic medium for human-computer interaction, speech has an irreplaceable role. A new generation of voice mobile phones, holders can completely control the various operations of the mobile phone through voice commands, such as making calls, reading and writing text messages, opening applications, etc. How to dig deep into the voice will become a trend of voice products.

With the performance improvement of the mobile phone's recording chip analog-to-digital converter (ADC, Analog-to-Digital Converter), the microphone signal-to-noise ratio is improved, and the high-definition recording achieved by the professional recorder can be realized on the mobile phone through a reasonable design layout. At the level, the recording quality is guaranteed, and the recognition rate of the voice-to-text engine is high, and the recording-to-text is completely commercialized.

At present, the function of mobile phone voice to text is mainly simple, and only a section of voice can be roughly converted into text. Due to the performance limitation of hardware or software, the recognition rate is not very high. The speaker cannot be identified, and many people can speak to the text at the same time, and the classification logo cannot be completed. Recording of a long recording, such as conference recordings, class presentations, group discussions, etc., can only be translated into a paragraph of text, without regulations, and can not separate the voice, completely does not meet the high quality, efficient design concept, reducing the human machine Interactivity.

Moreover, the current mobile phone installs a voice-to-text application (APP, APPlication), which mainly collects voice through a microphone, uploads it to the cloud through the network, and transfers text through the cloud engine. The actual text recognition rate, the collection distance is short and the conversion effect is general, and the user experience is poor.

In summary, the voice-to-text function in the mobile phone can only solve the voice conversion of a single voice, and needs to be connected to the cloud server, and the recognition rate is not high, and the recognition and elimination of multiple simultaneous speech cannot be performed, and the classification cannot be performed. Conversion.

Summary of the invention

To solve the above technical problem, an embodiment of the present invention provides a voice text conversion method and device, and a storage medium.

The voice text conversion method provided by the embodiment of the present invention includes:

Collecting voice information of more than one user by using more than two microphones;

Performing analysis and processing on the voice information collected by the microphones to obtain sound source characteristic parameters of each user;

And classifying the collected voice information according to the sound source characteristic parameters of each user, and obtaining voice information corresponding to each user;

The voice information corresponding to each user is converted into corresponding text information.

In an embodiment of the invention, before the analyzing and processing the voice information collected by the microphones, the method further includes:

Filtering background noise in the voice information collected by the respective microphones.

In an embodiment of the present invention, the voice information collected by the microphones is analyzed and processed to obtain sound source characteristic parameters of each user, including:

Performing analysis on the voice information collected by each microphone to obtain a time difference between each microphone receiving the concurrent voice;

The sound source characteristic parameters of each user are calculated according to the time difference in which the respective microphones receive the concurrent speech.

In an embodiment of the present invention, after the voice information corresponding to each user is converted into corresponding text information, the method further includes:

The classification displays the text information corresponding to each user.

According to the selected user identifier, the text information corresponding to one or more users is displayed in categories.

The voice text conversion device provided by the embodiment of the present invention includes:

The information collecting unit is configured to collect voice information of one or more users by using two or more microphones;

The voice analyzing unit is configured to analyze and process the voice information collected by the microphones to obtain sound source characteristic parameters of each user; and classify the collected voice information according to the sound source characteristic parameters of each user, and obtain Voice information corresponding to each user;

The voice text conversion unit is configured to convert the voice information corresponding to each user into corresponding text information.

In an embodiment of the invention, the device further includes:

The noise filtering unit is configured to filter out background noise in the voice information collected by the microphones.

In an embodiment of the invention, the voice analysis unit includes:

The analyzing subunit is configured to analyze the voice information collected by the microphones to obtain a time difference between the received voices of the microphones;

The calculating subunit is configured to calculate a sound source characteristic parameter of each user according to a time difference in which the respective microphones receive the concurrent speech.

In an embodiment of the invention, the device further includes:

The display unit is configured to display the text information corresponding to each user separately.

In an embodiment of the invention, the device further includes:

The display unit is configured to display, according to the selected user identifier, text information corresponding to one or more users respectively.

A storage medium storing a computer program configured to perform the aforementioned method for converting a voice text.

In the technical solution of the embodiment of the present invention, the voice text conversion device has high-performance hardware, including: N (N ≥ 2) reasonable layout high SNR microphones, forming a microphone array; high performance ADC, high Performance of Digital Signal Processing (DSP, Digital Signal Processing). The device can collect high-definition voice information. When collecting voice information, it can distinguish the user's spoken content by calculating the user's angle and distance, and when another person speaks at the same time, calculate another sound source characteristic parameter to show The difference is that the voice information of each user is separated according to different sound source characteristic parameters. In the case of voice-to-text, the local voice engine can be used to convert the voice information of each user into a corresponding text without connecting to the cloud, thereby solving the problem that multiple voices are converted into corresponding ones according to the user classification. The problem with the text.

DRAWINGS

FIG. 1 is a schematic flowchart diagram of a method for converting voice characters according to an embodiment of the present invention;

2 is a schematic diagram of a voice collection scenario according to an embodiment of the present invention;

3 is a schematic diagram 1 of a text conversion interface of a classification according to an embodiment of the present invention;

4 is a second schematic diagram of a text conversion interface of a classification according to an embodiment of the present invention;

FIG. 5 is a third schematic diagram of a text conversion interface of a classification according to an embodiment of the present invention; FIG.

FIG. 6 is a schematic structural diagram of a voice text conversion device according to an embodiment of the present invention.

detailed description

The embodiments of the present invention are described in detail below with reference to the accompanying drawings.

1 is a schematic flowchart of a voice text conversion method according to an embodiment of the present invention. The voice text conversion method in this example is applied to a voice text conversion device. As shown in FIG. 1 , the voice text conversion method includes the following steps. :

Step 101: Acquire voice information of more than one user by using two or more microphones.

In the embodiment of the present invention, the voice text conversion device may be an electronic device such as a mobile phone, a tablet computer, or a notebook computer.

In the embodiment of the present invention, the voice text conversion device has high-performance hardware, including: N (N ≥ 2) reasonable layout high SNR microphones to form a microphone array; high-performance ADC, high-performance digital Signal Processor (DSP, Digital Signal Processing).

In the embodiment of the present invention, when more than one user simultaneously inputs voice information to the voice text conversion device, two or more microphones in the voice text conversion device start and collect voice information of one or more users. It can be seen that for each microphone, the collected voice information is voice information in which a plurality of users are mixed together. The example of the present invention aims to separate voice information of different users to perform voice text conversion processing on voice information of each user respectively.

Step 102: Perform analysis processing on the voice information collected by the microphones to obtain sound source characteristic parameters of each user.

In the embodiment of the present invention, before the analysis and processing of the voice information collected by the microphones, the background noise in the voice information collected by the microphones is filtered out. Here, in order to eliminate non-human noise, the background noise in the speech information is filtered out.

In the embodiment of the present invention, the voice information collected by each microphone is analyzed, and the time difference between the received voices of each microphone is obtained. According to the time difference of the received voices of the microphones, the sound source characteristic parameters of each user are calculated. .

Specifically, concurrent speech refers to the same voice. For example, user A speaks “hello” voice. The voice text conversion device has two microphones. Since microphone 1 and microphone 2 have different positions, microphone 1 receives There is a time difference between the "Hello" voice and the moment when the microphone 2 receives the "Hello" voice. Here, the two "hello" voices in the microphone 1 and the microphone 2 are concurrent voices. Assuming that the position coordinate of the user A is (x1, y1), the position of the microphone 1 and the microphone 2 and the time difference of the analyzed concurrent speech are known, and the user A can be calculated. The position, in turn, determines the source characteristic parameters. Here, the sound source characteristic parameter may be a parameter such as an angle, a distance, and the like of the user with respect to the microphone, and the parameters may be characterized by the position coordinates of the user. Similarly, user B speaks a "pretty" voice, and the voice text conversion device has two microphones. Since the positions of the microphone 1 and the microphone 2 are different, the microphone 1 receives the "pretty" voice and the microphone 2 receives the " Beautiful" voices have different moments and have a time difference. Here, the two "pretty" voices in the microphone 1 and the microphone 2 are concurrent speech. Assuming that the position coordinates of the user B are (x2, y2), the position of the microphone 1 and the microphone 2 and the time difference of the analyzed concurrent speech are known, and the position of the user B can be calculated to determine the sound source characteristic parameter.

Step 103: Classify the collected voice information according to the sound source characteristic parameters of each user, and obtain voice information corresponding to each user.

In the embodiment of the present invention, the geographic locations of different users are different, so the sound source characteristic parameters of different users are different. Therefore, the voice information of multiple users can be classified according to the sound source characteristic parameters, thereby obtaining corresponding to different users. Voice message.

Step 104: Convert the voice information corresponding to each user into corresponding text information.

In the embodiment of the present invention, the voice information corresponding to each user may be converted into corresponding text information by using a local voice engine.

In the embodiment of the present invention, after the voice information corresponding to each user is converted into corresponding text information, the text information corresponding to each user is displayed in a classified manner.

Or, according to the selected user identifier, the text information corresponding to one or more users is displayed in a category.

The technical solution of the embodiment of the present invention can realize the conversion of the voice information of each user into a corresponding text by using a local voice engine without connecting to the cloud, thereby solving the scenario according to the scenario in which multiple people speak at the same time. The problem of user classification converting voice into corresponding text.

The following is a method for converting a voice text according to an embodiment of the present invention in combination with a specific application scenario. Step by step.

Referring to FIG. 2, a multi-person conference scene, three or more people, taking A, B, and C as an example, the device (mobile phone) includes a microphone 1 and a microphone 2, when A and B are alternately discussed, or A, B, C three alternately speak. The voice information conversion device of the embodiment of the present invention sequentially passes the collected voice information through the information collection unit, the voice analysis unit, and the voice text conversion unit. The device can separately separate the voice and text of the three persons A, B, and C, and the user can select to generate the voice and text of A, or B, or C. The classification processing text result shown in FIG. 3 is formed.

Referring to FIG. 2, a conference speech scene or a topic speech scene, such as A as a presenter, when the text needs to use A as the presenter, and the sounds of B and C are suppressed, the technical solution of the embodiment of the present invention can only retain the presenter. The sound of A, only convert the sound of A into text, and remove the sound of B and C. The classification processing text result shown in FIG. 4 is formed.

Referring to Figure 2, the interactive part of the meeting questions, such as A as the presenter, may need to interact with other members when speaking. At this time, the interaction between the presenter A and the questioner B can be used to perform voice collection and text in chronological order. Conversion. The classification processing text result shown in FIG. 5 is formed.

FIG. 6 is a schematic structural diagram of a voice text conversion device according to an embodiment of the present invention. As shown in FIG. 6, the device includes:

The information collecting unit 61 is configured to collect voice information of one or more users by using two or more microphones;

The voice analyzing unit 62 is configured to perform analysis processing on the voice information collected by the microphones to obtain sound source characteristic parameters of each user, and classify the collected voice information according to the sound source characteristic parameters of each user. Obtaining voice information corresponding to each user;

The voice text conversion unit 63 is configured to convert the voice information corresponding to each user into corresponding text information.

In the embodiment of the present invention, the device further includes:

The noise filtering unit 64 is configured to filter out background noise in the voice information collected by the respective microphones.

In the embodiment of the present invention, the voice analyzing unit 62 includes:

The analyzing sub-unit 621 is configured to analyze the voice information collected by the microphones to obtain a time difference between the received voices of the microphones;

The calculating sub-unit 622 is configured to calculate a sound source characteristic parameter of each user according to a time difference that the respective microphones receive the concurrent speech;

The classification sub-unit 623 is configured to classify the collected voice information according to the sound source characteristic parameters of the users, and obtain voice information corresponding to each user.

In the embodiment of the present invention, the device further includes:

The display unit 65 is configured to classify and display the text information corresponding to each of the users.

The display unit 65 is further configured to display, according to the selected user identifier, text information corresponding to one or more users.

It should be understood by those skilled in the art that the implementation functions of the units and their subunits in the speech file conversion device shown in FIG. 6 can be understood by referring to the related description of the foregoing speech text conversion method.

The embodiment of the invention further describes a storage medium in which a computer program is stored, the computer program being configured to execute the voice text conversion method of the foregoing embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not executed. In addition, the coupling, or direct coupling, or communication connection of the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms. of.

The units described above as separate components may or may not be physically separated, and the components displayed as the unit may or may not be physical units, that is, may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; The unit can be implemented in the form of hardware or in the form of hardware plus software functional units.

A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing storage medium includes: a mobile storage device, a read only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk. The medium in which the program code is stored.

Alternatively, the above-described integrated unit of the present invention may be stored in a computer readable storage medium if it is implemented in the form of a software function module and sold or used as a standalone product. Based on such understanding, the technical solution of the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product stored in a storage medium, including a plurality of instructions. A computer device (which may be a personal computer, server, or network device, etc.) is caused to perform all or part of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a removable storage device, a read only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes.

The foregoing is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can within the technical scope disclosed by the present invention. It is easy to think of variations or substitutions that are within the scope of the invention. Therefore, the scope of the invention should be determined by the scope of the appended claims.

Industrial applicability

The invention separates the voice information of each user according to different sound source characteristic parameters. In the case of voice-to-text, the local voice engine can be used to convert the voice information of each user into a corresponding text without connecting to the cloud, thereby solving the problem that multiple voices are converted into corresponding ones according to the user classification. The problem with the text.

Claims

A method for converting a voice text, the method comprising:

Collecting voice information of more than one user by using more than two microphones;

Performing analysis and processing on the voice information collected by the microphones to obtain sound source characteristic parameters of each user;

And classifying the collected voice information according to the sound source characteristic parameters of each user, and obtaining voice information corresponding to each user;

The voice information corresponding to each user is converted into corresponding text information.
The method for converting voice words according to claim 1, wherein before the analyzing and processing the voice information collected by the microphones, the method further comprises:

Filtering background noise in the voice information collected by the respective microphones.
The method for converting voice words according to claim 1, wherein the analyzing the voice information collected by the microphones to obtain the sound source characteristic parameters of each user comprises:

Performing analysis on the voice information collected by each microphone to obtain a time difference between each microphone receiving the concurrent voice;

The sound source characteristic parameters of each user are calculated according to the time difference in which the respective microphones receive the concurrent speech.
The method for converting a voice text according to any one of claims 1 to 3, wherein after the converting the voice information corresponding to each user into the corresponding text information, the method further includes:

The classification displays the text information corresponding to each user.
The method for converting a voice text according to any one of claims 1 to 3, wherein after the converting the voice information corresponding to each user into the corresponding text information, the method further includes:

According to the selected user identifier, the text information corresponding to one or more users is displayed in categories.
A voice text conversion device, the device comprising:

The information collecting unit is configured to collect voice information of one or more users by using two or more microphones;

The voice analyzing unit is configured to analyze and process the voice information collected by the microphones to obtain sound source characteristic parameters of each user; and classify the collected voice information according to the sound source characteristic parameters of each user, and obtain Voice information corresponding to each user;

The voice text conversion unit is configured to convert the voice information corresponding to each user into corresponding text information.
The voice text conversion device according to claim 6, wherein the device further comprises:

The noise filtering unit is configured to filter out background noise in the voice information collected by the microphones.
The voice text conversion device according to claim 6, wherein the voice analysis unit comprises:

The analyzing subunit is configured to analyze the voice information collected by the microphones to obtain a time difference between the received voices of the microphones;

The calculating subunit is configured to calculate a sound source characteristic parameter of each user according to a time difference in which the respective microphones receive the concurrent speech.
The voice text conversion device according to any one of claims 6 to 8, wherein the device further comprises:

The display unit is configured to display the text information corresponding to each user separately.
The voice text conversion device according to any one of claims 6 to 8, wherein the device further comprises:

The display unit is configured to display, according to the selected user identifier, text information corresponding to one or more users respectively.
A storage medium storing a computer program configured to execute the method for converting a voice character according to any one of claims 1 to 5.