CN114627854B

CN114627854B - Speech recognition method, speech recognition system and storage medium

Info

Publication number: CN114627854B
Application number: CN202011420932.3A
Authority: CN
Inventors: 朱云峰; 严秋红; 陆东明; 张亮; 董斌
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2025-03-21
Anticipated expiration: 2040-12-08
Also published as: CN114627854A

Abstract

The invention provides a voice recognition method, a voice recognition system and a storage medium. The voice recognition method comprises a voice stream processing step of receiving a voice stream and dividing the voice stream into voice frames, a voice frame processing step of carrying out mute judgment on the voice frames, and a voice recognition step of interacting messages with a voice recognition engine according to the mute judgment result.

Description

Speech recognition method, speech recognition system, and storage medium

Technical Field

The present invention relates generally to the field of automated processing of speech, and more particularly to a speech recognition method, a speech recognition system, and a storage medium.

Background

Speech recognition technology has been widely used in various aspects of production and life. For example, in a call scenario, the main application scenario of the real-time speech recognition technology includes, but is not limited to, the real-time speech recognition scenario of intelligent applications in a traffic center, such as intelligent agent assistants, real-time quality inspection, etc. The basis for realizing the service scenes is to utilize a voice recognition engine to recognize the calling voice and the called voice in the call as words, and the words are used as the input of a subsequent service processing module, and the real-time requirements of the scenes on voice recognition are relatively high, so that the words are basically obtained. The call scene is a double dialogue scene and is divided into a calling party and a called party, and the current implementation mode is that a one-way call generally occupies the concurrency capacity of two voice recognition engines, and comprises one-way calling voice and one-way called voice. Under this mechanism, a single-path engine supports a single voice. Converted to number of calls, maximum number of calls supported by the speech recognition engine = number of speech recognition engine concurrency/2.

With the rapid increase in the number of calls, there is also a greater number of concurrent demands on the speech recognition engine, which places greater demands on both hardware and software resources.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention. Its purpose is to present some concepts related to the invention in a simplified form as a prelude to the more detailed description that is presented later.

When a person speaks, two scenes of speaking and pausing exist, and the corresponding voices are voice and silence. In a two-person conversation scenario, one party is often in a listening state when the other party speaks. The calling scene is also a double dialogue scene, and is expressed as that when a calling party speaks, a called party is mute, and when the called party speaks, the calling party is mute. In the current recognition scenario, if a one-way call lasts 30 seconds, the speech recognition engine will be occupied with two paths of speech recognition time, 30 seconds each, for both the caller and the callee. However, monorail speech in a conversation (either the calling party or the called party) typically has long silence segments, and recognition of the silence segments is effectively a waste of speech recognition engine capacity.

The invention aims at the problem and provides a voice recognition method, a voice recognition system and a storage medium, which realize that only effective voice frames are transmitted to a voice recognition engine, and the engine recognition resources of a mute section are saved.

According to one aspect of the invention, a voice recognition method is provided, which comprises a voice stream processing step of receiving a voice stream and dividing the voice stream into voice frames, a voice frame processing step of carrying out mute judgment on the voice frames, and a voice recognition step of interacting messages with a voice recognition engine according to the mute judgment result.

In an embodiment of the invention, the method further comprises a recognition result processing step, wherein after the recognition result is obtained from the voice recognition engine, the time position of the recognition result in the original voice stream is calculated.

In the embodiment of the invention, in the step of processing the recognition result, the time position of the recognition result in the original voice stream is calculated according to the time position returned by the voice recognition engine and the stored mute duration information.

In the embodiment of the invention, in the voice recognition step, the interaction message with the voice recognition engine comprises newly establishing a voice recognition session, sending a voice frame to be recognized, acquiring a recognition result and ending the voice recognition session.

In the embodiment of the invention, the newly-built session information of the voice recognition session comprises a session identifier, a voice identifier, voice call path information and a voice processing position.

In the embodiment of the present invention, in the voice frame processing step, silence judgment is performed on the voice frame through voice endpoint detection.

In the embodiment of the invention, if the voice frame is not a mute frame, the voice recognition session is newly established when the last voice frame is a mute frame, if the voice frame is not a mute frame, the voice recognition is continued when the last voice frame is not a mute frame, if the voice frame is a mute frame, the voice recognition session is ended when the last voice frame is not a mute frame, and if the voice frame is a mute frame, the duration of the mute segment is calculated when the last voice frame is a mute frame.

In the embodiment of the invention, when the voice recognition session is newly established, the current concurrency number of the voice recognition engine is increased, and when the voice recognition session is ended, the current concurrency number of the voice recognition engine is reduced.

In an embodiment of the invention, if the concurrency number of the speech recognition engine reaches an upper limit, the new speech frame is buffered, and the early speech frame is discarded.

In an embodiment of the invention, the early speech frames exceeding the maximum duration of the speech frame buffer are discarded.

In the embodiment of the invention, the voice frames comprise a first frame, an intermediate frame and a last frame, wherein the first frame is processed by the following steps of performing mute time accumulation if the first frame is a mute frame, judging whether the voice recognition engine has a spare recognition capability or not if the first frame is a non-mute frame, performing voice recognition if the spare recognition capability is available, buffering or discarding the voice frames according to a buffering strategy if the recognition capability is full, performing voice frame processing on the intermediate frame, judging whether the last frame is a mute frame if the intermediate frame is a mute frame, performing mute time accumulation if the last frame is a mute frame, ending the voice recognition if the last frame is a non-mute frame, acquiring a final recognition result, calculating a time position of the recognition result, judging whether the last frame is a mute frame if the intermediate frame is a non-mute frame, performing the same processing as the case that the first frame is a non-mute frame if the last frame is a mute frame, continuing to perform the mute processing if the last frame is a non-mute frame, continuing to perform the mute processing if the last frame is a mute frame, and continuing to perform the mute processing if the last frame is a mute result if the last frame is a mute frame.

According to another aspect of the invention, a voice recognition system is provided, which comprises a voice stream processing module, a voice frame processing module and a voice recognition module, wherein the voice stream processing module receives a voice stream and divides the voice stream into voice frames, the voice frame processing module performs mute judgment on the voice frames, and the voice recognition module interacts messages with a voice recognition engine according to the mute judgment result.

According to yet another aspect of the present invention, there is provided a speech recognition system comprising a memory having instructions stored thereon, and a processor configured to execute the instructions stored on the memory to perform the speech recognition method described above.

According to yet another aspect of the present invention, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, cause the one or more processors to perform the above-described speech recognition method.

According to the embodiment of the invention, only effective voice frames are transmitted to the voice recognition engine, so that the engine recognition resources of a mute segment are saved. This may effectively reduce the number of concurrency required by the speech recognition engine.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

The invention may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:

Fig. 1 is an exemplary flowchart of a voice recognition method according to an embodiment of the present invention.

Fig. 2 is a detailed exemplary flowchart of processing steps for a speech frame in the speech recognition method of fig. 1 according to an embodiment of the present invention.

FIG. 3 illustrates an exemplary configuration of a computing device in which embodiments according to the invention may be implemented.

Detailed Description

The following detailed description is made with reference to the accompanying drawings and is provided to aid in a comprehensive understanding of various example embodiments of the invention. The following description includes various details to aid in understanding, but these are to be considered merely exemplary and not intended to limit the invention, which is defined by the appended claims and their equivalents. The words and phrases used in the following description are only intended to provide a clear and consistent understanding of the invention. In addition, descriptions of well-known structures, functions and configurations may be omitted for clarity and conciseness. Those of ordinary skill in the art will recognize that various changes and modifications of the examples described herein can be made without departing from the spirit and scope of the invention.

Fig. 1 is an exemplary flowchart of a voice recognition method according to an embodiment of the present invention. The voice recognition method of the embodiment of the invention can comprise steps S101-S103.

As shown in fig. 1, step S101 is a voice stream processing step of receiving a voice stream and dividing the voice stream into voice frames.

After the voice stream is acquired, framing the voice stream according to a preset length. For example, a real-time speech stream is received, the number of bytes is calculated according to the audio sampling rate, the number of channels, the bit depth, and the frame duration, and the real-time speech stream is divided into N frames.

As shown in fig. 1, step S102 is a voice frame processing step, and performs mute judgment on the voice frame.

In some embodiments, the mute determination may be made on the speech frame by speech endpoint detection during the speech frame processing step.

For example, voice endpoint detection refers to finding silence segments from a piece of voice, i.e., finding the start and end points of non-silence segments from a voice signal.

In some embodiments, the voice recognition session may be established when the last frame of voice is a silence frame, the voice recognition may be continued when the last frame of voice is not a silence frame, the voice recognition session may be ended when the last frame of voice is not a silence frame, the silence period duration may be calculated when the last frame of voice is a silence frame, if the last frame of voice is a silence frame, the voice recognition session may be terminated when the last frame of voice is a silence frame.

The detection method is to make mute judgment on the voice frame. If the voice is not a mute frame, whether a recognition session needs to be newly established or recognition is continued is determined according to the voice state of the last frame. If the voice stream is mute, whether to end a recognition session is determined according to the last frame of voice, and after the current session is ended, the right of newly establishing a session is obtained by another voice stream needing to be recognized. For example, if the speech frame is a silence frame, the last frame is considered a continuous silence period, and only the silence period time needs to be accumulated. If the previous frame is not a mute frame, the end point of the non-mute segment is considered to end the voice recognition session.

As shown in fig. 1, step S103 is a voice recognition step of interacting a message with a voice recognition engine according to the result of the mute judgment. For example, if a silence frame is determined, the frame may not be sent to the speech recognition engine. Only frames judged to be non-mute are sent to the speech recognition engine for recognition. This may save resources of the speech recognition engine.

For example, mute judgment adopts the following algorithm. The embodiment of the invention mainly uses the voice energy value as the voice characteristic to analyze and process, and adopts a short-time energy method. The short-time energy is to firstly frame the voice signal, the voice frame processing is simplified in the embodiment of the invention, the rapid voice frame is carried out by using the fixed time length, and the energy calculation is carried out on each frame after the frame is divided. Let the short-time energy of the nth frame of speech signal (represented by amplitude value x _n (m)) be denoted by En, the calculation formula is shown as follows:

En is a function of measuring the change in amplitude value of a speech signal, which uses the square of the amplitude of the signal, and is therefore relatively sensitive to large signals. The distinction of noise in speech and silence segments may be reflected in its energy. The signal-to-noise ratio of the voice signal in the call scene is high, and the voice section (non-mute section) and the noise section (mute section) can be distinguished by only using short-time energy. For example, frames with short-time energy less than a preset threshold may be considered silence segments. m=0..n-1 represents N sampling points in the N-th frame, respectively.

For example, assuming that the speech silence is 50% in duty cycle, the present invention recognizes only the non-silence, and then the speech engine is occupied for only 15 seconds in a talk time of 30 seconds. The saved 15 second engine identification capability may be used for another one-way call.

The voice recognition method of the embodiment of the present invention may further include step S104.

As shown in fig. 1, step S104 is a recognition result processing step of calculating a time position of a recognition result in a primitive voice stream after the recognition result is acquired from the voice recognition engine.

For example, in the recognition result processing step, the time position of the recognition result in the original voice stream is calculated according to the time position returned by the voice recognition engine and the stored mute duration information.

The time position returned by the speech engine may be the time position where the recognized result text is in the recognized speech segment.

That is, although periods of silence are not identified, the time information thereof may be recorded to accurately reproduce the correlation (e.g., interval time, etc.) of the respective periods of non-silence in time.

For example, while performing real-time speech recognition, a recognition result is obtained, where the recognition result includes a recognition text, and a front end point and a rear end point of the text. And adding the mute time stored in the buffer memory to the front end point to obtain the position of the recognition text result in the original audio.

Fig. 2 is a detailed exemplary flowchart of processing steps for a speech frame in the speech recognition method of fig. 1 according to an embodiment of the present invention. The flowchart is illustrated as an example of the more detailed processing in step S102 to step S104 in fig. 1.

For example, first, in step S101 of fig. 1, the byte stream is converted to a one second period, the voice stream sampling rate of the call is typically 8khz,16 bits, and the mono, then the framed byte stream is 16000 bytes.

Because the noise of the call recording is less, the embodiment adopts a short-time energy method to detect the voice endpoint. The cache is implemented using an in-memory database. The real-time performance of the recognition result can be ensured under the condition that the maximum voice buffer time is 2 seconds of voice. A specific flow of the speech recognition process is shown in fig. 2.

In some embodiments, the speech frames include a first frame, an intermediate frame, and a last frame (e.g., the first frame speech refers to the first frame information of a piece of speech, the intermediate frame speech refers to the intermediate frame information of the piece of speech, and the last frame speech refers to the last frame information of the piece of speech), the first frame is processed by accumulating silence time (e.g., buffering a talk audio silence time plus a time unit, e.g., 1 second) if the first frame is a non-silence frame (e.g., buffering a talk audio silence time plus a time unit, e.g., 1 second), determining if the first frame is a non-silence frame (e.g., may be referred to as a "non-silence point"), if there is a free recognition capability in the memory database (e.g., checking whether there is a free speech number), performing speech recognition (e.g., acquiring a session identifier (id), starting recognition), if the recognition capability is full, buffering or discarding the speech frame (e.g., buffering the audio in the memory database for other sessions, according to a buffering policy), performing silence time frame processing if the intermediate frame is a silence frame, determining if there is a silence frame (e.g., if there is a silence time unit, if there is a free speech number in the memory database, and if the result is a silence time unit is a speech frame, e.g., 1 second, and if the result is a silence time unit is no silence result is calculated, and if the result is a speech is a result of the result is 1, the method comprises the steps of reading other cached audios in a memory database to start processing), judging whether the last frame of voice is a mute frame or not if the middle frame is a non-mute frame, if the last frame of voice is a mute frame, performing the same processing as the situation that the first frame of voice is a non-mute frame (for example, checking whether a free conversation exists, acquiring conversation id to start recognition if the free conversation exists, caching audios in the memory database to wait for other conversations to finish if the free conversation does not exist), continuing to cache or continuing to perform voice recognition (for example, continuing to perform voice recognition) according to the conversation situation if the last frame of voice is not mute, performing voice frame processing on the last frame of voice, if the last frame of voice is a mute frame, clearing the cache (for example, clearing cache information of the conversation in the memory database), if the last frame of voice is a non-mute frame of voice, ending the voice recognition, acquiring the final recognition result, calculating the position of the recognition result (for example, acquiring the recognition result, ending the conversation, clearing the cache information of the conversation in the memory database).

Since the voice of the silence segment is not sent to the voice recognition engine any more, bg (front end point of the current result) of the recognition engine is different from the original end point time of the audio, so that the original position of the recognition result in the audio can be obtained. The invention designs a method for unifying time endpoints, which is to add one second of mute time to the corresponding information of the call in a buffer memory if the audio frame is judged to be a mute point on the assumption that the length of the voice frame is one second. In this way, all the processed mute points are used as records to be cached, and after the recognition engine obtains the result, the mute time stored in the cache is added to be aligned with the original time endpoint of the audio. For example, the first second of audio a is silent and the second is silent, and the silence time of a in the buffer sums up to 2 seconds. The recognition engine returns the result "good morning", "bg:0", which is different from real audio, with the third second and the fourth second being unmuted. Then, the program calculates the true bg:2 corresponding to "good morning" (indicating that the program starts 2 seconds after the initial position) according to the previous mute time accumulated as 2 seconds, and the true bg:2 is consistent with the original result of the audio.

In some embodiments, in the step of speech recognition, the interaction message with the speech recognition engine includes creating a speech recognition session, sending a speech frame to be recognized, obtaining a recognition result, and ending the speech recognition session.

For example, the speech recognition step may be implemented using a speech recognition interface. The processing mechanism of the speech recognition interface is that when a speaker starts speaking, the speech recognition engine establishes a conversation, the speaker synchronizes the speech flow to the speech recognition engine while speaking, and acquires the recognition result in real time, and after the speaker finishes speaking, acquires the final recognition result and finishes the conversation.

The message interaction of the voice recognition interface adopts an HTTP interface protocol, and the HTTP protocol has convenient expansion and high compatibility. For example, a programming framework of a non-blocking, event driven model may be used to implement highly concurrent interface calls. And establishing a new recognition session before starting the recognition of the voice. And the real-time voice stream synchronizing interface synchronizes the voice frames to be recognized to the voice recognition engine in a real-time voice frame data synchronizing mode. And the real-time recognition result acquisition interface is used for acquiring the real-time recognition result through the interface while transmitting the real-time voice stream. Ending the speech recognition session interface, namely ending the speech recognition session after the speech is ended.

In some embodiments, the session information of the newly-built voice recognition session includes a session identifier, a voice identifier, voice call path information, and a voice processing location.

For example, the session identifier is a unique ID of an identification session established with the speech recognition engine, the speech identifier is a unique ID distinguishing a calling speech or a called speech in a one-way call, the call-by-path information of the speech is information related to the call, including but not limited to a calling number, a called number, a call ID, a calling speech or a called speech, and the processing position of the speech is a time position of a currently processed speech frame in the whole speech.

The established session information needs to be stored in the cache, and the session information includes session identification, voice identification, call path information of voice, and processing location of voice.

In some embodiments, the current concurrency of the speech recognition engine is increased when the speech recognition session is established, and the current concurrency of the speech recognition engine is decreased when the speech recognition session is ended.

The real-time voice recognition mechanism provided by the voice recognition engine generally adopts session management, after a voice recognition session is established, a real-time voice frame is transmitted as input data of voice recognition, meanwhile, a real-time voice recognition result is obtained through an interface, and after the last frame is transmitted to the voice recognition engine, the session is ended. The speech recognition engines commonly purchased have concurrency limits. Since the concurrency number of the speech recognition engine is frequently queried and changed in the real-time speech recognition process, the concurrency number is stored by adopting a cache. The current concurrency needs to be increased when starting an identification session, and needs to be decreased when ending an identification session.

In some embodiments, new speech frames are buffered and early speech frames are discarded if the number of concurrence of the speech recognition engine reaches an upper limit.

In some embodiments, early speech frames that exceed the maximum duration of the speech frame buffer are discarded.

Assuming that the speech recognition engine supports 10 paths of recognition sessions altogether, 20 voices need to be recognized, if the silence segments of 10 voices exactly correspond to the voice segments of the other 10 voices, in theory, engine resources of the 10 silence segments can be given over to the other 10 voice segments to be executed, so that the effect that the 10 paths of recognition sessions support 20 voices is achieved. However, in practical situations, the silence segments of the speech are unevenly distributed, the lengths of the speech segments are inconsistent, and the randomness of the effective speech distribution and duration in the speech results in that more than 10 sessions are requesting the recognition capability of the speech recognition engine at the same time, so that part of the transcription request is discarded.

In order to solve the problem, the invention introduces a voice frame buffer mechanism and a buffer discarding mechanism, if the concurrency number of the voice recognition engine reaches the upper limit, new voice frames are buffered, and the storage of the voice frame buffer is realized by adopting a cache because of the need of high-speed reading. The speech frames in the buffer memory can be transferred to the speech recognition engine only before the whole sentence of speech is finished, so that the change of the transfer response time duration is not perceived. When the whole sentence of the speaker ends in real time, if more speech frames remain in the buffer memory for recognition, the real time performance of the speech recognition is affected, which may be shown that the speaker obtains the speech recognition result after speaking. The longer the audio time is buffered, the worse the real-time of speech recognition. Therefore, the maximum time length of the voice frame buffer memory is selected according to the actual service requirement, and when the maximum time length is exceeded, the early voice frame is regarded as the overdue buffer memory and is discarded, so that the real-time performance of voice recognition is ensured.

Most of the existing techniques for performing speech recognition by using speech endpoint detection are directed to scenes of single sentence recognition, are mostly used for non-real-time recognition applications, and do not have response performance required by real-time recognition.

According to the embodiment of the invention, the effective voice frames are distinguished from the mute frames by utilizing the voice endpoint detection technology, only the effective voice frames are identified, and the voice engine identification resources of the voice mute section are saved under the real-time voice identification scene. By using the innovative voice frame caching mechanism and the caching discarding mechanism, the utilization rate of the voice recognition engine is improved, and the real-time recognition performance is ensured. The single sentence speech recognition response time length can be controlled to be about 400 milliseconds, and the method is suitable for service scenes with high real-time recognition requirements, such as seat assistants, real-time quality inspection and the like. And simultaneously, the position of the recognition result text in the original voice is accurately restored by using a real-time recognition result position calculation technology.

According to the embodiment of the invention, the problem that the utilization rate of the voice recognition engine is greatly improved and the purchase cost of the voice recognition engine is reduced under the condition that the original recognition quality and response time are not influenced under the real-time voice recognition scene is solved. By using the voice recognition method of the embodiment of the invention, the utilization rate of the voice engine can be doubled, namely, the maximum calling number supported by the voice recognition of the calling scene = the concurrency number of the voice recognition engine.

The invention also provides a voice recognition system which comprises a voice stream processing module, a voice frame processing module and a voice recognition module, wherein the voice stream processing module receives a voice stream and divides the voice stream into voice frames, the voice frame processing module performs mute judgment on the voice frames, and the voice recognition module interacts messages with a voice recognition engine according to the mute judgment result.

The embodiment of the invention preprocesses the real-time voice stream, and realizes that only effective voice frames are transmitted to the voice recognition engine in real time by utilizing the technologies of voice endpoint detection, engine concurrence management, voice session management, voice frame buffering, result endpoint calculation and the like, thereby saving engine recognition resources of a mute section.

The embodiment of the invention is applied to the real-time voice recognition scenes of intelligent application of telephone traffic centers such as the seat assistant, the real-time quality inspection and the like, the seat assistant displays dialogue characters in real time when an operator answers the call, the dialogue characters can rapidly position user intention after being understood and processed through AI semantics, and the operator can conveniently process follow-up business. The real-time quality inspection monitors the service quality of the session operators by identifying information such as keywords in the transcribed text in real time. The service has extremely high requirement on the real-time performance of voice recognition, and the voice recognition real-time performance is not reduced after voice preprocessing is adopted, the single sentence transfer delay is about 400 milliseconds, and the service scene requirement is met.

By implementing and using the invention, the utilization rate of the near-double real-time voice recognition engine can be improved, and the investment cost of the voice recognition engine is greatly reduced.

FIG. 3 illustrates an exemplary configuration of a computing device 300 in which embodiments according to the invention may be implemented.

Computing device 300 is an example of a hardware device that can employ the above aspects of the present invention. Computing device 300 may be any machine configured to perform processing and/or computing. Computing device 300 may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a Personal Data Assistant (PDA), a smart phone, an in-vehicle computer, or a combination thereof.

As shown in fig. 3, computing device 300 may include one or more elements that may be connected to or in communication with bus 302 via one or more interfaces. Bus 302 may include, but is not limited to, an industry standard architecture (Industry Standard Architecture, ISA) bus, a micro channel architecture (Micro Channel Architecture, MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus. Computing device 300 may include, for example, one or more processors 304, one or more input devices 306, and one or more output devices 308. The one or more processors 304 may be any kind of processor and may include, but is not limited to, one or more general purpose processors or special purpose processors (such as special purpose processing chips). The processor 302 may be configured to execute instructions stored on the memory, for example, to perform the speech recognition method described in fig. 1. Alternatively, the processor 302 may implement the functions of the above-described voice stream processing module, voice frame processing module, and voice recognition module. Input device 306 may be any type of input device capable of inputting information to a computing device and may include, but is not limited to, a mouse, keyboard, touch screen, microphone, and/or remote controller. Output device 308 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers.

The computing device 300 may also include or be connected to a non-transitory storage device 314, which non-transitory storage device 314 may be any storage device that is non-transitory and that may enable data storage, and may include, but is not limited to, disk drives, optical storage devices, solid state memory, floppy disks, flexible disks, hard disks, magnetic tape, or any other magnetic medium, compact disk or any other optical medium, cache memory and/or any other memory chip or module, and/or any other medium from which a computer may read data, instructions, and/or code. Computing device 300 may also include Random Access Memory (RAM) 310 and Read Only Memory (ROM) 312. The ROM 312 may store programs, utilities or processes to be executed in a non-volatile manner. RAM310 may provide volatile data storage and store instructions related to the operation of computing device 300. Computing device 300 may also include a network/bus interface 316 coupled to data link 318. Network/bus interface 316 may be any kind of device or system capable of enabling communication with external apparatuses and/or networks and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication devices, and/or chipsets (such as bluetooth (TM) devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication facilities, etc.).

The invention may be implemented as any combination of an apparatus, a system, an integrated circuit, and a computer program on a non-transitory computer readable storage medium. One or more processors may be implemented as an Integrated Circuit (IC), an Application Specific Integrated Circuit (ASIC), or a large scale integrated circuit (LSI), a system LSI, a super LSI, or a super LSI assembly that performs some or all of the functions described in the present invention.

The present invention includes the use of software, applications, computer programs or algorithms. The software, application, computer program or algorithm may be stored on a non-transitory computer readable storage medium to cause a computer, such as one or more processors, to perform the steps described above and the steps depicted in the figures. For example, one or more memories may store software or algorithms in executable instructions and one or more processors may associate a set of instructions to execute the software or algorithms to provide various functions in accordance with the embodiments described herein.

The software and computer programs (which may also be referred to as programs, software applications, components, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural, object-oriented, functional, logical, or assembly or machine language. The term "computer-readable storage medium" refers to any computer program product, apparatus or device, such as magnetic disks, optical disks, solid state memory devices, memory, and Programmable Logic Devices (PLDs), for providing machine instructions or data to a programmable data processor, including computer-readable storage media that receive machine instructions as a computer-readable signal.

By way of example, a computer-readable storage medium may comprise Dynamic Random Access Memory (DRAM), random Access Memory (RAM), read Only Memory (ROM), electrically erasable read only memory (EEPROM), compact disc read only memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired computer-readable program code in the form of instructions or data structures and that can be accessed by a general purpose or special purpose computer or general purpose or special purpose processor. Disk or disc, as used in the present invention, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable storage media.

The inventive subject matter is provided as examples of methods, systems, and computer readable storage media for performing the features described herein. Other features or variations in addition to those described above are contemplated. It is contemplated that the implementation of the components and functions of the present invention may be accomplished with any emerging technology that may replace any of the above-described implementation technologies.

In addition, the foregoing description provides examples without limiting the scope, applicability, or configuration set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the spirit and scope of the invention. Various embodiments may omit, replace, or add various procedures or components as appropriate. For example, features described with respect to certain embodiments may be combined in other embodiments.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method of speech recognition, comprising:

A voice stream processing step of receiving a voice stream and dividing the voice stream into voice frames;

A voice frame processing step of performing mute judgment on the voice frame, and

A voice recognition step, namely according to the mute judgment result, interacting information with a voice recognition engine;

a recognition result processing step of calculating a time position of a recognition result in a primitive voice stream after the recognition result is acquired from the voice recognition engine;

wherein, the voice frame comprises a first frame, an intermediate frame and a last frame,

The first frame is processed by the following voice frame:

if the first frame is a mute frame, performing mute time accumulation;

If the first frame is a non-mute frame, judging whether the voice recognition engine has a free recognition capability, if so, performing the voice recognition, if so, caching or discarding the voice frame according to a caching strategy,

And carrying out the following voice frame processing on the intermediate frame:

if the middle frame is a mute frame, judging whether the last frame of voice is a mute frame, if the last frame of voice is a mute frame, accumulating mute time, if the last frame of voice is a non-mute frame, ending the voice recognition, acquiring the final recognition result, and calculating the time position of the recognition result;

If the middle frame is a non-mute frame, judging whether the last frame of voice is a mute frame, if the last frame of voice is a mute frame, performing the same processing as the case that the first frame is a non-mute frame, and if the last frame of voice is a non-mute frame, continuing to buffer or continuing to perform the voice recognition according to the conversation condition.

2. The speech recognition method of claim 1, wherein,

In the recognition result processing step, the time position of the recognition result in the original voice stream is calculated according to the time position returned by the voice recognition engine and the stored mute duration information.

3. The speech recognition method of claim 2, wherein,

In the voice recognition step, the interaction message with the voice recognition engine comprises newly establishing a voice recognition session, sending a voice frame to be recognized, acquiring a recognition result and ending the voice recognition session.

4. The speech recognition method of claim 3, wherein,

The newly established session information of the voice recognition session comprises session identification, voice call path information and voice processing position.

5. The speech recognition method of claim 1, wherein,

In the voice frame processing step, silence judgment is performed on the voice frame through voice endpoint detection.

6. The speech recognition method of claim 5, wherein,

If the voice frame is not a mute frame, when the last voice frame is a mute frame, the voice recognition session is newly established,

If the voice frame is not a mute frame, continuing to perform the voice recognition when the voice of the previous frame is not the mute frame;

If the speech frame is a silence frame, ending the speech recognition session when the last frame of speech is not a silence frame,

And if the voice frame is a mute frame, calculating the mute segment duration when the voice of the previous frame is the mute frame.

7. The method for speech recognition according to claim 4, wherein,

And when the voice recognition session is newly established, increasing the current concurrency number of the voice recognition engine, and when the voice recognition session is ended, reducing the current concurrency number of the voice recognition engine.

8. The speech recognition method of claim 7, wherein,

And if the concurrency number of the voice recognition engine reaches the upper limit, caching the new voice frames, and discarding the early voice frames.

9. The speech recognition method of claim 8, wherein,

Early speech frames that exceed the maximum duration of the speech frame buffer are discarded.

10. The speech recognition method of claim 1, wherein,

And carrying out the following voice frame processing on the last frame:

if the last frame of voice is mute, the buffer memory is cleaned, and if the last frame of voice is not mute, the voice recognition is ended, the last recognition result is obtained, and the time position of the recognition result is calculated.

11. A speech recognition system, comprising:

The voice stream processing module receives a voice stream and divides the voice stream into voice frames;

a voice frame processing module for carrying out mute judgment on the voice frame and

The voice recognition module is used for interacting messages with a voice recognition engine according to the mute judgment result;

After the voice recognition system acquires a recognition result from the voice recognition engine, calculating the time position of the recognition result in the original voice stream;

The first frame is processed by the following voice frame:

if the first frame is a mute frame, performing mute time accumulation;

12. A speech recognition system, comprising:

A memory having instructions stored thereon, and

A processor configured to execute instructions stored on the memory to perform the speech recognition method according to any one of claims 1 to 10.

13. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the speech recognition method of any one of claims 1 to 10.