[go: up one dir, main page]

CN116631454B - An audio processing method and device - Google Patents

An audio processing method and device

Info

Publication number
CN116631454B
CN116631454B CN202310380459.8A CN202310380459A CN116631454B CN 116631454 B CN116631454 B CN 116631454B CN 202310380459 A CN202310380459 A CN 202310380459A CN 116631454 B CN116631454 B CN 116631454B
Authority
CN
China
Prior art keywords
queue
audio
audio frame
frame
silence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310380459.8A
Other languages
Chinese (zh)
Other versions
CN116631454A (en
Inventor
李林峰
黄海荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei Xingji Meizu Technology Co Ltd
Original Assignee
Hubei Xingji Meizu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei Xingji Meizu Technology Co Ltd filed Critical Hubei Xingji Meizu Technology Co Ltd
Priority to CN202310380459.8A priority Critical patent/CN116631454B/en
Publication of CN116631454A publication Critical patent/CN116631454A/en
Application granted granted Critical
Publication of CN116631454B publication Critical patent/CN116631454B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • G11B2020/10537Audio or video recording
    • G11B2020/10546Audio or video recording specifically adapted for audio data
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • G11B2020/1062Data buffering arrangements, e.g. recording or playback buffers
    • G11B2020/10629Data buffering arrangements, e.g. recording or playback buffers the buffer having a specific structure
    • G11B2020/10638First-in-first-out memories [FIFO] buffers

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application provides an audio processing method and equipment, the method comprises the steps of collecting audio data, storing the audio data into a recording queue, wherein the recording queue is a first-in first-out queue, taking out a current audio frame from the recording queue, determining that a silence detection result of the current audio frame is a non-silence audio frame and a global state is a silence state, determining that the non-silence audio frame with a continuous first frame number exists, copying the audio frame of the silence detection queue to the tail part of an identification queue, wherein the silence detection queue is a circular queue, and the identification queue is the first-in first-out queue.

Description

Audio processing method and equipment
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to an audio processing method and apparatus.
Background
In the existing voice interaction scene, such as an automobile voice interaction scene, voice recognition needs to be performed on the user audio data, so that corresponding voice service is provided for the user.
The speech recognition process is such that the user does not speak, i.e. is silent, for a significant part of the time. The presence of silence consumes computational resources for speech recognition, however, existing schemes detect non-silence before starting the recognition program for recognition.
Disclosure of Invention
In a first aspect, an embodiment of the present application provides an audio processing method, including:
Collecting audio data;
Storing the audio data into a recording queue, wherein the recording queue is a first-in first-out queue;
taking out the current audio frame from the recording queue;
Determining that the silence detection result of the current audio frame is a non-silence audio frame and the global state is a silence state;
determining that there is a first number of consecutive frames of non-mute audio;
Copying the audio frames of the silence detection queue to an identification queue, wherein the silence detection queue is a circular queue, and the identification queue is a first-in first-out queue.
In some embodiments, further comprising:
And inserting the current audio frame into the identification queue, wherein the current audio frame is positioned in the identification queue after the copied audio frame of the silence detection queue.
In some embodiments, the number of audio frames in the silence detection queue of the copy is greater than or equal to the first number of frames.
In some embodiments, further comprising:
And determining that the global state is an unmuted state, and inserting the current audio frame into the identification queue.
In some embodiments, further comprising:
updating the global state to an unmuted state.
In some embodiments, further comprising:
And determining that the silence detection result of the current audio frame is a silence audio frame, and inserting the current audio frame into the silence detection queue.
In some embodiments, the silence detection queue comprises a plurality of data spaces circularly connected, the current audio frame is inserted into the data space pointed by a tail pointer of the silence detection queue, and the tail pointer moves along the data updating direction of the silence detection queue to the data space pointed by a head pointer of the silence detection queue.
In some embodiments, further comprising:
And determining that the global state is an unmuted state, determining that mute audio frames with continuous second frames exist, and updating the global state into the mute state.
In some embodiments, further comprising:
And identifying the audio frames in the identification queue in real time.
In a second aspect, an embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the audio processing method according to any one of the first aspect when executing the program.
In a third aspect, embodiments of the present application also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the audio processing method according to any of the first aspects described above.
Drawings
In order to more clearly illustrate the application or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a diagram of an application environment in which an audio processing method according to an embodiment of the present application may operate;
FIG. 2 is a flow chart of an audio processing method according to an embodiment of the present application;
FIG. 3 is an interactive schematic diagram of a queue of an audio processing method according to an embodiment of the present application;
Fig. 4 is a schematic diagram of a scenario of data update of a silence detection queue according to an embodiment of the present application;
FIG. 5 is a second exemplary diagram of a data update scenario of a silence detection queue according to an embodiment of the present application;
FIG. 6 is a message timing diagram of three tasks of an audio processing method according to an embodiment of the present application;
fig. 7 is a schematic flow chart of a silence detection task in an audio processing method according to an embodiment of the present application;
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the application are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the "first" and "second" distinguishing between objects generally are not limited in number to the extent that the first object may, for example, be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/" generally means a relationship in which the associated object is an "or" before and after.
In the current scenes of vehicle-mounted voice interaction and mobile terminal (such as mobile phone) interaction, such as an automobile voice interaction scene, voice recognition is required to be performed on audio data of a user, so that corresponding voice service is provided for the user. The speech recognition process is such that the user does not speak, i.e. is silent, for a significant part of the time. The existence of silence consumes the computing resource of voice recognition, however, the existing scheme firstly carries out silence detection on the audio data of the user, and when the silence is detected to be ended, a recognition program is started to recognize the subsequent audio data, and the frame loss phenomenon of the head data and the tail data often occurs in the mode.
Therefore, the embodiment of the application provides an audio processing method, which is characterized in that three independent tasks (namely a recording task, a silence detection task and an identification task) and two global states (namely a silence state and a non-silence state) are set, and message circulation is realized among the three tasks through a queue, so that the frame loss phenomenon of head data and tail data can not occur at the boundary of silence state change.
The audio processing method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Fig. 1 is an application environment diagram in which an audio processing method according to an embodiment of the present application may operate. As shown in fig. 1, the application environment includes a terminal 110 and a server 120, and communication is performed between the terminal 110 and the server 120 through a network, and the communication network may be a wireless communication network or a wired communication network, wherein the number of the terminal and the server is not limited. The wireless communication network may include, but is not limited to, at least one of WIFI (WIRELESS FIDELITY ), bluetooth, and the wired communication network may include, but is not limited to, at least one of a wide area network, a metropolitan area network, and a local area network.
In some embodiments, terminal 110 (terminal device) includes various handheld devices, vehicle mount devices, wearable devices, computing devices, or other processing devices connected to a wireless modem, such as cell phones, tablets, desktop notebooks, and smart devices that can run applications, including the central console of a smart car, etc. Specifically, it may refer to a User Equipment (UE), an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent, or a user equipment. The terminal device may also be a satellite phone, a cellular phone, a smart phone, a wireless data card, a wireless modem, a machine type communication device, a cordless phone, a session initiation protocol (session initiation protocol, SIP) phone, a wireless local loop (wireless local loop, WLL) station, a Personal Digital Assistant (PDA), a handheld device with wireless communication functionality, a computing device or other processing device connected to a wireless modem, a car mounted device or a wearable device, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal in industrial control (industrial control), a wireless terminal in unmanned (self-driving), a wireless terminal in telemedicine (remote medium), a wireless terminal in smart grid (SMART GRID), a wireless terminal in transportation security (transportation safety), a wireless terminal in smart city (SMART CITY), a wireless terminal in smart home (smart home) or a terminal device in future communication network, etc. The terminal may be powered by a battery, may also be attached to and powered by a power system of the vehicle or vessel. The power supply system of the vehicle or the ship may also charge the battery of the terminal to extend the communication time of the terminal.
The server 120 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
It should be noted that, the implementation of the method in the present application may be directly performed on the terminal 110, or may be directly performed on the server 120, or may be performed on the server 120, and then sent to the terminal 110 by the server 120.
Fig. 2 is a flow chart of an audio processing method according to an embodiment of the present application. As shown in fig. 2, an audio processing method is provided, which is exemplified by the application of the audio processing method to the terminal in fig. 1, and includes steps 210, 220, 230 to 260. The method flow steps are only one possible implementation of the application.
Step 210, collecting audio data;
In the embodiment of the application, the audio data of the user can be obtained through the vehicle-mounted voice system, and it is to be noted that under different scenes, the user opens corresponding Application programs (APP) according to the Application requirements of the user, such as a navigation program and a music playing program, and further obtains the corresponding audio data through the Application programs, for example, after the user opens the navigation program, the user inquires the route of the navigation destination, so that the vehicle-mounted voice system obtains the audio data and further performs voice interaction with the user.
Step 220, storing the audio data in a recording queue;
It should be noted that, in this embodiment, the recording queue is a first-in first-out queue (First Input First Output, FIFO), the size of the queue is fixed, as shown in fig. 3, when a recording task is executed, new audio data is placed in a space pointed by a tail pointer (tail) of the recording queue, when a subsequent silence detection task is executed, the new audio data is blocked until a free space is available in the recording queue when the head pointer (head) and the tail pointer (tail) overlap.
Further, in an embodiment, the audio data is stored in the recording queue frame by frame, and typically, the frame length of the audio frame is 10ms, and 16 bits of floating point of one sample data is used, and the audio data stored in the recording queue may be floating point or fixed point.
Step 230, the current audio frame is fetched from the recording queue;
wherein, the current audio frame refers to the audio frame pointed by the current head pointer in the recording queue.
As shown in fig. 3, the current audio frame is "a1" pointed to by the head pointer, and after the "a1" is fetched, the head pointer is moved backward by one bit to point to "a2".
Step 240, determining that the silence detection result of the current frame is a non-silence audio frame and the global state is a silence state;
In performing the silence detection task, one frame of data is typically fetched, i.e., 10ms of data, 16 bits per sampling point for a mono channel, typically at a sampling rate of 16000hz, and one frame of data includes 160 single precision floating point numbers.
In this embodiment, the (Filterbank, FBANK) speech feature data may be calculated from the audio data as input to the subsequent acoustic model. By inputting FBANK features into the acoustic model, mapping FBANK features to characters is achieved through the acoustic model, and sequentially inputting the result output by the acoustic model into the full-connection layer and the softmax function (or the sigmoid function), the probability value of each frame can be obtained, so that the small model (namely the acoustic model) is used for achieving silence detection in the embodiment, and due to the small calculation amount of the small model, the occupied storage resource is small, and too much power consumption is not occupied even if the operation is continued.
After obtaining the probability value of each frame, comparing the probability value of each frame with a preset threshold value, if the probability value exceeds the preset threshold value, judging that the silence detection result of the frame is a non-silence audio frame, otherwise, judging that the silence detection result is a silence audio frame.
In this embodiment, referring to fig. 6, after determining the silence detection result of the current audio frame, whether the current audio frame needs to be identified is determined by combining the global state before processing the current audio frame, for example, the global state is a non-silence state, if the current audio frame is the non-silence state, then the audio data indicating that the large probability in the current audio frame further includes the voice is inserted into the identification queue, and if the global state is the silence state, then the audio data indicating that the large probability in the current audio frame does not include the voice is inserted into the silence detection queue.
In one example, the global state may be updated by a silence detection result of an audio frame over a period of time, wherein the global state includes a silence state and a non-silence state, wherein if the global state is the non-silence state before processing the current audio frame, it indicates that there is a non-silence frame of one continuous multi-frame before the current audio frame, and if the global state is the silence state before processing the current audio frame, it indicates that there is a silence frame of one continuous multi-frame before the current audio frame.
Step 250, determining that there is a continuous first number of frames of non-mute audio;
It should be noted that, under the influence of other factors such as environmental noise, the silence detection result of the audio data of part of the non-human voice may also be a non-silence audio frame, so in this embodiment, when the global state is detected to be a silence state, but when the current audio frame is a non-silence audio frame, it is required to determine whether there is a non-silence audio frame of the continuous first frame number at present. When the current audio frame is an unmuted audio frame and the global state is a mute state, and the continuous first frame number of the unmuted audio frame exists, the state change of the collected audio data can be indicated.
In one example, the non-mute audio frames of the continuous first frame number include the current audio frame, for example, in some scenes, the user may speak slowly and the tail is weak, so that when the user is far from the audio capturing device, the detection result of the audio frame of the tail part during the speaking of the user may be the mute audio frame.
For example, for an audio frame with a frame length of 10ms, since it takes 100-300 ms to speak a word, it is possible to set the first frame to be 10 frames, and when the silence detection results of audio frames continuing to the nth frame (i.e., the current frame) are all detected to be non-silence audio frames from the (N-10) th frame, it is indicated that the first word spoken by the user can be recognized with a high probability from the (N-10) th frame to the nth frame.
In another example, the non-mute audio frames of the continuous first frame number do not include the current audio frame, for example, in some scenarios, the user may intermittently output invalid voice data, such as "Hi...a" voice word in hesitation ". A song" or "Hi...a" is played, and in this embodiment, in order to avoid the recognition of a large amount of invalid audio data (such as "Hi" and "voice") in the recognition task caused by the above factors, the non-mute audio frames of the continuous first frame number are set to not include the current audio frame.
For example, when the silence detection results of the audio frames from the (N-10) th frame are all non-silence audio frames, it indicates that the voice audio data is generated with a high probability, and when the (n+1) th frame (i.e., the current frame) is detected to be the non-silence audio frame, it indicates that the audio frames from the (N-10) th frame to the N-th frame include valid audio data, otherwise, it indicates that the audio frames from the (N-10) th frame to the N-th frame do not include valid audio data.
In step 260, the audio frame of the silence detection queue is copied to an identification queue, wherein the silence detection queue is a circular queue and the identification queue is a first-in first-out queue.
The silence detection queue in this embodiment is a queue based on Ring Buffer, and referring to fig. 3 to 5, the silence detection queue may be regarded as a circular Buffer, and is divided into N equal-sized data spaces, each of which may store one frame of voice frame data, and in the silence detection queue, there are two pointers, namely a head pointer (head) and a tail pointer (tail).
As shown in fig. 4, the data update direction of the silence detection queue is clockwise, when there is no element in the silence detection queue, the head pointer and the tail pointer point to the same data space, when the audio frame "b1" is added to the silence detection queue, the audio frame "b1" is added to the data space pointed by the tail pointer, and the tail pointer moves one data space backward in the clockwise direction.
Further, as shown in fig. 5, when the silence detection queue is in a full state, if the silence detection queue allows to overwrite previous data, the oldest data is overwritten when new data is continuously inserted, for example, when the silence detection queue is in a full state for the first time, "b1", "b2", "b3", "b4", "b5" and "b6" in the silence detection queue, then when "b7" needs to be enqueued, "b7" is inserted into the data space pointed by the current tail pointer, and the tail pointer is moved one bit backward, and points to the same data space as the head pointer, after the next "b8", "b9" and "b10" are in a sequential state, the oldest audio frames "b1", "b2" and "b3" are respectively overwritten by "b8", "b9" and "b10", and then the latest audio frame data in the silence detection queue is in a sequential state of "b4", "b5", "b6", "b7", "b8" and "b10", and so on.
That is, in general, the silence detection queue always holds the latest N audio frames, where N is the number of audio frames that the silence detection queue is not covered by other audio frames under full conditions.
In one example, in the case that the silence detection queue is not full, all audio frames in the silence detection queue may be sequentially copied to the identification queue, in the case that the silence detection queue is full, then, from the audio frame of the data space pointed to by the tail pointer (tail) of the silence detection queue, the audio frames at the position (tail-1) before the tail pointer may be sequentially copied to the data update direction of the silence detection queue in order, as shown in fig. 5, from "b4", the audio frames of "b4", "b5", "b6", "b7", "b8", and "b10 may be sequentially fetched in order, and the audio frames of" b4"," b5"," b6"," b7"," b8", and" b10 may be sequentially stored in the identification queue.
In this embodiment, the recognition queue is also a first-in first-out queue (First Input First Output, FIFO), where the size of the queue is fixed, that is, the newly inserted audio frame to be recognized is placed in the space pointed by the tail pointer (tail) of the recognition queue, and when the recognition task is executed, the audio frame is sequentially fetched from the position pointed by the head pointer (head) of the recognition queue to the tail.
Before copying, the tail part in the identification queue is an audio frame in a non-mute state, so that the audio frame in the mute detection queue is copied to the tail part of the identification queue, and the phenomenon that the frame loss of the head data and the tail data is avoided at the boundary of the mute state change can be ensured.
The embodiment of the application provides an audio processing method, which comprises the steps of firstly collecting audio data, then storing the audio data into a recording queue, and then copying the audio frames of the silence detection queue to an identification queue under the condition that the silence detection result of the current audio frame in the recording queue is determined to be a non-silence audio frame and the global state is determined to be a silence state, wherein the non-silence audio frame of a continuous first frame exists behind the current audio frame, so that the three independent tasks (namely, a recording task, a silence detection task and an identification task) and two global states (namely, the silence state and the non-silence state) are set, and message circulation is realized among the three tasks through the queue, so that the phenomenon of frame loss of head data and tail data at the boundary of silence state change is ensured.
It should be noted that each embodiment of the present application may be freely combined, exchanged in order, or separately executed, and does not need to rely on or rely on a fixed execution sequence.
In some embodiments, inserting the current audio frame into the identification queue is further included.
Wherein the current audio frame is located in the recognition queue after the audio frame of the copied silence detection queue.
For example, when the current audio frame is "b11", and the non-silence audio frames with continuous 6 frames stored in the current silence detection queue are "b4", "b5", "b6", "b7", "b8" and "b10" in sequence from beginning to end, after b4"," b5"," b6"," b7"," b8 "and" b10 "are copied to the identification queue in sequence, the" b11 "is inserted into the" b10", so that when the silence state is switched to the non-silence state, the frame loss phenomenon of the head data and the tail data will not occur at the boundary of the silence state change.
In some embodiments, the number of audio frames in the silence detection queue that are copied is greater than or equal to the first number of frames.
In one example, the silence detection queue is divided into N equal-sized data spaces, where N is the number of most recent audio frame data that the silence detection queue is normally holding, and when copying audio frames from the silence detection queue to the identification queue, the most recent N audio frames currently held may be copied to the identification queue, where N is the number of most recent audio frame data that the silence detection queue is not covered by other data audio frames under full queue conditions.
In this embodiment, in order to avoid the frame loss phenomenon of the header data and the trailer data at the boundary of the silence state change, N may be set to be greater than or equal to the first frame number.
In some embodiments, further comprising:
And determining that the global state is an unmuted state, and inserting the current audio frame into the identification queue.
In this embodiment, when the current audio frame is a non-mute audio frame and the global state is a non-mute state, it indicates that no state change occurs in the current audio frame, and the current audio frame may be directly inserted into the tail of the recognition queue, and waiting for subsequent speech recognition.
The embodiment of the application provides an audio processing method, which is characterized in that under the condition that the current audio frame does not have state change is detected, the current audio frame is directly inserted into a corresponding queue according to a silence detection result.
In some embodiments, after determining that there is a first consecutive number of non-mute audio frames, the method further comprises:
updating the global state to an unmuted state.
In this embodiment, when there is a non-mute audio frame with a first continuous frame number, it indicates that a voice appears in a large probability, and the global state is updated to the non-mute state, so that the subsequently detected non-mute audio frame can be directly inserted into the recognition queue, thereby avoiding the frame loss condition of subsequent voice recognition.
In some embodiments, further comprising:
And determining that the silence detection result of the current audio frame is a silence audio frame, and inserting the current audio frame into the silence detection queue.
In this embodiment, in the case that the current audio frame is a mute audio frame, no matter whether the current global state is a mute state or an unmuted state, the current global state is stored in the silence detection queue, so as to reduce the invalid speech recognition audio processing calculation power consumption.
In some embodiments, the silence detection queue includes a plurality of data spaces cyclically connected, the current audio frame is inserted into the data space pointed by a tail pointer of the silence detection queue, and the tail pointer moves along a data update direction of the silence detection queue to the data space pointed by a head pointer of the silence detection queue.
As shown in fig. 3 to 5, the silence detection queue may be regarded as a circular buffer zone, which is formed by a plurality of data spaces connected in a circulating manner, and when there is no element in the silence detection queue, the head pointer and the tail pointer point to the same data space, and when a new audio frame needs to be stored in the silence detection queue, the new audio frame is inserted into the data space pointed by the tail pointer of the silence detection queue, and the tail pointer moves along the data update direction of the silence detection queue to the data space pointed by the head pointer of the silence detection queue until the silence detection queue is full.
When the silence detection queue is full, the head pointer and the tail pointer point to the same data space again, so that when a new audio frame is inserted, the tail pointer can continuously move along the data updating direction of the silence detection queue to the data space pointed by the head pointer of the silence detection queue.
Therefore, even if the audio frames in the silence detection queue are not taken out in time, when new audio frames are stored in the subsequent silence detection queue, the oldest audio frames are covered by the newest audio frames, and a plurality of newest audio frames which are not covered by other data audio frames can be always stored in the silence detection queue.
In some embodiments, after determining that the silence detection result of the current audio frame is a silence audio frame, the method further includes:
And determining that the global state is an unmuted state, determining that mute audio frames with continuous second frames exist, and updating the global state into the mute state.
In this embodiment, when the current audio frame is a mute audio frame, but the global state is a non-mute state, it indicates that the current audio frame has a state change, and it should be noted that, under the influence of other factors such as the speech volume of the user, the silence detection result of the audio data of part of the voice may also be a mute audio frame.
In one example, the current audio frame may be included in the silence audio frame for a second consecutive frame number, such as in some scenarios, where the user may intermittently output some invalid speech data during the course of communicating the speech instructions, such as "Hi.. I want }. A person's (hesitation-time words of mood) i.e. play a song", in this embodiment, to avoid the above factors leading to a large amount of invalid audio data (such as: "I want", "audio data such as" and "in the" by "etc.), the current audio frame is included in the non-mute audio frame set for the second consecutive frame number.
In another example, the mute audio frames of the continuous second frame may not include the current audio frame, for example, in some scenarios, the user may speak slowly and the tail is weaker due to the influence of the speaking habit of the user, so that the mute audio frames of the continuous multi-frame may exist in the detection result of the audio data during the speaking period of the user.
For example, for an audio frame with a frame length of 10ms, since the duration of the interruption is about 100ms when a person intermittently speaks a segment, the second frame number may be set to 10 frames, and when the silence detection results of audio frames continuing to the nth frame (i.e., the current frame) are all detected to be silence audio frames from the (N-10) th frame, the silence detection result of the (n+1) th frame is determined, and when the (n+1) th frame (i.e., the current frame) is detected to be non-silence audio frame, the user speech is indicated to be not ended, otherwise, the user speech is indicated to be ended, and global status update may be performed. In some embodiments, after determining that the silence detection result of the current audio frame is a silence audio frame, the method further includes:
And determining that the global state is an unmuted state, determining that no mute audio frames of a second continuous frame number exist, and not updating the global state.
In this embodiment, when there is no mute audio frame of the second consecutive frame number, it indicates that the user's speech is not completely interrupted, for example, the user starts to speak intermittently after stopping for several seconds during the speech period, so as to avoid affecting the processing result of the subsequent audio frame, and in this case, the global state is not updated.
In some embodiments, further comprising identifying in real time the audio frames in the identification queue.
When the voice recognition task is executed, the audio frames stored in the voice recognition queue are audio frames with the silence part removed, referring to fig. 6, in this embodiment, only when the audio frames are inserted into the recognition queue, the voice recognition task is invoked to recognize the voice frames, so that only the silence detection task is running most of the time, and the voice recognition task is not running, thereby saving a lot of invalid recognition actions.
Referring to fig. 7, fig. 7 is a complete flow chart of a silence detection task in an audio processing method according to an embodiment of the present application, including the following steps:
Step S1, taking an audio frame from a recording queue;
S2, silence detection;
step S3, mute judgment, wherein if the probability value of the mute detection of the current audio frame exceeds a preset threshold value, the current audio frame is judged to be a non-mute audio frame, and the step S7 is carried out, otherwise, the current audio frame is judged to be a mute audio frame, and the step S4 is carried out;
Step S4, if the previous global state is a mute state, the step S6 is entered, otherwise, the step S5 is entered;
step S5, if the continuous K frames are mute before the current audio frame, updating the global state into a mute state;
s6, inserting the current audio frame into a silence detection queue;
step S7, if the previous global state is a non-mute state, the step S8 is entered, otherwise, the step S9 is entered;
S8, inserting the current audio frame into an identification queue;
Step S9, if the continuous M frames exist before the current audio frame and are not mute, updating the global state into the non-mute state, and executing step S10;
Step S10, copying the audio frames in the silence detection queue to the identification queue.
The audio processing system provided by the application is described below, and the audio processing system described below and the audio processing method described above can be referred to correspondingly.
The system provided in the embodiment of the present application is used for executing the above method embodiments, and specific flow and details refer to the above embodiments, which are not repeated herein.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 8, the electronic device may include a Processor (Processor) 801, a communication interface (Communications Interface) 802, a Memory 803, and a communication bus 804, where the Processor 801, the communication interface 802, and the Memory 803 complete communication with each other through the communication bus 804. The processor 801 may invoke logic instructions in the memory 803 to perform an audio processing method, where the method includes collecting audio data, storing the audio data in a recording queue, where the recording queue is a first-in-first-out queue, retrieving a current audio frame from the recording queue, determining that a silence detection result of the current audio frame is a non-silence audio frame and a global state is a silence state, determining that a non-silence audio frame with a first continuous frame number exists after the current audio frame, copying the audio frame of the silence detection queue to a tail of an identification queue, where the silence detection queue is a circular queue and the identification queue is a first-in-first-out queue.
Further, the logic instructions in the memory 803 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In another aspect, the present application provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that when executed by a computer are capable of performing the audio processing method provided by the methods described above, the method comprising collecting audio data, storing the audio data in a recording queue, wherein the recording queue is a first-in-first-out queue, retrieving a current audio frame from the recording queue, determining that a silence detection result of the current audio frame is a non-silence audio frame, and that a global state is a silence state, determining that there is a continuous first frame of non-silence audio frame after the current audio frame, and copying the audio frame of the silence detection queue to a tail of an identification queue, wherein the silence detection queue is a circular queue, and the identification queue is a first-in-first-out queue.
In yet another aspect, the present application further provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, is implemented to perform the audio processing method provided in the above embodiments, the method includes collecting audio data, storing the audio data in a recording queue, wherein the recording queue is a first-in-first-out queue, retrieving a current audio frame from the recording queue, determining that a silence detection result of the current audio frame is a non-silence audio frame and a global state is a silence state, determining that a non-silence audio frame with a first continuous frame number exists after the current audio frame, and copying the audio frame of the silence detection queue to a tail of an identification queue, wherein the silence detection queue is a circular queue and the identification queue is a first-in-first-out queue.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application.

Claims (8)

1.一种音频处理方法,其特征在于,包括:1. An audio processing method, characterized in that it includes: 采集音频数据;Collect audio data; 将所述音频数据存储至录音队列中,其中,所述录音队列为先进先出队列;The audio data is stored in a recording queue, wherein the recording queue is a first-in, first-out queue; 从所述录音队列取出当前音频帧;Retrieve the current audio frame from the recording queue; 当确定所述当前音频帧的静音检测结果为非静音音频帧,且全局状态为静音状态时;When it is determined that the silence detection result of the current audio frame is a non-silent audio frame, and the global state is silent; 确定存在连续第一帧数的非静音音频帧;Determine that there are consecutive non-silent audio frames for the first frame number; 拷贝静音检测队列的音频帧到识别队列,其中,所述静音检测队列为循环队列,所述识别队列为先进先出队列;The audio frames of the silence detection queue are copied to the recognition queue, wherein the silence detection queue is a circular queue and the recognition queue is a first-in-first-out queue; 其中,拷贝的所述静音检测队列中的音频帧的数量大于或等于所述第一帧数,且实时识别所述识别队列中的音频帧。The number of audio frames copied from the silence detection queue is greater than or equal to the first number of frames, and the audio frames in the recognition queue are identified in real time. 2.根据权利要求1所述的方法,其特征在于,还包括:2. The method according to claim 1, characterized in that it further comprises: 将所述当前音频帧插入所述识别队列,其中,所述当前音频帧在所述识别队列中位于拷贝的所述静音检测队列的音频帧之后。The current audio frame is inserted into the recognition queue, wherein the current audio frame is located in the recognition queue after the audio frames of the copied silence detection queue. 3.根据权利要求1所述的方法,其特征在于,还包括:3. The method according to claim 1, characterized in that it further comprises: 当确定所述当前音频帧的所述全局状态为非静音状态时,将所述当前音频帧插入所述识别队列。When it is determined that the global state of the current audio frame is non-mute, the current audio frame is inserted into the recognition queue. 4.根据权利要求1所述的方法,其特征在于,还包括:4. The method according to claim 1, characterized in that it further comprises: 将所述全局状态更新为非静音状态。Update the global state to non-mute. 5.根据权利要求1所述的方法,其特征在于,还包括:5. The method according to claim 1, characterized in that it further comprises: 当确定所述当前音频帧的静音检测结果为静音音频帧时,将所述当前音频帧插入所述静音检测队列。When the silence detection result of the current audio frame is determined to be a silent audio frame, the current audio frame is inserted into the silence detection queue. 6.根据权利要求5所述的方法,其特征在于,所述静音检测队列包括循环连接的多个数据空间,所述当前音频帧插入所述静音检测队列的尾指针指向的数据空间;所述尾指针沿着所述静音检测队列的数据更新方向,向所述静音检测队列的头指针指向的数据空间移动。6. The method according to claim 5, wherein the silence detection queue comprises a plurality of data spaces connected in a circular manner, the current audio frame is inserted into the data space pointed to by the tail pointer of the silence detection queue; the tail pointer moves along the data update direction of the silence detection queue toward the data space pointed to by the head pointer of the silence detection queue. 7.根据权利要求5所述的方法,其特征在于,还包括:7. The method according to claim 5, characterized in that it further comprises: 当确定所述当前音频帧的所述全局状态为非静音状态时,确定存在连续第二帧数的静音音频帧,将所述全局状态更新为静音状态。When it is determined that the global state of the current audio frame is non-silent, it is determined that there is a second consecutive number of silent audio frames, and the global state is updated to silent. 8.一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述音频处理方法。8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the audio processing method as described in any one of claims 1 to 7.
CN202310380459.8A 2023-04-04 2023-04-04 An audio processing method and device Active CN116631454B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310380459.8A CN116631454B (en) 2023-04-04 2023-04-04 An audio processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310380459.8A CN116631454B (en) 2023-04-04 2023-04-04 An audio processing method and device

Publications (2)

Publication Number Publication Date
CN116631454A CN116631454A (en) 2023-08-22
CN116631454B true CN116631454B (en) 2025-12-19

Family

ID=87608874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310380459.8A Active CN116631454B (en) 2023-04-04 2023-04-04 An audio processing method and device

Country Status (1)

Country Link
CN (1) CN116631454B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060685A (en) * 2019-04-15 2019-07-26 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN111026532A (en) * 2019-12-10 2020-04-17 苏州思必驰信息科技有限公司 Message Queue Management Method for Voice Data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9697831B2 (en) * 2013-06-26 2017-07-04 Cirrus Logic, Inc. Speech recognition
EP3179472B1 (en) * 2015-12-11 2020-03-18 Sony Mobile Communications, Inc. Method and device for recording and analyzing data from a microphone
US9949027B2 (en) * 2016-03-31 2018-04-17 Qualcomm Incorporated Systems and methods for handling silence in audio streams
US11482225B2 (en) * 2020-09-15 2022-10-25 Motorola Solutions, Inc. System and method for concurrent operation of voice operated switch and voice control with wake word
US11356492B2 (en) * 2020-09-16 2022-06-07 Kyndryl, Inc. Preventing audio dropout
CN114627854B (en) * 2020-12-08 2025-03-21 中国电信股份有限公司 Speech recognition method, speech recognition system and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060685A (en) * 2019-04-15 2019-07-26 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN111026532A (en) * 2019-12-10 2020-04-17 苏州思必驰信息科技有限公司 Message Queue Management Method for Voice Data

Also Published As

Publication number Publication date
CN116631454A (en) 2023-08-22

Similar Documents

Publication Publication Date Title
JP6751433B2 (en) Processing method, device and storage medium for waking up application program
EP2994911B1 (en) Adaptive audio frame processing for keyword detection
CN107256707B (en) Voice recognition method, system and terminal equipment
WO2016000569A1 (en) Voice communication method and system in game applications
US20190237070A1 (en) Voice interaction method, device, apparatus and server
CN110287303B (en) Man-machine conversation processing method, device, electronic equipment and storage medium
CN111883117A (en) Voice wake-up method and device
CN106228047B (en) A kind of application icon processing method and terminal device
CN111309962A (en) Method and device for extracting audio clip and electronic equipment
CN110321447A (en) Determination method, apparatus, electronic equipment and the storage medium of multiimage
CN113327611B (en) Voice wakeup method and device, storage medium and electronic equipment
CN116631454B (en) An audio processing method and device
CN114664290B (en) Sound event detection method and device and readable storage medium
CN112527235A (en) Voice playing method, device, equipment and storage medium
CN110035167B (en) Recording method and related device
CN113763921B (en) Method and apparatus for correcting text
WO2025251518A1 (en) Speech playback method and apparatus, and electronic device and storage medium
CN114913853B (en) Voice wake-up method, device, storage medium and electronic device
CN114005436B (en) Method, device and storage medium for determining voice endpoint
CN113495712A (en) Automatic volume adjustment method, apparatus, medium, and device
CN112735451B (en) Scheduling audio code rate switching method based on recurrent neural network, electronic equipment and storage medium
CN110113494A (en) The way of recording and relevant apparatus
CN112489644B (en) Speech recognition method and device for electronic equipment
CN114817006A (en) Stack information writing method, device, equipment and medium
CN116567148A (en) A control method, device, medium and electronic equipment for an intelligent outbound call

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant