CN114596868B - Speech coding method, speech coding device, terminal device and storage medium - Google Patents
Speech coding method, speech coding device, terminal device and storage mediumInfo
- Publication number
- CN114596868B CN114596868B CN202210240298.8A CN202210240298A CN114596868B CN 114596868 B CN114596868 B CN 114596868B CN 202210240298 A CN202210240298 A CN 202210240298A CN 114596868 B CN114596868 B CN 114596868B
- Authority
- CN
- China
- Prior art keywords
- frame
- frames
- sequence
- characteristic
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The application is suitable for the technical field of artificial intelligence, and provides a voice coding method, a voice coding device, terminal equipment and a storage medium. The voice coding method comprises the steps of coding a plurality of ordered voice data frames to obtain a characteristic frame sequence, wherein the voice data frames are obtained by carrying out framing operation on voice data to be identified, calculating blank probability of each characteristic frame in the characteristic frame sequence, determining an effective frame from the characteristic frame sequence based on the blank probability, determining a supplementary frame based on the position of the effective frame in the characteristic frame sequence, and coding again based on the effective frame and the supplementary frame to obtain coded data. The method can determine the effective frames and the supplementary frames as the input data of the subsequent stage through a screening mechanism in the encoding stage of the voice recognition, thereby shortening the characteristic frame sequence length in the subsequent calculation process of the voice recognition, reducing the calculation amount in the voice recognition process and improving the calculation efficiency of the voice recognition.
Description
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a speech coding method, a speech coding apparatus, a terminal device, and a computer readable storage medium.
Background
With the development of speech technology, speech coding technology has been widely used in various fields of life. Results have been achieved for longer speech sequences by employing end-to-end speech coding methods over conventional methods. Although the speech coding method has a remarkable improvement in recognition accuracy, the problem of low calculation efficiency still exists in the calculation process.
Disclosure of Invention
In view of this, embodiments of the present application provide a speech encoding method, a speech encoding apparatus, a terminal device, and a computer readable storage medium, which can shorten a feature frame sequence length, thereby reducing the amount of computation in a speech recognition process and improving the computation efficiency of speech recognition.
A first aspect of an embodiment of the present application provides a speech coding method, including:
encoding the sequenced multiple voice data frames to obtain a characteristic frame sequence, wherein the multiple voice data frames are obtained by performing framing operation on voice data to be identified;
Calculating blank probability of each feature frame in the feature frame sequence;
determining a valid frame from the sequence of feature frames based on the blank probability;
Determining a supplemental frame based on the position of the active frame in the sequence of feature frames;
And encoding again based on the effective frame and the supplementary frame to obtain encoded data.
A second aspect of an embodiment of the present application provides a speech coding apparatus, including:
The first coding module is used for coding the sequenced multiple voice data frames to obtain a characteristic frame sequence, and the multiple voice data frames are obtained by performing framing operation on voice data to be identified;
the computing module is used for computing the blank probability of each characteristic frame in the characteristic frame sequence;
the first determining module is used for determining a valid frame from the characteristic frame sequence based on the blank probability;
A second determining module, configured to determine a supplemental frame based on a position of the active frame in the feature frame sequence;
And the second coding module is used for coding again based on the effective frame and the supplementary frame to obtain coded data.
A third aspect of an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the terminal device, where the processor implements the steps of the speech coding method provided in the first aspect when the processor executes the computer program.
A fourth aspect of the embodiments of the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the speech coding method provided in the first aspect.
The voice coding method, the voice coding device, the terminal equipment and the computer readable storage medium provided by the embodiment of the application have the following beneficial effects:
by encoding the plurality of sequenced voice data frames, each voice data frame can obtain a characteristic frame after encoding, and the plurality of voice data frames can correspondingly obtain a plurality of characteristic frames, and the plurality of characteristic frames are arranged in time sequence to be a characteristic frame sequence. For each feature frame, a blanking probability for the feature frame may be calculated, and then a determination may be made as to whether the feature frame is a valid frame based on the blanking probability. After determining all valid frames in the sequence of feature frames, a supplemental frame may be further determined based on the location of the valid frames in the sequence of feature frames. It is considered that the supplemental frame may also contain significant information related to the significant frame. And finally, coding the effective frames and the supplementary frames again to obtain coded data. By the method, the length of the characteristic frame sequence can be shortened, and the calculated amount in the subsequent process of voice recognition is reduced, so that the calculation efficiency of voice recognition is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a speech coding method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a conventional coding model according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an improved coding model structure according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an FFM network according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an extractor according to an embodiment of the present application;
FIG. 6 is a block diagram of a speech coder according to an embodiment of the present application;
fig. 7 is a block diagram of a terminal device according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The speech coding method according to the embodiment of the present application may be performed by a terminal device, such as a mobile phone, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, or a Personal Digital Assistant (PDA).
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. According to the speech processing technology, the characteristic frame sequence obtained by encoding is filtered and screened, and the effective frames and the supplementary frames in the characteristic frame sequence are reserved, so that the length of the characteristic frame sequence is shortened, the calculated amount in the speech recognition process is reduced, and the speech recognition efficiency is improved.
Referring to fig. 1, fig. 1 shows a flowchart of an implementation of a speech coding method according to an embodiment of the present application. The speech coding method comprises the following steps:
step 110, encoding the sequenced multiple voice data frames to obtain a characteristic frame sequence.
Because the voice signal has short-time stationarity, in order to facilitate subsequent feature extraction of the voice information, the voice signal may be subjected to a segmentation process, where each segment is referred to as a frame, and also referred to as a framing process. After obtaining the voice data to be recognized, the voice data may be subjected to framing processing to obtain a plurality of voice data frames. Because the voice signal has time sequence characteristics, after a plurality of voice data frames are obtained, the voice data frames can be sequenced according to the time of each frame, so as to obtain a plurality of sequenced voice data frames. And encoding each voice data frame to obtain a corresponding characteristic frame. After the feature frames corresponding to each voice data frame are arranged according to the time sequence (namely according to the sequence of the corresponding voice data frames), a feature frame sequence can be obtained.
Step 120, calculating a blank probability of each feature frame in the feature frame sequence.
The connection time classification (Connectionist Temporal Classification, CTC) is a classification method that avoids manual alignment of inputs and outputs. Considering the characteristics of the sequence obtained based on the method and the output independence assumption of CTC, the application provides two assumptions, namely, firstly, the characteristic frame which is used for distinguishing the label corresponding to the frame and contains enough information in the characteristic frame which is corresponding to the non-blank label is output, the characteristic frame is marked as an effective frame, and secondly, the characteristic frame which is used for distinguishing the label corresponding to the frame and contains less information in the characteristic frame which is corresponding to the blank label is output, and the characteristic frame is marked as a blank frame. To distinguish which feature frames are valid frames and which feature frames are blank frames, a blank probability for each feature frame may be calculated to distinguish between the two feature frames in the sequence of feature frames.
And 130, determining a valid frame from the characteristic frame sequence based on the blank probability.
In order to reduce the calculation amount of the voice recognition, blank frames can be screened out from the characteristic frame sequence, and only valid frames are reserved, so that the characteristic frame sequence for subsequent calculation is shortened, and the calculation efficiency of the voice recognition is improved. Specifically, after the blank probability of each feature frame is calculated, it may be determined whether the feature frame is a valid frame based on the blank probability.
Step 140, determining a supplemental frame based on the position of the active frame in the sequence of feature frames.
To avoid losing some of the active information associated with an active frame during the determination of the active frame, a supplemental frame may be determined based on the location of each active frame in the sequence of feature frames. For example, the valid frames may be ranked according to their blank probabilities. The number of supplementary frames can be selected to be larger if the blank probability is higher and the corresponding grade is lower, and conversely, the number of supplementary frames can be selected to be smaller if the grade is lower. Of course, it is understood that the supplemental frame is selected from the blank frames.
And step 150, coding again based on the effective frame and the supplementary frame to obtain coded data.
After filtering and screening the characteristic frame sequence, the effective frames and the supplementary frames in the characteristic frame sequence can be obtained, wherein the ordering of the effective frames and the supplementary frames is the same as the ordering of the effective frames and the supplementary frames in the characteristic frame sequence. And encoding the effective frame and the supplementary frame again to obtain encoded data. The encoded data is obtained by screening out blank frames with little effective information, and compared with the characteristic frame sequence obtained in the step 110, the sequence length of the encoded data is effectively reduced, and subsequent operation of voice recognition is performed based on the encoded data, so that the calculated amount of voice recognition can be reduced, and the efficiency of voice recognition is improved.
By encoding the plurality of sequenced voice data frames, each voice data frame can obtain a characteristic frame after encoding, and the plurality of voice data frames can obtain a plurality of characteristic frames, and the plurality of characteristic frames are arranged in time sequence to be a characteristic frame sequence. For each feature frame, a blank probability of the feature frame may be calculated, then, based on the blank probability, whether the feature frame is a valid frame may be determined, and after determining all valid frames in the sequence of feature frames, a supplemental frame may be further determined based on the position of the valid frame in the sequence of feature frames, where it may be considered that the supplemental frame may also possibly contain valid information related to the valid frame. And finally, coding the effective frame and the supplementary frame again to obtain coded data. By the method, the length of the characteristic frame sequence can be shortened, the calculated amount in the voice recognition process can be reduced, and the calculation efficiency of the voice recognition can be improved.
In some embodiments, in order to accurately determine whether each feature frame is a valid frame, the step 120 specifically includes:
For each feature frame:
Step 121, judging whether the blank probability of the feature frame is smaller than a preset first blank probability threshold.
And step 122, if the blank probability of the feature frame is smaller than a preset first blank probability threshold, determining the feature frame as a valid frame.
A blank probability threshold value can be determined in advance according to the experience value, and whether the blank probability is smaller than the blank probability threshold value is used as a judging condition for judging whether each feature frame is a valid frame. For each feature frame, the smaller the blank probability of the feature frame, the more effective information the feature frame contains, and if the blank probability of the feature frame is smaller than the blank probability threshold, the feature frame is determined to be an effective frame. By the method, each characteristic frame in the characteristic sequence frame sequence is judged, and all effective frames in the sequence can be obtained. The blank probability threshold in the embodiment of the application is used for screening the effective frames, and in order to distinguish the effective frames from another blank probability threshold in the following, the blank probability threshold is marked as a first blank probability threshold, and the blank probability threshold in the following is marked as a second blank probability threshold.
In some embodiments, in determining the valid frame, in order to reduce the risk of losing valid information, the step 140 specifically includes:
If there is a target feature frame to the left and/or right of the active frame in the sequence of feature frames, the target feature frame is determined to be a supplemental frame.
In order to avoid that part of the feature frames containing the effective information are filtered, a blank probability threshold value can be improved, a second blank probability threshold value is set to screen blank frames, and blank frames with blank probability smaller than the second blank probability threshold value are determined to be complementary frames so as to improve the integrity of the effective information. The two concepts of the left side and the right side of the effective frame are based on the position of the effective frame in the feature frame sequence, and assume that one segment of a feature frame sequence is "i, i+1, i+2", wherein i+1 is the effective frame, then the blank frame of the left side of the effective frame is i, and the blank frame of the right side of the effective frame is i+2.
In some embodiments, in addition to the above method of performing secondary screening on blank frames by using the second blank probability threshold, the above step 140 may be implemented by:
step 141, obtaining each label of each effective frame and each output probability corresponding to each label;
In step 142, if the target labels of two adjacent effective frames are the same in the feature frame sequence, the feature frame between the two adjacent effective frames is determined as a supplementary frame.
For each feature frame, a labeling operation may be performed on the feature frame by an indicator, giving the feature frame a different label. Wherein each tag corresponds to a probability of outputting the tag, i.e., an output probability. That is, if a certain feature frame correspondingly outputs a plurality of labels, each label has an output probability, and the label with the highest output probability is the target label of the feature frame. For example, if the feature frame a has 3 labels, there are 3 output probabilities corresponding to the 3 labels, and assuming that the output probabilities corresponding to the 3 labels are 65%, 55% and 80%, respectively, the label corresponding to the 80% output probability is the target label. In the characteristic frame sequence, if the labels of maximum output probabilities of two adjacent effective frames are the same, that is, the target labels of the two effective frames are the same, the effective information contained in the two effective frames is the same, so that a blank frame between the two effective frames also contains the same effective information, and the blank frame between the two effective frames can be determined as a supplementary frame so as to improve the integrity of the effective information. Here, two adjacent active frames, also for a feature frame sequence, after all active frames in the feature frame sequence have been determined, the position of each active frame in the feature frame sequence can be identified. In the feature sequence, if there is only a blank frame between two effective frames, the positional relationship between the two effective frames is adjacent.
In some embodiments, in addition to the two ways of determining the supplemental frame, the step 140 may be implemented by:
(1) M frames (m < =n) are randomly filtered from the left n frames of the active frame, the probability of being determined as a supplemental frame being greater the closer to the active frame.
(2) M frames (m < =n) are randomly filtered from the right n frames of the active frame, the probability of being determined as a supplemental frame being greater the closer to the active frame.
(3) M frames (m < =n) are randomly filtered from n frames on both sides of the active frame, and the probability of being determined as a supplemental frame is greater as it approaches the active frame.
After the active frames in the sequence of feature frames are determined, the locations of the active frames in the sequence of feature frames may be identified, and then a specified number of blank frames may be selected from the left and/or right side of the active frames according to a probability distribution. The selected blank frame may be referred to as a candidate frame, and the closer it is to the active frame, the greater the probability of being determined to be a supplemental frame. Specifically, a distance threshold may be set, the distance between each candidate frame and the valid frame is calculated first, and then each distance (referred to as the distance between the candidate frame and the nearest valid frame) is compared with the distance threshold, and the candidate frame corresponding to the distance less than the distance threshold may be determined as the supplemental frame.
In some embodiments, to further reduce the risk of losing the valid information, the step 140 further includes:
A. and extracting the characteristic information from the effective frames, the supplementary frames and the characteristic frame sequence based on a preset extractor.
In blank frames, some valid information may still remain, and feature extraction may be performed based on the valid frames, the supplemental frames, and the blank frames in order to reduce the risk of losing the valid information. In the actual extraction process, the difficulty of extracting the feature information based on the blank frames is high, so that the possibility of extracting the blank frame information can be enhanced by expanding the extraction range to the whole feature frame sequence. Specifically, the effective frames and the supplementary frames can be extracted, and the characteristic frame sequences are also extracted (which is equivalent to extracting the characteristic information from the blank frames), so that the risk of losing the effective information is further reduced, and the integrity of the characteristic information is improved.
Accordingly, the step 150 specifically includes re-encoding based on the feature information to obtain encoded data.
Correspondingly, when the code is carried out again, the characteristic information obtained by the extractor is directly coded.
For a further understanding of the speech coding method of the present application, a model for implementing the present application will be described. Referring to fig. 2, fig. 2 shows a prior art classical speech coding model in which the filtering and extraction mechanism of the present application is absent, as is evident. In order to shorten the characteristic frame sequence obtained by encoding, the application introduces a filtering and extracting mechanism which is realized by an FFM network. Wherein, the FFM network can be embedded in a classical speech coding model, and the specific structure is shown in fig. 3, and fig. 3 shows a coding model improved by the present application. The model can introduce a filtering and extracting mechanism in the encoding process, and reduce the length of a characteristic frame sequence obtained after encoding, thereby reducing the calculated amount in the subsequent process of voice recognition and improving the calculation efficiency of voice recognition.
Specifically, as shown in fig. 4, the FMM network may be embedded in Conformer blocks. By way of example only and not limitation, assuming Conformer blocks together with 15 layers of networks, an FMM network may be embedded behind the 7 th layer network, i.e., the first 7 layers of networks implement the first encoding, and the second 8 layers of networks perform the second encoding of the feature information filtered and extracted by the FMM network, resulting in encoded data.
Specifically, referring to fig. 4, fig. 4 shows a schematic structural diagram of an FFM network, where the FFM network includes an indicator for screening and determining a valid frame and a complementary frame in a feature frame sequence, and a decimator for extracting feature information in the valid frame, the complementary frame, and the feature frame sequence (actually extracting feature information in a blank frame).
In some embodiments, the decimator includes a first Feed Forward Net (FFN) module, a multi-head codec-decoder attention, MHEDA module, a Convolution (CONV) module, a second feed forward neural network module, and a normalization (layer normalization, LN) module.
And the first feedforward neural network module is used for executing full-connection operation on the effective frames, the supplementary frames and the characteristic frame sequences to obtain a first characteristic frame sequence.
And the multi-head encoding and decoding attention module is used for executing self-attention encoding and decoding operation on the first characteristic frame sequence to obtain a second characteristic frame sequence.
The convolution module is used for performing convolution operation on the second characteristic frame sequence to obtain a third characteristic frame sequence;
and the second feedforward neural network module is used for executing full-connection operation on the third characteristic frame sequence to obtain a fourth characteristic frame sequence frame.
And the normalization module is used for performing normalization operation on the fourth characteristic frame sequence frame to obtain characteristic information.
Wherein, the mathematical expression of the extractor is as follows:
(1)
(2)Xeda=Xffn+MHEDA(Xffn,Xe,Xe)
(3)Xconv=Xeda+CONV(Xeda)
(4)
Wherein X f is a valid frame and a supplementary frame ordered according to a feature frame sequence, X ffn is a first feature frame sequence, X eda is a second feature frame sequence, X conv is a third feature frame sequence, and Y e is feature information.
Referring to fig. 5, the above expression will be described with reference to the schematic drawing of the decimator:
Assuming a feature frame sequence of X e.Xe after filtering, the active and supplemental frames X f are obtained, ordered in the feature frame sequence. Inputting X e and X f into a decimator, and obtaining characteristic information through the following steps:
The first step, after X f inputs the first feedforward neural network FNN, an output sequence FNN_1 (X f) can be obtained, FFN_1 (X f) is multiplied by 0.5 as a residual error and added with X f, and a first characteristic frame sequence X ffn is obtained.
The second step is to input X ffn and Xe as input sequences into the MHEDA module, calculate the attention distribution to get the output sequence MHEDA (X ffn,Xe,Xe), add MHEDA (X ffn,Xe,Xe) as residual to X ffn to get the second feature frame sequence X eda.
And thirdly, inputting X eda into a convolution module to obtain Conv (X eda), and adding the Conv (X eda) serving as a residual error with X eda to obtain a third characteristic frame sequence X conv.
And fourthly, taking X conv as input, entering a second feedforward neural network module to obtain FFN_2 (X conv), adding the FFN_2 (X conv) as residual error with X conv, and regularizing through a normalization layer to obtain characteristic information Y e.
The concrete expression of MHEDA model is:
(5)
(6)MHEDA(Q,K,V)=Concat(head1,…,headn)Wo
(7)
wherein Q is query, which represents a feature frame sequence to be queried, K is Key, which represents a keyword, V is value, which represents a value sequence, and head i represents an ith header.
In some embodiments, the above-mentioned speech coding method further comprises:
a sequence of feature frames, active frames, supplemental frames, a plurality of frames of speech data, and/or encoded data are deployed into a blockchain.
In order to ensure the safety of data and the fairness and transparency to users, a characteristic frame sequence, a valid frame, a supplementary frame, a plurality of voice data frames and/or encoded data can be deployed to a blockchain for verification. The user may then download the sequence of feature frames, the active frames, the supplemental frames, the plurality of frames of voice data, and/or encoded data from the blockchain via the respective device to verify that the data has been tampered with. The blockchain referred to in this embodiment is a novel application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. The blockchain is essentially a decentralised database, and is a series of data blocks which are generated by association by using a cryptography method, and each data block contains information of a batch of network transactions and is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
In addition, the embodiment of the application also provides a voice coding device.
Referring to fig. 6, fig. 6 is a block diagram illustrating a speech coding apparatus according to an embodiment of the present application. In this embodiment, each unit included in the terminal device is configured to perform each step in the embodiment corresponding to fig. 1. Refer specifically to fig. 1 and the related description in the embodiment corresponding to fig. 1. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 6, the speech encoding apparatus 60 includes:
a first encoding module 61, configured to encode a plurality of ordered speech data frames to obtain a feature frame sequence, where the plurality of speech data frames are obtained by performing framing operation on speech data to be identified;
a calculation module 62, configured to calculate a blank probability of each feature frame in the sequence of feature frames;
A first determining module 63, configured to determine a valid frame from the feature frame sequence based on the blank probability;
A second determining module 64 for determining a supplemental frame based on the position of the active frame in the sequence of feature frames;
the second encoding module 65 is configured to encode again based on the active frame and the supplemental frame, to obtain encoded data.
As an embodiment of the present application, the first determining module 63 may include:
For each feature frame:
the judging unit is used for judging whether the blank probability of the feature frame is smaller than a preset first blank probability threshold value;
and the first determining unit is used for determining the feature frame as a valid frame if the blank probability of the feature frame is smaller than the first blank probability threshold value.
As an embodiment of the present application, the second determining module 64 may include:
and the second determining unit is used for determining the target feature frame as a supplementary frame if the target feature frame exists on the left side and/or the right side of the effective frame in the feature frame sequence, wherein the target feature frame is a feature frame with the blank probability smaller than a preset second blank probability threshold value in the feature frame sequence.
As an embodiment of the present application, the second determining module 64 may include:
the acquisition unit is used for acquiring each label of each effective frame and each output probability corresponding to each label;
And the third determining unit is used for determining the characteristic frame between the two adjacent effective frames as a supplementary frame if the target labels of the two adjacent effective frames are the same in the characteristic frame sequence and the target label is the label with the largest output probability in the labels of the effective frames.
As an embodiment of the present application, the speech encoding apparatus 60 may include:
the extraction module is used for extracting feature information from the effective frame, the supplementary frame and the feature frame sequence based on a preset extractor after the supplementary frame is determined based on the position of the effective frame in the feature frame sequence;
Correspondingly, the second encoding module is specifically configured to re-encode the encoded data based on the feature information.
As an embodiment of the present application, the speech encoding apparatus 60 further includes:
the deployment module is used for deploying the characteristic frame sequence, the effective frame, the supplementary frame, the plurality of voice data frames and/or the coded data into the block chain.
Fig. 7 is a block diagram of a terminal device according to another embodiment of the present application. As shown in fig. 7, the terminal device 70 of this embodiment includes a processor 71, a memory 72, and a computer program 73, such as a program of a speech encoding method, stored in the memory 72 and executable on the processor 71. The steps of the various embodiments of the speech coding method described above, such as 110 to 150 shown in fig. 1, are implemented by processor 71 when executing computer program 73 described above. Or the processor 71 performs the functions of each module in the embodiment corresponding to fig. 6 when executing the computer program 73, for example, the functions of the modules 61 to 65 shown in fig. 6, refer to the related descriptions in the embodiment corresponding to fig. 6, which are not repeated here.
Illustratively, the computer program 73 may be partitioned into one or more units that are stored in the memory 72 and executed by the processor 71 to complete the present application. The one or more elements may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program 73 in the terminal 70. For example, the computer program 73 may be divided into a first encoding module, a computing module, a first determining module, a second determining module, and a second encoding module, each module having a specific function as described above.
The terminal device may include, but is not limited to, a processor 71, a memory 72. It will be appreciated by those skilled in the art that fig. 7 is merely an example of a terminal device 70 and is not intended to limit the terminal device 70, and may include more or fewer components than shown, or may combine certain components, or different components, such as the terminal device described above may also include input and output devices, network access devices, buses, etc.
The Processor 71 may be a central processing unit (Central Processing Unit, CPU), other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 72 may be an internal storage unit of the terminal device 70, for example, a hard disk or a memory of the terminal device 70. The memory 72 may be an external storage device of the terminal device 70, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided in the terminal device 70. Further, the memory 72 may include both an internal storage unit and an external storage device of the terminal device 70. The memory 72 is used for storing the computer program and other programs and data required for the terminal device. The memory 72 may also be used to temporarily store data that has been output or is to be output.
The foregoing embodiments are merely for illustrating the technical solution of the present application, but not for limiting the same, and although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or substituted for some of the technical features thereof, and that such modifications or substitutions do not depart from the spirit and scope of the technical solution of the embodiments of the present application and are intended to be included in the scope of the present application.
Claims (8)
1. A speech coding method, the speech coding method comprising:
Encoding the sequenced voice data frames to obtain a characteristic frame sequence, wherein the voice data frames are obtained by performing framing operation on voice data to be identified;
calculating blank probability of each feature frame in the feature frame sequence based on a CTC frame;
determining a valid frame from the sequence of feature frames based on the blank probability;
determining a supplemental frame from a non-active frame based on a position of the active frame in the sequence of feature frames;
coding again based on the effective frame and the supplementary frame to obtain coded data;
After the determination of the supplemental frame based on the position of the active frame in the sequence of feature frames, the speech encoding method further comprises:
Extracting feature information from the effective frame, the supplementary frame and the feature frame sequence based on a preset extractor;
accordingly, the re-encoding based on the effective frame and the supplementary frame to obtain encoded data includes:
re-encoding based on the characteristic information to obtain encoded data;
The extractor comprises a first feedforward neural network module, a multi-head encoding and decoding attention module, a convolution module, a second feedforward neural network module and a normalization module;
The first feedforward neural network module is used for executing full-connection operation on the effective frame, the supplementary frame and the characteristic frame sequence to obtain a first characteristic frame sequence;
The multi-head encoding and decoding attention module is used for executing self-attention encoding and decoding operation on the first characteristic frame sequence to obtain a second characteristic frame sequence;
The convolution module is used for performing convolution operation on the second characteristic frame sequence to obtain a third characteristic frame sequence;
the second feedforward neural network module is used for executing full-connection operation on the third characteristic frame sequence to obtain a fourth characteristic frame sequence frame;
And the normalization module is used for performing normalization operation on the fourth characteristic frame sequence frame to obtain the characteristic information.
2. The speech coding method according to claim 1, wherein said determining a valid frame from said sequence of feature frames based on said blank probability comprises:
For each feature frame:
Judging whether the blank probability of the feature frame is smaller than a preset first blank probability threshold value or not;
And if the blank probability of the feature frame is smaller than the first blank probability threshold value, determining the feature frame as a valid frame.
3. The speech coding method according to claim 1, wherein said determining a supplemental frame based on the position of the active frame in the sequence of feature frames comprises:
If a target feature frame exists on the left side and/or the right side of the effective frame in the feature frame sequence, determining the target feature frame as the supplementary frame, wherein the target feature frame is a feature frame with the blank probability smaller than a preset second blank probability threshold value in the feature frame sequence.
4. The speech coding method according to claim 1, wherein said determining a supplemental frame based on the position of the active frame in the sequence of feature frames comprises:
Acquiring each label of each effective frame and the output probability corresponding to each label;
And if the target labels of the two adjacent effective frames in the characteristic frame sequence are the same, wherein the target labels are labels with the highest output probability in the labels of the effective frames, determining the characteristic frame between the two adjacent effective frames as a supplementary frame.
5. The speech coding method according to claim 1, wherein after the encoding is performed again based on the effective frame and the supplemental frame, the speech coding method further comprises:
The sequence of feature frames, the active frames, the supplemental frames, the plurality of frames of speech data, and/or the encoded data are deployed into a blockchain.
6. A speech coder, the speech coder comprising:
The first coding module is used for coding the sequenced multiple voice data frames to obtain a characteristic frame sequence, wherein the multiple voice data frames are obtained by performing framing operation on voice data to be identified;
the calculating module is used for calculating the blank probability of each characteristic frame in the characteristic frame sequence based on the CTC frame;
a first determining module, configured to determine a valid frame from the feature frame sequence based on the blank probability;
A second determining module for determining a supplemental frame from non-valid frames based on a position of the valid frame in the sequence of feature frames;
The second coding module is used for coding again based on the effective frame and the supplementary frame to obtain coded data;
The speech encoding apparatus further includes:
The extraction module is used for extracting characteristic information from the effective frames, the supplementary frames and the characteristic frame sequences based on a preset extractor;
correspondingly, the second coding module is specifically configured to perform recoding based on the feature information to obtain coded data;
The extractor comprises a first feedforward neural network module, a multi-head encoding and decoding attention module, a convolution module, a second feedforward neural network module and a normalization module;
The first feedforward neural network module is used for executing full-connection operation on the effective frame, the supplementary frame and the characteristic frame sequence to obtain a first characteristic frame sequence;
The multi-head encoding and decoding attention module is used for executing self-attention encoding and decoding operation on the first characteristic frame sequence to obtain a second characteristic frame sequence;
The convolution module is used for performing convolution operation on the second characteristic frame sequence to obtain a third characteristic frame sequence;
the second feedforward neural network module is used for executing full-connection operation on the third characteristic frame sequence to obtain a fourth characteristic frame sequence frame;
And the normalization module is used for performing normalization operation on the fourth characteristic frame sequence frame to obtain the characteristic information.
7. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 5 when the computer program is executed.
8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 5.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210240298.8A CN114596868B (en) | 2022-03-10 | 2022-03-10 | Speech coding method, speech coding device, terminal device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210240298.8A CN114596868B (en) | 2022-03-10 | 2022-03-10 | Speech coding method, speech coding device, terminal device and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114596868A CN114596868A (en) | 2022-06-07 |
| CN114596868B true CN114596868B (en) | 2025-12-16 |
Family
ID=81816911
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210240298.8A Active CN114596868B (en) | 2022-03-10 | 2022-03-10 | Speech coding method, speech coding device, terminal device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114596868B (en) |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8260609B2 (en) * | 2006-07-31 | 2012-09-04 | Qualcomm Incorporated | Systems, methods, and apparatus for wideband encoding and decoding of inactive frames |
| CN107452399B (en) * | 2017-09-18 | 2020-09-15 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio feature extraction method and device |
| CN110136715B (en) * | 2019-05-16 | 2021-04-06 | 北京百度网讯科技有限公司 | Speech recognition method and device |
| CN112017676B (en) * | 2019-05-31 | 2024-07-16 | 京东科技控股股份有限公司 | Audio processing method, device and computer readable storage medium |
| WO2021022521A1 (en) * | 2019-08-07 | 2021-02-11 | 华为技术有限公司 | Method for processing data, and method and device for training neural network model |
| CN111755029B (en) * | 2020-05-27 | 2023-08-25 | 北京大米科技有限公司 | Voice processing method, device, storage medium and electronic equipment |
| CN114065771A (en) * | 2020-08-01 | 2022-02-18 | 新加坡依图有限责任公司(私有) | Pre-training language processing method and device |
| CN113436620B (en) * | 2021-06-30 | 2022-08-30 | 北京有竹居网络技术有限公司 | Training method of voice recognition model, voice recognition method, device, medium and equipment |
| CN113707137B (en) * | 2021-08-30 | 2024-02-20 | 普强时代(珠海横琴)信息技术有限公司 | Decoding realization method and device |
-
2022
- 2022-03-10 CN CN202210240298.8A patent/CN114596868B/en active Active
Also Published As
| Publication number | Publication date |
|---|---|
| CN114596868A (en) | 2022-06-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112131272B (en) | Detection methods, devices, equipment and storage media for multivariate KPI time series | |
| CN113641819B (en) | Argument Mining System and Method Based on Multi-task Sparse Shared Learning | |
| CN112069795A (en) | Corpus detection method, apparatus, device and medium based on mask language model | |
| CN111159407A (en) | Method, apparatus, device and medium for training entity recognition and relation classification model | |
| CN113886577B (en) | A text classification method, device, equipment and storage medium | |
| CN111695604B (en) | Method and device for determining image credibility, electronic equipment and storage medium | |
| CN113223502A (en) | Speech recognition system optimization method, device, equipment and readable storage medium | |
| CN116543351B (en) | A self-supervised group behavior recognition method based on spatiotemporal serial-parallel relation coding | |
| CN111091839B (en) | Voice awakening method and device, storage medium and intelligent device | |
| CN114241499A (en) | Table picture identification method, device and equipment and readable storage medium | |
| CN116612416B (en) | Method, device and equipment for dividing video target and readable storage medium | |
| CN114077841A (en) | Semantic extraction method and device based on artificial intelligence, electronic equipment and medium | |
| CN112347739B (en) | Applicable rule analysis method, device, electronic device and storage medium | |
| CN113190675A (en) | Text abstract generation method and device, computer equipment and storage medium | |
| CN111563161B (en) | A sentence recognition method, a sentence recognition device and an intelligent device | |
| CN118968373A (en) | Video processing method and device | |
| CN112764807B (en) | Code summary generation method and system based on multi-scale AST and feature fusion | |
| CN112529302A (en) | Method and system for predicting success rate of patent application authorization and electronic equipment | |
| CN113469237A (en) | User intention identification method and device, electronic equipment and storage medium | |
| CN113657318A (en) | Pet classification method, device, equipment and storage medium based on artificial intelligence | |
| CN114596868B (en) | Speech coding method, speech coding device, terminal device and storage medium | |
| CN114090797B (en) | A component retrieval method and device based on intelligent recommendation | |
| CN113283514B (en) | A method, device and medium for classifying unknown categories based on deep learning | |
| CN110928987B (en) | Legal provision retrieval method and related equipment based on neural network hybrid model | |
| CN118038215A (en) | Model knowledge distillation method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |