CN112466282B

CN112466282B - Speech recognition system and method oriented to aerospace professional field

Info

Publication number: CN112466282B
Application number: CN202011139217.2A
Authority: CN
Inventors: 温正棋; 李博; 刘进涛; 任斌; 李振龙; 周仔恒
Original assignee: Beijing Simulation Center
Current assignee: Beijing Simulation Center
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2023-11-28
Anticipated expiration: 2040-10-22
Also published as: CN112466282A

Abstract

The embodiment of the invention discloses a voice recognition system and a method for the aerospace professional field, wherein the system comprises the following steps: the encoder is composed of a first long-short-time memory network and is used for inputting an acoustic characteristic sequence and outputting hidden representations corresponding to the acoustic characteristic sequence after encoding; the prediction network formed by the second long-short-time memory network inputs a text sequence initial symbol sos at first, outputs hidden representation corresponding to a first word of the text sequence, inputs an embedded vector of a word every time, and outputs hidden representation corresponding to the predicted word after passing through the prediction network; the bias coding network is composed of a third long-short-time memory network and is used for inputting the professional vocabulary sequence in the aerospace field and outputting hidden representations corresponding to the professional vocabulary sequence in the aerospace field; and a fusion network formed by the multi-layer perceptron, fusing output results of the three networks, and predicting the next word of the text sequence.

Description

Speech recognition system and method oriented to aerospace professional field

Technical Field

The invention relates to the technical field of electronic information. And more particularly to a voice recognition system and method for the aerospace field.

Background

Voice interaction is one of the most natural man-machine interaction modes. At the heart of voice interaction is voice recognition, i.e., converting voice into text for subsequent processing by a computer. In recent years, speech recognition has made a great breakthrough and has been put into practical use by people. Meanwhile, with the development of aerospace technology, people have had an opportunity to enter space. It has become a necessary technology to make astronauts more natural and convenient to interact and control with the equipment. The voice recognition system in the aerospace field needs to occupy system resources lower and have lower calculation cost, and meanwhile, the professional vocabulary of the aerospace equipment needs to be recognized more accurately.

Currently, there are many techniques and systems for speech recognition, such as large vocabulary speech recognition systems based on hidden markov, which are used in many commercial products. These large vocabulary continuous speech recognition systems often build decoding networks based on weighted finite state transducers. The decoding network is very bulky, resulting in a very computationally expensive search in the decoding process. The storage and memory occupation of the whole system are high, and the power consumption during decoding is also high, so that the application of the whole system in the field of aerospace is limited. However, if the size of the decoding network is too compressed, the performance of the recognition system is greatly impaired, resulting in a great increase in error rate.

Therefore, a new voice recognition method and system facing the aerospace field are needed, so that the calculation cost and the storage occupation can be reduced, and the professional vocabulary and daily language in the aerospace field can be recognized efficiently and accurately.

Disclosure of Invention

The invention provides a voice recognition system and a voice recognition method for the aerospace professional field, which solve the problems of high calculation cost and low accuracy of professional vocabulary recognition of the existing voice recognition system.

In order to achieve the above object, the present invention provides the following technical solutions:

the first aspect of the present invention provides a speech recognition system oriented to the aerospace professional field, comprising:

the encoder is composed of a first long-short-time memory network and is used for inputting the acoustic feature sequence extracted by the feature extractor based on signal processing, and outputting hidden representation corresponding to the acoustic feature sequence after encoding;

the prediction network formed by the second long-short-time memory network inputs a text sequence initial symbol sos, outputs hidden representation corresponding to a first word of the text sequence through the prediction network, inputs an embedded vector of a word every time, and outputs hidden representation corresponding to the predicted word after passing through the prediction network;

the bias coding network is composed of a third long-short-time memory network and is used for inputting a professional vocabulary sequence in the aerospace field and outputting hidden representations corresponding to the professional vocabulary sequence in the aerospace field after coding;

and inputting output results of the encoder formed by the first long-short-time memory network, the prediction network formed by the second long-short-time memory network and the offset coding network formed by the third long-time memory network by the fusion network formed by the multi-layer perceptron, and predicting the next word of the text sequence.

In a specific embodiment, the encoder formed by the first long-short-time memory network encodes the extracted acoustic feature sequence according to the following formula:

h _t ＝LSTM(h _t-1 ,x _t )

wherein LSTM is a unit function of a long-short-time memory network, h _t For the hidden representation corresponding to the acoustic feature sequence at the time t, h _t-1 For the hidden representation corresponding to the acoustic feature sequence at time t-1, x _t Is the acoustic feature sequence at time t.

In a specific embodiment, the prediction network formed by the second long-short-time memory network obtains the hidden representation of each word in the corresponding text sequence according to the following formula:

c _j ＝LSTM(c _j-1 ,y _j )

wherein LSTM is a unit function of long-short-term memory network, c _j-1 Hidden representation corresponding to the word at the j-1 th position, y _j An embedded vector of words for j-position.

In a specific embodiment, the bias coding network formed by the third long-short-time memory network obtains hidden representations corresponding to the professional vocabulary sequence in the aerospace field according to the following formula:

b _k ＝LSTM(b _k-1 ,z _k )

wherein LSTM is a unit function of a long-short-term memory network, b _k-1 Hidden representation corresponding to words of a specialized vocabulary sequence for the k-1 th position aerospace field, z _k The embedded vector is an embedded vector of a special vocabulary sequence k position word in the aerospace field.

In a specific embodiment, the multi-layer perceptron comprises a fusion network, which fuses output results of three networks, namely an encoder formed by a first long-short-time memory network, a prediction network formed by a second long-short-time memory network and a bias coding network formed by a third long-time memory network, and predicts the next word of the text sequence according to the following formula:

P(y _j+1 )＝MLP([c _t ,b _k ,h _t ])

wherein MLP is a function of the multi-layer perceptron, wherein h _t B) representing hidden representation corresponding to acoustic feature sequence at t time extracted by encoder composed of first long-short-time memory network _k C) hiding representation of word correspondence for a specialized vocabulary sequence in the k-th position aerospace domain of a bias encoding network consisting of a third long-short-term memory network _t Hidden representations corresponding to words at the t-th position of the predictive network formed for the second long-short term memory network.

In a specific embodiment, in the recognition stage, an optimal text sequence is searched out according to the Viterbi algorithm from the following formulas, which specifically includes:

y ^* ＝argmax(trans(x,z,y))

wherein trans represents the whole speech recognition system model, argmax represents the word corresponding to the maximum probability value, x represents the acoustic feature sequence, z represents the professional vocabulary sequence in the aerospace field, y represents all text sequences, and y represents the optimal text sequence.

In a specific embodiment, the system may identify the corresponding specialized vocabulary of the aerospace domain in response to the user providing the specialized vocabulary sequence z of the aerospace domain.

A second aspect of the invention provides a method of training using the system of the first aspect of the invention by:

wherein θ represents a parameter of the whole neural network, a represents a text sequence containing inserted filling symbols, y represents a labeled text sequence, x represents an acoustic feature sequence, and z represents a special vocabulary sequence in the aerospace field.

A third aspect of the invention provides a method of speech recognition of a system trained using the method of the second aspect of the invention, comprising:

inputting the acoustic feature sequence extracted by the feature extractor based on signal processing into the trained system, and outputting hidden representations corresponding to the acoustic feature sequence through an encoder formed by a first long-short-time memory network;

in a trained system, inputting a text sequence initial symbol sos into a prediction network formed by a second long-short-time memory network, outputting hidden representations corresponding to a first word of the text sequence, inputting an embedded vector of a word each time, and outputting the hidden representations corresponding to the predicted word after passing through the prediction network;

in the trained system, inputting the professional vocabulary sequence in the aerospace field into a bias coding network formed by a third long-short-time memory network, and outputting hidden representations corresponding to the professional vocabulary sequence in the aerospace field after coding;

in the trained system, a fusion network formed by a multi-layer perceptron fuses output results of the encoder formed by the first long-short-time memory network, the prediction network formed by the second long-short-time memory network and the bias coding network formed by the third long-short-time memory network to predict the next word of the text sequence.

A fourth aspect of the invention provides a computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method according to the second aspect of the invention or any of the third aspects of the invention.

A fifth aspect of the invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterised in that the processor implements the method according to the second or any of the third aspects of the invention when executing the program.

The beneficial effects of the invention are as follows:

according to the aerospace-field-oriented voice recognition method and system, voice recognition is performed by inputting the acoustic feature sequence and the aerospace-field professional vocabulary provided by the user. The method can realize higher recognition accuracy for the professional vocabulary in the aerospace field.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a schematic diagram of a speech recognition system for the aerospace field according to one embodiment of the invention.

Detailed Description

In order to more clearly illustrate the present invention, the present invention will be further described with reference to preferred embodiments and the accompanying drawings. Like parts in the drawings are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and that this invention is not limited to the details given herein.

A first embodiment of the present invention provides a speech recognition system for the aerospace professional field, as shown in fig. 1, including:

h _t ＝LSTM(h _t-1 ,x _t )

c _j ＝LSTM(c _j-1 ,y _j )

wherein LSTM is a unit function of long-short-term memory network, c _j-1 Hidden table corresponding to the word in the j-1 positionShow, y _j An embedded vector of words for j-position.

b _k ＝LSTM(b _k-1 ,z _k )

P(y _j+1 (＝MLP([c _j ,b _k ,y _j ])

wherein MLP is a function of the multi-layer perceptron.

In one specific embodiment, the MLP is composed of a matrix and a nonlinear function:

MLP(q)＝W ₂ max(W ₁ q,0)

wherein W is ₁ And W is ₂ Q is the input vector and q is the parameter.

y ^* ＝argmax(trans(x,z,y))

A second embodiment of the present invention provides a method for training using the system according to the first embodiment of the present invention, by training with the following loss function:

A third embodiment of the present invention provides a method of speech recognition of a system trained using the method of the second embodiment of the present invention, comprising:

A fourth embodiment of the invention provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a method according to any of the second or third embodiments of the invention.

A fifth embodiment of the invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to the second embodiment of the invention or any of the third embodiments of the invention when executing the program.

It should be understood that the foregoing examples of the present invention are provided merely for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention, and that various other changes and modifications may be made therein by one skilled in the art without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A speech recognition system for the aerospace profession field, comprising:

2. The system of claim 1, wherein the encoder comprised of the first long-short-term memory network encodes the extracted acoustic signature sequence according to the formula:

h _t ＝LSTM(h _t-1 ,x _t )

3. The system of claim 1, wherein the predictive network of second long-short-term memory networks derives a hidden representation of each word in the corresponding text sequence according to the following formula;

c _t ＝LSTM(c _j-1 ,y _j )

4. The system of claim 1, wherein the bias encoding network obtains the hidden representation corresponding to the sequence of specialized vocabulary in the aerospace domain according to the following formula:

b _k ＝LSTM(b _k-1 ,z _k )

wherein LSTM is a unit function of a long-short-term memory network, b _k-1 Hidden representation corresponding to words of a specialized vocabulary sequence for the k-1 th position aerospace field, z _k For special vocabulary sequence k position words in the aerospace fieldThe vector is embedded.

5. The system of claim 1, wherein the multi-layer perceptron is configured to fuse the output results of three networks, namely an encoder configured from a first long-short-term memory network, a prediction network configured from a second long-short-term memory network, and a bias encoding network configured from a third long-short-term memory network, to predict the next word of the text sequence according to the following formula:

P(y _j+1 )＝MLP([c _t ,b _k ,h _t ])

6. The system of claim 1, wherein during the recognition stage, the optimal text sequence is searched out according to the Viterbi algorithm from the following formula, which specifically includes:

y ^* ＝argmax(trans(x,z,y))

7. The system of claim 6, wherein the system is responsive to a user providing a sequence of specialized vocabulary z for the aerospace field to identify a corresponding specialized vocabulary for the aerospace field.

8. A method of training the system of any of claims 1-7, characterized by training by the following loss function:

wherein θ represents a parameter of the whole neural network, a represents a text sequence containing inserted filling symbols, y represents a marked text sequence, x represents an acoustic feature sequence, and z represents a special vocabulary text sequence in the aerospace field.

9. A method of speech recognition of a system trained using the method of claim 8, comprising:

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 8 or 9.

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 8 or 9 when the program is executed by the processor.