Real-time Controlling Dynamics Sensing in Air Traffic System
<p>Structure of the proposed framework. ATC speech: the communication speech between air traffic controllers (ATCOs) and pilots; AM: acoustic model; PM: pronunciation model; PLM: phoneme-based language model; WLM: word-based language model; CIU: controlling instruction understanding; CI: controlling intent; CP: controlling parameter.</p> "> Figure 2
<p>(<b>a</b>) Network of the AM; (<b>b</b>) Network of the PM. FC: fully connected; CTC: Connectionist Temporal Classification; BLSTM: bidirectional long-short term memory.</p> "> Figure 3
<p>Architecture of the CIU model.</p> "> Figure 4
<p>(<b>a</b>) Spectrogram of an ATC speech segment; (<b>b</b>) Spectrogram of another ATC speech segment collected from the same transmission channel with (<b>a</b>).</p> "> Figure 5
<p>The core block of the CIU model.</p> ">
Abstract
:1. Introduction
- (a)
- Complex background noise: the ATC speech is transmitted by radiophone system which causes some unexpected noises and the ambient noise of ATCOs’ office also affects the intelligibility of the speech [13]. Different application scenes make the noise more diverse and complicated, which is hard to be filtered. Based on the open THCHS30 (Mandarin Chinese) [14] and TED-lium (English) [15] corpus, we use the PESQ [16] to evaluate the quality of the training samples used in this work. Based on a reference speech (THCHS30 and TED-lium in this work), the PESQ gives a score between 1 (poor) and 4.5 (good) for evaluating the quality of speech, and 3.8 is the acceptable score for the telephone voice. Finally, the measurements for the Chinese and English speech used in this work are 3.359 and 3.441 respectively.
- (b)
- High speech rate: the speech rate in ATC is higher than that of speech in daily life since ATC requires high timeliness.
- (c)
- Code switching: to eliminate the misunderstanding of ATC speech, the International Civil Aviation Organization (ICAO) has published many rules and criteria to regulate the pronunciation for homophone words [17]. For example, ‘a’ is switched to ‘alpha’. The code switching makes ATC speech more like a dialect which is only applicable in ATC.
- (d)
- Multilingual: In general, the ATC speech of international flights is in English, while the pilot of domestic flights usually uses local language to communication with ATCOs. For instance, Mandarin Chinese is widely used for ATC communication in China. More specifically, the Civil Aviation Administration of China (CAAC) published the ATC procedure and pronunciation in China on the basis of ICAO. Based on related ATC regulations, Chinese characters and English words are usually in one sentence of ATC speech, which generates special problems over universal ASR.
- (a)
- A framework is proposed to obtain real-time controlling dynamics in air traffic system, which further supports ATC related applications. An ASR- and CIU-based pipeline is designed to extract ATC-related elements by deep learning models.
- (b)
- A three-steps architecture, including AM, PM, and LM, is proposed to deal with the speech recognition in air traffic control. Our proposed ‘speech-phoneme labels-word labels’ pipeline unifies multilingual ASR (Mandarin Chinese and English in this work) into one model. By fixing the output of the AM to the basic phoneme vocabulary, the reusability of our proposed ASR model is improved greatly. Without considering dialects or other command word differences, we only need a new PM for expanding the word vocabulary.
- (c)
- Conv2D and APL are applied to solve the complex background noise and high speech rate of the ATC speech. An encoder–decoder architecture is proposed in the PM to obtain the word sequence of the ATC speech.
- (d)
- A BLSTM-based CIU joint model is proposed to perform the controlling intent detection and parameter labelling, in which the two tasks can enhance the performance with each other based on ATC regulations.
- (e)
- Based on the flight information, a correction procedure is proposed to revise minor mistakes of given sections of ASR results and further improve the performance of the proposed framework.
2. The Proposed Framework
2.1. Architecture of ASR
2.2. Architecture of CIU
3. Methods
3.1. Multilingual ATC
3.2. AM in ASR
3.3. PM and LM in ASR
3.4. CIU
- (a)
- Extracting the airline, call-sign of flight and the name of controlling unit from flight plan and ADS-B for each flight and generating a flight pool. Note that only flights in the sector of corresponding ATCOs are considered in this step.
- (b)
- Extracting the sections to be corrected from the result of the CIU model.
- (c)
- Comparing each section of the CIU model with corresponding information in the flight pool and selecting the most similar one as the corrected result. In addition, a similarity measurement between decoding result and corrected result is also calculated to support ATCOs to determine whether the correction can be accepted or not.
3.5. ATC Related
4. Results and Discussions
4.1. Data Description
4.2. Experimental Settings
- (a)
- WER: Word Error Rate between predicted and true labels is a common measurement of ASR applications [34], as shown below:
- (b)
- Classification precision and F1 score are applied to evaluate the performance of the CID and CPL respectively [35].
- (c)
- RTF: Real time factor is applied to evaluate the decoding efficiency of the proposed framework. RTF is calculated as (10), in which and are the decoding time and duration of the ATC speech in seconds respectively.
4.3. Test on ASR
- (a)
- AM can achieve high ASR performance from the speech to phoneme-based label sequence, i.e., 6.63% phoneme-based WER.
- (b)
- Taking both AM and PLM results as input, additional errors are led by the PM since our proposal is a cascade pipeline, about 0.4% word-based WER.
- (c)
- Both the PLM and WLM are proven to be useful for improving the overall ASR performance. The improvement obtained by the PLM is more significant than that of the WLM since the phoneme label is a finer-grained representation for the ATC speech compared to the word label.
- (d)
- The proposed ASR model can translate the ATC speech into word-based label sequence with 4.04% WER. Furthermore, each sub-model is proven to be indispensable for obtaining higher ASR performance in our proposed approach.
4.4. Test on CIU
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
ADS-B | automatic dependent surveillance-broadcast |
AM | acoustic model |
APL | average pooling layer |
ASR | automatic speech recognition |
ATC | air traffic control |
ATCOs | air traffic controllers |
ATCSs | air traffic control systems |
CI | controlling intent |
CID | controlling intent detection |
CIU | controlling instruction understanding |
CNN | convolutional neural network |
Conv2D | two-dimensional convolutional operation |
CP | controlling parameters |
CPL | controlling parameter labelling |
CTC | Connectionist Temporal Classification |
DNN | deep neural network |
FC | fully connected |
HMM/GMM | Hidden Markov Model/Gaussian Mixture Model |
LM | language model |
MFCCs | Mel Frequency Cepstral Coefficients |
MLP | Multilayer Perceptron |
PLM, WLM | phoneme-based LM, word-based LM |
PM | pronunciation model |
RNN | recurrent neural network |
RTF | real time factor |
References
- Bergner, J.; Hassa, O. Air Traffic Control. In Information Ergonomics; Springer: Berlin/Heidelberg, Germany, 2012; pp. 197–225. [Google Scholar]
- Skaltsas, G.; Rakas, J.; Karlaftis, M.G. An analysis of air traffic controller-pilot miscommunication in the NextGen environment. J. Air Transp. Manag. 2013, 27, 46–51. [Google Scholar] [CrossRef]
- Helmke, H.; Ohneiser, O.; Muhlhausen, T.; Wies, M. Reducing controller workload with automatic speech recognition. In Proceedings of the 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), Sacramento, CA, USA, 25–29 September 2016; pp. 1–10. [Google Scholar]
- Jagan Mohan, B.; Ramesh Babu, N. Speech recognition using MFCC and DTW. In Proceedings of the 2014 International Conference on Advances in Electrical Engineering (ICAEE), Vellore, India, 9–11 Janaury 2014; pp. 1–4. [Google Scholar]
- Gales, M.; Young, S. The Application of Hidden Markov Models in Speech Recognition. Found. Trends® Signal Process. 2007, 1, 195–304. [Google Scholar] [CrossRef] [Green Version]
- Dahl, G.E.; Yu, D.; Deng, L.; Acero, A. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 30–42. [Google Scholar] [CrossRef] [Green Version]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification. In Proceedings of the 23rd International Conference on Machine Learning-ICML ’06, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Hannun, A.Y.; Maas, A.L.; Jurafsky, D.; Ng, A.Y. First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs. arXiv, 2014; arXiv:1408.2873. [Google Scholar]
- Sainath, T.N.; Vinyals, O.; Senior, A.; Sak, H. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19–24 April 2015; pp. 4580–4584. [Google Scholar]
- Deng, L.; Abdel-Hamid, O.; Yu, D. A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6669–6673. [Google Scholar]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Qiang, C.; Chen, G. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. Int. Conf. Mach. Learn. 2016, 48, 173–182. [Google Scholar]
- Zhang, Y.; Chan, W.; Jaitly, N. Very deep convolutional networks for end-to-end speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4845–4849. [Google Scholar]
- Nguyen, V.N.; Holone, H. Possibilities, Challenges and the State of the Art of Automatic Speech Recognition in Air Traffic Control. Int. J. Comput. Electr. Autom. Control Inf. Eng. 2015, 9, 1916–1925. [Google Scholar]
- Wang, D.; Zhang, X. THCHS-30: A Free Chinese Speech Corpus. arXiv 2015, arXiv:1512.01882. [Google Scholar]
- Open Speech and Language Resources. Available online: http://www.openslr.org/7/ (accessed on 2 February 2019).
- Beerends, J.G.; Hekstra, A.P.; Rix, A.W.; Hollier, M.P. Perceptual evaluation of speech quality (PESQ)-The new ITU standard for objective measurement of perceived speech quality, Part II–Psychoacoustic model. J. Audio Eng. Soc. 2002, 50, 765–778. [Google Scholar]
- ICAO. Manual on the Implementation of ICAO Language Proficiency Requirements; International Civil Aviation Organization: Montréal, QC, Canada, 2010. [Google Scholar]
- Kopald, H.D.; Chanen, A.; Chen, S.; Smith, E.C.; Tarakan, R.M. Applying automatic speech recognition technology to Air Traffic Management. In Proceedings of the 2013 IEEE/AIAA 32nd Digital Avionics Systems Conference (DASC), East Syracuse, NY, USA, 5–10 October 2013. [Google Scholar]
- Ferreiros, J.; Pardo, J.M.; Córdoba, R.d.; Macias-Guarasa, J.; Montero, J.M.; Fernández, F.; Sama, V.; d’Haro, L.F.; González, G. A speech interface for air traffic control terminals. Aerosp. Sci. Technol. 2012, 21, 7–15. [Google Scholar] [CrossRef] [Green Version]
- Srinivasamurthy, A.; Motlicek, P.; Himawan, I.; Szaszák, G.; Oualil, Y.; Helmke, H. Semi-Supervised Learning with Semantic Knowledge Extraction for Improved Speech Recognition in Air Traffic Control. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 2406–2410. [Google Scholar]
- Cordoba, R.d.; Ferreiros, J.; San-Segundo, R.; Macias-Guarasa, J.; Montero, J.M.; Fernandez, F.; D’Haro, L.F.; Pardo, J.M. Air traffic control speech recognition system cross-task and speaker adaptation. IEEE Aerosp. Electron. Syst. Mag. 2006, 21, 12–17. [Google Scholar] [CrossRef]
- Johnson, D.R.; Nenov, V.I.; Espinoza, G. Automatic Speech Semantic Recognition and verification in Air Traffic Control. In Proceedings of the 2013 IEEE/AIAA 32nd Digital Avionics Systems Conference (DASC), East Syracuse, NY, USA, 5–10 October 2013; pp. 5B5-1–5B5-14. [Google Scholar]
- Pellegrini, T.; Farinas, J.; Delpech, E.; Lancelot, F. The Airbus Air Traffic Control speech recognition 2018 challenge: Towards ATC automatic transcription and call sign detection. arXiv, 2018; arXiv:1810.12614. [Google Scholar]
- Biadsy, F. Automatic Dialect and Accent Recognition and its Application to Speech Recognition. Ph.D. Thesis, Columbia University, New York, NY, USA, 2011. [Google Scholar]
- Haffner, P.; Tur, G.; Wright, J.H. Optimizing SVMs for complex call classification. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP ’03), Hong Kong, China, 6–10 April 2003; pp. I-632–I-635. [Google Scholar]
- Yao, K.; Peng, B.; Zweig, G.; Yu, D.; Li, X.; Gao, F. Recurrent conditional random field for language understanding. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4077–4081. [Google Scholar]
- Bonnisseau, J.-M.; Lachiri, O. On the objective of firms under uncertainty with stock markets. J. Math. Econ. 2004, 40, 493–513. [Google Scholar] [CrossRef]
- Yao, K.; Peng, B.; Zhang, Y.; Yu, D.; Zweig, G.; Shi, Y. Spoken language understanding using long short-term memory neural networks. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, NV, USA, 7–10 December 2014. [Google Scholar]
- Xu, P.; Sarikaya, R. Convolutional neural network based triangular CRF for joint intent detection and slot filling. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013; pp. 78–83. [Google Scholar]
- Guo, D.; Tur, G.; Yih, W.; Zweig, G. Joint semantic utterance classification and slot filling with recursive neural networks. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, NV, USA, 7–10 December 2014; pp. 554–559. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Chen, K.; Yan, Z.-J.; Huo, Q. A context-sensitive-chunk BPTT approach to training deep LSTM/BLSTM recurrent neural networks for offline handwriting recognition. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 411–415. [Google Scholar]
- Grinter, R.E. Recomposition: Coordinating a Web of Software Dependencies. Comput. Support. Coop. Work 2003, 12, 297–327. [Google Scholar] [CrossRef] [Green Version]
- Graves, A.; Jaitly, N. Towards End-To-End Speech Recognition with Recurrent Neural Networks. JMLR Workshop Conf. Proc. 2014, 32, 1764–1772. [Google Scholar]
- Liu, B.; Lane, I. Joint Online Spoken Language Understanding and Language Modeling With Recurrent Neural Networks. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, CA, USA, 13–15 September 2016; pp. 22–30. [Google Scholar]
- Kaldi. Available online: http://kaldi-asr.org/ (accessed on 2 February 2019).
Role | Speech Text |
---|---|
ATCO | China eastern fife two tree four, climb to six tousand meters |
Pilot | climb to six thousand meters, China eastern fife two tree four |
Label Type | English Sample | Chinese Sample |
---|---|---|
word | russian sky niner eight eight tree chengdu radar contact | Echo echo 八 november Charlie alpha 两 前 等 国 航 四 四 五 两 |
phoneme | r ah1 sh ah0 s k ay1 n ay n er ey1 t ey1 t th r iy1 ch eng2 d u1 r ey1 d aa2 r k aa1 t ae2 k t | eh1 k ow0 eh1 k ow0 b a1 n ow0 v eh1 m b er0 ch aa1 r l iy0 aw1 l f ah0 l iang3 q ian2 d eng3 g uo2 h ang2 s iy4 s iy4 uu u3 l iang3 |
CIU label | B-AL I-AL B-CS I-CS I-CS I-CS B-ATCN B-RADAR I-RADAR | B-TAXI I-TAXI I-TAXI B-RW I-RW I-RW I-RW B-HOLD I-HOLD B-AL I-AL B-CS I-CS I-CS I-CS |
Model | Input | Output |
---|---|---|
AM | spectrogram | Phoneme sequence |
PLM | Phoneme sequence | Phoneme unit |
PM | Phoneme sequence | Word sequence |
WLM | Word sequence | Word unit |
CIU | Word sequence | Controlling intent and parameters |
Model | AM | PLM | PM | WLM |
---|---|---|---|---|
AM/PM | 6.63 | - | 7.01 | - |
AM/PM/WLM | 6.63 | - | 7.01 | 6.52 |
AM/PLM/PM | 6.63 | 4.12 | 4.50 | - |
AM/PLM/PM/WLM | 6.63 | 4.12 | 4.50 | 4.04 |
Methods | LM | Modeling Unit | WER (%) | RTF |
---|---|---|---|---|
HMM/GMM | 3-g | Phoneme/word | 9.14 | - |
DS2-based end-to-end | RNN-based word | Word | 6.31 | 0.139 |
Our proposal | PLM and WLM | Phoneme/word | 4.04 | 0.147 |
Methods | Classification Precision | F1 Score |
---|---|---|
Our proposal | 99.45 | 97.71 |
Independent CID model | 98.87 | - |
Independent CPL model | - | 95.58 |
CNN/CRF | 97.93 | 96.31 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lin, Y.; Tan, X.; Yang, B.; Yang, K.; Zhang, J.; Yu, J. Real-time Controlling Dynamics Sensing in Air Traffic System. Sensors 2019, 19, 679. https://doi.org/10.3390/s19030679
Lin Y, Tan X, Yang B, Yang K, Zhang J, Yu J. Real-time Controlling Dynamics Sensing in Air Traffic System. Sensors. 2019; 19(3):679. https://doi.org/10.3390/s19030679
Chicago/Turabian StyleLin, Yi, Xianlong Tan, Bo Yang, Kai Yang, Jianwei Zhang, and Jing Yu. 2019. "Real-time Controlling Dynamics Sensing in Air Traffic System" Sensors 19, no. 3: 679. https://doi.org/10.3390/s19030679
APA StyleLin, Y., Tan, X., Yang, B., Yang, K., Zhang, J., & Yu, J. (2019). Real-time Controlling Dynamics Sensing in Air Traffic System. Sensors, 19(3), 679. https://doi.org/10.3390/s19030679