Real-Time Sound Source Localization for Low-Power IoT Devices Based on Multi-Stream CNN
<p>(<b>a</b>) Limitation of the camera vision-based method is that it cannot track if the user deviates from the field of view of the camera; (<b>b</b>) The advantage of the microphone array-based sound localization method is that it can track the user by estimating the direction through sound regardless of any direction.</p> "> Figure 2
<p>System overview of real-time SSL system using IoT devices.</p> "> Figure 3
<p>(<b>a</b>) A 2D schematic map of the microphone array and sound sources. The blue dots indicate sound sources, and the colored lines are extension lines between the center of the microphone array and each microphone; (<b>b</b>) Regions were divided according to the locations of the microphones.</p> "> Figure 4
<p>(<b>a</b>) Block diagram of the model architecture, (<b>b</b>) conventional MS block, (<b>c</b>) efficient MS block, and (<b>d</b>) aggregation gate. The colored arrows indicate the stride of convolution, the subsequent batch normalization (BN) [<a href="#B17-sensors-22-04650" class="html-bibr">17</a>], and ReLU [<a href="#B18-sensors-22-04650" class="html-bibr">18</a>].</p> "> Figure 5
<p>Comparison between (<b>a</b>) standard 1D Convolution and (<b>b</b>) depth-wise separable 1D convolution.</p> "> Figure 6
<p>Comparison between (<b>a</b>) ACC and (<b>b</b>) DOA error in test data according to the type of block composing the SSL model. Here, “O” indicates efficient MS blocks, and “X” indicates conventional MS blocks.</p> "> Figure 7
<p>Comparison between the inference times using a desktop CPU according to the type of block composing the SSL model and the framework used for inference.</p> ">
Abstract
:1. Introduction
2. Materials and Methods
2.1. Overall System Architecture
2.2. Dataset
2.3. Classification and Regression together
2.4. Conventional Multi-Stream (MS) Block
2.5. Efficient Multi-Stream (MS) Block
2.6. Model Architecture
2.7. Performance Evaluation
3. Results
3.1. Model Performance Measurements
3.2. Model Inference Time Measurements
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
SSL | Sound Source Localization |
IoT | Internet of Things |
DNN | Deep Neural Network |
DOA | Direction of Arrival |
SPS | Spatial Pseudo-Spectrum |
CNN | Convolutional Neural Network |
MS | Multi-Stream |
AG | Aggregation Gate |
RIR | Room Impulse Response |
SNR | Signal-to-Noise Ratio |
MFCC | Mel-Frequency Cepstral Coefficient |
AoA | Angle of Arrival |
ACC | Accuracy |
ReLU | Rectified Linear Unit |
BN | Batch Normalization |
References
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- DiBiase, J. A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays. Ph.D. Thesis, Brown University, Providence, RI, USA, 2000. [Google Scholar]
- Schmidt, R. Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef] [Green Version]
- Yalta, N.; Nakadai, K.; Ogata, T. Sound source localization using deep learning models. J. Robot. Mechatron. 2017, 29, 37–48. [Google Scholar] [CrossRef]
- Adavanne, S.; Politis, A.; Virtanen, T. Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. arXiv 2017, arXiv:1710.10059. [Google Scholar]
- Vera-Diaz, J.M.; Pizarro, D.; Macias-Guarasa, J. Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates. Sensors 2018, 18, 3418. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Sundar, H.; Wang, W.; Sun, M.; Wang, C. Raw Waveform Based End-to-end Deep Convolutional Network for Spatial Localization of Multiple Acoustic Sources. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4642–4646. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Kim, T.; Lee, J.; Nam, J. Comparison and Analysis of SampleCNN Architectures for Audio Classification. IEEE J. Sel. Top. Signal Process. 2019, 13, 285–297. [Google Scholar] [CrossRef]
- Lopez-Ballester, J.; Alcaraz Calero, J.M.; Segura-Garcia, J.; Felici-Castell, S.; Garcia-Pineda, M.; Cobos, M. Speech Intelligibility Analysis and Approximation to Room Parameters through the Internet of Things. Appl. Sci. 2021, 11, 1430. [Google Scholar] [CrossRef]
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Tech. Rep. 1993, 93, 27403. [Google Scholar]
- Allen, J.B.; Berkley, D.A. Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
- Ravanelli, M.; Bengio, Y. Speaker recognition from raw waveform with sincnet. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 1021–1028. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 1–9. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML, Lille, France, 6–11 July 2015; International Machine Learning Society (IMLS): Washington, DC, USA, 2015; Volume 1, pp. 448–456. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 28 October 2019; pp. 3702–3712. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications; Technical Report; Google Inc.: Menlo Park, CA, USA, 2017. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
- Chen, T.; Moreau, T.; Jiang, Z.; Zheng, L.; Yan, E.; Cowan, M.; Shen, H.; Wang, L.; Hu, Y.; Ceze, L.; et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, Carlsbad, CA, USA, 8–10 October 2018; pp. 579–594. [Google Scholar]
- Gibson, P.; Cano, J.; Turner, J.; Crowley, E.J.; O’Boyle, M.; Storkey, A. Optimizing Grouped Convolutions on Edge Devices. 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP); IEEE Computer Society: Los Alamitos, CA, USA, 2020; pp. 189–196. [Google Scholar] [CrossRef]
Item | Value |
---|---|
Number of speakers | 220 |
Sampling rate | 16 kHz |
Room size | (m) |
Reverberation time | 0.4 s |
Distance | 0.1 m–2.12 m |
Angle | 0–360° |
Input duration | 40 ms |
Total number of samples | 90,767 |
Type of Module | Input | Output |
---|---|---|
Conv block | (640, 8) | (320, 32) |
Conventional MS block | (320, 32) | (160, 48) |
Conventional MS block | (160, 48) | (80, 96) |
Efficient MS block | (80, 96) | (40, 144) |
Global max-pooling | (40, 144) | (1, 144) |
Dense | (1, 144) | (1, 12) |
Hyperparameter | Value |
---|---|
Total number of epochs | 30 |
Batch size | 256 |
Optimizer | Adam |
Learning rate (LR) | 0.01 |
Row Number | Training SNR (dB) | Test SNR | |||||||
---|---|---|---|---|---|---|---|---|---|
30 dB | 20 dB | 10 dB | 0 dB | ||||||
ACC (%) | DOA (°) | ACC (%) | DOA (°) | ACC (%) | DOA (°) | ACC (%) | DOA (°) | ||
1 | 30, 20, 10, 0 | 96.01 | 4.6436 | 95.95 | 4.7519 | 95.63 | 4.9653 | 88 | 9.0297 |
2 | 30 | 96.53 | 3.9954 | 96.52 | 4.0413 | 92.32 | 6.5589 | 33.58 | 58.1142 |
3 | 20 | 96.67 | 3.8748 | 96.62 | 3.9359 | 93.19 | 5.96 | 33.61 | 59.6756 |
4 | 10 | 96.41 | 4.0014 | 96.39 | 4.0832 | 95.79 | 4.5159 | 55.28 | 36.6573 |
5 | 0 | 96.07 | 4.6869 | 96.09 | 4.6663 | 95.72 | 4.9086 | 91.41 | 7.4289 |
6 | X | 96.81 | 3.9654 | 96.68 | 4.0048 | 92.41 | 6.3149 | 32.99 | 57.9033 |
Types of Blocks (Kernel Sizes) | Test SNR | Computation | |||||||
---|---|---|---|---|---|---|---|---|---|
30 dB | 20 dB | 10 dB | 0 dB | ||||||
ACC (%) | DOA (°) | ACC (%) | DOA (°) | ACC (%) | DOA (°) | ACC (%) | DOA (°) | ||
XXO (3, 5, 7) | 95.91 | 4.7539 | 95.85 | 4.7132 | 95.52 | 4.9142 | 90.74 | 7.6871 | 3,258,868 |
XXO (7, 9, 13) | 96.07 | 4.6869 | 96.09 | 4.6663 | 95.72 | 4.9086 | 91.41 | 7.4289 | 3,285,748 |
Item | Batch Size | |
---|---|---|
1 | 16 | |
Inference time (ms) using Pytorch | 426.93 | 7.677 |
Inference time (ms) using TVM | 7.811 | 5.772 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ko, J.; Kim, H.; Kim, J. Real-Time Sound Source Localization for Low-Power IoT Devices Based on Multi-Stream CNN. Sensors 2022, 22, 4650. https://doi.org/10.3390/s22124650
Ko J, Kim H, Kim J. Real-Time Sound Source Localization for Low-Power IoT Devices Based on Multi-Stream CNN. Sensors. 2022; 22(12):4650. https://doi.org/10.3390/s22124650
Chicago/Turabian StyleKo, Jungbeom, Hyunchul Kim, and Jungsuk Kim. 2022. "Real-Time Sound Source Localization for Low-Power IoT Devices Based on Multi-Stream CNN" Sensors 22, no. 12: 4650. https://doi.org/10.3390/s22124650
APA StyleKo, J., Kim, H., & Kim, J. (2022). Real-Time Sound Source Localization for Low-Power IoT Devices Based on Multi-Stream CNN. Sensors, 22(12), 4650. https://doi.org/10.3390/s22124650