A CNN Sound Classification Mechanism Using Data Augmentation
<p>Spectrogram of a sound signal.</p> "> Figure 2
<p>The sound classification model using CNN.</p> "> Figure 3
<p>(<b>a</b>) The accuracy of the sound classification model using the ESC-50 original dataset. (<b>b</b>) The accuracy of the sound classification model using the ESC-50 dataset and amplifying the amount of data by 100% (<span class="html-italic">K</span> = 2).</p> "> Figure 4
<p>Sound classification accuracy with data augmentation in the ESC-50 dataset.</p> "> Figure 5
<p>Sound classification precision with data augmentation in the ESC-50 dataset.</p> "> Figure 6
<p>Sound classification recall with data augmentation in the ESC-50 dataset.</p> "> Figure 7
<p>Sound classification F1 score with data augmentation in the ESC-50 dataset.</p> "> Figure 8
<p>(<b>a</b>) The accuracy of the sound classification model using the UrbanSound8K original dataset. (<b>b</b>) The accuracy of the sound classification model using the UrbanSound8K dataset and amplifying the amount of data by 100% (<span class="html-italic">K</span> = 2).</p> "> Figure 9
<p>The sound classification results using this new training set.</p> ">
Abstract
:1. Introduction
- Speech recognition [1,2]: Sound classification plays a crucial role in speech recognition systems. By accurately classifying and identifying speech sounds, these systems can convert spoken words or phrases into written text. This technology is utilized in voice assistants, transcription services, call center automation, and language learning applications. Accurate sound classification enables more precise and efficient speech recognition, leading to better user experiences and increased productivity.
- Music analysis and recommendation [3,4]: Sound classification allows for the analysis and categorization of music based on various features, such as genre, tempo, mood, and instrumentation. This classification enables personalized music recommendations, playlist generation, music organization, and automatic tagging. Music-streaming platforms and digital music libraries rely on accurate sound classification to provide personalized recommendations to users, enhancing their music listening experiences.
- Environmental sound monitoring [5,6]: Sound classification can be used to monitor and classify environmental sounds. This is useful in applications such as wildlife monitoring, noise pollution assessment, acoustic event detection, and surveillance systems. By automatically classifying sounds such as animal calls, vehicle sounds, alarms, or gunshots, sound classification aids in detecting anomalies, identifying specific events, and alerting authorities or users to potential threats or disturbances.
- Anomaly detection and security [7,8]: Sound classification can be used to identify abnormal or anomalous sounds in various contexts, including industrial settings, security systems, and healthcare environments. By training models to recognize normal sound patterns, deviations or unexpected sounds can be classified as anomalies. This technology helps to detect equipment failures, security breaches, and medical emergencies, allowing for timely interventions and preventive measures.
- Cost and resource constraints: Collecting and labeling sound data can be a time-consuming and resource-intensive process, which requires expertise, equipment, and human effort to capture, process, and accurately annotate sound signals. The cost and logistics associated with data collection and annotation can be significant, especially when aiming for large-scale and diverse datasets. Limited financial resources and access to specialized equipment or personnel can pose challenges in acquiring an adequate amount of labeled sound data.
- Data imbalance: Imbalanced class distributions within sound datasets can represent another obstacle. Certain sound classes may have an abundance of available data, while others are underrepresented. This data imbalance can negatively impact the model’s performance, as it may struggle to generalize well for minority classes with limited examples. Acquiring a balanced dataset with sufficient instances for each class becomes a challenge, leading to potential biases and a reduced classification accuracy for certain categories.
- Data annotation challenges: Accurate labeling of sound data is a complex task that often requires human expertise and domain knowledge. Annotating sound signals with the correct class labels or semantic information can be subjective and prone to errors. The process may involve multiple annotators, leading to variations in the annotations and potential inconsistencies. The scarcity of well-annotated sound data further hinders the acquisition of enough labeled samples for classification.
2. Related Works
3. Sound Classification Mechanism
3.1. Data Preprocessing
- Frame blocking: The sound signal is continuously changing. To simplify the continuously changing signal, it is assumed that the sound signal does not change in a short time scale. Therefore, the sound signal is aggregated into a unit with multiple sampling points (N), which is called an “audio frame”. An audio frame is 20~40 ms. If the length of the audio frame is shorter, there will not be enough sampling points in each audio frame to perform reliable spectrum calculation, but if the length is too long, the signal of each audio frame will change too much. In addition, to avoid excessive changes between two adjacent audio frames, we will allow an overlapping area between two adjacent audio frames, and this overlapping area contains about half or one-third of the sampling points (M) in the audio frame.
- Pre-emphasis: To highlight the high-frequency formant, the sound signal will first pass through a high-pass filter.
- Hamming window: Suppose that the audio-framed signal is S(n). To increase the continuity between the sound frames, the divided audio frames are multiplied by a Hamming window to avoid signal discontinuity in the subsequent Fourier transform. The Hamming window is shown in Formula (2), and the result of multiplying the sound frame by the Hamming window is shown in Formula (3):
- Signal transformation: The change in the sound signal in the time domain will continue to change over time, so the sound signal cannot be effectively discussed in the time domain. In the frequency domain, short-term speech signals appear periodic. Generally, the speech signal is converted from the time domain to the frequency domain by a discrete Fourier transform (DFT), and the characteristics of the sound signal are observed in the frequency domain, or the characteristic parameters in the frequency domain are extracted. The formula for the discrete Fourier transform is as follows:
- Mel filter: A set of triangular bandpass filters is selected, usually including P triangular bandpass filters. P is usually set at a value between 20 and 40. The center frequency and bandwidth of each triangular bandpass filter are determined according to the Mel scale, which is a non-linear frequency scale based on the pitch perceived by the human ear. The frequency domain response of a triangular bandpass filter is computed. The response function of each filter is a triangle, which takes the maximum value at the center frequency and then gradually decreases to the left and right sides until the frequency is 0. The response function of each filter is convolved with the spectrogram to obtain the output of each filter in the frequency domain. This output represents the logarithmic energy (Ek) of the audio in this frequency band, which is equivalent to dividing the original signal into several bandpass signals of different frequencies. The Mel frequency represents the general human ear’s sensitivity to frequency, and it can also be seen that the human ear’s perception of frequency, f, logarithmically changes. The conversion relationship between the Mel frequency and the general frequency, f, is as in Formula (7):
- Discrete cosine transform (DCT): The above logarithmic energies, Ek, are brought into the discrete cosine transform to find the L-order Mel-scale Cepstrum parameter, where L is usually 12.
3.2. Data Augmentation
3.3. Sound Classification Model
4. Experimental Results
- In the frame-blocking procedure: Since the sound signal is extracted at a sampling frequency of 44.1 kHz for 5 s of monophonic audio, the standard audio frame is set to 25 ms, and the overlapping area between audio frames is set to 15 ms. Therefore, there are N = 1130 sampling points in one audio frame, and M = 662 of them are the same as the adjacent audio frame.
- In the pre-emphasis procedure: According to Formula (1), the audio is enhanced.
- In the Hamming window procedure: According to Formulas (2) and (3), the 1103 sampling points in the audio frame are calculated.
- In the signal transformation procedure: The audio in the time domain is converted into the energy distribution in the frequency domain according to Formulas (5) and (6).
- In the Mel filter procedure: The energy spectrum is multiplied by a set of K triangular bandpass filters to obtain the logarithmic energy (Ek) output by each filter, according to Formula (7).
- In the DCT procedure: The discrete cosine transform will be calculated according to Formula (8).
4.1. Experimental Results Based on the ESC-50 Dataset
4.2. Experimental Results Based on the UrbanSound8K Dataset
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Han, W.; Zhang, Z.; Zhang, Y.; Yu, J.; Chiu, C.-C.; Qin, J.; Gulati, A.; Pang, R.; Wu, Y. ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context. In Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; pp. 3610–3614. [Google Scholar]
- Orken, M.; Dina, O.; Keylan, A.; Tolganay, T.; Mohamed, O. A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci. Rep. 2022, 12, 8337. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y. Music Recommendation System and Recommendation Model Based on Convolutional Neural Network. Mob. Inf. Syst. 2022, 2022, 3387598. [Google Scholar] [CrossRef]
- Huang, K.; Qin, H.; Zhang, X.; Zhang, H. Music recommender system based on graph convolutional neural networks with attention mechanism. Neural Netw. 2021, 135, 107–117. [Google Scholar]
- Marc, G.; Damian, M. Environmental sound monitoring using machine learning on mobile devices. Appl. Acoust. 2020, 159, 107041. [Google Scholar]
- Nogueira, A.F.R.; Oliveira, H.S.; Machado, J.J.M.; Tavares, J.M.R.S. Sound Classification and Processing of Urban Environments: A Systematic Literature Review. Sensors 2022, 22, 8608. [Google Scholar] [CrossRef] [PubMed]
- Nishida, T.; Dohi, K.; Endo, T.; Yamamoto, M.; Kawaguchi, Y. Anomalous Sound Detection Based on Machine Activity Detection. In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022. [Google Scholar]
- Wang, Y.; Zheng, Y.; Zhang, Y.; Xie, Y.; Xu, S.; Hu, Y.; He, L. Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Using Classification-Based Methods. Appl. Sci. 2021, 11, 11128. [Google Scholar] [CrossRef]
- Shriram, K. Vasudevan, Sini Raj Pulari, Subashri Vasudevan, Deep Learning: A Comprehensive Guide, 1st ed.; Chapman & Hall: Boca Raton, FL, USA, 2021. [Google Scholar]
- Viarbitskaya, T.; Dobrucki, A. Audio processing with using Python language science libraries. In Proceedings of the 2018 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 19–21 September 2018; pp. 350–354. [Google Scholar]
- Eric, W. Hansen, Fourier Transforms: Principles and Applications; Wiley: Hoboken, NJ, USA, 2014. [Google Scholar]
- Sejdić, E.; Djurović, I.; Jiang, J. Time-frequency feature representation using energy concentration: An overview of recent advances. Digit. Signal Process. 2009, 19, 153–183. [Google Scholar] [CrossRef]
- Franzese, M.; Iuliano, A.; Models, H.M. Encyclopedia of Bioinformatics and Computational Biology; Elsevier: Amsterdam, The Netherlands, 2019; Volume 1, pp. 753–762. [Google Scholar]
- Wan, H.; Wang, H.; Scotney, B.; Liu, J. A Novel Gaussian Mixture Model for Classification. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019. [Google Scholar]
- Han, T.; Kim, K.; Park, H. Location Estimation of Predominant Sound Source with Embedded Source Separation in Amplitude-Panned Stereo Signal. IEEE Signal Process. Lett. 2015, 22, 1685–1688. [Google Scholar] [CrossRef]
- Półrolniczak, E.; Kramarczyk, M. Estimation of singing voice types based on voice parameters analysis. In Proceedings of the 2017 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 20–22 September 2017; pp. 63–68. [Google Scholar]
- Thwe, K.Z.; War, N. Environmental sound classification based on time-frequency representation. In Proceedings of the 2017 18th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Kanazawa, Japan, 26–28 June 2017; pp. 251–255. [Google Scholar]
- Zrar, K.H.; Abdulbasit, A.; Al-Talabani, K. Mel Frequency Cepstral Coefficient and its Applications: A Review. IEEE Access 2022, 10, 122136–122158. [Google Scholar]
- Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Xie, J.; Hu, K.; Zhu, M.; Yu, J.; Zhu, Q. Investigation of Different CNN-Based Models for Improved Bird Sound Classification. IEEE Access 2019, 7, 175353–175361. [Google Scholar] [CrossRef]
- Das, J.K.; Ghosh, A.; Pal, A.K.; Dutta, S.; Chakrabarty, A. Urban Sound Classification Using Convolutional Neural Network and Long Short Term Memory Based on Multiple Features. In Proceedings of the 2020 Fourth International Conference On Intelligent Computing in Data Sciences (ICDS), Fez, Morocco, 21–23 October 2020; pp. 1–9. [Google Scholar]
- Chi, Z.; Li, Y.; Chen, C. Deep Convolutional Neural Network Combined with Concatenated Spectrogram for Environmental Sound Classification. In Proceedings of the 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 19–20 October 2019; pp. 251–254. [Google Scholar]
- Piczak, K.J. ESC-50: Dataset for Environmental Sound Classification. Available online: https://github.com/karolpiczak/ESC-50 (accessed on 8 July 2023).
- Salamon, J.; Jacoby, C.; Bello, J.P. Urban Sound Datasets. Available online: https://urbansounddataset.weebly.com/urbansound8k.html (accessed on 8 July 2023).
- Mohaimenuzzaman, M.; Bergmeir, C.; West, I.; Meyer, B. Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices. Pattern Recognit. 2023, 133, 109025. [Google Scholar] [CrossRef]
- Takahashi, N.; Gygli, M.; Pfister, B.; Gool, L.V. Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, San Francisco, CA, USA, 8–12 September 2016; pp. 2982–2986. [Google Scholar]
- Salamon, J.; Bello, J.P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
- Nam, H.; Kim, S.-H.; Park, Y.-H. Filteraugment: An Acoustic Environmental Data Augmentation Method. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar]
- Park, D.S.; Zhang, W.Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
- Nanni, L.; Maguolo, G.; Brahnam, S.; Paci, M. An Ensemble of Convolutional Neural Networks for Audio Classification. Appl. Sci. 2021, 11, 5796. [Google Scholar] [CrossRef]
- Garcia-Balboa, J.L.; Alba-Fernandez, M.V.; Ariza-López, F.J.; Rodriguez-Avi, J. Homogeneity Test for Confusion Matrices: A Method and an Example. In Proceedings of the IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1203–1205. [Google Scholar]
- Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar]
Sets of Triangular Bandpass Filters | Number of Triangular Bandpass Filters |
---|---|
K = 1 | 40 |
K = 2 | 30, 40 |
K = 3 | 20, 30, 40 |
K = 4 | 20, 30, 35, 40 |
K = 5 | 20, 25, 30, 35, 40 |
Hyperparameters | Values |
---|---|
Epoch | 500 |
Batch_Size | 100 |
Loss_Function | categorical_crossentropy |
Optimizer | Adam |
Learning Rate | 0.001 |
No. | Major Categories | ||||
---|---|---|---|---|---|
Animals | Natural Soundscapes and Water Sounds | Human, Non-Speech Sounds | Interior/Domestic Sounds | Exterior/Urban Noises | |
1 | Dog | Rain | Crying baby | Door knock | Helicopter |
2 | Rooster | Sea waves | Sneezing | Mouse click | Chainsaw |
3 | Pig | Crackling fire | Clapping | Keyboard typing | Siren |
4 | Cow | Crickets | Breathing | Door, wood creaks | Car horn |
5 | Frog | Chirping birds | Coughing | Can opening | Engine |
6 | Cat | Water drops | Footsteps | Washing machine | Train |
7 | Hen | Wind | Laughing | Vacuum cleaner | Church bells |
8 | Insects (flying) | Pouring water | Brushing teeth | Clock alarm | Airplane |
9 | Sheep | Toilet flush | Snoring | Clock tick | Fireworks |
10 | Crow | Thunderstorm | Drinking, sipping | Glass breaking | Hand saw |
Actual | Original Dataset (K = 1) | Augmentation (K = 2) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Predict | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 |
1 | 23 | 5 | 2 | 3 | 2 | 75 | 6 | 3 | 0 | 5 |
2 | 7 | 26 | 3 | 1 | 5 | 0 | 65 | 3 | 2 | 4 |
3 | 3 | 5 | 26 | 3 | 3 | 3 | 2 | 81 | 1 | 1 |
4 | 2 | 4 | 8 | 23 | 6 | 0 | 2 | 4 | 69 | 7 |
5 | 0 | 6 | 6 | 2 | 25 | 0 | 0 | 0 | 0 | 67 |
Category | Original Dataset (K = 1) | Augmentation (K = 2) | ||||||
---|---|---|---|---|---|---|---|---|
Accuracy | Precision | Recall | F1 Score | Accuracy | Precision | Recall | F1 Score | |
1 | 0.66 | 0.66 | 0.67 | 0.68 | 0.84 | 0.96 | 0.84 | 0.90 |
2 | 0.62 | 0.56 | 0.62 | 0.59 | 0.88 | 0.87 | 0.88 | 0.87 |
3 | 0.65 | 0.58 | 0.63 | 0.60 | 0.92 | 0.89 | 0.92 | 0.91 |
4 | 0.53 | 0.72 | 0.52 | 0.61 | 0.84 | 0.96 | 0.84 | 0.90 |
5 | 0.64 | 0.61 | 0.63 | 0.60 | 1.00 | 0.80 | 1.00 | 0.89 |
Average | 0.62 | 0.63 | 0.62 | 0.62 | 0.90 | 0.90 | 0.90 | 0.89 |
No. | Classes | Number of Data | No. | Classes | Number of Data |
---|---|---|---|---|---|
1 | air_conditioner | 1000 | 6 | enginge_idling | 1000 |
2 | car_horn | 429 | 7 | gun_shot | 374 |
3 | children_playing | 1000 | 8 | jackhammer | 1000 |
4 | dog_bark | 1000 | 9 | siren | 929 |
5 | Drilling | 1000 | 10 | street_music | 1000 |
Class | Original Dataset (K = 1) | Augmentation (K = 2) | ||||||
---|---|---|---|---|---|---|---|---|
Accuracy | Precision | Recall | F1 Score | Accuracy | Precision | Recall | F1 Score | |
1 | 0.93 | 0.93 | 0.83 | 0..93 | 0.98 | 0.91 | 0.98 | 0.94 |
2 | 0.87 | 0.97 | 0.87 | 0.92 | 0.81 | 0.91 | 0.81 | 0.86 |
3 | 0.88 | 0.77 | 0.88 | 0.82 | 0.89 | 0.86 | 0.89 | 0.88 |
4 | 0.86 | 0.96 | 0.86 | 0.91 | 0.86 | 0.96 | 0.86 | 0.91 |
5 | 0.85 | 0.91 | 0.85 | 0.88 | 0.91 | 0.93 | 0.91 | 0.92 |
6 | 0.93 | 0.93 | 0.93 | 0.93 | 0.97 | 0.93 | 0.97 | 0.95 |
7 | 0.94 | 0.91 | 0.94 | 0.93 | 1.00 | 1.00 | 1.00 | 1.00 |
8 | 0.92 | 0.88 | 0.92 | 0.90 | 0.96 | 0.94 | 0.96 | 0.95 |
9 | 0.97 | 0.92 | 0.97 | 0.94 | 0.98 | 0.88 | 0.98 | 0.93 |
10 | 0.81 | 0.84 | 0.81 | 0.82 | 0.87 | 0.91 | 0.80 | 0.85 |
Average | 0.90 | 0.92 | 0.90 | 0.89 | 0.92 | 0.92 | 0.92 | 0.92 |
Category | New Dataset (Halve the Dataset and Double the Data Using Data Augmentation) | |||
---|---|---|---|---|
Accuracy | Precision | Recall | F1 Score | |
1 | 0.97 | 0.93 | 0.97 | 0.95 |
2 | 0.74 | 0.97 | 0.74 | 0.84 |
3 | 0.84 | 0.85 | 0.88 | 0.86 |
4 | 0.90 | 0.93 | 0.87 | 0.90 |
5 | 0.89 | 0.92 | 0.89 | 0.90 |
6 | 0.94 | 0.93 | 0.93 | 0.93 |
7 | 1.00 | 1.00 | 1.00 | 1.00 |
8 | 0.95 | 0.96 | 0.92 | 0.94 |
9 | 0.91 | 0.93 | 0.91 | 0.92 |
10 | 0.93 | 0.80 | 0.93 | 0.86 |
Average | 0.91 | 0.92 | 0.90 | 0.91 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chu, H.-C.; Zhang, Y.-L.; Chiang, H.-C. A CNN Sound Classification Mechanism Using Data Augmentation. Sensors 2023, 23, 6972. https://doi.org/10.3390/s23156972
Chu H-C, Zhang Y-L, Chiang H-C. A CNN Sound Classification Mechanism Using Data Augmentation. Sensors. 2023; 23(15):6972. https://doi.org/10.3390/s23156972
Chicago/Turabian StyleChu, Hung-Chi, Young-Lin Zhang, and Hao-Chu Chiang. 2023. "A CNN Sound Classification Mechanism Using Data Augmentation" Sensors 23, no. 15: 6972. https://doi.org/10.3390/s23156972