1 Introduction

The evolution of modern technology explores new frontiers in artificial intelligence and machine learning, with special attention to sound and acoustic recognition. Research in sound and acoustic recognition aims to mimic the human auditory system, with a focus on precise processing and deep understanding of auditory data [1,2,3,4]. Traditional sound and acoustic recognition systems have primarily relied on classical machine learning and mathematical models [5, 6]. However, recent advancements in AI technology have led to the predominance of neural network-based research in this area [1,2,3,4, 7, 8]. These auditory-based deep learning systems are utilized in fields such as sound classification, localization, and voice recognition, with various software-based neural network models actively researched in each category.

This review paper focuses on Spiking Neural Network (SNN) [9, 10] for sound and acoustic recognition, while highlighting their potential to transcend digital computer architectures by leveraging analog computing structures. SNNs communicate and learn information in a similar manner to biological neurons by using sporadic electrical firings [11], otherwise known as ’spikes’. Unlike software-centric neural networks, SNNs are based on gate-level hardware, operating on analog computing principles rather than von-Neumann or digital computer architectures. This leads to characteristics like low power consumption and low latency, making SNNs particularly promising for embedded devices [10]. Moreover, SNNs’ capacity to naturally integrate temporal dynamics makes them exceptionally well-suited for handling time-series data like sound [12,13,14,15].

This paper will delve into the latest research trends in sound recognition using SNNs, reviewing key technical advancements and innovations. It will also discuss how this technology could be applied across various application areas and outline future research directions. This paper will first cover the characteristics of SNNs, then discuss recent SNN technologies and research related to sound and provide a summary of reported research findings, and lastly suggest future research directions based on current technology levels and limitations.

2 Auditory SNN overview

2.1 SNN outlines

SNN [9] attempt to mimic the neural cells from the hardware level, unlike other software-based networks. This is in contrast to traditional software-based virtual neural networks, demonstrating exceptional performance in embedded devices and dynamic environments that must operate under harsh conditions. Such characteristics arise because SNN fundamentally follow the structure of analog computers, employing design units of analog signals to emulate the physical properties of brain neural networks through pure electrical signal-based operations. Recent studies have shown that SNN offer superior results in energy efficiency and processing speed for voice signal processing, especially in time-sensitive sound recognition tasks, compared to conventional deep learning methods. This advancement stems from SNNs’ unique structure as ’analog computer’-based neural network systems, surpassing traditional neural networks in terms of power efficiency and latency in sound recognition.

Although SNN models initially focused on simulating simple spiking activities, recent research has shifted attention to their learning capabilities and real-time data processing abilities. These virtues shine in mobile and wearable technologies, and so recent studies tend to focus on how the technology may be applied in these environments [16,17,18,19].

The operation of a typical SNN can be mathematically divided into two main components: the behaviour of neurons and the transmission of spikes.

Behavior of Neurons: Each neuron changes its state according to the input signals. The state of a neuron is commonly represented by its voltage state, which varies over time. The voltage \( V(t) \) of a neuron is expressed as a function of time \( t \), and it generates a spike when it reaches a certain threshold voltage \( V_{\text {threshold}} \) [20, 21]. The voltage state of a neuron can be modeled by the following differential equation:

$$\begin{aligned} \tau \frac{dV(t)}{dt} = -V(t) + I(t) \end{aligned}$$
(1)

Here, \( \tau \) is the time constant, and \( I(t) \) is the input signal over time.

Transmission of Spikes: When a neuron generates a spike, this spike is transmitted to other neurons. The effect of the spike is regulated through weights. The transmission of a spike from neuron \( j \) to neuron \( i \) is determined by the weight \( W_{ij} \). This can be expressed by the following equation:

$$\begin{aligned} I_i(t) = \sum _j W_{ij} S_j(t) \end{aligned}$$
(2)

Where \( I_i(t) \) is the total input signal reaching neuron \( i \), \( W_{ij} \) is the connection strength from neuron \( j \) to \( i \), and \( S_j(t) \) is the spike time function of neuron \( j \). In this way, SNNs can respond in a more natural manner to temporally varying inputs.

To provide appropriate inputs for an SNN designed in this way, it is necessary to embed conventional time-series data into spike signals [22,23,24]. There are typically two methods used for signal encoding: Rate Coding and Temporal Coding. These two methods offer different ways of encoding aspects of the signal into spikes.

Rate Coding: In this method, the intensity of the input signal is converted into the firing frequency of neurons [22]. A more intense signal is converted into a series of frequent spikes, whereas a less intense signal would be transformed into a series of infrequent spikes. Given a time-series signal \( x(t) \), rate coding to convert this into a spike sequence can be expressed as follows:

$$\begin{aligned} S(t) = f(x(t)) \end{aligned}$$
(3)

Where \( S(t) \) represents the occurrence of spikes at time \( t \), and \( f \) is the function that converts the signal into spiking frequency.

Temporal Coding: In this method, the temporal pattern of the input signal is encoded into the timing in which the signal changes [22, 23]. Here, the timing of signal changes is more important than the signal’s intensity. For a time-series signal \( x(t) \), temporally coded spikes can have the following relationship:

$$\begin{aligned} S(t) = g(\Delta x(t)) \end{aligned}$$
(4)

Where \( \Delta x(t) \) is the change in the signal over time, and \( g \) is the function that converts this change into the timing of spikes.

Both encoding methods are used to convert important characteristics of time-series data into spikes. Rate Coding reflects the intensity information of the input signal, while Temporal Coding captures the temporal pattern and changes in the signal more effectively. These converted spikes are processed through the SNN.

2.2 SNN in sound domain

2.2.1 Sound localization

Sound localization is an important ability that allows humans and animals to identify the source of sounds in their environment and determine their direction [3]. This process utilizes various signal processing mechanisms such as the Time Difference of Arrival (TDOA), Interaural Level Difference (ILD), and the filtering effects caused by the shape of the ears citestrutt1907our,grumiaux2022survey,niu2019deep,desai2022review. Research on sound localization using SNNs leverages the capability of these networks to process complex signals in real-time. For instance, SNNs can precisely calculate the TDOA of sound signals to estimate their direction, opening up potential applications in robotics, hearing aids, and environmental monitoring systems. Moreover, SNNs can be used to extract various features from sound signals and then use their findings to identify the source of the sounds. This can be particularly useful in distinguishing multiple sound sources in complex environments [25, 26].

Liu et al. [27] proposes a spiking neural network inspired by the auditory sensory neural pathway of actual mammals. In mammals, three parts of the brain stem are associated with locating a sound source: the medial superior olive (MSO), lateral superior olive (LSO), and the inferior colliculus (IC). Liu et al. [27] also uses three types of artificial neurons to get the TDOA, ILD, and the azimuth angle, respectively. Under the assumption that the azimuth angle is related to TDOA and ILD, Liu et al. [27] use the Bayes theorem to better the estimation of the azimuthal location. It should be noted that unlike previous studies that did not use the ILD information, the localization performance when using sound data with frequencies above 1KHz was greatly improved.

Wall et al. [28] also tries to mimic a mammalian auditory processing nuclei, namely the MSO. Wall et al. [28] introduces the Ben’s Spiker Algorithm (BSA) in order to convert the sound signals into a biologically realistic spike train. These spike trains are then filtered through another layer to remove any erroneous spikes that might have resulted from noise. The resulting spike trains are finally sent to a train of neurons simulating the MSO using Jeffress’ model [29]. This network produces biologically realistic spike trains while also proving that biologically inspired sound localization can compare favorably with classical techniques like cross-correlation SNNs.

On their next research, [30] takes inspiration from the LSO and another distinct part of the auditory brainstem nuclei called the medial nucleus of the trapezoid body (MNTB). The network uses two layers derived from each nuclei. The data from both ’ears’ are filtered through a layer mimicking the MNTB, with the input from ipsilateral ’ear’ relative to the sound source acting as an excitatory signal while the input from the contralateral ’ear’ acts as an inhibitory signal. Both signals are then used in the layer mimicking the LSO when determining ILD.

Pan et al. [31] continue the trend of taking hints from the biological workings of sound localization. Although the study also takes hint from Jeffress’ model [29] similar to Wall et al. [28], Pan et al. [31] introduces Multi-Tone Phase Coding (MTPC) before the Jeffress model layer for better simulating human hearing and also improving localization performance. MTPC breaks down a single sound into multiple tones, similar to how the human cochlear recognizes tones from a specific frequency. The model then uses the Jeffress model to estimate ITD information from different tones and then converts them into spikes. For Pan et al. [31] convert pure tones rather than a single complex sound to spikes, the model is computationally efficient and improves the sound localization accuracy.

Roozbehi et al. [32] advance the domain of sound localization by implementing a dynamic-structured reservoir SNN (rSNN). Although this network also incorporates the Jeffress model, the unique integration of Adaptive Resonance Theory (ART) within the rSNN framework enhances the adaptability and efficiency of sound localization. The ART-rSNN model optimizes the neural arrangement dynamically to amplify the detection of energy near sound sources, significantly enhancing localization precision. This innovative architecture allows for real-time adjustments based on the acoustic environment, offering a substantial improvement in computational efficiency and localization accuracy over traditional models.

Lastly, Haghighatshoar and Muir [33] extend the technological boundaries of sound source localization by innovating a low-power SNN method using a Hilbert Transform spike encoding scheme. Unlike the above-mentioned studies, which background heavily lay in biological findings, this study introduces a novel short-time Hilbert transform (STHT) that circumvents the need for complex band-pass filtering. This approach simplifies the auditory signal processing pipeline by directly obtaining a robust phase signal from wideband audio, which is then used to derive a new beamforming method. The result is a state-of-the-art localization accuracy that rivals traditional non-SNN methods but with significantly lower power consumption, making it ideal for integration into low-power IoT devices. The implementation on ultra-low-power SNN hardware demonstrates how signal processing and neural network design can be co-optimized for high efficiency, pushing forward the capabilities of neuromorphic computing in practical applications.

The research trajectory in sound localization using SNNs demonstrates a significant evolution, moving from initial biologically-inspired models to integrating diverse computational strategies that enhance efficiency and adaptability. It must be noted that most papers aim to advance practical usage of computational sound localization while also trying to better understand the inner workings of the biological human sound locating system. As the research of SNNs on sound localization takes insights from neuro-physiology and also tries to share its knowledge to further other fields related to this task, the potential for cross-disciplinary innovation grows significantly, promising better technological advancements in this area.

Table. 1 reports the summary results of the main research results reported above. Since each paper evaluated the performance and developed the model in different datasets and experimental environments, it is impossible to evaluate the performance with one quantitative indicator, but the performance factor technology has been developed in each performance feature area. Table. 2 reports the results once again, focusing on the papers that can compare quantitative performance. Each paper summarized the data as similar as possible and the test results that were conducted in the environment.

Table 1 Performance comparison and major contributions of SNN-based sound localization models
Table 2 Comparison of different systems for sound localization and classification

2.2.2 Sound classification

Table 3 Performance comparison and major contributions of SNN-based Sound classification models
Table 4 Classification accuracy comparison with RWCP dataset
Table 5 Classification accuracy comparison with TIDIGITS dataset

The integration of AI in sound or voice-related applications has become essential in most embedded devices [61,62,63]. Given the necessary technological requirements for deployment on embedded systems, the low latency and low power consumption advantages of SNNs are particularly appealing. Against this backdrop, a significant amount of research on SNN-based sound, typically speech, classification-based applications is being conducted.

Tavanaei and Maida [40] develop a foundational approach by integrating Spike Timing Dependent Plasticity (STDP) with backpropagation, creating a hybrid model known as BP-STDP. This model aligns with the principles of biological neural networks while enhancing computational efficiency-a theme that recurs in subsequent SNN research, particularly in adapting SNNs to more efficiently emulate functionalities like those of rectified linear units (ReLUs) in traditional ANNs.

Dong et al. [41] extend this effort towards unsupervised learning in SNNs by employing a convolutional architecture and STDP for speech recognition, much like Tavanaei and Maida [40]. Dong et al. [41] focus on energy efficiency and the biological feasibility of their network, highlighting the unsupervised feature extraction capability which directly addresses the challenge of high power consumption noted in conventional ANNs. This reflects a shared emphasis on low power consumption that is crucial for embedded systems.

Martinelli et al. [43] further this exploration within the specific context of Voice Activity Detection (VAD). Similar to the previous studies, this research focuses on the power efficiency of SNNs but also tackles the challenge of effective training algorithms for SNNs. By adapting recurrent network training methodologies to SNNs, they manage to achieve state-of-the-art VAD performance while maintaining the low-energy consumption characteristic of SNNs, bridging a common gap in performance between artificial neural networks and SNNs.

Amin [42] introduces another innovative approach through the Adaptive Threshold Module (ATM), which dynamically adjusts neuron thresholds to enhance feature extraction. This adaptation is a direct response to optimizing the processing of input spike trains, a fundamental challenge across SNN applications aimed at achieving real-time processing capabilities. This model echoes the prior emphasis on improving computational efficiency and reducing energy use.

Bensimon et al. [12] and Xiang et al. [44] both push the boundaries of traditional SNN applications by incorporating novel elements-SCTN and PCSNN models respectively-that integrate with sensory inputs or photonic components. These studies not only emphasize low power consumption but also explore novel hardware integrations to boost performance and efficiency, a common theme aimed at bridging the gap between laboratory research and real-world application.

Lastly, Yang and Chang [45] synthesize these themes into a practical application by developing an ultra-low-power speech recognition accelerator that utilizes an RSNN. Their work not only consolidates the advancements in reducing power consumption and computational complexity but also demonstrates effective integration of these strategies into a device suitable for edge computing.

Table. 3 summarizes the main sound classification models summarized above. Each paper is a paper that improved the model based on a new methodology in the field of research. Since the papers were tested with different datasets and environments, there was a problem with quantitative performance comparison. To solve this problem, Table. 4 and Table. 5 collect the results of the tested papers using the RWCP and TIDIGITS datasets during the performance reporting of the papers, respectively, and report them as quantitative figures.

The integration of SNN technology into voice classification systems is still in its early stages, with ongoing research focused on overcoming challenges such as effective training algorithms, spike encoding methods, and hardware integration. As these challenges are addressed, SNNs are expected to become increasingly viable for a wide range of voice processing applications, offering a path toward more efficient, responsive, and power-aware computing solutions.

2.2.3 Multimodal SNN

Recently, sound processing SNNs aim to add visual processing to create multimodal models. The visual and auditory cortices in the human brain are functionally and structurally connected, enabling the integration of sensory information [64, 65]. Given that SNNs aim to replicate the brain’s processing structure and capabilities, the progression towards audio-visual multimodal SNNs seems like a natural development.

Rathi and Roy [66] introduces an unsupervised spiking neural network that integrates auditory and visual inputs using distinct unimodal networks linked through cross-modal connections trained via spike-timing-dependent plasticity (STDP). This approach harnesses the inherent correlations between sensory modalities, enhancing robustness and improving classification accuracy in noisy environments. The focus on unsupervised learning and organic integration of sensory data through cross-modal connections is particularly beneficial in environments where labeled data is scarce, marking a significant shift towards exploiting natural sensory interactions without supervised training.

In contrast, Liu et al. [67] builds upon the concept of multimodal integration in SNNs but adopts a different approach by using a recurrent SNN for auditory data and a convolutional SNN for visual data, each optimized for temporal and spatial features, respectively. These networks are then fused through an attention-based cross-modal sub-network that dynamically evaluates and adjusts the weights assigned to each sensory input, optimizing integration based on data reliability. This architecture not only demonstrates the benefits of multimodal sensory processing but also significantly enhances performance over unimodal auditory SNNs by about 7%p, illustrating the effectiveness of attention mechanisms in complex sensory environments. This contrast with [66]’s method underscores the diversity in strategies for integrating multimodal data in neural networks, each offering unique advantages depending on the application context.

Guo et al. [68] introduces the Spiking Multi-Model Transformer (SMMT), which advances multimodal SNNs by integrating auditory and visual data using a Spiking Cross-Attention (SCA) mechanism. This model refines Liu et al. [67]’s attention-based method by employing a unified transformer framework that processes both modalities simultaneously, simplifying the architecture and potentially increasing integration efficiency. This approach demonstrates improved performance on diverse datasets, showcasing a significant evolution in SNNs for handling complex multimodal data.

Table. 6 summarizes the multimodal SNN model, which is combined with a different domain than the previously reported sound domain.

Table 6 Performance of SNN-based multimodal learning Models

3 Discussion

SNN demonstrate potential in bridging the gap between current artificial intelligence systems and biological neural processing for auditory signal processing. The temporal dynamics of acoustic signals align well with the characteristics of SNN, marking a significant step towards mimicking human neural activity. This not only suggests the possibility of advancing AI but also underscores the importance of biological fidelity in future technology development. The architecture of SNN is designed to emulate the operating principles of the human nervous system, offering a path to overcome the limitations of traditional software-based neural networks. By replicating how human neurons encode and process stimuli, SNN are expected to increase biological accuracy in AI systems. However, this poses the challenge of achieving the computational efficiency and speed of the human brain, which is not easily attainable within current digital computing paradigms.

The significance of this study lies in its pioneering role in extending the application scope of SNNs from the visual domain to various sensory data processing tasks, including audition. The study introduces diverse methods for encoding non-visual external stimuli, such as sound, into spikes. These advancements are expected to further broaden the potential of SNNs in sensory information processing, paving the way for novel applications and insights into biological neural processing. Encoding external stimuli into spikes presents a novel approach to AI development, potentially redefining the boundaries of neural network technology by providing a solution to the challenge of accurate stimulus embedding.

SNN research in the sound domain is currently making remarkable progress as we reported in this. In sound localization, studies by Liu et al. [27], Wall et al. [8, 28], Pan et al. [31], and Haghighatshoar and Muir [33] have demonstrated that SNN can accurately estimate sound source locations even in complex auditory environments. Moreover, in speech recognition, research by Dong et al. [41], Wu et al. [52, 69], and Xiang et al. [44] has confirmed that SNN can achieve performance comparable to existing deep learning models.

Furthermore, recent studies on multimodal information processing using SNNs have gained significant attention. [66] proposed an unsupervised learning-based multimodal SNN that integrates visual and auditory information, while [67] introduced a supervised learning-based multimodal SNN with an attention mechanism. These approaches demonstrate similarities to human multimodal information processing by flexibly combining and modeling interactions among different sensory inputs. Notably, cross-modal connections at the spike signal level differentiate these models from traditional fusion methods that operate on abstracted features. These biologically plausible multimodal SNN models can contribute to a more sophisticated understanding and emulation of human sensory information processing. Moreover, they have the potential to model and interpret human perceptual confusion phenomena arising from interactions between senses. For instance, research could explore using SNN models to explain illusions caused by mismatches between visual and auditory information, such as the McGurk effect [70].

Despite their potential, SNN are often overlooked in favor of models that rely on large numbers of parameters and high computing resources, as SNN currently lag behind in terms of scalability, performance, and ease of development. However, once developed, SNN consume significantly less power, making them environmentally friendly and capable of achieving sufficient performance with fewer parameters. These characteristic features demonstrate that SNN is suitable for solving energy problems required for future AI technologies and for the development of local embedded AI for solving security and speed problems, etc., and can make a significant contribution to the sustainable development of AI. The development of AI technology is at a crossroads. While the pursuit of ever-larger models with massive computational requirements has yielded impressive results, it has also raised concerns about energy consumption and sustainability. SNN offer an alternative path forward, one that prioritizes efficiency and environmental responsibility without sacrificing performance. By focusing on energy-efficient, locally executable models like SNN, we can create a more sustainable and accessible AI ecosystem. This will require ongoing research and collaboration across disciplines, as well as a willingness to challenge the prevailing paradigms of AI development.

However, challenges remain before SNN can be widely applied in practical applications. Key issues include improving SNN model learning efficiency, optimizing spike encoding methods, and implementing low-power neuromorphic hardware. Overcoming these challenges will position SNN as the next-generation auditory information processing technology, surpassing current limitations.

4 Conclusion

This review examined the latest research trends and achievements of SNN in auditory signal processing applications. By emulating the information processing methods of biological neural networks, SNN possess the potential to overcome the limitations of existing AI technologies. Notably, SNN have demonstrated excellent performance in sound localization and speech recognition and have shown the possibility of more sophisticated modeling of human auditory information processing mechanisms. However, for SNN technology to reach the application stage, various technical challenges must be addressed, including model learning efficiency, spike encoding, and hardware implementation. Overcoming these challenges requires interdisciplinary collaborative research among experts from neuroscience, semiconductor engineering, and algorithms.

By expanding the horizons of SNN research and breaking down technological barriers, we can implement artificial intelligence systems that approximate human auditory cognition capabilities. This goes beyond mere technological advancement; it is also an intellectual challenge that elevates our understanding of human intelligence. We anticipate that SNN research will open new frontiers in artificial intelligence technology across various domains, including the auditory field.