1. Introduction
Machine-learning (ML) applications based on neural networks are expanding dramatically. A neural network consists of connections between layers and a weight indicating the connectivity between nodes. Machine-learning is used in various fields, and the size and depth of networks are increasing, which is necessary for the network to cope with complex inputs. Typically, when the number of layers increases in deep learning, the number of weights approaches billions. In large-scale ML, the computation must be processed on the server. However, edge devices that make contact with humans in the internet of things (IoT) are designed on the basis of microcontroller units (MCUs). MCUs used in typical edge devices have power consumption about uW/MHz, and hundreds of kilobytes flash memory. When software (i.e., firmware) is flashed into memory, MCU repeats the same operation until reprogramming. As the name of Edge indicates, edge devices focus on the role of sensing and collecting data through interactions with objects at the end of the IoT network. To collect vast amounts of real-world data, edge devices must be able to operate in various locations and under various conditions. Moreover, the available resources, e.g., memory and power, are meager, especially for IoT edge devices, because their accessibility is low and the environment is poor. As edge device moves away from the center of network to collect data generated from close interaction between real-world, there are constraints, such as insufficient power or a small memory size. Therefore, when running artificial intelligence applications on conventional edge devices, various methods are used to transmit the collected data to a resource-rich server to perform ML operations, and then retransmit the results back to the edge device.
In this paper, we focus on the preprocessing data overhead and the process of passing it as an input to the network model when an edge device executes an ML application. In a speech-recognition application, if raw audio sample data input from the microphone are transmitted directly to the model, then the waveform will contain too much data. Thus, excessive resources will be required for calculation. In addition, if different people pronounce the same word, they will have the same classification result. However, the waveform has different shapes when compared with the raw audio data waveform. Edge devices use preprocessing techniques, such as fast Fourier transform (FFT), to reduce the amount of computation and improve the model’s classification performance. In this paper, we propose a technique that aims at improving the operation speed by allocating the preprocessing operations, previously performed using software, to custom-designed hardware, as shown in
Figure 1. In order to prevent correct results from being obtainable because the syllables of a word are truncated by windowing in speech recognition, the accuracy increases as many iterative classifications are performed over a certain time interval. In the already existing application, the entire process of audio data collection, preprocessing, and classification by a model is performed in the MCU’s software code. In this paper, an inter-integrated circuit-sound (I
S) hardware module was custom designed for collecting and delivering sound data to an edge device. Moreover, the I
S module performs part of the preprocessing process in the voice-recognition application. Preprocessing of raw audio data mainly uses digital signal processing (DSP) operations. With DSP operations, it is faster and more efficient to perform in parallel in hardware, which reduces the time taken for preprocessing audio data, thus allowing more classifications to be performed at the same time. In addition, energy consumption can be minimized if a custom I
S module that is optimized for voice recognition is used. This approach is a software and hardware co-design for running ML applications efficiently in embedded system edge devices.
To evaluate the proposed method, we used the edge device: an MCU board with a 32-bit ARM Cortex-M4 processor. The custom-designed DSP-embedded IS module was implemented in a Xilinx field-programmable gate array (FPGA). A TinyML program was modified as bare metal to implement an ML application on the MCU evaluation board. The audio data delivered to the edge device were transferred from the external FPGA via an IS communication protocol. The baseline for comparison is that the IS module delivers audio data, and the ML application, including the preprocessing of raw data, is run in the software. However, if preprocessing is included in the custom IS module, then the protocol used for data transfer is the same as IS, except that the data to be transferred have completed preprocessing. Moreover, the computation time and energy consumed in the two cases were compared.
The rest of the paper is organized as follows.
Section 2 describes related works about the TinyML approach and audio recognition using ML.
Section 3 presents the overall structure of the hardware and software co-designed audio-classification system.
Section 4 describes the system’s implementation and applies the proposed method to a word-classification application using an embedded board.
Section 5 concludes the paper.
2. Related Works
Recently, an on-device ML paradigm is gaining attention in which the edge device itself processes data [
1,
2,
3,
4]. This in turn reduces the potential security problems when data are transmitted from the edge device to the server as well as the energy consumption generated from data communication.
Much research has already been conducted on the addition of dedicated hardware, such as a co-processor, to the central processing unit of a device where ML applications are executed. Furthermore, recent ML studies have focused on resource-rich servers or personal computers that support graphics processing units [
5]. However, the conditions of the models used in a server differ from those used in an MCU. We note that, even if the model is only several megabytes in size, it will be difficult to use in MCUs [
6]. However, various conditions must be considered before ML operations can be performed on resource-constrained edge devices, such as the size of the code to be written to the flash memory and the neural network layer’s configuration. Therefore, the TinyML paradigm arose, and research is being conducted on performing ML applications in the MCU efficiently. TinyML is a paradigm that focuses on compressing neural network models, rather than complex inputs or performance, to enable ML applications on MCU-based edge devices [
7,
8,
9,
10,
11,
12]. Furthermore, TinyML allows the conversion of a model trained in an ML framework and programmed in a high-level language like Python into a C/C++ program by separating it into an interpreter and weights. This process involves quantization or precision reduction to fit the edge device’s code-memory size and memory usage [
13].
In addition, when ML applications are used in mobile devices that have lower resources and performance than PCs or servers, research is being conducted to accelerate ML applications using FPGAs or GPUs [
14,
15]. A mobile device with hundreds of megabytes of storage is sufficient to use the TensorFlow Lite framework’s network output. Research on optimizing ML applications with the help of hardware such as FPGA or GPU is used more efficiently in mobile devices application processors which includes GPU inside [
16,
17]. A research approached from a framework perspective to accelerate neural network inference in mobile on-device, added a interface layer that can be accessed from mobile applications to TensorFlow Lite, allowing the GPU embedded in mobile devices to be used in inference of ML applications [
18]. Furthermore, many studies have been conducted to accelerate the neural network model using FPGA, by implementing the entire neural network model on FPGA [
19,
20]. This approach resulted in reducing the time required for inference. The multiply-accumulate operation of a neural network may yield better results if it is performed in parallel in hardware than instruction by instruction executed in the processor. Other studies of TensorFlow Lite focused on performing ML applications by compressing the size of a deep learning model, which is optimized for PC resources, to use in mobile devices or minimizing amount of data communication with the cloud or server. However, neural network models for mobile devices cannot be directly applied to edge devices. Edge devices often lack the size of flash memory to store code, moreover absence of GPU. Therefore, in TensorFlow Lite for mobile, the TinyML approach for deployment to tiny edge devices should be used.
In our previous work, we proposed a prototype of a partial firmware replacement for a run-time network model [
21,
22,
23,
24,
25]. This technique involves selecting a model corresponding to the input data domain from a memory-limited device. Another technique was described for automatically generating a suitable convolution neural network model [
26,
27,
28,
29]. This technique helps to overcome memory constraints in MCUs. Additionally, an efficient network suitable for edge devices is proposed from the same perspective, in which memory used for inference is saved using operator reordering. This approach focuses on reducing the memory used by the network model when the MCU runs ML applications. A technique for reducing the precision of parameters while minimizing the loss of network accuracy [
30]. This technique helps reduce the memory usage occupied by ML applications.
The methods mentioned so far involve optimizing the embedded memory to match the MCU’s conditions. In this paper, we focus on reducing the computation time and energy using a model with a memory size optimized for MCU and combining hardware. Additionally, several researchers combined hardware accelerators for ML applications [
31]. However, only a few studies have been conducted at the level of embedded systems other than accelerators used with high-performance resources. In this paper, preprocessing hardware is used in the data-acquisition module during the data-input process, instead of accelerating ML computation.
3. Proposed Architecture
3.1. TinyML
TinyML can be defined as a combination of hardware and software algorithms for ML operating at the mW energy consumption range. TinyML requires an embedded system powered by a battery. Most frameworks for designing ML applications are implemented in Python, but this high-level abstraction language cannot be used in embedded systems. The independent blockchain used in embedded systems often supports only C/C++. Compiled code, called firmware, is written to the nonvolatile flash memory of the embedded system. This memory performs fixed operations until the firmware is updated. The TinyML framework is implemented in the C/C++ language for using ML applications in embedded systems.
The process of developing a TinyML application can be divided into two: generating a trained model and developing firmware to be executed in an embedded system based on the generated model. Neural network model generating and weight training are conducted at host machine, as shown in
Figure 2b. In this paper, The TinyML framework used is TensorFlow Lite for MCU, which was released by Google [
32]. When using TensorFlow in Python, typically on a PC or server, one should write code to configure and train the network architecture. Given that the output of TensorFlow is vast, it is impossible to apply that output to the limited memory of embedded systems. However, TensorFlow Lite is a framework for creating models in the range of hundreds of megabytes. These models can be used on mobile devices. Embedded systems can use small-sized models in the form of C/C++ arrays taking up several kilobytes created with the conversion functions provided within the framework. Additionally, TensorFlow Lite quantizes the weights to reduce the model’s size, such as converting 32-bit floating-point weights to 8-bit integers. The model is used in the MCU-based embedded system after training in the TensorFlow Lite environment and then converted into a C/C++ array type. In the model’s training process configuring the layer structure and evaluating the accuracy of the model’s training is similar to the flow used in TensorFlow.
The next step is to develop the firmware that applies the model created as an array to the TinyML application. In the TinyML application, the model and interpreter are separated, and the interpreter executes the model in the form of a C/C++ array. The application’s execution is described at the bottom of
Figure 2c. The input data to be used in ML application may be unprocessed, like an image input to the convolution layer, or may require preprocessing, like audio waveform data. First, the input data are passed to the interpreter, which configures the number and types of layers of the neural network. Then the model is combined to perform inference on the input data. Once the inference is complete, the user must interpret the model’s output and respond. When classification is performed on a single image, the class from the model’s output layer can be interpreted as a result. These techniques can be mapped to the IoT layer. In the physical layer, the windowing preprocessing process of raw audio data is accelerated; In the communication layer, data is received from the external using the I
S communication protocol; In the application layer, TinyML audio classification firmware is executed. However, in the case of an application that recognizes continuous speech, the inference is performed several times in a short time period to comprehensively interpret the output results. Finally, according to the results, the cycle is completed with the device in which the application is executed performing a predetermined response.
In this paper, we focused on the aforementioned preprocess stage of the TinyML application in the edge device. When external data is input from the sensor, the software stored in the on-chip flash memory does not perform entire preprocessing, but partitions the part to be executed in hardware and the part to be executed in software. Considering the hardware characteristics useful in parallel data processing, some computations of preprocessing are conducted in hardware to reduce overhead.
3.2. Audio Classification
In this paper, the implemented TinyML application classifies words in audio data. Unlike image classification, finding words in continuous raw audio data requires preprocessing. Audio classification is an application that requires a lot of preprocessing of raw data. A typical sampling frequency of kHz-level produces a large amount of audio raw data over time. However, the network model used for classification cannot receive an input of that size. Therefore, features are extracted by using complex preprocessing in several steps. Among various audio classifications, the case used in this paper is word classification that recognizes specific words in the speaker’s speech.
The word classification application specific words in a human’s voice. Each person has a different voice frequency, amplitude, and pronunciation time. Therefore, it is necessary to find the target word’s unique characteristics in the voice data sequence.
Figure 3 shows the differences between the audio waveforms and spectrograms generated when different people pronounce the same word.
Figure 3a shows the audio waveform when the word “yes” is pronounced, and the two-dimensional (2D) image is converted into a spectrogram.
Figure 3b shows the audio waveform when the word “no” is pronounced. However, it is difficult to detect a word’s unique characteristics from an audio waveform, but these characteristics can be visible on the spectrogram. The features of audio waveforms may differ even for the same word, according to the starting point of pronunciation. Additionally, the waveform’s amplitude may vary depending on the person’s voice. In contrast, regions with high power area in the spectrogram (yellow color) shows a unique shape according to each word in a 2D image of the same size. Unique shape of spectrograms are emphasized in red dotted line in
Figure 3. For the word “yes”, it shows a shape of bent to the right, and the word “no”, it shows a dense form on the left. In learning and inferencing in ML, a spectrogram can produce a better performance because these features are less affected by the characteristics of the human voice. Moreover, a spectrogram reduces the amount of computation performed in the neural network by compressing the raw audio data. Among the neural network structures, the convolution layer specializes in processing input images, so various applications use spectrograms to visualize audio. This audio is then used as input for the model.
The
Figure 4 shows the process before the spectrogram is generated. When performing an FFT operation on a waveform, the signal at the window boundary becomes discontinuous. The discontinuities may include harmonic components that do not exist in the waveform during the FFT operation. Therefore, Hann windowing [
33] is performed on the waveform, as shown in Equation (
1). The maximum sample size is
N.
Then, the mel-frequency scale, a nonlinear function, is applied to average the adjacent frequencies into the downsampled second array. This scale, as shown in Equation (
2), gives more weight to low-frequency elements due to human sound characteristics, and it merges high frequencies. At this point, one row of the spectrogram is generated. Furthermore, the number of columns generated for the spectrogram equals the number of times this operation is repeated for an audio data sequence of a certain length.
3.3. Custom IS Module
The edge device was assumed to be an embedded system with an MCU. It has operating frequency of about 100 to 200 MHz. Therefore, running a TinyML word recognition application in an edge device, it is necessary to extract features from continuous speech waveforms to reduce the amount of computation required to meet up low operation clock frequency. In the conventional approach, as a peripheral of the edge device, the hardware module for collecting voice data is considered separately from the ML software application. This approach makes a simple design, which is considered an advantage because the hardware and software must be designed and tested separately. However, if all of the computations and operations, except data collection, are processed via software, then the time required for such processes may increase due to the Von Neumann architecture. This architecture requires instruction fetching. Most of the audio-preprocessing operations for voice-recognition applications are DSP operations, such as FFT. Instead of complex instructions, it is more efficient to process DSP operations, which mainly are addition and multiplication operations, in hardware.
In this paper, the implemented I
S module adds a windowing logic to the conventional structure using the Hann window coefficients, as shown in
Figure 5a, which is a communication module for sound interfaces that support stereo or mono channels, and the module may be a master or slave.
Figure 5b shows the synthesized Netlist schematic of the designed custom I
S module. Additionally, the I
S module is capable of both transmitting and receiving sound. This custom I
S module is designed to perform a part of the windowing function in hardware during the word classification preprocess. Time and energy consumption overhead can be reduced by accelerating in parallel rather than in instruction fetch-based software. The proposed approach involves receiving a voice from an external microphone and transmitting it to the embedded system in which the TinyML application is executed. The I
S module requires a serializer/deserializer module that deserializes the received data bit by bit and re-serializes the data for transmission. This requirement is because data transmission using the protocol consists of one line each for transmission and reception. The standard I
S module does not manipulate any input pulse-code modulation audio data. However, our custom-designed I
S module performs Hann windowing for each sample before transmitting the data received from the microphone. When the window boundary is determined, the sample matching the index of the Hann window function is multiplied by a coefficient, and the windowing is processed in resistor–transistor logic (RTL). The coefficient of the window function has a floating-point format, but in RTL, it is difficult to calculate the fraction. Hence, it is quantized as a sum of powers of 2. When data come in, they are multiplied by coefficients in order of sequence, and the results form an output sequence with a first-in, first-out policy. When running TinyML applications, if a part of the preprocessing process performed in the embedded system is distributed to hardware, then the logic area or power consumption may increase. However, if the increment is less than the resource savings in the embedded system, then it can be considered more efficient.
4. Experimental Results
The experiments with the proposed structure were conducted on an embedded board used to run the TinyML application and an FPGA, in which a custom I
S module was programmed. The STM32F4-Discovery embedded board was used in the experiment as a TinyML application execution system based on ARM Cortex-M4 32-bit MCU.
Table 1 shows the specifications of the board. The FPGA used an ARM Cortex-A9 processor and Xilinx 7-series combined Zynq-Z7 board [
34]. The neural network used in the TinyML application on the embedded board was tiny conv [
35,
36].
The network consists of three layers. The first layer performs a convolution operation on a 49 × 40 size spectrogram image with eight 10 × 8 filters. In the second layer, the images generated by the filter are fully connected to the output layer. The third layer is the output layer, which has two trained words and silence and unknown classes. The softmax function was used to sharpen the differences between the results. The Speech Commands dataset was used to train the above network [
37]. This dataset is open-source with over 100,000 WAV files of 1-s duration each. The word pair trained in this application was “
yes” and “
no”.
Afterward, FFT was applied to audio data to generate the first frequency array. The decision to recognize a word in the audio classification application used in this research was conducted at 1-s intervals. The overhead for audio preprocessing for the entire 1-s period is large. At a sampling frequency of several kHz, the size of data collected for 1-s is too large for signal processing such as FFT. Therefore, the small window is moved as much as the stride and frequency bucket. The result of Hann windowing and FFT operation, is stored as much as the operation result for 1-s.
If the period of decision is
T when the window size is
W and the stride size is
S, then the number of frequency buckets is determined as the minimum
i that satisfies Equation (
3).
The determined frequency bucket value affects the number of input layer nodes in the neural network.
Figure 6 shows the measurement of the ratio of functions corresponding to the front-end, specifically preprocessing, in the entire main loop when the TinyML application is executed on the embedded board. The front-end functions, which comprise 78.8% of the feature-generate function, include windowing, FFT, and log scaling, as described in
Figure 4. The feature-generate function occupies 81% of the main loop. In this experiment, the windowing function will be handled in hardware.
The I
S module was implemented as RTL on the FPGA board. Voice was input through a microphone connected to the I
S module, and the embedded system received audio data using the I
S protocol. Then, the audio data received by the embedded system was passed from the windowing function in the FPGA. A floating-point operation is required to calculate the coefficient values of the Hann window array. However, if a floating-point unit is added for windowing calculation, an excessive area will be required. Therefore, we quantized the coefficients as the sum of powers of negative 2. Using Equation (
1), because of Hann coefficients have values from 0 to 1, the fractional part can be expressed as the sum of powers of negative 2. In this quantization technique, the error from the original coefficient varies according to the highest power of negative 2. As the number of negative power of 2 increases, the resolution of fractional parts that can be expressed increases, so the quantization error is small. The hardware usage is determined by the length of the Hann windowing function, i.e., the number of coefficients. To apply the Hann window sliding for each sample to the audio data input, coefficients need to be stored in the hardware. In the current implementation stage, the cell area after synthesis tends to be large. This increase in size is because the combinational and sequential logics are used without a separate storage device, such as RAM, for coefficients.
The Xilinx Vivado uses FPGA resources, which was placed on the chip beforehand, with routing when synthesizing and implementing RTL on Zynq FPGA. Utilization and area calculations may not be accurate due to different specifications and resources for each FPGA. Therefore, in this evaluation, the cell area was measured after RTL synthesis of the custom IS module using Synopsys Design Compiler.
Figure 7a shows the change in the cell area of the custom I
S module while changing the number of coefficients, which was made according to the length of the Hann window and the maximum value of the number of the power of 2, according to the quantization precision.
Figure 7b is the smallest FPGA tool synthesis result when the coefficient window length is 10 and the quantization degree is −5.
Figure 7c is the largest case when the coefficient window length is 80 and the quantization degree is −10. Since the Hann window coefficient has a value from 0 to 1, the power of 2 is also negative. The number of coefficients was changed from 10 to 80 corresponding to the window size from 0.625 to 5 ms. Quantization degree represents the number of powers of 2 in the negative direction used to express decimal coefficients, and it was changed from −5 to −10.
As a result, the length and quantization degree of the coefficient (window) increased as the value of the synthesized cell area increased. To minimize the additional hardware usage, the coefficient length and quantization degree should be lower. However, if the window coefficient length factor is too low, the size of the window becomes smaller and the number of FFT operations performed in the MCU software, required to recognize a single word, increases. Alternatively, if the quantization degree is lowered and coarse-grained, it may affect the difference between the original Hann window function result. Considering this trade-off, it is important to determine a factor suitable for the application. In this paper, the Hann window array implemented in RTL in hardware has a length of 80, which corresponds to a window size of 5 ms at 16 kHz. This dimension matches those of the input layer of the neural network and the spectrogram-converted audio data sampling frequency. The quantization degree uses a value of −5; thus, the maximum resolution of the quantization is .
Table 2 compares the changes in the execution time for the functions, the energy consumption of the embedded board, and the area after the synthesis of the RTL implemented in the FPGA board when the window process is excluded from the TinyML application using the proposed structure. In the row corresponding to the windowing function, a 0.219 ms decrease occurred, apart from the necessary time consumption for profiling. Furthermore, the execution time for other preprocessing functions also decreased slightly. Additionally, the energy consumption was reduced by 3.27%, as compared to the conventional case. Energy consumption was measured using the Microchip Power Debugger current visualizer. However, for the proposed structure, the hardware area of the I
S module increased by 93.39%. The size of the window function for the audio data sequence affects the I
S module’s area increase. As the size of the window increases, the number of coefficients to be stored in the window function inevitably also increases. Therefore, it is important to select an appropriately sized window boundary according to the characteristics of the application to which the proposed structure will be applied.