1. Introduction
According to the World Health Organization (WHO) [
1], about
of adults over 65 have falls annually, indicating that over 300,000 individuals may die from these types of incidents annually. Based on the assessments of fall victims [
2], falls were identified as one of the primary causes of injuries in this age group, with at least one more fall occurring in the next six months for
of the elderly who had fallen. The injuries caused by this sort of incident might lead to major trauma and fractures, among other things, in addition to psychological issues and the potential for future psychological stress [
3]. Accidents after a fall are often potentially fatal if the person is not found and saved in time [
4]. From the discussion above, it is clear that research on fall detection methods is crucial to lowering the number of accidents and their health consequences among the elderly. The present work is based on a technological perspective, attempting to detect falls using wearable equipment and addressing any medical, geriatric, or sociological issues. The design of an information and communication technology model would complete a fall-detection system, complementing the current study [
5].
Advances in sensor and signal processing technologies have paved the way for developing autonomous, body-based systems for monitoring activities and detecting falls. Sensor data are typically acquired and processed in a central unit to infer information about the individual’s posture. These platforms consist of some different types of sensors at the body level (accelerometers, smartphones, or smartwatches) or environmental (Passive Infrared Sensors or PIRs, microphones, or digital cameras) [
6,
7,
8]. While the former are associated with monitoring the physical body, the latter cannot accompany an individual on trips outside their usual residential environment and can potentially raise privacy concerns; thus, they are not considered in this work. This first group is a field where artificial intelligence has been applied with some success [
9]. We found in the literature deep convolutional network structures [
10], recursive neural networks [
11], or vector support machines [
12] applied to the fall-detection problem. On the one hand, the reliability in detecting falls assumes significant values with the aid of AI methods, that is, high effective detection percentages in several works published in the literature. On the other hand, they are obtained with simulation data with the profiles of specific falls and can present high false-positive rates (alarm triggering without a risk of a fall).
The current work is then framed in the study and the development of algorithms for detecting falls in elderly populations based on AI and in-body measures. Technically, it is based on the readings of certain physical quantities affecting the individual’s state, which stands out for its innovation regarding where the data are intended to be processed. In addition to acquiring the usual in-body acceleration magnitudes, the computational platform, which is just a wearable device, also takes care of the inference of a pre-trained neural network, as the opposite of central or edge computations [
13]. As such, the system will only transmit in the case of a predicted fall. The main challenge is regarded to be energy management since the computational platform for wearables is typically of low processing power and limited memory and should work with shallow duty cycles. In contrast, classical neural network inference usually relies on the edge or cloud computing of the algorithms, with the rise of privacy issues and a large amount of data to be transferred through some communication channels [
14].
The overall methodology comprises the following: (i) The proposal of a memory occupancy model for evaluating the memory required for implementing the onsite inference of a Long Short-Term Memory (LSTM)-type neural network; (ii) A sensitivity analysis of the network hyper-parameters through a grid search procedure to refine the network topology; (iii) The proposal of a new methodology to perform the quantization of the network by reducing the memory occupancy of the hyper-parameter values. The concept of symmetry is also used to select the network structure to deploy on the embedded processor, that is, there is a symmetric relation between the model’s structure and the memory footprint on the embedded processor, as the number of parameters and the memory size are mirrored. Further to the novelty of this general approach, it is shown that the proposed method for the quantization process presents improved results concerning the current state-of-the-art.
While the Gate Disclosure technique was initially proposed in [
15], the current manuscript greatly extends the methodology, considering a complete workflow that takes into account the embedded platform where the technique is to be implemented and its resource limitations. In addition to the overall methodology, which is original, two major novelties can be stated:
The memory modeling of the LSTM network, cross-validated with the network accuracy, as a tool to tune the LSTM size;
Quantizing the network with almost no degradation in accuracy so that it can be stored and deployed in an embedded environment by a low-power, low-complexity microcontroller.
The remainder of the work is organized as follows.
Section 2 revises the current state-of-the-art with regard to applying AI to the fall-detection problem and quantizing neural networks.
Section 3 provides and analyzes a memory occupancy model for assessing the network memory footprint.
Section 4 provides a sensitivity analysis of the network hyper-parameters through a grid search procedure, and
Section 5 describes in detail the methodology approach for the quantization.
Section 6 provides a description and a performance analysis of the results. Finally,
Section 7 summarizes the main findings and addresses future work directions.
2. Related Work
In the last several years, deep learning methods have replaced threshold or state-based algorithms when it comes to fall detection among the senior population [
16,
17]. Since the issue dynamics are unique to each individual, gathering data from the environment is not the best option and might limit freedom and portability, as is stated in
Section 1. Nevertheless, approaches with excellent accuracy are available in the computer vision literature [
18]. In the context of hardware acquisition for angular velocity and/or acceleration, a wristwatch or smartphone is typically the first device that comes to mind. Nevertheless, when it comes to elderly individuals, these devices do not have the same widespread adoption and penetration potential as they have for younger [
19,
20] users. Despite this clear drawback, the study of the surrounding settings is still quite active [
21]. While there are many techniques for learning-based fall detection in the literature, recursive neural network methods typically have an advantage in that they are more flexible to learn a high number of different features [
22,
23]. These techniques include the hidden Markov model [
24], decision trees [
25], K-Nearest Neighbor [
26], and Support Vector Machine [
27]. From a different perspective, one method that finds the highest accuracy when applied to the fall detection problem is the use of Attention-based Neural Networks. Although the approach reports accuracy as high as
, the intricacy of the network makes its inference impractical for use on a low-power embedded device [
28].
Obtaining a simplification that makes it possible for the models to be inferred more quickly is essential, for the reasons mentioned above as well as the additional difficulties brought on by the growing complexity of machine learning algorithms. Quantization, the process of converting continuous real values to a smaller collection of discrete and finite data points, is the most prominent technique to reduce model’s complexity. Since the bit width of the information is closely correlated with the memory occupancy, fewer computing operations must be performed, resulting in a reduction in the Central Processing Unit (CPU) power needs, which start from 32-bit floating points after a training stage. The Binary Neural Network, which operates with only 1-bit weights and activation [
29], is the ultimate example of quantization. In a neural network, the quantization process can be added either before or while training; the present work concentrates on quantization after training, which is known as Post-Training Quantization (PTQ). Using quantized hyper-parameters for training would entail Quantization-Aware Training (QAT), often utilizing fixed-point notation or half- or quarter-precision [
30]. Because discretization by rounding is not continuous and, hence, has a mainly zero derivative, quantization presents a hurdle. Consequently, it is not possible to use the standard gradient-based optimization technique. Uniform quantization, in which a quantization function simply maps the input values, is the most widely used and simple method. Non-uniform quantization, on the other hand, maps the quantization steps (thresholds) as intervals with width and bins determined by a mapping function. By allocating bits and discretizing the range of parameters in an irregular manner, non-uniform quantization is often more effective in capturing the information contained in the input values. Nonetheless, implementing non-uniform quantization techniques effectively on common computer gear, such as embedded microcontrollers, is usually challenging. Because of its ease of use and effective translation to low-level coding, uniform quantization replaces its opponent [
31].
The absence of gradients has been addressed in the scientific literature in a number of ways [
32]. In order to enhance the quantizer with respect to uniform schemes, more recent work formulates non-uniform quantization as an optimization problem [
33,
34,
35]. Additionally related, Stochastic Quantization (SQ) approaches use quantization to make use of stochastic characteristics [
36]. While deterministic methods cannot accurately describe inputs as well as optimization and SQ methods, quantizing data depends heavily on the stochastic features of the data and computational complexity rises, which increases the number of inverse operations that an embedded microcontroller can perform [
22].
Keeping in mind the previous discussion, the following study applies a uniform quantization step to LSTM network architectures. The metrics employed and provided to the LSTM comprise 3-axis accelerations obtained using an Inertial Measurement Unit (IMU), which is thought to be worn on the body as a bracelet [
37]. To the authors’ knowledge, although the need for wearable use is known for these types of applications, there is no work proposing techniques that correlate AI (namely, LSTM) networks with low-power embedded processors. The methodology described and analyzed here shows a new way of deploying LSTM networks to embedded devices.
3. Memory Occupancy Model
Managing the data storage layout becomes critical when considering the embedded implementation of an LSTM network on some microcontroller. Firstly, because the storage space is limited, whenever the network structure is saved on non-volatile memory (flash, EEPROM, or other) or volatile memory such as RAM, the usually embedded microcontrollers will feature a few kilobytes or megabytes [
38]. Secondly, the increasing size of the networks for gathering more knowledge and the sparsity caused by optimization tools implies higher latency and inference time [
39,
40]. It is thus highly pertinent to develop strategies for reducing the complexity of neural networks. Pruning is one technique that can hold in network compression. Pruning, the process of deleting parameters from an existing neural network at a post-training stage, could be a solution to minimize memory occupancy. Nonetheless, the sparsity of the matrix operation increases the latency through memory jumps, increasing floating point operations (FLOPS), a metric that is mistakenly neglected. Also, high pruning rates will usually highly affect the network accuracy [
41].
Regardless of the problem they address, neural networks become increasingly complex with progressively deeper architectures, implying a high number of parameters (with a consequent high increase in memory) and growing latency. For example, an ESP32 Tensilica Xtensa LX7 dual-core 32-bit microprocessor from Espressif Systems features 512 kb of RAM and 384 kb of ROM. If it would be applied to a computer vision problem with AlexNet (winner of the ImageNet Large-Scale Visual Recognition Challenge in 2012) [
42], it would only support
of the network 62, 378, 344 parameters with a 32-bit floating-point representation [
14,
43].
Let us first consider a standard recurrent neural network for time series classification. An LSTM unit comprises a forget gate, an input gate, an output gate, a candidate for the update stage, and the update stage itself (
Figure 1). The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell through a cell state and a hidden state. Each of these gates is modeled as a non-linear function of the input signals, the previous hidden state, the cell state of the cell, and some constant value known as bias. The following equations model the mathematical operations:
where
known as the
Input Weights,
known as the
Recurrent Weights, and
known as
Bias. The non-linearities are due to the sigmoid function
and the hyperbolic tangent
. The operator ∘ denotes the element-wise product. The cell output corresponds to the last hidden state,
, when
. The inference process of the LSTM network consists of reading a set of the acceleration (
,
,
) from an IMU and supplying it as a
input vector, identified as
in (
1).
The memory footprint required by the LSTM cell corresponds to storing the matrices
,
, and
. While
U is dependent on the input size, in this case, the 3 dimensions of the acceleration and the cell size
h,
W, and
b are only dependent on the cell size
h. This parameter, also known as the
Unit Size or the
Cell Size, is the key feature of an LSTM and controls the amount and complexity of the information memorized by the LSTM cell [
44]. The design stage of an LSTM network has to consider determining the network number of units through trial and error, which is performed in this work by applying a grid search methodology.
Given the knowledge of the LSTM cell and its structure, a general case for the network global architecture to apply to the fall-detection problem is considered here where the inputs correspond to the 3-axes accelerations (
Figure 2).
The first step to consider is a normalization of the input values. This is archived by acting on the inputs by normalizing the values to prevent vanishing gradient problems [
45]. Secondly, one or multiple LSTM layers acting as a feature extraction process are considered, where the number of layers and the number of units in each layer are parameters to be determined. The third block of the overall structure corresponds to one to multiple Fully-Connected (FC) layers to perform the classification, in this case, a “FALL” or an “ADL” (Activity of Daily Living). As such, the output of the FC layers will always consist of two cells, and their input will depend on the number of units in the last LSTM layer. Although, theoretically, only one FC layer would be needed, if the number of LSTM units is high, it may be advantageous for the training process to add more FC layers to make the transition smoother. Lastly, a SoftMax activation layer converts the vector of numbers from the FC output into a vector of probabilities.
The memory occupancy required by the mentioned structure can be estimated using the following expression:
where
,
,
,
, and
correspond to the number of bytes used for the quantity representation of layer
i, for the LSTM bias, input weights, recurrent weights, and the FC weights and bias, respectively. The sum limits
and
F are the number of LSTM and FC layers, respectively. Depending on the sensibility analysis stage, which determines some parameters of the network, these representations might be different; this is one of the critical features of the proposed methodology.
The expression in (
2) can be further refined by considering the LSTM layers’ number of units and one FC layer as follows:
where
is the number of units of the LSTM layer
i and
d is the number of inputs. Here, the FC layer will directly switch the
h layers of the last LSTM to the two classification outputs. The model presented in (
3) serves as a benchmark for selecting the adequate network topology (number of layers and their parameters) alongside the performance metrics of the classification problem.
Expression (
3) shows that the higher order of the memory occupancy is quadratic with the cell unit number
, which makes the growth of the memory occupation much more important for high values of
.
Figure 3 represents a case where two LSTM layers are considered, with a single-precision floating representation (all
equals 4) and an input size
, as it would be for the case when considering a fall-detection problem based on IMU acceleration measures. Also, to cross-reference the memory model with standard and typical embedded processors, two planes that represent
of the available volatile memory on two processors are added. The higher one regards an ultra-low power 8-bit 8051 compatible Silicon Labs C8051F98x, featuring 8 kB Flash and 512 kB RAM, and the upper one an ultra-low-power 16-bit Microchip MSP430, featuring 512 kB Flash and 66 kB RAM [
46]. From the intersection line of the surfaces (plotted in red), one can see that, for an MSP430, a combination of 10 and 50 cell units of the two layers can be deployed, with middle conjunction between 30 and 40 cell units. When considering the C8051F98x, the supported network can be more complex, featuring a combination of 10 and 150 cell units, with a middle point of 100 cell units for each layer. It should be noted that the memory model is unambiguous (there is no uncertainty), as it maps the model’s parameters into a memory footprint based on the number of bits of the parameter representation and the model’s structure.
When considering two bytes for the representation of all quantities, bias, and weights of both LSTM and FC layers, it is expected for the complexity of the networks to narrow. This case could be obtained with a half-precision floating point or an integer 16-bit representation. From
Figure 4, one can see that the MSP430 could support a combination of 10 and 70 cell units for both the LSTM layers, with a middle interval of 50 cell units. The C8051F98x would support a combination of 80 and 200 cell units with a mid-range of 150 cell units.
The proposed model is used in the following section, in conjunction with an evaluation of the accuracy and other related metrics obtained using a public dataset of acceleration collected with different falls and diverse daily activities. It is worth mentioning that, in the event that more models coexist, for example, to make another kind of detection, it would be enough to replicate the arithmetics. Another no less important relationship that can be taken from the model presented concerns energy consumption and latency, something very significant in the wearable context. Since the energy consumption of the CPU is directly correlated with the number of operations and can be quantified as a normalized energy cost or cycle count, the presented model can be seen as a metric to evaluate the energy and latency efficiency of the embedded device [
47].
4. Sensitivity Analysis
For simplicity and to highlight the proposed methodology, the topology of the network to analyze consists of the minimal requirements for a time series classification problem, that is, one LSTM layer for the feature extraction, one FC layer that has the task of classifying the features, and finally, a SoftMax activation layer that converts the vector of numbers from the FC into a vector of probabilities (
Figure 5).
Since it is intended to implement the inference of the network on a low-power, low-complexity embedded microcontroller, the network is considered for training, varying the number of cells of the LSTM layer from 1 to 200.
The SisFall dataset [
37] is one of the most popular dataset options when evaluating the required data for network training. A significant duration of samples (
between 10 s and 180 s, comprising
ADLs and
falls) and a wide diversity of emulated movements (19 classes of ADLs, including basic and sports activities and 15 classes of falls) are included in the dataset. The experimental volunteers, numbering 38, span a wide age range. The number of samples, both for classifying falls and daily activities, is considered enough for the number of parameters to be tunned [
48].
The first step consists of annotating the data that are used for training purposes. The SisFall dataset is publicly available with raw data, that is, samples acquired using an ADXL345 accelerometer from Analog Devices, Wilmington, MA, USA with a sampling rate of 200 Hz. Using a MatLab
® script, each data file is read keeping the sampling rate, but firstly adjusting the scale using Expression (
4), considering that the accelerometer has a scale of
G and 13-bit resolution, obtained from the manufacturer’s datasheet.
Secondly, a 4
th-order low-pass digital Butterworth filter with a cutoff frequency of 5 Hz is applied to the data samples to remove high-frequency noise. Thirdly, batch normalization is processed by rescaling the values for a distribution with zero mean and unity standard deviation as
Finally, a 2 swindow is taken from the total number of samples, centered on the fall event when considering falls, and centered on the interval when considering ADLs. This is because the number of samples differs for all the events, and some activities are acquired within a very long time frame. It is known that a fall event usually occurs within a window interval of 500 ms to 2 s [
49]. Also, from each collection of falls and ADLs,
is annotated as training samples and
is reserved for validation.
All the simulations are carried with MatLab® R2022a and run on the Ubuntu operating system, on an NVIDIA DGX Station computer, Santa Clara, CA, USA with 2 CPU Intel Xeon E5-2698V4 featuring 20 GHz cores, 256 GB of RAM, and 4 NVIDIA Tesla V100 GPUs. The training is performed with a batch size of 31, meaning that, with the 2852 samples, 92 iterations will occur. Also, the training persists over 15 epochs, considered sufficient for obtaining a reasonable accuracy without underfitting, and yet not enough to induce overfitting. The validation frequency is set at 46, implying two validations per epoch. The Adam optimizer is used, with a learning rate of and a gradient decay factor of .
When the number of cell units
is considered, one can see from
Figure 6 that, when the number of cells is close to one, the accuracy is relatively low, starting with
for
. Nonetheless, after around
, the accuracy enters a steady state, varying from
to
, as seen in
Figure 6. The maximum value of
occurs at
, and, as such, there is no more extended advantage of increasing the number of cells. It should be noted that, when considering the 2 swindow over the 200 Hzsampling frequency, a total of 400 samples represent the fall or the ADL event. Thus, the LSTM is clearly fulfilling its purpose in the long-term feature since the best accuracy is archived with only 100 cells.
In addition to the accuracy and the loss metrics, in the case of a problem such as fall detection, it is significant to analyze in what way the accuracy is not reaching of correct predictions. A FALL being classified as an ADL or vice versa has different real-life consequences. There are only four cases for any classification result:
True Positive (TP): Prediction is a FALL, and there is, in fact, a FALL. This is a desirable situation;
True Negative (TN): Prediction is an ADL, and the subject is not falling. This is also a desirable situation;
False Positive (FP): The prediction is a FALL, but the subject performs an ADL. This is a false alarm and not desirable;
False Negative (FN): The prediction is an ADL, but the subject suffered a FALL. This is the worst situation and not desirable.
While both FP and FN are not desirable, one is worse than the other, as, when dealing with FP, a second, more detailed scan will be correct. On the contrary, when dealing with an FN, the person has fallen and may be injured, but no emergency response will be triggered. The related metrics are the Precision, formulated as
, which evaluates “How many of those who were labeled as FALL are actually FALLs”, and the Recall, formulated as
, which evaluates “Of all the samples that are FALLs, how many of those were correctly predicted”.
Figure 7 shows the Precision and the Recall versus the Accuracy.
When considering both
P and
R, one can see from
Figure 7 that, although its behavior is not monotonous, both are increasing with the number of cells in the LSTM layer. Moreover, the
reaches its maximum value at the same cell unit number as the
, which is desirable, as explained previously.
The results obtained here are considered for implementation in the case where the number of the LSTM layer cell unit corresponds to 100. The Confusion Matrix shown in
Figure 8 provides some more details on what is happening with the predictions of the chosen architecture. From the 345 actual FALLs, 338 are correctly predicted as FALLS, and only 7 are predicted as an ADL, representing
of the cases. This is why the Recall reached a value as high as
. In this case, the Precision is lower (
), but, as stated previously, this is a more tolerable situation.
Regarding the memory occupancy, as can be seen from
Figure 9, although the network could be implemented on a C8051F98x, this is the only analyzed processor that could allocate the network, with a memory occupancy of
. When considering 8-bit microcontrollers such as the STM8L or the 16-bit MSP430, the first would only support a network with 12 cells, and the second would only support 40. The STM32L would need all the memory for a network of 80 cells.
5. Proposed Methodology
This section begins with implementing one LSTM layer with 100 cells and 1 FC layer with 100 inputs and 2 outputs. These are based on the memory model presented in
Section 3 and the sensibility analysis given in
Section 4. The math for batch normalization is not covered here because it is regarded as a pre-processing step. Similarly, the SOFTMAX layer—which is also regarded as a post-processing phase—is not involved. The purpose is to find the most effective technique to save the network hyper-parameters on an embedded microcontroller that would function as a wearable portable device.
In accordance with
Figure 10, the quantization is seen as a PTQ in this instance; a trained model is quantized before being stored, pending an inference procedure. The layer computations are then carried out once the hyper-parameters have been loaded and de-quantized. The de-quantization procedure receives a control word with characteristics established during the quantization process. An alternative approach would be to carry out the training while taking into account the quantized hyper-parameters, or Quantization-Aware Training (QAT). While the method’s perception is more natural, building a stable network with optimal performance becomes tougher due to the complexity of the training process [
50].
To characterize the network, the plots of the LSTM and the FC weights and bias histograms are analyzed (
Figure 11). Also, a representation of an equivalent Normal Distribution is considered, with the hyper-parameters set as mean and standard deviation.
Table 1 represents each layer’s hyper-parameters’ maximum and minimum values.
The results of the data analysis show that, after pre-processing with batch normalization, there is a high degree of similarity between the LSTM input weights and the LSTM recurrent weights (
Figure 11). However, neither the FC weights nor the LSTM bias fit into a normal distribution. It is important to note that, when examining the LSTM bias in further depth (
Figure 11b), they combine four distinct forms of bias pertaining to the forget, input, output, and cell gates. Therefore, it is essential to comprehend how
Figure 11b’s distribution is set up in relation to various gates, as seen in
Figure 12. It is worth mentioning that, when dealing with a large number of parameters, it is expected to have a Normal Distribution among them. This is due to the Central Limit Theorem, which states that, when the sample size tends to infinity, the sample mean will be normally distributed. On the contrary, biases are known to be more stable, alternating between stable values [
51,
52]. This explains why
Figure 11a,c,d is quite similar to the Normal Distribution and
Figure 11b has two different group values.
From
Figure 12, it is clear that the unbalanced weights of
Figure 11b are due to the fact that the network keeps the input gate always active, but the same does not happen for the other gates. This situation is visible in
Table 2, where the mean value of the input gate bias is close to one, and the remaining gates are close to zero.
The above discussion and the feature that was identified are the motivation for a global methodology to apply to the fall-detection problem. Based on the histogram information, one can see that uniform quantization is adequate due to the uniformly distributed hyper-parameters. Moreover, when individualizing the gate bias, that is, when considering a higher granularity, one can apply a more friendly linear quantization on uniformly distributed pre-trained weights. The quantization function
takes the form of
where
r is the 32-bit floating-point number to quantize,
S is a real-valued scaling factor, and
Z is an integer for zero displacement. The scaling factor
S ensures that the numerical range of the original floating-point values is properly represented within the limited range of the quantized format, that is,
where
and
correspond to the maximum and minimum values of the hyper-parameter set, respectively, and
b is the number of bits used for the representation. The zero displacement
Z is calculated as
where
corresponds to the minimum quantized value.
The overall methodology consists of calibrating the LSTM and FC layers, individualizing the gates of the LSTM, called here as
Gate Disclosure. The calibration stage delivers
and
parameters for the quantization. Secondly, a uniform quantization is applied through (
6), (
7) and (
8). The obtained network can then be stored in an embedded computer system memory.
Figure 13 resumes the proposed method. Initially, the network is quantized in an offline step. To obtain the quantized version based on the pre-trained LSTM model, a calibration is performed to obtain
and
, the minimum and maximum values of the hyper-parameters. Secondly, the scaling factor
S is calculated through expressions (
6), (
7), and (
8). Next, the hyper-parameters are quantized with expression (
6) and stored on the embedded microcontroller memory (
Figure 13a). From here, the inference happens online; acceleration points are acquired and fed to the quantized network after being also quantized (
Figure 13b). Finally, a simulation stage is considered here to obtain performance metrics, namely, the accuracy of the de-quantized network. To that end, the hyper-parameters are again obtained with 32-bit floating point format, from the inverse of (
6) or
.
6. Results and Discussion
The proposed methodology was implemented in MatLab
® R2022a and run on an NVIDIA DGX Station computer described in
Section 4, on the network described in
Section 4. Further to the quantization with Gate Disclosure, and for comparison purposes, the network was also quantized when considering a calibration stage on the individual layers without discriminating the gates. The number of bits considered for the quantized representation was 16-bit, 8-bit, 4-bit, and 2-bit.
The first set of results represented in
Figure 14 corresponds to the current state-of-the-art, where uniform quantization is applied to a network, with a previous calibration performed at the layer level, that is, the scaling factor is obtained individually for each layer. From the confusion matrices in
Figure 14, one can see that the network suffers no degradation when quantized with a 16-bit integer representation and has barely any degradation with an 8-bit integer representation. Regarding lower-bit representations, the network has its precision degraded by lowering the number of bits. The representation considering 2-bit-wide implies the complete degradation of the network, where all samples are classified as falls. Nevertheless, a substantial decrease can be obtained when considering the 16-bit or the 8-bit quantization. It ranges from
bytes in the 32 floating point representations to
bytes for 16-bit or
in the case of 8-bit, meaning reductions of
and
, respectively.
Secondly, the proposed Gate Disclosure methodology is applied, as shown in the confusion matrices represented in
Figure 15.
Figure 15 shows evidence of the augmented performance implied by the
Gate Disclosure methodology. One can see that 16-bit, 8-bit, and 4-bit quantizations did not indicate any network performance degradation, as the confusion matrices are identical. When using 2-bit for the representations, the network performance slightly decreases but still keeps an accuracy of
. The explanation for this behavior is, as discussed in
Section 5, due to the distribution of the hyper-parameters.
While, in the first case (
Figure 14), the small number of combinations of the 2-bit quantization has to be distributed by a wide range of the bias, in the second case, applying the Gate Disclosure (
Figure 15), the four combinations are distributed by a much more narrow space.
Figure 16 illustrates this situation, where the ’red’ circles are the pre-trained 32-bit floating point values, the ’blue’ squares denote the de-quantized values, and the dashed ’magenta’ lines represent the quantization bins. This also explains the RMSE for each case summarized in
Table 3. When considering both the calibration of the layers or the Gate Disclosure calibration method, as the number of bits decreases, the error increases, but at a lower rate in the second case.
To end the performance discussion, and recalling the proposed memory model, if the quantization options of the Gate Disclosure method overlap (
Figure 17), one can see that the 16-bit MSP430 microcontroller could now perfectly support the specified network with 100 hidden cells. Also, the STM8L would support an LSTM with 40 hidden cells, with an accuracy of
.
7. Conclusions
This work proposes a new approach for quantizing an LSTM neural network based on a high level of granularity, that is, to the level of the LSTM gates, called Gate Disclosure. By performing detailed simulations and analyses of the network topology (1 LSTM layer + 1 FC layer), the method is then applied to an LSTM with 100 hidden cells.
A memory occupancy model was then proposed to evaluate the feasibility of using this network on an embedded device as a wearable for fall detection in senior groups. The model’s correlation with the network topology and the binary representation of the network hyper-parameters allow for the statement of its implementability.
The proposed methodology is compared with the state-of-the-art uniform quantization on different bit widths (16 to 2), demonstrating that new quantized and de-quantized networks are still accurate for predicting fall detection based on acceleration measures for fall detection among elderly people. The numerical results show that, for an LSTM network, using different quantization thresholds “per-gate” can improve the accuracy of the quantized model.
For future work, the authors will consider testing the methodology on different public datasets to strengthen the results and implement other non-uniform quantization functions for comparison. Additionally, the methodology could be applied to other problems with the same architecture (Time Series Classification), such as predicting tremors in Parkinson’s disease.