Basic Activity Recognition From Wearable
Basic Activity Recognition From Wearable
Abstract
The field of human activity recognition has undergone a great development,
making its presence felt in various sectors such as healthcare and supervision.
The identification of fundamental behaviours that occur regularly in our
everyday lives can be extremely useful in the development of systems that aid
the elderly, as well as opening the door to the detection of more complicated
activities in a Smart home environment. Recently, the use of deep learning
techniques allowed the extraction of features from sensor’s readings automat-
ically, in a hierarchical way through non-linear transformations. In this study,
we propose a deep learning model that can work with raw data without any
pre-processing. Several human activities can be recognized by our stacked
LSTM network. We demonstrate that our outcomes are comparable to or
better than those obtained by traditional feature engineering approaches.
Furthermore, our model is lightweight and can be applied on edge devices.
1 Introduction
Human activity recognition is grounded on the gathering of data from
different sensing devices. We distinguish between two types: vision based
HAR which requires cameras, and sensor based HAR which is based on
sensors like accelerometers, gyroscopes, GPS, and others. These sensors are
embedded in smartphones and other smart objects.
The efforts made in the last years in the electronics industry have allowed
to reduce the size of sensors and to produce others with smaller sizes. This has
motivated the development of several context aware applications in various
fields such as healthcare [1], supervision [2], security [3], and other daily life
applications especially the domain of internet of things [4].
Vision based HAR suffers from problems related to privacy, restrictions
related to mobility and power consumption, and difficulties to obtain good
images in some climatic conditions. These complications directly impact the
accuracy of recognition and push researchers to adopt sensor based HAR
which is more accurate, flexible, and simple.
Sensors take advantage of being ubiquitous. They can now be embedded
into wearable devices like phones, watches, and bracelets, as well as non-
wearable items like automobiles, doors, and furniture, thanks to the growth
of smart gadgets and the Internet of Things. Sensors are broadly present in
our environment, logging people’s motions in a non-intrusive and continual
manner.
Today, everybody has a smart device that contains at least one sensor,
these devices are an excellent tool for collecting data, and monitoring human
activities, and currently there is a shift especially towards smartphones to
realize this recognition.
For a good recognition, traditional Machine Learning algorithms neces-
sitate domain knowledge to pre-process data and to select features. Recent
achievements of deep learning in computer vision, speech recognition, and
natural language processing attracted the researchers to investigate its effec-
tiveness in the recognition of human activities. In fact, Deep learning can
extract features from sensor’s raw data automatically, and thus substituting
Basic Activity Recognition from Wearable Sensors 243
2 Related Works
Deep learning models have a greater learning ability, according to cur-
rent reviewed research in HAR. Convolutional neural networks have been
employed as a feature extractor, either alone or in combination with recurrent
neural networks and its variants.
Oluwalade et al. [5], have investigated the difference in data generated
by two types of devices that embed the same type of sensors (a watch and
a phone). They used four models: Long short-term memory (LSTM), Bidi-
rectional Long short-term memory (LSTM), Convolutional Neural Network
(CNN), and Convolutional LSTM (ConvLSTM) to classify fifteen hand and
non-hand-oriented activities. They also used GRU to forecast the last 30
seconds of data generated from the watch accelerometer. They obtained an
average classification accuracy of more than 91%. In [6] The authors used the
signals captured by a gyroscope and an accelerometer, which were processed,
then they extracted more than 561 features in the time and frequency domain,
and they used Kernel Principal component analysis to reduce the number of
dimensions, they proposed deep belief network to recognize 12 activities.
In result they had a better score compared to SVM and ANN approaches.
244 Z. Benhaili et al.
3 Background
3.1 Recurrent Neural Network
RNNs is a type of neural networks that were initially created in the 1980’s, for
the purpose of anticipating what’s coming straightaway in an extremely exact
way. This ability is what they made them the most suitable for sequential
data like time series prediction, speech and audio recognition, financial data,
weather forecasting and much more. RNN possesses cyclic connections that
allows it to learn the temporal dynamics of sequential inputs. Each node in
the hidden layer has a function that input the current hidden layer state ht and
generate the output yt based on its current input xt and the previous hidden
state ht−1 according to the following equations:
4 Proposed Methodology
4.1 Architectures
We discuss in this section our Stacked LSTM network [21] versus other
architectures that we proposed to investigate the performance on motion
detection. We mention that UCI HAR database was used for evaluation.
The First architecture: Stacked LSTM has been frequently used in the
literature for time series prediction, for its capability of identifying and
extracting temporal relationships between signal readings. Our first LSTM
network, in which we used three consecutive layers of LSTM to represent
more complicated patterns, adding depth to the network and increasing its
extraction capacity. After that, we added a dropout layer with a different num-
ber of nodes for each layer. Using a mix of 128,64 and 32 nodes, we were able
96.87%
95.12%
94.97%
95.92%
95.92%
95.31%
95.96%
96.23%
95.69%
96%
94.46%
ACCURACY%
0 10 20 25 30 40 50 60 70 80 90
DROPOUT %
Table 3 shows that the best score is obtained when the batch size is set
to 32. Effect of the Optimizer. Table 4 shows that, when compared to the
Adadelta, Nadam, and SGD optimizers, Adam optimizer produced the best
results with the fastest calculation using its default values (Learning rate =
0.001). Rmsprop also provided us with a good level of accuracy.
The second architecture called S-LSTM-CRF presented in Figure 3 is a
modification of the first stacked LSTM architecture with the same number of
nodes but with a difference in the final layer. In this network, instead of using
Softmax to do the classification, we chose Conditional random field (CRF).
This algorithm uses log-loss. we had an accuracy that does not exceed 92%,
this result does not increase even if we change the value of hyper parameters.
The third architecture S-LSTM-SVM, presented in Figure 4 uses the
same LSTM model but this time CRF is replaced by SVM which uses Hinge
loss.
The Fourth architecture SLSTM + Inception, this network is based on
an adaptation of inception network for time series data.
The Inception Module is a structure that permits various types of filters to
be combined in a single network. We adjusted this module for time series data
Basic Activity Recognition from Wearable Sensors 249
Figure 3 Three stacked LSTM with CRF in the last layer (S-LSTM+CRF).
by utilizing only the naı̈ve version presented in Figure 5, changing the number
of the convolution filters 1*1, 3*3, and 5*5, and using one-dimensional
convolution and Maxpooling layers instead of 2 dimensional operations in
each layer. Last the concatenation of all layers into a single output is done in
next layer. The performances of all the proposed architectures tested on UCI
HAR are evaluated in Table 5.
The combination between CNN and LSTM was presented in several
works in the literature and it allowed to reach good results. The main idea
is to design a network which is wide using the concatenated convolutions
250 Z. Benhaili et al.
and deep by combining our first network stacked LSTM with four inception
modules. This combination exposed in Figure 6 gives us the best accuracy
which is 97.20%, it is superior to our first stacked LSTM, but the number
of parameters is big. We remarked that increasing the number of inception
modules increase the performance after training the model over 200 epochs,
unfortunately this increases also exponentially the complexity.
After analyzing the performance, the use of three LSTM layers is the
best. This is also proven by the combination of these three layers with
other networks. Below three LSTM layers the model significantly loses its
extraction capacity, and above this value the model falls into overfitting. We
notice that the use of more than three Inception modules with our stacked
LSTM increases the accuracy but at the same time the complexity, which
favors the use of the Stacked LSTM because of its lightness. The use of
the Inception module alone also gives a good result. The second part of the
experiment shows that the placement of the Softmax function at the last layer
is preferable and gives better results than the use of CRF and SVM.
5 Experiment
5.1 Datasets
We considered two public benchmark datasets for HAR to evaluate the
performance of our network. Here is a brief description of each:
Basic Activity Recognition from Wearable Sensors 251
False Negative):
TP + TN
Accuracy = (5)
TP + TN + FP + FN
TP
Precision = (6)
TP + FP
TP
Recall = (7)
TP + FN
Precision × Recall
F-Measure = 2 × (8)
Precision + Recall
5.3 Training
UCI HAR dataset is devised to 70% for training and 30% for testing, as a
result subject’s 2, 4, 9, 10, 12, 13, 18, 20, and 24 data left unseen by the model.
This splitting will allow a fair comparison with previous works using the same
evaluation technique. We used Adam optimization algorithm to minimize the
cost function. Softmax layer exists in the final layer and the gradient back
propagates from this layer to the LSTM layers of the input. The dropout is
90%. During training we tested different batch sizes. for over 200 epochs we
selected the size of 32. The signals in this dataset have been pre-processed
using a low pass filter with 50% overlap. The sliding window has a size of
2.56 sec.
To verify the generalization capability of our network, we tested it on
WISDM dataset using raw data with no feature selection, we just applied a
sliding window, with a width of 180 and a time step of 100.
7 Conclusion
In this work artificial intelligence and the IoT are exploited to design a
lightweight network capable of recognizing human activities. Our approach
is based on the use of stacked LSTM layers in an end-to-end fashion without
any data pre-processing. It eliminated the need of feature engineering and
provided a better performance than traditional methods based on feature
design. We plan to investigate other datasets with more complex actions in
the future, as well as an energy-efficient implementation of this model in
smart devices.
References
[1] X. Zhou, W. Liang, K. I.-K. Wang, H. Wang, L. T. Yang, and Q. Jin,
“Deep-Learning-Enhanced Human Activity Recognition for Internet of
Healthcare Things,” IEEE Internet of Things Journal, vol. 7, no. 7,
pp. 6429–6438, Jul. 2020, doi: 10.1109/JIOT.2020.2985082.
[2] K. K. Htike, O. O. Khalifa, H. A. Mohd Ramli, and M. A. M.
Abushariah, “Human activity recognition for video surveillance using
sequences of postures,” in The Third International Conference on e-
Technologies and Networks for Development (ICeND2014), Apr. 2014,
pp. 79–82. doi: 10.1109/ICeND.2014.6991357.
[3] S.-R. Ke, H. L. U. Thuc, Y.-J. Lee, J.-N. Hwang, J.-H. Yoo, and K.-
H. Choi, “A Review on Video-Based Human Activity Recognition,”
Computers, vol. 2, no. 2, Art. no. 2, Jun. 2013, doi: 10.3390/comput
ers2020088.
[4] W. Lu, F. Fan, J. Chu, P. Jing, and S. Yuting, “Wearable Computing
for Internet of Things: A Discriminant Approach for Human Activ-
ity Recognition,” IEEE Internet of Things Journal, vol. 6, no. 2,
pp. 2749–2759, Apr. 2019, doi: 10.1109/JIOT.2018.2873594.
[5] B. Oluwalade, S. Neela, J. Wawira, T. Adejumo, and S. Purkayastha,
“Human Activity Recognition using Deep Learning Models on Smart-
phones and Smartwatches Sensor Data,” arXiv:2103.03836 [cs, eess],
Feb. 2021, [Online]. Available: http://arxiv.org/abs/2103.03836
[6] M. M. Hassan, Md. Z. Uddin, A. Mohamed, and A. Almogren, “A robust
human activity recognition system using smartphone sensors and deep
learning,” Future Generation Computer Systems, vol. 81, pp. 307–313,
Apr. 2018, doi: 10.1016/j.future.2017.11.029.
256 Z. Benhaili et al.
Biographies
Zakaria Benhaili received the engineering degree from the National School
of Applied Sciences at Sultan Moulay Slimane University, Khouribga,
Morocco, in 2017, Currently a PhD student in Third year in Faculty of
Sciences and Technologies at Hassan 1 st University. His main research
area includes Internet of Things, deep learning and its applications in human
activity recognition, pattern recognition, and smart homes.