1. Introduction
Human Activity Recognition (HAR) has become a pivotal technology across various domains, including transportation, active living, entertainment, and security. The integration of visual and non-visual sensors has significantly improved the efficiency and accuracy of activity monitoring and analysis. In traditional security operations, human operators rely on extensive camera networks, often leading to issues like operator fatigue and reduced effectiveness. These limitations have driven the development of automated vision-based systems, which offer greater reliability and scalability [
1].
In the context of transportation, HAR is instrumental in enabling real-time traffic monitoring, optimizing routes, and predicting travel times. These advancements contribute to the development of intelligent transportation systems designed to improve traffic flow, enhance passenger safety, and maximize vehicle efficiency, ultimately fostering more adaptive and efficient transportation networks.
HAR identifies a variety of human activities, from basic gestures to complex movements, using raw sensor data [
2]. Earlier studies predominantly utilized vision-based approaches, including RGB cameras and Kinect sensors, for activity classification and analysis [
3,
4,
5]. Although these methods provide precise activity recognition and are suitable for long-range monitoring, they raise significant privacy concerns and face challenges in large-scale implementation [
6]. In contrast, motion sensors such as accelerometers and gyroscopes embedded in smartphones offer a cost-effective, mobile alternative that alleviates privacy issues associated with vision-based techniques. By integrating HAR with smartphone sensors and employing deep learning methods, the accuracy of activity recognition can be significantly improved.
Recent studies have demonstrated notable advancements in HAR using smartphone sensors. Palimote et al. achieved an impressive 99.06% accuracy in classifying activities such as walking, walking downstairs, walking upstairs, sitting, standing, and lying down by leveraging accelerometer and gyroscope data from a Samsung Galaxy S2. Their approach utilized an Artificial Neural Network (ANN), which proved highly effective in accurately classifying these activities based on the sensor data, showcasing the potential of ANN for robust human activity recognition systems [
7]. Koping et al. introduced a flexible sensor-based HAR system utilizing smartphones and wearable devices, employing Support Vector Machines (SVM) for activity recognition [
8]. Jahangiri and Rakha conducted a comparative analysis of machine learning techniques for transportation mode detection using smartphone sensors, emphasizing the effectiveness of Random Forest (RF) and SVM [
9]. Similarly, Lopez et al. evaluated smartphone sensor data for travel behavior analysis, identifying limitations in data accuracy due to hardware constraints. These constraints include sensor precision, variations in sampling rates, and noise, which are inherent to smartphone hardware, as well as inconsistencies across different device models. These limitations can affect the quality of collected data and lead to propagated errors in the analysis, potentially skewing the aggregated results [
10].
The integration of machine learning in HAR has traditionally depended on heuristic feature extraction methods, which often limit classification accuracy and robustness [
10]. In contrast, deep learning approaches, such as Long Short-Term Memory (LSTM) networks, utilize raw data to construct complex models with multiple layers, effectively tackling challenges in sequence learning [
11]. LSTM’s capability to address vanishing gradient issues and manage information flow through controllable gates significantly improves accuracy compared to traditional methods.
Singh et al. utilized LSTM models to predict activities from smart home datasets, demonstrating superior performance compared to probabilistic models such as Naïve Bayes and Hidden Markov Models (HMM) [
12]. Mekruksavanich and Jitpattanakul further advanced HAR in smart homes by applying accelerometer and gyroscope data with LSTM networks, achieving notable improvements in accuracy [
13]. However, these studies largely focus on controlled environments with position-dependent datasets, which may not fully represent real-world scenarios where devices can be held in various positions.
Most HAR studies using smartphones rely on controlled, position-dependent datasets, which fail to capture the complexities of real-life scenarios where devices may be held in varying positions. This discrepancy can result in decreased accuracy in activity recognition. To address these challenges, it is crucial to utilize unconstrained, position-independent datasets [
12,
13]. Furthermore, many recent studies overlook the inclusion of sensors like gravity and linear accelerometers in smartphone-based HAR systems, despite their potential to enhance recognition accuracy.
In recent studies, HAR using smartphone sensors has shown promise in transportation and real-world activity tracking. However, the inherent challenges of position variability, sensor data diversity, and model robustness remain unresolved. This study addresses these issues by focusing on realistic, position-independent HAR, combining machine learning and deep learning approaches for improved recognition across various user scenarios. The primary contributions of this study are as follows:
Realistic, Position-Independent Data Collection: Our research investigates HAR using smartphones placed in multiple realistic positions—chest and leg pockets—reflecting common device placements in real-world settings. This approach addresses limitations associated with fixed-position data collection and improves the generalizability of activity recognition models.
Enhanced Sensor Suite Utilization: In addition to accelerometers and gyroscopes, this study incorporates linear accelerometers and gravity sensors, providing a more comprehensive dataset that better captures dynamic activity patterns and improves classification accuracy. Comparative Analysis of Machine Learning and Deep Learning Models: This study evaluates both classical machine learning models (Decision Tree, K-Nearest Neighbors, Random Forest, Support Vector Classifier, and XGBoost) and advanced deep learning models (GRU, LSTM, TCN, and Transformer) for HAR.
Performance Optimization through Overlapping Data Segmentation: The study compares non-overlapping and 50% overlapping data segmentation methods, identifying the advantages of overlapping windows for improving model accuracy by capturing transitional information.
Application to Intelligent Transportation Systems (ITS): By demonstrating high-accuracy passenger behavior recognition, our research provides insights into practical ITS applications, contributing to enhanced traffic management, safety monitoring, and real-time emergency response systems.
These contributions establish a foundation for robust, real-world HAR implementations in intelligent transportation contexts, advancing the state of the art in activity recognition research.
The remainder of the paper is organized as follows:
Section 2 describes the related works in HAR. In
Section 3 provides the materials and methods of the study. The results of the study and its discussion part are then discussed in
Section 4. Finally,
Section 5 provides the conclusion and directions for future research.
3. Materials and Methods
This study was designed to systematically evaluate and compare the performance of various machine learning and deep learning models for HAR using sensor data from smartphones. The research specifically focused on two key factors: the influence of different data segmentation strategies—comparing non-overlapping windows with 50% overlapping windows—and the impact of sensor configurations, contrasting the use of two sensors (accelerometer and gyroscope) against four sensors (including linear accelerometer and gravity sensors). Through this comprehensive evaluation, the study aims to provide actionable insights into the optimal methodologies for accurate and efficient activity recognition in real-world scenarios, addressing both model performance and practical deployment considerations.
3.1. Data Collection
This research utilized the embedded sensors of a Samsung Galaxy A50 Android smartphone (Samsung Electronics, Suwon, South Korea)—specifically the accelerometer, gyroscope, linear accelerometer, and gravity sensor—to collect uncontrolled data from various human activities, including walking, running, and being stationary, both while using and not using the smartphone. The addition of the linear accelerometer and gravity sensor was investigated to assess whether these sensors significantly enhance classification accuracy compared to using only the accelerometer and gyroscope. The linear accelerometer and gravity sensor are derivatives of the accelerometer, providing additional information about the magnitude and direction of acceleration and gravity, respectively.
A Graphical User Interface (GUI)-based application was employed to facilitate data collection, recording each volunteer’s unique identifier and the time duration for each activity (start and stop times). Data from all activities were collected at a sampling rate of 50 Hertz (Hz), capturing measurements across the X, Y, and Z axes for each of the four sensors (accelerometer, gyroscope, linear accelerometer, and gravity sensor). The gyroscope, initially collecting data at 120 Hz, was normalized to 50 Hz to ensure consistency with the other sensors.
The study involved eleven (11) volunteers, aged between 20 and 40 years, who were asked to perform a series of basic activities while holding the smartphone in different positions. These activities were categorized into two scenarios:
- Scenario 1:
Activities with the smartphone placed in the left or right side pockets (not in use).
- Scenario 2:
Activities with the smartphone held in the chest position (in use).
Each volunteer performed a sequence of activities over a period of 7 min. These activities were divided into six categories: walking, running, and stationary (sitting and standing), with the conditions of both using the smartphone and not using the smartphone in a bus stop station setting.
Table 1 outlines the specific activities conducted in this study.
From
Table 1, the walking speed in this study is standardized to two steps per second for each window sample. This rate is based on the typical gait cycle of an average, healthy individual who takes approximately two steps per second. Additionally, running is assumed to involve a higher speed compared to walking. For activities such as standing, sitting, or any other stationary actions, these were categorized as stationary actions due to the lack of movement.
The study evaluates the impact of different sensor configurations on HAR accuracy by analyzing two distinct setups:
- Two-Sensor Configuration:
Utilized only the accelerometer and gyroscope to establish a baseline for the effectiveness of these primary sensors in activity recognition.
- Four-Sensor Configuration:
Incorporated all four sensors—accelerometer, gyroscope, linear acceleration sensor, and gravity sensor—to assess whether the inclusion of additional sensor data enhances the model’s activity recognition capabilities.
Table 2 presents a comparative overview of key smartphone sensors utilized in HAR. Each sensor type has distinct capabilities and limitations that affect its effectiveness in activity tracking. The sensors collectively capture detailed motion data, with accelerometers and gyroscopes being widely used in HAR for their ability to measure linear and angular motion. The addition of linear accelerometers and gravity sensors provides supplementary context, enhancing the system’s accuracy by minimizing noise from changes in acceleration. This integration of multiple sensors contributes to more robust and precise activity recognition.
3.2. Data Processing and Sensor Configuration
The raw sensor data were preprocessed to enhance data quality and ensure consistency across measurements. This preprocessing involved noise reduction and normalization to standardize the data before further analysis. To evaluate the impact of data segmentation strategies on model performance, we divided the data into overlapping windows of 2 s with 50% overlap and non-overlapping windows. This approach allowed for the capture of continuous sequences, which are essential for training deep learning models and provided insights into how different windowing techniques affect performance.
To ensure comprehensive analysis, the data were segmented into fixed windows of 2.56 s (128 data points per window). Each window was processed with two different approaches: one set with 50% overlap and another set without overlap. This segmentation strategy aimed to capture temporal patterns effectively and provide a robust comparison of preprocessing methods.
Following segmentation, we applied handcrafted feature extraction to each window. This approach allowed us to capture the most relevant aspects of the sensor signals, ensuring that the models could effectively differentiate between various human activities. We extracted a variety of time-domain features commonly used in HAR. The feature set for the two-sensor configuration included 42 features, while the four-sensor configuration yielded 84 features. These features were chosen based on their relevance and effectiveness in distinguishing different activities.
To provide a clear representation of the sensor measurements during a running activity, we visualize the collected data for each sensor between 80 and 85 s of the recording. This segment was selected because it captures a momentum phase of the running activity, where consistent sensor readings are observed.
Figure 1,
Figure 2,
Figure 3 and
Figure 4 depict the data from different sensors during this window.
The plots provide insight into the raw sensor data for the running activity. For instance, in
Figure 1, the accelerometer data shows the rapid fluctuations in acceleration along the x, y, and z axes as the volunteer propels forward during running.
Figure 2 presents the gyroscope data, reflecting angular changes during body movement, while
Figure 3 displays the linear acceleration measured, which isolates the acceleration component excluding gravitational effects.
Figure 4 shows the gravitational component of acceleration, highlighting the orientation of the smartphone during the activity. These visualizations serve as the basis for the feature extraction process discussed in the subsequent section.
Statistical Feature Analysis for Activity Recognition
Table 3 summarizes the statistical analysis of the feature vectors computed from each sliding window. The process began with raw data collection, followed by segmentation into windows corresponding to each activity. For each window, signal features were extracted and processed using statistical methods. This comprehensive feature extraction approach provides a detailed basis for the subsequent classification analysis and allows for a robust evaluation of model performance under varying sensor configurations.
In this study, several key time-domain features were extracted to capture the essential characteristics of the sensor signals over time. The
mean (
) provides the average value of the signal in each window, highlighting the overall level of activity:
where
n represents the number of data points in the window, and
denotes each individual sensor reading.
The
standard deviation (
) quantifies the amount of variation in the signal, reflecting how much the values fluctuate around the mean:
Skewness assesses the asymmetry of the signal distribution, indicating whether the values tend to be higher or lower than the mean. This is particularly useful for understanding the balance of activity levels:
Additionally, the
maximum (
) and
minimum (
) values capture the range of the data, helping to identify the peak and lowest activity levels:
Lastly,
kurtosis measures the “tailedness” or sharpness of the signal distribution. This feature helps to identify the presence of outliers or extreme values that may affect activity classification:
Together, these extracted features provide a rich representation of the sensor data and form the basis for further analysis in human activity recognition, allowing for robust classification models to be built based on the variability and distribution of the signal data [
31].
3.3. Experimental Setup
This section outlines the experimental setup used to assess the performance of various machine learning and deep learning models for human activity recognition tasks utilizing smartphone sensor data. The hyperparameters for each model were carefully tuned to achieve optimal performance.
Table 4 provides a detailed summary of the hyperparameter configurations explored for each model, enabling a systematic comparison across diverse machine learning and deep learning approaches.
The feature set was standardized using a Standard Scaler to ensure that each feature has a mean of 0 and a standard deviation of 1, transforming each feature
x as follows:
where
z is the standardized value,
is the mean of the feature, and
is the standard deviation. Standardization was particularly crucial for models like SVC, KNN, and neural networks (e.g., GRU and Transformer), which are sensitive to the scale of input data. Standardizing the features improves the convergence of gradient-based optimization algorithms and ensures that all features contribute equally to distance calculations in KNN and margin maximization in SVC.
Nested Cross-Validation (CV) was employed to assess the generalizability of the models to unseen data. The outer loop used five folds, while the inner loop used three folds. The input tensor shape was defined as (window size, number of features): specifically, (2,3) for the three-input configurations and (2,4) for the four-input configurations. Keras’ K-Fold was used for dataset splitting, with a random state of 42 to ensure reproducibility. The dataset was split into 80% for training and 20% for testing, ensuring adequate data for model evaluation. This approach allowed each model to train and test multiple times on different folds, maximizing the utility of the available data while avoiding overfitting.
This two-level cross-validation technique provided an unbiased estimate of model performance and minimized overfitting during hyperparameter tuning. The outer loop ensured generalization to unseen data, while the inner loop rigorously selected the best hyperparameters based on the training data. This consistent setup was applied across all models, from traditional classifiers like DT, KNN, RF, and SVC to advanced gradient-boosted models like XGBoost, and further to deep learning architectures such as GRU, LSTM, TCN, and Transformer, providing a robust evaluation across the full spectrum of models used in this study.
Table 4 illustrates the range of hyperparameters adjusted for each model to achieve optimal performance. For DT, hyperparameters such as the splitting criterion (gini or entropy), maximum depth (None, 10, 20, 30), and minimum samples per split and leaf were fine-tuned to balance model complexity and prevent overfitting.
The KNN model’s parameters, including the number of neighbors and weight functions, were adjusted to enhance sensitivity to local data variations. Support Vector Classifier (SVC) was optimized by experimenting with different kernels, regularization constants, and gamma values to refine the decision boundary and handle non-linear relationships effectively.
RF was optimized with hyperparameters including the number of estimators, maximum depth, and maximum features to balance performance and computational efficiency. RF’s ensemble structure and its ability to handle complex interactions in sensor data make it a robust model choice for activity recognition tasks.
XGBoost was tuned with parameters such as learning rate, maximum depth, number of estimators, and gamma. The model’s gradient-boosting framework allows it to capture intricate patterns in the data, while careful tuning of these parameters ensures that it achieves high performance without overfitting.
LSTM networks required the tuning of several parameters, including the number of LSTM layers, units per layer, dropout rates, optimizers (Adam, Adamax, RMSProp, SGD), and learning rates. These adjustments were crucial for managing memory, preventing overfitting, and improving the model’s ability to capture temporal dependencies in the data.
Similarly, TCN were optimized by varying the number of layers, filters, kernel sizes, and learning rates to effectively capture temporal patterns and ensure robust training dynamics.
GRU models required adjustments in the number of GRU layers, units, dropout rates, optimizers, and learning rates. The GRU’s simpler structure compared to LSTM networks makes it computationally efficient while still retaining the ability to capture essential temporal dependencies in sequential data.
The Transformer model was fine-tuned by varying the number of layers, hidden units, attention heads, dropout rates, and learning rates. The Transformer’s self-attention mechanism allows it to focus on relevant parts of the input sequence, making it effective in handling long-range dependencies and interactions within the activity data.
Each model’s parameters were chosen through nested cross-validation to ensure they were well-suited for the data and provided generalizable performance across different configurations. The inclusion of a range of traditional and deep learning models, including DT, KNN, RF, SVC, XGBoost, GRU, LSTM, TCN, and Transformer, provided a comprehensive exploration of diverse techniques, enhancing the robustness of our activity recognition framework.
These hyperparameter optimizations were evaluated using a five-fold outer cross-validation and three-fold inner cross-validation approach, as shown in
Table 5. This ensured an unbiased estimate of model performance while reducing the risk of overfitting.
3.4. Performance Metrics and Evaluation of the Trained Model
In evaluating the efficacy of the developed models, several performance metrics are employed, each providing unique insights into the model’s capability to accurately classify human activities.
Accuracy is the primary metric to evaluate the overall correctness of predictions made by the model. Accuracy represents the proportion of correctly classified instances (both positive and negative) to the total instances. This metric provides a broad indication of the model’s reliability and effectiveness across all activity categories. The mathematical expression for accuracy is given by
Precision is another critical metric in this study, particularly when differentiating between similar activities. It indicates the proportion of correctly predicted positive observations to the total predicted positives. High precision is crucial in minimizing false positives, which could otherwise compromise the system’s utility in real-world applications. The mathematical expression for precision is given by
Recall (Sensitivity), another critical metric, measures the model’s ability to identify all relevant instances. This is especially vital for activities that are subtly distinct, such as differentiating between sitting and standing. A higher recall rate ensures that fewer actual events of interest are missed, maintaining the system’s reliability across diverse scenarios. The mathematical expression for recall is given by
The F1 Score is used to find the balance between precision and recall. An optimal F1 Score is indicative of the robustness of the model against the potentially imbalanced nature of the dataset, ensuring consistent performance in identifying correct activities while reducing false identifications. The mathematical expression for the F1 Score is given by
Lastly, a confusion matrix offers a comprehensive visualization of the model’s performance across all activity categories. It provides clarity on how accurately each category is predicted by displaying TP, FP, FN, and TN (
Table 6). This matrix is instrumental in evaluating the model’s ability to differentiate between similar activities, such as walking and running. For example, the matrix can reveal how well the model distinguishes between these activities and help in assessing the impact of potential adjustments, such as increasing the number of sensors, on the overall performance of the system.
Collectively, these metrics facilitate a thorough evaluation of the model’s performance and are essential for guiding the iterative refinement of the activity recognition system.
3.5. Methodology
In
Figure 5, the research simulation integrates both traditional machine learning techniques and advanced deep learning architectures, which are elaborated upon in the following subsections.
Figure 5 outlines the comprehensive workflow for human activity recognition using smartphone sensors, presenting the complete process from data acquisition to activity classification. Initially, raw data are collected from a variety of built-in sensors, including accelerometers, gyroscopes, linear accelerometers, and gravity sensors. These raw data undergo a crucial segmentation step to prepare them for detailed analysis, allowing for a more precise examination of different activity patterns.
Key statistical features—such as mean, standard deviation, median absolute deviation, skewness, maximum, minimum, and kurtosis—are extracted from the segmented data. These features are critical as they capture important characteristics of the signal’s distribution and variability. By analyzing these features, the system can better differentiate between various physical activities, such as running, walking, or stationary positions. The extraction of these features enables the capture of subtle differences in movement patterns and ensures that the activity classification is based on robust and informative data metrics.
The extracted features are then used as inputs for a range of machine learning models. This study incorporates traditional classifiers, including DT, KNN, and SVC, as well as ensemble models such as RF and XGBoost. These models were selected for their interpretability and effectiveness in structured data classification.
Additionally, advanced deep learning architectures, including LSTM networks, TCN, GRU, and Transformers, are integrated to capture sequential dependencies and complex temporal patterns in sensor data. The combination of traditional and deep learning models allows for a thorough exploration of both conventional and contemporary approaches to activity recognition.
The models are trained to classify activities into distinct categories, including running, stationary, and walking. Their performance is evaluated across two different scenarios, providing valuable insights into their ability to generalize and perform effectively in varied contexts. This dual approach—integrating traditional machine learning techniques with cutting-edge deep learning methods—significantly improves the accuracy and reliability of the human activity recognition system. By harnessing the strengths of both methodologies, the research strives to achieve superior performance in differentiating between various physical activities, thereby advancing the capabilities of activity recognition technologies.
3.6. Model Development and Hyperparameter Optimization
In this study, we employed a comprehensive suite of machine learning and deep learning models to tackle human activity recognition, ranging from traditional algorithms such as DT, KNN, RF, SVC, and XGBoost to advanced deep learning architectures including GRU, LSTM, TCN, and Transformer.
To maximize model performance, we implemented a nested cross-validation approach. This technique involved a five-fold outer cross-validation loop for unbiased model evaluation and a three-fold inner loop dedicated to fine-tuning hyperparameters. This nested structure was critical for producing robust estimates of model performance, minimizing the risk of overfitting, and ensuring the generalizability of each model [
32].
For the machine learning models, we employed grid search within the inner loop of cross-validation to systematically explore combinations of hyperparameters. Key hyperparameters optimized included the depth and split criteria for DT, the number of neighbors for KNN, the number of estimators and maximum features for RF, the regularization parameter C and kernel type for SVC, and learning rates and boosting parameters for XGBoost.
For the deep learning models, we leveraged a Hyperband tuner to explore a fine-grained range of hyperparameters, including learning rates, number of layers, number of units per layer, and dropout rates. This method allowed us to optimize complex configurations efficiently, selecting the best settings to achieve peak model performance. The Hyperband tuner’s ability to dynamically allocate resources based on model performance enabled an adaptive search that significantly improved convergence times.
By rigorously optimizing each model’s hyperparameters through these tailored approaches, we ensured that all models were evaluated in their most effective configurations, enhancing the reliability of our comparative analysis and the robustness of our conclusions.
3.7. Models Overview
3.7.1. Decision Tree
The Decision Tree (DT) model, a widely used non-parametric supervised learning algorithm [
33], was employed in this study to classify human activities using data from multiple sensors. The core strength of DT lies in its ability to recursively partition the feature space based on decision rules, ensuring interpretable classification. However, recognizing diverse human activities from multi-sensor data presents challenges, requiring careful parameter tuning to balance model complexity and prevent overfitting.
In this study, we optimized the DT using the Gini impurity criterion to guide node splits, aiming to maximize the purity of the resulting partitions. The formula for Gini impurity is as follows:
where
is the probability of a data point belonging to class
i.
Hyperparameters such as the criterion for splitting, maximum depth of the tree, minimum samples per leaf, and minimum samples required to split an internal node were optimized using the nested cross-validation and Keras Tuners as described in
Section 3.6. The optimal hyperparameters for each sensor configuration are presented in
Table 7.
In the non-overlapping configurations, the entropy criterion was preferred, with a maximum depth of 5 to prevent overfitting and ensure robust pattern recognition. For 50% overlapping data, the minimum samples per split were increased to better handle the additional data points introduced by overlapping windows, addressing potential overfitting.
The switch from
entropy to
Gini impurity for the four-sensor 50%-overlapping setup reflects the need to handle more complex feature interactions. This adjustment allowed for better computational efficiency while maintaining high classification accuracy [
26].
In summary, tuning the DT hyperparameters ensured the model could generalize effectively across sensor setups and data segmentation strategies, offering accurate classification without unnecessary complexity. The DT model proved to be a reliable classifier for human activity recognition, especially when optimized for different sensor configurations [
33].
3.7.2. K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) [
34] is a simple, yet effective, instance-based learning algorithm. It classifies data points based on the majority vote of their
k-nearest neighbors, utilizing distance metrics like Minkowski distance. In this study, KNN was applied to human activity recognition using both non-overlapping and 50% overlapping data configurations.
The Minkowski distance, which generalizes both Euclidean and Manhattan distance, is defined as:
where
for Manhattan distance and
for Euclidean distance.
To ensure optimal performance, we used nested cross-validation, which not only tunes hyperparameters but also validates the model, thus providing a more robust evaluation. The optimal hyperparameters across both two-sensor and four-sensor configurations (non-overlapping and overlapping) remained consistent, as shown in
Table 8. Specifically, a Minkowski distance with
(Manhattan distance),
neighbors, and uniform weights yielded the best results.
KNN’s simplicity in design—using local proximity to make decisions—makes it highly interpretable. Despite this, the nested cross-validation process confirmed that the
setup with Manhattan distance strikes a balance between simplicity and accuracy, performing consistently well across different sensor configurations. While KNN is not as complex as deep learning models, it provides competitive results when applied in well-structured sensor setups. Its effectiveness highlights that even traditional methods can serve as reliable benchmarks in human activity recognition tasks [
35].
3.7.3. Random Forest (RF)
Random Forest (RF), an ensemble learning technique developed by Breiman [
26], is used to enhance classification accuracy by constructing multiple decision trees and aggregating their results. RF improves robustness against overfitting and is particularly suitable for handling high-dimensional, multi-sensor data in human activity recognition. Each tree in the forest is trained on a subset of the dataset, with the final prediction determined by a majority vote among the trees.
In RF, each decision tree classifier is denoted as
, where
x represents the input feature vector and
is a random vector indicating the subset of features used for the
k-th tree. The final prediction
for a classification task is given by the following:
where
K is the total number of trees in the forest. This majority voting strategy allows RF to reduce the variance in individual decision trees and improve overall classification accuracy.
Hyperparameters for RF were optimized through a nested cross-validation approach (outer = 5 folds, inner = 3 folds), ensuring robust evaluation and tuning across different configurations. The use of Keras optimizers facilitated efficient hyperparameter selection.
Table 9 presents the final parameters.
RF’s ensemble nature makes it robust to noise and variability in the sensor data, contributing to improved classification accuracy across both overlapping and non-overlapping datasets.
3.7.4. Support Vector Classifier (SVC)
Support Vector Classifier (SVC), introduced by Vapnik [
36], is a powerful supervised learning method designed to find the optimal hyperplane that separates data points into distinct classes. In this research, SVC was applied to classify human activities using sensor data from smartphones, with a focus on optimizing its performance through nested cross-validation.
To effectively capture the non-linear relationships in the activity data, the Radial Basis Function (RBF) kernel was employed. The RBF kernel transforms input data into a higher-dimensional space, enabling the model to separate complex patterns in human movements. The kernel is controlled by the parameter , which defines its width, and was optimized through grid search.
The RBF kernel is mathematically expressed as:
where
and
are the input feature vectors, and
controls the kernel’s width, influencing the decision boundary.
The model’s performance is highly sensitive to its regularization parameter
C and kernel width
, both of which were fine-tuned for each sensor configuration (non-overlapping and 50% overlapping). For two-sensor setups, the RBF kernel with
and automatic
yielded the best results, while the four-sensor non-overlapping configuration favored a linear kernel for its ability to handle simpler decision boundaries. The 50%-overlapping configuration returned to the RBF kernel to manage more complex, overlapping data relationships.
Table 10 provides a summary of the optimal hyperparameters identified for each configuration.
The use of nested cross-validation ensured a reliable evaluation of model performance by optimizing hyperparameters. The RBF kernel, particularly, demonstrated strength in capturing non-linearities within overlapping data, while the linear kernel proved efficient for non-overlapping configurations. These findings highlight the flexibility of SVC in handling various sensor setups and classification tasks.
3.7.5. XGBoost
XGBoost (Extreme Gradient Boosting) [
27] is an advanced gradient-boosting technique known for its computational speed, scalability, and effectiveness in handling large and complex datasets. Unlike traditional boosting algorithms, XGBoost incorporates second-order derivatives (i.e., the Hessian) in the objective function, allowing for more accurate gradient-based optimization. This feature makes XGBoost particularly suitable for applications involving intricate sensor data, such as human activity recognition.
In each boosting iteration, XGBoost adds a new tree
that minimizes an objective function, defined as:
where
L represents the loss function quantifying the difference between the true label
and the prediction
from the previous iteration. The term
serves as a regularization component to penalize the complexity of
, thereby mitigating overfitting. This regularization, alongside the use of second-order gradients, is a distinctive feature that enhances XGBoost’s robustness in complex classification tasks.
The hyperparameters for XGBoost were optimized using a nested cross-validation approach, with five outer folds and three inner folds, along with Keras optimizers to facilitate hyperparameter tuning. The optimized parameters for each sensor configuration are detailed in
Table 11.
Using nested cross-validation with Keras optimizers allowed for precise tuning of XGBoost, capturing complex interactions within multi-sensor data and yielding high accuracy across various sensor configurations. This approach ensured that the model was generalized effectively, enhancing its performance in real-world human activity recognition applications.
3.7.6. Gated Recurrent Unit (GRU)
Gated Recurrent Units (GRU) [
37] are a variant of recurrent neural networks that aim to improve upon traditional LSTMs by simplifying the architecture while maintaining the capability to capture sequential dependencies. GRUs are particularly effective for HAR tasks where capturing long-term dependencies is crucial, yet computational efficiency is also a consideration.
The GRU model consists of two main gates—reset and update gates—that regulate the flow of information. The reset gate
controls how much of the previous information is forgotten, while the update gate
determines how much of the past information is carried forward. The primary equations for the GRU are given by the following:
where
and
are the reset and update gates, respectively, and
is the candidate activation. GRUs reduce the complexity of the network by combining the forget and input gates into a single update gate, which often results in faster training times compared to LSTM.
Figure 6 illustrates this GRU structure, displaying the flow of data through the reset and update gates for sequential data processing.
The GRU hyperparameters, optimized using nested cross-validation (outer = 5, inner = 3), are summarized in
Table 12. Temporary values include two layers, 64 GRU units, a dropout rate of 0.3, and the Adam optimizer with a learning rate of 0.0015.
By leveraging GRU’s simplified architecture, this model is able to efficiently capture both short-term and long-term dependencies, making it suitable for human activity recognition with complex sensor data. In this architecture, the reset and update gates play critical roles in managing sequential data processing. The asterisk (*) in the figure denotes multiplication, and the “” represents subtraction, key operations that adjust the information flow through the GRU unit.
3.7.7. Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM) networks [
38] are a specialized form of Recurrent Neural Networks (RNNs) designed to handle sequential data and capture long-term dependencies. This makes LSTMs highly suitable for HAR, where temporal patterns are crucial.
In this study, a multi-layer LSTM network was implemented to model the sequential patterns in sensor data, optimizing key parameters such as the number of LSTM units and learning rate for effective classification performance. The core of the LSTM consists of gates—forget, input, and output—that control information flow, allowing the network to manage both short-term and long-term dependencies in the data. These gates interact with the cell state and the network’s memory, to retain relevant information and discard what’s unnecessary over time.
As depicted in
Figure 7, the forget gate discards irrelevant information, the input gate allows new data to be added, and the output gate determines what is passed to the next layer. These mechanisms enable LSTM to retain useful patterns for activity recognition, selectively managing both short-term and long-term dependencies.
The LSTM’s core equations are expressed as:
where
,
, and
represent the activations of the forget, input, and output gates, respectively, while
represents the cell state update. These gates use the sigmoid function
, while cell updates rely on element-wise operations ⊙, and the previous hidden state
and input
influence the gating mechanisms.
The LSTM’s hyperparameters were optimized using a nested cross-validation process, as shown in
Table 13. For the two-sensor, non-overlapping configuration, two LSTM layers with 32 units and a dropout rate of 0.30 performed best, while the 50%-overlapping setup used one LSTM layer with 128 units. The four-sensor non-overlapping configuration similarly required one LSTM layer, whereas the 50%-overlapping setup needed a more complex architecture with three layers.
The LSTM’s ability to capture both short-term and long-term dependencies was key to improving classification accuracy across different sensor configurations. By fine-tuning the hyperparameters, the model was able to effectively differentiate between various human activities, achieving robust performance through careful management of sequential data.
3.7.8. Temporal Convolutional Networks (TCN)
Temporal Convolutional Networks (TCN) [
28] are highly effective for sequence modeling, particularly in handling long-term dependencies through dilated causal convolutions. This enables TCNs to capture temporal patterns over extended periods without losing context. In this study, TCNs were employed to classify human activities based on multi-sensor data. Key parameters such as filter size, layers, and dilation factor were optimized for performance.
The core dilated convolution operation in a TCN is given by the following:
where
is the output at time
t,
is the input signal,
denotes filter weights, and
d is the dilation factor. This formulation enhances the model’s ability to recognize temporal dependencies across varying time steps.
The TCN architecture, as shown in
Figure 8, processes temporal sequences using convolutional layers that expand the receptive field efficiently. Dilations in increasing order capture long-range dependencies without a significant computational cost. Dropout layers with a rate of 0.25 were applied to mitigate overfitting, and different optimizers (Adam, RMSProp) were used depending on sensor configurations.
The TCN demonstrated robustness in handling complex sensor configurations. Dilated convolutions enabled the model to efficiently capture long-range temporal relationships, crucial for differentiating subtle variations in human activities. In more complex configurations, such as the four-sensor setup, dropout layers further improved generalization by reducing overfitting, especially with high-dimensional data.
The optimal hyperparameters, identified through nested cross-validation, are summarized in
Table 14. For the two-sensor 50%-overlapping setup, the best results were achieved using two layers with filter sizes of 96, 32, and 128, and an Adam optimizer with a learning rate of 0.0006. In contrast, the four-sensor non-overlapping setup favored filter sizes of 96, 64, and 32, using Adam with a learning rate of 0.0011.
In the more complex four-sensor, 50%-overlapping configuration, a simplified architecture of one layer, with filter sizes of 32, 32, and 128, and RMSProp (learning rate 0.0025) was optimal. These configurations highlight the adaptability of TCNs in different sensor setups, capturing both local and long-range temporal dependencies effectively.
The nested cross-validation approach ensured unbiased tuning, contributing to the model’s ability to generalize across various sensor setups. By capturing both short-term and long-term dependencies, the TCN model achieved robust performance in human activity recognition tasks across different sensor configurations.
3.7.9. Transformer
The Transformer model [
29] is a groundbreaking architecture that relies solely on attention mechanisms to capture dependencies in sequential data, eliminating the need for recurrence. This self-attention mechanism allows the Transformer to focus on relevant parts of the input sequence, making it highly effective for handling long-term dependencies in HAR.
The core of the Transformer is the multi-head attention mechanism, which computes attention weights for each input token across multiple heads, allowing the model to focus on various aspects of the input sequence. The attention mechanism is defined as follows:
where the query
, key
, and value
are matrices derived from the input embeddings, with
representing the dimension of the queries and keys. This formulation allows the model to simultaneously attend to various parts of the sequence in a parallel manner. As illustrated in
Figure 9, the multi-head attention mechanism allows the model to attend to various parts of the sequence in parallel, supported by positional encoding to handle sequential dependencies effectively.
The hyperparameters for the Transformer model were optimized using nested cross-validation (outer = 5, inner = 3), as outlined in
Table 15. Temporary values include four attention heads, two layers, and an Adam optimizer with a learning rate of 0.0008.
The Transformer’s self-attention mechanism, combined with its ability to process sequences in parallel, makes it highly efficient and powerful for HAR tasks, particularly in capturing complex temporal relationships within sensor data.
4. Results and Discussion
This section evaluates the performance of nine models—DT, KNN, RF, SVC, XGBoost, GRU, LSTM, TCN, and Transformer—for activity recognition within Intelligent Transportation Systems (ITS). Effective activity recognition is crucial for enhancing safety and efficiency in ITS applications. The analysis compares model performance across two sensor configurations (two-sensor and four-sensor) and two data segmentation strategies (non-overlapping and 50% overlapping), focusing on key performance metrics such as accuracy, precision, and recall to assess their viability for real-time applications in ITS.
4.1. Model Performance Overview
Decision Trees (DT) were selected for their simplicity and interpretability, making them ideal for real-time ITS applications where model transparency is crucial. While DT maintains a balance between accuracy and efficiency, it is somewhat prone to overfitting, particularly with smaller datasets.
K-Nearest Neighbors (KNN) was included for its flexibility in handling diverse activity patterns. However, its performance can decline in high-dimensional spaces, which leads to computational challenges. Various values for k and distance metrics were tested to optimize KNN’s sensitivity to the sensor data.
Random Forest (RF) was introduced as an ensemble model to assess its classification consistency and ability to mitigate overfitting through the aggregation of multiple decision trees. RF produced reliable results across sensor configurations, demonstrating particularly strong performance in non-overlapping datasets, where it maintained high accuracy across various activities.
Support Vector Classifier (SVC) with the RBF kernel was chosen for its capacity to manage non-linear data effectively, enhancing accuracy in activity recognition, especially with smaller datasets. Its strong generalization capabilities contributed to high classification performance.
XGBoost, a gradient-boosting algorithm, was selected for its computational efficiency and advanced feature-handling capabilities. It effectively captured complex activity patterns, excelling in overlapping data segmentation, and achieving accuracy comparable to deep learning models such as TCN. Its robustness and precision make it particularly suitable for ITS applications where both accuracy and efficiency are paramount.
Gated Recurrent Unit (GRU) was incorporated to evaluate its sequential learning capabilities while offering reduced computational requirements compared to LSTM. The gating mechanism in GRU enables effective learning of activity sequences, which is beneficial for real-time ITS applications. GRU performed well across sensor configurations, particularly in non-overlapping data, where its lower complexity supported faster inference.
Long Short-Term Memory (LSTM) networks are well-suited for capturing long-term dependencies, which are crucial for distinguishing activities such as walking and running. However, its performance was slightly diminished with non-overlapping data configurations, as LSTM relies on the continuity of sequential data.
Temporal Convolutional Networks (TCN), utilizing dilated convolutions to capture extended temporal dependencies, consistently achieved high accuracy, particularly in overlapping segmentation. Its ability to retain contextual information across sequences makes it an excellent choice for real-time ITS tasks.
Transformer models, which leverage self-attention mechanisms, were evaluated for their ability to manage complex dependencies in sequential data without the limitations of recurrent connections. They excelled in overlapping segmentation scenarios, effectively capturing long-term dependencies across activities, achieving competitive accuracy, and highlighting their scalability for ITS applications.
4.2. Sensor Configurations and Data Segmentation
The evaluation utilized two sensor configurations—two-sensor (accelerometer and gyroscope) and four-sensor (which includes a linear accelerometer and gravity sensor)—in conjunction with two segmentation strategies: non-overlapping and 50% overlapping. The implementation of overlapping segmentation proved advantageous by preserving transitional information between activities, which is crucial for accurately capturing the dynamics of human motion. This strategy significantly enhanced model performance, leading to notable improvements in accuracy across all models, demonstrating the effectiveness of this approach in real-time activity recognition within ITS.
4.3. Two-Sensor Configurations
4.3.1. Two-Sensors: Model Performance with Non-Overlapping Data
In this configuration, data from two sensors—an accelerometer and a gyroscope—were analyzed using non-overlapping segments. The models evaluated include DT, KNN, RF, SVC, XGBoost, GRU, LSTM, TCN, and Transformer.
The evaluation results in
Table 16 reveal that the
TCN model achieves the highest performance across most metrics, with an F1 Score of
0.9757 on the test set. This model demonstrates strong consistency across training, validation, and test sets, underscoring its robustness and suitability for non-overlapping configurations in human activity recognition.
DT and KNN models offer reliable baseline performance, with DT reaching an F1 Score of 0.9725 on the test set. However, these simpler models tend to lag behind more advanced models, particularly when handling high-dimensional sensor data [
33,
34].
Among the ensemble models, RF and XGBoost exhibit solid performance, leveraging ensemble learning to enhance generalization. RF achieves a test F1 Score of 0.9720, while XGBoost achieves a higher F1 Score of 0.9753, reflecting the benefits of boosting techniques for managing complex data interactions [
26,
27].
SVC also demonstrates adaptability to non-linear data with a test F1 Score of 0.9726, making it a strong candidate for activity recognition in configurations lacking temporal overlap [
36]. The deep learning models GRU and LSTM offer consistent results, with F1 Scores of 0.9748 and 0.9738, respectively, on the test set, highlighting their effectiveness in sequential data processing even without overlapping segmentation [
37,
38].
Despite its strengths, the Transformer model performs slightly lower than TCN in this configuration, with an F1 Score of 0.9724 on the test set, suggesting that while its self-attention mechanism is beneficial for long-range dependencies, TCN may be more effective in handling structured, sequential data in non-overlapping setups.
In conclusion, the TCN model emerges as the best performer in this configuration, demonstrating high accuracy, precision, recall, and F1 Score across all sets. Its superior generalization capabilities highlight its robustness for human activity recognition in non-overlapping data settings.
4.3.2. Confusion Matrix Analysis for Two-Sensor Non-Overlapping Configuration
By examining the confusion matrices presented in
Figure 10 we can gain deeper insights into how effectively each model classifies different activity categories, revealing specific strengths and weaknesses in handling certain tasks.
The confusion matrices provide insightful details about the performance of different models in the two-sensor non-overlapping configuration:
DT: As shown in
Figure 10a, the DT model demonstrates relatively consistent performance across most activity classes, correctly classifying 221 instances of
wlk_wos. However, the model struggles with 36 misclassifications of
stn_ws, suggesting difficulties in distinguishing between subtle differences in stationary and non-stationary activities. This pattern highlights DT’s known tendency to overfit, making confident but incorrect predictions, especially in cases with smaller datasets or less distinct class boundaries [
26].
KNN: As depicted in
Figure 10b, KNN shows sensitivity to closely spaced classes in the feature space, misclassifying 28 instances of
stn_wos. KNN correctly classifies 222 instances of
wlk_wos, but its instance-based approach struggles in high-dimensional sensor data without overlap, where the feature space becomes more complex [
35].
RF:
Figure 10c shows RF’s consistent performance, especially in classifying stationary activities like
stn_wos (481 correct classifications). However, it misclassifies 40 instances of
stn_ws, indicating room for improvement in handling subtle distinctions in movement. RF’s ensemble learning structure provides stability, which is reflected in its balanced performance across various activity classes [
26].
SVC: The confusion matrix in
Figure 10d illustrates the strong performance of SVC, with minimal misclassifications. SVC’s capacity to maximize margins effectively separates the classes, correctly identifying 228 instances of
run_wos and 477 instances of
stn_wos, confirming its suitability for non-linear data in non-overlapping setups [
39].
XGBoost: As seen in
Figure 10e, XGBoost excels in distinguishing complex activity patterns, with only a few misclassifications, such as three instances of
run_wos and 40 of
stn_ws. Its gradient-boosting approach enhances its capacity for handling complex data structures, making it a strong contender in the non-overlapping configuration [
27].
GRU: The GRU model, shown in
Figure 10f, demonstrates efficient classification of sequential activities, correctly classifying 136 instances of
run_vs and 229 of
stn_ws. Although it has a few misclassifications in
stn_wos, GRU’s simplicity and lower computational cost make it suitable for real-time ITS applications [
37].
LSTM: As indicated in
Figure 10g, LSTM misclassifies 31 instances of
stn_ws, reflecting its limitation in non-overlapping data. However, it still performs well across other activities, showing flexibility in handling sequential data despite the lack of continuity [
38].
TCN: As shown in
Figure 10h, TCN provides the highest classification accuracy, especially for dynamic activities, with minimal misclassifications. It is particularly effective in correctly classifying activities like
run_wos (234 correct classifications) and
wlk_wos (228 correct classifications). Its convolutional structure allows efficient handling of non-overlapping segments, reinforcing TCN as the top-performing model in this setup [
28].
Transformer: The Transformer model, illustrated in
Figure 10i, performs well in classifying
stn_wos with 502 correct classifications but shows slight limitations with dynamic activities, misclassifying 44 instances of
stn_ws. Although its self-attention mechanism aids in long-range dependency handling, it may not fully leverage the sequential structure needed in non-overlapping setups [
29].
In summary, both the table and confusion matrix analyses consistently highlight the TCN model as the best performer for this two-sensor, non-overlapping configuration. Its ability to handle non-overlapping temporal data without sacrificing accuracy or interpretability makes it ideal for human activity recognition tasks where data continuity is limited. TCN outperforms other models in classifying a range of activities accurately, proving its robustness and adaptability in this setup.
4.3.3. Two-Sensors: Model Performance with 50% Overlapping Data
In this configuration, the same two sensors—accelerometer and gyroscope—are utilized, but with a 50% overlapping data segmentation strategy. This method enhances the ability of models to capture contextual transitions between activities by retaining more temporal information, which is crucial for improved classification accuracy.
The evaluation results in
Table 17 indicate that the
TCN model achieves the highest performance across the majority of metrics, with an F1 Score of
0.9765 on the test set. This model exhibits strong consistency throughout training, validation, and test phases, emphasizing its robustness in addressing overlapping configurations for human activity recognition.
The Decision Tree (DT) and K-Nearest Neighbors (KNN) models serve as reliable baseline performance indicators, with DT attaining an F1 Score of 0.9707 on the test set. However, these simpler models typically fall short compared to their more advanced counterparts, particularly when processing high-dimensional sensor data [
33,
34].
Ensemble models such as Random Forest (RF) and XGBoost demonstrate commendable performance, effectively utilizing ensemble learning techniques to enhance generalization. RF achieves a test F1 Score of 0.9730, while XGBoost performs slightly better with an F1 Score of
0.9749. This improvement underscores the advantages of boosting methods in managing intricate data interactions [
26,
27].
Support Vector Classifier (SVC) also shows strong adaptability to non-linear data, reaching a test F1 Score of 0.9745. This positions SVC as a strong candidate for activity recognition, particularly in configurations that require efficient handling of overlapping data. Additionally, the deep learning models GRU and LSTM yield consistent outcomes, with F1 Scores of 0.9750 and 0.9762, respectively, on the test set, demonstrating their effectiveness in processing sequential data [
37,
38].
The Transformer model exhibits robust capabilities, achieving an F1 Score of 0.9746 on the test set. Although it shows promise with its self-attention mechanism for capturing complex relationships, it falls slightly short of TCN’s performance, indicating that while effective, TCN may be better suited for handling structured, sequential data in overlapping contexts.
In conclusion, the TCN model stands out as the best performer in this configuration, exhibiting high accuracy, precision, recall, and F1 Score across all evaluation sets. Its superior generalization capabilities further reinforce its robustness for human activity recognition in settings utilizing overlapping data.
4.3.4. Confusion Matrix Analysis for Two-Sensor 50%-Overlapping Configuration
The confusion matrices, displayed in
Figure 11, provide detailed insights into the classification results for each model under the 50% overlapping data configurations. By presenting a comprehensive view of the predicted versus actual activity classes, these matrices enable a thorough assessment of model performance. They not only reveal the overall accuracy of each model but also illustrate specific strengths and weaknesses in handling different activities.
The confusion matrices provide insightful details about the performance of different models in the two-sensor with 50%-overlapping configurations, showcasing how well each model can capture temporal dependencies and transitions between activities.
DT: In
Figure 11a, the DT model correctly classifies a majority of the activity classes. However, it struggles with distinguishing between similar stationary activities, as seen in the 73 misclassifications of the class
stn_ws. This limitation highlights the tendency of the DT model to overfit in overlapping data scenarios, which can result from its hierarchical structure that may not generalize well to nuanced differences between similar classes [
26]. Moreover, the reliance on predetermined splitting criteria might lead to a lack of adaptability when encountering the subtle variations present in stationary activities.
KNN: The KNN confusion matrix in
Figure 11b reflects its ability to accurately classify dynamic activities such as
run_wos, where it achieves 454 correct classifications. However, it shows some difficulty with stationary activities, particularly in the class
stn_ws, where 69 instances were misclassified. This behavior aligns with KNN’s sensitivity to local neighbor choices and highlights its struggles in higher-dimensional feature spaces [
35]. The model’s reliance on proximity for classification can lead to misclassifications, particularly when different activity classes have similar features or when noise is present in the data.
RF:
Figure 11c shows that the RF model effectively handles the majority of classes, particularly
stn_wos, with 956 correctly classified instances. However, it faces some difficulty with closely related classes, such as
wlk_wos, where it misclassifies seven instances. This performance indicates that while ensemble learning provides robustness, RF may still struggle with nuanced distinctions in overlapping configurations [
26].
SVC:
Figure 11d demonstrates the robustness of the SVC model. It correctly classifies 951 instances of
stn_wos and 455 instances of
run_wos, with only nine and one misclassifications, respectively. The SVC’s performance here highlights its ability to manage overlapping data effectively, particularly in scenarios with non-linear decision boundaries [
39]. Its effectiveness in capturing the relevant features of the data allows it to achieve high accuracy, even in the presence of overlapping segments. Additionally, SVC’s capacity to use kernel functions can help in transforming the feature space to better separate classes, further enhancing its classification accuracy.
XGBoost: As shown in
Figure 11e, XGBoost achieves strong classification performance, particularly in
stn_wos, with 958 correct classifications. However, it encounters challenges in distinguishing
wlk_ws, misclassifying only one instance as
wlk_wos, which may result from the complexity of interactions in overlapping sequences. This outcome underlines the effectiveness of boosting in refining predictions across diverse activity classes [
27].
GRU:
Figure 11f reveals GRU’s strengths in handling temporal data with overlapping segments, as it achieves 446 correct classifications for
run_wos and 459 for
stn_ws. Despite its sequential learning ability, GRU misclassifies 69 instances of
stn_ws, suggesting that it may still benefit from additional temporal smoothing in stationary categories [
37].
LSTM: As shown in
Figure 11g, the LSTM model effectively classifies both stationary and dynamic activities, achieving 445 correct classifications for
run_wos and 923 for
stn_wos. However, it struggles slightly with
stn_ws, misclassifying 70 instances. This performance highlights LSTM’s strength in capturing long-term dependencies, leveraging overlapping temporal data to improve differentiation between similar activities. The model’s architecture enables it to retain past information, making it well-suited for complex activity recognition tasks, especially where temporal overlap enhances classification accuracy [
38].
TCN: Similarly,
Figure 11h illustrates the high performance of the TCN, which correctly classifies 448 instances of
run_wos and 967 instances of
stn_wos, demonstrating comparable performance to LSTM. TCN benefits from its ability to model both short-term and long-term dependencies through its convolutional architecture, which allows it to process overlapping data more efficiently by capturing spatial and temporal relationships within the data [
28]. TCN shows particular strength in activities such as
run_ws, correctly classifying 224 instances. However, like LSTM, it misclassifies instances of
stn_ws, suggesting that even with overlapping data, stationary activities remain a challenge for both deep learning models.
Transformer:
Figure 11i demonstrates the Transformer’s strong performance, correctly classifying 451 instances of
run_wos and 930 of
stn_wos. However, the model misclassifies 73 instances of
stn_ws, suggesting that while its self-attention mechanism captures long-range dependencies effectively, it faces challenges distinguishing closely related stationary activities. This indicates that, despite its proficiency with sequential patterns, the Transformer may benefit from further tuning for overlapping configurations in stationary states [
29].
In conclusion, overlapping data segmentation enhances model performance by preserving important temporal transitions. While LSTM, TCN, and Transformer exhibit strong classification capabilities, TCN emerges as the most consistent and accurate model, further affirming the importance of deep learning architectures designed for sequential data in human activity recognition applications. The deep learning models’ capability to leverage both short-term and long-term dependencies enables them to handle the complexity introduced by overlapping segments effectively, making them suitable choices for capturing activity transitions. Further analysis of the comparative performance of these models across configurations will be presented in
Section 4.5, where a holistic evaluation of their strengths and limitations will be discussed.
4.4. Four-Sensor Configurations
4.4.1. Four-Sensors: Model Performance with Non-Overlapping Data
In this configuration, four sensors—accelerometer, gyroscope, linear accelerometer, and gravity—were utilized with non-overlapping data segments. The inclusion of additional sensors aims to capture a wider range of movement patterns, potentially enhancing classification accuracy across different models. This strategy leverages the unique strengths of each sensor, allowing for a more comprehensive analysis of monitored activities.
The four-sensor setup enriches the feature set and provides nuanced information about the subject’s movements, aiding in distinguishing between similar activities. For instance, the gyroscope contributes to rotational movement detection, while the linear accelerometer and gravity sensor provide insights into linear acceleration and gravitational effects. This multidimensional approach is expected to improve the models’ ability to recognize subtle variations in activity patterns.
The evaluation results in
Table 18 reveal that the
TCN model achieves the highest performance across most metrics, with an F1 Score of
0.9770 on the test set. This model demonstrates strong consistency across training, validation, and test sets, underscoring its robustness in handling non-overlapping configurations for human activity recognition.
DT and KNN models offer reliable baseline performance, with DT reaching an F1 Score of 0.9719 on the test set. However, these simpler models tend to lag behind more advanced models, particularly when handling high-dimensional sensor data [
33,
34].
Among the ensemble models, RF and XGBoost exhibit solid performance, leveraging ensemble learning to enhance generalization. RF achieves a test F1 Score of 0.9753, while XGBoost achieves a higher F1 Score of 0.9735, reflecting the benefits of boosting techniques for managing complex data interactions [
26,
27].
SVC also demonstrates adaptability to non-linear data with a test F1 Score of 0.9735, making it a strong candidate for activity recognition in configurations that require efficient handling of non-overlapping data. The deep learning models GRU and LSTM offer consistent results, with F1 Scores of 0.9738 and 0.9752, respectively, on the test set, highlighting their effectiveness in sequential data processing, even without overlapping segmentation [
37,
38].
Despite its strengths, the Transformer model performs slightly lower than TCN in this configuration, with an F1 Score of 0.9760 on the test set, suggesting that while its self-attention mechanism is beneficial for long-range dependencies, TCN may be more effective in handling structured, sequential data in non-overlapping setups [
29].
In conclusion, the TCN model emerges as the best performer in this configuration, demonstrating high accuracy, precision, recall, and F1 Score across all sets. Its superior generalization capabilities highlight its robustness for human activity recognition in non-overlapping data settings. The ensemble models such as RF and XGBoost show commendable results, reinforcing the utility of ensemble learning approaches in enhancing classification performance. Meanwhile, GRU and Transformer also contribute valuable insights, particularly in capturing temporal dependencies, although they slightly trail behind TCN in this context.
4.4.2. Confusion Matrix Analysis for Four-Sensor Non-Overlapping Configuration
The confusion matrices in
Figure 12 provide additional insights into the classification performance for each model.
The confusion matrices provide insightful details about the performance of different models in the four-sensor non-overlapping configuration:
DT: The DT confusion matrix in
Figure 12a reveals that the model accurately classifies 226 instances of
run_wos and 220 instances of
stn_ws. It also shows strong performance in classifying
wlk_ws, with 127 instances correctly identified. However, there are notable misclassifications, with 36 instances of
stn_ws incorrectly labeled as
stn_wos. This misclassification pattern suggests a tendency for the DT model to overfit, especially when dealing with similar stationary activities where sensor readings may overlap [
26].
KNN: The KNN confusion matrix in
Figure 12b demonstrates strong performance, particularly for the
run_wos activity, where 229 instances are correctly classified. However, KNN struggles with distinguishing between
stn_ws and
stn_wos, as evidenced by the 33 misclassifications in the
stn_ws class. This confusion likely arises from KNN’s sensitivity to feature space proximity, particularly in high-dimensional data [
35].
RF: The RF model in
Figure 12c performs well, accurately classifying 228 instances of
run_wos and 480 of
stn_wos. However, it misclassifies 36 instances of
stn_ws as
stn_wos, suggesting limitations in distinguishing subtle temporal features in stationary activities [
26]. While effective in managing feature complexity, RF is slightly outperformed by deep learning models in capturing nuanced activity differences.
SVC: The SVC confusion matrix in
Figure 12c highlights SVC’s high accuracy, particularly in
stn_wos, where it correctly classifies 476 instances. However, the model struggles with distinguishing between
stn_ws and
stn_wos, with 42 instances of
stn_ws misclassified as
stn_wos and five instances of
stn_wos misclassified as
stn_ws. This confusion could be due to the overlapping nature of sensor readings between these stationary activities [
39].
XGBoost: The XGBoost confusion matrix (
Figure 12e) shows strong classification of
run_wos (226 instances) and
stn_wos (481 instances). However, it struggles with subtle distinctions, misclassifying 40 instances of
stn_ws as
stn_wos. While XGBoost’s ensemble framework captures complex data patterns, distinguishing closely related stationary activities remains challenging without sequential processing, a strength of temporal models like LSTM and TCN [
27].
GRU: The GRU model in
Figure 12f shows solid performance, particularly in the classification of
run_wos with 234 correctly identified instances and
stn_ws with 211 correct classifications. However, it has minor challenges in distinguishing between
stn_ws and
stn_wos, with 31 instances of
stn_ws misclassified as
stn_wos. GRU’s sequential processing enables it to retain important context over time, aiding in recognizing both stationary and dynamic activities [
37].
LSTM: The LSTM model in
Figure 12g shows strong classification performance across various activities. Specifically, it accurately classifies 203 instances of
run_wos and 503 instances of
stn_wos. However, the model struggles slightly with
stn_ws, where it misclassifies 37 instances as
stn_wos. This pattern highlights LSTM’s ability to capture both short-term and long-term dependencies in time-series data, which enhances its robustness across activities [
38]. The model also performs well in distinguishing between the
run_ws and
wlk_ws classes, achieving high classification accuracy in these categories, further underlining its effectiveness in handling sequential data.
TCN: In
Figure 12h, the TCN model shows excellent performance in dynamic activities, accurately classifying 234 instances of
run_wos and 500 instances of
stn_wos. TCN’s use of dilated convolutions allows it to model both short-term and long-term dependencies, making it ideal for recognizing complex movement patterns in non-overlapping configurations [
28]. The TCN model proves to be a strong competitor, performing comparably to LSTM, especially in dynamic classes, with minimal misclassifications.
Transformer: The Transformer model, shown in
Figure 12i, demonstrates strong classification performance, leveraging self-attention to effectively capture complex, long-range dependencies [
29]. Although it ranks just below TCN and LSTM for non-overlapping data, the model’s attention mechanism excels at identifying extended sequences. However, subtle distinctions in stationary activities present a challenge, as seen in the 40 instances of
stn_ws misclassified as
stn_wos, likely due to limitations of self-attention in distinguishing closely related activities without temporal continuity [
40].
In conclusion, the analysis of the four-sensor non-overlapping configuration underscores
TCN as the best-performing model overall. The TCN model demonstrates superior accuracy in both dynamic and stationary activities, with minimal misclassifications, thanks to its ability to capture both short-term and long-term dependencies through its dilated convolutional architecture [
28]. While the
LSTM model remains highly competitive and excels in capturing long-term dependencies in stationary activities, TCN’s efficiency and robustness across diverse activity types make it the top choice for this configuration. The
Transformer model, though slightly behind TCN and LSTM, still shows substantial potential by leveraging self-attention to capture long-range dependencies, proving valuable for applications requiring extensive sequential data analysis.
Overall, TCN’s strength in handling non-overlapping, dynamic activities with high accuracy and LSTM’s capability in managing long-duration dependencies highlight the effectiveness of both models for human activity recognition in scenarios where accurate temporal pattern recognition is essential. A detailed comparative analysis of these models across configurations will be presented in
Section 4.5, offering a comprehensive view of their relative advantages and trade-offs.
4.4.3. Four-Sensors: Model Performance with 50% Overlapping Data
In this configuration, four sensors—accelerometer, gyroscope, linear accelerometer, and gravity—are used with a 50% overlapping strategy applied to the data segments. The overlapping segments provide additional continuity and capture transitions between activities more effectively, enhancing the models’ ability to recognize patterns across time and improving overall performance.
The evaluation results in
Table 19 reveal that the
TCN model achieves the highest performance across most metrics, with an F1 Score of
0.9762 on the test set. This model demonstrates strong consistency across training, validation, and test sets, underscoring its robustness in handling overlapping configurations for human activity recognition.
DT and KNN models provide reliable baseline performance, with DT achieving an F1 Score of 0.9714 on the test set. However, these simpler models tend to lag behind more advanced architectures, particularly when managing high-dimensional sensor data [
33,
34].
RF and XGBoost show solid performance. RF achieves a test F1 Score of 0.9741, while XGBoost attains a slightly higher score of 0.9760, reflecting the effectiveness of ensemble learning in capturing complex data interactions [
26,
27].
SVC also performs well, reaching a test F1 Score of 0.9745. Its ability to handle non-linear relationships makes it a strong contender for activity recognition in overlapping data configurations. The deep learning models GRU and LSTM offer consistent results, with F1 Scores of 0.9762 and 0.9760, respectively, demonstrating their capacity to manage sequential data processing effectively [
37,
38].
While the Transformer model shows robust capabilities, achieving an F1 Score of 0.9761, it falls just slightly below the TCN in this configuration. This outcome suggests that while the Transformer’s self-attention mechanism is advantageous for long-range dependencies, the TCN may be better suited to handle structured, sequential data in overlapping setups [
29].
In conclusion, although the TCN and GRU models achieve identical high scores in overall metrics, the TCN model is superior due to its lower misclassification rates revealed in the confusion matrices. This means TCN more effectively distinguishes between similar activities in overlapping data, capturing temporal dependencies and subtle variations better than GRU. Consequently, TCN’s enhanced class-wise performance and superior generalization make it the most robust choice for human activity recognition in overlapping data settings, meeting the demands for high accuracy and reliability in practical applications.
4.4.4. Confusion Matrix Analysis for Four-Sensor 50%-Overlapping Configuration
In this configuration, four sensors—accelerometer, gyroscope, linear accelerometer, and gravity—are used with a 50% overlapping strategy applied to the data segments. This method enhances the models’ ability to capture contextual transitions between activities, improving overall performance and classification accuracy.
The confusion matrices reveal the following patterns:
DT: The DT confusion matrix in
Figure 13a shows accurate classification of 441 instances of
run_wos and 960 instances of
stn_ws. However, it struggles with 15 misclassifications for the
run_wos class and 1 for the
wlk_wos. These misclassifications suggest that despite the additional overlapping data, DT still faces challenges in accurately distinguishing activities with similar sensor signals [
26].
KNN: The KNN confusion matrix in
Figure 13b indicates improved performance, particularly for the 456 correct classifications of
run_wos and 442 correct classifications of
wlk_wos. However, it misclassified 59 instances for
stn_ws, reflecting KNN’s sensitivity to overlapping data and potential difficulty in distinguishing between stationary activities [
35].
RF: The RF confusion matrix in
Figure 13c shows strong results, with 960 correct classifications for the
stn_ws class and only seven misclassifications for the
run_wos. RF’s ensemble learning approach enhances its ability to generalize well across various sensor features, making it effective in this configuration, despite its occasional challenges in capturing complex dependencies among sensor inputs [
26].
SVC: The SVC confusion matrix in
Figure 13d demonstrates strong performance, correctly classifying 455 instances of
run_wos and 952 instances of
stn_wos. While generally robust, SVC faces some challenges with stationary classes, misclassifying 73 instances of
stn_ws as
stn_wos. This suggests that although SVC effectively handles overlapping sensor data, further refinement could enhance accuracy in distinguishing stationary states [
39].
XGBoost: The XGBoost confusion matrix in
Figure 13e demonstrates strong performance, with 958 correct classifications for
stn_wos and 452 for
run_wos, with only four misclassifications in the latter. However, it misclassifies 73 instances of
stn_ws as
stn_wos, highlighting a challenge in distinguishing closely related stationary classes. This indicates XGBoost’s effectiveness with complex data interactions, albeit with slight limitations in handling subtle distinctions in stationary activities.
GRU: As shown in
Figure 13f, the GRU model performs well, with 923 correct classifications for
stn_wos and 461 for
stn_ws. However, it misclassifies 69 instances of
stn_ws as
stn_wos, highlighting some difficulty in distinguishing similar stationary activities. Overall, GRU’s ability to capture temporal dependencies makes it effective for most classes, though additional tuning could improve its handling of overlapping sensor data [
37].
LSTM: The LSTM confusion matrix in
Figure 13g indicates strong performance, particularly with 972 correct classifications for the
stn_wos class. However, there are some minor misclassifications, including 64 instances of
stn_ws incorrectly classified as
stn_wos. Despite these few errors, the LSTM model generally shows robustness across most activity categories, leveraging its ability to capture sequential dependencies effectively [
38].
TCN: The TCN confusion matrix in
Figure 13h demonstrates high classification accuracy across activities, with 984 correct classifications for the
stn_wos class. However, there are 60 misclassifications for the
stn_ws class, indicating a slight limitation in distinguishing between stationary activities. Overall, TCN effectively handles long-term dependencies, underscoring its strength in processing sequential data [
28].
Transformer: The Transformer confusion matrix in
Figure 13i showcases a strong ability to classify activities, achieving 927 correct classifications for the
stn_wos class. Despite some misclassifications, its self-attention mechanism effectively captures complex relationships over longer sequences, demonstrating its applicability for activity recognition.
In summary, the 50% overlapping configuration enhances the performance of all models. The TCN stands out as the best performer, achieving the highest accuracy and lowest misclassifications, while the GRU and XGBoost also demonstrate commendable results. These findings underscore the importance of selecting appropriate models based on the specific requirements of human activity recognition tasks, particularly in leveraging overlapping data to enhance classification accuracy.
4.5. Comparative Analysis Across Configurations
This section provides a detailed comparative analysis of the performance metrics for various models across different sensor configurations and data segmentation strategies, focusing on the test set results to evaluate the models’ generalization capabilities.
4.6. Performance Overview
The analysis reveals that the
Temporal Convolutional Network (TCN) consistently outperforms other models across all configurations. As summarized in
Table 20, TCN achieves the highest scores in accuracy, precision, recall, and F1 Score, reaching up to
0.9770 in the four Sensors Non-overlapping setup. This superior performance is attributed to TCN’s ability to capture temporal dependencies effectively through dilated convolutions, making it particularly well-suited for handling multi-sensor data.
Recurrent models like LSTM and GRU also show strong performance on the test set, particularly in configurations with overlapping data and multiple sensors. For instance, LSTM achieves a test accuracy of 0.9760 in the four-sensor 50%-overlapping setup, closely following TCN and demonstrating its effectiveness in capturing long-term temporal patterns.
Traditional models such as Random Forest (RF) and Support Vector Classifier (SVC) perform adequately but do not match the performance of deep learning models. For example, RF achieves a test accuracy of 0.9753 in the four-sensor non-overlapping configuration, while SVC reaches up to 0.9745 in the two-sensor 50%-overlapping setup. This highlights their limitations in handling complex temporal dependencies inherent in ITS applications.
4.7. Detailed Analysis of Model Performance
4.7.1. Two Sensors with Non-Overlapping Data
In this configuration, the TCN achieves a test accuracy of 0.9757, outperforming other models. LSTM and GRU also perform well, with test accuracies of 0.9738 and 0.9748, respectively. Traditional models like RF and SVC achieve lower accuracies of 0.9720 and 0.9726, respectively, indicating challenges in capturing temporal patterns with limited sensor input.
4.7.2. Two Sensors with 50% Overlapping Data
With overlapping data, TCN’s performance improves to a test accuracy of 0.9765. LSTM and GRU also see performance gains, achieving test accuracies of 0.9762 and 0.9750, respectively. The overlapping data provide additional temporal context, enhancing the models’ ability to capture dependencies. Traditional models show modest improvements but remain behind deep learning models.
4.7.3. Four Sensors with Non-Overlapping Data
Increasing the sensor count to four, TCN achieves its highest test accuracy of 0.9770. LSTM follows with a test accuracy of 0.9752, and GRU achieves 0.9738. Traditional models like RF and XGBoost show improved performance due to the additional sensor data, with test accuracies of 0.9753 and 0.9735, respectively. KNN and DT lag behind, indicating limitations in scaling with more sensors.
4.7.4. Four Sensors with 50% Overlapping Data
In this configuration, both the TCN and GRU models achieve a high test accuracy of 0.9762. However, analysis of their confusion matrices reveals that the TCN model outperforms GRU in terms of misclassification rates across all activity classes.
Confusion Matrix Insights:
TCN Model: Fewer misclassifications overall, particularly in challenging stationary activities, indicating better discrimination of subtle patterns.
GRU Model: More misclassifications in certain classes, suggesting difficulties in distinguishing similar activities with overlapping data.
Why TCN is the Best Model:
Superior Class-wise Accuracy: TCN minimizes misclassifications across all activities, enhancing robustness and reliability critical for ITS applications.
Effective with Overlapping Data: TCN’s dilated causal convolutions capture long-term dependencies more effectively than GRU.
Better at Distinguishing Similar Activities: TCN excels in identifying subtle differences, reducing misinterpretation risks in real-world scenarios.
Conclusion: Despite identical overall metrics, the TCN model is the superior choice due to its lower misclassification rates and better class-wise performance. This underscores the importance of evaluating models beyond aggregate metrics to ensure reliable and effective HAR systems in ITS applications.
The LSTM model also performs well with a test accuracy of 0.9760 but does not surpass TCN in class-wise accuracy. Traditional models show improvement but still do not match the performance of deep learning models.
4.8. Impact of Overlapping Data and Sensor Count
The inclusion of overlapping data and additional sensors generally enhances model performance on the test set:
Overlapping Data: Provides additional temporal context, improving the models’ ability to capture dependencies. Deep learning models like TCN and LSTM benefit significantly, as evidenced by their higher test set metrics in overlapping configurations.
Increased Sensor Count: Offers more comprehensive data, allowing models to learn from richer features and improving classification capabilities on unseen data. TCN and LSTM benefit greatly from additional sensors, capturing complex spatiotemporal patterns more effectively.
However, in configurations with four sensors and overlapping data, the performance of TCN slightly decreases, suggesting that excessive redundancy may not always contribute to better generalization and might introduce overfitting.
4.9. Comparison Based on Performance Metrics
Accuracy: TCN consistently achieves the highest test set accuracy across most configurations, indicating robust overall performance and generalization capability.
Precision: High precision scores by TCN (up to 0.9770) reflect its effectiveness in minimizing false positives, crucial in ITS to avoid incorrect detections.
Recall: TCN’s high recall values demonstrate its ability to identify true positives reliably, ensuring critical events are not missed.
F1 Score: The balanced F1 Scores suggest that TCN maintains a good trade-off between precision and recall on unseen data, essential for reliable decision-making in ITS applications.
4.10. Recommendation
Based on the test set analysis, the TCN in the four-sensor non-overlapping configuration emerges as the most suitable model for ITS applications, achieving the highest test set performance metrics. The advantages of this configuration include:
Enhanced Temporal Modeling: TCN’s architecture effectively captures both short- and long-term dependencies in the data, even on unseen data.
Improved Contextual Understanding: The use of multiple sensors provides richer context, enabling better interpretation of dynamic environments in real-world scenarios.
Optimal Data Utilization: Non-overlapping data with multiple sensors avoids redundancy, enhancing generalization and preventing potential overfitting.
Robustness and Scalability: High test set performance indicates that the model is robust and can generalize well, increasing adaptability to different ITS challenges.
5. Conclusions and Future Works
5.1. Conclusions
This study makes significant contributions to the field of Human Activity Recognition (HAR) within Intelligent Transportation Systems (ITS) by effectively classifying passenger activities into six distinct categories: walking, running, and stationary (both while using and not using the smartphone). These insights hold substantial implications for optimizing smart transportation systems, as they can enhance decision-making processes for drivers and improve the overall commuter experience. By leveraging these findings, the study lays a foundation for the future development of smart transportation systems aimed at making public transit more efficient and user-friendly, thereby encouraging greater adoption of public transportation.
In this investigation, a comprehensive comparison was conducted between traditional machine learning models and advanced deep learning techniques. The models evaluated included Decision Tree (DT), K-Nearest Neighbors (KNN), Random Forest (RF), Support Vector Classifier (SVC), XGBoost, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM) networks, Temporal Convolutional Networks (TCN), and Transformer models. The evaluation encompassed various data segmentation strategies, including both non-overlapping and 50% overlapping configurations, utilizing smartphone sensor data. The notable performance enhancements observed with overlapping data segments, particularly for deep learning models, underscore the advantages of integrating temporal continuity into activity recognition tasks.
The results reveal that employing four sensors significantly elevates model performance compared to two-sensor configurations. The implementation of 50% overlapping data segments further augmented the models’ generalization capabilities. Among the models evaluated, deep learning models—specifically LSTM, TCN, GRU, and Transformer—demonstrated superior outcomes. The TCN model exhibited exceptional performance in the four-sensor non-overlapping configuration, achieving the highest accuracy of 97.70%. In the same configuration, the LSTM and GRU models also achieved high accuracies of 97.52% and 97.38%, respectively, reinforcing their strong capacity for capturing both short-term and long-term dependencies in time-series data. The Transformer model also showed robust performance, highlighting the effectiveness of attention mechanisms in capturing temporal patterns.
Traditional machine learning models such as KNN, DT, RF, and SVC showed reasonable performance but consistently underperformed compared to deep learning models, especially in complex scenarios characterized by multi-sensor inputs and overlapping data segments. Ensemble models like RF and XGBoost improved upon simpler models by leveraging multiple decision trees and gradient boosting techniques, respectively. However, they still fell short in capturing the intricate temporal dependencies inherent in sequential sensor data, which deep learning models managed more effectively.
Moreover, the TCN model displayed notable performance across various configurations, particularly in handling the intricacies of overlapping sequences. Its architecture, leveraging dilated convolutions, facilitated the capture of temporal patterns over extended periods, making it an effective alternative to recurrent models like LSTM and GRU in certain scenarios. These findings accentuate the increasing significance of deep learning models, particularly TCN and LSTM, in real-time, context-aware ITS applications, such as driver monitoring and traffic condition prediction. This research not only highlights the strengths of deep learning methodologies in HAR tasks but also advocates for their broader application in advancing smart transportation systems [
28,
41].
5.2. Future Work
Future research directions include exploring real-time implementation of the proposed models within ITS environments to evaluate their practical performance and computational efficiency. Additionally, incorporating a broader range of activities and more diverse user populations can enhance the generalizability of the models. Investigating sensor fusion techniques to optimally combine data from multiple sensors and exploring energy-efficient model architectures suitable for deployment on resource-constrained devices are also promising avenues. Finally, integrating explainable AI techniques can enhance the transparency of deep learning models, facilitating their acceptance in safety-critical ITS applications.
The findings of this study open several promising avenues for future research aimed at enhancing the role of machine learning and deep learning in Intelligent Transportation Systems (ITS):
- 1.
Hybrid Model Development: Exploring hybrid models that integrate traditional machine learning techniques with advanced deep learning architectures such as LSTM, TCN, GRU, and Transformer. Such models could leverage the interpretability and efficiency of traditional methods alongside the sophisticated pattern recognition capabilities of deep learning, potentially yielding improved performance in dynamic ITS environments, particularly in real-time applications [
42,
43].
- 2.
Customized Preprocessing Techniques: Investigating the customization of preprocessing strategies tailored to specific ITS applications. Traffic monitoring and driver behavior analysis may benefit from domain-specific data preprocessing that accounts for unique noise characteristics and data variability inherent in various sensor inputs. These refinements are likely to enhance model accuracy and adaptability in real-time conditions, enabling better detection of critical activities [
44,
45].
- 3.
Personalized Models through Transfer Learning: Developing personalized models that adapt to individual driving patterns and environmental contexts using techniques like transfer learning. This approach could fine-tune models to cater to individual users, enhancing the precision of activity recognition and making systems more context-aware. Personalized models would significantly improve the relevance and accuracy of predictions in tailored transportation applications, addressing diverse user needs effectively [
46].
- 4.
Integration of Multimodal Data Sources: Extending the research by combining smartphone sensor data with inputs from vehicle sensors, environmental monitoring systems, and other relevant technologies. Integrating multimodal data can lead to richer datasets and a deeper understanding of transportation behaviors and conditions, ultimately fostering more intelligent and responsive ITS solutions [
28,
41].
- 5.
Real-Time Implementation and Optimization: Exploring the real-time implementation of the proposed models within ITS environments to evaluate their practical performance and computational efficiency. Future research should focus on addressing computational constraints and latency issues associated with real-time ITS applications. This includes optimizing models for scalability and ensuring efficient performance in live environments, where response times and computational resources are often limited [
39].
- 6.
Energy-Efficient Architectures: Investigating energy-efficient model architectures suitable for deployment on resource-constrained devices like smartphones and embedded systems. This is crucial for practical ITS applications where power consumption can be a limiting factor.
- 7.
Explainable AI Techniques: Integrating explainable AI methods to enhance the transparency and interpretability of deep learning models. This can facilitate their acceptance in safety-critical ITS applications by providing insights into model decision-making processes.
- 8.
Expanded Activity Sets and Diverse Populations: Incorporating a broader range of activities and more diverse user populations to enhance the generalizability of the models. This can improve the robustness of HAR systems across different demographic groups and usage scenarios.
5.3. Recommendation
Based on the comprehensive analysis and performance metrics evaluated across different configurations, the Temporal Convolutional Network (TCN) in the four-sensor non-overlapping configuration emerges as the most suitable model for ITS applications. This recommendation is grounded in its superior accuracy, precision, recall, and F1 Score, as detailed in
Table 20.
The advantages of this configuration include:
Enhanced Temporal Modeling: TCN’s architecture effectively captures both short- and long-term dependencies in the data, even on unseen data, due to its use of dilated causal convolutions [
28]. Improved Contextual Understanding: The use of multiple sensors provides richer context, enabling better interpretation of dynamic environments in real-world scenarios. Optimal Data Utilization: Non-overlapping data with multiple sensors avoids redundancy, enhancing generalization and preventing potential overfitting. Robustness and Scalability: High test set performance indicates that the model is robust and can generalize well, increasing adaptability to different ITS challenges. This recommendation aligns with the objectives of ITS, which require reliable and accurate real-time analysis of multi-sensor data to improve transportation efficiency and safety.
5.4. Summary of Findings
In conclusion, the Temporal Convolutional Network stands out as the optimal model for Intelligent Transportation Systems, particularly in the four-sensor non-overlapping configuration. Its superior performance across all key metrics on the test set makes it highly recommended for applications that require dependable and precise real-time analysis of multi-sensor data.