1. Introduction
Intelligent vehicles, equipped with advanced sensing systems, are transforming the way we perceive and navigate driving environments. These vehicles can gather extensive amounts of contextual information, surpassing the capabilities of human drivers. This advanced perceptual capability serves as a critical foundation for comprehending the driving environment and making accurate decisions in autonomous vehicle operations. Based on these sensor inputs, autonomous vehicles must infer the intentions of surrounding vehicles and accurately predict their future trajectories [
1,
2]. Such highly accurate trajectory predictions are crucial not only for anticipating upcoming driving conditions but also for informing the subsequent tactical decisions of autonomous vehicles [
3].
In the context of complex urban road networks, decision-making in intelligent vehicles is ultimately reflected in their trajectory and driving behavior. Predicting a vehicle’s future trajectory is therefore a pivotal factor in ensuring safety in dynamic traffic conditions [
4]. Furthermore, trajectory prediction is indispensable for modeling traffic states, facilitating decision-making, planning, and anticipating hazards. Accurately forecasting the movements of nearby vehicles is essential for the effective operation of intelligent vehicles. Yet, achieving consistent precision in trajectory predictions, particularly in challenging environments, remains a notable challenge [
5]. Simultaneously, the diversity in driving behaviors and the inherent complexity of urban traffic also pose considerable challenges to achieving reliable and accurate predictions [
6].
Traditional trajectory prediction models often overemphasize historical trajectory data, neglecting the fact that their influence may diminish over time. It is essential to evaluate the relevance of historical features across varying temporal horizons [
7]. Moreover, many models address temporal and spatial attention independently, thereby overlooking their interdependencies. Beyond spatial feasibility and target positions, a driver’s style plays a significant role in influencing trajectory modifications. To enhance accuracy and stability, these challenges must be systematically addressed [
8,
9].
Recent studies [
10] have categorized vehicle trajectory prediction approaches into four primary methodologies: physical models, machine learning, reinforcement learning, and deep learning. Among these, traditional methods, such as dynamic models, kinematic models [
11], Kalman filtering [
12], and Monte Carlo methods [
13], utilize features like vehicle speed, acceleration, and turning rate. Although these approaches are computationally efficient, they often struggle to maintain accuracy in complex traffic scenarios. Machine learning approaches have introduced new paradigms for trajectory prediction. For example, Bahram et al. [
14] integrated model-driven interaction perception with supervised learning, while Zhu et al. [
15] employed SVM and GMM to analyze multi-participant interactions. However, these models often require deterministic maneuver characteristics, limiting them to simpler, short-term scenarios. Reinforcement learning (RL) has emerged as a promising framework for managing the complexity and high dimensionality of driving environments. Approaches such as HRL [
16] and GAIL [
17] have demonstrated efficacy in autonomous driving policy learning. However, RL’s computational intensity and lengthy training times hinder its feasibility for real-time applications.
Deep learning has gained significant prominence, marked by advancements in sequence models, graph neural networks (GNNs), and generative models. Sequence networks extract feature sequences from trajectory data, with models such as MFP [
18] and mmTransformer [
19] integrating environmental context for enhanced predictions. GNNs, such as TrGNN [
20] and hierarchical GNN models [
21], capture interdependencies among traffic participants. Generative Adversarial Networks (GANs) have been applied to analyze multimodal trajectory distributions, as seen in GAN-VEEP [
22] and ME-GAN [
23]. Generally, deep learning approaches outperform traditional methods in terms of both accuracy and scalability. GCN-based Prediction [
24] uses spatial and temporal graphs in connected environments, highlighting its advantage in cooperative driving scenarios. Attention-based GCN [
25] introduces a novel use of vehicle-to-traffic rule interactions, offering enhanced prediction in complex environments such as intersections. The Feature-Boosting Network [
26] emphasizes feature enhancement techniques, achieving remarkable gains in trajectory prediction accuracy through its novel high-dimensional space analysis. L. Lin et al. introduced the STA-LSTM model incorporating an attention mechanism in 2022 [
27].
While deep learning models generally achieve high prediction accuracy, they often face challenges in identifying key historical moments and influential neighboring vehicles. The STA-LSTM model’s spatial–temporal attention mechanism highlights crucial trajectory points and evaluates the influence of surrounding vehicles, enhancing interpretability. However, the 1 s prediction horizon of the model constrains its practical applicability and raises potential safety concerns. Moreover, while the model enhances interpretability, this improvement does not result in a significant increase in accuracy when compared to more advanced models such as CS-LSTM. The grid-based spatial modeling approach, while simplifying computational processes, is insufficient for capturing the intricate interactions present in dynamic environments. To address these limitations, future work should consider incorporating advanced data structures, such as graph neural networks (GNNs). Moreover, the STA-LSTM model’s predominant emphasis on highway scenarios restricts its applicability to diverse urban environments, necessitating further validation to ensure the model’s accuracy in trajectory prediction on urban road networks.
Accurate trajectory prediction must take into account both the temporal continuity of vehicle motion and the spatial influence of surrounding vehicles. Current research often neglects driver behavior variability, hindering personalized intelligent systems. This study establishes a road network using a fixed reference frame centered on the target vehicle, discretizing the surrounding road space into a 3 × 13 grid (
Figure 1). By integrating driver behavior variability, the model achieves long-term trajectory prediction while accounting for different driving styles in multi-vehicle scenarios.
In this study, we developed an enhanced trajectory prediction model based on Long Short-Term Memory (LSTM) networks, named HTSA-LSTM. This model integrates spatiotemporal attention mechanisms and accounts for variations in driving behavior, specifically designed for long-term trajectory prediction in multi-vehicle interaction scenarios, predicting the potential positions a target vehicle may reach within the forecast horizon. The main contributions of this work are as follows:
Dynamic Spatiotemporal Attention Mechanism: This mechanism captures essential spatiotemporal dependencies between moving vehicles by integrating dynamic temporal and spatial attention. The model selectively identifies critical moments within historical traffic data and dynamically adjusts weight distributions according to real-time traffic conditions, enhancing its adaptability. This capability significantly improves prediction accuracy, especially in rapidly changing traffic environments, demonstrating the model’s effectiveness in handling complex, evolving scenarios.
Driver Behavior Analysis Module: This module segments time-series driving data into behavioral units, enhancing prediction accuracy by incorporating historical trajectories. Driving primitives are extracted through a Sparse Inverse Covariance Clustering (SICC-SC) method, clustering trajectory data to reveal distinct behaviors without predefined labels, achieving robust adaptability. Temporal consistency is maintained via a dynamic programming-based Expectation–Maximization (EM) framework, while a Toeplitz graphical Lasso with ADMM refines model sparsity, enabling precise and interpretable risk analysis and demonstrating effectiveness in complex, multi-vehicle scenarios.
Real-World Model Validation: Extensive data collection experiments were conducted across diverse road conditions in Wuhan, China, to assess the model’s generalization and effectiveness. Data from 20 drivers were collected over 12 km of urban roads, 34 km of expressways, and 45 km of highways. After preprocessing, this dataset validated the HTSA-LSTM model’s performance under real-world driving conditions, demonstrating its robustness and applicability in varied traffic environments.
In summary, this study introduces the HTSA-LSTM model, which integrates dynamic spatiotemporal attention mechanisms and variations in driving behavior to improve the prediction of future vehicle trajectories. The following sections will discuss the model’s architecture, the principles of the spatiotemporal attention mechanism, the training and evaluation using the NGSIM dataset, and the validation of its effectiveness with real-world driving data.
This paper is structured into three principal components.
Section 2 delineates the model architecture development, emphasizing the integration of temporal–spatial attention mechanisms and a driving style analysis module for performance enhancement.
Section 3 presents a rigorous experimental validation protocol, wherein individual module contributions were systematically assessed through comprehensive evaluations of the NGSIM dataset. Finally,
Section 4 conducts comparative real-world driving data on an authentic driving dataset to corroborate the model’s practical applicability and operational robustness. We then summarize and analyze the entire text in
Section 5.
2. Model Development
In urban road network systems, straight road segments constitute the primary road form. Therefore, this chapter focuses on the driving states of traffic participants on straight-road segments. A fixed reference frame is employed in this study, with the origin fixed at time t at the geometric center of the target vehicle’s vertical projection onto the road surface [
28], as shown in
Figure 1. In this reference frame, the
y-axis aligns with the direction of the target vehicle’s motion on the road, while the
x-axis is perpendicular to the vehicle’s motion direction.
The historical trajectory of a traffic participant is defined as a numerical vector, where the values alternate between
and
and are separated by fixed time steps corresponding to the device’s sampling frequency. A trajectory consisting of n points is represented as a sequence of
values
,
,
,…,
, all of which are elements of the set
T, defined as follows:
The trajectory point indices are mapped to timestamps through the function
, which follows the conditions specified in Equation (2), where
is the fixed time step:
In this chapter, the trajectory prediction problem is defined as the following sequence generation task: Given an observed trajectory of m < n points as the sequence , , the goal is to predict each as accurately as possible by minimizing some error metric between the actual and predicted trajectories.
In the HTSA-LSTM trajectory prediction model proposed in this chapter—based on LSTM, integrating spatiotemporal attention mechanisms, and accounting for differences in driver behavior—a fixed reference frame is used, as illustrated in
Figure 2. The model’s output is defined as:
where
represents the instantaneous coordinates of the target vehicle and the neighboring vehicles in adjacent lanes within a distance
meters at time
. The specific value of
must comply with local traffic safety regulations, matching the safe distance required at the current driving speed.
The model’s output is a probability distribution of the possible trajectory:
where
represents the coordinates that the target vehicle may reach within the prediction time steps
.
2.1. Traditional Models
When utilizing traditional neural networks for time-series predictions, a significant limitation arises due to the network’s inherent disregard for whether information from a preceding time frame might influence tasks in subsequent moments. This oversight renders traditional neural networks unsuitable for time-series problems. In the context of vehicle trajectory prediction, it is indisputable that driving data from previous time intervals has a substantial impact on subsequent prediction tasks. To address this issue of “long-term dependency”, Long Short-Term Memory (LSTM) recurrent neural networks were introduced. These LSTM units are capable of analyzing temporal dependencies within the input data to the network. By incorporating gating mechanisms, LSTM units enhance the memory capabilities of standard recurrent units. Specifically, an LSTM consists of a forget gate, an input gate, an output gate, and a memory cell (as illustrated in
Figure 2). Several variants of the standard LSTM have emerged, such as LSTMs without forget gates and those with peephole connections. However, in the HTSA-LSTM model, all LSTMs utilized are standard LSTM units.
Figure 2 illustrates the traditional LSTM architecture.
Forget Gate: Upon receiving input into the LSTM unit, the first task of the LSTM is to determine which information from the cell state should be discarded or “forgotten”. This decision is governed by the “forget gate layer” of the network. The forget gate takes two inputs: the output value from the previous time step
and the current input value
. It then determines the resulting information
, which represents the magnitude of information to be discarded. Each value in the cell state
is scaled within the range [0, 1], where 1 indicates “fully retain this state” and 0 indicates “completely disregard this state.” The forget gate’s update process is as follows:
where
is the weight matrix,
is the bias term,
is the output value from the previous network unit,
is the current network input, and
is the sigmoid function.
Input Gate: After information has been filtered through the forget gate, it is necessary to determine which information should be preserved in the LSTM cell state. This functionality is primarily achieved through the “input gate” layer and the tanh layer. The “input gate” layer primarily decides which information should be updated, while the tanh layer creates a new candidate vector
that can be integrated into the cell state update. The input gate’s update process is as follows:
where
and
are the weight matrices,
and
are the bias terms, and tanh refers to the hyperbolic tangent function.
LSTM Cell State Update: The process of updating the cell state involves transitioning the old state
to the new state
. Following the selection of information by the forget gate and the input gate, the cell state is multiplied by the value
determined by the forget gate. This result signifies that the forget gate has opted to discard certain information. Subsequently, the value from the input gate is multiplied by the candidate vector
and added to the output from the forget gate, thereby incorporating some new information into the cell state. The cell state update is expressed as:
where
represents the weight from the forget gate,
denotes the output of the forget gate,
is the value from the input gate, and
represents the new candidate value. This value is scaled according to the network’s updates for each cell state.
Output Gate: The determination of the output value is contingent upon the updated cell state. This value is derived by passing the current input and the hidden state from the previous time step through a fully connected layer, followed by the application of a
function, which yields the output value within the range [0, 1]. A value of 0 indicates no output, while a value of 1 indicates full output. The cell state is then processed through a
function to produce a result within the range of [−1, 1], which is multiplied by the output gate value to obtain the final hidden state:
where
is the weight matrix and
is the bias term.
2.2. Attention Mechanism
Traditional models treat temporal and spatial attention independently. In contrast, the HTSA-LSTM model integrates these mechanisms, allowing them to interact and influence one another, resulting in more accurate predictions. The following sections will detail the working principles of the dynamic spatiotemporal attention mechanism and its advantages in real-world applications.
2.2.1. Temporal Attention Mechanism
In the HTSA-LSTM model, at a given time step, the T-step historical trajectories of the target vehicle and its neighboring vehicles, denoted as
, are used as the input to the model. Considering the hidden states of the LSTM units in HTSA-LSTM,
, with
d representing the length of the hidden state. After generating these hidden states, the temporal attention weights associated with
v, denoted as
, are computed as follows:
where
represents the learnable weights.
Next, by combining the hidden states
with the temporal attention weights
, the tensor cell value associated with
is derived as follows:
Equation (12) computes the tensor cell value through temporal attention-weighted aggregation of the LSTM-hidden states , yielding a compact latent representation that encapsulates discriminative temporal patterns within the window . Specifically, given —the softmax-normalized attention weights derived from in Equation (11)—this operation aggregates the hidden states via Equation (12), dynamically emphasizing salient historical moments (e.g., abrupt maneuvers) while compressing contextual information into a single feature vector. By integrating LSTM’s sequential modeling capacity with attention-driven dynamic weight allocation, this mechanism efficiently balances temporal dependencies and spatially aware interactions in trajectory prediction, serving as a critical bridge between temporal context extraction and spatial relation learning.
Overall, all tensor cell values are utilized to compute the spatial-level attention weights and to predict the trajectory of the target vehicle.
2.2.2. Spatial Attention Mechanism
In the HTSA-LSTM model, the tensor cell values at time
t are represented as follows:
where
N denotes the total number of tensor cells (
N = 3 × 13 = 39). The tensor cell
is defined as follows:
The spatial attention weights associated with all vehicles at time
t are computed as follows:
where
represents the learnable weights.
Here,
represents the spatial attention weights corresponding to each cell at time
t. Finally, the historical information of the target vehicle and its neighboring vehicles is aggregated as follows:
Subsequently, the spatiotemporal interaction information is coupled with the driving style information.
In the HTSA-LSTM model, there is a strong interdependence between the temporal attention mechanism and the spatial attention mechanism. This relationship is manifested primarily in the following two aspects:
Temporal Attention Mechanism: The model applies weighted processing to the T-step historical trajectory data of the target vehicle and its neighboring vehicles, generating a weighted hidden state for each time step. This process focuses on assigning appropriate weights based on the significance of historical data, thereby highlighting the most critical moments for future trajectory prediction. These weighted hidden states are subsequently integrated into the spatial attention mechanism, forming the tensor cell values . Specifically, if vehicle is within network cell , is included as part of ; otherwise, is set to zero. This demonstrates that the output of the temporal attention directly influences the input structure of the spatial attention mechanism.
Spatial Attention Mechanism: This mechanism leverages the tensor cell values of all vehicles, including the output from the temporal attention mechanism, to calculate the spatial attention weights . This step emphasizes identifying the regions within the various spatial cells (or network cells) that have the most significant impact on predicting the target vehicle’s trajectory. The model then performs a weighted sum of all using the weights , generating the final representation that integrates both temporal and spatial information, as described in Equation (17).
2.3. Driving Style Analysis Module
To better utilize sensor data and capture driving styles, temporal state continuity is introduced to segment and cluster vehicle sensor data. A sparse inverse covariance clustering method based on temporal sequence state continuity (SICC-SC) was developed. For semantic analysis of driving primitives, a network analysis method is applied. Using the network structure of each driving primitive obtained from SICC-SC, the driving primitives are semantically interpreted and defined without manually predefined labels. The SICC-SC method is used to segment the driver’s trajectory data, resulting in five clusters in
Table 1.
As shown in
Figure 3, after obtaining the multidimensional vehicle sensor data, a sliding window with a size of
is used to extract local time-series data. First, the relative distance series
of neighboring data points within the window
is calculated, including the mean
and standard deviation
. Based on the characteristics of the normal distribution, each value in
is assigned a similarity metric, representing the temporal continuity of the time series. Subsequently, the Expectation–Maximization (EM) algorithm is used to solve the sparse inverse covariance matrix Θ.
This matrix encapsulates conditional dependencies among sensor variables (e.g., throttle–velocity correlations during acceleration), effectively serving as a feature dictionary that defines discriminative interactions for each driving primitive. The sliding window mechanism, analogous to a dynamic lens scanning temporal behaviors, enables the model to decompose complex driving sequences into semantically coherent units, as visualized in
Figure 3.
A dynamic programming (DP)-based clustering assignment method is introduced to address the issue of state continuity in time-series data. In the E-step of the EM algorithm, the combinatorial optimization problem is solved by maximizing the expected likelihood. The objective function minimizes the sum of the negative log-likelihood and a penalty term
:
In the proposed dynamic programming (DP)-based clustering assignment method, the objective function in Equation (18) is designed to balance model fitness and state continuity in time-series data. Specifically, the term represents the negative log-likelihood of assigning time point to cluster under the current parameter estimates , which measures the discrepancy between the observed data and the model’s predictions. The second term, , introduces a state continuity constraint: it imposes a penalty proportional to the parameter whenever a time point is assigned to a different cluster than its predecessor . This encourages adjacent time points to be clustered together, enhancing the temporal coherence of the resulting clusters.
The dynamic programming approach optimizes this objective by constructing an optimal path that minimizes the cumulative cost across all time points. This path is determined by iteratively solving subproblems that balance the trade-off between local likelihood maximization and global continuity constraints, ensuring that the final clustering assignment is both statistically plausible and temporally smooth. The parameter can be adaptively adjusted to control the strength of the continuity constraint, allowing the method to handle varying degrees of temporal dependence in different segments of the time series.
Once the cluster assignment result P is obtained, the cluster parameters are then updated. We propose an optimization method based on the Toeplitz graphical Lasso, solved using the Alternating Direction Method of Multipliers (ADMM). The objective function maximizes the log-determinant of the Toeplitz covariance matrix while adding a Lasso sparsity regularization term:
where
Θ is the covariance matrix,
S is the empirical covariance matrix, and
λ controls the sparsity. The ADMM iteratively updates
Θ, Z, and the Lagrange multiplier
U, solving the optimization problem in a recursive manner. The update steps involve optimizing the log-determinant for
Θ, applying Lasso sparsity constraints to
Z, and updating the Lagrange multiplier
U to enforce the constraint
. Each cluster is represented by an inverse covariance matrix
, defined as the driving basis, also known as the precision matrix. The dimensionality of this matrix is
, which represents the conditional independence relationships between the six sensors. Due to the block Toeplitz structure of the inverse covariance matrix, assuming a window size of 2, the time-invariant assumption can be made, allowing a better representation of the temporal relationships between the sensors. On one hand, the diagonal blocks represent the temporal correlations of the same sensor across different time points, while the off-diagonal blocks represent the conditional correlations between different sensors at the same time. The element
in the matrix indicates that the sensor variables
and
are conditionally independent. Conditional dependencies are denoted as
. The dependency relationship is represented by the formula:
In this paper, a threshold of is set to convert the matrix into a graph structure. Edges with weights lower than the threshold are removed, while edges with weights higher than the threshold are retained, and their weights are represented by the relative bias coefficient . Afterward, the distribution of the standardized driving primitives is used to analyze the statistical properties and risks of different drivers. We standardized the characteristics of the driver and input them into the HTSA-LSTM.
2.4. Coupled Spatiotemporal Attention Model Considering Driving Styles
The architecture of the HTSA-LSTM model is illustrated in
Figure 4. To predict vehicle trajectories within a road network, the model leverages temporal and spatial attention mechanisms to capture dependencies in both time and space. Additionally, it integrates a driving style analysis module to account for the influence of varying driving behaviors on vehicle movement. The road network surrounding the target vehicle is discretized into a 3 × 13 grid, as shown in
Figure 1. The grid’s rows represent the left, current, and right lanes relative to the target vehicle, while the columns correspond to discrete grid cells of width L [m], determined by the vehicle’s length and the driving environment. In alignment with related studies [
28], L is set to 4.6 m (15 ft) during training with the NGSIM dataset. All vehicles within the 3 × 13 grid, excluding the target vehicle, are considered neighboring vehicles, each occupying a unique grid cell.
The input to the HTSA-LSTM model consists of the historical trajectories over T steps and the driving style information of all vehicles within the grid. The model outputs the predicted trajectory of the target vehicle for the next H steps. The temporal attention mechanism analyzes the influence of historical trajectories of both the target and neighboring vehicles, while the spatial attention mechanism captures the impact of the surrounding environment on the prediction. Integration module driving styles analysis is used to obtain driving habits.
In summary, the HTSA-LSTM model effectively integrates temporal and spatial attention mechanisms, combining time-series analysis with spatial feature recognition. The temporal attention identifies critical relationships between past trajectories and the current state, while the spatial attention emphasizes key environmental features. By employing a dynamic spatiotemporal attention mechanism, the model captures evolving spatiotemporal dependencies and incorporates driving style variations, enhancing its understanding of the interaction between these factors for more accurate trajectory predictions.
3. Experiments and Evaluation
This chapter explores and validates the application of the temporal–spatial attention mechanism-based Long Short-Term Memory (HTSA-LSTM) model for vehicle trajectory prediction. Using the NGSIM US101 trajectory dataset for analysis and model training, we demonstrate the superiority of the HTSA-LSTM in handling complex traffic scenarios. Comparisons with benchmark models further validate its advantages in prediction accuracy and generalization capability. Additionally, the model’s effectiveness in real-world driving conditions was confirmed using actual vehicle driving data.
During the dataset segmentation and model training phase, the data were split into training, validation, and test sets with a 7:1:2 ratio. The model inputs consisted of time-series segments, enabling precise future trajectory predictions. The configuration and optimization of model parameters, along with the experimental environment setup, laid a strong foundation for the research. This chapter also provides a detailed list of the model’s hyperparameters, serving as a reference for future experiments and comparisons.
Theoretically, the floating-point operations (FLOPs) for a single forward pass through the HTSA-LSTM model, considering both its spatial–temporal attention mechanism and LSTM parameters, are estimated to be approximately 820,000. Combined with hardware performance characteristics, the theoretical latency is projected to be 4.1 ms. Experimental measurements, which incorporate data preprocessing and communication overhead, reveal an estimated end-to-end latency of 5.6 ms. This result achieves a 10 Hz real-time processing rate, satisfying the system’s real-time performance requirements. All experiments were conducted using Python 3.8, CUDA 11.6.2, a 12-core Intel Xeon Platinum 8255C @ 2.50GHz CPU, an Nvidia RTX 3080 (10 GB) GPU, and 128 GB of RAM.
In order to prevent overfitting, we adopted early stopping and regularization strategies during training to enhance the model’s generalization capabilities. Future work will further investigate the model’s generalization performance in larger-scale and more diverse data scenarios, as well as the impact of training data selection on performance.
3.1. Formatting of Mathematical Components
To assess the effectiveness of the HTSA-LSTM model in real-world applications, this section separately trains and tests the model using real-world driving data from urban city roads (CRs), urban expressways (UEs), and highways (Hs). The following evaluation metrics are employed to evaluate the model’s performance on the test set:
RMSE is a common metric for evaluating the accuracy of a model’s predictions. However, it has certain limitations when assessing the accuracy of multimodal predictions, as RMSE tends to favor models that average the prediction results. In some specific scenarios, the mean value may not accurately reflect the quality of the prediction.
- 2.
Negative Log-Likelihood Loss (NLL):
To address this limitation, another evaluation metric is introduced: the negative log-likelihood loss (NLL) of the model’s predicted trajectory relative to the corresponding true trajectory [
29]. Although the NLL value cannot be directly interpreted as a specific physical quantity, it provides an intuitive comparison between unimodal and multimodal prediction distributions.
The trajectory information input into the HTSA-LSTM model is denoted as , comprising a series of trajectory points continuously distributed over time. For each , there is a corresponding predicted point . The probability distribution of the predicted trajectory points at each position, represented as reflects the prediction accuracy at point , abbreviated as . The goal is for the HTSA-LSTM model’s prediction accuracy to be as close to 1 as possible for each trajectory point. The objective of the negative log-likelihood loss (NLL) is to minimize the negative log of , or , which encourages the prediction accuracy to approach 1, thereby minimizing the loss and reducing the discrepancy between the predicted and true values. The NLL of the input data is the average NLL across all samples, where NNN represents the number of trajectory points in the input data.
- 3.
Mean Absolute Error (MAE):
MAE ranges from [0, +∞), where smaller MAE values indicate that the predictions are closer to the true values. An MAE of 0 suggests that the model is perfect; conversely, larger prediction errors result in a higher MAE.
- 4.
R-Squared (R2) Coefficient of Determination:
R2 ranges from [0, 1] and is used to characterize the fit of the model’s predictions to the actual data.
3.2. Dataset Segmentation and Model Training
Figure 5 illustrates the data segmentation process. The different colored lines represent multi-dimensional data from different sensors. This chapter utilizes the preprocessed NGSIM US101 trajectory dataset, which is discretized and mapped according to the road network structure shown in
Figure 1. The NGSIM dataset records vehicle movement data on highways, primarily capturing forward-driving and lane-changing behaviors. The US101 dataset includes three 15 min trajectory segments (7:50–8:05 am, 8:05–8:20 am, 8:20–8:35 am), which are divided into training, validation, and test sets in a 7:1:2 ratio. During model training, a specific vehicle is selected as the objective vehicle (OV), while the surrounding vehicles within a 3 × 13 road grid are considered neighboring vehicles (NVs).
In the HTSA-LSTM model, the historical trajectory data of the OV and NVs over T seconds is used to predict the OV’s trajectory over the next H seconds. Therefore, each feature sequence segment must be at least T + H seconds long (as illustrated in
Figure 5). In this study, the vehicle trajectory data are segmented into time-series fragments of T + H seconds, where T seconds represents the OV’s historical driving data and H seconds represents the model’s future trajectory prediction range. Both are sampled at a frequency of 10 Hz. Data segmentation is performed using a sliding time window of length T + H, allowing the predicted results to be merged into a continuous sequence, thus enabling predictions over longer time horizons. Before inputting the data into the HTSA-LSTM model, each vehicle’s trajectory is downsampled by a factor of 2 to reduce model complexity.
Model Parameter Settings: Based on existing research results in trajectory prediction and considering subsequent comparative experiments, the length of the output window H in the HTSA-LSTM model is incrementally increased from 1, selecting the optimal future prediction step length according to the evaluation metrics. Hyperparameters in the model are optimized using grid search, with the final parameter settings provided in
Table 2. For the parameter setting of SICC-SC, we refer to previous work [
30].
3.3. Training Setup and Evaluation Metrics
To comprehensively account for the influence of driving style on vehicle trajectory, a driving style analysis module has been embedded within the HTSA-LSTM model. This module performs semantic parsing of the input data, identifying and segmenting fundamental driving behavior units, thereby enabling an unsupervised analysis of the driver’s driving style. By incorporating the driving style analysis module, the model not only captures temporal and spatial dependencies through the spatiotemporal attention mechanism but also integrates the impact of driving style into the analysis.
3.4. Results and Comparison
In this chapter, the NGSIM US101 dataset is utilized to predict future vehicle trajectories using the HTSA-LSTM model alongside several baseline models. The performance of these models is evaluated using RMSE and NLL metrics. The baseline models assessed include:
Constant Velocity (CV) [
31]: An earlier trajectory prediction method based on physical principles, primarily using the vehicle’s dynamic model and Kalman filter equations. Although the dataset lacks factors like road surface friction, making complex physics-based comparisons impractical, a simple physics-based model is constructed to predict the vehicle’s trajectory under constant velocity.
Simple-LSTM (S-LSTM) [
32]: This model uses LSTM units within a recurrent neural network to analyze continuous time-series data for temporal vehicle trajectory predictions.
Maneuver-LSTM (M-LSTM) [
33]: Based on an encoder-decoder framework, this model uses historical trajectory data and lane structure, assigning confidence values to six maneuver categories to predict future motion through a multimodal distribution.
Attention-LSTM (A-LSTM) [
3]: This model uses an LSTM encoder with shared weights to generate vector encodings of each vehicle’s motion. An attention module models interactions based on the importance of adjacent vehicles, and the decoder uses these interaction vectors to predict trajectory distribution.
Convolutional Social Pooling LSTM (CS-LSTM) [
27]: This model, based on an LSTM encoder-decoder framework, employs a social convolutional network to accurately capture spatial interactions among agents, producing multimodal trajectory predictions.
Spatial–Temporal Attention-LSTM(STA-LSTM) [
1]: This model employs an LSTM encoder–decoder framework with integrated spatial and temporal attention mechanisms. It assigns weights to surrounding vehicles and historical time steps, capturing critical interactions for trajectory prediction. STA-LSTM enhances the accuracy of multimodal trajectory predictions by effectively modeling dynamic relationships over time and space.
Comparative models (4) and (5) serve as ablation experiments for HTSA-LSTM, with A-LSTM using only a unified attention mechanism (AM) and CS-LSTM employing only the spatial attention mechanism (social convolutional network). HTSA-LSTM integrates both, with AM as the temporal attention mechanism and the social convolutional network as the spatial attention mechanism. Models (2) and (3) also contribute to a broader ablation study, where removing the spatiotemporal attention from HTSA-LSTM results in S-LSTM and replacing it with a Maneuver module results in M-LSTM, retaining the overall framework while altering specific components.
Table 3 presents the RMSE and NLL values for predictions made by HTSA-LSTM and baseline models on the NGSIM US101 dataset. As shown, the evaluation metrics increase as the prediction horizon extends, indicating a decline in prediction accuracy for all models. Compared to other baseline models, the CV and S-LSTM models produce higher RMSE and NLL values, which can be attributed to the fact that other models consider the influence of interaction information from neighboring vehicles during trajectory prediction. This directly indicates the significance of vehicle interaction information in trajectory prediction. The results for prediction horizons of 1 to 6 s demonstrate that HTSA-LSTM yields significantly lower RMSE and NLL values than other baseline models, indicating higher accuracy in predicting future vehicle trajectories.
The analysis of the RMSE and NLL values for the STA-LSTM and HTSA-LSTM models reveals significant differences in their performance across prediction horizons. As with the overall trend observed in
Table 3, both models show increased RMSE and NLL values as the prediction horizon extends, indicating a decline in accuracy. However, HTSA-LSTM consistently outperforms STA-LSTM across all time frames, demonstrating its superior ability to incorporate vehicle interaction information into trajectory predictions. Specifically, HTSA-LSTM achieves markedly lower RMSE and NLL values compared to STA-LSTM, particularly in the shorter prediction horizons, highlighting its effectiveness in accurately forecasting future vehicle trajectories. This emphasizes the importance of utilizing interaction data from neighboring vehicles for improved prediction accuracy.
3.5. Analysis of Temporal and Spatial Attention Allocation
Following the overall analysis of the HTSA-LSTM model’s trajectory prediction performance, this section delves into the allocation of the temporal attention mechanism across time steps and the spatial attention mechanism within the road network structure.
3.5.1. Temporal Attention Allocation
This analysis uses three 15 min subsets from the NGSIM US101 dataset to examine the allocation of temporal attention, calculating the attention weights for 15 historical time steps (from
to
).
Figure 6 illustrates the average temporal attention weights from
to
. It is evident that the weights before
are nearly zero, indicating minimal influence on the prediction results, and thus, the model opts to ignore them. The temporal attention weight at the current time step
is the highest, suggesting that the future trajectory of the target vehicle is primarily influenced by the most recent state of the target and neighboring vehicles. This result also explains why the CS-LSTM model, which incorporates spatial but not temporal attention, underperforms compared to HTSA-LSTM.
3.5.2. Spatial Attention Allocation
The study section of US101 in the NGSIM dataset comprises five lanes, with each section including an on-ramp.
Figure 7 is a lane distribution map at the Ventura Boulevard on-ramp, captured from Google Maps. This chapter selects four lanes from the southbound section of Ventura Boulevard on US101: the innermost lane, the middle lane, the outermost lane, and the on-ramp section (corresponding to the lanes indicated by the blue, green, yellow, and red arrows, respectively, in the figure) to analyze the spatial attention allocation of the target vehicle in different lanes.
In analyzing the spatial attention mechanism’s weight distribution, each road network grid cell is labeled based on the lane name and its relative position to the target vehicle’s grid cell. To simplify, the following abbreviations are used: Current Lane (C), Right Lane (R), and Left Lane (L). The longitudinal representation of the lane grid is illustrated in
Figure 8, with values ranging from −6 to 6. For example, (C, 0) represents the target vehicle’s position, (C, 6) indicates the sixth grid cell ahead in the current lane, (R, 2) denotes the second grid cell ahead in the right lane, and (L, −3) refers to the third grid cell behind in the left lane. Regarding spatial attention weights, the grid cell occupied by the target vehicle holds the highest attention weight at 71.56%. This, combined with the temporal attention weight distribution, suggests that the target vehicle’s future trajectory is predominantly influenced by its own driving state. The study focuses on passenger vehicles, which are highly maneuverable and can quickly adjust their trajectories in response to surrounding vehicles.
To better describe the distribution of attention weights among neighboring vehicles in the road network structure, the remaining 28.44% of attention weights, excluding the 71.56% allocated to the target vehicle, are normalized and distributed across the 3 × 13 road grid. The results are shown in
Figure 8. Grid cells behind the target vehicle receive little attention, meaning rear vehicles have minimal impact and drivers focus less on them.
In
Figure 8, as attention is allocated, the square color changes from blue to green and finally to red.
Figure 8 further shows that the target vehicle’s spatial attention is primarily focused on vehicles ahead in the current lane. However, the on-ramp section is an exception, where the target vehicle pays more attention to vehicles ahead in the left lane, indicating a tendency to switch to that lane (as shown in the red box in
Figure 8). This is consistent with real-world behavior, as merging from an on-ramp onto the main road typically requires a lane change. Compared to the right lane, vehicles in the outermost lane also pay more attention to the left lane, showing a greater tendency to change lanes to the left rather than the right. Conversely, the innermost lane exhibits the opposite pattern, where the target vehicle not only focuses on the dynamics of vehicles in its current lane but also pays more attention to vehicles ahead in the right lane. Vehicles in the middle lane show little difference in attention distribution between the left and right lanes. The spatial attention distribution in other lanes can be explained similarly. These results demonstrate that HTSA-LSTM effectively captures the impact of surrounding traffic participants on the target vehicle’s future state at the spatial level.
To investigate whether spatial attention weights can explain specific driving behaviors, such as lane changes, this chapter selects 1585 as the subject of study. This vehicle performed two consecutive lane change maneuvers on the US101 highway.
The target vehicle 1585 executed its first lane change from lane 4 to lane 5 at the 447th time step and its second lane change from lane 5 to lane 6 at the 518th time step, as shown in
Figure 9. Where blue and purple means the car is changing lanes, and green means the car is unchanged in the current lane. Additionally,
Figure 9 displays the road grid cells that received the highest spatial attention weight at each time step during this process. It can be observed that before the 446th time step, the target vehicle (Vehicle 1585) primarily focused on the neighboring vehicles ahead in the current lane, with the highest spatial attention weights concentrated on grid cells (C, 2), (C, 3), and (C, 4).
From the 447th to the 469th time step, the maximum attention gradually shifted from the current lane to (R, 1), and then to (R, 2) as the vehicle prepared to change lanes to the right. At the 518th time step, the maximum attention weight again shifted from the current lane to the right lane, remaining on grid cells (R, 1) and (R, 2) until the 554th time step, marking the completion of the lane change from lane 5 to lane 6. The duration of the maximum spatial attention weight shift indicates that the first lane change took 2.2 s (22 × 0.1), while the second lane change lasted 3.6 s (36 × 0.1).
3.6. Analysis of Driving Primitives
Figure 10 illustrates the driving behaviors in relation to various kinematic features, including lateral velocity (v_x), lateral acceleration (a_x), longitudinal velocity (v_y), and longitudinal acceleration (a_y), alongside the velocity differential (ΔV). The plots represent the temporal changes in each feature, labeled with different driving primitives identified by the Sparse Inverse Covariance Clustering (SICC-SC) method. The color-coded labels correspond to the driving primitives depicted in
Figure 1, providing a clear visual association of each primitive with specific kinematic behaviors. The variation in lateral and longitudinal velocities and accelerations highlights the influence of different driving primitives over time.
Figure 11 presents the distribution of driving primitives extracted using the SICC-SC method. The color map represents the frequency of each driving primitive, labeled from #1 to #5, with corresponding normalized occurrence proportions. Notably, primitive #1 (yellow) accounts for 28% of the overall driving behavior, reflecting consistent behaviors such as stable high-speed driving. Primitives #3 (green) and #4 (cyan) represent 22% and 17%, respectively, corresponding to low-speed operations and moderate acceleration. Primitive #2 (blue), contributing to 15% of the driving instances, characterizes stopping or idling behavior, indicating a significant share of routine traffic stops. Primitive #5 (cyan) reflects sharp acceleration, accounting for 18%, which may be associated with overtaking or rapid lane changes. Overall, the analysis suggests that a considerable portion of the driving behavior involves regular high-speed cruising, low-speed control, and frequent stopping maneuvers, illustrating a diverse mixture of urban and expressway driving scenarios.
It is evident that each driving primitive demonstrates distinct effects on the kinematic features. For instance, during periods labeled with primitive #1, the lateral and longitudinal velocities (v_x and v_y) exhibit stable trends, indicating cruising at consistent speeds. Conversely, during periods associated with primitive #5, characterized by rapid acceleration, spikes in both longitudinal velocity and acceleration are observed. The presence of primitive #3 (low-speed driving) corresponds with lower values of both v_y and a_y, suggesting careful driving in high-traffic areas.
The bottom panel of
Figure 11 further demonstrates the segmentation of driving primitives over time, visualized as a color-coded bar. This segmentation highlights the temporal dynamics of each primitive throughout the driving period, effectively summarizing the diversity in driving styles and the frequent shifts in driving behavior. The frequent changes between primitives #1, #3, and #5 indicate a dynamic driving pattern, with the driver consistently alternating between high-speed cruising, low-speed maneuvering, and aggressive accelerations, reflecting the complexity of real-world traffic scenarios.
3.7. Vehicle Speed, Acceleration, and Trajectory Prediction
To provide a more intuitive demonstration of HTSA-LSTM’s predictive performance on other feature variables, this chapter selects Vehicle 1585 from the NGSIM US101 dataset to predict and analyze its longitudinal speed (
vy), longitudinal acceleration (
ay), lateral speed (
vx), and lateral acceleration (
ax). The prediction accuracy is evaluated using the mean absolute error (MAE) between the predicted and actual values:
where
represents the predicted value of the target vehicle,
is the actual value,
i denotes the observation point, and
n is the number of samples. The corresponding mean errors for longitudinal speed
vy, longitudinal acceleration
ay, lateral speed
vx, and lateral acceleration
ax are denoted as
,
,
, and
.
Figure 12 provides examples of predictions for
vy,
ay,
vx, and
ax at different prediction horizons
. The predictions are merged based on the continuity of the time series, presenting combined results for 50 steps (5 s) and 80 steps (8 s) into the future.
Figure 13 reveasl that the mean error for all predicted variables increases with the prediction horizon. Additionally, during periods of rapid changes in the predicted values, the accuracy of the predictions tends to decrease. For instance, in the prediction of longitudinal speed
, around the 100th time step, a sharp transition from rapid deceleration to rapid acceleration occurs. While the predicted trends align with the actual values, there is some discrepancy at the extremities, with the predictions generally being conservative. A similar pattern is observed for other variables around their extreme points.
This behavior can be reasonably explained by the distribution of temporal attention weights. As analyzed earlier, the temporal attention weights in the input sequence are primarily concentrated at the current time step (i.e., time
, as shown in
Figure 13). In
Figure 13, the red solid circle and the hollow circle represent
,
, respectively, the black solid rhombus, and the hollow rhombus represent
,
, respectively. For the future time series being predicted, this corresponds to the previous time step. Thus, when an extreme value occurs, the model continues to predict based on the trend at time
and prior, until the temporal attention captures this change.
To visually demonstrate HTSA-LSTM’s trajectory prediction results, the predicted trajectories are merged based on the continuity of the time series to form long-range trajectory predictions according to the lane distribution in the road network model. Three samples from the NGSIM dataset were selected for trajectory display. The interactions among Vehicles 1585, 1576, and 1583 are depicted in
Figure 14: Vehicle 1576 is the lead vehicle before Vehicle 1585’s lane change (from lane L4 to lane L5), and Vehicle 1583 is the lead vehicle in the new lane (L5) after the lane change.
The results show that HTSA-LSTM’s trajectory predictions closely match the actual trajectories of the vehicles. However, during the lane change of Vehicle 1585, the model predicts the lane change slightly earlier than it occurs, which can be considered a conservative strategy. Vehicle 1583 moves toward the opposite side of the lane when its neighboring vehicle (Vehicle 1585) changes lanes into its lane, then gradually returns to the center of the road. This behavior aligns with the common driving habit of safely yielding to a lane-changing vehicle by moving to the right. The trajectory prediction results directly confirm that considering spatiotemporal dependencies is essential for accurately predicting vehicle trajectories.
5. Conclusions
Driving style analysis and vehicle trajectory prediction are critical for advancing intelligent vehicle technologies. This study examines variations in driving styles and introduces an innovative trajectory prediction model, HTSA-LSTM. The model is based on Long Short-Term Memory (LSTM) networks, integrating spatiotemporal attention mechanisms and driver behavior characteristics to achieve high-precision, long-term predictions. The key findings of this research are summarized as follows:
The study focuses on long-term vehicle trajectory prediction in multi-vehicle interaction scenarios. The HTSA-LSTM model incorporates a spatiotemporal attention mechanism to capture the complex spatiotemporal dependencies between vehicles. In addition to considering the continuity of vehicle motion and the influence of neighboring vehicles, the model segments driving data into driving primitives to analyze individual driving preferences, enabling precise future trajectory prediction. A 3 × 13 road grid structure was employed to represent the trajectories of the target vehicle and surrounding vehicles, and temporal and spatial attention weights were analyzed to identify key influential factors.
Trained and tested on the NGSIM US101 dataset, the model achieved a 20.72% reduction in RMSE and a 24.98% reduction in NLL for 5 s predictions compared to baseline models. The real-world applicability of HTSA-LSTM was validated using driving datasets, achieving over 97.9% accuracy for 5 s predictions on urban expressways (UEs) and highways (Hs), and 92.7% accuracy for 3 s predictions on city roads (CRs). Compared to STA-LSTM, HTSA-LSTM demonstrated superior prediction accuracy, especially in long-term forecasts (e.g., 6 s), with significantly lower RMSE and NLL metrics. This indicates that HTSA-LSTM exhibits greater robustness and accuracy in dynamic driving scenarios, showing clear advantages in handling vehicle interactions and complex environments, making it a promising solution for intelligent driving applications.
Although the HTSA-LSTM model has achieved good results in highway environments, its performance on urban roads still has room for improvement due to the more diverse traffic behaviors and complex vehicle–road interactions present in urban settings. Future work could consider incorporating advanced techniques such as multi-scale features, contextual information, and graph neural networks (GNNs) to more effectively model the complex traffic interactions in urban areas. While ensuring strong performance, we focus on the model’s generalization ability across diverse traffic scenarios and investigate the impact of training data selection on its robustness and generalization performance.
By integrating long-term trend analysis with real-time data—such as traffic density, road conditions, and meteorological information—our model not only predicts regular congestion patterns, capacity changes, and the impacts of special events but also rapidly adapts to current environmental conditions even in unfamiliar scenarios. This combined approach, however, increases data preprocessing requirements and computational complexity, necessitating research into efficient multimodal fusion strategies and online updating methods to balance real-time performance with prediction accuracy. Moreover, the current framework’s omission of dynamic conditions like fluctuating traffic density, sudden road closures, and adverse weather, which can alter spatial–temporal correlations and driver behaviors, may limit its adaptability. Future enhancements will focus on dynamic data fusion within the spatial–temporal attention module—leveraging adaptive weight adjustments (e.g., using traffic density for grid aggregation and gating mechanisms for road closures) and integrating weather data into the driving style analysis layer—to better capture complex, real-world scenarios.