entropy-27-00181
entropy-27-00181
entropy-27-00181
1 School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China;
iezzhang@zzu.edu.cn (Z.Z.); wangsong61@163.com (S.W.); iexguo@zzu.edu.cn (X.G.);
jfgaozzu@163.com (J.G.)
2 School of Integrated Circuits, Zhongyuan University of Technology, Zhengzhou 451191, China
3 Department of Electrical, Computer and Software Engineering, The University of Auckland,
Auckland 1010, New Zealand; a.hu@auckland.ac.nz
* Correspondence: 6407@zut.edu.cn
Abstract: Industrial fault diagnosis faces unique challenges with high-dimensional data,
long time-series, and complex couplings, which are characterized by significant infor-
mation entropy and intricate information dependencies inherent in datasets. Traditional
image processing methods are effective for local feature extraction but often miss global
temporal patterns, crucial for accurate diagnosis. While deep learning models like Vision
Transformer (ViT) capture broader temporal features, they struggle with varying fault
causes and time dependencies inherent in industrial data, where adding encoder layers
may even hinder performance. This paper proposes a novel global and local feature fusion
sequence-aware ViT (GLF-ViT), modifying feature embedding to retain sampling point
correlations and preserve more local information. By fusing global features from the classi-
fication token with local features from the encoder, the algorithm significantly enhances
complex fault diagnosis. Experimental analyses on data segment length, network depth,
feature fusion and attention head receptive field validate the approach, demonstrating
that a shallower encoder network is better suited for high-dimensional time-series fault
diagnosis in complex industrial processes compared to deeper networks. The proposed
method outperforms state-of-the-art algorithms on the Tennessee Eastman (TE) dataset
and demonstrates excellent performance when further validated on a power transmission
fault dataset.
Academic Editor: Yongbo Li
Received: 6 January 2025 Keywords: fault diagnosis; vision Transformer; feature fusion; Tennessee Eastman process
Revised: 24 January 2025
Accepted: 6 February 2025
Published: 8 February 2025
non-dominated sorting genetic algorithm (NSGAII) [5] can be used for feature selection
to identify the optimal subset of features for classification. Performance can be further
enhanced by combining kernel PCA (KPCA) with kernel density estimation [6]. However,
these methods typically assume that the data follow a certain distribution and require high-
quality data preprocessing. For complex, multivariate chemical processes, the diagnostic
accuracy and robustness of these methods are relatively limited.
With the advancement of machine learning techniques, increasing research has focused
on applying machine learning algorithms to fault diagnosis. Traditional machine learning
methods, such as support vector machines (SVM) [7] and random forests (RF) [8], achieve
fault classification and diagnosis by learning various features in the data, particularly non-
linear features. These methods have demonstrated significant performance improvements;
however, due to limitations in their structural design, they are unable to fully capture
higher-order features and spatiotemporal dependencies, and their performance heavily
relies on feature engineering and parameter optimization.
The rise of deep learning technology has provided new solutions for fault diagnosis.
With the remarkable progress of convolutional neural networks (CNN) in computer vision,
more researchers have applied deep learning algorithms from the image processing field
to fault diagnosis. The application of image classification algorithms for time-series fault
diagnosis can be broadly categorized into two approaches. One approach involves using
feature extraction techniques to visually represent time-series data. For instance, Barrera-
Llanga et al. used a visual geometry group (VGG) 19-based deep learning approach to
transform current signals into spectral images for induction motor fault classification [9].
Similarly, Zhang et al. proposed a method combining frequency domain markov transition
field (FDMTF) and a multi-branch residual convolutional neural network (MBRCNN) [10].
Another approach organizes multidimensional time-series data into graph structures for
further processing. Since industrial process data have temporal and dimensional char-
acteristics, they can be organized into a structure similar to images through slicing. By
leveraging the powerful feature extraction capabilities of CNN, various local features
within data segments can be automatically obtained [11–13]. To further extract multi-level
high-level features, researchers have combined CNN with various algorithm structures,
such as attention mechanisms [14], wavelet transforms [15], and auto-encoders [16], to
further optimize feature weighting, combine time-frequency domain feature information,
and train the network in an unsupervised manner to enhance classification performance
and applicability. However, due to the limited receptive field of convolutional kernels,
CNN-based algorithms lack the ability to capture temporal features. To address this,
researchers have employed algorithms like 1D-CNN [17], dilated convolutional neural
networks (DCNN) [18], temporal convolutional network (TCN) [19] to learn temporal
variation features, and they have combined CNN with structures such as long short-term
memory (LSTM) [20], gate recurrent unit (GRU) [21] and a deep shapley additive explana-
tion (SHAP) [22,23] to effectively handle high-dimensional and complex time-series data,
significantly improving the accuracy and robustness of fault diagnosis. However, these
algorithms still rely on aggregating local features [24], and their ability to capture global
temporal features in long-period, large-scale datasets needs further enhancement.
The Transformer model [25] has demonstrated exceptional performance in various
fields, including natural language processing and time-series forecasting. Building on
this, the vision transformer(ViT) [26] successfully adapted the Transformer architecture for
image processing tasks, showcasing strong capabilities in capturing temporal features. This
has made ViT a promising approach for fault diagnosis applications. Recent advancements
have explored innovative modifications to the Transformer framework. For example,
studies have CNN with Transformers to integrate the local feature extraction strengths of
Entropy 2025, 27, 181 3 of 25
CNN with the global information modeling capabilities of Transformers [27,28]. Pyramid
attention mechanisms have also been introduced, employing hierarchical structures to
capture temporal dependencies across different scales [29]. Additionally, convolutional
pooling and distillation operations have been applied between self-attention modules to
downsample features and enhance feature representation [30], while pyramid encoder–
decoder structures have been used to model multi-scale dependencies [31]. ViT retains
most elements of the Transformer architecture while incorporating a classification token
specifically designed for global feature learning and classification. Its potential has been
explored in various applications, such as geological fault detection [32] and the diagnosis of
rolling bearing faults in aircraft engines using model distillation techniques with multiple
ViT models [33]. In the context of the TE process dataset, ViT has been evaluated alongside
alternative approaches, including wavelet transforms and CNN, with its advantages in
fault diagnosis being systematically analyzed [34].
Although the Transformer architecture has shown remarkable capabilities in sequence
modeling, it faces several challenges when directly applied to high-dimensional time-series
fault diagnosis. First, the standard Transformer lacks an explicit mechanism for global
feature aggregation, which is crucial for fault pattern recognition across multiple sensor
channels. Second, its purely sequential processing nature may not effectively capture
the concurrent relationships among different sensor measurements. The ViT architecture,
with its patch-based processing and classification token design, provides a more suitable
framework for our scenario. The classification token serves as a natural aggregator for
global feature learning, while the patch-based approach allows for more efficient parallel
processing of multi-sensor data segments.
However, ViT was originally designed for image processing, and the high-dimensional
time-series data in our study differs significantly from real image pixel values. The data
points in these segments are often normalized values from sensors and lack the strong
spatiotemporal relationships that traditional image pixels have with surrounding pixels.
Simply applying image-based algorithms to high-dimensional time-series fault diagnosis
may lead to information confusion, which can negatively impact system performance [34].
Additionally, increasing the depth of algorithms used for fault diagnosis does not necessar-
ily improve recognition performance, which is quite different from image-based algorithms.
When using ViT for image processing, global feature aggregation of image patches is
necessary to effectively restore overall image semantics. However, when dealing with high-
dimensional data, it is essential to specifically analyze the characteristics of the data being
processed. This often requires recombining local features from each token and selecting
appropriate network parameters to achieve maximum effectiveness.
Building on the limitations of previous studies, this research addresses the challenges
of applying ViT to high-dimensional time-series data, particularly in fault diagnosis for the
Tennessee Eastman chemical process. The main contributions of this paper are as follows:
• We propose a sequence-aware ViT network that is specifically adapted to high-
dimensional sensor data, addressing the limitations of traditional ViT models when
applied to data without inherent spatiotemporal relationships. This adaptation is
critical for achieving accurate fault diagnosis in complex industrial processes.
• We enhance the fusion of global and local features by employing a multi-head atten-
tion mechanism. This approach improves diagnostic accuracy while maintaining a
streamlined model design, avoiding additional structural complexity.
• We provide an in-depth analysis of attention focus across encoder layers, identifying
potential causes for performance degradation in deeper networks. This analysis offers
valuable guidance for designing more effective models in fault diagnosis, particularly
for applications requiring high reliability.
Entropy 2025, 27, 181 4 of 25
The organization of this paper is as follows: Section 2 describes the relevant theoretical
information for the algorithm. Section 3 introduces the embedding method and the global–
local feature fusion approach proposed in this paper. Section 4 presents the TE dataset,
compares the performance of the proposed algorithm in fault diagnosis with other state
of the art (SOTA) algorithms, and analyzes the effects of parameters such as data length,
encoder depth, and the number of attention heads on the algorithm’s performance. Section 5
concludes the paper with a summary and conclusion.
2. Preliminaries
2.1. Multi-Head Attention Mechanism
The multi-head attention mechanism is an extension of the self-attention mechanism. It
enhances the model’s representation capacity, allowing the model to learn diverse features
from different subspaces. Specifically, the multi-head attention mechanism computes
attention through multiple distinct heads, concatenates the outputs of these heads, and
then applies a linear transformation to generate the final output. The calculation process
can be represented as:
where WiQ , WiK , WiV ∈ Rd×dk , and W O ∈ Rhdk ×d are learnable parameter matrices. The
variable h represents the number of attention heads, and dk is the dimensionality of each
attention head. The advantage of this structure lies in its ability to capture different
feature representations in parallel through multiple attention heads, thereby enhancing
the model’s capability to identify complex patterns. Additionally, the concatenation of
attention results from multiple heads ensures that the final representation contains rich
contextual information, which significantly helps improve the model’s generalization
ability. As a result, the multi-head attention mechanism has become a core component of
Transformer and its variants, widely applied in fields such as natural language processing
and computer vision.
where Xembed represents the input vectors processed through the embedding layer, and PE
is the positional encoding matrix. This combination allows the Transformer to leverage
both the content information and the positional information when handling sequences,
significantly improving the model’s performance in tasks that rely on sequence order.
are added to each patch embedding to retain spatial information. The complete input
sequence is formed as:
This sequence is processed through multiple Transformer encoder layers, each con-
sisting of multi-head self-attention mechanisms and feed-forward networks. The encoder
layers employ residual connections and layer normalization to ensure stable gradient
propagation during training. In each layer, the self-attention mechanism enables each
token to attend to all other tokens, capturing both local and global dependencies in the
input. The final classification is performed by applying an MLP and softmax function to
the classification token output from the last encoder layer:
L
y = softmax(MLP(zcls )) (8)
L represents the final state of the classification token after L encoder layers. This
where zcls
architecture has demonstrated remarkable performance in various vision tasks, primarily
due to its ability to model long-range dependencies and capture global context information.
3. Proposed Algorithm
In industrial fault diagnosis, unlike the traditional images that ViT mainly processes,
the data in this study, although presented in image form, essentially consist of time-series
data derived from various sensor readings. The pixels in these data lack the strong spatial
relationships found in traditional images, but there is significant homogeneity between
rows of pixels. Directly partitioning these data into patches, as in image processing, could
disrupt this dependency. Therefore, we designed a new multidimensional time-series
embedding method. In the Tennessee Eastman(TE) chemical production process, different
faults are primarily associated with different state variables, and the time dependencies
of each fault also vary. As a result, special attention is required when extracting features
Entropy 2025, 27, 181 6 of 25
to ensure both depth and breadth. Overly deep feature extraction may lead to overfitting,
which can degrade system performance.
The traditional ViT obtains more global information by stacking encoder layers and
using larger datasets. However, for fault diagnosis in complex industrial processes, fault
causes are often associated with multiple factors, and there are strong nonlinear, spatiotem-
poral correlations and couplings between faults. The combination of global and local
information is crucial for achieving optimal fault diagnosis performance. Based on this, we
apply a sliding step size of 1 for slicing the data to increase the data volume while reducing
the number of stacked ViT encoder layers. Additionally, we introduce a multi-head atten-
tion mechanism to fuse the features learned by the classification token with the features
output by the encoder. Combining the above improvements, we propose a global–local
feature fusion ViT algorithm, with its structural flow shown in Figure 1. This approach
fully leverages both global and local feature information, enhancing the algorithm’s fault
diagnosis performance in complex industrial systems.
Figure 1. The structure of global and local feature fusion sequence-aware vision transformer (GLF-
ViT). Industrial condition data collected by sensors is preprocessed and segmented into m × n
matrices, then linearly projected by sampling points before being fed into the encoder for feature
extraction and fusion, enabling fault classification.
Instead, the sequence of sampling points in each data segment is linearly projected to map
the multidimensional features of all time steps into a unified high-dimensional space.
Specifically, we first project the input m-dimensional features into a d-dimensional
feature space through a linear mapping W ∈ Rm×d :
Z = XW + b (9)
where hlt represents the features of time step t at layer l, and the attention weight αlti is
computed as:
with WQl , W l ∈ Rd×dk being the query and key projection matrices at layer l.
K
This computation effectively models what we term “cross-temporal features”—the
dynamic relationships between different time points in the sequence. As each time step’s
features are computed by attending to all other time steps, the model captures both local
temporal patterns and their evolution throughout the sequence. The continuous changes in
multidimensional relationships over time are inherently modeled, allowing the capture of
cross-dimensional features based on temporal dependencies.
While deeper networks are often preferred in image-based ViT applications to com-
pensate for information loss from patch division, industrial process data present different
challenges. The temporal variability, strong coupling between variables, and nonlinear
characteristics of industrial processes require a more nuanced approach. Different fault
patterns often manifest in varying clusters of variables and across different temporal scales.
Therefore, we intentionally design our architecture with fewer encoder layers (L ≤ 4) but
leverage multi-head attention mechanisms with differentiated receptive fields:
where each attention head can specialize in capturing different aspects of the temporal
patterns. This approach enriches the feature representation while maintaining the temporal
integrity of the data, better suiting the characteristics of industrial fault diagnosis.
Entropy 2025, 27, 181 8 of 25
The high-dimensional features output by the encoder H = [h1 , ..., hn ] (shown as dark
green parts in Figure 1) predominantly contain local temporal features, which will be
further enhanced through global–local feature fusion in the subsequent stage.
This sequence is fed into the Transformer encoder for feature extraction:
H = T ( Z ) ∈ R(n+1)×d (14)
The output H of the encoder contains the classification token output hCLS and the high-
dimensional representations hi of all the feature segments after processing through multiple
layers of self-attention and feed-forward networks:
H = [hCLS , h1 , h2 , . . . , hn ] (15)
where hCLS ∈ Rd is the classification token output from the encoder, representing global
features, and hi ∈ Rd represents the encoded features of the i-th time segment. The
classification token output hCLS interacts with the encoder feature maps hi through the
multi-head attention mechanism, resulting in fused features that contain both global and
local information.
Assuming the query (Q) in the multi-head attention mechanism comes from the
classification token, while the key (K) and value (V) come from the encoded feature maps:
The multi-head attention mechanism maps the query, key, and value to a lower-
dimensional space through linear projections and computes the attention weights:
Q = WQ hCLS , K = WK G, V = WV H (18)
Entropy 2025, 27, 181 9 of 25
where WQ , WK , WV ∈ Rd×dk are the trainable linear projection matrices, and dk is the
dimensionality of the lower-dimensional space. The attention weights are computed using
the Softmax function:
QK T
Attention( Q, K, V ) = softmax √ V (19)
dk
hfusion is the representation that combines global and local features, and it is passed
through a fully connected layer and a softmax layer for classification:
where W ∈ RC×d is the weight matrix of the classification layer, b ∈ RC is the bias term,
and C represents the number of classes.
The fusion process through multi-head attention mechanism provides several key
advantages in our approach. First, by using hCLS as the query and encoder features as both
key and value matrices, we enable the model to learn selective feature aggregation based
on the global context. Each attention head can specialize in different aspects of the temporal
patterns: some heads focus on short-term dynamics by attending to temporally adjacent
features, while others capture long-range dependencies by attending to features across
the entire sequence. This multi-scale feature learning is particularly crucial for industrial
fault diagnosis, where fault patterns may manifest at various temporal scales. Second,
the attention weights learned during the fusion process effectively serve as an adaptive
feature selection mechanism, allowing the model to emphasize the most relevant temporal
patterns for different fault types. Finally, this fusion architecture maintains the integrity
of both global and local features throughout the process, as the original features from
both the classification token and encoder output participate in the attention computation
without information loss. This comprehensive feature utilization significantly enhances the
model’s capability to handle complex industrial faults that exhibit both global trends and
local anomalies.
4. Case Study
This study employs both the Tennessee Eastman dataset and a power system transmis-
sion fault dataset for case studies. The TE dataset, with its numerous variables and diverse
fault types, enables comprehensive analysis and parameter exploration. Meanwhile, the
power system transmission fault data serve to validate the algorithm’s applicability, further
demonstrating its effectiveness.
4.1. TE Database
The TE database is a classic benchmark dataset used in the research of industrial
process fault detection and diagnosis. This dataset was first introduced by Downs and
Vogel in 1993 to simulate the dynamic behavior and fault scenarios of real chemical pro-
cesses(as shown in Figure 2), allowing for the evaluation of fault detection methods. The
TE database’s simulation environment replicates a chemical reaction process from Eastman
Chemical Company, involving multiple operating units and control loops, reflecting the
complexity and uncertainty found in real industrial environments.
Entropy 2025, 27, 181 10 of 25
In this paper, we use the TE dataset [36] to test and validate the performance of the
proposed algorithm. The dataset contains 41 measurable process variables and 11 control
variables, covering various aspects such as temperature, pressure, flow rate, and concentra-
tion. Each variable has its own operating range and dynamic characteristics. The simulation
data include normal operation data as well as 20 different types of fault data. These fault
types encompass various scenarios, including equipment faults, process control faults, and
external disturbances.
The TE database, provided by Harvard University, is divided into training and testing
sets, both of which include fault and normal data. For ease of data management, we used
the normal and fault data from the training set in our experiments. The training set fault
data simulate 500 runs, each simulating 20 fault conditions. Each fault condition runs for
25 h, with the fault introduced after one hour of normal operation, and data sampled every
three minutes. Each fault type has 500 data points per run. The training set normal data
also consist of 500 runs, each running for 25 h with the same sampling frequency as the
fault data, providing 500 normal data points per run.
extraction in the subsequent model stages. To ensure the correlation between data segments,
we used a sliding window approach with a step size of l = 1. This sliding window
technique maintains data continuity while increasing the number of training samples,
thereby improving the model’s generalization ability. For a given time-series { xt | t =
1, 2, . . . , T }, where xt ∈ Rm represents the m variable values at time t, data segments are
generated through the sliding window. Assuming each data segment has a length of n, the
i-th data segment can be represented as:
Xi = [ x i , x i + 1 , . . . , x i + n − 1 ] , i = 1, 2, . . . , T − n + 1 (21)
where, Xi ∈ Rn×m represents a data segment containing n time steps and m variables.
Through this method, the original time-series is transformed into a series of sequences,
each of length n. These data segments serve as input to the deep learning model for fault
detection and classification tasks. To investigate the impact of input sequence length on
the model, we extracted segments of four different lengths: n = 5, 10, 20, 40 for subsequent
experimental analysis. To ensure that the data within each segment remains temporally
continuous, segmentation was performed on a per-run basis, without crossing runs. This
approach ensures that all data within a segment comes from a single process and maintains
temporal continuity, preserving the cross-temporal feature information as much as possible.
Table 1 shows the number of data segments obtained for different segment lengths.
Figure 3. Data slicing process. Slicing the n-dimensional data using a length of m sampling points
with a step size of L.
Segment
5 10 20 40
Length (n)
Train 4,598,160 4,549,860 4,453,260 4,260,060
Validation 199,920 197,820 193,620 185,220
Test 199,920 197,820 193,620 185,220
N K
1
L(Φ) = −
N ∑ ∑ tik log p(ŷi = k | Xi ; Φ) (22)
i =1 k =1
Entropy 2025, 27, 181 12 of 25
where L is the loss function, Φ is the set of trainable parameters, N is the number of training
samples, tik is an indicator function for sample i’s true label in class k (with a value of 0 or
1), and p(ŷi = k | Xi ; Φ) is the probability predicted by the model that Xi belongs to class k.
The proposed algorithm utilizes a four-layer encoder with eight attention heads per
encoder. The feature fusion mechanism employs 32 attention heads in the multi-head
attention module, and the linear projection dimension is set to 128. The dropout rate is
configured at 0.1. The model is trained using the Adam optimizer, with a learning rate set
to 0.00001, and mixed precision training is employed to improve computational efficiency.
Additionally, a learning rate scheduler with linear warmup and cosine annealing is used,
where the first 5% of training steps are dedicated to warmup, allowing for a smoother
adjustment of the learning rate. The batch size is set to 1024, and the model is trained for a
total of 600 epochs. To prevent overfitting, an early stopping mechanism is implemented,
terminating the training if the validation accuracy does not improve for 50 consecutive
epochs. The experiments were conducted on the PyTorch platform, using a server equipped
with four NVIDIA Tesla V100-SXM2-32GB GPUs located in Zhengzhou, China.
From the perspective of time complexity, the algorithm consists of three main com-
ponents: the input projection layer, the multi-head self-attention mechanism, and feature
fusion. For an input sequence of length n and feature dimension d, the complexity of the
input projection is O(nd). In the L-layer Transformer encoder, the self-attention mecha-
nism in each layer has a complexity of O(n2 d), with the major computational cost arising
from the matrix multiplication of query-key pairs. During the feature fusion stage, the
interaction between the classification token and the feature map is implemented using a
multi-head attention mechanism, which also has a complexity of O(n2 d). Thus, the overall
time complexity of the algorithm is O( Ln2 d), where the primary computational bottleneck
lies in the quadratic complexity of the self-attention mechanism. From the perspective
of space complexity, the model parameters are primarily composed of the Transformer
encoder parameters O( Ld2 ) and the input projection layer parameters O(nd). During
runtime, the storage requirements include the attention matrix O(n2 ) and the intermedi-
ate feature representations O(nd). Considering a batch size of b, the actual memory cost
during training and inference is O(bn2 + bnd). Overall, the algorithm achieves efficient
computation while maintaining high performance, particularly when the sequence length
and feature dimensions are within a moderate range.
We used t-SNE for dimensionality reduction to visualize the raw data. After applying
dimensionality reduction to the test set data as shown in Figure 5, it is evident that the raw
data are highly entangled, with multiple categories intertwined. Faults 3, 9, and 15 are
almost completely mixed together, and the raw data are characterized by a large number of
categories and complex distributions. However, after applying our proposed algorithm
in Figure 6, the chaotic and complex data have been organized into several distinct states,
with each color block representing a specific state. The analysis of t-SNE visualization
reveals the mechanism by which the GLF-ViT model captures key features and contributes
to improved diagnostic performance. The raw data, prior to model processing, exhibit a
highly entangled distribution, particularly for faults 3, 9, and 15, which overlap significantly
with other categories, making effective classification challenging. After processing with
the GLF-ViT model, the feature distribution becomes well structured, with distinct clusters
Entropy 2025, 27, 181 14 of 25
for each category, and the features of complex faults 3, 9, and 15 appear more cohesive
and concentrated. This demonstrates that the model effectively enhances the extraction of
cross-dimensional and temporal features through the classification token for global feature
learning and the encoder’s mechanism for preserving local features.
Figure 5. t-SNE visualization of test set data before inputting into GLF-ViT.
Figure 6. t-SNE visualization of test set data before the classification layer in GLF-ViT.
For the challenging faults 3, 9, and 15, the proposed global–local feature fusion mech-
anism effectively enhances classification performance by precisely modeling the unique
characteristics of these faults. Analysis of data segment length reveals that the complexity
of faults 3, 9, and 15 arises from their highly coupled temporal dependencies and nonlinear
inter-variable features. Shorter time segments fail to provide sufficient contextual informa-
tion to capture these features, whereas longer segments effectively capture cross-temporal
Entropy 2025, 27, 181 16 of 25
feature correlations. For example, when the segment length increases from 5 to 40, the F1
score for fault 3 improves significantly from 66.61% to 94.28% as shown in Figure 7, with
similar performance gains observed for faults 9 and 15. This demonstrates that increasing
the segment length enables the model to extract key features from complex temporal pat-
terns more comprehensively, thereby improving its diagnostic capability for these faults.
Figure 7. The correlation between the F1 score and the data segment length for faults 3, 9, and 15.
Table 4. Performance comparison with different head of feature fusion and layer configurations.
nonlinear coupling between variables. Shallow networks can effectively balance the extrac-
tion of global information and critical local features through a dispersed attention range.
However, as the encoder depth increases, the attention mechanism tends to focus on longer
time spans, overly emphasizing global information while neglecting local dependencies,
which leads to an imbalance in feature representation. This issue is particularly pronounced
for complex faults such as faults 3, 9, and 15, which rely on short-term temporal features;
deeper networks fail to adequately capture these critical characteristics, thereby negatively
affecting diagnostic performance. Additionally, deeper networks are more prone to feature
redundancy and overfitting. While shallow networks focus on extracting sufficient global
and local features, deeper networks often introduce redundant features that fail to provide
additional useful information and may amplify noise in the data, thereby weakening gener-
alization capability. These findings highlight the need to strike a balance between network
depth and redundancy in designing models for industrial data, as shallow networks not
only avoid feature redundancy but also better adapt to the diverse temporal dependencies
of such data, resulting in superior diagnostic performance.
To further investigate the impact of the number of attention heads in the encoder
mechanism on algorithm performance, we conducted experiments using a four-layer
encoder structure with different numbers of attention heads (8, 16, 32). The attention heads
in the fusion mechanism were fixed at 32, and other settings remained the same. The results
show that the F1 scores for 8, 16, and 32 attention heads were 98.37%, 98.33%, and 98.22%,
respectively. It can be observed that increasing the number of attention heads did not
improve performance and instead led to a slight decrease. Moreover, a higher number of
attention heads significantly increases algorithmic complexity. Considering these factors,
we ultimately set the encoder network’s attention head count to 8.
Figure 8. Four variant structure diagrams: (a) classification using only encoder features (Transformer),
(b) classification using only classification token features (ViT), (c) feature fusion using a gating
mechanism, (d) the proposed algorithm structure.
Table 5. F1 comparison of four models (a), (b), (c), and (d) across different classes.
To analyze the feature fusion process, we compared the proposed algorithm’s multi-
head attention mechanism with a gating mechanism. The gating mechanism functions like
a switch, computing a set of weights to determine the fusion ratio between two inputs.
The gated weights are computed through a fully connected layer using the classification
vector and the full set of encoder output features. These weights are then compressed into
the range of [0, 1] using the sigmoid activation function, and the two inputs are combined
proportionally to produce fused features for classification. In contrast, multi-head attention
calculates the correlations between input features and combines the outputs of multiple
attention heads to complete complex feature fusion. Each attention head models the
relationships between features in different subspaces, allowing it to capture both global
and fine-grained dependencies more effectively. The comparison results show that the
proposed algorithm performs better. This could be because the gating mechanism is a
linear weighted method, making it difficult to capture complex relationships and high-
order dependencies between input features. The multi-head attention mechanism, on the
other hand, excels at capturing long-range dependencies when handling complex data,
Entropy 2025, 27, 181 19 of 25
giving it greater flexibility and stronger expressive power in feature fusion, which leads to
better performance.
The multi-head attention mechanism plays a critical role in the deep fusion of global
and local features, which is essential for classifying faults 3, 9, and 15. These faults are
characterized by highly complex variable coupling and nonlinear temporal relationships.
The classification token captures global information through interactions with all time
segments, while the encoder outputs retain local features at each time step, ensuring the
integrity of fine-grained information. By leveraging multi-head feature fusion, the attention
mechanism effectively extracts cross-dimensional and cross-temporal dependencies inher-
ent in these faults. Notably, the shallow network structure and the dispersed coverage of
attention heads provide highly adaptive feature extraction capabilities for faults 3, 9, and 15.
In contrast, deeper networks may introduce feature redundancy, potentially undermining
the model’s ability to identify these faults. Therefore, the proposed fusion mechanism
precisely captures the essential characteristics of these complex faults, achieving accurate
classification for faults 3, 9, and 15, and offering strong support for fault diagnosis in
complex industrial processes.
Figure 9. Precision and recall of the GLF-ViT algorithm with 4-layer and 10-layer encoders.
Figure 10. Radar chart of F1 scores for the GLF-ViT algorithm with 4-layer and 10-layer encoders.
The radar chart provides a clearer visualization of the overall performance of the two
models across faults, excluding faults 3, 9, and 15. Both models performed well, with F1
scores above 95% for most states and even reaching 100% for several categories, indicating
that the proposed algorithm offers strong diagnostic capability for the TE chemical process
as a whole. Comparatively, the F1 scores of the 4-layer model were either equal to or
better than those of the 10-layer model, especially for faults 3, 9, and 15, where the 10-layer
network showed significant performance degradation. In the 10-layer network, the F1
Entropy 2025, 27, 181 21 of 25
score for fault 3 dropped to 86.76%, an 8.21% decrease compared to the 4-layer model. For
fault 9, the F1 score was 70.84%, a decrease of 16.45%, and for fault 15, the F1 score was
77.55%, a drop of 12.12%. These reductions in performance for these three faults were
the primary reason for the overall diagnostic performance decline in the 10-layer network.
However, for faults 12 and 18, the 10-layer network showed slight improvements compared
to the 4-layer network. The F1 score for fault 12 increased from 99.79% to 100%, a 0.21%
improvement, and for fault 18, the F1 score rose from 96.45% to 96.84%, an increase of
0.39%. Although these performance gains were limited, they demonstrate that the causes
of faults in chemical process multi-class problems are complex. The main factors driving
each fault, along with their variable and temporal dependencies, vary, and adjustments to
the algorithm can have differing effects on diagnostic performance across different states.
We will further investigate these causes through an analysis of the attention mechanism.
From Figure 11, we can clearly observe that the attention heads in the 4-layer encoder
cover a range of 10 to 40 sampling points, with varying lengths and significant differences
in attention distribution. In contrast, the attention heads in the 10-layer encoder focus
on ranges above 20 sampling points in Figure 12, mostly concentrated around 30 points.
We speculate that this difference in the attention range contributes to the performance
disparity. In the TE process, faults exhibit varying degrees of correlation not only with
different sensors but also with temporal dependencies. The broader range of attention
distributions in the 4-layer model appears to suit fault diagnosis in the TE process better. By
combining this analysis with recognition performance, we can infer that a diverse range of
attention spans allows the model to better distinguish faults 3, 9, and 15. These traditionally
challenging faults do not seem to exhibit long temporal correlations. If all attention heads
focus on ranges beyond 20 sampling points, it may interfere with diagnosing these faults.
Figure 11. The attention head receptive field for the GLF-ViT algorithm with the 4-layer model.
Moreover, the attention span differences further explain why a data segment length of
40 achieves better performance—shorter data segments fail to provide sufficient temporal
dependencies. Some faults, such as faults 12 and 18, may require longer temporal relation-
ships, but longer attention spans alone may not be the main factor for fault identification,
as extended attention does not significantly improve performance for these faults. Addi-
tionally, the analysis of the attention field diagram shows that shallow networks, with their
dispersed attention distribution, are particularly effective in capturing short- and mid-term
dependency features, while the classification token ensures comprehensive modeling of
the overall fault patterns by extracting global features. This fusion mechanism of global
and local features significantly improves the distinction between categories and enhances
diagnostic accuracy, providing strong support for identifying complex faults.
Entropy 2025, 27, 181 22 of 25
Figure 12. The attention head receptive field for the GLF-ViT algorithm with the 10-layer model.
The disparity in attention spans also clarifies why traditional Transformer and ViT
models do not exhibit improved performance with deeper layers when applied to high-
dimensional time-series-to-image fault diagnosis. Ultimately, this is because the images
generated from high-dimensional time-series data differ fundamentally from the images
used in computer vision tasks. Unlike traditional images, where there are strong spatial
and temporal relationships between pixels, the transformed images of time-series data do
not exhibit such properties. As a result, deeper networks may cause feature redundancy,
leading to performance degradation. When migrating image classification algorithms to
the fault diagnosis domain, it is essential to consider the nature of the data, the causes of
faults, and the characteristics of the transformed images. This approach ensures the best
performance and enhances the algorithm’s applicability in fault diagnosis.
5. Conclusions
This paper proposed a ViT-based global and local feature fusion algorithm for high-
dimensional time-series fault diagnosis. The algorithm effectively combines the global
features obtained by the learnable classification vector in traditional ViT with the local
features extracted by the Transformer encoder, applying image recognition methods to the
diagnosis of dynamic and complex industrial process faults. Compared to existing SOTA
algorithms, whether based on CNN improvements or ViT improvements, our proposed
algorithm demonstrates superior performance. In the TE dataset for fault diagnosis, the
proposed algorithm achieves an average F1-score of 98.37% and an average recall of 98.38%
across all data types, including normal states, surpassing the advanced algorithms we
referenced. This further proves the advancement and effectiveness of our model. Directly
applying ViT to time-series data may involve several potential risks, such as insufficient
modeling capability for temporal dependencies, inappropriate preprocessing methods, a
lack of interpretability and trustworthiness, as well as challenges in generalization and
robustness. Additionally, ViT requires significant computational resources, which could
limit its efficiency. To fully realize the potential of ViT in time-series analysis, more sys-
tematic and in-depth research is needed in areas such as algorithmic improvement, model
evaluation, and interpretability analysis. At the same time, attention should be given to
the latest technological advancements and practical experiences to continuously optimize
and refine relevant methods. This will ultimately contribute to the development of smarter,
more reliable, and efficient fault diagnosis tools for industrial applications.
Funding: This research was supported by the National Natural Science Foundation of China (Grant
Nos. 62101503 and 62301497), the Science and Technology Project of Henan Province (Grant No.
242102211017), and the Key Research and Development Program of Henan (Grant No. 231111212000).
Data Availability Statement: No new data were created or analyzed in this study. Data sharing is
not applicable to this article.
References
1. Jiang, Y.; Yin, S.; Kaynak, O. Performance supervised plant-wide process monitoring in industry 4.0: A roadmap. IEEE Open J.
Ind. Electron. Soc. 2020, 2, 21–35.
2. Ge, Z. Review on data-driven modeling and monitoring for plant-wide industrial processes. Chemometr. Intell. Lab 2017, 171,
16–25.
3. López-Estrada, F.-R.; Astorga-Zaragoza, C.-M.; Theilliol, D.; Ponsart, J.-C.; Valencia-Palomo, G.; Torres, L. Observer synthesis for
a class of Takagi–Sugeno descriptor system with unmeasurable premise variable. Application to fault diagnosis. Int. J. Syst. Sci.
2017, 48, 3419–3430.
4. Lee, J.; Yoo, C.; Lee, I. Statistical process monitoring with independent component analysis. J. Process Contr. 2004, 5, 467–485.
5. Ardali, N.; Zarghami, R.; Gharebagh, R. Optimized data driven fault detection and diagnosis in chemical processes. Comput.
Chem. Eng. 2024, 186, 108712.
6. Shahzad, F.; Huang, Z.; Memon, W. Process monitoring using kernel PCA and kernel density estimation-based SSGLR method
for nonlinear fault detection. Appl. Sci. 2022, 12, 2981.
7. Shi, Q.; Zhang, H. Fault diagnosis of an autonomous vehicle with an improved SVM algorithm subject to unbalanced datasets.
IEEE Trans. Ind. Electron. 2021, 68, 6248–6256.
Entropy 2025, 27, 181 24 of 25
8. Li, C.; Sanchez, R.V.; Zurita, G.; Cerrada, M.; Cabrera, D.; Vásquez, R.E. Gearbox fault diagnosis based on deep random forest
fusion of acoustic and vibratory signals. Mech. Syst. Signal Process. 2016, 76, 283–293.
9. Barrera-Llanga, K.; Burriel-Valencia, J.; Sapena-Bano, A.; Martinez-Roman, J. Fault detection in induction machines using learning
models and Fourier spectrum image analysis. Sensors 2025, 25, 471.
10. Zhang, J.; Zhang, Q.; Qin, X.; Sun, Y. Robust fault diagnosis of quayside container crane gearbox based on 2D image representation
in frequency domain and CNN. Struct. Health Monit. 2024, 23, 324–342.
11. Yan, J.; Liu, T.; Ye, X.; Jing, X.; Dai, Y. Rotating machinery fault diagnosis based on a novel lightweight convolutional neural
network. PLOS ONE 2021, 16, e0256287.
12. Song, Q.; Jiang, P. A multi-scale convolutional neural network based fault diagnosis model for complex chemical processes.
Process Saf. Environ. 2022, 159, 575–584.
13. Xu, M.; Gao, J.; Zhang, Z.; Wang, H. Bearing-fault diagnosis with signal-to-rgb image mapping and multichannel multiscale
convolutional neural network. Entropy 2022, 24, 1569.
14. Xiao, B.; Zhang, Y.; Zhou, C.; Ou, J.; Huang, G. A noise-robust CNN architecture with global attention and gated convolutional
Kernels for bearing fault detection. Meas. Sci. Technol. 2024, 35, 086142.
15. Dong, Z.; Zhao, D.; Cui, L. An intelligent bearing fault diagnosis framework: One-dimensional improved self-attention-enhanced
CNN and empirical wavelet transform. Nonlinear Dynam. 2024, 112, 6439–6459.
16. Debasish, J.; Jayant, P.; Sudheendra, H.; Satish, N. CNN and Convolutional Autoencoder (CAE) based real-time sensor fault
detection, localization, and correction. Mech. Syst. Signal Pro. 2022, 169, 108723.
17. Saif, S.; Wahaibi, A.; Abiola, S.; Lu, Q. Improving convolutional neural networks for fault diagnosis in chemical processes by
incorporating global correlations. Comput. Chem. Eng. 2023 176, 108289.
18. Khan, M.A.; Choo, J.; Kim, Y. Intelligent fault detection using raw vibration signals via dilated convolutional neural networks. J.
Supercomput. 2020, 76, 8086–8100.
19. Ildar, L.; Mark, L.; Ilya, M. Fault detection in Tennessee Eastman process with temporal deep learning models. J. Ind. Inf. Integr.
2021 23, 100216.
20. Huang, T.; Zhang, Q.; Tao, X.; Zhao, S.; Lu, X. A novel fault diagnosis method based on CNN and LSTM and its application in
fault diagnosis for complex systems. Artif. Intell. Rev. 2022, 55, 1289–1315.
21. Meng, X.; Tan, H.; Yan, P.; Zheng, Q.; Chen, G.; Jiang, J. A GNSS/INS Integrated Navigation Compensation Method Based on
CNN–GRU + IRAKF Hybrid Model During GNSS Outages. IEEE Trans. Instrum. Meas. 2024, 73, 2510015.
22. Li, M.; Peng, P.; Sun, H.; Wang, M.; Wang, H. An order-invariant and interpretable dilated convolution neural network for
chemical process fault detection and diagnosis. IEEE Trans. Autom. Sci. Eng. 2023, 21, 3933–3943.
23. Li, Y.; Liu, Z.; Jia, Z.; Zhao, W.; Wang, K.; Qin, X. Fault Diagnosis Strategy for Flight Control Rudder Circuit Based on SHAP
Interpretable Analysis Optimization Transformer With Attention Mechanism. IEEE Trans. Instrum. Meas. 2024, 73, 1–14.
24. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE
Trans. Neur. Net. Lear. Syst. 2022, 33, 6999–7019.
25. Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017.
26. Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International
Conference on Learning Representations, Vienna, Austria, 4 May 2021.
27. Zhu, Q.; Qian, Y.; Zhang, N.; He, Y.; Xu, Y. Multi-scale Transformer-CNN domain adaptation network for complex processes fault
diagnosis. J. Process Contr. 2023, 130, 103069.
28. Wei, C.; Han, H.; Wu, Z.; Xia, Y.; Ji, Z. Transformer-Based Multiscale Reconstruction Network for Defect Detection of Infrared
Images. IEEE Trans. Instrum. Meas. 2024, 73, 1–14.
29. Liu, S.; Yu, H.; Liao, C.; Lin, J. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and
forecasting. In Proceedings of the Tenth International Conference on Learning Representations (ICLR 2022), Virtual, 25–29 April
2022.
30. Zhou, H.; Zhang, S.; Peng, S.; Zhang, J.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence
time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021.
31. Zhang, Y.; Wu, R.; Dascalu, S.; Harris, F. Multi-scale transformer pyramid networks for multivariate time series forecasting. IEEE
Access 2024, 12, 14731–14741.
32. Wang, J.; Ma, S.; An, Y.; Dong, R. A Comparative Study of Vision Transformer and Convolutional Neural Network Models in
Geological Fault Detection. IEEE Access 2024, 12, 136148–136159.
33. Kang, Y.; Chen, G.; Wang, H.; Shen, J.; Wei, X. Fault anomaly detection method of aero-engine rolling bearing based on distillation
learning. ISA Trans. 2023, 145, 387–398.
34. Zhou, K.; Tong, Y.; Li, X.; Huang, H.; Song, K.; Chen, X. Exploring global attention mechanism on fault detection and diagnosis
for complex engineering processes. Process Saf. Environ. 2023, 170, 660–669.
Entropy 2025, 27, 181 25 of 25
35. Downs, J.; Vogel, E. A plant-wide industrial process control problem. Comput. Chem. Eng. 1993, 17, 245–255.
36. Amsel, R.; Tran, B.; Maia, R. Additional tennessee eastman process simulation data for anomaly detection evaluation. Harv.
Dataverse 2017, 1. https://doi.org/10.7910/DVN/6C3JR1.
37. Wei, Z.; Xu, J.; Li, Z.; Dang, Y.; Dai, Y. A novel deep learning model based on target transformer for fault diagnosis of chemical
process. Process Saf. Environ. 2022, 167, 480–492.
38. Jamil, M.; Sharma, S.K.; Singh, R. Fault detection and classification in electrical power transmission system using artificial neural
network. SpringerPlus 2015, 4, 1–13.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.