[go: up one dir, main page]

CN118796133B - Slow fault detection method, device, equipment and medium in distributed storage system - Google Patents

Slow fault detection method, device, equipment and medium in distributed storage system Download PDF

Info

Publication number
CN118796133B
CN118796133B CN202411280438.XA CN202411280438A CN118796133B CN 118796133 B CN118796133 B CN 118796133B CN 202411280438 A CN202411280438 A CN 202411280438A CN 118796133 B CN118796133 B CN 118796133B
Authority
CN
China
Prior art keywords
slow
data set
slow fault
linear regression
regression model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411280438.XA
Other languages
Chinese (zh)
Other versions
CN118796133A (en
Inventor
刘培德
汤国林
滕飞
王鹏
刘微俏
朱宝颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Finance and Economics
Original Assignee
Shandong University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Finance and Economics filed Critical Shandong University of Finance and Economics
Priority to CN202411280438.XA priority Critical patent/CN118796133B/en
Publication of CN118796133A publication Critical patent/CN118796133A/en
Application granted granted Critical
Publication of CN118796133B publication Critical patent/CN118796133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0617Improving the reliability of storage systems in relation to availability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本发明的一种分布式存储系统中的慢故障检测方法、装置、设备及介质,属于分布式存储故障检测技术领域,方法包括以下步骤:步骤S1,间隔t时间采集OC模块与所有OSD交互的数据,计算t时间内的时延和吞吐量,构建数据集;步骤S2,采用PCAP算法和DBSCAN算法对数据集进行预处理;步骤S3,采用多项式回归建立线性回归模型;步骤S4,利用线性回归模型对预处理后数据集进行预测,根据慢故障检测阈值识别出界条目并将其标记为慢速条目;步骤S5,采用滑动窗口和记分牌机制对所有慢故障事件进行统计并计算分数。本发明能够识别慢故障事件,提前发现该类问题,有效提升了系统的可用性和长尾问题,提高了Ceph系统的可用性。

The invention discloses a method, device, equipment and medium for detecting slow faults in a distributed storage system, which belongs to the technical field of distributed storage fault detection. The method comprises the following steps: step S1, collecting data of the interaction between the OC module and all OSDs at intervals of t, calculating the delay and throughput within the time t, and constructing a data set; step S2, preprocessing the data set using the PCAP algorithm and the DBSCAN algorithm; step S3, establishing a linear regression model using polynomial regression; step S4, predicting the preprocessed data set using the linear regression model, identifying out-of-bounds entries according to the slow fault detection threshold and marking them as slow entries; step S5, using a sliding window and a scoreboard mechanism to count and calculate scores for all slow fault events. The invention can identify slow fault events, discover such problems in advance, effectively improve the availability and long-tail problems of the system, and improve the availability of the Ceph system.

Description

Slow fault detection method, device, equipment and medium in distributed storage system
Technical Field
The invention relates to a slow fault detection method, a slow fault detection device, slow fault detection equipment and slow fault detection media in a distributed storage system, and belongs to the technical field of distributed storage fault detection.
Background
Slow faults, also known as grey faults, are of increasing concern. In slow failures, software or hardware components (at runtime) provide less than expected performance. With the advent of new hardware devices and software stacks, the effects of slow failures (caused, for example, by failed NAND or scheduling problems) are more likely to be noticed, which may have been masked as noise. Recent studies have shown that slow fault events per year may occur as frequently as fault stop events per year.
Accurate detection of slow to failure faults is challenging. Performance changes caused by internal factors (e.g., SSD garbage collection) or external factors (e.g., workload bursts) may have symptoms similar to slow failures. Unlike well-defined fail-over faults, which are standard (e.g., software crashes, data loss), determining a slow fault is often empirical in practice and is therefore inherently inaccurate. In addition, the failure of a slow failure is often temporary, which makes it difficult for a field engineer to identify, let alone reproduce or infer the root cause.
The OSD in the Ceph distributed storage system has slow IO detection, mainly through the tracking delay exceeding a certain threshold value, send out the warning, then prompt the research personnel to intervene in the location, but the reason that causes slow request involves very much, and for this reason type of slow trouble, relatively short-term is difficult to catch, go to the location after the problem has been unable to track. The IASO detection framework based on the nodes mainly adopts statistics timeout times to calculate scores to obtain overall score table ranking, and a service degradation severity decision is obtained according to the ranking to determine whether the service degradation is isolated from the cluster. However, the current framework cannot be directly applied to Ceph, and for a slow failure state of a drive cannot be inferred, manual intervention is still required to locate a specific problem root.
The data centers of the Langchao cloud are nationwide, and each data center has a plurality of storage clusters, and a distributed storage system Ceph or other systems are deployed on each cluster to support storage services. Each cluster has fewer than 40 hard disks and more than near 10000 hard disks. By default, the data storage drives in each node are typically of the same model. The existing Ceph system has the capability of detecting and isolating faults of a hard disk, but only can alarm slow requests when slow faults occur, the faults are difficult to track, PG in OSD has the characteristic of global distribution across fault domains, and the influence range is large. The present invention therefore proposes a slow failure detection method in a distributed storage system.
Disclosure of Invention
In order to solve the problems, the invention provides a slow fault detection method, a slow fault detection device, slow fault detection equipment and slow fault detection media in a distributed storage system, which can improve the usability of a Ceph system.
The technical scheme adopted by the invention for solving the technical problems is as follows:
in a first aspect, a method for detecting a slow failure in a distributed storage system according to an embodiment of the present invention includes the following steps:
step S1, data interacted by an OC module and all OSD are acquired at intervals of t time, time delay and throughput in the t time are calculated, and a data set is constructed;
S2, preprocessing a data set by adopting a PCAP algorithm and a DBSCAN algorithm;
s3, establishing a linear regression model by adopting polynomial regression;
s4, predicting the preprocessed data set by using a linear regression model, identifying a boundary entry according to a slow fault detection threshold value and marking the boundary entry as a slow entry;
And S5, counting all the slow fault events by adopting a sliding window and scoreboard mechanism and calculating scores.
As a possible implementation manner of this embodiment, in step S1, the collected data includes a timestamp, a host, a disk ID, a delay, and a throughput.
As a possible implementation manner of this embodiment, the step S2 of preprocessing the data set by using the PCAP algorithm and the DBSCAN algorithm includes:
Carrying out standardization processing on the data set;
identifying abnormal values in the data set by adopting a DBSCAN algorithm;
The PCAP algorithm is used to discard outlier entries in the dataset.
As a possible implementation manner of this embodiment, the step S3 of establishing a linear regression model by using polynomial regression includes:
Constructing polynomial characteristics and converting the polynomial characteristics, and establishing a binomial relation between time delay and throughput;
Fitting a binomial relation between the time delay and the throughput to construct a linear regression model, and outputting a prediction result of the time delay by the linear regression model.
As a possible implementation manner of this embodiment, the step S4 of predicting the preprocessed data set by using a linear regression model, identifying and marking the boundary entry as a slow fault entry according to a slow fault detection threshold includes:
calculating standard deviation of residual error between a prediction result and an actual value of the linear regression model, determining a corresponding z quantile by using the characteristic of standard normal distribution in statistics through a given confidence level, multiplying the standard deviation by the z quantile to obtain an upper limit of a confidence interval, and taking the upper limit of the confidence interval as a slow fault detection threshold;
And identifying abnormal items, namely defining a sliding window after predicting each node, and if the predicted value in the sliding window exceeds the upper limit of the confidence interval, considering that a slow fault event occurs.
As a possible implementation manner of this embodiment, the step S5, using a sliding window and a scoreboard mechanism to count and calculate scores for all the slow fault events, includes:
judging whether a slow fault event occurs;
if a slow fault event occurs and there is no slow fault event for the previous period of time, then the Score is pressed Factor drop, score initial value of 10, if Score continues to decay and is less than 0, then set to 0;
If a slow failure event occurs and a slow failure event occurs for a previous period of time, the Score presses The factor is incremented and if Score exceeds the threshold, the calculation is stopped.
As a possible implementation manner of this embodiment, the determining whether a slow fault event occurs includes:
rate of whether there is a faulty slow event in a certain time The method comprises the following steps:
,
wherein slowCount is the number of slow fault entries, totalCount is the total number of entries;
if slowRatio is greater than the slow fault event duty cycle threshold, then a slow fault event is deemed to have occurred.
In a second aspect, an embodiment of the present invention provides a slow fault detection device in a distributed storage system, including:
The data set construction module is used for acquiring the data interacted by the OC module and all OSD at intervals of t time, calculating the time delay and throughput in the t time and constructing a data set;
The data set preprocessing module is used for preprocessing the data set by adopting a PCAP algorithm and a DBSCAN algorithm;
the model building module is used for building a linear regression model by adopting polynomial regression;
The slow fault detection module is used for predicting the preprocessed data set by utilizing a linear regression model, identifying a boundary entry according to a slow fault detection threshold value and marking the boundary entry as a slow entry;
And the slow fault statistics module is used for counting all slow fault events and calculating scores by adopting a sliding window and scoreboard mechanism.
In a third aspect, an embodiment of the present invention provides a computer device, including a processor, a memory, and a bus, where the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processor communicates with the memory through the bus, and the processor executes the machine-readable instructions to perform steps of a slow fault detection method in any of the distributed storage systems described above.
In a fourth aspect, embodiments of the present invention provide a storage medium having a computer program stored thereon, which when executed by a processor performs the steps of a slow failure detection method in any of the above-described distributed storage systems.
The technical scheme of the embodiment of the invention has the following beneficial effects:
The invention establishes a regression model based on the characteristics of the captured disk-level time delay and throughput data by using a machine learning technology, evaluates the severity of slow faults by using a scoreboard mechanism, timely isolates problematic drivers and improves the usability of the Ceph system. According to the invention, a new node-level lightweight small-sized data set regression-based driver-level slow fault event is introduced, such problems are found in advance, and the usability and long tail problems of the system are effectively improved.
The slow fault detection device in the distributed storage system of the technical scheme of the embodiment of the invention has the same beneficial effects as the slow fault detection method in the distributed storage system of the technical scheme of the embodiment of the invention.
Drawings
FIG. 1 is a flow chart illustrating a method of slow fault detection in a distributed storage system according to an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating a slow failure detection mechanism in a distributed storage system according to an exemplary embodiment;
FIG. 3 is a schematic diagram of a system architecture of an apparatus involved in the implementation of the present invention;
FIG. 4 is a schematic view of a polynomial regression model visualization;
FIG. 5 is a diagram showing the time delay variation of disk1 within 1h on a host;
Fig. 6 is a schematic diagram showing the delay variation of disk4 within 1h on the same host.
Detailed Description
In order to more clearly illustrate the technical features of the solution of the present invention, the present invention will be described in detail below with reference to the following detailed description and the accompanying drawings.
In order to detect a drive slow failure state in a Ceph system, the invention utilizes classical machine learning techniques (PCA, DBSCAN and polynomial regression) to establish a mapping relationship between delay variation and workload pressure, through which an accurate adaptive threshold can be automatically determined for each node to identify tracked slow requests. Furthermore, based on the slow request, the present invention builds a corresponding slow-to-failure event and utilizes a scoreboard mechanism to evaluate the severity of such event.
As shown in fig. 1, the method for detecting a slow failure in a distributed storage system according to the embodiment of the present invention includes the following steps:
step S1, data interacted by an OC module and all OSD are acquired at intervals of t time, time delay and throughput in the t time are calculated, and a data set is constructed;
S2, preprocessing a data set by adopting a PCAP algorithm and a DBSCAN algorithm;
s3, establishing a linear regression model by adopting polynomial regression;
s4, predicting the preprocessed data set by using a linear regression model, identifying a boundary entry according to a slow fault detection threshold value and marking the boundary entry as a slow entry;
And S5, counting all the slow fault events by adopting a sliding window and scoreboard mechanism and calculating scores.
As a possible implementation manner of this embodiment, in step S1, the collected data includes a timestamp, a host, a disk ID, a delay, and a throughput.
As a possible implementation manner of this embodiment, the step S2 of preprocessing the data set by using the PCAP algorithm and the DBSCAN algorithm includes:
Carrying out standardization processing on the data set;
identifying abnormal values in the data set by adopting a DBSCAN algorithm;
The PCAP algorithm is used to discard outlier entries in the dataset.
As a possible implementation manner of this embodiment, the step S3 of establishing a linear regression model by using polynomial regression includes:
Constructing polynomial characteristics and converting the polynomial characteristics, and establishing a binomial relation between time delay and throughput;
Fitting a binomial relation between the time delay and the throughput to construct a linear regression model, and outputting a prediction result of the time delay by the linear regression model.
As a possible implementation manner of this embodiment, the step S4 of predicting the preprocessed data set by using a linear regression model, identifying and marking the boundary entry as a slow fault entry according to a slow fault detection threshold includes:
calculating standard deviation of residual error between a prediction result and an actual value of the linear regression model, determining a corresponding z quantile by using the characteristic of standard normal distribution in statistics through a given confidence level, multiplying the standard deviation by the z quantile to obtain an upper limit of a confidence interval, and taking the upper limit of the confidence interval as a slow fault detection threshold;
And identifying abnormal items, namely defining a sliding window after predicting each node, and if the predicted value in the sliding window exceeds the upper limit of the confidence interval, considering that a slow fault event occurs.
As a possible implementation manner of this embodiment, the step S5, using a sliding window and a scoreboard mechanism to count and calculate scores for all the slow fault events, includes:
judging whether a slow fault event occurs;
if a slow fault event occurs and there is no slow fault event for the previous period of time, then the Score is pressed Factor drop, score initial value of 10, if Score continues to decay and is less than 0, then set to 0;
If a slow failure event occurs and a slow failure event occurs for a previous period of time, the Score presses The factor is incremented and if Score exceeds the threshold, the calculation is stopped.
As a possible implementation manner of this embodiment, the determining whether a slow fault event occurs includes:
rate of whether there is a faulty slow event in a certain time The method comprises the following steps:
,
wherein slowCount is the number of slow fault entries, totalCount is the total number of entries;
If slowRatio is greater than the slow fault event duty cycle threshold, then a slow fault event is deemed to have occurred, namely:
;
slowThreadshold is a slow fault event duty cycle threshold, if curSlow =1, a slow fault event is considered to occur.
As shown in fig. 2, a slow fault detection device in a distributed storage system according to an embodiment of the present invention includes:
The data set construction module is used for acquiring the data interacted by the OC module and all OSD at intervals of t time, calculating the time delay and throughput in the t time and constructing a data set;
The data set preprocessing module is used for preprocessing the data set by adopting a PCAP algorithm and a DBSCAN algorithm;
the model building module is used for building a linear regression model by adopting polynomial regression;
The slow fault detection module is used for predicting the preprocessed data set by utilizing a linear regression model, identifying a boundary entry according to a slow fault detection threshold value and marking the boundary entry as a slow entry;
And the slow fault statistics module is used for counting all slow fault events and calculating scores by adopting a sliding window and scoreboard mechanism.
As shown in FIG. 3, the devices involved in the implementation of the present invention are an OSD (Object-based Storage Device, object storage device), MGR (MySQL Group Replication) module and OC module (Object in Object-C). The invention performs the following specific process of slow fault detection.
1. And (5) data collection.
The OC module collects op_w_latency and op_w_in_bytes every 15s and all OSD interaction on the node, calculates throughput and time delay in the period, and stops collecting after 1 hour of collecting every day when the load is idle, wherein the data format of the collecting is as follows:
The collected data includes time stamp (Timestamp), host (Host), disk ID (disk), latency (Latency), throughput (Throughput).
The following steps are then performed for the collected data set.
2. The outlier entries are identified and discarded using a PCA and density-based noise application spatial clustering DBSCAN algorithm.
2.1DBSCAN algorithm.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a Density-based clustering algorithm, which defines clusters as the largest set of points that are connected in Density, can divide areas with a sufficiently high Density into clusters, and can find clusters of arbitrary shape in noisy spatial databases.
The clusters of DBSCAN define a set of samples connected by a maximum density derived from density reachability relationships, i.e., a class, or cluster, of the final cluster. (note the set of density links), there may be one or more core objects within a cluster. If there is only one core object, then other non-core object samples in the cluster are all in this core objectIn the neighborhood, if there are multiple core objects, any one core object in the clusterThere must be one other core object in the neighborhood, otherwise the two core objects cannot be reachable in density. These core objectsThe collection of all samples in the neighborhood form a DBSCAN cluster.
How can such a cluster sample set be found? a core object without a category is arbitrarily selected as a seed, and then finding out a sample set with reachable density of all the core objects, namely a cluster. Then, another core object without category is selected to find a sample set with reachable density, thus obtaining another cluster (all the obtained clusters are certainly connected by density). Run until all core objects have a class.
2.2PCA component analysis.
The data dimension reduction is a preprocessing method for high-dimension characteristic data. The dimension reduction is to keep some important characteristics of high-dimension data, remove noise and unimportant characteristics, and therefore achieve the purpose of improving the data processing speed. In practical production and application, the dimension reduction is within a certain information loss range, and a large amount of time and cost can be saved. Dimension reduction is also a very widely applied data preprocessing method.
PCA (Principal Component Analysis), the principal component analysis method, is one of the most widely used data dimension reduction algorithms. The main idea of PCA is to map n-dimensional features onto k-dimensions, which are completely new orthogonal features, also called principal components, and are k-dimensional features reconstructed on the basis of the original n-dimensional features. PCA works by sequentially finding a set of mutually orthogonal axes from the original space, the selection of which is closely related to the data itself. The first new coordinate axis is selected to be the direction with the maximum variance in the original data, the second new coordinate axis is selected to be the plane orthogonal to the first coordinate axis so as to make the variance maximum, and the third axis is selected to be the plane orthogonal to the 1 st and 2 nd axes so as to make the variance maximum. By analogy, n such coordinate axes may be obtained. The new axes obtained in this way find that most of the variance is contained in the first k axes and that the latter axes contain almost 0 variance. Thus, the remaining axes can be ignored, leaving only the first k axes with the vast majority of variances. In fact, this amounts to retaining only dimensional features containing a substantial portion of variance, while ignoring feature dimensions containing variances of almost 0, achieving dimension reduction of the data features.
2.3 Data normalization.
STANDARDSCALER is a common data preprocessing technique, which is used for carrying out standardization processing on characteristic data according to a standard normal distribution with a mean value of 0 and a variance of 1. Its main purposes and meanings include:
The dimensional differences between features are eliminated, i.e. different features often have different dimensions and dimensions, for example, the value range of a certain feature may be several hundred to several thousand, while the value range of another feature may be between 0 and 1. In this case, dimensional differences between features may cause the model to be affected during the training process, possibly resulting in slow model convergence or poor model results. By normalization, dimensional differences between features can be eliminated, making it easier for the model to learn the relationships between features.
In many machine learning algorithms, such as gradient descent methods, the feature scale directly affects the convergence speed of the algorithm and the final model accuracy. By the normalization process, the algorithm can be made to converge faster and a more accurate model can be obtained.
The influence of outliers and outliers is reduced by normalizing the normal distribution, which can convert the data into a normal distribution with a mean of 0 and a variance of 1, so that the influence of outliers and outliers on the model can be reduced. Because the normalized data has a tighter distribution, the influence of outliers on the mean and variance is reduced.
The interpretation of the model is improved-in some cases, if the scale difference of the features is large, the interpretation of the model parameters may become difficult. By the normalization process, the sizes of the model parameters can be made comparable, thereby improving the interpretation of the model.
2.4 Screening outliers using DBSCAN and PCA.
The necessary pre-treatment is to eradicate outlier samples before applying the regression model. Although < latency > sample pairs are typically clustered together in one node, entries from slow-failure drives or normal performance changes (e.g., internal GCs) may still be biased. Thus, the present invention first screens out outliers before building a polynomial regression model. The use of a DBSCAN density-based clustering algorithm (measuring spatial distance) is an effective method of identifying potentially normal and outliers.
In short, DBSCAN groups points that are sufficiently close in space-the distance between the points is less than a minimum. Note that < latency > pairs from long-term or permanent slow-failure drives may be clustered together, but remote from the primary cluster. Thus, the present invention only retains the most scored set for further modeling.
Unfortunately, the effectiveness of screening raw data sets using DBSCAN has proven to be relatively limited. The root cause is that throughput and delay are positively correlated. Thus, the sample (i.e., < latency > pair) may be biased in a particular direction. Thus, outliers (i.e., samples from slow-failing drives) may be falsely marked as internal values. Therefore, the present invention transforms coordinates using principal component analysis PCA and penalizes outliers perpendicular to the tilt direction to reduce false marks.
2.5 Based on the above invention, the following steps are performed in the engineering:
A. Data were first read from the CSV file in dependence on pandas library as shown in table 1 (table 1 is a partial data display).
TABLE 1 reading data from CSV File
B. Taking log 10 of the through put to obtain log 10 _through put, constructing STANDARDSCALER to normalize the data so that the mean value of the data is 0 and the variance is 1, constructing a two-bit PCA object, and performing PCA conversion on the normalized data.
C. constructing a DBSCAN object with a neighborhood radius of 0.5 and a minimum sample number of 5, and carrying out prediction screening on the data after PCA conversion.
D. and deleting the abnormal item with the predicted value of-1.
3. A regression model is built by performing polynomial regression based on the clean dataset to obtain the model.
3.1 Model selection.
Since normal drives within a node may have similar latency to throughput relationships (i.e., well clustered together), the present invention can use a regression model to describe the behavior of "normal" drives and to delineate the range of variation for slow failure detection. Classical regression models include linearity, polynomials, and advanced problems such as kernel regression. The present invention does not use linear regression because the dependency of latency on throughput is obviously nonlinear. In addition, advanced models (e.g., kernel regression) are unnecessary because the mapping of latency to throughput is mainly monotonic (i.e., latency increases with throughput). Polynomial regression is preferable because it can handle non-linearities while preserving model simplicity (i.e., achieving the required goodness of fit with sufficient parameters).
3.2 Polynomial regression.
Polynomial regression is a regression analysis method that establishes a relationship between an independent variable (feature) and a dependent variable (target) by fitting data using a polynomial function. Unlike linear regression models, polynomial regression models allow polynomial terms of independent variables, making the model more flexible, enabling better fitting of data of nonlinear relationships.
The principle of the polynomial regression model is as follows:
A. model expression polynomial regression model expression can be written as:
,
where y is the dependent variable (target), x is the independent variable (feature), Is a coefficient of the model, n is an order of the polynomial,Is an error term.
B. Fitting procedure polynomial regression the fitting procedure is achieved by minimizing the sum of squares of the residuals, i.e. by optimizing the algorithm to find the coefficients that minimize the error between the model predicted and real values. This process can be solved using least squares or the like.
C. Feature transformation in polynomial regression, if the argument has only one feature x, then in the modelThe polynomial term that is characteristic of the like may be obtained by polynomial conversion of the original characteristic. For example, for quadratic polynomial regression, the argument may beConversion toAnd then fitting using a linear regression model.
D. In practical applications, it is important to select a suitable polynomial order n. Too low an order may lead to a lack of fitting, the model does not fit the complexity of the data well, while too high an order may lead to an overfitting, the model is too sensitive to noise and does not generalize well to new data.
E. For polynomial regression models, the performance may be evaluated using the evaluation metrics of various regression models, such as mean square error (Mean Squared Error, MSE), root mean square error (Root Mean Squared Error, RMSE), decision coefficient (Coefficient of Determination, R-squared), and the like.
3.3 Polynomial regression is applied to establish the binomial relationship between latency and log10_ throught.
Firstly, polynomialFeatures is constructed to perform binomial feature conversion, then LinearRegression is constructed to perform fitting, and finally, the latency_pred is predicted according to the constructed model.
4. The upper prediction limit is used as a slow fault detection threshold. The model is then applied to the original dataset to identify and mark the world entry as a slow entry.
4.1 Constructing a threshold according to the model.
A. Standard deviation is calculated, the standard deviation of the residual error (prediction error) between the model predictive value 'latency_pred' and the actual value 'latency'. This standard deviation measures the degree of dispersion between the model's predicted value and the actual value, i.e. the model's prediction accuracy.
B. The upper confidence interval limit is calculated by determining the corresponding z quantile by a given confidence level (99.9%) using the characteristics of a standard normal distribution in statistics (z distribution), 3.291 being the z quantile at the current 99.9% confidence level. The standard deviation is then multiplied by the z quantile to obtain the upper bound of the confidence interval. This upper limit represents the upper range of model predictions at a 99.9% confidence level.
C. the model is visualized as shown in fig. 4.
4.2 Identifying an exception entry.
Fig. 4 shows the driver latency fit (green line) and the 99.9% upper limit (red line). The invention counts 1h per day according to the nodes, and after regression is carried out according to the nodes, data is collected every 15s, a sliding window is defined to be 5 minutes, the proportion of each driver latency exceeding the upper limit is detected every 5 minutes, and if the proportion exceeds a certain threshold (configurable here), then a slow fault event is considered to occur.
5. A fault slow event is identified. And counting the daily slow fault events of each driver by adopting a sliding window and scoreboard mechanism, calculating scores and reporting the scores to Mgr.
The specific process for calculating the score is as follows:
Judging whether a fault slow event exists in 2 min:
,
Wherein slowCount is the number of analyzed slow fault entries, totalCount is the total number of entries;
,
slowThreadshold is a slow fault event duty cycle threshold, if curSlow =1, then a slow fault event is considered to occur;
if a slow fault occurs currently and there is no slow fault event in the last 2 minutes, the Score is calculated as Factor decreases, score initial value is 10;
;
;
if the attenuation continuously occurs and the calculation result is smaller than 0, setting to 0;
If a slow fault event occurs within the last 2 minutes, the Score is calculated as The factor is incremented:
,
After exceeding the threshold (default 100) in the Score calculation, the calculation is stopped.
After the calculation is completed, the node driver Score pair < OSDID and Score > is sent to Mgr, and the Mgr module sorts all drivers and sends the drivers to a two-wire engineer for intervention processing.
The Mgr module collects the node driver scores reported by the OC module every day, gives out the slow fault disk list with corresponding proportion according to the configuration threshold after giving out the overall ranking, and sends out the intervention of the two-wire engineer.
The newly emerging fail-slow fault plagues both software and hardware, with the victim component still running, but the performance is degraded. In order to solve the problem, the invention introduces a new node-level lightweight small data set regression-based slow fault event of the driver level, discovers the problem in advance, and effectively improves the usability of the system and the long tail problem.
Fig. 5 is a schematic diagram of a time delay change of disk1 in 1h on a host, fig. 6 is a schematic diagram of a time delay change of disk4 in 1h on the same host, in fig. 5 and 6, a yellow curve is a curve of primary driver data, a red curve is a polynomial regression curve of multiple driver data after cleaning in the host, a green curve is a 99.9% upper limit curve of polynomial regression, fig. 6 shows that a slow fault event does not occur in the disk4, and a slow fault event may exist in the disk1 in fig. 5.
The invention can identify slow fault events and discover the problems in advance, thereby effectively improving the usability of the system and the long tail problem and improving the usability of the Ceph system.
The embodiment of the invention provides a computer device, which comprises a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, when the device runs, the processor and the memory are communicated through the bus, and the processor executes the machine-readable instructions to execute the steps of the slow fault detection method in any distributed storage system.
In particular, the above-mentioned memory and processor can be general-purpose memory and processor, and are not particularly limited herein, and the slow failure detection method in the above-mentioned distributed storage system can be performed when the processor runs a computer program stored in the memory.
It will be appreciated by those skilled in the art that the structure of the computer device is not limiting of the computer device and may include more or fewer components than shown, or may be combined with or separated from certain components, or may be arranged in a different arrangement of components.
In some embodiments, the computer device may further include a touch screen operable to display a graphical user interface (e.g., a launch interface of an application) and to receive user operations with respect to the graphical user interface (e.g., launch operations with respect to the application). A particular touch screen may include a display panel and a touch panel. The display panel may be configured in the form of an LCD (Liquid CRYSTAL DISPLAY), an OLED (Organic Light-Emitting Diode), or the like. The touch panel may collect touch or non-touch operations on or near the user and generate preset operation instructions, for example, operations on or near the touch panel by the user using any suitable object such as a finger, a stylus, or the like. In addition, the touch panel may include two parts, a touch detection device and a touch controller. The touch controller receives touch information from the touch detection device, converts the touch information into information which can be processed by the processor, sends the information to the processor, and can receive a command sent by the processor and execute the command. In addition, the touch panel may be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave, or may be implemented by any technology developed in the future. Further, the touch panel may overlay the display panel, and a user may operate on or near the touch panel overlaid on the display panel according to a graphical user interface displayed by the display panel, and upon detection of an operation thereon or thereabout, the touch panel is transferred to the processor to determine a user input, and the processor then provides a corresponding visual output on the display panel in response to the user input. In addition, the touch panel and the display panel may be implemented as two independent components or may be integrated.
Corresponding to the above method for starting an application program, the embodiment of the present invention further provides a storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the method for detecting a slow failure in any of the above distributed storage systems are performed.
The starting device of the application program provided by the embodiment of the application can be specific hardware on the equipment or software or firmware installed on the equipment. The device provided by the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brevity, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned. It will be clear to those skilled in the art that, for convenience and brevity, the specific operation of the system, apparatus and unit described above may refer to the corresponding process in the above method embodiment, which is not described in detail herein.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of modules is merely a logical function division, and there may be additional divisions in actual implementation, and for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with respect to each other may be through some communication interface, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other form.
The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiment provided by the application may be integrated in one processing module, or each module may exist alone physically, or two or more modules may be integrated in one module.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the specific embodiments of the present invention without departing from the spirit and scope of the present invention, and any modifications and equivalents are intended to be included in the scope of the claims of the present invention.

Claims (8)

1. A method for detecting a slow failure in a distributed storage system, comprising the steps of:
step S1, data interacted by an OC module and all OSD are acquired at intervals of t time, time delay and throughput in the t time are calculated, and a data set is constructed;
S2, preprocessing a data set by adopting a PCAP algorithm and a DBSCAN algorithm;
s3, establishing a linear regression model by adopting polynomial regression;
s4, predicting the preprocessed data set by using a linear regression model, identifying a boundary entry according to a slow fault detection threshold value and marking the boundary entry as a slow entry;
S5, counting all slow fault events by adopting a sliding window and scoreboard mechanism and calculating scores;
step S3, a linear regression model is established by polynomial regression, and the method comprises the following steps:
Constructing polynomial characteristics and converting the polynomial characteristics, and establishing a binomial relation between time delay and throughput;
Fitting a binomial relation between the time delay and the throughput to construct a linear regression model, wherein the linear regression model outputs a prediction result of the time delay;
Step S5, counting and calculating scores of all the slow fault events by adopting a sliding window and a scoreboard mechanism, including:
judging whether a slow fault event occurs;
if a slow fault event occurs and there is no slow fault event for the previous period of time, then the Score is pressed Factor drop, score initial value of 10, if Score continues to decay and is less than 0, then set to 0;
If a slow failure event occurs and a slow failure event occurs for a previous period of time, the Score presses The factor is incremented and if Score exceeds the threshold, the calculation is stopped.
2. The method of claim 1, wherein in step S1, the collected data includes a time stamp, a host, a disk ID, a latency, and a throughput.
3. The method for detecting slow failure in a distributed storage system according to claim 1, wherein the step S2 of preprocessing the data set by using PCAP algorithm and DBSCAN algorithm comprises:
Carrying out standardization processing on the data set;
identifying abnormal values in the data set by adopting a DBSCAN algorithm;
The PCAP algorithm is used to discard outlier entries in the dataset.
4. The method for detecting slow faults in a distributed storage system according to claim 1, wherein said step S4 of predicting the preprocessed data set using a linear regression model, identifying and marking a boundary entry as a slow fault entry based on a slow fault detection threshold comprises:
calculating standard deviation of residual error between a prediction result and an actual value of the linear regression model, determining a corresponding z quantile by using the characteristic of standard normal distribution in statistics through a given confidence level, multiplying the standard deviation by the z quantile to obtain an upper limit of a confidence interval, and taking the upper limit of the confidence interval as a slow fault detection threshold;
And identifying abnormal items, namely defining a sliding window after predicting each node, and if the predicted value in the sliding window exceeds the upper limit of the confidence interval, considering that a slow fault event occurs.
5. The method of claim 1, wherein determining whether a slow failure event has occurred comprises:
rate of whether there is a faulty slow event in a certain time The method comprises the following steps:
,
wherein slowCount is the number of slow fault entries, totalCount is the total number of entries;
if slowRatio is greater than the slow fault event duty cycle threshold, then a slow fault event is deemed to have occurred.
6. A slow failure detection apparatus in a distributed storage system, comprising:
The data set construction module is used for acquiring the data interacted by the OC module and all OSD at intervals of t time, calculating the time delay and throughput in the t time and constructing a data set;
The data set preprocessing module is used for preprocessing the data set by adopting a PCAP algorithm and a DBSCAN algorithm;
the model building module is used for building a linear regression model by adopting polynomial regression;
The slow fault detection module is used for predicting the preprocessed data set by utilizing a linear regression model, identifying a boundary entry according to a slow fault detection threshold value and marking the boundary entry as a slow entry;
The slow fault statistics module is used for counting all slow fault events and calculating scores by adopting a sliding window and a scoreboard mechanism;
the specific process of the model building module for building the linear regression model by adopting polynomial regression is as follows:
Constructing polynomial characteristics and converting the polynomial characteristics, and establishing a binomial relation between time delay and throughput;
Fitting a binomial relation between the time delay and the throughput to construct a linear regression model, wherein the linear regression model outputs a prediction result of the time delay;
the slow fault statistics module adopts a sliding window and a scoreboard mechanism to count all slow fault events and calculates scores, and the specific process is as follows:
judging whether a slow fault event occurs;
if a slow fault event occurs and there is no slow fault event for the previous period of time, then the Score is pressed Factor drop, score initial value of 10, if Score continues to decay and is less than 0, then set to 0;
If a slow failure event occurs and a slow failure event occurs for a previous period of time, the Score presses The factor is incremented and if Score exceeds the threshold, the calculation is stopped.
7. A computer device comprising a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the computer device is in operation, the processor executing the machine-readable instructions to perform the steps of the slow fault detection method in a distributed storage system as claimed in any one of claims 1 to 5.
8. A storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the slow fault detection method in a distributed storage system as claimed in any one of claims 1 to 5.
CN202411280438.XA 2024-09-13 2024-09-13 Slow fault detection method, device, equipment and medium in distributed storage system Active CN118796133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411280438.XA CN118796133B (en) 2024-09-13 2024-09-13 Slow fault detection method, device, equipment and medium in distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411280438.XA CN118796133B (en) 2024-09-13 2024-09-13 Slow fault detection method, device, equipment and medium in distributed storage system

Publications (2)

Publication Number Publication Date
CN118796133A CN118796133A (en) 2024-10-18
CN118796133B true CN118796133B (en) 2024-12-20

Family

ID=93026353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411280438.XA Active CN118796133B (en) 2024-09-13 2024-09-13 Slow fault detection method, device, equipment and medium in distributed storage system

Country Status (1)

Country Link
CN (1) CN118796133B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194177A (en) * 2023-11-03 2023-12-08 四川省华存智谷科技有限责任公司 Method for improving detection accuracy of slow disk of storage system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9842013B2 (en) * 2014-10-27 2017-12-12 Aruba Networks, Inc. Dynamic adaptive approach for failure detection of node in a cluster
US11119800B1 (en) * 2018-06-28 2021-09-14 Amazon Technologies, Inc. Detecting and mitigating hardware component slow failures
CN111913667B (en) * 2020-08-06 2023-03-14 平安科技(深圳)有限公司 OSD blocking detection method, system, terminal and storage medium based on Ceph
CN115795403A (en) * 2022-10-26 2023-03-14 上海交通大学 A storage device slow fault detection method, system and storage medium
CN116149894B (en) * 2023-02-28 2023-10-27 哈尔滨工业大学(深圳) Method for detecting slow card and related equipment
CN117093461A (en) * 2023-08-31 2023-11-21 济南浪潮数据技术有限公司 Method, system, equipment and storage medium for time delay detection and analysis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194177A (en) * 2023-11-03 2023-12-08 四川省华存智谷科技有限责任公司 Method for improving detection accuracy of slow disk of storage system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Dynamic Scheduling: Scoreboarding";Shuai Wang等;《Computer Architecture》;20161231;1-44页 *

Also Published As

Publication number Publication date
CN118796133A (en) 2024-10-18

Similar Documents

Publication Publication Date Title
JP7100155B2 (en) Alarm log compression methods, devices and systems, and storage media
US10055275B2 (en) Apparatus and method of leveraging semi-supervised machine learning principals to perform root cause analysis and derivation for remediation of issues in a computer environment
US9424157B2 (en) Early detection of failing computers
US11403164B2 (en) Method and device for determining a performance indicator value for predicting anomalies in a computing infrastructure from values of performance indicators
US8078913B2 (en) Automated identification of performance crisis
US11307916B2 (en) Method and device for determining an estimated time before a technical incident in a computing infrastructure from values of performance indicators
CN111459700A (en) Method and apparatus for diagnosing device failure, diagnostic device, and storage medium
US11675643B2 (en) Method and device for determining a technical incident risk value in a computing infrastructure from performance indicator values
US20190265088A1 (en) System analysis method, system analysis apparatus, and program
CN111400122B (en) Hard disk health degree assessment method and device
CN111459692B (en) Method, apparatus and computer program product for predicting drive failure
US9003076B2 (en) Identifying anomalies in original metrics of a system
CN118708461B (en) Intelligent central control system of commercial tablet personal computer
CN115964361B (en) Data enhancement method, system, equipment and computer readable storage medium
CN113656228A (en) Disk fault detection method and device, computer equipment and storage medium
CN108415810A (en) Hard disk state monitoring method and device
CN115964211A (en) Root cause positioning method, device, equipment and readable medium
CN111769974A (en) A method for fault diagnosis of cloud system
CN117112339A (en) Abnormality detection method, abnormality detection device, electronic device, and computer program product
CN118796133B (en) Slow fault detection method, device, equipment and medium in distributed storage system
CN118822510A (en) A computer room equipment status monitoring method, system and monitoring terminal
CN113093695A (en) Data-driven SDN controller fault diagnosis system
JP6973445B2 (en) Display method, display device, and program
CN117407245A (en) Model training task anomaly detection method and system, electronic equipment and storage medium
CN117520040A (en) Micro-service fault root cause determining method, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant