TWI587152B

TWI587152B - Method for extending life expectancy of disks in cloud-based service system and system using the same

Info

Publication number: TWI587152B
Application number: TW105134472A
Authority: TW
Inventors: 陳文賢; 黃純芳; 黃明仁
Original assignee: 先智雲端數據股份有限公司
Priority date: 2016-10-26
Filing date: 2016-10-26
Publication date: 2017-06-11
Also published as: TW201816625A

Description

The party used to extend the disk life expectancy value in the cloud service system Method and system using the same

本發明關於一種延長磁碟預期壽命值的方法及使用該方法的系統，特別是關於一種用於延長雲端服務系統中磁碟預期壽命值的方法及使用該方法的系統。 The present invention relates to a method for extending a disk life expectancy value and a system using the same, and more particularly to a method for extending a disk life expectancy value in a cloud service system and a system using the same.

電腦中的磁碟是用來儲存資料，供應用程式運作所需的主要設備。無論何種型態，比如硬碟、固態硬碟，甚或是磁帶，在一段長時間使用之後，該磁碟終究會故障而無法工作。如果並未於故障之前適當地執行資料備份或歸檔，磁碟中的資料會遺失，因為其中可能會包含重要資料，如磁碟中的作業系統與電腦系統組態資料，這將造成災難。通常，磁碟會於故障前顯現某些跡象。舉例來說，儲存的資料消失或程式運轉頻繁失常。使用者可容易地察覺這些跡象並採取行動替換磁碟及儲存其中的資料。因為該電腦可能僅有少數的磁碟，且該使用者可以透過電腦性能表現每天持續觀察磁碟，這做法是可行的。 The disk in the computer is the main device used to store data and supply the application. Regardless of the type, such as a hard drive, a solid state drive, or even a tape, after a long period of use, the disk will eventually fail and will not work. If the data backup or archive is not properly performed before the failure, the data on the disk will be lost because it may contain important data, such as the operating system and computer system configuration data on the disk, which will cause disaster. Often, the disk will show some signs before the failure. For example, the stored data disappears or the program runs frequently. Users can easily detect these signs and take action to replace the disk and store the data in it. Because the computer may have only a few disks, and the user can continuously observe the disk every day through the performance of the computer, this is feasible.

對於運作雲端服務系統的架構來說，也會遭遇如同前述磁碟的相同問題。然而，比較複雜的情況是該架構通常包含大量用於資料存取的磁碟。因為儲存資料性質及內容的不同，某一磁碟可能會較其它磁碟更常被存取。經常性的磁碟存取是縮短磁碟壽命值的重要因素。然而，頻繁且持續地觀察每一磁碟的物理性能是非常困難的。對雲端服務系統的管理人員來說，常常執行資料備份並替換故障的磁碟並不是一種符合經濟效益的方法。因此，某些可監測叢集磁碟並預測磁碟壽值的技術已經公開，以提供解決方法。舉例來說，美國專利申請第US2016232450號提出一種儲存裝置壽命監控系統以及其儲存裝置壽命監控方法。該方法的步驟：包括蒐集對應此些儲存裝置的運作行為資訊；儲存多個具有運作行為資訊及對應運作壽命值的訓練資料；依據此些運作行為資訊及對應運作壽命值來架構儲存裝置壽命預測模型；將此些儲存裝置的運作行為資訊輸入至儲存裝置壽命預測模型以產生對應每一儲存裝置的預估壽命值；以及依據每一儲存裝置的運作行為資訊與預估壽命值來重新架構儲存裝置壽命預測模型。藉此，該方法能夠準確地預測儲存裝置的壽命。 For the architecture that operates the cloud service system, it will encounter the same problem as the aforementioned disk. However, the more complicated situation is that the architecture usually contains a large number of disks for data access. Because of the nature and content of the stored data, a disk may be accessed more often than other disks. Frequent disk access is an important factor in shortening disk life values. However, it is very difficult to observe the physical properties of each disk frequently and continuously. For cloud service system administrators, it is not a cost-effective way to perform data backups and replace failed disks. Therefore, some techniques for monitoring cluster disks and predicting disk life values have been disclosed to provide a solution. For example, US Patent Application No. US2016232450 proposes a storage device life monitoring system and a storage device life monitoring method thereof. The method comprises the steps of: collecting operational behavior information corresponding to the storage devices; storing a plurality of training materials having operational behavior information and corresponding operational life values; and constructing storage device life predictions based on the operational behavior information and corresponding operational lifetime values a model; inputting operational behavior information of the storage devices to a storage device life prediction model to generate an estimated life value corresponding to each storage device; and re-architecting the storage according to the operational behavior information and the estimated life value of each storage device Device life prediction model. Thereby, the method can accurately predict the life of the storage device.

前述專利申請案使用來自日誌的資料，如系統日誌、應用程式日誌，或資料庫日誌的資料(運作行為資訊)，用於訓練以及預測壽命。雖然日誌中的資料可能不會告知磁碟的實際情況，但可由紀錄中獲得某些磁碟健康狀態的暗示，這是因為紀錄內不正常的數值與對應磁碟的真實壽命值之間是有關聯的，可有效地使用歷史資料來進行預測。如果該方法能藉日誌所透漏內容，精確地為所有磁碟找出壽命值，對一特定型號的磁碟，基於相同的製造流程與品質要求，其真實壽命值應介於特定範圍內，例如，使用4,000至5,000小時。然而，事實上，某些相同型號的磁碟只能工作一段短的時間，某些工作較長的時間，而大多數磁碟的壽命值落於該預測的範圍內。即使兩個磁碟有相似的運作行為資訊，它們可能不會有相同的壽命值。這意味著分析中缺少一些關鍵因素。 The aforementioned patent application uses data from a log, such as a system log, an application log, or a database log (operational behavior information) for training and predicting life. Although the information in the log may not inform the actual situation of the disk, it may be suggested by the record that some disk health status is obtained. It is because the abnormal value in the record is related to the real life value of the corresponding disk, and the historical data can be effectively used for prediction. If the method can borrow the contents of the log to accurately find the lifetime value for all the disks, for a specific model of the disk, based on the same manufacturing process and quality requirements, the real life value should be within a certain range, for example , use 4,000 to 5,000 hours. However, in fact, some disks of the same model can only work for a short period of time, some work for a long time, and the lifetime value of most disks falls within the predicted range. Even though two disks have similar operational behavior information, they may not have the same lifetime value. This means that some key factors are missing from the analysis.

對有相似的日誌卻有不同壽命值的二個磁碟來說，如果檢視某些性能資料，如IOPS(Input/Output Per Second，每秒輸入/輸出操作次數)、延遲時間，與流通量(Throughput)，或相關資訊，如中央處理器(Central Processing Unit，CPU)負載或主機記憶體使用量，可以發現該二個磁碟運行方式不同，而這差異可能就是導致不同壽命值的因素。舉例來說，二個磁碟一年內有相似的存取及故障紀錄，一個在其中三個月內被密集地存取而另一個在一年內平均地被存取。因此，亟為需要一種用來提供雲端服務系統中磁碟更精確壽命的預測方法，可進一步由分析輸入/輸出模式，延長磁碟的預期壽命值。 For two disks with similar logs but different lifetime values, if you view certain performance data, such as IOPS (Input/Output Per Second), delay time, and throughput ( Throughput, or related information, such as Central Processing Unit (CPU) load or host memory usage, can be found that the two disks operate differently, and this difference may be the cause of different lifetime values. For example, two disks have similar access and failure records within one year, one being intensively accessed within three months and the other being accessed equally within a year. Therefore, in order to provide a prediction method for providing a more accurate life of the disk in the cloud service system, the input/output mode can be further analyzed to extend the life expectancy value of the disk.

本段文字提取和編譯本發明的某些特點。其它特點將被揭露於後續段落中。其目的在涵蓋附加的申請專利範圍之精神和範圍中，各式的修改和類似的排列。 This paragraph of text extracts and compiles certain features of the present invention. Other features will be revealed in subsequent paragraphs. The intention is to cover various modifications and similar arrangements in the spirit and scope of the appended claims.

本發明提出一種用於延長雲端服務系統中磁碟預期壽命值的方法及使用該方法的系統。依照本發明的一種態樣，該方法包含步驟：A.由歷史資料中對一雲端服務系統中的每一磁碟蒐集性能資料；B.濾除性能資料，該些性能資料來自一故障的磁碟或一磁碟具有壽命值短於一預設值；C.依照壽命值級別，以對應的性能資料分組該些磁碟；D.常規化該些性能資料與壽命值為一無單位性能值與一無單位壽命值；E.對每一組中每一磁碟的無單位性能值執行LSTM(Long Short Term Memory)建模演算法，以得到每一磁碟在未來一段時間內該無單位性能值的預測走勢；F.基於該組中該些無單位性能值的預測走勢，分別對每一組中所有的磁碟指定一特定無單位性能值；G.執行k-平均叢集(k-means clustering)演算法，以輸入集來得到輸出集，其中每一輸入集代表一對應的磁碟並包含一特定無單位性能值與一無單位壽命值；H.反常規化每一輸出集以分別得到一性能極限與一目標壽命值；及I.為該些形成一儲存設備的磁碟設置一性能極限，以便每一磁碟在未來該段時間內具有的期望壽值不短於該目標壽命值。該方法可進一步包含一步驟J於步驟I後：J.為每一儲存設備配置一工作負載，該工作負載具有的一性能需求匹配或低於該性能極限。 The present invention proposes a method for extending a disk life expectancy value in a cloud service system and a system using the same. According to an aspect of the invention, the method comprises the steps of: A. collecting performance data from each disk in a cloud service system from historical data; B. filtering performance data, the performance data is from a faulty magnetic field. The disk or a disk has a lifetime value shorter than a preset value; C. group the disks according to the performance value level according to the life value level; D. normalize the performance data and the lifetime value to a unitless performance value And a unitless life value; E. Perform an LSTM (Long Short Term Memory) modeling algorithm for the unitless performance value of each disk in each group to obtain a unitless unit for each disk in a future period of time. The predicted trend of the performance value; F. based on the predicted trend of the unitless performance values in the group, respectively assign a specific unitless performance value to all the disks in each group; G. Perform k-average cluster (k- Means clustering), the input set is used to obtain an output set, wherein each input set represents a corresponding disk and includes a specific unitless performance value and a unitless life value; H. denormalize each output set to Get a performance limit and a goal respectively Command value; I. These and forming a disk storage device that is provided with a performance limit, so that each disk has in the next period of time not shorter lifetime expected life value to the target value. The method may further comprise a step J after step I: J. configuring one for each storage device A workload that has a performance requirement that matches or falls below that performance limit.

依照本發明，該性能資料可為延遲時間、流通量(Throughput)、中央處理器(Central Processing Unit，CPU)負載、記憶體使用量，或IOPS(Input/Output Per Second，每秒輸入/輸出操作次數)。該無單位性能值可將一性能資料值與所有性能資料值中最小者間的一第一差異值除以所有性能資料值中最大者與最小者間的一第二差異值而計算出。該無單位壽命值可將一壽命值與所有壽命值中的最大者間的一第三差異值除以所有壽命值中的最大者與最小者間的一第四差異值而計算出。該磁碟可為硬碟(Hard Disk Drive，HDD)或固態硬碟(Solid State Disk，SSD)。該些壽命值級別可均勻地分佈在所有壽命值中的最大者與最小者間的範圍內。該特定無單位性能值可由平均該組中在未來該段時間內該些預測的無單位性能值而得到。該歷史資料可來自系統日誌、應用程式日誌、資料庫日誌，或S.M.A.R.T.(Self-Monitoring Analysis and Reporting Technology，自我監測分析和報告技術)日誌。步驟G中每一叢集的中央集可選為該輸出集。 According to the present invention, the performance data may be delay time, throughput, central processing unit (CPU) load, memory usage, or IOPS (Input/Output Per Second) input/output operations per second. frequency). The unitless performance value may be calculated by dividing a performance data value from a first difference value among the smallest of all performance data values by a second difference value between the largest and smallest of all performance data values. The unitless life value may be calculated by dividing a third difference value between a lifetime value and the largest of all life values by a fourth difference value between the largest and the smallest of all life values. The disk can be a Hard Disk Drive (HDD) or a Solid State Disk (SSD). These life value levels can be evenly distributed over the range between the largest and the smallest of all life values. The particular unitless performance value may be derived by averaging the predicted unitless performance values of the group for the period of time in the future. This historical data can come from system logs, application logs, database logs, or S.M.A.R.T. (Self-Monitoring Analysis and Reporting Technology) logs. The central set of each cluster in step G can be selected as the output set.

本發明的另一種態樣是一種雲端服務系統。該系統包含：一主機，用於運作工作負載並執行資料存取；複數個磁碟，連接到該主機，用以儲存資料供工作負載存取；及一預期壽命延長模組，配置到或安裝於該主機，用以由歷史資料中對每一磁碟蒐集性能資料；濾除性能資料，該些性能資料來自一故障的磁碟或一磁碟具有壽命值短於一預設值；依照壽命值級別，以對應的性能資料分組該些磁碟；常規化該些性能資料與壽命值為一無單位性能值與一無單位壽命值；對每一組中每一磁碟的無單位性能值執行LSTM建模演算法，以得到每一磁碟在未來一段時間內該無單位性能值的預測走勢；基於該組中該些無單位性能值的預測走勢，分別對每一組中所有的磁碟指定一特定無單位性能值；執行k-平均叢集演算法，以輸入集來得到輸出集，其中每一輸入集代表一對應的磁碟並包含一特定無單位性能值與一無單位壽命值；反常規化每一輸出集以分別得到一性能極限與一目標壽命值；及為該些形成一儲存設備的磁碟設置一性能極限，以便每一磁碟在未來該段時間內具有的期望壽值不短於該目標壽命值。該預期壽命延長模組可進一步用以為每一儲存設備配置一工作負載，該工作負載具有的一性能需求匹配或低於該性能極限。 Another aspect of the present invention is a cloud service system. The system includes: a host for operating a workload and performing data access; a plurality of disks connected to the host for storing data for workload access; and an expected life extension module, configured or installed For the host, used for historical data Collecting performance data for each disk; filtering performance data from a faulty disk or a disk having a lifetime value shorter than a preset value; grouping the corresponding performance data according to the life value level The disks; normalize the performance data and lifetime values to a unitless performance value and a unitless lifetime value; perform an LSTM modeling algorithm on the unitless performance value of each disk in each group to obtain The predicted trend of the unitless performance value of each disk in a future period of time; based on the predicted trend of the unitless performance values in the group, each of the disks in each group is assigned a specific unitless performance value; Performing a k-average clustering algorithm to obtain an output set with an input set, wherein each input set represents a corresponding disk and includes a specific unitless performance value and a unitless lifetime value; denormalizing each output set by Obtaining a performance limit and a target lifetime value respectively; and setting a performance limit for the disks forming the storage device, so that each disk has a expected life value not shorter than the target in the future time period Life values. The life extension extension module can be further configured to configure a workload for each storage device, the workload having a performance requirement that matches or falls below the performance limit.

依照本發明，該性能資料可為延遲時間、流通量、中央處理器負載、記憶體使用量，或IOPS。該無單位性能值可將一性能資料值與所有性能資料值中最小者間的一第一差異值除以所有性能資料值中最大者與最小者間的一第二差異值而計算出。該無單位壽命值可將一壽命值與所有壽命值中的最大者間的一第三差異值除以所有壽命值中的最大者與最小者間的一第四差異值而計算出。其中該磁碟可為硬碟或固態硬碟。該些壽命值級別可均勻地分佈在所有壽命值中的最大者與最小者間的範圍內。該特定無單位性能值可由平均該組中在未來該段時間內該些預測的無單位性能值而得到。該歷史資料可來自系統日誌、應用程式日誌、資料庫日誌，或S.M.A.R.T.日誌。來自k-平均叢集演算法的每一叢集的中央集可選為該輸出集。 In accordance with the present invention, the performance data can be latency, throughput, CPU load, memory usage, or IOPS. The unitless performance value may be calculated by dividing a performance data value from a first difference value among the smallest of all performance data values by a second difference value between the largest and smallest of all performance data values. The unitless life value may be calculated by dividing a third difference value between a lifetime value and the largest of all life values by a fourth difference value between the largest and the smallest of all life values. Where the disk can be a hard disk or a solid Hard drive. These life value levels can be evenly distributed over the range between the largest and the smallest of all life values. The particular unitless performance value may be derived by averaging the predicted unitless performance values of the group for the period of time in the future. This historical data can come from system logs, application logs, database logs, or S.M.A.R.T. logs. The central set of each cluster from the k-average clustering algorithm can be selected as the output set.

本發明使用LSTM建模及k-平均叢集演算法來找出性能極限與目標壽命值，以讓一叢集的磁碟可指定用於運行於雲端服務系統上的一特定工作負載。磁碟能被預測出其最小壽值，且可以滿足該工作負載的需求。此外，該最小壽值是該磁碟所能持續運作的最長壽命。 The present invention uses LSTM modeling and k-average clustering algorithms to find performance limits and target lifetime values so that a cluster of disks can be designated for a particular workload running on the cloud service system. The disk can be predicted for its minimum lifetime and can meet the workload's needs. In addition, the minimum life value is the longest life that the disk can continue to operate.

10‧‧‧雲端服務系統 10‧‧‧Cloud Service System

100‧‧‧主機 100‧‧‧Host

101‧‧‧中央處理器 101‧‧‧Central Processing Unit

102‧‧‧記憶體 102‧‧‧ memory

103‧‧‧預期壽命延長模組 103‧‧‧Life expectancy extension module

201~230‧‧‧磁碟 201~230‧‧‧Disk

第1圖為依照本發明，一種雲端儲存設備系統的示意圖。 1 is a schematic diagram of a cloud storage device system in accordance with the present invention.

第2圖為依照本發明，一種用於延長雲端服務系統中磁碟預期壽命值方法的流程圖。 2 is a flow chart of a method for extending a disk life expectancy value in a cloud service system in accordance with the present invention.

第3圖為一顯示磁碟狀態的表單。 Figure 3 shows a form showing the status of the disk.

第4圖為一顯示分組結果的表單。 Figure 4 is a form showing the results of the grouping.

第5圖於上半部顯示一於過去時間內蒐集的無單位性能值的圖表，及於下半部顯示一未來24小時無單位性能值預測走勢的圖表。 Figure 5 shows a graph of the unitless performance values collected in the past half in the upper half and a graph showing the predicted trend of the unitless performance values for the next 24 hours in the lower half.

第6圖顯示3個於一高組中磁碟的IOPS預測走勢。 Figure 6 shows the IOPS prediction trend for the three disks in a high group.

第7圖為顯示輸入集分布的圖表。 Figure 7 is a chart showing the distribution of input sets.

本發明將藉由參照下列的實施方式而更具體地描述。 The invention will be more specifically described by reference to the following embodiments.

一種實現本發明的理想架構顯示於第1圖中。一雲端服務系統10包含一主機100與30個磁碟(該些磁碟依序由201到230編號，以供以下全面的說明之用)。主機100可以是一台伺服器，用來運作工作負載並為該工作負載執行資料存取。主機100是通過網際網路、區域網路(Local Area Network，LAN)，或廣域網路(Wide Area Network，WAN)，接收來自客戶端設備，如個人電腦、平板電腦，與智慧型手機，及其它遠端設備的需求的硬體。磁碟201到230連接到該主機100，用以儲存資料供工作負載存取。雖然本實施例中磁碟的數量為30個，但這不意味限制本發明的應用。事實上，只要能達到工作負載的需求，雲端服務系統10可以具有任何數量的磁碟。磁碟可以是硬碟(Hard Disk Drive，HDD)或固態硬碟(Solid State Disk，SSD)。在本實施例中，磁碟201到230都是固態硬碟。對本發明的應用而言，磁碟的型態應該一致。最好，磁碟的型號能相同(來自相同的製造商及具有一致的架構)。藉此，可以進行基於相同情況的統一比較。來自該提供方法的結果可以更精準。 An ideal architecture for implementing the present invention is shown in Figure 1. A cloud service system 10 includes a host 100 and 30 disks (the disks are numbered sequentially from 201 to 230 for the following comprehensive description). The host 100 can be a server that is used to operate the workload and perform data access for the workload. The host 100 receives from a client device such as a personal computer, a tablet computer, a smart phone, and the like through an internet, a local area network (LAN), or a wide area network (WAN). The hardware of the requirements of the remote device. Disks 201 through 230 are connected to the host 100 for storing data for workload access. Although the number of disks in this embodiment is 30, this is not meant to limit the application of the present invention. In fact, the cloud services system 10 can have any number of disks as long as the workload is met. The disk can be a Hard Disk Drive (HDD) or a Solid State Disk (SSD). In the present embodiment, the disks 201 to 230 are all solid state hard disks. For the application of the invention, the type of the disk should be consistent. Preferably, the models of the disks are the same (from the same manufacturer and have a consistent architecture). Thereby, a unified comparison based on the same situation can be performed. The results from this offering method can be more precise.

主機100具有數個關鍵部件：一中央處理器(Central Processing Unit，CPU)101、一記憶體102，及一預期壽命延長模組103。中央處理器101負責主機100的運作。記憶體102可以是一個靜態隨機存取記憶體(Static Random Access Memory，SRAM)或動態隨機存取記憶體(Dynamic Random Access Memory，DRAM)，用來暫時儲存資料或程式以運行雲端服務系統10。預期壽命延長模組103是實現本發明提供方法的設備，它配置到主機100中。預期壽命延長模組103的主要功能是由歷史資料中對每一磁碟蒐集性能資料；濾除性能資料，該些性能資料來自一故障的磁碟或一磁碟具有壽命值短於一預設值；依照壽命值級別，以對應的性能資料分組該些磁碟；常規化該些性能資料與壽命值為一無單位性能值與一無單位壽命值；對每一組中每一磁碟的無單位性能值執行LSTM(Long Short Term Memory)建模演算法，以得到每一磁碟在未來一段時間內該無單位性能值的預測走勢；基於該組中該些無單位性能值的預測走勢，分別對每一組中所有的磁碟指定一特定無單位性能值；執行k-平均叢集演算法，以輸入集來得到輸出集，其中每一輸入集代表一對應的磁碟並包含一特定無單位性能值與一無單位壽命值；反常規化每一輸出集以分別得到一性能極限與一目標壽命值；及為該些形成一儲存設備的磁碟設置一性能極限，以便每一磁碟在未來該段時間內具有的期望壽值不短於該目標壽命值。此外，預期壽命延長模組103能進一步為每一儲存設備配置一工作負載，該工作負載具有的一性能需求匹配或低於該性能極限。這些功能將於之後配合本發明的方法而詳細地說明。應注意的是在其它實施例中，預期壽命延長模組可能是以軟體的型式，安裝於主機100(儲存於記憶體102中並由中央處理器101操作)中。在另外一些實施例中，預期壽命延長模組可以是獨立的設備並平行配置到主機100中。 The host 100 has a number of key components: a central processing unit (CPU) 101, a memory 102, and a life expectancy extension module 103. The central processing unit 101 is responsible for the operation of the host 100. The memory 102 can be a static random access memory (SRAM) or a dynamic random access memory (DRAM) for temporarily storing data or programs to run the cloud service system 10. The life expectancy extension module 103 is a device that implements the method provided by the present invention and is configured into the host 100. The main function of the life expectancy extension module 103 is to collect performance data for each disk from the historical data; filter performance data from a faulty disk or a disk having a lifetime value shorter than a preset Values; according to the life value level, group the disks with corresponding performance data; normalize the performance data and lifetime values to a unitless performance value and a unitless lifetime value; for each disk in each group Performing an LSTM (Long Short Term Memory) modeling algorithm without unit performance value to obtain a predicted trend of the unitless performance value of each disk in a future period of time; based on the predicted trend of the unitless performance values in the group Specifying a specific unitless performance value for each disk in each group; performing a k-average cluster algorithm to obtain an output set from the input set, wherein each input set represents a corresponding disk and contains a specific No unit performance value and a unitless life value; denormalize each output set to obtain a performance limit and a target lifetime value respectively; and set a performance limit for the disks forming the storage device So that each disk has in the next period of time shorter than the expected life value target lifetime value. In addition, expected The life extension module 103 can further configure a workload for each storage device that has a performance requirement that matches or falls below the performance limit. These functions will be described in detail later in conjunction with the method of the present invention. It should be noted that in other embodiments, the life expectancy extension module may be installed in the host 100 (stored in the memory 102 and operated by the central processing unit 101) in a software type. In still other embodiments, the life expectancy extension module can be a standalone device and configured into the host 100 in parallel.

請見第2圖，該圖為依照本發明，一種用於延長雲端服務系統10中磁碟預期壽命值方法的流程圖。該方法的第一步驟為由歷史資料中對雲端服務系統10中的每一磁碟蒐集性能資料(S01)。此處，該些性能資料能自雲端服務系統10的任何部件中蒐集，它們可不需要與磁碟的物理性能有關，但要和雲端服務系統10的一部分相關聯。舉例來說，性能資料可以是延遲時間、流通量、中央處理器負載、記憶體使用量或IOPS。歷史資料是過去連續蒐集的資料，包含性能資料、元資料或其它需要的資訊，它們可能來自系統日誌、應用程式日誌、資料庫日誌，或S.M.A.R.T.(Self-Monitoring Analysis and Reporting Technology，自我監測分析和報告技術)日誌。這意味由上述的來源取得性能資料。在本實施例中，以IOPS來說明。此外，用於任一磁碟的性能資料是不可中斷的，比如於過去6個月連續且週期性地記錄，但漏失了第三個月中的某一周的性能資料，這樣是不行的。對中斷紀錄的性能資料來說是沒有意義的，因為無法找出該性能的輸入/輸出模式。 Please refer to FIG. 2, which is a flow chart of a method for extending the disk life expectancy value in the cloud service system 10 in accordance with the present invention. The first step of the method is to collect performance data (S01) from each disk in the cloud service system 10 from the historical data. Here, the performance data can be gathered from any component of the cloud services system 10, which may not be associated with the physical properties of the disk, but is associated with a portion of the cloud services system 10. For example, performance data can be latency, throughput, CPU load, memory usage, or IOPS. Historical data is continuously collected in the past, including performance data, metadata, or other required information, which may come from system logs, application logs, database logs, or SMART (Self-Monitoring Analysis and Reporting Technology, self-monitoring analysis and Reporting technology) logs. This means obtaining performance data from the sources mentioned above. In the present embodiment, it is explained by IOPS. In addition, performance data for any disk is uninterruptible, such as continuous and periodic recordings over the past 6 months, but missing performance data for a week in the third month, which is not acceptable. Performance data for interrupted records This is meaningless because the input/output mode of this performance cannot be found.

接著，第二步驟為濾除性能資料，該些性能資料來自一故障的磁碟或一磁碟具有壽命值短於一預設值(S02)。為了有較好的理解，請參閱第3圖，該圖為一顯示磁碟201到230狀態的表單。很明顯地，由第3圖可知在分析進行的瞬間，某些磁碟的狀態是“故障的”。故障的磁碟可能是全然地失效等待更換，也可指的是具有較差性能的磁碟，諸如具有超過一定程度的死塊(dead blocks)或很容易過熱的磁碟。磁碟221到225被判定為“故障的”而無法由本發明所應用。剩餘的磁碟都是好的。然而，對所有磁碟來說紀錄的壽命值都不盡相同。壽命值指的是磁碟至今正常運作時間(天數)，它可以繼續工作一段長時間或可能很快故障。壽命值(lifespan)不同於壽值(lifetime)，後者定義為一磁碟在故障前所能工作的所有時間。易言之，壽值是一個確定的值而壽命值其範圍隨時會變動。除了來自“故障的”磁碟的性能資料無法使用，來自好的磁碟的性能資料但蒐集時間太短(壽命值)也不具代表性。壽命值短到如何的程度而導致該磁碟的資料不具代表性並不為本發明所限制。在本實施例中，將具有的壽命值短於50天的磁碟去除不用。因此，放棄使用磁碟226到230。 Then, the second step is to filter out performance data from a faulty disk or a disk having a lifetime value shorter than a preset value (S02). For a better understanding, please refer to FIG. 3, which is a form showing the status of the disks 201 to 230. Obviously, it can be seen from Fig. 3 that the state of some disks is "faulty" at the moment of analysis. A failed disk may be completely failed for replacement, and may also refer to a disk having poor performance, such as a disk having more than a certain degree of dead blocks or being easily overheated. The disks 221 to 225 are judged to be "faulty" and cannot be applied by the present invention. The remaining disks are all good. However, the lifetime values recorded for all disks are not the same. The lifetime value refers to the normal operating time (days) of the disk, which can continue to work for a long time or may fail quickly. The lifespan is different from the life, which is defined as the time at which a disk can work before it fails. In other words, the value of life is a certain value and the range of life values will change at any time. Except for performance data from "faulty" disks that are not available, performance data from good disks but collection times are too short (lifetime values) are not representative. The extent to which the lifetime value is so short that the data of the disk is not representative is not limited by the present invention. In the present embodiment, a disk having a lifetime value shorter than 50 days is removed. Therefore, the use of the disks 226 to 230 is abandoned.

下一步驟是依照壽命值級別，以對應的性能資料分組該些磁碟(好的且具有壽命值長於50天的)(S03)。請見第 4圖，該圖為一顯示分組結果的表單。分組(grouping)或分級(binning)，是一種資料預先處理技術，用來減少次要觀測誤差效應。組的數量不限制。在本實施例中，組的數量為5組，它們被分成較高、高、中等，低，與較低。在其它實施例中，組的數量可以是3組且分成高、中等，及低。壽命值級別是兩鄰近組的界限值。最好，該些壽命值級別均勻地分佈在所有壽命值中的最大者與最小者間的範圍內。此處，壽命值級別設定為122、187、252，及317(天)。藉此，由第4圖可知，磁碟214是在較高組中，磁碟213、215與208是在高組中，磁碟218、203、219，與204是在中等組中，磁碟216、201、220、212、211，217及202是在低組中，磁碟205、210、209、206，與207是在較低組中。性能資料與對應的磁碟歸類到前述各組的一組中。 The next step is to group the disks (good and have a lifetime value longer than 50 days) with the corresponding performance data according to the life value level (S03). See section Figure 4, which is a form showing the results of grouping. Grouping or binning is a data pre-processing technique used to reduce the effects of secondary observation errors. The number of groups is not limited. In the present embodiment, the number of groups is 5 groups, which are divided into higher, higher, medium, lower, and lower. In other embodiments, the number of groups can be 3 groups and divided into high, medium, and low. The life value level is the limit value of two adjacent groups. Preferably, the life value levels are evenly distributed over the range between the largest and the smallest of all life values. Here, the life value levels are set to 122, 187, 252, and 317 (days). Thus, as can be seen from FIG. 4, the disk 214 is in the upper group, the disks 213, 215 and 208 are in the high group, the disks 218, 203, 219, and 204 are in the middle group, the disk 216, 201, 220, 212, 211, 217, and 202 are in the lower group, disks 205, 210, 209, 206, and 207 are in the lower group. The performance data and corresponding disks are grouped into a group of the aforementioned groups.

本方法的第四步驟是常規化該些性能資料與壽命值為一無單位性能值與一無單位壽命值(S04)。依照本發明，無單位性能值由將一性能資料值與所有性能資料值中最小者間的一第一差異值除以所有性能資料值中最大者與最小者間的一第二差異值而計算出。而無單位壽命值由將一壽命值與所有壽命值中的最大者間的一第三差異值除以所有壽命值中的最大者與最小者間的一第四差異值而計算出。兩者的計算是類似的，但應用目標不同。請見第4圖。以壽命值的常規化作為例子。在第4圖中，所有壽命值中最小者為56，最大者為 382。第四差異值為326(由382減56而獲得)。對磁碟216來說，它的壽命值是154，第三差異值是98(由154減56而獲得)而無單位壽命值是0.301(由98除以326而獲得)。 The fourth step of the method is to normalize the performance data and lifetime values to a unitless performance value and a unitless life value (S04). According to the present invention, the unitless performance value is calculated by dividing a first difference value between a performance data value and a minimum of all performance data values by a second difference value between the largest and the smallest of all performance data values. Out. The unitless life value is calculated by dividing a third difference value between the largest of all life values and all of the life values by a fourth difference value between the largest and the smallest of all life values. The calculations for both are similar, but the application goals are different. Please see figure 4. Take the regularization of the life value as an example. In Figure 4, the smallest of all life values is 56, the largest being 382. The fourth difference value is 326 (obtained by 382 minus 56). For disk 216, its lifetime value is 154, the third difference value is 98 (obtained by 154 minus 56) and the unit life value is 0.301 (obtained by dividing 98 by 326).

接著，對每一組中每一磁碟的無單位性能值執行LSTM(Long Short Term Memory)建模演算法，以得到每一磁碟在未來一段時間內該無單位性能值的預測走勢(S05)。LSTM建模演算法是人工神經網絡(Artificial Neural Network，ANN)的一種類型，它每一個節點有特殊設計，適合預測長期數據的趨勢。LSTM建模演算法的詳細設計不是本發明之重點，任何的LSTM建模演算法都可應用，雖然其導出結果有某種程度的差異。請見第5圖，該圖於上半部顯示一於過去時間內蒐集的無單位性能值的圖表，及於下半部顯示一未來24小時無單位性能值預測走勢的圖表。二圖表是用於相同的磁碟，如在高組中的磁碟208。上半部圖中實線上的每一點是磁碟208的IOPS的紀錄。蒐集資料的間隔可能是5分鐘、10分鐘、30分鐘，或一小時，本發明並未限定之。下半部圖中的虛線顯示IOPS在未來24小時的預測走勢。有20個磁碟用來進行分析，在未來該段時間內無單位性能值預測走勢的圖就有20幅。 Next, an LSTM (Long Short Term Memory) modeling algorithm is performed on the unitless performance value of each disk in each group to obtain a predicted trend of the unitless performance value of each disk in a future period of time (S05) ). The LSTM modeling algorithm is a type of Artificial Neural Network (ANN), each of which has a special design that is suitable for predicting trends in long-term data. The detailed design of the LSTM modeling algorithm is not the focus of the present invention, and any LSTM modeling algorithm can be applied, although the results of the derivation are somewhat different. See Figure 5, which shows a graph of unitless performance values collected over the past half in the upper half and a chart showing the predicted trend of unitless performance values for the next 24 hours in the lower half. The second chart is for the same disk, such as disk 208 in the high group. Each point on the solid line in the upper half of the graph is a record of the IOPS of the disk 208. The interval for collecting data may be 5 minutes, 10 minutes, 30 minutes, or one hour, and the present invention is not limited thereto. The dotted line in the lower half shows the predicted trend of IOPS over the next 24 hours. There are 20 disks for analysis, and there are 20 charts with no unit performance value prediction trend in the future.

接著，基於該組中該些無單位性能值的預測走勢，分別對每一組中所有的磁碟指定一特定無單位性能值(S06)。應注意的是因為本發明的目的在預測磁碟在未來該段時間內的預期壽命值，並配置最佳磁碟組合給不同的工作負載已延長所有磁碟的壽命值，其結果對磁碟來說，應於“未來該段時間”內可行。舉例來說，未來該段時間可指接下來的1小時、接下來的6小時、首6小時後的接下來的1小時等等。因此，該指定的特定無單位性能值可隨”未來該段時間”之不同定義而改變。為了對步驟S06有較佳理解，請參閱第6圖，該圖顯示3個於高組中磁碟的IOPS預測走勢。為了對接下來的6小時以本方法找出結果，分別由該些圖表中取樣3個預測的值(黑點)。特定無單位性能值由平均高組中第6小時(在接下來的6小時中任一時點都可被使用)該些預測的無單位性能值而獲得。如第4圖所示為0.43。對其它的組而言，特定無單位性能值也給定於第4圖中。當然，其它方式，如加權平均值或幾何平均值也可以用來找出該特定無單位性能值，本發明並未限定之。 Then, based on the predicted trend of the unitless performance values in the group, a specific unitless performance value is assigned to each of the disks in each group (S06). It should be noted that the purpose of the present invention is to predict the time of the disk in the future. The expected life value, and the configuration of the optimal disk combination to different workloads has extended the lifetime value of all disks, and the result should be feasible for the disk in the "future time". For example, the future time may refer to the next hour, the next 6 hours, the next hour after the first 6 hours, and the like. Therefore, the specified specific unitless performance value can be changed with different definitions of "this period of time in the future". For a better understanding of step S06, please refer to Figure 6, which shows the predicted IOPS trend of the three disks in the high group. In order to find the results by this method for the next 6 hours, three predicted values (black dots) were sampled from the graphs. The specific unitless performance value is obtained from the predicted unitless performance values at the 6th hour of the average high group (which can be used at any one of the next 6 hours). As shown in Figure 4, it is 0.43. For other groups, specific unitless performance values are also given in Figure 4. Of course, other methods, such as a weighted average or a geometric mean, can also be used to find the specific unitless performance value, which is not limited by the present invention.

本方法的第7步驟是執行k-平均叢集(k-means clustering)演算法，以輸入集來得到輸出集(S07)。此處，每一輸入集代表一對應的磁碟並包含一特定無單位性能值與一無單位壽命值。第7圖為顯示輸入集分布的圖表。橫軸值為特定無單位性能值(基於IOPS)，縱軸值為無單位壽命值。輸入集由實心菱形所標示。在運算之後，k-平均叢集演算法可將輸入集分為3個或更多的叢集(在本實施例中，使用3個叢集)而不論每一輸入集可能屬於哪一組。步驟S07中每一叢集的中央集被選作該輸出集。如第7圖所示，每一叢集的輸出集由一空心圓形所標示。 The seventh step of the method is to perform a k-means clustering algorithm to obtain an output set from the input set (S07). Here, each input set represents a corresponding disk and contains a specific unitless performance value and a unitless life value. Figure 7 is a chart showing the distribution of input sets. The horizontal axis value is the specific unitless performance value (based on IOPS), and the vertical axis value is the unitless life value. The input set is indicated by a solid diamond. After the operation, the k-average cluster algorithm can divide the input set into three or more clusters (in this embodiment, three clusters are used) regardless of which group each input set may belong to. In each cluster in step S07 The central set is selected as the output set. As shown in Figure 7, the output set for each cluster is indicated by a hollow circle.

接著的步驟為反常規化每一輸出集以分別得到一性能極限與一目標壽命值(S08)。在步驟S08中，反常規化意味將輸出集中的值乘以對應的第二差異值與第四差異值，並加上各自的最小值，以得到一IOPS值與一壽命值。對高IOPS與高壽命值叢集而言，性能極限為1542的IOPS與284天的壽命值。對中等IOPS與較短壽命值叢集而言，性能極限為1150的IOPS與85天的壽命值。對於低IOPS與中等壽命值叢集而言，性能極限為544的IOPS與147天的壽命值。接著，為該些形成一儲存設備的磁碟設置一性能極限，以便每一磁碟在未來該段時間內具有的期望壽值不短於該目標壽命值(S09)。此處使用的“儲存設備”同於“叢集”，也就是數個磁碟連接起來用於特定的工作負載，勿與k-平均叢集演算法中的”叢集”混淆。如果一儲存設備中的磁碟設定具有1542的IOPS性能極限，這意味當具有最低IOPS低於1542的一工作負載應用時，該儲存設備中每一磁碟具有期望壽值284天。 The next step is to denormalize each output set to obtain a performance limit and a target lifetime value, respectively (S08). In step S08, the denormalization means multiplying the values in the output set by the corresponding second difference value and the fourth difference value, and adding respective minimum values to obtain an IOPS value and a lifetime value. For high IOPS and high lifetime value clusters, the performance limit is 1542 IOPS and 284 days lifetime value. For medium IOPS and shorter lifetime value clusters, the performance limit is 1150 IOPS and 85 days lifetime value. For low IOPS and medium life value clusters, the performance limit is 544 IOPS and 147 days lifetime value. Next, a performance limit is set for the disks forming the storage device so that each disk has a desired life value not shorter than the target lifetime value for the period of time in the future (S09). The "storage device" used here is the same as the "cluster", that is, several disks are connected for a specific workload, and should not be confused with the "cluster" in the k-average cluster algorithm. If the disk setting in a storage device has an IOPS performance limit of 1542, this means that each disk in the storage device has a desired lifetime of 284 days when it has a workload application with a minimum IOPS lower than 1542.

當所有磁碟都設定了性能極限，本方法的最後一步驟為為每一儲存設備(數個磁碟)配置一工作負載，該工作負載具有的一性能需求匹配或低於該性能極限(S10)。在這樣的情況下，該些磁碟能被預測具有一最小壽值且工作負載(IOPS)的需求可以被滿足。此外，該最小壽值是磁碟能維持運作的最長壽命。然而，要強調的是步驟S09的結果僅在“未來該段時間”內成立。當“未來該段時間”的定義改變，比如從接下來1小時改為6小時，步驟S09的結果也會隨之改變。儲存設備中的磁碟配置是動態的。 When all the disks have performance limits set, the final step of the method is to configure a workload for each storage device (several disks) that has a performance requirement that matches or falls below the performance limit (S10). ). In such cases, the disks can be predicted to have a minimum lifetime and the workload (IOPS) requirements can be met. In addition, the minimum lifetime value is the disk energy dimension. The longest life of operation. However, it is emphasized that the result of step S09 is only established within the "future time". When the definition of "this period of time in the future" changes, for example, from the next hour to 6 hours, the result of step S09 will also change. The disk configuration in the storage device is dynamic.

雖然本發明已以實施方式揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明之精神和範圍內，當可作些許之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present invention has been disclosed in the above embodiments, it is not intended to limit the invention, and those skilled in the art can make some modifications and refinements without departing from the spirit and scope of the invention. The scope of the invention is defined by the scope of the appended claims.

Claims

A method for extending a disk life expectancy value in a cloud service system, comprising the steps of: A. collecting performance data from each disk in a cloud service system from historical data to generate a plurality of performance data; The performance data from a faulty disk or a disk having a lifetime value shorter than a preset value is filtered out from the plurality of performance data collected in step A; C. according to the lifetime value level, corresponding performance Data grouping the disks; D. normalizing the performance data and lifetime values to a unitless performance value and a unitless life value; E. performing LSTM (Long) on the unitless performance value of each disk in each group Short Term Memory) modeling algorithm to obtain the predicted trend of the unitless performance value of each disk in a future period of time; F. based on the predicted trend of the unitless performance values in the group, respectively for each group All of the disks specify a specific unitless performance value; G. perform a k-means clustering algorithm to obtain an output set from the input set, where each input set represents a corresponding disk and contains a specific unitless performance And a lifetime value without unit;. H unconventional set of outputs each respectively to a performance limit value and a target life; and I. to form a disk storage device of the plurality of setting a performance limit, so that each The expected life value of a disk for a period of time in the future is not shorter than the target lifetime value.

The method of claim 1, further comprising a step J after step I: J. configuring a workload for each storage device, the workload having a performance requirement that matches or falls below the performance limit.

The method of claim 1, wherein the performance data is a delay time, a throughput, a central processing unit (CPU) load, a memory usage, or an IOPS (Input/Output Per). Second, number of input/output operations per second).

The method of claim 1, wherein the unitless performance value is obtained by dividing a first difference between a performance data value and a minimum of all performance data values by a maximum and a minimum of all performance data values. Calculated by a second difference value between the two.

The method of claim 1, wherein the unitless life value is obtained by dividing a third difference between a lifetime value and a maximum of all life values by a maximum and a minimum of all life values. Calculated by a fourth difference value between.

The method of claim 1, wherein the disk is a Hard Disk Drive (HDD) or a Solid State Disk (SSD).

The method of claim 1, wherein the life value levels are evenly distributed among the largest and smallest of all life values. Inside.

The method of claim 1, wherein the specific unitless performance value is obtained by averaging the predicted unitless performance values of the group in the future period of time.

The method of claim 1, wherein the historical data is from a system log, an application log, a database log, or a S.M.A.R.T. (Self-Monitoring Analysis and Reporting Technology) log.

The method of claim 1, wherein the central set of each cluster in step G is the output set.

A cloud service system includes: a host for operating a workload and performing data access; a plurality of disks connected to the host for storing data for workload access; and an expected life extension module, configured Or installed on the host to collect performance data from each disk in the historical data to generate a plurality of performance data; a disk from a fault or a disk having a lifetime value shorter than a preset value The performance data is filtered out from the collected plurality of performance data; according to the life value level, the disks are grouped according to the corresponding performance data; the performance data and the lifetime value are conventionalized into a unitless performance value and none Unit life value; perform an LSTM modeling algorithm on the unitless performance value of each disk in each group to get each disk in The predicted trend of the unitless performance value for a period of time; based on the predicted trend of the unitless performance values in the group, each of the disks in each group is assigned a specific unitless performance value; performing k-average clustering Algorithm, the input set is used to obtain an output set, wherein each input set represents a corresponding disk and includes a specific unitless performance value and a unitless life value; each of the output sets is denormalized to obtain a performance limit respectively And a target lifetime value; and setting a performance limit for the disks forming a storage device such that each disk has a desired lifetime value that is not shorter than the target lifetime value for a period of time in the future.

The cloud service system of claim 11, wherein the life extension extension module is further configured to configure a workload for each storage device, the workload having a performance requirement that matches or falls below the performance limit.

The cloud service system of claim 11, wherein the performance data is delay time, throughput, CPU load, memory usage, or IOPS.

The cloud service system of claim 11, wherein the unitless performance value is obtained by dividing a first difference between a performance data value and a minimum of all performance data values by a maximum of all performance data values. Calculated with a second difference value from the smallest.

The cloud service system of claim 11, wherein the unitless life value is one of a maximum value of a lifetime value and all life values. The third difference value is calculated by dividing by a fourth difference value between the largest and the smallest of all life values.

The cloud service system of claim 11, wherein the disk is a hard disk or a solid state disk.

The cloud service system of claim 11, wherein the life value levels are evenly distributed within a range between a maximum and a minimum of all life values.

The cloud service system of claim 11, wherein the specific unitless performance value is obtained by averaging the predicted unitless performance values of the group in the future period of time.

The cloud service system of claim 11, wherein the historical data is from a system log, an application log, a database log, or an S.M.A.R.T. log.

The cloud service system of claim 11, wherein a central set of each cluster from the k-average cluster algorithm is selected as the output set.