TWI820814B - Storage system and drive recovery method thereof - Google Patents
Storage system and drive recovery method thereof Download PDFInfo
- Publication number
- TWI820814B TWI820814B TW111127518A TW111127518A TWI820814B TW I820814 B TWI820814 B TW I820814B TW 111127518 A TW111127518 A TW 111127518A TW 111127518 A TW111127518 A TW 111127518A TW I820814 B TWI820814 B TW I820814B
- Authority
- TW
- Taiwan
- Prior art keywords
- storage device
- list
- devices
- storage
- determining
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0727—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
- Control Of Multiple Motors (AREA)
- Vehicle Body Suspensions (AREA)
Abstract
Description
本發明是有關於一種儲存系統,且特別是有關於一種儲存系統與其硬碟恢復方法。The present invention relates to a storage system, and in particular, to a storage system and a hard disk recovery method thereof.
對於具有多個硬碟的網路附接儲存(Network Attached Storage,NAS)裝置來說,硬碟與系統主機中斷連線(disconnect)的情況不時地會發生。造成上述情況的原因有許多,可能是硬碟本身毀損,也有可能是硬碟背板或硬碟控制晶片發生故障等等。維修人員往往需要拿到實體裝置對硬碟進行複製操作,才有可能找到些硬碟與系統中斷連線的原因,此舉往往耗時且耗費人力。For Network Attached Storage (NAS) devices with multiple hard drives, the hard drive may become disconnected from the system host from time to time. There are many reasons for the above situation. It may be that the hard drive itself is damaged, or it may be that the hard drive backplane or the hard drive control chip is faulty, etc. Maintenance personnel often need to obtain the physical device and perform a copy operation on the hard drive before they can possibly find the reason for the disconnection between the hard drive and the system. This is often time-consuming and labor-intensive.
一般而言,若硬碟與系統中斷連線的問題並非因為硬碟自身毀損,使用者可透過重新插拔硬碟而使該硬碟重新與系統建立連線,好讓該硬碟的讀寫操作恢復正常。但是,若使用者無法即時地抵達網路附接儲存裝置的所在位置重啟網路儲存裝置或重新插拔硬碟,中斷連線的硬碟將長時間無法恢復運作。此刻,若中斷連線的硬碟屬於已建立的磁碟冗餘陣列(redundant array of independent disk,RAID),則該RAID將會因此降級(degraded)而處於遺失資料的高風險狀態。Generally speaking, if the problem of disconnection between the hard disk and the system is not caused by the hard disk itself being damaged, the user can re-insert the hard disk and re-establish the connection with the system, so that the hard disk can read and write. Operation returns to normal. However, if the user cannot immediately reach the location of the network-attached storage device to restart the network storage device or re-plug the hard drive, the disconnected hard drive will not be able to resume operation for a long time. At this moment, if the disconnected hard disk belongs to an established redundant array of independent disks (RAID), the RAID will be degraded and be at high risk of losing data.
有鑑於此,本發明實施例提供一種儲存系統與其硬碟恢復方法,可解決上述技術問題。In view of this, embodiments of the present invention provide a storage system and a hard disk recovery method thereof, which can solve the above technical problems.
本發明實施例的儲存系統的硬碟恢復方法包括(但不僅限於)下列步驟。反應於一儲存裝置的故障事件,判斷儲存裝置是否存在。反應於判定儲存裝置存在,將儲存裝置新增至一待恢復裝置清單。判斷恢復執行裝置清單的裝置數量是否小於等於一門檻值。反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,立即對儲存裝置進行一電源重啟操作。The hard disk recovery method of the storage system according to the embodiment of the present invention includes (but is not limited to) the following steps. In response to a failure event of a storage device, determine whether the storage device exists. In response to determining that the storage device exists, the storage device is added to a list of devices to be restored. Determine whether the number of devices that resume execution of the device list is less than or equal to a threshold. In response to determining that the number of devices that resume execution of the device list is less than or equal to the threshold, a power restart operation is immediately performed on the storage device.
本發明實施例的儲存系統包括至少一儲存裝置以及處理器。處理器連接所述儲存裝置,並經配置以執行下列步驟。反應於一儲存裝置的故障事件,判斷儲存裝置是否存在。反應於判定儲存裝置存在,將儲存裝置新增至一待恢復裝置清單。判斷恢復執行裝置清單的裝置數量是否小於等於一門檻值。反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,立即對儲存裝置進行一電源重啟操作。The storage system according to the embodiment of the present invention includes at least one storage device and a processor. A processor is connected to the storage device and configured to perform the following steps. In response to a failure event of a storage device, determine whether the storage device exists. In response to determining that the storage device exists, the storage device is added to a list of devices to be restored. Determine whether the number of devices that resume execution of the device list is less than or equal to a threshold. In response to determining that the number of devices that resume execution of the device list is less than or equal to the threshold, a power restart operation is immediately performed on the storage device.
基於上述,於本發明的實施例中,當儲存裝置發生故障而無法正常運作時,此儲存裝置將被新增至待恢復裝置清單之中。當恢復執行裝置清單所紀錄的裝置數量小於等於門檻值時,可自動地對此儲存裝置立即進行電源重啟操作,以嘗試讓此儲存裝置恢復正常運作。藉此,可盡快讓中斷連線的儲存裝置恢復正常運作,並可降低資料遺失的風險。Based on the above, in embodiments of the present invention, when a storage device fails and cannot operate normally, the storage device will be added to the list of devices to be restored. When the number of devices recorded in the recovery execution device list is less than or equal to the threshold, the storage device can be automatically and immediately powered on to try to restore the storage device to normal operation. In this way, the disconnected storage device can be restored to normal operation as soon as possible and the risk of data loss can be reduced.
為讓本發明的上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明如下。In order to make the above-mentioned features and advantages of the present invention more obvious and easy to understand, embodiments are given below and described in detail with reference to the accompanying drawings.
本發明的部份實施例接下來將會配合附圖來詳細描述,以下的描述所引用的元件符號,當不同附圖出現相同的元件符號將視為相同或相似的元件。這些實施例只是本發明的一部份,並未揭示所有本發明的可實施方式。更確切的說,這些實施例只是本發明的專利申請範圍中的方法與系統的範例。Some embodiments of the present invention will be described in detail with reference to the accompanying drawings. The component symbols cited in the following description will be regarded as the same or similar components when the same component symbols appear in different drawings. These embodiments are only part of the present invention and do not disclose all possible implementations of the present invention. Rather, these embodiments are merely examples of methods and systems within the scope of the patent application of the present invention.
圖1是依據本發明一實施例的儲存系統的方塊圖。請參照圖1,儲存系統10包括至少一儲存裝置110_1~110_N、處理器120,以及記憶體130。於一些實施例中,儲存系統10可實施為網路附接儲存(Network Attached Storage,NAS)裝置或其他種類的網路伺服器。FIG. 1 is a block diagram of a storage system according to an embodiment of the present invention. Please refer to FIG. 1 , the
儲存裝置110_1~110_N例如為固態硬碟(Solid State Drive,SSD)或硬式磁碟(Hard Disk Drive,HDD),本發明對此不限制。此外,本發明對於儲存裝置110_1~110_N的數量也不限制。於一些實施例中,部分或全部的儲存裝置110_1~110_N可組成獨立磁碟冗餘陣列(RAID)。需說明的是,儲存系統10還包括多個裝置槽(Bay)。各個裝置槽中設置有電性插槽。這些電性插槽例如為SATA插槽或U.2 PCIe插槽,但本發明不限制於此。這些儲存裝置110_1~110_N適於放置於這些裝置槽中,以使儲存裝置110_1~110_N可插設於對應的電性插槽而與處理器120連接。The storage devices 110_1 to 110_N are, for example, solid state drives (Solid State Drives, SSDs) or hard disk drives (Hard Disk Drives, HDDs), and the present invention is not limited thereto. In addition, the present invention does not limit the number of storage devices 110_1 to 110_N. In some embodiments, some or all of the storage devices 110_1˜110_N may form a redundant array of independent disks (RAID). It should be noted that the
處理器120可以是中央處理單元(Central Processing Unit,CPU)、通用處理器或是其他可程式化之一般用途或特殊用途的微處理器(Microprocessor)、數位信號處理器(Digital Signal Processor,DSP)、可程式化控制器、現場可程式化邏輯閘陣列(Field Programmable Gate Array,FPGA)、特殊應用積體電路(Application-Specific Integrated Circuit,ASIC)或其他類似元件或上述元件的組合。The
記憶體130可用以儲存指令、程式碼、軟體模組等等資料,其可以例如是任意型式的固定式或可移動式隨機存取記憶體(random access memory,RAM)、唯讀記憶體(read-only memory,ROM)、快閃記憶體(flash memory)或其他類似裝置、積體電路及其組合。The
處理器120可存取並執行記錄在記憶體130中的軟體模組,以實現本發明實施例中的硬碟恢復方法。上述軟體模組可廣泛地解釋為意謂指令、指令集、代碼、程式碼、程式、應用程式、軟體套件、執行緒、程序、功能等,而不管其是被稱作軟體、韌體、中間軟體、微碼、硬體描述語言亦或其他者。The
圖2是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。請參照圖2,本實施例的方法可由圖1的儲存系統10執行,以下即搭配圖1所示的元件說明圖2各步驟的細節。此外,為使本發明的概念更易於理解,以下將以儲存裝置110_1發生故障為範例進行說明。FIG. 2 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. Please refer to Figure 2. The method in this embodiment can be executed by the
於步驟S202,處理器120偵測到儲存裝置110_1的故障事件。具體而言,於一些實施例中,當處理器120偵測不到儲存裝置110_1或無法識別儲存裝置110_1時,即代表儲存裝置110_1的故障事件發生,而處理器120也將偵測到儲存裝置110_1的故障事件。從另一觀點來看,於一些實施例中,當處理器120與儲存裝置110_1中斷連線時,即代表儲存裝置110_1的故障事件發生,而處理器120也將偵測到儲存裝置110_1的故障事件。In step S202, the
於步驟S204,反應於儲存裝置110_1的故障事件,處理器120判斷儲存裝置110_1是否存在。換言之,處理器120將判斷儲存裝置110_1是否還插在電性插槽上。若儲存裝置110_1已經被拔出電性插槽,則處理器120判斷儲存裝置110_1不存在。相反地,若儲存裝置110_1插在電性插槽上,則處理器120判斷儲存裝置110_1存在。In step S204, in response to the failure event of the storage device 110_1, the
若步驟S204判斷為是,於步驟S206,反應於判定儲存裝置110_1存在,處理器120將儲存裝置110_1新增至一待恢復裝置清單。意即,當發生故障事件的儲存裝置110_1還插在電性插槽上時,處理器120可將儲存裝置110_1新增至待恢復裝置清單。換言之,待恢復裝置清單所記錄的儲存裝置皆發生故障事件且依然插在電性插槽上。If the determination in step S204 is yes, in step S206, in response to determining that the storage device 110_1 exists, the
接著,於步驟S208,處理器120判斷恢復執行裝置清單的裝置數量是否小於等於門檻值。詳細而言,恢復執行裝置清單所記錄的儲存裝置是來自待恢復裝置清單。並且,恢復執行裝置清單所記錄的所有儲存裝置正在進行恢復操作,上述恢復操作可包括電源重啟操作與資料重建操作。處理器120將統計恢復執行裝置清單所記錄儲存裝置的裝置數量,並判斷此裝置數量是否小於等於門檻值。Next, in step S208, the
於一些實施例中,儲存系統10可包括M個裝置槽,而儲存裝置110_1插設於M個裝置槽其中之一中。用以與恢復執行裝置清單所紀錄的裝置數量進行比對的門檻值可為裝置槽的數量M乘以預設比例。上述預設比例可以是二分之一或其他比例。亦即,門檻值為小於等於M的數值。舉例而言,假設預設比例為二分之一,則門檻值為M/2。In some embodiments, the
若步驟S208判斷為是,於步驟S210,反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,處理器120立即對儲存裝置110_1進行電源重啟操作。另一方面,若步驟S208判斷為否,於步驟S212,反應於判定恢復執行裝置清單的裝置數量未小於等於門檻值,處理器120在等待一經過時間之後對儲存裝置110_1進行電源重啟操作。舉例而言,假設儲存系統10可包括4個裝置槽且預設比例為二分之一,則門檻值等於4/2=2。當門檻值為2,處理器120同時間最多只能對2個儲存裝置進行電源重啟動操作。舉另一例,假設儲存系統10可包括5個裝置槽且預設比例為二分之一,則門檻值等於5/2=2.5。當門檻值為2.5,處理器120同時間最多只能對2個儲存裝置進行電源重啟動操作。If the determination in step S208 is yes, in step S210, in response to the determination that the number of devices for resuming execution of the device list is less than or equal to the threshold value, the
也就是說,當恢復執行裝置清單中的正在進行恢復操作的儲存裝置的裝置數量小於等於門檻值時,處理器120可立即對儲存裝置110_1進行電源重啟操作,以使儲存裝置110_1開始進行恢復操作。當儲存裝置110_1進行電源重啟操作,儲存裝置110_1先被斷電再被上電。另一方面,當恢復執行裝置清單中的正在進行恢復操作的儲存裝置的裝置數量未小於等於門檻值時,處理器120可暫緩對儲存裝置110_1進行電源重啟操作,以使儲存裝置110_1在等待一經過時間之後才開始進行恢復操作。藉此,可避免有過多的儲存裝置同時進行電源重啟操作,造成儲存系統10的電源負擔過重。That is to say, when the number of storage devices in the recovery execution device list that are undergoing recovery operations is less than or equal to the threshold value, the
以下將列舉其他實施例以說明本發明的其他實施樣態。然而,為了方便清楚說明本發明,以下實施例將繼續以儲存裝置110_1發生故障事件為範例進行說明。Other examples will be enumerated below to illustrate other implementation aspects of the present invention. However, in order to facilitate a clear explanation of the present invention, the following embodiments will continue to take a failure event of the storage device 110_1 as an example.
圖3是依據本發明一實施例的儲存系統的方塊圖。請參照圖3,儲存系統10還包括GPIO(General Purpose Input/Output)介面140、供電裝置150,以及開關裝置160。須說明的是,儲存系統10中的儲存裝置110_1可經由GPIO介面140連接處理器120。處理器120可透過GPIO介面140的GPIO針腳來偵測儲存裝置110_1是否插在電性插槽上。於一些實施例中,當用以偵測連接狀態的GPIO針腳具有高準位時,處理器120可判斷儲存裝置110_1插在電性插槽上。當用以偵測連接狀態的GPIO針腳具有低準位時,處理器120可判斷儲存裝置110_1被拔出電性插槽。FIG. 3 is a block diagram of a storage system according to an embodiment of the present invention. Referring to FIG. 3 , the
此外,開關裝置160連接於儲存裝置110_1與供電裝置150之間,且開關裝置160經由GPIO介面140連接處理器120。處理器120可控制開關裝置160導通或截止,以控制供電裝置150輸出的電源是否提供給儲存裝置110_1。開關裝置160例如為熔斷器(eFuse IC)。於一些實施例中,處理器120可透過GPIO介面140的GPIO針腳提供開關控制訊號給開關裝置160。當用以控制供電的GPIO針腳具有高準位時,開關裝置160導通而提供電源至儲存裝置110_1。當用以控制供電的GPIO針腳具有低準位時,開關裝置160截止而停止供電給儲存裝置110_1。須說明的是,雖然圖3僅以儲存裝置110_1為範例進行說明,但其他儲存裝置110_2~110_N可依據相似的硬體配置而與處理器120相連接。In addition, the
圖4是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。請參照圖4,本實施例的方法可由圖3的儲存系統10執行,以下即搭配圖3所示的元件說明圖4各步驟的細節。FIG. 4 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. Referring to FIG. 4 , the method of this embodiment can be executed by the
於步驟S402,處理器120偵測到儲存裝置110_1的故障事件。於步驟S404,當處理器120偵測到儲存裝置110_1的故障事件,處理器120判斷儲存裝置110_1的電性插槽是否支援電源控制功能。詳細而言,如圖3所示,若儲存裝置110_1的電性插槽經由處理器120可控制的開關裝置160連接至供電裝置,代表儲存裝置110_1的電性插槽支援電源控制功能。亦即,處理器120可透過控制開關裝置160導通或截止來控制儲存裝置110_1的供電與否,以具備控制儲存裝置110_1進行電源重啟操作的能力。In step S402, the
若步驟S404判斷為是,於步驟S406,處理器120判斷儲存裝置110_1的電性插槽是否支援存在偵測功能。詳細而言,如圖3所示,若儲存裝置110_1的電性插槽經由GPIO針腳連接至處理器120且此GPIO針腳用以偵測儲存裝置110_1是否插在電性插槽上,代表儲存裝置110_1的電性插槽支援存在偵測功能。If the determination in step S404 is yes, in step S406, the
若步驟S406判斷為是,於步驟S408,處理器120判斷儲存裝置110_1是否存在。於步驟S408的詳細操作可參見前述實施例,於此不贅述。須注意的是,若步驟S408判斷為是,於步驟S410,處理器120判斷對儲存裝置110_1進行電源重啟操作的次數是否超過一次數門檻值。舉例而言,處理器120可判斷儲存裝置110_1於一日之內執行電源重啟操作的次數是否超過5次。然而,次數門檻值可視實際應用而設計,本發明對此不限制。If the determination in step S406 is yes, in step S408, the
詳細來說,若處理器120在單位時間內對儲存裝置110_1進行太多次的電源重啟操作,代表儲存裝置110_1本身可能已經有毀損的情況,因此一直重複進行電源重啟操作也無法使儲存裝置110_1恢復正常運作。於是,若步驟S410判斷為是,處理器120可放棄恢復儲存裝置110_1。Specifically, if the
另一方面,若步驟S410判斷為否,於步驟S412,反應於判定對儲存裝置110_1進行電源重啟操作的次數未超過次數門檻值,處理器120將儲存裝置110_1新增至待恢復裝置清單。於步驟S414,處理器120判斷恢復執行裝置清單的裝置數量是否小於等於門檻值。若步驟S414判斷為是,於步驟S416,反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,處理器120立即對儲存裝置110_1進行電源重啟操作。另一方面,若步驟S414判斷為否,於步驟S418,反應於判定恢復執行裝置清單的裝置數量未小於等於門檻值,處理器120在等待一經過時間之後對儲存裝置進行電源重啟操作。步驟S412~步驟S418的詳細操作方式可參照圖2實施例的說明,於此不再贅述。On the other hand, if the determination in step S410 is negative, in step S412, in response to determining that the number of power restart operations on the storage device 110_1 does not exceed the number threshold, the
圖5是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。請參照圖5,本實施例的方法可由圖1的儲存系統10執行,以下即搭配圖1所示的元件說明圖2各步驟的細節。FIG. 5 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. Referring to FIG. 5 , the method of this embodiment can be executed by the
於步驟S502,處理器120偵測到儲存裝置110_1的故障事件。於一些實施例中,反應於處理器120未接收到儲存裝置110_1回覆訊息時,處理器120判定偵測到儲存裝置110_1的故障事件。舉例而言,假設處理器120及儲存裝置110_1之間利用SATA協議進行溝通,當處理器120無法接收到儲存裝置110_1所發送的訊框D2h時,處理器120可判定偵測到儲存裝置110_1的故障事件,但本發明不限制於此。In step S502, the
於一些實施例中,在儲存裝置110_1~110_N完成初始化與安裝之後,處理器120會開始偵測儲存裝置110_1~110_N的故障事件。於一些實施例中,在安裝儲存裝置110_1~110_N初期,處理器120可初始化儲存裝置110_1~110_N,並且對儲存裝置110_1~110_N的資料配置資訊進行記錄及保存。前述資料配置資訊可包括各儲存裝置110_1~110_N所屬之磁碟陣列等級與所屬之磁碟陣列。詳細而言,儲存裝置110_1~110_N可依需求而分屬一或多個磁碟陣列(亦稱為磁碟陣列群組),且這些磁碟陣列可對應至不同的磁碟陣列等級,例如RAID-5或RAID-6等等。In some embodiments, after the storage devices 110_1˜110_N complete initialization and installation, the
於步驟S504,處理器120判斷儲存裝置110_1是否存在。若步驟S504判斷為是,於步驟S506,處理器120將儲存裝置110_1新增至待恢復裝置清單。於步驟S508,處理器120判斷恢復執行裝置清單的裝置數量是否小於等於門檻值。步驟S504~步驟S508的詳細操作方式可參照前述實施例,於此不贅述。In step S504, the
值得一提的是,於一些實施例中,用以與恢復執行裝置清單所紀錄的裝置數量進行比對的門檻值可根據儲存裝置110_1所屬的磁碟陣列等級來配置。詳細而言,當儲存裝置110_1對應至第一磁碟陣列等級,則門檻值可為第一值。當儲存裝置110_1對應至第二磁碟陣列等級,則門檻值可為第二值。第一值相異於第二值。具體而言,對於可容錯的硬碟數量較高的磁碟陣列等級來說,門檻值可設置為較低的第一值。對於可容錯的硬碟數量較低的磁碟陣列等級來說,門檻值可設置為較高的第二值。舉例而言,當儲存裝置110_1屬於RAID-6時,門檻值可為較低的第一值,以在磁碟陣列較不易被降級的情況下盡量降低儲存系統10的電源負擔。當儲存裝置110_1屬於RAID-5時,門檻值可為較高的第二值,以在磁碟陣列較容易被降級的情況下讓儲存裝置110_1可以盡快恢復。It is worth mentioning that in some embodiments, the threshold used for comparison with the number of devices recorded in the recovery execution device list may be configured according to the disk array level to which the storage device 110_1 belongs. Specifically, when the storage device 110_1 corresponds to the first disk array level, the threshold value may be the first value. When the storage device 110_1 corresponds to the second disk array level, the threshold value may be the second value. The first value is different from the second value. Specifically, for a disk array class with a higher number of fault-tolerant hard disks, the threshold value may be set to a lower first value. For disk array classes with a lower number of fault-tolerant drives, the threshold can be set to a higher second value. For example, when the storage device 110_1 belongs to RAID-6, the threshold value may be a lower first value to minimize the power burden of the
若步驟S508判斷為是,於步驟S510,反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,處理器120將儲存裝置110_1自待恢復裝置清單之中移除並且將儲存裝置110_1新增至恢復執行裝置清單。亦即,反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,儲存裝置110_1將從待恢復裝置清單移至恢復執行裝置清單。於步驟S512,反應於儲存裝置110_1自待恢復裝置清單移至恢復執行裝置清單,處理器120立即對儲存裝置110_1進行電源重啟操作,例如透過控制圖3所示的開關裝置160來對儲存裝置110_1進行斷電與上電。If the determination in step S508 is yes, in step S510, in response to determining that the number of devices for restoring the execution device list is less than or equal to the threshold, the
若步驟S508判斷為否,於步驟S522,反應於判定恢復執行裝置清單的裝置數量未小於等於門檻值,處理器120在等待一經過時間之後對儲存裝置110_1進行電源重啟操作。於一些實施例中,在等待一經過時間之後,反應於恢復執行裝置清單的裝置數量從未小於等於門檻值轉換為小於等於門檻值,處理器120才對儲存裝置110_1進行電源重啟操作。於一些實施例中,恢復執行裝置清單所紀錄之儲存裝置會在完成恢復操作之後被移除。因此,恢復執行裝置清單的裝置數量會因為其紀錄的儲存裝置完成恢復操作而降低。If the determination in step S508 is negative, in step S522, in response to the determination that the number of devices for resuming execution of the device list is not less than or equal to the threshold, the
於步驟S514,在對儲存裝置110_1進行電源重啟操作之後,處理器120判斷恢復執行裝置清單中的儲存裝置110_1是否於一預設時段內恢復連線。上述預設時段例如為60秒,但本發明不限制於此。舉例而言,假設處理器120及儲存裝置110_1之間利用SATA協議進行溝通,當處理器120接收到儲存裝置110_1所發送的訊框D2h時,處理器120可判定儲存裝置110_1恢復連線,但本發明不限制於此。In step S514, after performing a power restart operation on the storage device 110_1, the
若步驟S514判斷為否,代表儲存裝置110_1已經無法透過電源重啟來恢復運作。因此,於步驟S524,處理器120放棄恢復儲存裝置110_1。之後,於步驟S520,處理器120將儲存裝置110_1自恢復執行裝置清單中移除。If the determination in step S514 is negative, it means that the storage device 110_1 cannot resume operation by restarting the power supply. Therefore, in step S524, the
另一方面,若步驟S514判斷為是,於步驟S516,處理器120判斷儲存裝置110_1是否屬於一磁碟陣列(RAID)。於一些實施例中,處理器120可根據對儲存裝置110_1進行初始化過程中所紀錄的資料配置資訊來判斷儲存裝置110_1是否屬於一磁碟陣列。On the other hand, if the determination in step S514 is yes, in step S516, the
若步驟S516判斷為否,代表儲存裝置110_1無須進行關於磁碟陣列的資料重建操作。之後,於步驟S520,處理器120將儲存裝置110_1自恢復執行裝置清單中移除。If the determination in step S516 is negative, it means that the storage device 110_1 does not need to perform a data reconstruction operation on the disk array. Afterwards, in step S520, the
若步驟S516判斷為是,於步驟S518,反應於判定儲存裝置110_1屬於磁碟陣列,處理器120根據磁碟陣列支援的重建(rebuild)功能對儲存裝置110_1進行資料重建操作。舉例而言,若儲存裝置110_1所屬磁碟陣列可支援快速重建功能(例如ZFS檔案系統),則處理器120透過快速重建功能對儲存裝置110_1進行資料重建操作。若儲存裝置110_1所屬磁碟陣列支援一般重建功能,則處理器120透過一般重建功能對儲存裝置110_1進行資料重建操作。在完成資料重建操作之後,於步驟S520,處理器120將儲存裝置110_1自恢復執行裝置清單中移除。If the determination in step S516 is yes, in step S518, in response to determining that the storage device 110_1 belongs to the disk array, the
圖6是依據本發明一實施例的儲存系統的事件日誌的示意圖。請參照圖6,處理器120可將儲存系統10發生的事件的內容及時間等相關資訊記錄為日誌(Log)61,使用者可透過一作業系統得知日誌61內容。當處理器120與儲存裝置110_1中斷連線,處理器120偵測到儲存裝置110_1的故障事件,且處理器120將此故障事件紀錄為日誌61中的日誌消息Msg1。然後,處理器120可對儲存裝置110_1進行電源重啟操作,使儲存裝置110_1恢復與處理器120的連線。處理器120將儲存裝置110_1恢復連線的事件紀錄為日誌61中的日誌消息Msg2。接著,處理器120將對恢復連線的儲存裝置110_1進行資料重建操作,如日誌61中的日誌消息Msg3與Msg4所示。由此可知,在儲存裝置110_1發生故障之後,儲存裝置110_1可在沒有人為操作介入的情況下自動恢復為正常操作且透過日誌記錄自動恢復爲正常操作的過程。FIG. 6 is a schematic diagram of an event log of a storage system according to an embodiment of the present invention. Referring to FIG. 6 , the
需說明的是,以至少一個處理器執行之硬碟恢復方法的處理程序並不限於上述實施形態之例。舉例而言,可省略上述步驟(處理)之一部分,亦可以其他順序執行各步驟。又,可組合上述步驟中之任二個以上的步驟,亦可修正或刪除步驟之一部分。或者,亦可除了上述各步驟外還執行其他步驟。It should be noted that the processing procedure of the hard disk recovery method executed by at least one processor is not limited to the above embodiments. For example, part of the above steps (processing) may be omitted, or each step may be performed in other order. Furthermore, any two or more of the above steps may be combined, and part of the steps may also be modified or deleted. Alternatively, other steps may be performed in addition to the above steps.
綜上所述,在本發明實施例中,當儲存裝置發生故障而無法正常運作時,此儲存裝置將被新增至待恢復裝置清單之中。當恢復執行裝置清單所紀錄的裝置數量小於等於門檻值時,儲存裝置將從待恢復裝置清單移至恢復執行裝置清單,以自動地對恢復執行裝置清單中的儲存裝置進行電源重啟操作。藉此,可盡快讓中斷連線的儲存裝置恢復正常運作,並可降低資料遺失的風險。此外,當恢復執行裝置清單所紀錄的裝置數量未小於等於門檻值時,在等待一經過時間之後對此儲存裝置進行所述電源重啟操作。藉此,可避免過多的儲存裝置同時進行電源重啟操作而造成儲存系統的電源負擔過重。To sum up, in the embodiment of the present invention, when the storage device fails and cannot operate normally, the storage device will be added to the list of devices to be restored. When the number of devices recorded in the recovery execution device list is less than or equal to the threshold, the storage device will be moved from the device list to be recovered to the recovery execution device list to automatically power cycle the storage devices in the recovery execution device list. In this way, the disconnected storage device can be restored to normal operation as soon as possible and the risk of data loss can be reduced. In addition, when the number of devices recorded in the recovery execution device list is not less than or equal to the threshold value, the power restart operation is performed on the storage device after waiting for an elapsed time. This can avoid too many storage devices performing power restart operations at the same time, causing excessive power load on the storage system.
雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed above through embodiments, they are not intended to limit the present invention. Anyone with ordinary knowledge in the technical field may make some modifications and modifications without departing from the spirit and scope of the present invention. Therefore, The protection scope of the present invention shall be determined by the appended patent application scope.
10:儲存系統10:Storage system
110_1~110_N:儲存裝置110_1~110_N: Storage device
120:處理器120: Processor
130:記憶體130:Memory
140:GPIO介面140:GPIO interface
150:供電裝置150:Power supply device
160:開關裝置160:Switching device
Msg1~Msg4:日誌消息Msg1~Msg4: Log message
S202~S212、S402~S418、S502~S524:步驟S202~S212, S402~S418, S502~S524: steps
圖1是依據本發明一實施例的儲存系統的方塊圖。 圖2是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。 圖3是依據本發明一實施例的儲存系統的方塊圖。 圖4是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。 圖5是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。 圖6是依據本發明一實施例的儲存系統的事件日誌的示意圖。 FIG. 1 is a block diagram of a storage system according to an embodiment of the present invention. FIG. 2 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. FIG. 3 is a block diagram of a storage system according to an embodiment of the present invention. FIG. 4 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. FIG. 5 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. FIG. 6 is a schematic diagram of an event log of a storage system according to an embodiment of the present invention.
S202~S212:步驟 S202~S212: steps
Claims (12)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW111127518A TWI820814B (en) | 2022-07-22 | 2022-07-22 | Storage system and drive recovery method thereof |
CN202211227693.9A CN117472619A (en) | 2022-07-22 | 2022-10-09 | Storage system and its hard disk recovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW111127518A TWI820814B (en) | 2022-07-22 | 2022-07-22 | Storage system and drive recovery method thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
TWI820814B true TWI820814B (en) | 2023-11-01 |
TW202405655A TW202405655A (en) | 2024-02-01 |
Family
ID=89624390
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW111127518A TWI820814B (en) | 2022-07-22 | 2022-07-22 | Storage system and drive recovery method thereof |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117472619A (en) |
TW (1) | TWI820814B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110078495A1 (en) * | 2006-08-25 | 2011-03-31 | Hitachi, Ltd. | Storage control apparatus and failure recovery method for storage control apparatus |
US8725934B2 (en) * | 2011-12-22 | 2014-05-13 | Fusion-Io, Inc. | Methods and appratuses for atomic storage operations |
TW201423378A (en) * | 2012-09-18 | 2014-06-16 | Mitsubishi Electric Corp | Raid failure self-repair device |
TWI476610B (en) * | 2008-04-29 | 2015-03-11 | Maxiscale Inc | Peer-to-peer redundant file server system and methods |
US20150378858A1 (en) * | 2013-02-28 | 2015-12-31 | Hitachi, Ltd. | Storage system and memory device fault recovery method |
-
2022
- 2022-07-22 TW TW111127518A patent/TWI820814B/en active
- 2022-10-09 CN CN202211227693.9A patent/CN117472619A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110078495A1 (en) * | 2006-08-25 | 2011-03-31 | Hitachi, Ltd. | Storage control apparatus and failure recovery method for storage control apparatus |
TWI476610B (en) * | 2008-04-29 | 2015-03-11 | Maxiscale Inc | Peer-to-peer redundant file server system and methods |
US8725934B2 (en) * | 2011-12-22 | 2014-05-13 | Fusion-Io, Inc. | Methods and appratuses for atomic storage operations |
TW201423378A (en) * | 2012-09-18 | 2014-06-16 | Mitsubishi Electric Corp | Raid failure self-repair device |
US20150378858A1 (en) * | 2013-02-28 | 2015-12-31 | Hitachi, Ltd. | Storage system and memory device fault recovery method |
Also Published As
Publication number | Publication date |
---|---|
TW202405655A (en) | 2024-02-01 |
CN117472619A (en) | 2024-01-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4723290B2 (en) | Disk array device and control method thereof | |
CN103136012B (en) | Computer system and update method of basic input-output system thereof | |
JP5910444B2 (en) | Information processing apparatus, activation program, and activation method | |
JP6555096B2 (en) | Information processing apparatus and program update control method | |
CN103207816A (en) | Linux system repairing method | |
JP6124644B2 (en) | Information processing apparatus and information processing system | |
US7861112B2 (en) | Storage apparatus and method for controlling the same | |
TWI820814B (en) | Storage system and drive recovery method thereof | |
WO2008076203A1 (en) | Managing storage stability | |
JP5387767B2 (en) | Update technology for running programs | |
CN111427721B (en) | Abnormality recovery method and device | |
TWI547798B (en) | Data storage system and control method thereof | |
CN114385412A (en) | Storage management method, apparatus and computer program product | |
CN111158963A (en) | A server firmware redundant startup method and server | |
CN116048400A (en) | A hardware recovery method, device, equipment and readable storage medium | |
CN101192174A (en) | Recovery processing method and system after disk array device capacity expansion interruption | |
JP2015222454A (en) | Raid failure self-repairing device | |
US8909983B2 (en) | Method of operating a storage device | |
CN113312198B (en) | System and method for monitoring and recovering heterogeneous components | |
JP6398727B2 (en) | Control device, storage device, and control program | |
CN101452333A (en) | Method and computer device capable of handling power supply abnormity | |
JP2014123258A (en) | Disk array system, data recovery method, and data recovery program | |
CN113032182B (en) | Method and device for abnormal recovery of computer system | |
CN110347555B (en) | Hard disk operation state determination method | |
JP2019164578A (en) | Control system, information processing device, control method, raid controller restoration method, and program |