[go: up one dir, main page]

WO2021040810A1 - Device lifetime prediction - Google Patents

Device lifetime prediction Download PDF

Info

Publication number
WO2021040810A1
WO2021040810A1 PCT/US2020/026164 US2020026164W WO2021040810A1 WO 2021040810 A1 WO2021040810 A1 WO 2021040810A1 US 2020026164 W US2020026164 W US 2020026164W WO 2021040810 A1 WO2021040810 A1 WO 2021040810A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameters
performance measure
target digital
curve
digital device
Prior art date
Application number
PCT/US2020/026164
Other languages
French (fr)
Inventor
Yingxuan Zhu
Bowen Jiang
Yong Wang
Jian Li
Original Assignee
Futurewei Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Futurewei Technologies, Inc. filed Critical Futurewei Technologies, Inc.
Publication of WO2021040810A1 publication Critical patent/WO2021040810A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/40Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection

Definitions

  • Data centers hosting the network-connected data processing services may employ thousands or even hundreds of thousands of components including: storage devices to hold the large amounts of data, servers to process the data, and telemetry devices to provide results of data processing services to clients. These components need to be replaced and/or maintained to ensure uninterrupted service to data center clients.
  • a storage device such as an SSD or a disk drive fails
  • data in the storage device may be lost, resulting in interruption of operations and financial loss to businesses as well as the loss of personal data, such as pictures, videos, and financial information to consumers. Failure of storage devices is an issue for both business and consumers.
  • the lifetime of a device depends on attributes of the device and on how the device is used. Equivalent devices may have large differences in their actual lifetimes based on their different usage profiles. Maintenance schedules for these devices may use the minimum lifetime estimates to ensure that devices are replaced before they fail. This results in many of these devices being replaced too soon, well before the actual end of their lifetimes. Summary
  • the examples below describe apparatus and methods for predicting lifetime of digital device.
  • the examples combine reference curves, representing a performance measure of devices, with a noise function that represents a difference between the reference curve and an actual performance measure curve for the device.
  • the apparatus and method predict parameters of the noise function to provide predicted noise function values.
  • the predicted noise function values are combined with the reference curve to produce an augmented performance measure curve.
  • the example apparatus and method use the augmented performance measure curve to generate maintenance recommendations for the digital device.
  • a method for predicting a lifetime of a digital device includes using a set of device attributes and usage attributes of the digital device to determine a device profile having an associated reference performance measure curve.
  • the method determines a set of parameters of a noise function where the noise function represents a difference between the reference performance measure curve and a current performance measure curve of the digital device.
  • the method calculates a set of predicted parameters of the noise function and combines the noise function having the set of predicted parameters with the reference performance measure curve to produce an augmented performance measure curve that indicates the future performance measure of the target digital device.
  • the method uses the augmented performance measure curve to determine an expected lifetime of the digital device and/or provide a maintenance recommendation for the digital device.
  • the method includes determining a trend in at least one of the device attributes of the digital device, in the augmented performance measure curve, or in at least one of the predicted parameters.
  • the method determines a second device profile for the digital device, different from the first device profile.
  • the second device profile includes a second reference performance measure curve.
  • the method determines a second set of parameters of the noise function, wherein a combination of the noise function, having the second set of parameters, and the second reference performance measure curve corresponds to a current measure of the performance measure of the digital device.
  • the method calculates a second set of predicted parameters for the noise function based on the second set of parameters and combines the noise function having the second set of predicted parameters with the second reference performance measure curve to produce a second augmented performance measure curve which is used to determine the expected lifetime of the digital device and/or to provide the maintenance recommendation for the digital device.
  • the method includes determining the device profile for the digital device by calculating respective conditional probability values for a plurality of predetermined device profiles based on the device attributes of the digital device and the usage attributes for the digital device to generate respective conditional probability values for the predetermined device profiles and selecting the predetermined device profile having the highest conditional probability value as the determined device profile.
  • the method includes calculating the set of predicted parameters of the noise function by applying the parameters of the noise function to a long short term memory (LSTM) network.
  • LSTM long short term memory
  • the noise function conforms to a Gaussian function and the set of parameters includes a set of parameters of the Gaussian function.
  • the set of parameters causes the Gaussian function to approximate a difference between the reference performance measure curve and the current performance measure curve of the digital device.
  • the digital device is a solid state drive (SSD) device
  • the reference performance measure curve is a reference erase-count curve
  • the augmented performance measure curve of the digital device is a predicted erase-count curve for the SSD device.
  • the method includes determining the device profile by applying device attributes and usage attributes of the digital device to an encoder-decoder convolutional neural network (CNN) having trained parameters.
  • CNN encoder-decoder convolutional neural network
  • the determining of the first set of parameters of the noise function includes defining the noise function as a function of the device attributes of the target digital device, the usage attributes of the target digital device, and the set of parameters of the noise function.
  • the defined noise function corresponds to a difference between the current performance measure curve and the reference performance measure curve of the target digital device.
  • an apparatus for predicting a lifetime of a digital device includes a computing device that uses a set of device attributes and usage attributes of the digital device to determine a device profile associated with a reference performance measure curve.
  • the computing device determines a set of parameters of a noise function, wherein the noise function represents a difference between the reference performance measure curve and an actual performance measure curve of the digital device.
  • the computing device calculates a set of predicted parameters of the noise function based on the set of parameters and combines the noise function having the set of predicted parameters with the reference performance measure curve to produce an augmented performance measure curve.
  • the augmented performance measure curve predicts the performance measure of the target digital device.
  • the computing device determines an expected lifetime of the digital device and/or to generate a maintenance recommendation for the digital device based on the augmented performance measure curve.
  • the computing device determines a trend in at least one of the device attributes of the digital device, in the augmented performance measure curve, or in at least one of the predicted parameters.
  • the computing device determines a second device profile for the digital device, different from the device profile.
  • the second device profile includes a second reference performance measure curve.
  • the computing device determines a second set of parameters of the noise function, wherein a combination of the noise function, having the second set of parameters, and the second reference performance measure curve corresponds to a current performance measure of the digital device.
  • the computing device calculates a second set of predicted parameters for the noise function based on the second set of parameters and combines the noise function having the second set of predicted parameters with the second reference performance measure curve to produce a second augmented performance measure curve.
  • the computing device determines the expected lifetime of the digital device and/or to generates the maintenance recommendation for the digital device based on the second augmented performance measure curve.
  • the computing device determines the device profile for the digital device by calculating respective conditional probability values for a plurality of predetermined device profiles based on the device attributes of the digital device and the usage attributes of the digital device to generate a plurality of conditional probability values for the respective plurality of predetermined device profiles.
  • the computing device selects the predetermined device profile having the highest conditional probability value as the device profile.
  • the computing device calculates the set of predicted parameters of the noise function by applying the parameters of the noise function to a LSTM network.
  • the noise function conforms to a Gaussian function and the set of parameters includes a set of parameters of the Gaussian function. The set of parameters causes cause the Gaussian function to approximate a difference between the reference performance measure curve and the current performance measure curve of the digital device.
  • the digital device is an SSD device
  • the reference performance measure curve is a reference erase-count curve
  • the augmented performance measure curve of the digital device is a predicted erase-count curve for the SSD device.
  • the computing device is configured to select the device profile by applying the device attributes and the usage attributes of the digital device to an encoder- decoder CNN having trained parameters.
  • the computing device determines the first set of parameters of the noise function by defining the noise function as a function of the device attributes of the target digital device, the usage attributes of the target digital device, and the set of parameters of the noise function.
  • the defined noise function corresponds to a difference between the current performance measure curve and the reference performance measure curve of the target digital device.
  • the computing device calculates the set of predicted parameters of the noise function by applying the parameters of the noise function to a convolutional LSTM network.
  • a non-transitory computer-readable medium includes program instructions which, when executed by a processor, cause the processor to predict a lifetime of a digital device, the program instructions cause the processor to determine a first device profile having an associated reference performance measure curve.
  • the instructions cause the processor to determine a set of parameters of a noise function where the noise function represents a difference between the reference performance measure curve and an actual performance measure curve of the digital device.
  • the instructions further cause the processor to calculate a set of predicted parameters of the noise function based on the set of parameters and to combine the noise function having the set of predicted parameters with the reference performance measure curve to produce an augmented performance measure curve.
  • the augmented performance measure curve predicts the performance measure of the target digital device.
  • the instructions cause the processor to determine an expected lifetime of the digital device and/or to generate the maintenance recommendation for the digital device based on the augmented performance measure curve.
  • the program instructions cause the processor to determine a trend in at least one of the device attributes of the digital device, in the augmented performance measure curve, or in at least one of the predicted parameters. When the trend exceeds a respective threshold value, the instructions cause the processor to determine a second device profile for the digital device, different from the first device profile.
  • the second device profile includes a second reference performance measure curve.
  • the program instructions further cause the processor to determine a second set of parameters of the noise function, wherein a combination of the noise function, having the second set of parameters, and the second reference performance measure curve corresponds to a current measure of the performance measure of the digital device.
  • the program instructions cause the processor to calculate a second set of predicted parameters for the noise function based on the second set of parameters and to combine the noise function having the second set of predicted parameters with the second reference performance measure curve to produce a second augmented performance measure curve.
  • the instructions cause the processor to determine the expected lifetime of the digital device and/or to generate the maintenance recommendation for the digital device based on the second augmented performance measure curve.
  • FIG. 1 is a block diagram of a computer system including a data center that interfaces with multiple client networks in accordance with an embodiment.
  • FIG. 2 is a block diagram of a shared database or a shared file system in accordance with an embodiment.
  • FIG. 3 is a flow-chart diagram of methods in accordance with embodiments.
  • FIGs. 4A, 4B, 4C, and 4D are reference curves of erase-counts versus power on hours in accordance with embodiments.
  • FIG. 5 A is a data flow diagram according to a first method for estimating an erase-count curve in accordance with embodiments.
  • FIGs. 5B and 5C are flow-chart diagrams illustrating the first method for estimating an erase-count curve in accordance with embodiments.
  • FIG. 6 is a graph of erase-count versus power on hours showing computation details of a noise function in accordance with an embodiment.
  • FIG. 7 is a block diagram of a multi-channel Long Short Term Memory (LSTM) network in accordance with an embodiment.
  • LSTM Long Short Term Memory
  • FIGs. 8 A and 8B are block diagrams of an LSTM cell and an LSTM network in accordance with an embodiment.
  • FIG. 9 is a flow-chart diagram illustrating a second method for estimating an erase-count curve in accordance with embodiments.
  • FIG. 10 is a data diagram showing an encoder-decoder convolutional neural network (CNN) in accordance with an embodiment.
  • CNN convolutional neural network
  • FIG. 11 is a block diagram of a convolutional LSTM network in accordance with an embodiment.
  • FIG. 12 is a flow-chart diagram illustrating a method for determining whether to change the reference erase-count curve of a target device.
  • FIG. 13 is a block diagram of a computing system according to an embodiment.
  • the functions, methods, and/or algorithms described herein may be implemented in software in the embodiments.
  • the software may consist of computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware-based storage devices, either local or networked.
  • modules which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples.
  • the software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server or other computer system, turning such computer system into a specifically programmed machine.
  • the functionality can be configured to perform an operation using, for instance, software, hardware, firmware, or the like.
  • the phrase “configured to” can refer to a logic circuit structure of a hardware element that is to implement the associated functionality.
  • the phrase “configured to” can also refer to a logic circuit structure of a hardware element that is to implement the coding design of associated functionality of firmware or software.
  • the term “module” refers to a structural element that can be implemented using any suitable hardware (e.g., a processor, among others), software (e.g., an application, among others), firmware, or any combination of hardware, software, and firmware.
  • logic encompasses any functionality for performing a task.
  • each operation illustrated in the flowcharts corresponds to logic for performing that operation.
  • An operation can be performed using, software, hardware, firmware, or the like.
  • the terms, “component,” “system,” and the like may refer to computer- related entities, hardware, and software in execution, firmware, or combination thereof.
  • a component may be a process running on a processor, an object, an executable, a program, a function, a method, a subroutine, a computer, or a combination of software and hardware.
  • processor may refer to a hardware component, such as a processing unit of a computer system and may include multiple single-core processors and/or multi-core processors.
  • the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computing device to implement the disclosed subject matter.
  • article of manufacture as used herein is intended to encompass a computer program accessible from any computer-readable storage device or media.
  • Computer-readable storage media can include, but are not limited to, magnetic storage devices, e.g., hard disks, floppy disks, magnetic strips, optical disks, compact disks (CDs), digital versatile disks (DVDs), smart cards, SSDs , among others.
  • the embodiments describe estimating the remaining lifetime of a target SSD device in a data center environment.
  • the same methods may be used to estimate other performance measures of an SSD device or performance measures of a mechanical hard disk drive (HDD) device or other devices (e.g., processor, random access memory (RAM) device, or cooling fan).
  • the described embodiments predict the remaining lifetime of the target SSD device based on a predicted erase- count curve for the target SSD device.
  • Other performance measures may be used to predict the lifetime of other types of devices.
  • reference performance measure curves of the spin-up time or spindle start/stop count of a mechanical HDD may be used to predict its remaining lifetime.
  • the embodiments are not limited to digital devices in data centers.
  • a reference erase-count curve for a consumer SSD may be obtained by sending the device attributes and usage attributes collected locally to a web service and receiving the reference erase-count curve from the web service.
  • Erase-count may be considered to be a key performance index (KPI) of SSD devices, because, for many SSD devices, the failure of the device may be predicted by analyzing its erase-counts.
  • KPI key performance index
  • the embodiments described herein describe two example machine learning based frameworks to predict an erase-count curve for a target device. The predicted erase-count curve predicts how the erase-counts of a target SSD device will change over the remaining lifetime of the target device and, thus provides an estimate of the remaining device lifetime.
  • an example of the first framework uses conditional probabilities to select a reference erase-count curve for a target device and determines a difference between the reference erase-count curve and an actual erase-count curve for the target device.
  • This difference referred to herein as a noise function
  • the framework uses one or more LSTM networks to estimate parameters of the noise function.
  • the first framework employs a combination of the reference erase-count curve and the estimated noise function derived from the estimated parameters to define an augmented erase-count curve (e.g., a predicted erase-count curve) for the target device.
  • the first framework uses this augmented erase-count curve to determine whether the target SSD device needs maintenance and/or to determine the expected lifetime of the target SSD device.
  • An example of the second framework uses an encoder-decoder CNN to leam a set of parameters to be used to associate SSD devices with reference erase-count curves and, based on those parameters, to determine a particular reference erase-count curve for a target SSD device.
  • the second framework employs a convolutional LSTM network to estimate the noise function parameters.
  • the second framework combines the reference erase- count curve and the estimated noise function derived from the estimated parameters to define an augmented erase-count curve which is used to determine whether the target device needs maintenance and/or to determine the expected lifetime of the target device.
  • estimating the parameters of the noise function reduces the amount of testing data used for prediction, allowing calculation of a lifetime estimate earlier in the operation of the target SSD device than could be achieved by estimating the lifetime directly based on a predicted erase-count curve.
  • Modeling the erase-count curve as a reference erase-count curve plus a noise function also allows lifetime prediction to be based on a relatively small number of time points, as long as the noise function truly represents the difference between the actual erase-count curve and the reference erase-count curve.
  • the noise function can be predicted and, thus, the erase-count curve for the target device can also be predicted.
  • the first framework models the noise function as a Gaussian function.
  • the second framework uses an encoder-decoder CNN to determine parameters that define the reference erase- count curve based on training data. This framework applies testing data for a target SSD device to the encoder-decoder CNN to determine the reference erase-count curve.
  • the framework determines parameters of a noise function and uses a convolutional LSTM to estimate parameters for the noise function.
  • the second framework combines the noise function having the estimated parameters with the reference erase-count curve to produce an augmented erase-count curve that can be used to estimate the lifetime of the SSD device.
  • An SSD storage device is typically operated in a write-erase-write cycle.
  • the write-erase-write cycle is also referred to as a program-erase cycle (P/E cycle).
  • P/E cycle is a criterion for quantifying the endurance of a device. Generally speaking, each cycle consumes a tiny portion of the lifetime of a device. As the number of cycles increases, the remaining lifetime of a device decreases. Because erasing pairs with writing, erase-count may be considered a KPI in measuring lifetime of an SSD storage device.
  • FIG. 1 is a block diagram of a computing system 100 including a data center 104 that interfaces with a multiple client networks 102 in accordance with an embodiment.
  • the data-center environment is one example of the use of the illustrated frameworks. These frameworks may be used to predict expected performance measure curves for myriad devices used in personal or commercial environments.
  • the example data center 104 processes tasks received from the client networks 102 and provides results to the networks 102. These tasks include executing applications such as, without limitation, word processors, spread sheets, database management systems (DBMSs), and accounting systems.
  • the data center 104 includes a load balancer 106 that distributes the tasks among multiple processing nodes 108A, 108B, 108C, 108D, and 108N.
  • the processing nodes 108A, 108B, 108C, 108D, and 108N are coupled, via a high-speed input/output (I/O) channel 110, to each other, to shared databases 112, and to shared file systems 118 and 120.
  • FIG. 1 shows two shared databases including SSD devices 114A, 114B, and 114C, and SSD devices 116A, 116B, and 116C, respectively.
  • FIG. 2 is a block diagram of a shared database or a shared file system 200 in accordance with an embodiment.
  • the database/file system shown in FIG. 2 includes multiple SSDs 202A, 202B, and 202N coupled to a storage device interface 204.
  • the storage device interface 204 handles I/O operations among the processing nodes 108A, 108B, 108C, 108D, and 108N and the SSDs 202A, 202B, and 202N. These I/O operations use a high-speed I/O interface 206 to pass data to and from the high-speed I/O channel 110, shown in FIG. 1.
  • Storage device interface 204 also couples the SSDs 202A, 202B, and 202N to a storage device management unit 208 that monitors the status (e.g., the expected lifetime) of the SSDs 202A, 202B, and 202N and generates maintenance recommendations as described in the embodiments.
  • Maintenance recommendations include, for example, pre-ordering an SSD device that is predicted to fail at a particular date (e.g., the SSD has a 95 percent probability of failing within seven days). Alternatively, the impending failure of an SSD device may depend on how the device is being used. In this instance, maintenance recommendations may include copying the data from the target device to another device and reassigning the target device to an application that performs fewer write-erase cycles.
  • the predicted failure of the target device may also depend on the environmental conditions in which the device is being used. In this instance, the recommended maintenance may include changing environmental factors, such as providing additional cooling.
  • the storage device management unit 208 implements the methods 300, 500, 520, 540, 900, and 1200 described below with reference to FIGs. 3, 5A, 5B, 5C, 9 and 12.
  • the storage management unit 208 may include or be coupled to the LSTM network 720 shown in FIGs. 7, 8A and 8B; the encoder decoder CNN network 1000 shown in FIG. 10; and/or the convolutional LSTM network 1100 shown in FIG. 11.
  • these methods could be implemented in other devices, for example, one or more of the processing nodes 108A, 108B, 108C, 108D, and 108N or in a combination of the storage management unit 208 and one or more of the processing nodes 108A, 108B, 108C, 108D, and 108N.
  • the networks 720, 1000, and/or 1100 may be implemented in hardware coupled to, and/or in software configured to execute on, the storage device management unit 208 and/or one or more of the processing nodes 108A, 108B, 108C, 108D, and 108N.
  • the lifetime of an SSD device depends on a variety of factors, including device specifications (device attributes, A) and how a device is being used (usage attributes, H).
  • device attributes A
  • usage attributes H
  • a general solution is to replace a device at a specific time before the end of its life. It is difficult, however, to anticipate a good time to replace a target device. Replacing a device too early will increase costs unnecessarily, while replacing it too late results in the loss of data when the device fails.
  • not every SSD device is used in the same way such that different SSD devices may have different maximum numbers of erase- counts and/or different numbers of power-on hours until failure.
  • the embodiments described herein use machine learning methods to predict device lifetimes. Specifically, the methods generate reference erase-count curves using historic data generated from actual and/or simulated SSD devices. The method uses the reference erase-count curves as benchmarks to guide prediction of the remaining lifetime for a target SSD device. In some embodiments, probability methods are applied to identify the most probable reference erase-count curve for a target device based on device attributes, A, and usage attributes, H, of the target device. Embodiments augment the identified erase-count curve with a noise function and use networks to predict parameters of the noise function.
  • Other embodiments train an encoder-decoder CNN on historic erase- count data to leam parameters, W, from the A and H attributes of the historic data. These embodiments then use the parameters W to associate target SSD devices with their respective reference erase-count curves based on the A and H attributes of the target SSD devices. These embodiments use the parameters, W, and the trained encoder-decoder CNN to determine a particular reference erase-count curve for a particular target device. These embodiments employ a convolutional LSTM to predict the parameters of the noise function. In both frameworks, the reference erase-count curve augmented by the noise function is a predicted erase-count curve that is used to predict the remaining lifetime of the SSD device.
  • the described embodiments predict the noise function instead of predicting the erase-count curve or device lifetime directly. This process achieves better results than direct prediction of the erase-count curve because the noise function represents the difference between reference erase-count curve and the actual erase-count curve and is robust to changes.
  • the example methods predict the erase-count curve of the target SSD device based on the predicted parameters of noise function at any point in the lifetime of the target SSD device. Thus, the example methods are also advantageous as they can predict the lifetime of an SDD device at an early stage of device usage. The ability to predict lifetimes early reduces the data needed to predict the device lifetime and allows the device to be used over a longer lifetime.
  • FIG. 3 is a flow-chart diagram of a method 300 according to embodiments.
  • the method 300 shown in FIG. 3 is relevant to both of the two frameworks described above.
  • Operation 302 obtains device attributes, A, and usage attributes, H, for a target SSD device.
  • Operation 304 uses these attributes to determine a reference erase-count curve for the target device.
  • operation 306 defines a noise function based on a difference between the reference erase-count curve and the current actual erase-count curve for the target device. Once this noise function is defined, operation 308 predicts the parameters of the noise function to generate a predicted noise function.
  • Operation 310 analyzes trends in the device attributes, actual erase-count curve, and the parameters of the noise function to determine whether the reference erase-count curve for the device is no longer valid and should be changed. When operation 310 determines that the reference erase- count curve should be changed, it transfers control to operation 304, described above, to select a new reference erase-count curve. When operation 310 determines that the current reference erase-count curve is valid, operation 312 uses the noise function to augment the reference erase-count curve and uses the augmented erase- count curve to predict the lifetime of the target SSD device and/or to generate maintenance recommendations for the target SSD device.
  • An embodiment generates historical data for multiple devices 202A, 202B, and 202N (shown in FIG. 2) by operating the devices or simulating the operation of the devices over their lifetimes to generate respective historic erase- count curves for the devices 202A, 202B, and 202N.
  • Operation 302 of the method 300 obtains device attributes, A, and usage attributes, H, for each of the devices 202 A, 202B, and 202N monitored by the storage device management unit 208 or for each of the devices monitored by all of the storage management units 208 in the data center 104.
  • Example device attributes, A include values of the Self- Monitoring, Analysis, and Reporting Technology (SMART) attributes for the particular SSD devices 202A, 202B, and 202N.
  • SMART Self- Monitoring, Analysis, and Reporting Technology
  • These device attributes include the model number of the device as well as information about how the device was manufactured (e.g., attributes such as built-in error correction that differentiate a consumer-grade device from an enterprise-grade device), performance attributes (e.g., read-error rate, remaining space, and maximum erase-count).
  • the usage attributes, H describe how the device is used and include factors such as, without limitation, the particular applications with which the SSD devices are used, the number of users using each of the SSD devices, and the identity of the client using the SSD device.
  • the usage attributes include time related attributes such as the hour/day/month/year at which the device is used and the duty cycle and/or duration of such use.
  • the usage attributes include environmental attributes (e.g., temperature, humidity, and vibration).
  • y’ y + E
  • y the known reference erase-count curve of a target device
  • y’ a predicted erase-count curve (e.g., the augmented erase-count curve) used to guide device maintenance
  • E a noise function.
  • the first method selects the reference erase-count curve, y, based on probability.
  • This method models the noise function, E, as a Gaussian function and uses an LSTM to predict the parameters of the noise function, E.
  • the second method uses a trained encoder-decoder CNN to determine the reference erase-count curve, y, and uses a convolutional LSTM to predict the parameters of the noise function, E.
  • the second method does not assume that the noise function, E, can be modeled as a Gaussian function or as any specific function. It uses the encoder-decoder CNN to determine the reference erase-count curve, y, for a target SSD device based on the A and H attributes of the device and a set of learned parameters, W.
  • the parameters W are generated by the encoder- decoder CNN based on the historical training data.
  • the second method defines a set of parameters, W’, that represent a difference between the current erase-count curve of the target device and the reference erase-count curve.
  • the parameters, W’ which correspond to the noise function, are estimated using a convolutional LSTM and the testing data to determine the augmented erase-count curve.
  • FIGs. 4 A, 4B, 4C, and 4D show example graphs of erase-count versus power-on time, where the y-axis indicates erase-counts and the x-axis indicates power-on time. Each of these graphs shows a different erase-count curve relative to power-on time. In each graph, as time goes by, erase-counts increase at varying rates depending on characteristics of the device and on how the device is used. Near the end of the SSD lifetime, the increase of erase-counts slows down. These graphs may be generated based on the performance of actual or simulated SSD devices.
  • the graphs shown in FIGs. 4 A, 4B, 4C, and 4D are synthetic curves generated from simulations of SSD devices.
  • the different erase-count curves reflect differences in the construction and configuration of the devices (device attributes, A) and in how devices are used (usage attributes, H).
  • the erase-count curves of university users are different from those of bank users. Bank users tend to erase and write SSD devices frequently during business hours and rarely after business hours. On the other hand, students in universities may ran computers overnight to get experiment results, such that the erasing and writing of SSD devices by university students will not follow a workday schedule.
  • the curve 402 in FIG. 4A shows the erase-count curve for an SSD device that is accessed heavily over a relatively short term, after having been powered on for about 900 hours.
  • the device is accessed about 100,000 times over the next 1000 powered-on hours and approaches its maximum number of erase cycles at about 2000 powered-on hours.
  • This curve may correspond, for example, to a usage profile of a television network streaming media server that is continually writing new program material to the SDD device.
  • Curve 404 shown in FIG 4B shows an erase-count curve that exhibits different rates of erase-counts at different times. This curve may correspond to a usage profile of an enterprise data processing system which has a greater numbers of erase cycles at the end of each three-month interval, for example, to accommodate quarterly financial statement processing. After the financial statement processing is complete, the number of erase cycles decreases for the remainder of the quarter.
  • FIG. 5A is a data flow diagram of an example method 500, described in more detail below with reference to the methods 520 of FIG. 5B and 540 of FIG.
  • the system shown in FIG. 5A includes a historical data store 502, a calculate module 504, a reference curve probabilities store 506, a source of testing data 507, a device profiles store 506, a reference curve store 510, and a prediction module 512.
  • the reference curves are generated before processing any target device.
  • An example method for generating the reference curves and the device profile probabilities is described below with reference to FIG. 5B and an example method for estimating the noise function parameters and using the estimated noise function parameters to predict the lifetime of an SSD device is described below with reference to FIG. 5C.
  • FIGs. 5B and 5C are flow-chart diagrams showing methods 520 and 540 according to the first framework that implement the data flow shown in FIG. 5A.
  • Method 520 generates a set of reference erase-count curves and method 540 selects one of the generated reference erase-count curves, predicts parameters of a noise function, and combines the reference erase-count curve with the predicted noise function to produce the augmented erase-count curve.
  • Operation 522 of the method 520 obtains historical data including the attributes A and H, and erase-count curves for multiple real and/or simulated SSD devices. These data are stored in the historical data store 502. The combination of the historical erase counts and the historical values of A and H for a particular device are referred to as a historical data set.
  • Operation 524 clusters the historical data sets for the multiple SSD devices. Any number of clustering algorithms may be used to cluster the data sets, including K-means clustering, mean-shift clustering, and Gaussian mixture model (GMM) clustering. Each cluster has a device profile that is a combination of A and H such that similar devices having similar usage profiles are clustered together.
  • K-means clustering K-means clustering
  • mean-shift clustering mean-shift clustering
  • GMM Gaussian mixture model
  • operation 526 generates a respective reference erase-count curve to represent each cluster. These erase-count curves are stored in the reference curve store 510. In embodiments, operation 526 averages the historical erase-count curves of all of the historical data sets in the cluster to generate the reference erase-count curve for the cluster. Operation 526 stores the reference erase-count curves and an indication of the number of historical devices corresponding to each reference erase-count curve in the reference curve store 510. Each reference erase-count curve is associated with a respective device profile stored in the device profile store 508. The device profile also includes the attributes A and H.
  • Operation 528 determines probabilities relative to each device profile.
  • the probabilities may be determined from the relative numbers of devices in each cluster. These probabilities may include conditional probabilities relative to the A and H attributes of the devices as described below.
  • Operation 542 of method 540 performed by the calculate module 504, selects a device profile from the device profile store 508 and, thus, a reference erase-count curve for a target SSD device from the reference curve store 510. The device profile is selected using conditional probabilities based on the device attributes and usage attributes obtained from the testing data source 507.
  • Operation 542 uses current testing data, including current values of A and H, to calculate the conditional probability that a target SSD device corresponds to a particular drive profile.
  • the drive profiles and let represent the device attributes, (e.g., attributes that are set once a drive is made), where m and n are the numbers of drive profiles and device attributes, respectively. Each attribute has a pool of options, such as 0 and 1.
  • the usage attributes e.g., the attributes that depend on how the devices are being used and can be changed by users.
  • Device attributes, A affect usage attributes, H, because users tend to use a device according to its specifications.
  • the probability of a particular set of device attributes, A can be estimated based on a set of usage attributes, H. Consequently, there are causal relationships among attributes A and H, and the drive profiles.
  • conditional probabilities may be used to select a drive profile for a target SSD device based on the device attributes and usage attributes for the target SSD device.
  • the conditional probability of a drive profile given the known device attributes can be calculated using the historical data, as described above with reference to FIG. 5A.
  • Conditional probabilities applied to the testing data may be used to select the most probable drive profile for a target SSD device.
  • the conditional probability of a set of device attributes given the drive profile, p(AID) is known.
  • Drive profiles and device attributes can indicate the probability of usage attributes. That is to say, given D and A, the probability p(HID, A) is known.
  • the conditional probability values may be calculated using either or both of equation (la) and (lb).
  • Embodiments determine the probabilities of p(HID, A), p(AID), p(AID, H), p(HID), and p(D) using both the training data and the historical data.
  • Operation 542 of the example method 540 determines a drive profile for a target SSD device as the drive profile having the largest value of posterior probability based on the attributes A and H for the target device.
  • a data center (user) can define H according to the services that the data center provides and the particular hardware used by the data center.
  • usage attributes, H may vary among different drive profiles. In training, it is advantageous to define both the device attributes, A, and the usage attributes, H, in detail, so the calculation of probabilities will cover all possible drive types used in the data center.
  • the noise function, E changes according to the values of the device attributes and usage attributes.
  • Operation 546 determines the parameters of the noise function, E, using the testing data. Erase-counts of given time points are fit into the curve by combining the values of the reference erase-count curve, y, and the values of the noise function, E. Although embodiments model E as a Gaussian function, other types of functions can be used. Operation 546 predicts a set of parameters of the Gaussian function. Operation 548 generates the Gaussian function using the predicted set of parameters and combines the result with the value of the reference erase-count curve to generate an estimated erase-count curve that describes the target SSD device over its entire lifetime. Operations 546 and 548 estimate parameters of the function E over a range of times to generate a predicted erase- count curve over any time range.
  • curve 602 represents the actual erase-count curve, y’
  • curve 604 represents the reference erase-count curve, y, associated with the drive profile, D, of the target SSD device.
  • Equation (3) describes the noise function E in terms of points along a curve x that represents the difference between the actual erase-count curve y’ and the reference erase-count curve y.
  • the values of the parameters, s, r, k, and 1, of the noise function E are derived as shown in equations (4) through (14).
  • This example derives the values of the parameters based on 11 points on the difference curve x, each point associated with a different set of parameter values of E between t-5 and t+5, as shown in FIG. 6.
  • the eleven points shown in FIG. 6 are only an example. More or fewer sets of parameter values may be used to derive the values of the parameters for the noise function, E.
  • the value of the index of each point is m.
  • equation (10) defines terms that may be used to derive the parameters k, 1, r, and s according to equations (11) - (14).
  • Operation 546 of FIG. 5C which is performed by the prediction module 512, applies the parameters of the Gaussian function to an LSTM network to extract change trends from the parameters. Specifically, parameters calculated at different time points along the difference function, x, are sent to LSTM units to train the LSTM and to predict the best set of parameters for all the unknown time points. Note that, the prediction module uses the testing data as both training data and testing data for the LSTM network, because all the parameters are obtained from testing data. The model is robust to local changes because the respective LSTM networks to predict each parameter of the function E.
  • the parameters of the noise function, E, calculated at a particular point in the lifetime of the SSD device are not necessarily the best parameters because they are derived from a relatively small number of points on the reference erase-count curve and the actual erase-count curve.
  • the values of the parameters may change during operation of the target SSD device due to changes in the usage attributes (e.g., the target SSD device being used by a different set of users and/or with a different application). Accordingly, while the parameters calculated at any particular point in time provide a good estimate of the noise function, they are not likely to be the optimal parameters for the noise function such that the combination of the noise function and the reference erase-count curve correctly models the remaining lifetime of the target SSD device.
  • the parameters, s, r, k, and 1, of the noise function, E are generated by one or more LSTM networks. Specifically, different values of each parameter at different time points are input to respective cells of the LSTM network to train the neural network to predict the best set of parameters for all the unknown time points.
  • FIG. 7 is a block diagram of an exemplary LSTM network 720, actually, a set of LSTM networks, one for each parameter, s, r, k, and 1, of the Gaussian noise function described above with reference to equation (3).
  • each LSTM network includes N LSTM cells.
  • the LSTM network for the parameter s includes LSTM cells 722 A, 722B, and so on up to 722N; the LSTM network for the parameter r includes LSTM cells 724A, 724B, and so on up to 724N; the LSTM network for the parameter k includes LSTM cells 726A, 726B, and so on up to 726N; and the LSTM network for the parameter 1 includes LSTM cells 728A, 728B, and so on up to 728N.
  • the LSTM networks are shown as parallel networks, embodiments may calculate the values for each of the parameters at different times using a single LSTM network that is, in a serial manner.
  • FIGs. 8A shows details of an example LSTM cell 800 and FIG. 8B shows multiple LSTM cells, 802, 804, and 806, each corresponding to the cell 800, connected to produce an LSTM network.
  • FIGs. 8A shows details of an example LSTM cell 800 and FIG. 8B shows multiple LSTM cells, 802, 804, and 806, each corresponding to the cell 800, connected to produce an LSTM network.
  • x t is the input vector to the LSTM cell
  • f t is the activation vector of the forget state
  • i t is the activation vector for the input gate
  • O t is the activation vector for the output gate
  • h t is the output vector of the LSTM cell
  • c t is the cell state vector
  • W f , W i , W o , W c , U f , U i , and U o are weight matrixes
  • b f , bi, b o , and b e are bias vectors.
  • the weight matrixes and the bias vectors are learned during the training of the LSTM network.
  • the activation function s g is a sigmoid function
  • the activation functions s c and s h are hyperbolic tangent (tanh) functions.
  • Operation 546 uses the LSTM network 720 to predict the parameters of E at different points in time.
  • the predicted values of E are combined with the reference erase-count curve to generate an augmented erase-count curve as shown in operation 548.
  • operation 312 of FIG. 3 uses the augmented erase-count curve to predict the lifetime of the target SSD device.
  • Using the LSTM network 720 to predict the noise function parameters is advantageous as it increases the robustness of the example method and reduces errors caused by fluctuation of the model over time. Predicting the noise function parameters also takes into account the overall dataset for the target device up to the last time point to predict the most recent set of parameters.
  • operation 546 can use data representing the entire erase-count curve or only a few data points of the target SSD device.
  • SSD devices uses a Gaussian function for the noise function, E.
  • the actual noise function may not conform to a Gaussian function.
  • the second framework addresses this issue by using a trained encoder-decoder CNN to define a set of parameters, W, for the function y that map respective sets of attributes A and H to respective reference erase-count functions, y.
  • the difference between the reference erase-count function and the actual erase count function is represented by a set of parameters, W’,
  • the parameters W’ form a two-dimensional matrix, consequently, they cannot be predicted using a regular LSTM network.
  • the embodiments described below address this by using a convolutional LSTM network to estimate the parameter matrix W’ for the function E.
  • W’ is a matrix that represents the difference between the reference data and the test data with respect to the parameters, W, learned by the encoder- decoder CNN.
  • y is also a function of the attributes A and
  • W is a matrix of parameters learned by the encoder-decoder CNN, with the attributes A and H to determine a reference erase-count function, y.
  • the noise function, E, used in the second framework is a function of A, H and W’ , where values of the matrix W’ are predicted using the convolutional LSTM network.
  • each row of V is one of the attributes from A and H. Because the number of values of each attribute row in V is not the same, some of the elements may be filled with zeroes or predefined values.
  • the reference erase-count curve, y is a two dimensional matrix with a time value, t, for each column and an erase-count for each row.
  • FIG. 9 is a flow-chart diagram of an example method 900 according to the second framework that generates an augmented erase-count curve which may be used to predict lifetimes for a target SSD device.
  • Operation 902 trains an encoder-decoder CNN 1000 (shown in FIG. 10) to leam the parameters, W, that are used by the encoder-decoder CNN 1000 to associate attributes A and H of an SSD device with a reference erase-count curve for the device.
  • the parameters, W are relative to all of the attributes A and H.
  • operation 902 of FIG. 9 applies the matrix V, including the attributes A and H of the devices in the historical data set, as training data to an input layer 1002 of the encoder-decoder CNN 1000.
  • the encoder-decoder CNN 1000 learns the parameters W that associate different matrixes V with different reference erase-count curves, y.
  • the output 1004 of the encoder-decoder CNN 1000 is a reference erase-count curve y, corresponding to the device profile into which the SSD device was classified based on the matrix V of attributes for the device and the parameters, W.
  • FIG. 9 applies the matrix V for a target SSD device to the input layer 1002 to determine the reference erase-count curve, y, for the target SSD device based on the parameters, W.
  • the method 900 determines initial values of the parameters W’ by solving equation (20) based on the portion of the actual current erase-count curve, y’, of the target SSD device known from the testing data, the attributes H and A (e.g. the matrix V), and the determined reference erase-count curve, y.
  • Operation 906 applies the initial values of the parameters, W’, to the convolutional LSTM network 1100, shown in FIG. 11, to predict the unknown parameters, W’, that define the noise function E(A,H,W’) of the second framework.
  • Convolutional LSTM network 1100 includes convolutional LSTM cells 1102 A, 1102B, and 1102N which respectively receive input values w’i, w’ t , and w’ n , (e.g., particular noise function parameter values at different times) from input terminals 1104A, 1104B, and 1104N.
  • LSTM network 1100 are provided at output terminals 1106A, 1106B, and 1106N, respectively.
  • the example convolutional LSTM network 1100 predicts spatial features of the matrix W’ over time.
  • Equations (22) - (26) define the operations performed by each cell of the convolutional LSTM network 1100 using the standard notation for LSTM cells.
  • f t , i t , O t , C t , and h t have the same meaning as in the conventional LSTM described above with reference to FIGs. 8A and 8B,
  • the vectors are weighting vectors that are learned by the convolutional LSTM 1100 from the training data set.
  • the vectors b f , bi, bi, and b c are offset vectors that are learned by the convolutional LSTM from the training data set.
  • the testing data are used both to train the convolutional LSTM 1100 and to predict parameter values for the function E.
  • the output values of neural networks shown in FIG. 11, hi through h n are values of the respective parameters, w’, at different instants as determined by the convolutional LSTM network.
  • the output of the convolutional LSTM is a set of parameters W’ that define the noise function E, which, when combined with the reference erase-count curve function, F, by operation 908 of FIG. 9, produces the augmented erase-count curve, y’, for the target SSD device as shown in equation (20).
  • operation 312 uses the augmented erase-count curve y’ to predict the lifetime of the target SSD device.
  • Embodiments generate maintenance recommendations to reassign the target SSD device, to copy data from the target SSD device, and/or to replace the target SSD before the end of its expected lifetime, as determined from the augmented erase- count curve y’.
  • the drive profile associated with the target SSD device may change during the lifetime of the target SSD device.
  • a hard drive can be assigned to bank users in the first six-months of a year but university users in the second six-months of the year.
  • FIG. 12 is a flow chart diagram showing an example method 1200 to detect when the drive profile associated with a target SSD device does not agree with the current erase-count curve, indicating a change in the drive profile associated with the target SSD device.
  • the example method 1200 implements the example operation 310, described above with reference to FIG. 3.
  • An embodiment detects a change in drive profile by detecting changes in the device attributes, A, changes in the augmented erase-count curve, y’, and/or changes in the attributes of the noise function, E, of the first framework, or the noise function, E, of the second framework.
  • Operation 1202 determines whether any of the attributes of A for the target SSD device has changed by an amount greater than a threshold TA for that attribute. Operation 1202 may not check all of the attributes as some attributes will not change and, thus, do not need to be checked. The value of TA depends on the attribute of A being tested and on the tolerances of the target SSD device for the attribute of A.
  • operation 1208 returns a value to operation 310 of FIG. 3 indicating that the drive profile should be changed.
  • operation 1204 analyzes trends in the augmented erase-count curve, y’, generated from the testing data to determine if the trend exhibits a change greater than a threshold value TC. This step detects a change in the shape of the augmented erase-count curve y’. If the shape of erase- count curve changes it is usually due to a change in the H attributes. Deciding to change the drive profile based on a change in the shape of the augmented erase- count curve y’, however, should be made carefully, because a large number of factors, such as noise, can lead to these changes. The duration, frequency, and magnitude of changes in the y’ curve should be considered.
  • An embodiment compares the absolute value of the first partial derivative (e.g., slope) of the curve y’ with respect to time (e.g. to a first threshold value TCi and compares the absolute value of the second partial derivative (e.g. change in slope) of the augmented erase-count curve y’ to a second threshold value TC2 0ver multiple time intervals. If either of these partial derivatives is greater than its corresponding threshold, operation 1208 returns the value indicating that the drive profile should be changed.
  • the first partial derivative e.g., slope
  • TC2 e.g. change in slope
  • operation 1204 detects no significant trend change in the augmented erase-count curve
  • operation 1206 compares trends in each of the parameters of the noise function E or E, depending on the framework being used, to respective threshold values to determine whether a trend might indicate a need to change in the reference erase-count curve, y.
  • Operation 1204 compares the absolute values of the first and second partial derivatives of each of the parameters with respect to time to a respective threshold value for the parameter over multiple time intervals. When the trend of any parameter exceeds its respective threshold, operation 1208 returns the value indicating that the drive profile should be changed.
  • operation 1210 returns a value to operation 310 of FIG. 3 indicating that the current drive profile and corresponding reference erase-count curve is correct.
  • FIG. 13 is a block diagram of a server computing device 1300, according to an embodiment. Similar components may be used in other types of computing devices. For example, the clients, servers, and network resources may each use a different set of the components shown in FIG. 13 and/or computing components not shown in FIG. 13 such as the components shown in FIG. 1 and as the storage device management unit 208, shown in FIG. 2 to execute the methods shown in FIGs. 3, 5B, 5C, 9, and 12.
  • One example computing device 1300 may include a processing unit (e.g., one or more processors and/or CPUs) 1302, memory 1303, removable storage 1310, and non-removable storage 1312 communicatively coupled by a bus 1301.
  • a processing unit e.g., one or more processors and/or CPUs
  • memory 1303, removable storage 1310, and non-removable storage 1312 communicatively coupled by a bus 1301.
  • the removable storage 1310 may also or alternatively include storage in one of the nodes 108A through 108N of the data center 104 accessible via the high-speed I/O channel 110, shown in FIG. 1, and the high-speed I/O interface 206, shown in FIG. 2.
  • Memory 1303 may include volatile memory 1314 and non-volatile memory 1308.
  • Computing device 1300 may include or have access to a computing environment that includes a variety of computer-readable media, such as volatile memory 1314 and non-volatile memory 1308, removable storage 1310 and non removable storage 1312.
  • Computer storage includes random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc read-only memory (CD- ROM), digital versatile disk (DVD) or other optical disk storage devices, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
  • Computing device 1300 may include or have access to a computing environment that includes input interface 1306, output interface 1304, and communication interface 1316.
  • Output interface 1304 may provide an interface to a display device, such as a touchscreen, that also may serve as an input device.
  • the input interface 1306 may provide an interface to one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device- specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to the server computing device 1300, and/or other input devices.
  • the computing device 1300 may operate in a networked environment using a communication interface 1316 to connect to one of the networks 102 and/or with one or more of the server nodes 108A through 108N of the data center 104 using the high-speed I/O channel 110.
  • the communication interface may include one or more of an interface to a local area network (LAN), a wide area network (WAN), a cellular network, a WLAN network, and/or a Bluetooth® network.
  • the processor 1302 of the server computing device 1300 executes computer-readable instructions stored on a computer-readable storage medium.
  • Computer-readable instructions may include applications 1318 such as the methods 300, 520, 540, 900, and 1200, described above, stored in the memory 1303.
  • a hard drive, CD-ROM, RAM, and flash memory are some examples of articles including a non-transitory computer-readable medium such as a storage device.
  • the terms computer-readable medium and storage device do not include carrier waves to the extent carrier waves are deemed too transitory.
  • the functions or algorithms described herein may be implemented using software in one embodiment.
  • the software may consist of computer-executable instructions stored on computer-readable media or computer-readable storage device such as one or more non-transitory memories or other type of hardware -based storage devices, either local or networked, such as in applications 1318.
  • modules which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples.
  • the software may be executed on one or more CPUs, including a single-core or a multi core processor, digital signal processor, application specific integrated circuit (ASIC), microprocessor, or other type of processor operating on a computer system, turning such computer system into a specifically programmed machine.
  • ASIC application specific integrated circuit

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

An apparatus and a method for predicting lifetime for a target digital device combine a reference performance measure curve, representing a performance measure of the target device, with a noise function that represents a difference between the reference performance measure curve and an actual performance measure curve for the target device. The apparatus and method predict parameters of the noise function to provide predicted noise function values. The predicted noise function values are combined with the reference performance measure curve to produce an augmented performance measure curve that predicts the performance measure of the target digital device. The apparatus and method use the augmented performance measure curve to generate maintenance recommendations for the target digital device.

Description

DEVICE LIFETIME PREDICTION
Cross-Reference to Related Applications [0001] This Application claims priority from U.S. Provisional Application No. 62/891,125 filed August 23, 2019, the contents of which are incorporated herein by reference.
Technical Field
[0002] Apparatus and methods for predicting the lifetimes of digital devices and in particular, for predicting the expected lifetime of a solid-state drive (SSD) device are disclosed.
Background
[0003] Companies make increasing use of digital devices for network-connected (e.g., “cloud”) data processing services to process large amounts of data (“big data”) to perform business tasks and enhance competitiveness. Data centers hosting the network-connected data processing services may employ thousands or even hundreds of thousands of components including: storage devices to hold the large amounts of data, servers to process the data, and telemetry devices to provide results of data processing services to clients. These components need to be replaced and/or maintained to ensure uninterrupted service to data center clients.
[0004] In addition, personal devices are becoming prevalent and process increasing amounts of data. The growing use of data brings new challenges to data centers and consumers regarding use and management of data storage devices.
When a storage device, such as an SSD or a disk drive fails, data in the storage device may be lost, resulting in interruption of operations and financial loss to businesses as well as the loss of personal data, such as pictures, videos, and financial information to consumers. Failure of storage devices is an issue for both business and consumers.
[0005] The lifetime of a device depends on attributes of the device and on how the device is used. Equivalent devices may have large differences in their actual lifetimes based on their different usage profiles. Maintenance schedules for these devices may use the minimum lifetime estimates to ensure that devices are replaced before they fail. This results in many of these devices being replaced too soon, well before the actual end of their lifetimes. Summary
[0006] The examples below describe apparatus and methods for predicting lifetime of digital device. The examples combine reference curves, representing a performance measure of devices, with a noise function that represents a difference between the reference curve and an actual performance measure curve for the device. The apparatus and method predict parameters of the noise function to provide predicted noise function values. The predicted noise function values are combined with the reference curve to produce an augmented performance measure curve. The example apparatus and method use the augmented performance measure curve to generate maintenance recommendations for the digital device.
[0007] These examples are encompassed by the features of the independent claims. Further embodiments are apparent from the dependent claims, the description and the figures.
[0008] According to a first aspect, a method for predicting a lifetime of a digital device includes using a set of device attributes and usage attributes of the digital device to determine a device profile having an associated reference performance measure curve. The method determines a set of parameters of a noise function where the noise function represents a difference between the reference performance measure curve and a current performance measure curve of the digital device. The method calculates a set of predicted parameters of the noise function and combines the noise function having the set of predicted parameters with the reference performance measure curve to produce an augmented performance measure curve that indicates the future performance measure of the target digital device. The method uses the augmented performance measure curve to determine an expected lifetime of the digital device and/or provide a maintenance recommendation for the digital device.
[0009] In a first implementation of the method according to the first aspect, the method includes determining a trend in at least one of the device attributes of the digital device, in the augmented performance measure curve, or in at least one of the predicted parameters. When the trend exceeds a respective threshold value, the method determines a second device profile for the digital device, different from the first device profile. The second device profile includes a second reference performance measure curve. The method determines a second set of parameters of the noise function, wherein a combination of the noise function, having the second set of parameters, and the second reference performance measure curve corresponds to a current measure of the performance measure of the digital device. The method calculates a second set of predicted parameters for the noise function based on the second set of parameters and combines the noise function having the second set of predicted parameters with the second reference performance measure curve to produce a second augmented performance measure curve which is used to determine the expected lifetime of the digital device and/or to provide the maintenance recommendation for the digital device.
[0010] In a second implementation of the method according to the first aspect, the method includes determining the device profile for the digital device by calculating respective conditional probability values for a plurality of predetermined device profiles based on the device attributes of the digital device and the usage attributes for the digital device to generate respective conditional probability values for the predetermined device profiles and selecting the predetermined device profile having the highest conditional probability value as the determined device profile. [0011] In a third implementation of the method according to the first aspect, the method includes calculating the set of predicted parameters of the noise function by applying the parameters of the noise function to a long short term memory (LSTM) network.
[0012] In a fourth implementation of the method according to the first aspect, the noise function conforms to a Gaussian function and the set of parameters includes a set of parameters of the Gaussian function. The set of parameters causes the Gaussian function to approximate a difference between the reference performance measure curve and the current performance measure curve of the digital device.
[0013] In a fifth implementation of the method according to the first aspect, the digital device is a solid state drive (SSD) device, the reference performance measure curve is a reference erase-count curve, and the augmented performance measure curve of the digital device is a predicted erase-count curve for the SSD device. [0014] In a sixth implementation of the method according to the first aspect, the method includes determining the device profile by applying device attributes and usage attributes of the digital device to an encoder-decoder convolutional neural network (CNN) having trained parameters.
[0015] In a seventh implementation of the method according to the first aspect, the determining of the first set of parameters of the noise function includes defining the noise function as a function of the device attributes of the target digital device, the usage attributes of the target digital device, and the set of parameters of the noise function. The defined noise function corresponds to a difference between the current performance measure curve and the reference performance measure curve of the target digital device.
[0016] In an eighth implementation of the method according to the first aspect, the method includes calculating the set of predicted parameters of the noise function by applying the parameters of the noise function to a convolutional LSTM network. [0017] According to a second aspect, an apparatus for predicting a lifetime of a digital device includes a computing device that uses a set of device attributes and usage attributes of the digital device to determine a device profile associated with a reference performance measure curve. The computing device determines a set of parameters of a noise function, wherein the noise function represents a difference between the reference performance measure curve and an actual performance measure curve of the digital device. The computing device calculates a set of predicted parameters of the noise function based on the set of parameters and combines the noise function having the set of predicted parameters with the reference performance measure curve to produce an augmented performance measure curve. The augmented performance measure curve predicts the performance measure of the target digital device. The computing device determines an expected lifetime of the digital device and/or to generate a maintenance recommendation for the digital device based on the augmented performance measure curve.
[0018] In a first implementation of the apparatus according to the second aspect, the computing device determines a trend in at least one of the device attributes of the digital device, in the augmented performance measure curve, or in at least one of the predicted parameters. When the trend exceeds a respective threshold value, the computing device determines a second device profile for the digital device, different from the device profile. The second device profile includes a second reference performance measure curve. The computing device determines a second set of parameters of the noise function, wherein a combination of the noise function, having the second set of parameters, and the second reference performance measure curve corresponds to a current performance measure of the digital device. The computing device calculates a second set of predicted parameters for the noise function based on the second set of parameters and combines the noise function having the second set of predicted parameters with the second reference performance measure curve to produce a second augmented performance measure curve. The computing device determines the expected lifetime of the digital device and/or to generates the maintenance recommendation for the digital device based on the second augmented performance measure curve.
[0019] In a second implementation of the apparatus according to the second aspect, the computing device determines the device profile for the digital device by calculating respective conditional probability values for a plurality of predetermined device profiles based on the device attributes of the digital device and the usage attributes of the digital device to generate a plurality of conditional probability values for the respective plurality of predetermined device profiles. The computing device selects the predetermined device profile having the highest conditional probability value as the device profile.
[0020] In a third implementation of the apparatus according to the second aspect, the computing device calculates the set of predicted parameters of the noise function by applying the parameters of the noise function to a LSTM network. [0021] In a fourth implementation of the apparatus according to the second aspect, the noise function conforms to a Gaussian function and the set of parameters includes a set of parameters of the Gaussian function. The set of parameters causes cause the Gaussian function to approximate a difference between the reference performance measure curve and the current performance measure curve of the digital device.
[0022] In a fifth implementation of the apparatus according to the second aspect, the digital device is an SSD device, the reference performance measure curve is a reference erase-count curve, and the augmented performance measure curve of the digital device is a predicted erase-count curve for the SSD device. [0023] In a sixth implementation of the apparatus according to the second aspect, the computing device is configured to select the device profile by applying the device attributes and the usage attributes of the digital device to an encoder- decoder CNN having trained parameters.
[0024] In a seventh implementation of the apparatus according to the second aspect, the computing device determines the first set of parameters of the noise function by defining the noise function as a function of the device attributes of the target digital device, the usage attributes of the target digital device, and the set of parameters of the noise function. The defined noise function corresponds to a difference between the current performance measure curve and the reference performance measure curve of the target digital device.
[0025] In an eighth implementation of the apparatus according to the second aspect, the computing device calculates the set of predicted parameters of the noise function by applying the parameters of the noise function to a convolutional LSTM network.
[0026] According to a third aspect, a non-transitory computer-readable medium includes program instructions which, when executed by a processor, cause the processor to predict a lifetime of a digital device, the program instructions cause the processor to determine a first device profile having an associated reference performance measure curve. The instructions cause the processor to determine a set of parameters of a noise function where the noise function represents a difference between the reference performance measure curve and an actual performance measure curve of the digital device. The instructions further cause the processor to calculate a set of predicted parameters of the noise function based on the set of parameters and to combine the noise function having the set of predicted parameters with the reference performance measure curve to produce an augmented performance measure curve. The augmented performance measure curve predicts the performance measure of the target digital device. The instructions cause the processor to determine an expected lifetime of the digital device and/or to generate the maintenance recommendation for the digital device based on the augmented performance measure curve.
[0027] In a first implementation of the computer-readable medium according to the third aspect, the program instructions cause the processor to determine a trend in at least one of the device attributes of the digital device, in the augmented performance measure curve, or in at least one of the predicted parameters. When the trend exceeds a respective threshold value, the instructions cause the processor to determine a second device profile for the digital device, different from the first device profile. The second device profile includes a second reference performance measure curve. The program instructions further cause the processor to determine a second set of parameters of the noise function, wherein a combination of the noise function, having the second set of parameters, and the second reference performance measure curve corresponds to a current measure of the performance measure of the digital device. The program instructions cause the processor to calculate a second set of predicted parameters for the noise function based on the second set of parameters and to combine the noise function having the second set of predicted parameters with the second reference performance measure curve to produce a second augmented performance measure curve. The instructions cause the processor to determine the expected lifetime of the digital device and/or to generate the maintenance recommendation for the digital device based on the second augmented performance measure curve.
Brief Description of the Drawings
[0028] FIG. 1 is a block diagram of a computer system including a data center that interfaces with multiple client networks in accordance with an embodiment. [0029] FIG. 2 is a block diagram of a shared database or a shared file system in accordance with an embodiment.
[0030] FIG. 3 is a flow-chart diagram of methods in accordance with embodiments.
[0031] FIGs. 4A, 4B, 4C, and 4D are reference curves of erase-counts versus power on hours in accordance with embodiments.
[0032] FIG. 5 A is a data flow diagram according to a first method for estimating an erase-count curve in accordance with embodiments.
[0033] FIGs. 5B and 5C are flow-chart diagrams illustrating the first method for estimating an erase-count curve in accordance with embodiments.
[0034] FIG. 6 is a graph of erase-count versus power on hours showing computation details of a noise function in accordance with an embodiment. [0035] FIG. 7 is a block diagram of a multi-channel Long Short Term Memory (LSTM) network in accordance with an embodiment.
[0036] FIGs. 8 A and 8B are block diagrams of an LSTM cell and an LSTM network in accordance with an embodiment.
[0037] FIG. 9 is a flow-chart diagram illustrating a second method for estimating an erase-count curve in accordance with embodiments.
[0038] FIG. 10 is a data diagram showing an encoder-decoder convolutional neural network (CNN) in accordance with an embodiment.
[0039] FIG. 11 is a block diagram of a convolutional LSTM network in accordance with an embodiment.
[0040] FIG. 12 is a flow-chart diagram illustrating a method for determining whether to change the reference erase-count curve of a target device.
[0041] FIG. 13 is a block diagram of a computing system according to an embodiment.
Detailed Description
[0042] In the following description, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosed subject matter and it is to be understood that other embodiments may be utilized and that structural, logical, and electrical changes may be made without departing from the scope of the appended claims. The following description of embodiments is, therefore, not to be taken in a limited sense.
[0043] The functions, methods, and/or algorithms described herein may be implemented in software in the embodiments. The software may consist of computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware-based storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server or other computer system, turning such computer system into a specifically programmed machine.
[0044] The functionality can be configured to perform an operation using, for instance, software, hardware, firmware, or the like. For example, the phrase “configured to” can refer to a logic circuit structure of a hardware element that is to implement the associated functionality. The phrase “configured to” can also refer to a logic circuit structure of a hardware element that is to implement the coding design of associated functionality of firmware or software. The term “module” refers to a structural element that can be implemented using any suitable hardware (e.g., a processor, among others), software (e.g., an application, among others), firmware, or any combination of hardware, software, and firmware. The term, “logic” encompasses any functionality for performing a task. For instance, each operation illustrated in the flowcharts corresponds to logic for performing that operation. An operation can be performed using, software, hardware, firmware, or the like. The terms, “component,” “system,” and the like may refer to computer- related entities, hardware, and software in execution, firmware, or combination thereof. A component may be a process running on a processor, an object, an executable, a program, a function, a method, a subroutine, a computer, or a combination of software and hardware. The term, “processor,” may refer to a hardware component, such as a processing unit of a computer system and may include multiple single-core processors and/or multi-core processors.
[0045] Furthermore, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computing device to implement the disclosed subject matter. The term, “article of manufacture,” as used herein is intended to encompass a computer program accessible from any computer-readable storage device or media. Computer-readable storage media can include, but are not limited to, magnetic storage devices, e.g., hard disks, floppy disks, magnetic strips, optical disks, compact disks (CDs), digital versatile disks (DVDs), smart cards, SSDs , among others.
[0046] The embodiments describe estimating the remaining lifetime of a target SSD device in a data center environment. However, the same methods may be used to estimate other performance measures of an SSD device or performance measures of a mechanical hard disk drive (HDD) device or other devices (e.g., processor, random access memory (RAM) device, or cooling fan). The described embodiments predict the remaining lifetime of the target SSD device based on a predicted erase- count curve for the target SSD device. Other performance measures may be used to predict the lifetime of other types of devices. For example, reference performance measure curves of the spin-up time or spindle start/stop count of a mechanical HDD may be used to predict its remaining lifetime. Furthermore, the embodiments are not limited to digital devices in data centers. These methods may be applied to consumer devices using reference performance measure curves generated on the user equipment or obtained via a web service. It is contemplated that parts of the method may be implemented locally and other parts implemented by a web service. For example, a reference erase-count curve for a consumer SSD may be obtained by sending the device attributes and usage attributes collected locally to a web service and receiving the reference erase-count curve from the web service.
[0047] Erase-count may be considered to be a key performance index (KPI) of SSD devices, because, for many SSD devices, the failure of the device may be predicted by analyzing its erase-counts. The embodiments described herein describe two example machine learning based frameworks to predict an erase-count curve for a target device. The predicted erase-count curve predicts how the erase-counts of a target SSD device will change over the remaining lifetime of the target device and, thus provides an estimate of the remaining device lifetime.
[0048] Specifically, an example of the first framework, described below with reference to FIGs. 5A, 5B, 5C, 6, 7, 8A, and 8B, uses conditional probabilities to select a reference erase-count curve for a target device and determines a difference between the reference erase-count curve and an actual erase-count curve for the target device. This difference, referred to herein as a noise function, is modeled as a Gaussian function. The framework uses one or more LSTM networks to estimate parameters of the noise function. The first framework employs a combination of the reference erase-count curve and the estimated noise function derived from the estimated parameters to define an augmented erase-count curve (e.g., a predicted erase-count curve) for the target device. The first framework uses this augmented erase-count curve to determine whether the target SSD device needs maintenance and/or to determine the expected lifetime of the target SSD device.
[0049] An example of the second framework, described below with reference to FIGs. 9, 10, and 11 uses an encoder-decoder CNN to leam a set of parameters to be used to associate SSD devices with reference erase-count curves and, based on those parameters, to determine a particular reference erase-count curve for a target SSD device. The second framework employs a convolutional LSTM network to estimate the noise function parameters. The second framework combines the reference erase- count curve and the estimated noise function derived from the estimated parameters to define an augmented erase-count curve which is used to determine whether the target device needs maintenance and/or to determine the expected lifetime of the target device.
[0050] In both frameworks, estimating the parameters of the noise function reduces the amount of testing data used for prediction, allowing calculation of a lifetime estimate earlier in the operation of the target SSD device than could be achieved by estimating the lifetime directly based on a predicted erase-count curve. Modeling the erase-count curve as a reference erase-count curve plus a noise function also allows lifetime prediction to be based on a relatively small number of time points, as long as the noise function truly represents the difference between the actual erase-count curve and the reference erase-count curve. Once the parameters of the noise function are known, the noise function can be predicted and, thus, the erase-count curve for the target device can also be predicted. The first framework models the noise function as a Gaussian function. The second framework uses an encoder-decoder CNN to determine parameters that define the reference erase- count curve based on training data. This framework applies testing data for a target SSD device to the encoder-decoder CNN to determine the reference erase-count curve. The framework determines parameters of a noise function and uses a convolutional LSTM to estimate parameters for the noise function. The second framework combines the noise function having the estimated parameters with the reference erase-count curve to produce an augmented erase-count curve that can be used to estimate the lifetime of the SSD device.
[0051] An SSD storage device is typically operated in a write-erase-write cycle. When the SSD device is full or when the user deletes data from the SSD device, the data in the device are erased before new data are written. The write-erase-write cycle is also referred to as a program-erase cycle (P/E cycle). The P/E cycle is a criterion for quantifying the endurance of a device. Generally speaking, each cycle consumes a tiny portion of the lifetime of a device. As the number of cycles increases, the remaining lifetime of a device decreases. Because erasing pairs with writing, erase-count may be considered a KPI in measuring lifetime of an SSD storage device.
[0052] FIG. 1 is a block diagram of a computing system 100 including a data center 104 that interfaces with a multiple client networks 102 in accordance with an embodiment. The data-center environment is one example of the use of the illustrated frameworks. These frameworks may be used to predict expected performance measure curves for myriad devices used in personal or commercial environments. The example data center 104 processes tasks received from the client networks 102 and provides results to the networks 102. These tasks include executing applications such as, without limitation, word processors, spread sheets, database management systems (DBMSs), and accounting systems. The data center 104 includes a load balancer 106 that distributes the tasks among multiple processing nodes 108A, 108B, 108C, 108D, and 108N. The processing nodes 108A, 108B, 108C, 108D, and 108N are coupled, via a high-speed input/output (I/O) channel 110, to each other, to shared databases 112, and to shared file systems 118 and 120. FIG. 1 shows two shared databases including SSD devices 114A, 114B, and 114C, and SSD devices 116A, 116B, and 116C, respectively.
[0053] FIG. 2 is a block diagram of a shared database or a shared file system 200 in accordance with an embodiment. The database/file system shown in FIG. 2 includes multiple SSDs 202A, 202B, and 202N coupled to a storage device interface 204. The storage device interface 204 handles I/O operations among the processing nodes 108A, 108B, 108C, 108D, and 108N and the SSDs 202A, 202B, and 202N. These I/O operations use a high-speed I/O interface 206 to pass data to and from the high-speed I/O channel 110, shown in FIG. 1. Storage device interface 204 also couples the SSDs 202A, 202B, and 202N to a storage device management unit 208 that monitors the status (e.g., the expected lifetime) of the SSDs 202A, 202B, and 202N and generates maintenance recommendations as described in the embodiments. Maintenance recommendations include, for example, pre-ordering an SSD device that is predicted to fail at a particular date (e.g., the SSD has a 95 percent probability of failing within seven days). Alternatively, the impending failure of an SSD device may depend on how the device is being used. In this instance, maintenance recommendations may include copying the data from the target device to another device and reassigning the target device to an application that performs fewer write-erase cycles. The predicted failure of the target device may also depend on the environmental conditions in which the device is being used. In this instance, the recommended maintenance may include changing environmental factors, such as providing additional cooling. In embodiments, the storage device management unit 208 implements the methods 300, 500, 520, 540, 900, and 1200 described below with reference to FIGs. 3, 5A, 5B, 5C, 9 and 12. The storage management unit 208 may include or be coupled to the LSTM network 720 shown in FIGs. 7, 8A and 8B; the encoder decoder CNN network 1000 shown in FIG. 10; and/or the convolutional LSTM network 1100 shown in FIG. 11. It is contemplated, however, that these methods could be implemented in other devices, for example, one or more of the processing nodes 108A, 108B, 108C, 108D, and 108N or in a combination of the storage management unit 208 and one or more of the processing nodes 108A, 108B, 108C, 108D, and 108N. It is also contemplated that the networks 720, 1000, and/or 1100 may be implemented in hardware coupled to, and/or in software configured to execute on, the storage device management unit 208 and/or one or more of the processing nodes 108A, 108B, 108C, 108D, and 108N.
[0054] The lifetime of an SSD device depends on a variety of factors, including device specifications (device attributes, A) and how a device is being used (usage attributes, H). In order to prevent device failure, a general solution is to replace a device at a specific time before the end of its life. It is difficult, however, to anticipate a good time to replace a target device. Replacing a device too early will increase costs unnecessarily, while replacing it too late results in the loss of data when the device fails. Furthermore, not every SSD device is used in the same way such that different SSD devices may have different maximum numbers of erase- counts and/or different numbers of power-on hours until failure.
[0055] The embodiments described herein use machine learning methods to predict device lifetimes. Specifically, the methods generate reference erase-count curves using historic data generated from actual and/or simulated SSD devices. The method uses the reference erase-count curves as benchmarks to guide prediction of the remaining lifetime for a target SSD device. In some embodiments, probability methods are applied to identify the most probable reference erase-count curve for a target device based on device attributes, A, and usage attributes, H, of the target device. Embodiments augment the identified erase-count curve with a noise function and use networks to predict parameters of the noise function.
[0056] Other embodiments train an encoder-decoder CNN on historic erase- count data to leam parameters, W, from the A and H attributes of the historic data. These embodiments then use the parameters W to associate target SSD devices with their respective reference erase-count curves based on the A and H attributes of the target SSD devices. These embodiments use the parameters, W, and the trained encoder-decoder CNN to determine a particular reference erase-count curve for a particular target device. These embodiments employ a convolutional LSTM to predict the parameters of the noise function. In both frameworks, the reference erase-count curve augmented by the noise function is a predicted erase-count curve that is used to predict the remaining lifetime of the SSD device.
[0057] The described embodiments predict the noise function instead of predicting the erase-count curve or device lifetime directly. This process achieves better results than direct prediction of the erase-count curve because the noise function represents the difference between reference erase-count curve and the actual erase-count curve and is robust to changes. The example methods predict the erase-count curve of the target SSD device based on the predicted parameters of noise function at any point in the lifetime of the target SSD device. Thus, the example methods are also advantageous as they can predict the lifetime of an SDD device at an early stage of device usage. The ability to predict lifetimes early reduces the data needed to predict the device lifetime and allows the device to be used over a longer lifetime.
[0058] FIG. 3 is a flow-chart diagram of a method 300 according to embodiments. The method 300 shown in FIG. 3 is relevant to both of the two frameworks described above. Operation 302 obtains device attributes, A, and usage attributes, H, for a target SSD device. Operation 304 uses these attributes to determine a reference erase-count curve for the target device. Next, operation 306 defines a noise function based on a difference between the reference erase-count curve and the current actual erase-count curve for the target device. Once this noise function is defined, operation 308 predicts the parameters of the noise function to generate a predicted noise function. Operation 310 analyzes trends in the device attributes, actual erase-count curve, and the parameters of the noise function to determine whether the reference erase-count curve for the device is no longer valid and should be changed. When operation 310 determines that the reference erase- count curve should be changed, it transfers control to operation 304, described above, to select a new reference erase-count curve. When operation 310 determines that the current reference erase-count curve is valid, operation 312 uses the noise function to augment the reference erase-count curve and uses the augmented erase- count curve to predict the lifetime of the target SSD device and/or to generate maintenance recommendations for the target SSD device.
[0059] An embodiment generates historical data for multiple devices 202A, 202B, and 202N (shown in FIG. 2) by operating the devices or simulating the operation of the devices over their lifetimes to generate respective historic erase- count curves for the devices 202A, 202B, and 202N. Operation 302 of the method 300 obtains device attributes, A, and usage attributes, H, for each of the devices 202 A, 202B, and 202N monitored by the storage device management unit 208 or for each of the devices monitored by all of the storage management units 208 in the data center 104. Example device attributes, A, include values of the Self- Monitoring, Analysis, and Reporting Technology (SMART) attributes for the particular SSD devices 202A, 202B, and 202N. These device attributes include the model number of the device as well as information about how the device was manufactured (e.g., attributes such as built-in error correction that differentiate a consumer-grade device from an enterprise-grade device), performance attributes (e.g., read-error rate, remaining space, and maximum erase-count).
[0060] The usage attributes, H, describe how the device is used and include factors such as, without limitation, the particular applications with which the SSD devices are used, the number of users using each of the SSD devices, and the identity of the client using the SSD device. In addition, the usage attributes include time related attributes such as the hour/day/month/year at which the device is used and the duty cycle and/or duration of such use. Furthermore, the usage attributes include environmental attributes (e.g., temperature, humidity, and vibration).
[0061] The problem addressed by the embodiments is represented by the function: y’ = y + E, where y is the known reference erase-count curve of a target device, y’ is a predicted erase-count curve (e.g., the augmented erase-count curve) used to guide device maintenance, and E is a noise function. As described above, embodiments employ one of two methods to predict y’. The first method selects the reference erase-count curve, y, based on probability. This method models the noise function, E, as a Gaussian function and uses an LSTM to predict the parameters of the noise function, E. The second method uses a trained encoder-decoder CNN to determine the reference erase-count curve, y, and uses a convolutional LSTM to predict the parameters of the noise function, E. The second method does not assume that the noise function, E, can be modeled as a Gaussian function or as any specific function. It uses the encoder-decoder CNN to determine the reference erase-count curve, y, for a target SSD device based on the A and H attributes of the device and a set of learned parameters, W. The parameters W are generated by the encoder- decoder CNN based on the historical training data. The second method defines a set of parameters, W’, that represent a difference between the current erase-count curve of the target device and the reference erase-count curve. The parameters, W’, which correspond to the noise function, are estimated using a convolutional LSTM and the testing data to determine the augmented erase-count curve.
[0062] FIGs. 4 A, 4B, 4C, and 4D show example graphs of erase-count versus power-on time, where the y-axis indicates erase-counts and the x-axis indicates power-on time. Each of these graphs shows a different erase-count curve relative to power-on time. In each graph, as time goes by, erase-counts increase at varying rates depending on characteristics of the device and on how the device is used. Near the end of the SSD lifetime, the increase of erase-counts slows down. These graphs may be generated based on the performance of actual or simulated SSD devices.
The graphs shown in FIGs. 4 A, 4B, 4C, and 4D are synthetic curves generated from simulations of SSD devices. The different erase-count curves reflect differences in the construction and configuration of the devices (device attributes, A) and in how devices are used (usage attributes, H). For example, the erase-count curves of university users are different from those of bank users. Bank users tend to erase and write SSD devices frequently during business hours and rarely after business hours. On the other hand, students in universities may ran computers overnight to get experiment results, such that the erasing and writing of SSD devices by university students will not follow a workday schedule.
[0063] The curve 402 in FIG. 4A shows the erase-count curve for an SSD device that is accessed heavily over a relatively short term, after having been powered on for about 900 hours. The device is accessed about 100,000 times over the next 1000 powered-on hours and approaches its maximum number of erase cycles at about 2000 powered-on hours. This curve may correspond, for example, to a usage profile of a television network streaming media server that is continually writing new program material to the SDD device.
[0064] Curve 404, shown in FIG 4B shows an erase-count curve that exhibits different rates of erase-counts at different times. This curve may correspond to a usage profile of an enterprise data processing system which has a greater numbers of erase cycles at the end of each three-month interval, for example, to accommodate quarterly financial statement processing. After the financial statement processing is complete, the number of erase cycles decreases for the remainder of the quarter.
[0065] The curves 406 and 408 shown in FlGs. 4C and 4D are similar to the curve shown in FIG. 4B in that they describe varying rates of erase operations. These differences may be the result of device attributes, A, usage attributes, H, or both. The user does not need to understand these differences because the reference erase-count curves are automatically produced, for example, by clustering and averaging actual historical erase-count curves of real and/or simulated SSD devices. [0066] FIG. 5A is a data flow diagram of an example method 500, described in more detail below with reference to the methods 520 of FIG. 5B and 540 of FIG.
5C, that implements the first framework. The system shown in FIG. 5A includes a historical data store 502, a calculate module 504, a reference curve probabilities store 506, a source of testing data 507, a device profiles store 506, a reference curve store 510, and a prediction module 512. In an embodiment, the reference curves are generated before processing any target device. An example method for generating the reference curves and the device profile probabilities is described below with reference to FIG. 5B and an example method for estimating the noise function parameters and using the estimated noise function parameters to predict the lifetime of an SSD device is described below with reference to FIG. 5C.
[0067] FIGs. 5B and 5C are flow-chart diagrams showing methods 520 and 540 according to the first framework that implement the data flow shown in FIG. 5A. Method 520 generates a set of reference erase-count curves and method 540 selects one of the generated reference erase-count curves, predicts parameters of a noise function, and combines the reference erase-count curve with the predicted noise function to produce the augmented erase-count curve. Operation 522 of the method 520 obtains historical data including the attributes A and H, and erase-count curves for multiple real and/or simulated SSD devices. These data are stored in the historical data store 502. The combination of the historical erase counts and the historical values of A and H for a particular device are referred to as a historical data set. Operation 524, performed by the calculate module 504, clusters the historical data sets for the multiple SSD devices. Any number of clustering algorithms may be used to cluster the data sets, including K-means clustering, mean-shift clustering, and Gaussian mixture model (GMM) clustering. Each cluster has a device profile that is a combination of A and H such that similar devices having similar usage profiles are clustered together.
[0068] Once the historical data sets have been clustered, operation 526 generates a respective reference erase-count curve to represent each cluster. These erase-count curves are stored in the reference curve store 510. In embodiments, operation 526 averages the historical erase-count curves of all of the historical data sets in the cluster to generate the reference erase-count curve for the cluster. Operation 526 stores the reference erase-count curves and an indication of the number of historical devices corresponding to each reference erase-count curve in the reference curve store 510. Each reference erase-count curve is associated with a respective device profile stored in the device profile store 508. The device profile also includes the attributes A and H.
[0069] Operation 528 determines probabilities relative to each device profile. In an embodiment, the probabilities may be determined from the relative numbers of devices in each cluster. These probabilities may include conditional probabilities relative to the A and H attributes of the devices as described below. [0070] Operation 542 of method 540, performed by the calculate module 504, selects a device profile from the device profile store 508 and, thus, a reference erase-count curve for a target SSD device from the reference curve store 510. The device profile is selected using conditional probabilities based on the device attributes and usage attributes obtained from the testing data source 507. Operation 542 uses current testing data, including current values of A and H, to calculate the conditional probability that a target SSD device corresponds to a particular drive profile.
[0071] To determine drive profile and, thus, the reference erase-count curve for a target SSD device, let
Figure imgf000021_0002
represent the drive profiles, and let represent the device attributes, (e.g.,
Figure imgf000021_0001
attributes that are set once a drive is made), where m and n are the numbers of drive profiles and device attributes, respectively. Each attribute has a pool of options, such as 0 and 1. In addition, let represent the usage
Figure imgf000021_0003
attributes, (e.g., the attributes that depend on how the devices are being used and can be changed by users). Device attributes, A, affect usage attributes, H, because users tend to use a device according to its specifications. Thus the probability of a particular set of device attributes, A, can be estimated based on a set of usage attributes, H. Consequently, there are causal relationships among attributes A and H, and the drive profiles. Thus, conditional probabilities may be used to select a drive profile for a target SSD device based on the device attributes and usage attributes for the target SSD device.
[0072] Not every data center has all SSD devices corresponding to all drive profiles, and not every usage attribute is known in practice. The conditional probability of a drive profile given the known device attributes can be calculated using the historical data, as described above with reference to FIG. 5A. Conditional probabilities applied to the testing data may be used to select the most probable drive profile for a target SSD device. As described above, the conditional probability of a set of device attributes given the drive profile, p(AID), is known. Drive profiles and device attributes can indicate the probability of usage attributes. That is to say, given D and A, the probability p(HID, A) is known. Thus, even without all the information in testing data, a drive profile can be determined by choosing the one with highest conditional probability. The conditional probability values may be calculated using either or both of equation (la) and (lb).
Figure imgf000022_0001
[0073] Embodiments determine the probabilities of p(HID, A), p(AID), p(AID, H), p(HID), and p(D) using both the training data and the historical data. Operation 542 of the example method 540 determines a drive profile for a target SSD device as the drive profile having the largest value of posterior probability based on the attributes A and H for the target device. A data center (user) can define H according to the services that the data center provides and the particular hardware used by the data center. Thus, usage attributes, H, may vary among different drive profiles. In training, it is advantageous to define both the device attributes, A, and the usage attributes, H, in detail, so the calculation of probabilities will cover all possible drive types used in the data center. However, this uses a large amount of data and time to define the drive profiles. In practice, when the testing data does not have a specific attribute that is defined in training, based on the historical data, the probabilities relative to that attribute can be simply set to 0 or 1 depending on the situation.
[0074] Presumably, every attribute affects the shape of an erase-count curve.
Even though two drives have the same device profile, their actual erase-count curves may not be the same, because the exact value of each attribute can be different. However, devices with same device profile usually have similar actual erase-count curves.
[0075] After operation 542 selects the device profile for the target SSD device, operation 544, performed by the prediction module 512, analyzes differences between the actual erase-count curve of the target SSD device, obtained from the testing data, and the assigned reference erase-count curve, determined from the device profiles store 508 and reference curve store 510, to determine parameters of a noise function representing the difference between these curves. Operation 544 models the noise function as a Gaussian function to approximate this difference. [0076] An example of operation 544 is described below. Once operation 542 selects a device profile of the target SSD device, the reference erase-count curve of the target SSD device is known. Let y represent the reference erase-count curve, and y’ be the actual erase-count curve of the target SSD device. The erase-count curve y’ can be represented by equation (2). y’ = y + E (2)
The noise function, E, changes according to the values of the device attributes and usage attributes.
[0077] Operation 546 determines the parameters of the noise function, E, using the testing data. Erase-counts of given time points are fit into the curve by combining the values of the reference erase-count curve, y, and the values of the noise function, E. Although embodiments model E as a Gaussian function, other types of functions can be used. Operation 546 predicts a set of parameters of the Gaussian function. Operation 548 generates the Gaussian function using the predicted set of parameters and combines the result with the value of the reference erase-count curve to generate an estimated erase-count curve that describes the target SSD device over its entire lifetime. Operations 546 and 548 estimate parameters of the function E over a range of times to generate a predicted erase- count curve over any time range.
[0078] Referring to FIG. 6, curve 602 represents the actual erase-count curve, y’, while curve 604 represents the reference erase-count curve, y, associated with the drive profile, D, of the target SSD device. Equation (3) describes the noise function E in terms of points along a curve x that represents the difference between the actual erase-count curve y’ and the reference erase-count curve y.
Figure imgf000023_0001
The values of the parameters, s, r, k, and 1, of the noise function E are derived as shown in equations (4) through (14). This example derives the values of the parameters based on 11 points on the difference curve x, each point associated with a different set of parameter values of E between t-5 and t+5, as shown in FIG. 6. The eleven points shown in FIG. 6 are only an example. More or fewer sets of parameter values may be used to derive the values of the parameters for the noise function, E. In these equations, the value of the index of each point is m. Thus, xm is the value of x(t) at t = m (shown below as tm).
Figure imgf000023_0002
Figure imgf000024_0001
From the above, equation (10) defines terms that may be used to derive the parameters k, 1, r, and s according to equations (11) - (14).
Figure imgf000024_0002
[0079] Operation 546 of FIG. 5C, which is performed by the prediction module 512, applies the parameters of the Gaussian function to an LSTM network to extract change trends from the parameters. Specifically, parameters calculated at different time points along the difference function, x, are sent to LSTM units to train the LSTM and to predict the best set of parameters for all the unknown time points. Note that, the prediction module uses the testing data as both training data and testing data for the LSTM network, because all the parameters are obtained from testing data. The model is robust to local changes because the respective LSTM networks to predict each parameter of the function E. [0080] The parameters of the noise function, E, calculated at a particular point in the lifetime of the SSD device are not necessarily the best parameters because they are derived from a relatively small number of points on the reference erase-count curve and the actual erase-count curve. In addition, the values of the parameters may change during operation of the target SSD device due to changes in the usage attributes (e.g., the target SSD device being used by a different set of users and/or with a different application). Accordingly, while the parameters calculated at any particular point in time provide a good estimate of the noise function, they are not likely to be the optimal parameters for the noise function such that the combination of the noise function and the reference erase-count curve correctly models the remaining lifetime of the target SSD device.
[0081] As described above with reference to operation 546 of FIG. 5C, the parameters, s, r, k, and 1, of the noise function, E, are generated by one or more LSTM networks. Specifically, different values of each parameter at different time points are input to respective cells of the LSTM network to train the neural network to predict the best set of parameters for all the unknown time points.
[0082] FIG. 7 is a block diagram of an exemplary LSTM network 720, actually, a set of LSTM networks, one for each parameter, s, r, k, and 1, of the Gaussian noise function described above with reference to equation (3). In the exemplary embodiment, each LSTM network includes N LSTM cells. The LSTM network for the parameter s includes LSTM cells 722 A, 722B, and so on up to 722N; the LSTM network for the parameter r includes LSTM cells 724A, 724B, and so on up to 724N; the LSTM network for the parameter k includes LSTM cells 726A, 726B, and so on up to 726N; and the LSTM network for the parameter 1 includes LSTM cells 728A, 728B, and so on up to 728N. Although the LSTM networks are shown as parallel networks, embodiments may calculate the values for each of the parameters at different times using a single LSTM network that is, in a serial manner.
[0083] FIGs. 8A shows details of an example LSTM cell 800 and FIG. 8B shows multiple LSTM cells, 802, 804, and 806, each corresponding to the cell 800, connected to produce an LSTM network. Each of the LSTM cells shown in FIGs.
8A and 8B implements equations (15) - (19).
Figure imgf000025_0001
Figure imgf000026_0001
In equations (15) - (17), xtis the input vector to the LSTM cell, ft is the activation vector of the forget state, it is the activation vector for the input gate, Ot is the activation vector for the output gate, ht is the output vector of the LSTM cell, ct is the cell state vector, Wf, Wi, Wo, Wc, Uf, Ui, and Uo are weight matrixes, and bf, bi, bo, and be, are bias vectors. The weight matrixes and the bias vectors are learned during the training of the LSTM network. The activation function sg is a sigmoid function, and the activation functions sc and sh are hyperbolic tangent (tanh) functions.
[0084] Operation 546 uses the LSTM network 720 to predict the parameters of E at different points in time. The predicted values of E are combined with the reference erase-count curve to generate an augmented erase-count curve as shown in operation 548. As described above, operation 312 of FIG. 3 uses the augmented erase-count curve to predict the lifetime of the target SSD device. Using the LSTM network 720 to predict the noise function parameters is advantageous as it increases the robustness of the example method and reduces errors caused by fluctuation of the model over time. Predicting the noise function parameters also takes into account the overall dataset for the target device up to the last time point to predict the most recent set of parameters. During prediction, operation 546 can use data representing the entire erase-count curve or only a few data points of the target SSD device.
[0085] The first framework, described above, for predicting the lifetime of
SSD devices uses a Gaussian function for the noise function, E. The actual noise function, however, may not conform to a Gaussian function. The second framework addresses this issue by using a trained encoder-decoder CNN to define a set of parameters, W, for the function y that map respective sets of attributes A and H to respective reference erase-count functions, y. The difference between the reference erase-count function and the actual erase count function is represented by a set of parameters, W’, As described below, the parameters W’ form a two-dimensional matrix, consequently, they cannot be predicted using a regular LSTM network. The embodiments described below address this by using a convolutional LSTM network to estimate the parameter matrix W’ for the function E.
[0086] The second framework modifies equation (2) as shown in equation
(20).
Figure imgf000027_0001
In equation (20), W’ is a matrix that represents the difference between the reference data and the test data with respect to the parameters, W, learned by the encoder- decoder CNN.
[0087] In the second framework, y is also a function of the attributes A and
H. This may be represented by equation (21) in which W is a matrix of parameters learned by the encoder-decoder CNN, with the attributes A and H to determine a reference erase-count function, y.
Figure imgf000027_0002
[0088] As shown in equation (20), the noise function, E, used in the second framework is a function of A, H and W’ , where values of the matrix W’ are predicted using the convolutional LSTM network.
[0089] The second framework defines a further matrix V such that V = (A,
H). Each row of V is one of the attributes from A and H. Because the number of values of each attribute row in V is not the same, some of the elements may be filled with zeroes or predefined values. In the example second framework, the reference erase-count curve, y, is a two dimensional matrix with a time value, t, for each column and an erase-count for each row.
[0090] FIG. 9 is a flow-chart diagram of an example method 900 according to the second framework that generates an augmented erase-count curve which may be used to predict lifetimes for a target SSD device. Operation 902 trains an encoder-decoder CNN 1000 (shown in FIG. 10) to leam the parameters, W, that are used by the encoder-decoder CNN 1000 to associate attributes A and H of an SSD device with a reference erase-count curve for the device. The parameters, W, are relative to all of the attributes A and H. With reference to FIG. 10, operation 902 of FIG. 9 applies the matrix V, including the attributes A and H of the devices in the historical data set, as training data to an input layer 1002 of the encoder-decoder CNN 1000. The encoder-decoder CNN 1000 learns the parameters W that associate different matrixes V with different reference erase-count curves, y. The output 1004 of the encoder-decoder CNN 1000 is a reference erase-count curve y, corresponding to the device profile into which the SSD device was classified based on the matrix V of attributes for the device and the parameters, W.
[0091] After the encoder-decoder CNN 1000 is trained, operation 904 of
FIG. 9 applies the matrix V for a target SSD device to the input layer 1002 to determine the reference erase-count curve, y, for the target SSD device based on the parameters, W. The method 900 determines initial values of the parameters W’ by solving equation (20) based on the portion of the actual current erase-count curve, y’, of the target SSD device known from the testing data, the attributes H and A (e.g. the matrix V), and the determined reference erase-count curve, y.
[0092] Operation 906 applies the initial values of the parameters, W’, to the convolutional LSTM network 1100, shown in FIG. 11, to predict the unknown parameters, W’, that define the noise function E(A,H,W’) of the second framework. Convolutional LSTM network 1100 includes convolutional LSTM cells 1102 A, 1102B, and 1102N which respectively receive input values w’i, w’t, and w’n, (e.g., particular noise function parameter values at different times) from input terminals 1104A, 1104B, and 1104N. Output values, hl, ht, and hn from the convolutional
LSTM network 1100 are provided at output terminals 1106A, 1106B, and 1106N, respectively. The example convolutional LSTM network 1100 predicts spatial features of the matrix W’ over time. Equations (22) - (26) define the operations performed by each cell of the convolutional LSTM network 1100 using the standard notation for LSTM cells.
Figure imgf000028_0001
ft, it, Ot, Ct, and ht have the same meaning as in the conventional LSTM described above with reference to FIGs. 8A and 8B, The vectors
Figure imgf000028_0002
Figure imgf000029_0001
are weighting vectors that are learned by the convolutional LSTM 1100 from the training data set. Similarly, the vectors bf, bi, bi, and bc are offset vectors that are learned by the convolutional LSTM from the training data set. As with the first framework, the testing data are used both to train the convolutional LSTM 1100 and to predict parameter values for the function E.
[0093] The output values of neural networks shown in FIG. 11, hi through hn, are values of the respective parameters, w’, at different instants as determined by the convolutional LSTM network. The output of the convolutional LSTM is a set of parameters W’ that define the noise function E, which, when combined with the reference erase-count curve function, F, by operation 908 of FIG. 9, produces the augmented erase-count curve, y’, for the target SSD device as shown in equation (20). As described above with reference to FIG. 3, operation 312 uses the augmented erase-count curve y’ to predict the lifetime of the target SSD device. Embodiments generate maintenance recommendations to reassign the target SSD device, to copy data from the target SSD device, and/or to replace the target SSD before the end of its expected lifetime, as determined from the augmented erase- count curve y’.
[0094] As described above, the drive profile associated with the target SSD device may change during the lifetime of the target SSD device. For example, in a data center, a hard drive can be assigned to bank users in the first six-months of a year but university users in the second six-months of the year. FIG. 12 is a flow chart diagram showing an example method 1200 to detect when the drive profile associated with a target SSD device does not agree with the current erase-count curve, indicating a change in the drive profile associated with the target SSD device. The example method 1200 implements the example operation 310, described above with reference to FIG. 3. An embodiment detects a change in drive profile by detecting changes in the device attributes, A, changes in the augmented erase-count curve, y’, and/or changes in the attributes of the noise function, E, of the first framework, or the noise function, E, of the second framework. Operation 1202 determines whether any of the attributes of A for the target SSD device has changed by an amount greater than a threshold TA for that attribute. Operation 1202 may not check all of the attributes as some attributes will not change and, thus, do not need to be checked. The value of TA depends on the attribute of A being tested and on the tolerances of the target SSD device for the attribute of A. When, operation 1202 detects a change in one of the attributes of A that is greater than the corresponding threshold, operation 1208 returns a value to operation 310 of FIG. 3 indicating that the drive profile should be changed.
[0095] If, in operation 1202, none of the attributes, A, changed by an amount greater than its corresponding threshold, operation 1204 analyzes trends in the augmented erase-count curve, y’, generated from the testing data to determine if the trend exhibits a change greater than a threshold value TC. This step detects a change in the shape of the augmented erase-count curve y’. If the shape of erase- count curve changes it is usually due to a change in the H attributes. Deciding to change the drive profile based on a change in the shape of the augmented erase- count curve y’, however, should be made carefully, because a large number of factors, such as noise, can lead to these changes. The duration, frequency, and magnitude of changes in the y’ curve should be considered. An embodiment compares the absolute value of the first partial derivative (e.g., slope) of the curve y’ with respect to time (e.g.
Figure imgf000030_0001
to a first threshold value TCi and compares the absolute value of the second partial derivative (e.g. change in slope) of the augmented erase-count curve y’
Figure imgf000030_0002
to a second threshold value TC2 0ver multiple time intervals. If either of these partial derivatives is greater than its corresponding threshold, operation 1208 returns the value indicating that the drive profile should be changed.
[0096] If operation 1204 detects no significant trend change in the augmented erase-count curve, operation 1206 compares trends in each of the parameters of the noise function E or E, depending on the framework being used, to respective threshold values to determine whether a trend might indicate a need to change in the reference erase-count curve, y. Operation 1204 compares the absolute values of the first and second partial derivatives of each of the parameters with respect to time to a respective threshold value for the parameter over multiple time intervals. When the trend of any parameter exceeds its respective threshold, operation 1208 returns the value indicating that the drive profile should be changed. When operation 1206 determines that all of the parameter trends are within their respective thresholds, operation 1210 returns a value to operation 310 of FIG. 3 indicating that the current drive profile and corresponding reference erase-count curve is correct.
[0097] FIG. 13 is a block diagram of a server computing device 1300, according to an embodiment. Similar components may be used in other types of computing devices. For example, the clients, servers, and network resources may each use a different set of the components shown in FIG. 13 and/or computing components not shown in FIG. 13 such as the components shown in FIG. 1 and as the storage device management unit 208, shown in FIG. 2 to execute the methods shown in FIGs. 3, 5B, 5C, 9, and 12.
[0098] One example computing device 1300 may include a processing unit (e.g., one or more processors and/or CPUs) 1302, memory 1303, removable storage 1310, and non-removable storage 1312 communicatively coupled by a bus 1301. Although the various data storage elements are illustrated as part of the computing device 1300, the removable storage 1310 may also or alternatively include storage in one of the nodes 108A through 108N of the data center 104 accessible via the high-speed I/O channel 110, shown in FIG. 1, and the high-speed I/O interface 206, shown in FIG. 2.
[0099] Memory 1303 may include volatile memory 1314 and non-volatile memory 1308. Computing device 1300 may include or have access to a computing environment that includes a variety of computer-readable media, such as volatile memory 1314 and non-volatile memory 1308, removable storage 1310 and non removable storage 1312. Computer storage includes random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc read-only memory (CD- ROM), digital versatile disk (DVD) or other optical disk storage devices, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
[00100] Computing device 1300 may include or have access to a computing environment that includes input interface 1306, output interface 1304, and communication interface 1316. Output interface 1304 may provide an interface to a display device, such as a touchscreen, that also may serve as an input device. The input interface 1306 may provide an interface to one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device- specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to the server computing device 1300, and/or other input devices. The computing device 1300 may operate in a networked environment using a communication interface 1316 to connect to one of the networks 102 and/or with one or more of the server nodes 108A through 108N of the data center 104 using the high-speed I/O channel 110. The communication interface may include one or more of an interface to a local area network (LAN), a wide area network (WAN), a cellular network, a WLAN network, and/or a Bluetooth® network.
[00101] The processor 1302 of the server computing device 1300 executes computer-readable instructions stored on a computer-readable storage medium. Computer-readable instructions may include applications 1318 such as the methods 300, 520, 540, 900, and 1200, described above, stored in the memory 1303. A hard drive, CD-ROM, RAM, and flash memory are some examples of articles including a non-transitory computer-readable medium such as a storage device. The terms computer-readable medium and storage device do not include carrier waves to the extent carrier waves are deemed too transitory.
[00102] The functions or algorithms described herein may be implemented using software in one embodiment. The software may consist of computer-executable instructions stored on computer-readable media or computer-readable storage device such as one or more non-transitory memories or other type of hardware -based storage devices, either local or networked, such as in applications 1318. Further, such functions correspond to modules, which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on one or more CPUs, including a single-core or a multi core processor, digital signal processor, application specific integrated circuit (ASIC), microprocessor, or other type of processor operating on a computer system, turning such computer system into a specifically programmed machine.
[00103] Although a few embodiments have been described in detail above, other modifications are possible. For example, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Other embodiments may be within the scope of the following claims.

Claims

CLAIMS What is claimed is:
1. A computer implemented method for predicting a lifetime of a target digital device, the method comprising: determining, based on device attributes of the target digital device and usage attributes for the target digital device, a first device profile associated with a first reference performance measure curve; determining a first set of parameters of a noise function, wherein a combination of the noise function, having the first set of parameters, and the first reference performance measure curve corresponds to a current performance measure curve of the target digital device; calculating a first set of predicted parameters of the noise function based on the first set of parameters; combining the noise function having the first set of predicted parameters with the first reference performance measure curve to produce a first augmented performance measure curve that predicts a future performance measure of the target digital device; and predicting the lifetime of the target digital device based on the first augmented performance measure curve.
2. The computer implemented method of claim 1, further comprising: determining a trend in at least one of the device attributes of the target digital device, in the first augmented performance measure curve, or in at least one parameter of the first set of predicted parameters; in response to the determined trend exceeding a respective threshold value, determining a second device profile for the target digital device, different from the first device profile, the second device profile including a second reference performance measure curve; and determining a second set of parameters of the noise function, wherein a combination of the noise function, having the second set of parameters, and the second reference performance measure curve corresponds to the current performance measure curve of the target digital device; calculating a second set of predicted parameters for the noise function based on the second set of parameters; combining the noise function having the second set of predicted parameters with the second reference performance measure curve to produce a second augmented performance measure curve that predicts the future performance measure of the target digital device; and predicting the lifetime of the target digital device based on the second augmented performance measure curve.
3. The computer implemented method of claim 1, wherein the determining of the first device profile for the target digital device includes: calculating respective conditional probability values for a plurality of device profiles based on the device attributes of the target digital device and the usage attributes of the target digital device to generate a plurality of conditional probability values for the plurality of device profiles, respectively; and selecting, as the first device profile, one of the plurality of device profiles having a highest conditional probability value among the plurality of conditional probability values.
4. The computer implemented method of claim 1, wherein the calculating of the first set of predicted parameters of the noise function includes applying the first set of parameters of the noise function to a long short term memory (LSTM) network.
5. The computer implemented method of claim 1, wherein the noise function conforms to a Gaussian function and the first set of parameters includes a set of parameters of the Gaussian function that cause the Gaussian function to approximate a difference between the first reference performance measure curve and the current performance measure curve of the target digital device.
6. The computer implemented method of claim 1, wherein the target digital device includes a solid state drive (SSD) device, the first reference performance measure curve is an erase-count curve, and the first augmented performance measure curve of the target digital device is a predicted erase-count curve.
7. The computer implemented method of claim 1, wherein the determining of the first device profile includes applying the device attributes and the usage attributes of the target digital device to an encoder-decoder convolutional neural network (CNN) having trained parameters.
8. The computer implemented method of claim 7, wherein the determining of the first set of parameters of the noise function includes defining the noise function as a function of the device attributes of the target digital device, the usage attributes of the target digital device, and the first set of parameters of the noise function, wherein the defined noise function corresponds to a difference between the current performance measure curve and the first reference performance measure curve of the target digital device.
9. The computer implemented method of claim 8, wherein the calculating of the first set of predicted parameters of the noise function includes applying the first set of parameters of the noise function to a convolutional LSTM network.
10. An apparatus for predicting a lifetime of a target digital device, the apparatus comprising: a memory including program instructions; and a processor, coupled to the memory, wherein the program instructions configure the processor to perform operations including: determining, based on device attributes of the target digital device and usage attributes of the target digital device, a first device profile associated with a first reference performance measure curve; determining a first set of parameters of a noise function, wherein a combination of the noise function, having the first set of parameters, and the first reference performance measure curve corresponds to a current performance measure curve of the target digital device; calculating a first set of predicted parameters of the noise function based on the first set of parameters; combining the noise function having the first set of predicted parameters with the first reference performance measure curve to produce a first augmented performance measure curve that predicts a future performance measure of the target digital device; and predicting the lifetime of the target digital device based on the first augmented performance measure curve.
11. The apparatus of claim 10, wherein the operations further comprise: determining a trend in at least one of the device attributes of the target digital device, in the first augmented performance measure curve, or in at least one parameter of the first set of predicted parameters; in response to the determined trend exceeding a respective threshold value, selecting a second device profile for the target digital device, different from the first device profile, the second device profile including a second reference performance measure curve; determining a second set of parameters of the noise function, wherein a combination of the noise function, having the second set of parameters, and the second reference performance measure curve corresponds to the current performance measure curve of the target digital device; calculating a second set of predicted parameters for the noise function based on the second set of parameters; combining the noise function having the second set of predicted parameters with the second reference performance measure curve to produce a second augmented performance measure curve that predicts the future performance measure of the target digital device; and predicting the lifetime of the target digital device based on the second augmented performance measure curve.
12. The apparatus of claim 10, wherein the operation of determining the first device profile includes: calculating respective conditional probability values for a plurality of device profiles based on the device attributes of the target digital device and the usage attributes of the target digital device to generate a plurality of conditional probability values for the plurality of device profiles, respectively; and selecting, as the first device profile, one of the plurality of device profiles having a highest conditional probability value among the plurality of conditional probability values.
13. The apparatus of claim 10, wherein the operation of calculating the first set of predicted parameters of the noise function includes applying the first set of parameters of the noise function to a long short term memory (LSTM) network.
14. The apparatus of claim 10, wherein the noise function conforms to a Gaussian function and the first set of parameters includes a set of parameters of the Gaussian function that cause the Gaussian function to represent a difference between the first reference performance measure curve and the current performance measure curve of the target digital device.
15. The apparatus of claim 10, wherein the target digital device includes a solid state drive (SSD) device, the first reference performance measure curve is an erase- count curve, and the first augmented performance measure curve of the target digital device is a predicted erase-count curve.
16. The apparatus of claim 10, wherein the operation of determining the first device profile includes applying the device attributes and the usage attributes of the target digital device to an encoder-decoder convolutional neural network (CNN) having trained parameters.
17. The apparatus of claim 16, wherein the operation of determining the first set of parameters of the noise function includes defining the noise function as a function of the device attributes of the target digital device, the usage attributes of the target digital device, and the first set of parameters of the noise function, wherein the defined noise function corresponds to a difference between the current performance measure curve and the first reference performance measure curve of the target digital device.
18. The apparatus of claim 17, wherein the operation of calculating the first set of predicted parameters of the noise function includes applying the first set of parameters of the noise function to a convolutional LSTM.
19. A computer-readable medium comprising program instructions for predicting a lifetime of a target digital device, the program instructions, when executed by a processor, cause the processor to perform operations including: determining, based on device attributes of the target digital device and usage attributes of the target digital device, a first device profile associated with a first reference performance measure curve; determining a first set of parameters of a noise function, wherein a combination of the noise function, having the first set of parameters, and the first reference performance measure curve corresponds to a current performance measure curve of the target digital device; calculating a first set of predicted parameters of the noise function based on the first set of parameters; combining the noise function having the first set of predicted parameters with the first reference performance measure curve to produce a first augmented performance measure curve that predicts a future performance measure of the target digital device; and predicting the lifetime of the target digital device based on the first augmented performance measure curve.
20. The computer-readable medium of claim 19, wherein the operations further comprise: determining a trend in at least one of the device attributes of the target digital device, in the first augmented performance measure curve, or in at least one parameter of the first set of predicted parameters; in response to the determined trend exceeding a respective threshold value, determining a second device profile for the target digital device, different from the first device profile, the second device profile including a second reference performance measure curve; determining a second set of parameters of the noise function, wherein a combination of the noise function, having the second set of parameters, and the second reference performance measure curve corresponds to the current performance measure curve of the target digital device; calculating a second set of predicted parameters for the noise function based on the second set of parameters; combining the noise function having the second set of predicted parameters with the second reference performance measure curve to produce a second augmented performance measure curve that predicts the future performance measure of the target digital device; and predicting the lifetime of the target digital device based on the second augmented performance measure curve
PCT/US2020/026164 2019-08-23 2020-04-01 Device lifetime prediction WO2021040810A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962891125P 2019-08-23 2019-08-23
US62/891,125 2019-08-23

Publications (1)

Publication Number Publication Date
WO2021040810A1 true WO2021040810A1 (en) 2021-03-04

Family

ID=70457122

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/026164 WO2021040810A1 (en) 2019-08-23 2020-04-01 Device lifetime prediction

Country Status (1)

Country Link
WO (1) WO2021040810A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592976A (en) * 2024-01-19 2024-02-23 山东豪泉软件技术有限公司 Cutter residual life prediction method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5719675A (en) * 1992-06-16 1998-02-17 Honeywell Inc. Laser gyro life prediction
US20070198786A1 (en) * 2006-02-10 2007-08-23 Sandisk Il Ltd. Method for estimating and reporting the life expectancy of flash-disk memory
US20150046635A1 (en) * 2013-08-07 2015-02-12 SMART Storage Systems, Inc. Electronic System with Storage Drive Life Estimation Mechanism and Method of Operation Thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5719675A (en) * 1992-06-16 1998-02-17 Honeywell Inc. Laser gyro life prediction
US20070198786A1 (en) * 2006-02-10 2007-08-23 Sandisk Il Ltd. Method for estimating and reporting the life expectancy of flash-disk memory
US20150046635A1 (en) * 2013-08-07 2015-02-12 SMART Storage Systems, Inc. Electronic System with Storage Drive Life Estimation Mechanism and Method of Operation Thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592976A (en) * 2024-01-19 2024-02-23 山东豪泉软件技术有限公司 Cutter residual life prediction method, device, equipment and medium
CN117592976B (en) * 2024-01-19 2024-04-26 山东豪泉软件技术有限公司 Cutter residual life prediction method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN107832581B (en) State prediction method and device
Nie et al. Machine learning models for GPU error prediction in a large scale HPC system
US11288577B2 (en) Deep long short term memory network for estimation of remaining useful life of the components
Levy et al. Lessons learned from memory errors observed over the lifetime of cielo
US20250181037A1 (en) Combined Learned and Dynamic Control System
US9530256B2 (en) Generating cumulative wear-based indicators for vehicular components
US11341026B2 (en) Facilitating detection of anomalies in data center telemetry
US20090100108A1 (en) Replica Placement and Repair Strategies in Multinode Storage Systems
US20150378807A1 (en) Predicting process failures using analytics
WO2022037169A1 (en) Method and apparatus for predicting service life of solid-state disk, and computer-readable storage medium
US11734103B2 (en) Behavior-driven die management on solid-state drives
Ma et al. RBER-aware lifetime prediction scheme for 3D-TLC NAND flash memory
US20120221373A1 (en) Estimating Business Service Responsiveness
US20230259429A1 (en) Method and Apparatus for Predicting and Exploiting Aperiodic Backup Time Windows on a Storage System
US20080298276A1 (en) Analytical Framework for Multinode Storage Reliability Analysis
US20220052933A1 (en) Maintenance recommendation for containerized services
US9170909B2 (en) Automatic parallel performance profiling systems and methods
WO2020206699A1 (en) Predicting virtual machine allocation failures on server node clusters
CN109922212B (en) Method and device for predicting time-interval telephone traffic ratio
WO2021040810A1 (en) Device lifetime prediction
Santikellur et al. A shared page-aware machine learning assisted method for predicting and improving multi-level cell NAND flash memory life expectancy
US20140114713A1 (en) Incident assignment
CN110971468A (en) A Delayed Copy Incremental Container Checkpoint Processing Method Based on Dirty Page Prediction
Bayram et al. Improving reliability with dynamic syndrome allocation in intelligent software defined data centers
Montes et al. Grid global behavior prediction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20721346

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20721346

Country of ref document: EP

Kind code of ref document: A1