White Paper On MTBF
White Paper On MTBF
White Paper On MTBF
ELECTRONIC SYSTEM
RELIABILITY - WHY IMPORTANT?
• PROBLEMS
– Electronic systems involves the utilization of very
large numbers of components which are very similar.
– The designer has little control over their production
and manufacture but must specify catalogue items.
– The designer has little control over device reliability.
– Control of the production process is a major
determinant of reliability.
– It is difficult to test for electronic component defects
that do not immediately affect performance.
• SOLUTION: Very close attention must be paid to
electronics part reliability. The design must involve a
reliability team.
OUTLINE
• DEFINITIONS
• CAUSES OF ELECTRONIC
COMPONENT FAILURE
• PREDICTION METHODS- TEST
• Mil- HDBK- 217 PREDICTION
METHODS- CALCULATIONS
– PARTS STRESS ANALYSIS
PREDICTIONS
– PARTS COUNT RELIABILITY
METHOD
– LIMITATIONS
• ADDITIONAL INFORMATION
– Other Failure Rate Data Sources
– Arrhenius Model
DEFINITIONS • OPERATING
STRESS
– The actual stress (or load) applied
during
operation of the part (e. g. voltage for
capacitor, dissipated power for
resistors)
• RATED STRESS
– The manufacturers rating for the
part.
• STRESS RATIO
– Ratio of operating stress to rated
stress.
• PART GRADES
– Grade 1, 2 etc. designates high
quality
standard parts.
– JAN, Industrial and Commercial
Grades
designations for other parts that can
be
used.
BACKGROUND
• Reliability engineering and
management grew
up largely in response to the
problems of
electronic equipment reliability.
• Many reliability techniques have
been
developed from electronics
applications.
CAUSES OF
ELECTRONIC
COMPONENT FAILURES
Electronic Failures =
f ( design, mfg. process,
quality type,
temperature, electrical
load, vibration,
chemical, stresses )
OTHER CAUSES OF
ELECTRONIC
COMPONENT FAILURES
(con't)
Electrical Load
• Higher that anticipated voltage or
current loads can
cause arcing, and other damage.
Vibration
• Shock and vibration can cause
fatigue damage to even
properly made components.
Chemical
• Contaminants introduced in the
manufacturing process
may eventually degrade an IC or
other device.
• Environmental contaminants
(moisture, etc) may
promote chemical attacks on
components.
Purpoee - The purpose of thfs MruboOk is to establish and maintain consistent and
uniform
ti.~ for estimating the hhemnt rek&Slity (i.e., the reUabflityof a mature design) of rnilbry
@edron&
~~~ - systems. It provides a common basfs for ~ predictionsckhg aoquis&bn progmms
for military ebctrcmc systems and equipment. h atso establishes a common basis for
oomparfng and
evafuatlng reliability predictions of rdated or competitive destgns. The handbook is
intended to be used
as a tool to increase the reliabil”~ of the equ@merxbeing designed.
1.2 Appllcatlon - This handtmok oontains two methods of reMWiJity pmdiotbn - “Part
Stress
Analysis” In Sectfons 5 through 23 amf 7%rts Count- in Appendix IL These methods vary
in degree of
informatbn needed to apply them. lhe Part Stress Anafysii Method recpires a greater
amount of detailed
In&mtfon and ts appfkabfe mrfng the later design phase when actual hardware and c&wits
are being
designed. The Parts Count Method raquires less infonnatbn, generally part quantities,
qmtity level, and
the applkatbn environmen& This method Is appfioable cMng the early de- @ase and du~
pmpo@
formulation. In general, the Parts Count Metfwd wffl usually result in a more conservative
estknate (i.e.,
~f*mte)ofsy’stem r@taMtythanthe Parts Stress Method.
1.3 Computerfzad Rellablllty PmcffctlOn - Rome Laborato~ - ORACLE is a computer
program
developed to aid in appfying the part stress analysis procedure of MIL-HDBK-217. Based
on
environmental use chamcteristks, piece part oount, thermal and electrical stresses,
subsystem repair rates
and system configuration, the program calculates piece part, assemMy and subassembly
failure rates. It
also flags overstressed parts, afbws the user to perform tradeoff analyses and provides
system mean-time-
to-failure and availability. The ORACLE computer program software (available in both VAX
and IBM
co~atible PC versbns) is available at replacement tape/disc cost to all DoD organizations,
and to
contractors for applbcatbn on spedfk DoD contraots as government furnished property
(GFP). A
statement of terms and conditions may be obtained upon written request to: Rome
Laborato~/ERSR,
What is MTBF?
MTBF is an acronym for Mean Time Between Failures. In general, a higher MTBF number
indicates a more reliable product. Beyond this simple definition, you’ll find a wide variety of
special meanings.
In the military/aerospace industries, MTBF is defined by a specific set of calculations. The formula
for system longevity is based on the thermal, electrical and environmental stresses on each
component. The engineer evaluates the components and subassemblies in a particular product
by these formulas and produces an overall number called calculated MTBF.
Another way to compute MTBF is to evaluate product reliability based on the product’s actual
performance in the field. Instead of theoretical calculations of what might occur, field MTBF is a
measure of the numbers and types of failures that the products actually experience in real
applications.
At Liebert, we track two types of field MTBF statistics: critical bus MTBF and hardware MTBF. In
the next few paragraphs, we will explain each of these.
Liebert maintains a database with information on every Series 600 UPS ever shipped. We also
keep records of all reported failures. Each quarter we evaluate the reliability information and tally
up the critical bus outages that were attributable to the UPS or System Control Cabinet.
Some events are excluded from the total. For example, if a UPS experiences an alarm condition
and successfully transfers the load to bypass, there is no critical bus outage.
Likewise if utility input power fails and the UPS and batteries support the critical load for the
proper number of minutes, the UPS has done its job. If the utility power (or backup Diesel
generator) is not available when the UPS has drained the batteries, the UPS -- with ample
warning to the operator -- will perform an orderly shutdown. This is not a chargeable critical bus
outage since the equipment performed as designed.
Other excluded situations are those caused by site conditions or operator error. For example, one
customer wired his facility fire alarm system to trigger the Emergency Power Off circuit on the
UPS. Unfortunately, he forgot to disconnect the circuit before performing a routine test of the fire
alarm system. This caused a critical bus outage, but did not count against UPS MTBF.
As of this writing, we have records of more than 7,000 Series 600 modules in more than 5,500
systems. Cumulative system operating hours exceed 220 million. Since shipments began in
1989, we have records of just 80 critical bus failures. Considering our exposure is approximately
4 million system operating hours per month, this is a remarkably small number of failures.
We compute our field MTBF numbers by dividing system operating hours by “failures plus one.”
We do this to be conservative and to be consistent with earlier published documents. Dividing
220 million hours by 81 (80 + 1) gives us a number considerably in excess of 2 million hours. We
recognize that some Series 600 sites are not under contract to Liebert Global services and might
not be reporting all failures. Therefore we choose not to advertise the exact calculated number.
“In excess of one million hours” is sufficient.
Module MTBF
The other way we track reliability is the field MTBF of the UPS modules. For these purposes, we
count every type of module or System Control Cabinet failure that causes the module to take
itself off-line. As before, we exclude incidents of operator error, site problems or instances of
shutdown after successful discharge of batteries.
To compile this number, we have taken various sample periods. For a challenge, one of the
periods was chosen to coincide with one of the worst heat waves on record in large portions of
the Midwest and Northeast. A difficult test indeed!
During the sample periods, Series 600 UPS modules accumulated approximately 6 million
operating hours and 35 hardware failures. Of these, only one caused a critical bus outage. The
other 34 events featured the UPS successfully transferring the load to the bypass source.
Dividing 6 million hours by 35 gives a module MTBF of approximately 170,000 hours.
Methodology
The Equations
Failure Rate, MTBF, and FITs
Description of Methodology
The parts count method is a technique for developing an estimate or
prediction of the average life, the Mean Time Between Failures (MTBF),
of an assembly. It is a prediction process whereby a numerical
estimate is made of the ability, with respect to failure, of a design to
perform its intended function. Once the failure rate is determined,
MTBF is easily calculated as the inverse of the failure rate, as follows:
MTBF = 1
FR1 + FR2 + FR3 + ...........FRn
where FR is the failure rate of each component of the system up to n,
all components
The general procedure for determining a board level (or system level)
failure rate is to sum individual failure rates for each component. For
MIL-HDBK-217, the summation is then added to a failure rate for the
circuit board, which includes the affect of solder joints. Component
failure rates are provided by MIL-HDBK-217, "Military Handbook,
Reliability Prediction of Electronic Equipment", as standard part failure
rate models or directly from the manufacturers.
The failure rates presented apply to equipment under normal operating
conditions, i.e., with power on and performing its intended function in
its intended environment. Consideration is given to various
environments, component quality, and thermal aspects.
The Equations
A sample calculation for integrated circuits taken from MIL-HDBK-217 is
as follows:
Failure Rate = (C1 * PiT + C2 * PiE) * PiQ * PiL
Each factor in this equation is dependent upon a certain part
parameter. The end result of this equation is the failure rate of the
integrated circuit.
What is MIL-HDBK-217?
MIL-HDBK-217 is a reliability prediction standard originally developed for defense and aerospace
related organizations, but later adopted by many commercial and industrial companies. Many
times referred to simply as 217, MIL-HDBK-217 includes mathematical reliability models for
nearly all types of electrical and electronic components. These reliability models are based on
parameters of the components such as number of pins, number of transistors, power dissipation,
and environmental factors. Results from MIL-HDBK-217 are provided as both a failure rate and
as an MTBF (Mean Time Between Failures) where the MTBF is the mathematical inverse of the
failure rate.