[go: up one dir, main page]

0% found this document useful (0 votes)
28 views80 pages

Basic Statistical Analysis

The document provides an overview of evaluating analytical data using Microsoft Excel, focusing on the concepts of precision, accuracy, and error types in measurements. It discusses the importance of replicates, the calculation of mean and median, and various statistical measures such as standard deviation and variance. Additionally, it covers calibration processes and the significance of confidence limits in assessing the accuracy of analytical procedures.

Uploaded by

Elvis Kyeremeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views80 pages

Basic Statistical Analysis

The document provides an overview of evaluating analytical data using Microsoft Excel, focusing on the concepts of precision, accuracy, and error types in measurements. It discusses the importance of replicates, the calculation of mean and median, and various statistical measures such as standard deviation and variance. Additionally, it covers calibration processes and the significance of confidence limits in assessing the accuracy of analytical procedures.

Uploaded by

Elvis Kyeremeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Evaluating Analytical Data

with Microsoft Excel


Computer Application ES 370

Shadrack Fosu (PhD)


Introduction
 The procedure used to determine whether a given result is
unacceptable involves running a series of identical tests on
the same sample, using the same instrument or other piece
of equipment, over and over.

 In such a scenario, the indeterminate errors manifest


themselves in values that deviate, positively and negatively,
from the mean (average) of all the values obtained.

 This topic is an introduction to the sources and evaluation


of errors in analytical measurements, the effect of
measurement error on the result of an analysis, and the
statistical analysis of data.
Some Important Terms

 In order to improve the reliability and to obtain information


about the variability of results, two to five portions
(replicates) of a sample are usually carried through an entire
analytical procedure.

 Individual results from a set of measurements are seldom


the same, so we usually consider the “best” estimate to be
the central value for the set.
 We justify the extra effort required to analyze replicates in
two ways.

 First, the central value of a set should be more reliable than


any of the individual results.

 Usually, the mean or the median is used as the central value


for a set of replicate measurements.

 Second, an analysis of the variation in the data allows us to


estimate the uncertainty associated with the central value.
The Mean and the Median
o The most widely used measure of central value is the
mean.

o The mean, also called the arithmetic mean or the


average, is obtained by dividing the sum of replicate
measurements by the number of measurements in the set:

σ𝑁
𝑖=1 𝑥𝑖
𝑋=
𝑁

o where xi represents the individual values of x making up


the set of N replicate measurements.
 The median is the middle result when replicate data are
arranged in increasing or decreasing order.

 There are equal numbers of results that are larger and


smaller than the median.

 For an odd number of results, the median can be found by


arranging the results in order and locating the middle result.

 For an even number, the average value of the middle pair


is used

 In ideal cases, the mean and median are identical. However,


when the number of measurements in the set is small, the
values often differ.
Example
The table below show results for the quantitative determination
of iron. Six equal portions of an aqueous solution with a
“known” concentration of 20.00 ppm of iron (III) were analysed
in exactly the same way.
Sample Conc.Fe3+
(ppm)
1 19.4
2 19.5
3 19.6
4 19.8
5 20.1
6 20.3

Calculate the mean and median for the data using excel.
Precision
 Precision describes the reproducibility of measurements —in
other words, the closeness of results that have been obtained
in exactly the same way.

 Generally, the precision of measurement is readily determined


by simply repeating the measurement on replicates samples.

 Three terms are widely used to describe the precision of a set


of replicate data:

 Standard deviation
 Variance
 Coefficient of variation
 These three are functions of how much an individual result
xi differs from the mean, called the deviation from the
mean di.

𝑑𝑖 = 𝑥𝑖 − 𝑋

 The relationship between the deviation from the mean and


the three precision terms is given by the standard
deviation.

 It is the most common measure of the dispersion of data


around the mean and given as:
σ𝑛𝑖=1 𝑥𝑖 − 𝑋 2
𝑠=
𝑛−1

 The term n is the number of measurements, and n – 1 is


referred to as the number of degrees of freedom. The term
s represents the standard deviation.

 The significance of s is that the smaller it is numerically, the


more precise the data (the more the measurements are
“bunched” around the mean).
For an infinite number of measurements (where the mean is
𝜇), the standard deviation is symbolized as 𝜎 (Greek letter
sigma) and is known as the population standard deviation.

An infinite number of measurements is approximated by 30


or more measurements.

 The standard deviation of each mean is known as the


standard error of the mean and is given the symbol sm.

 The standard error is inversely proportional to the square


root of the number of data points N used to calculate the
mean as given by:
𝑠
𝑆𝑚 =
𝑁

Variance and Other Measures of Precision


 Although the sample standard deviation is usually used in
reporting the precision of analytical data, we often find
three other terms

Variance (s2)
 The variance is just the square of the standard deviation. The
sample variance s2 is an estimate of the population variance
𝜎2 and is given by
σ𝑛 2
2 𝑖=1 𝑥𝑖 − 𝑋
𝑠 =
𝑛−1

 Note that the standard deviation has the same units as the
data, while the variance has the units of the data squared.

 Scientists tend to use standard deviation rather than


variance because it is easier to relate a measurement and its
precision if they both have the same units.

 The advantage of using variance is that variances are


additive in many situations.
Relative Standard Deviation (RSD) ans
Coefficient of Variation (CV)
 Frequently standard deviations are given in relative rather
than absolute terms.

 We calculate the relative standard deviation by dividing


the standard deviation by the mean value of the data set.

 The relative standard deviation, RSD, is sometimes given


the symbol sr

𝑠
𝑅𝑆𝐷 = 𝑠𝑟 =
𝑋
 The result is often expressed in parts per thousand (ppt) or
in percent by multiplying this ratio by 1000 ppt or by 100%.
For example

𝑠
𝑅𝑆𝐷 𝑖𝑛 𝑝𝑝𝑡 = × 1000 𝑝𝑝𝑡
𝑋
 The relative standard deviation multiplied by 100% is called
the coefficient of variation (CV).

𝑠
𝐶𝑉 = 𝑅𝑆𝐷 𝑖𝑛 𝑝𝑒𝑟𝑐𝑒𝑛𝑡 = × 100 %
𝑋

 Relative standard deviation relates the standard deviation


to the value of the mean and represents a practical and
popular expression of data quality
Spread or Range (w)

 The spread, or range, w, is another term that is sometimes


used to describe the precision of a set of replicate results.

 It is the difference between the largest value in the set and


the smallest.

Accuracy
 It indicates the closeness of the measurement to the true or
accepted value and is expressed by the error

 Note that accuracy measures agreement between a result


and the accepted value.
 Precision, on the other hand describes the agreement
among several results obtained in the same way.

 We can determine precision just by measuring replicate


samples.

 Accuracy is often more difficult to determine because the


true value is usually unknown.

 An accepted value must be used instead.

 Accuracy is expressed in terms of either absolute or relative


error
Absolute Error
 The absolute error E in the measurement of a quantity x is
given by the equation

𝐸 = 𝑥𝑖 − 𝑥𝑡
 where xt is the true or accepted value of the quantity.
Relative Error
 The relative error Er is often a more useful quantity than the
absolute error.

 The percent relative error is given by the expression


𝑥𝑖 − 𝑥𝑡
𝐸𝑟 = × 100%
𝑥𝑡

Types of Errors in Experimental Data

 Random (or indeterminate) error, causes data to be


scattered more or less symmetrically around a mean value.

 Random error in a measurement is reflected by its precision.


 Systematic (or determinate) error, causes the mean of a
data set to differ from the accepted value.

 In general, a systematic error in a series of replicate


measurements causes all the results to be too high or too
low.

 An example of a systematic error is the loss of a volatile


analyte while heating a sample.

 Gross errors differ from indeterminate and determinate


errors.

 They usually occur only occasionally, are often too large and
may cause a result to be either high or low.

 They are often the product of human errors.


 For example, if part of a precipitate is lost before weighing,
analytical results will be low.

 Touching a weighing bottle with your fingers after its empty


mass is determined will cause a high mass reading for a solid
weighed in the contaminated bottle

 Gross errors lead to outliers, results that appear to differ


markedly from all other data in a set of replicate
measurements.
Systematic Errors
 Systematic errors have a definite value, an assignable
cause, and are of the same magnitude for replicate
measurements made in the same way.

 They lead to bias in measurement results. Note that bias


affects all of the data in a set in the same way and that it
bears a sign.

Sources of Systematic Errors


 Instrumental errors are caused by nonideal instrument
behavior, by faulty calibrations or by use under
inappropriate conditions.
Method errors arise from nonideal chemical or physical
behavior of analytical systems.

Personal errors result from the carelessness, inattention, or


personal limitations of the experimenter.
Accuracy of an analysis: Confidence Limits

 The mean or best value is the value that will be reported


for an analytical determination.

The question is:

 How closely does the mean or best value agree with the
true value 𝜇? In other words, how accurate is the analysis.

 Experimentally, an analytical procedure is often tested on


a carefully prepared standard sample whose composition
is known to a high degree of accuracy.
 Then the standard sample is analysed by the analytical
procedure to be tested, and

 the analytical results compared with the stated value for


each constituent.

 In favourable cases, this gives a direct indication of the


accuracy of which the analytical procedure is capable.

 The difficulty is that the actual sample have somewhat


different composition from the standard sample and

 this can affect the accuracy likely to be obtained.

 Another approach is to use the precision of the analysis, as


outlined below,
Confidence Limits: the t Distribution
 If a valid analytical procedure having no systematic error is
used to obtain a sample mean,

 then statistical theory may be used to describe the accuracy


of the procedure.

 Specifically, the theory is used to predict within what limits


the mean is likely to agree with 𝜇.

 The theory will not enable this prediction to be made with


100% probability.

 There is always some fraction of risk 𝛼 or percentage (100-


100𝛼) involved in such a prediction.
 The limits predicted for a certain risk or probability are called
confidence limits.

 Instead of depending on the normal distribution curve, the


limits depend on the t, or “Student’s t,” distribution curve.

 The distribution yields values for a constant called t.

 To formulate the confidence limits, the standard deviation s


of the sample is calculated and a value for t is found by
consulting The t-table.

 The constant t depends on the fraction of risk 𝛼 and on n,


the number of measurements in the sample.

 Many statisticians prefer n – 1, which is termed the degrees


of freedom, and the probability level; given in the table.
 The confidence limits for X are calculated as follows:
𝑡𝑠
𝜇=𝑋±
𝑛
 It may be said that the fraction of risk that 𝜇 lies outside
these confidence limits is 𝛼. Obviously 𝛼 Τ2 is the fraction of
risk that 𝜇 is either large or smaller than the confidence
limits.

 It may also be said that the probability that 𝜇 lies inside


these limits is 100 - 100𝛼.
Trial Question 1
A new procedure for the rapid determination of Sulphur in
kerosene was tested on a sample known from its method of
preparation to contain 0.123% S (μ = 0.123% S). The
results for % S were 0.112, 0.118, 0.115 and 0.119. Do the
data indicate that there is a bias in the method at 95%
confidence level? Using excel perform this hypothesis
testing
Assignment 2
Sewage and industrial pollutants dumped into a body of water can
reduce the dissolved oxygen concentration and adversely affect
aquatic species. In one study, weekly readings are taken from the
same location in a river over a two-month period.

Week Number Dissolved O2,


ppm
1 4.9
2 5.1
3 5.6
4 4.3
5 4.7
6 4.9
7 4.5
8 5.1
Some scientists think that 5.0 ppm is a dissolved O2 level that is
marginal for fish to live. Conduct a statistical test to determine
whether the mean dissolved O2 concentration is less than 5.0
ppm at the 95% confidence level. State clearly the null and
alternative hypotheses using Microsoft Excel
STANDARDISATION AND
CALIBRATION
Introduction
 A very important part of all analytical procedures is the
calibration and standardization process.

 Calibration determines the relationship between the


analytical response and the analyte concentration.

 This relationship is usually determined by the use of


chemical standards.

 The standards used can be prepared from purified reagents,


if available, or standardized by classical quantitative
methods.
 Most commonly, the standards used are prepared externally
to the analyte solutions (external standard methods).

E.g., the arsenic concentration in a sample was determined by


calibration of the absorbance scale of a spectrophotometer with
external standard solutions of known arsenic concentration.

 In some cases, an attempt is made to reduce interferences


from other constituents in the sample matrix, called
concomitants, by

 using standards added to the analyte solution (internal


standard methods or standard addition methods) or by matrix
matching or modification.
 Almost all analytical methods require some type of
calibration with chemical standards.

 Gravimetric methods and some coulometric methods are


among the few absolute methods that do not rely on
calibration with chemical standards.

Calibration of an Analytical Instrument


 Signals are measured using equipment or instruments that
must be properly calibrated if Smeas is to be free of
determinate errors.

 Calibration is accomplished against a standard, adjusting


Smeas until it agrees with the standard’s known signal.
 The response of an instrument used for chemical analysis is
proportional to the concentration of the analyte in a solution.

 This proportionality can be expressed as follows:

𝑅 = 𝐾𝐶

where R is the instrument readout, C is the concentration of


the analyte, and K is the proportionality constant.

 (The most common example is Beer’s law,)

 If the object of the experiment is the concentration (C) of the


analyte in the sample solution, then the R and the K for this
solution must be known so that C can be calculated:
𝑅
𝐶=
𝐾

 the calibration of an instrument for quantitative analysis can


utilize a single standard, resulting in a calibration constant,
or a series of standards, resulting in a standard curve.

 If a single standard is used, the value of K is the calibration


constant.

 It is found by determining the instrument response for a


standard solution of the analyte and then calculating K:
𝑅𝑠 = 𝐾𝐶𝑠
𝑅𝑠
𝐾=
𝐶𝑠
 The concentration of the unknown sample solution is then
calculated from its instrument response and the K
determined from equation above:
𝑅𝑈
𝐶𝑢 =
𝐾
 In the above equations, RS is the readout for the standard
solution, CS is the concentration of the analyte in the standard
solution, RU is the readout for the unknown sample solution,

 CU is the concentration of the analyte in the unknown solution,


and the proportionality constant (K) is the calibration constant.
Limitations
 There are two limitations with the above process:

1) the analyst is “putting all the eggs in one basket” by


comparing the sample to just one standard (not very
statistically sound)

2) the calibration constant, K, must truly be constant at the two


concentration levels, CS and CU (possible, but not guaranteed).

 Because of these limitations, the concept of the standard


curve is used most of the time.
 The concept of a series of standards refers to an experiment
in which a series of standard solutions is prepared;

 covering a concentration range within which the unknown


concentration is expected to fall.

 For example, if an unknown concentration is thought to be


around 4 parts per million (ppm),

 then a series of standards bracketing this value, such as 1, 3,


5, and 7 ppm, are prepared.

 The readout for each of these is then measured.

 The standard curve is a plot of the readout vs.


concentration.
 The unknown concentration is determined from this plot.

Conc. (ppm) Readout


1.0 2.1
3.0 6.3
5.0 10.5
7.0 14.7
 The standard curve procedure is free from the limitations
involved in comparing an unknown to a single standard
because:

1) if the unknown concentration is compared to more than one


standard, the results are more statistically sound,

2) if the curve is a straight line, the value of K is constant


through the concentration range used.

 If the standard curve is a straight line free of bends or


curves, the value of K is the slope of the line and is
constant.
 If a portion of the standard curve is linear, this linear portion
may be used to determine the unknown concentration.

 We should indicate at this point that, while the limitations of


the single standard are eliminated,

 there is still considerable opportunity for experimental error


in this procedure (error in solution preparation and
instrument readout), and

 this error may make it appear that the data points are
“scattered” and not linear. (This will be address later).
 This series of standard solutions method is commonplace
in an instrumental analysis laboratory for the vast majority
of all quantitative instrumental procedures.

 Examples of this abound for many spectrophotometric,


chromatographic, and other techniques.

Comparison with Standards

 Direct comparison techniques

 Titration procedures
Direct Comparison
 Some analytical procedures involve comparing a property of
the analyte (or the product of a reaction with the analyte)
with;

 standards such that the property being tested matches or


nearly matches that of the standard.

 For example, in early colorimeters, the color produced as


the result of a chemical reaction of the analyte was
compared with;

 the color produced by using standards in place of the


analyte in the same reaction.
 If the concentration of the standard was varied by dilution, for
example, it was possible to match colours fairly precisely.

 The concentration of the analyte was then equal to the


concentration of the diluted standard.

 Such procedures are called null comparison or isomation


methods.
External Standard Calibration
 In external standard calibration, a series of standard
solutions is prepared separately from the sample.

 The standards are used to establish the instrument


calibration function,

 which is obtained from analysis of the instrument response


as a function of the known analyte concentration.

 Ideally, three or more standard solutions are used in the


calibration process,

 although in some routine determinations, two-point


calibrations can be reliable.
 The calibration function can be obtained graphically or in
mathematical form.

 Generally, a plot of instrument response versus known


analyte concentrations is used to produce a calibration
curve, sometimes called a working curve.

 It is often desirable that the calibration curve be linear in at


least the range of the analyte concentrations.

 A linear calibration curve of absorbance versus analyte


concentration is shown
Example One
Calibration curve of absorbance
versus analyte concentration for a
series of standards. Data for
standards shown as solid circles.
The calibration curve is used in an
inverse fashion to obtain the
concentration of an unknown with
an absorbance of 0.505. The
absorbance is located on the line,
and then the concentration
corresponding to that absorbance is
obtained by extrapolating to the x-
axis (dashed lines). Residuals are
distances on the y-axis between the
data points and the predicted line as
shown in the inset.
 The concentration found is then related back to the analyte
concentration in the original sample by applying appropriate
dilution factors from the sample preparation steps.

 Computerized numerical data analysis has largely replaced


graphical calibration methods, which are now seldom used
except for visual confirmation of results.

 Statistical methods, such as the method of least squares,


are routinely used to find the mathematical equation
describing the calibration function.
The Least-Square Method
 The calibration curve shown in e.g. one is for the
determination of Ni(II) by reaction with excess thiocyanate
to form an absorbing complex ion 𝑁𝑖 𝑆𝐶𝑁 + .

 The ordinate is the dependent variable, absorbance, while


the abcissa is the independent variable, concentration of
Ni(II).

 As is typical and usually desirable, the plot approximates a


straight line.

 Note that, because of indeterminate errors in the


measurement process, not all the data fall exactly on the line.
 Thus, the investigator must try to draw the “best” straight
line among the data points.

 Regression analysis provides the means for objectively


obtaining such a line and also for specifying the
uncertainties associated with its subsequent use.

 We consider here only the basic method of least squares


for two-dimensional data.
Assumptions of the Least-Squares Method
 Two assumptions are made in using the method of least
squares.

 The first is that there is actually a linear relationship


between the measured response y (absorbance in e.g.1)
and the standard analyte concentration x.

 The mathematical relationship that describes this


assumption is called the regression model, which may be
represented as

𝑦 = 𝑚𝑥 + 𝑏
 where b is the y intercept (the value of y when x is zero),
and m is the slope of the line.

 We also assume that any deviation of the individual points


from the straight line arises from error in the measurement.
 That is, we assume there is no error in x values of the points
(concentrations).

 Both of these assumptions are appropriate for many


analytical methods, but;

 bear in mind that, whenever there is significant uncertainty


in the x data, basic linear least-squares analysis may not
give the best straight line.

 In such a case, a more complex correlation analysis may


be necessary.

 In addition, simple least-squares analysis may not be


appropriate when the uncertainties in the y values vary
significantly with x.
 In that instance, it may be necessary to apply different
weighting factors to the points and perform a weighted
least-squares analysis.

Finding The Least-Square line


The least-squares procedure can be illustrated with the aid of
the calibration curve for the determination of Ni(II) shown in
E.g. 1.

 Thiocyanate was added to the Ni(II) standards, and the


absorbances measured as a function of the Ni(II)
concentration.

 The vertical deviation of each point from the straight line is


called a residual as shown in the inset.
 The line generated by the least-squares method is the one
that minimizes the sum of the squares of the residuals for all
the points.

 In addition to providing the best fit between the experimental


points and the straight line, the method gives the standard
deviations for m and b.

 The least-squares method finds the sum of the squares of


the residuals SSresid and minimizes the sum using calculus.

 The value of SSresid is found from


𝑁

𝑆𝑆𝑟𝑒𝑠𝑖𝑑 = ෍ 𝑦𝑖 − 𝑏 + 𝑚𝑥𝑖 2

𝑖=1

where N is the number of points used. The calculation of slope


and intercept is simplified when three quantities are defined, Sxx,
Syy, and Sxy as follows:

σ 𝑥𝑖 2
2 2
𝑆𝑥𝑥 = ෍ 𝑥𝑖 − 𝑥ҧ = ෍ 𝑥𝑖 −
𝑁

σ 𝑦𝑖 2
2 2
𝑆𝑦𝑦 = ෍ 𝑦𝑖 − 𝑦ത = ෍ 𝑦𝑖 −
𝑁
σ 𝑥𝑖 σ 𝑦𝑖
𝑆𝑥𝑦 = ෍ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത = ෍ 𝑥𝑖 𝑦𝑖 −
𝑁

 where xi and yi are individual pairs of data for x and y, N is


the number of pairs, and 𝑥ҧ and 𝑦ത are the average values for
x and y.

 Note that Sxx and Syy are the sum of the squares of the
deviations from the mean for individual values of x and y.

 The expressions shown on the far right in Equations above


are more convenient when a calculator without a built-in
regression function is used.
 Six useful quantities can be derived from Sxx, Syy, and Sxy:

• The slope of the line, m:


𝑆𝑥𝑦
𝑚=
𝑆𝑥𝑥
• The intercept, b:
𝑏 = 𝑦ത − 𝑚𝑥ҧ
• The standard deviation about regression, Sr:

𝑆𝑦𝑦 −𝑚2 𝑆𝑥𝑥


𝑆𝑟 =
𝑁−2
The standard deviation of the slope, sm:

𝑠𝑟2
𝑠𝑚 =
𝑠𝑥𝑥

The standard deviation of the intercept, sb:

σ 𝑥𝑖2 1
𝑠𝑏 = 𝑠𝑟 = 𝑠𝑟
𝑁 σ 𝑥𝑖2 − σ 𝑥𝑖 2 𝑁 − σ 𝑥𝑖 2 / σ 𝑥𝑖2

The standard deviation for results obtained from the


calibration curve, Sc:

𝑠𝑟 1 1 𝑦ഥ𝑐 − 𝑦ത 2
𝑠𝑐 = + +
𝑚 𝑀 𝑁 𝑚2 𝑆𝑥𝑥
 Equation Sc gives us a way to calculate the standard
deviation from the mean 𝑦തc of a set of M replicate analyses
of unknowns when a calibration curve that contains N
points is used; that 𝑦ത is the mean value of y for the N
calibration points.

 This equation is only approximate and assumes that slope


and intercept are independent parameters, which is not
strictly true.
 The standard deviation about regression sr is the standard
deviation for y when the deviations are measured not from
the mean of y (as is the usual case) but from the straight line
that results from the least-squares prediction.

 The value of sr is related to SSresid by

σ𝑁
𝑖=1 𝑦𝑖 − 𝑏 + 𝑚𝑥𝑖
2 𝑆𝑆𝑟𝑒𝑠𝑖𝑑
𝑠𝑟 = =
𝑁−2 𝑁−2
 In this equation the number of degrees of freedom is N – 2
since one degree of freedom is lost in calculating m and
one in determining b.

 The standard deviation about regression is often called the


standard error of the estimate. It roughly corresponds to
the size of a typical deviation from the estimated regression
line.

With computers, the calculations are typically done


using a spreadsheet program, such as Microsoft®
Excel
Exercise 1
Carry out a least-squares analysis of the calibration data for the
determination of isooctane in a hydrocarbon mixture provided
in the table below:

Calibration Data for the Chromatographic Determination of


Isooctane in a Hydrocarbon Mixture
Mole Percent Peak Area
Isooctane, xi yi
0.352 1.09
0.803 1.78
1.08 2.60
1.38 3.03
1.75 4.01
Exercise 2
The calibration curve found in Exercise 2 was used for the
chromatographic determination of isooctane in a hydrocarbon
mixture. A peak area of 2.65 was obtained. Calculate the mole
percent of isooctane in the mixture and the standard deviation if
the area was (a) the result of a single measurement and (b) the
mean of four measurements.

You might also like