Evaluating Analytical Data
with Microsoft Excel
Computer Application ES 370
Shadrack Fosu (PhD)
Introduction
The procedure used to determine whether a given result is
unacceptable involves running a series of identical tests on
the same sample, using the same instrument or other piece
of equipment, over and over.
In such a scenario, the indeterminate errors manifest
themselves in values that deviate, positively and negatively,
from the mean (average) of all the values obtained.
This topic is an introduction to the sources and evaluation
of errors in analytical measurements, the effect of
measurement error on the result of an analysis, and the
statistical analysis of data.
Some Important Terms
In order to improve the reliability and to obtain information
about the variability of results, two to five portions
(replicates) of a sample are usually carried through an entire
analytical procedure.
Individual results from a set of measurements are seldom
the same, so we usually consider the “best” estimate to be
the central value for the set.
We justify the extra effort required to analyze replicates in
two ways.
First, the central value of a set should be more reliable than
any of the individual results.
Usually, the mean or the median is used as the central value
for a set of replicate measurements.
Second, an analysis of the variation in the data allows us to
estimate the uncertainty associated with the central value.
The Mean and the Median
o The most widely used measure of central value is the
mean.
o The mean, also called the arithmetic mean or the
average, is obtained by dividing the sum of replicate
measurements by the number of measurements in the set:
σ𝑁
𝑖=1 𝑥𝑖
𝑋=
𝑁
o where xi represents the individual values of x making up
the set of N replicate measurements.
The median is the middle result when replicate data are
arranged in increasing or decreasing order.
There are equal numbers of results that are larger and
smaller than the median.
For an odd number of results, the median can be found by
arranging the results in order and locating the middle result.
For an even number, the average value of the middle pair
is used
In ideal cases, the mean and median are identical. However,
when the number of measurements in the set is small, the
values often differ.
Example
The table below show results for the quantitative determination
of iron. Six equal portions of an aqueous solution with a
“known” concentration of 20.00 ppm of iron (III) were analysed
in exactly the same way.
Sample Conc.Fe3+
(ppm)
1 19.4
2 19.5
3 19.6
4 19.8
5 20.1
6 20.3
Calculate the mean and median for the data using excel.
Precision
Precision describes the reproducibility of measurements —in
other words, the closeness of results that have been obtained
in exactly the same way.
Generally, the precision of measurement is readily determined
by simply repeating the measurement on replicates samples.
Three terms are widely used to describe the precision of a set
of replicate data:
Standard deviation
Variance
Coefficient of variation
These three are functions of how much an individual result
xi differs from the mean, called the deviation from the
mean di.
𝑑𝑖 = 𝑥𝑖 − 𝑋
The relationship between the deviation from the mean and
the three precision terms is given by the standard
deviation.
It is the most common measure of the dispersion of data
around the mean and given as:
σ𝑛𝑖=1 𝑥𝑖 − 𝑋 2
𝑠=
𝑛−1
The term n is the number of measurements, and n – 1 is
referred to as the number of degrees of freedom. The term
s represents the standard deviation.
The significance of s is that the smaller it is numerically, the
more precise the data (the more the measurements are
“bunched” around the mean).
For an infinite number of measurements (where the mean is
𝜇), the standard deviation is symbolized as 𝜎 (Greek letter
sigma) and is known as the population standard deviation.
An infinite number of measurements is approximated by 30
or more measurements.
The standard deviation of each mean is known as the
standard error of the mean and is given the symbol sm.
The standard error is inversely proportional to the square
root of the number of data points N used to calculate the
mean as given by:
𝑠
𝑆𝑚 =
𝑁
Variance and Other Measures of Precision
Although the sample standard deviation is usually used in
reporting the precision of analytical data, we often find
three other terms
Variance (s2)
The variance is just the square of the standard deviation. The
sample variance s2 is an estimate of the population variance
𝜎2 and is given by
σ𝑛 2
2 𝑖=1 𝑥𝑖 − 𝑋
𝑠 =
𝑛−1
Note that the standard deviation has the same units as the
data, while the variance has the units of the data squared.
Scientists tend to use standard deviation rather than
variance because it is easier to relate a measurement and its
precision if they both have the same units.
The advantage of using variance is that variances are
additive in many situations.
Relative Standard Deviation (RSD) ans
Coefficient of Variation (CV)
Frequently standard deviations are given in relative rather
than absolute terms.
We calculate the relative standard deviation by dividing
the standard deviation by the mean value of the data set.
The relative standard deviation, RSD, is sometimes given
the symbol sr
𝑠
𝑅𝑆𝐷 = 𝑠𝑟 =
𝑋
The result is often expressed in parts per thousand (ppt) or
in percent by multiplying this ratio by 1000 ppt or by 100%.
For example
𝑠
𝑅𝑆𝐷 𝑖𝑛 𝑝𝑝𝑡 = × 1000 𝑝𝑝𝑡
𝑋
The relative standard deviation multiplied by 100% is called
the coefficient of variation (CV).
𝑠
𝐶𝑉 = 𝑅𝑆𝐷 𝑖𝑛 𝑝𝑒𝑟𝑐𝑒𝑛𝑡 = × 100 %
𝑋
Relative standard deviation relates the standard deviation
to the value of the mean and represents a practical and
popular expression of data quality
Spread or Range (w)
The spread, or range, w, is another term that is sometimes
used to describe the precision of a set of replicate results.
It is the difference between the largest value in the set and
the smallest.
Accuracy
It indicates the closeness of the measurement to the true or
accepted value and is expressed by the error
Note that accuracy measures agreement between a result
and the accepted value.
Precision, on the other hand describes the agreement
among several results obtained in the same way.
We can determine precision just by measuring replicate
samples.
Accuracy is often more difficult to determine because the
true value is usually unknown.
An accepted value must be used instead.
Accuracy is expressed in terms of either absolute or relative
error
Absolute Error
The absolute error E in the measurement of a quantity x is
given by the equation
𝐸 = 𝑥𝑖 − 𝑥𝑡
where xt is the true or accepted value of the quantity.
Relative Error
The relative error Er is often a more useful quantity than the
absolute error.
The percent relative error is given by the expression
𝑥𝑖 − 𝑥𝑡
𝐸𝑟 = × 100%
𝑥𝑡
Types of Errors in Experimental Data
Random (or indeterminate) error, causes data to be
scattered more or less symmetrically around a mean value.
Random error in a measurement is reflected by its precision.
Systematic (or determinate) error, causes the mean of a
data set to differ from the accepted value.
In general, a systematic error in a series of replicate
measurements causes all the results to be too high or too
low.
An example of a systematic error is the loss of a volatile
analyte while heating a sample.
Gross errors differ from indeterminate and determinate
errors.
They usually occur only occasionally, are often too large and
may cause a result to be either high or low.
They are often the product of human errors.
For example, if part of a precipitate is lost before weighing,
analytical results will be low.
Touching a weighing bottle with your fingers after its empty
mass is determined will cause a high mass reading for a solid
weighed in the contaminated bottle
Gross errors lead to outliers, results that appear to differ
markedly from all other data in a set of replicate
measurements.
Systematic Errors
Systematic errors have a definite value, an assignable
cause, and are of the same magnitude for replicate
measurements made in the same way.
They lead to bias in measurement results. Note that bias
affects all of the data in a set in the same way and that it
bears a sign.
Sources of Systematic Errors
Instrumental errors are caused by nonideal instrument
behavior, by faulty calibrations or by use under
inappropriate conditions.
Method errors arise from nonideal chemical or physical
behavior of analytical systems.
Personal errors result from the carelessness, inattention, or
personal limitations of the experimenter.
Accuracy of an analysis: Confidence Limits
The mean or best value is the value that will be reported
for an analytical determination.
The question is:
How closely does the mean or best value agree with the
true value 𝜇? In other words, how accurate is the analysis.
Experimentally, an analytical procedure is often tested on
a carefully prepared standard sample whose composition
is known to a high degree of accuracy.
Then the standard sample is analysed by the analytical
procedure to be tested, and
the analytical results compared with the stated value for
each constituent.
In favourable cases, this gives a direct indication of the
accuracy of which the analytical procedure is capable.
The difficulty is that the actual sample have somewhat
different composition from the standard sample and
this can affect the accuracy likely to be obtained.
Another approach is to use the precision of the analysis, as
outlined below,
Confidence Limits: the t Distribution
If a valid analytical procedure having no systematic error is
used to obtain a sample mean,
then statistical theory may be used to describe the accuracy
of the procedure.
Specifically, the theory is used to predict within what limits
the mean is likely to agree with 𝜇.
The theory will not enable this prediction to be made with
100% probability.
There is always some fraction of risk 𝛼 or percentage (100-
100𝛼) involved in such a prediction.
The limits predicted for a certain risk or probability are called
confidence limits.
Instead of depending on the normal distribution curve, the
limits depend on the t, or “Student’s t,” distribution curve.
The distribution yields values for a constant called t.
To formulate the confidence limits, the standard deviation s
of the sample is calculated and a value for t is found by
consulting The t-table.
The constant t depends on the fraction of risk 𝛼 and on n,
the number of measurements in the sample.
Many statisticians prefer n – 1, which is termed the degrees
of freedom, and the probability level; given in the table.
The confidence limits for X are calculated as follows:
𝑡𝑠
𝜇=𝑋±
𝑛
It may be said that the fraction of risk that 𝜇 lies outside
these confidence limits is 𝛼. Obviously 𝛼 Τ2 is the fraction of
risk that 𝜇 is either large or smaller than the confidence
limits.
It may also be said that the probability that 𝜇 lies inside
these limits is 100 - 100𝛼.
Trial Question 1
A new procedure for the rapid determination of Sulphur in
kerosene was tested on a sample known from its method of
preparation to contain 0.123% S (μ = 0.123% S). The
results for % S were 0.112, 0.118, 0.115 and 0.119. Do the
data indicate that there is a bias in the method at 95%
confidence level? Using excel perform this hypothesis
testing
Assignment 2
Sewage and industrial pollutants dumped into a body of water can
reduce the dissolved oxygen concentration and adversely affect
aquatic species. In one study, weekly readings are taken from the
same location in a river over a two-month period.
Week Number Dissolved O2,
ppm
1 4.9
2 5.1
3 5.6
4 4.3
5 4.7
6 4.9
7 4.5
8 5.1
Some scientists think that 5.0 ppm is a dissolved O2 level that is
marginal for fish to live. Conduct a statistical test to determine
whether the mean dissolved O2 concentration is less than 5.0
ppm at the 95% confidence level. State clearly the null and
alternative hypotheses using Microsoft Excel
STANDARDISATION AND
CALIBRATION
Introduction
A very important part of all analytical procedures is the
calibration and standardization process.
Calibration determines the relationship between the
analytical response and the analyte concentration.
This relationship is usually determined by the use of
chemical standards.
The standards used can be prepared from purified reagents,
if available, or standardized by classical quantitative
methods.
Most commonly, the standards used are prepared externally
to the analyte solutions (external standard methods).
E.g., the arsenic concentration in a sample was determined by
calibration of the absorbance scale of a spectrophotometer with
external standard solutions of known arsenic concentration.
In some cases, an attempt is made to reduce interferences
from other constituents in the sample matrix, called
concomitants, by
using standards added to the analyte solution (internal
standard methods or standard addition methods) or by matrix
matching or modification.
Almost all analytical methods require some type of
calibration with chemical standards.
Gravimetric methods and some coulometric methods are
among the few absolute methods that do not rely on
calibration with chemical standards.
Calibration of an Analytical Instrument
Signals are measured using equipment or instruments that
must be properly calibrated if Smeas is to be free of
determinate errors.
Calibration is accomplished against a standard, adjusting
Smeas until it agrees with the standard’s known signal.
The response of an instrument used for chemical analysis is
proportional to the concentration of the analyte in a solution.
This proportionality can be expressed as follows:
𝑅 = 𝐾𝐶
where R is the instrument readout, C is the concentration of
the analyte, and K is the proportionality constant.
(The most common example is Beer’s law,)
If the object of the experiment is the concentration (C) of the
analyte in the sample solution, then the R and the K for this
solution must be known so that C can be calculated:
𝑅
𝐶=
𝐾
the calibration of an instrument for quantitative analysis can
utilize a single standard, resulting in a calibration constant,
or a series of standards, resulting in a standard curve.
If a single standard is used, the value of K is the calibration
constant.
It is found by determining the instrument response for a
standard solution of the analyte and then calculating K:
𝑅𝑠 = 𝐾𝐶𝑠
𝑅𝑠
𝐾=
𝐶𝑠
The concentration of the unknown sample solution is then
calculated from its instrument response and the K
determined from equation above:
𝑅𝑈
𝐶𝑢 =
𝐾
In the above equations, RS is the readout for the standard
solution, CS is the concentration of the analyte in the standard
solution, RU is the readout for the unknown sample solution,
CU is the concentration of the analyte in the unknown solution,
and the proportionality constant (K) is the calibration constant.
Limitations
There are two limitations with the above process:
1) the analyst is “putting all the eggs in one basket” by
comparing the sample to just one standard (not very
statistically sound)
2) the calibration constant, K, must truly be constant at the two
concentration levels, CS and CU (possible, but not guaranteed).
Because of these limitations, the concept of the standard
curve is used most of the time.
The concept of a series of standards refers to an experiment
in which a series of standard solutions is prepared;
covering a concentration range within which the unknown
concentration is expected to fall.
For example, if an unknown concentration is thought to be
around 4 parts per million (ppm),
then a series of standards bracketing this value, such as 1, 3,
5, and 7 ppm, are prepared.
The readout for each of these is then measured.
The standard curve is a plot of the readout vs.
concentration.
The unknown concentration is determined from this plot.
Conc. (ppm) Readout
1.0 2.1
3.0 6.3
5.0 10.5
7.0 14.7
The standard curve procedure is free from the limitations
involved in comparing an unknown to a single standard
because:
1) if the unknown concentration is compared to more than one
standard, the results are more statistically sound,
2) if the curve is a straight line, the value of K is constant
through the concentration range used.
If the standard curve is a straight line free of bends or
curves, the value of K is the slope of the line and is
constant.
If a portion of the standard curve is linear, this linear portion
may be used to determine the unknown concentration.
We should indicate at this point that, while the limitations of
the single standard are eliminated,
there is still considerable opportunity for experimental error
in this procedure (error in solution preparation and
instrument readout), and
this error may make it appear that the data points are
“scattered” and not linear. (This will be address later).
This series of standard solutions method is commonplace
in an instrumental analysis laboratory for the vast majority
of all quantitative instrumental procedures.
Examples of this abound for many spectrophotometric,
chromatographic, and other techniques.
Comparison with Standards
Direct comparison techniques
Titration procedures
Direct Comparison
Some analytical procedures involve comparing a property of
the analyte (or the product of a reaction with the analyte)
with;
standards such that the property being tested matches or
nearly matches that of the standard.
For example, in early colorimeters, the color produced as
the result of a chemical reaction of the analyte was
compared with;
the color produced by using standards in place of the
analyte in the same reaction.
If the concentration of the standard was varied by dilution, for
example, it was possible to match colours fairly precisely.
The concentration of the analyte was then equal to the
concentration of the diluted standard.
Such procedures are called null comparison or isomation
methods.
External Standard Calibration
In external standard calibration, a series of standard
solutions is prepared separately from the sample.
The standards are used to establish the instrument
calibration function,
which is obtained from analysis of the instrument response
as a function of the known analyte concentration.
Ideally, three or more standard solutions are used in the
calibration process,
although in some routine determinations, two-point
calibrations can be reliable.
The calibration function can be obtained graphically or in
mathematical form.
Generally, a plot of instrument response versus known
analyte concentrations is used to produce a calibration
curve, sometimes called a working curve.
It is often desirable that the calibration curve be linear in at
least the range of the analyte concentrations.
A linear calibration curve of absorbance versus analyte
concentration is shown
Example One
Calibration curve of absorbance
versus analyte concentration for a
series of standards. Data for
standards shown as solid circles.
The calibration curve is used in an
inverse fashion to obtain the
concentration of an unknown with
an absorbance of 0.505. The
absorbance is located on the line,
and then the concentration
corresponding to that absorbance is
obtained by extrapolating to the x-
axis (dashed lines). Residuals are
distances on the y-axis between the
data points and the predicted line as
shown in the inset.
The concentration found is then related back to the analyte
concentration in the original sample by applying appropriate
dilution factors from the sample preparation steps.
Computerized numerical data analysis has largely replaced
graphical calibration methods, which are now seldom used
except for visual confirmation of results.
Statistical methods, such as the method of least squares,
are routinely used to find the mathematical equation
describing the calibration function.
The Least-Square Method
The calibration curve shown in e.g. one is for the
determination of Ni(II) by reaction with excess thiocyanate
to form an absorbing complex ion 𝑁𝑖 𝑆𝐶𝑁 + .
The ordinate is the dependent variable, absorbance, while
the abcissa is the independent variable, concentration of
Ni(II).
As is typical and usually desirable, the plot approximates a
straight line.
Note that, because of indeterminate errors in the
measurement process, not all the data fall exactly on the line.
Thus, the investigator must try to draw the “best” straight
line among the data points.
Regression analysis provides the means for objectively
obtaining such a line and also for specifying the
uncertainties associated with its subsequent use.
We consider here only the basic method of least squares
for two-dimensional data.
Assumptions of the Least-Squares Method
Two assumptions are made in using the method of least
squares.
The first is that there is actually a linear relationship
between the measured response y (absorbance in e.g.1)
and the standard analyte concentration x.
The mathematical relationship that describes this
assumption is called the regression model, which may be
represented as
𝑦 = 𝑚𝑥 + 𝑏
where b is the y intercept (the value of y when x is zero),
and m is the slope of the line.
We also assume that any deviation of the individual points
from the straight line arises from error in the measurement.
That is, we assume there is no error in x values of the points
(concentrations).
Both of these assumptions are appropriate for many
analytical methods, but;
bear in mind that, whenever there is significant uncertainty
in the x data, basic linear least-squares analysis may not
give the best straight line.
In such a case, a more complex correlation analysis may
be necessary.
In addition, simple least-squares analysis may not be
appropriate when the uncertainties in the y values vary
significantly with x.
In that instance, it may be necessary to apply different
weighting factors to the points and perform a weighted
least-squares analysis.
Finding The Least-Square line
The least-squares procedure can be illustrated with the aid of
the calibration curve for the determination of Ni(II) shown in
E.g. 1.
Thiocyanate was added to the Ni(II) standards, and the
absorbances measured as a function of the Ni(II)
concentration.
The vertical deviation of each point from the straight line is
called a residual as shown in the inset.
The line generated by the least-squares method is the one
that minimizes the sum of the squares of the residuals for all
the points.
In addition to providing the best fit between the experimental
points and the straight line, the method gives the standard
deviations for m and b.
The least-squares method finds the sum of the squares of
the residuals SSresid and minimizes the sum using calculus.
The value of SSresid is found from
𝑁
𝑆𝑆𝑟𝑒𝑠𝑖𝑑 = 𝑦𝑖 − 𝑏 + 𝑚𝑥𝑖 2
𝑖=1
where N is the number of points used. The calculation of slope
and intercept is simplified when three quantities are defined, Sxx,
Syy, and Sxy as follows:
σ 𝑥𝑖 2
2 2
𝑆𝑥𝑥 = 𝑥𝑖 − 𝑥ҧ = 𝑥𝑖 −
𝑁
σ 𝑦𝑖 2
2 2
𝑆𝑦𝑦 = 𝑦𝑖 − 𝑦ത = 𝑦𝑖 −
𝑁
σ 𝑥𝑖 σ 𝑦𝑖
𝑆𝑥𝑦 = 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത = 𝑥𝑖 𝑦𝑖 −
𝑁
where xi and yi are individual pairs of data for x and y, N is
the number of pairs, and 𝑥ҧ and 𝑦ത are the average values for
x and y.
Note that Sxx and Syy are the sum of the squares of the
deviations from the mean for individual values of x and y.
The expressions shown on the far right in Equations above
are more convenient when a calculator without a built-in
regression function is used.
Six useful quantities can be derived from Sxx, Syy, and Sxy:
• The slope of the line, m:
𝑆𝑥𝑦
𝑚=
𝑆𝑥𝑥
• The intercept, b:
𝑏 = 𝑦ത − 𝑚𝑥ҧ
• The standard deviation about regression, Sr:
𝑆𝑦𝑦 −𝑚2 𝑆𝑥𝑥
𝑆𝑟 =
𝑁−2
The standard deviation of the slope, sm:
𝑠𝑟2
𝑠𝑚 =
𝑠𝑥𝑥
The standard deviation of the intercept, sb:
σ 𝑥𝑖2 1
𝑠𝑏 = 𝑠𝑟 = 𝑠𝑟
𝑁 σ 𝑥𝑖2 − σ 𝑥𝑖 2 𝑁 − σ 𝑥𝑖 2 / σ 𝑥𝑖2
The standard deviation for results obtained from the
calibration curve, Sc:
𝑠𝑟 1 1 𝑦ഥ𝑐 − 𝑦ത 2
𝑠𝑐 = + +
𝑚 𝑀 𝑁 𝑚2 𝑆𝑥𝑥
Equation Sc gives us a way to calculate the standard
deviation from the mean 𝑦തc of a set of M replicate analyses
of unknowns when a calibration curve that contains N
points is used; that 𝑦ത is the mean value of y for the N
calibration points.
This equation is only approximate and assumes that slope
and intercept are independent parameters, which is not
strictly true.
The standard deviation about regression sr is the standard
deviation for y when the deviations are measured not from
the mean of y (as is the usual case) but from the straight line
that results from the least-squares prediction.
The value of sr is related to SSresid by
σ𝑁
𝑖=1 𝑦𝑖 − 𝑏 + 𝑚𝑥𝑖
2 𝑆𝑆𝑟𝑒𝑠𝑖𝑑
𝑠𝑟 = =
𝑁−2 𝑁−2
In this equation the number of degrees of freedom is N – 2
since one degree of freedom is lost in calculating m and
one in determining b.
The standard deviation about regression is often called the
standard error of the estimate. It roughly corresponds to
the size of a typical deviation from the estimated regression
line.
With computers, the calculations are typically done
using a spreadsheet program, such as Microsoft®
Excel
Exercise 1
Carry out a least-squares analysis of the calibration data for the
determination of isooctane in a hydrocarbon mixture provided
in the table below:
Calibration Data for the Chromatographic Determination of
Isooctane in a Hydrocarbon Mixture
Mole Percent Peak Area
Isooctane, xi yi
0.352 1.09
0.803 1.78
1.08 2.60
1.38 3.03
1.75 4.01
Exercise 2
The calibration curve found in Exercise 2 was used for the
chromatographic determination of isooctane in a hydrocarbon
mixture. A peak area of 2.65 was obtained. Calculate the mole
percent of isooctane in the mixture and the standard deviation if
the area was (a) the result of a single measurement and (b) the
mean of four measurements.