Methods Chapter 2
Methods Chapter 2
Methods Chapter 2
Inference is the process of making conclusions about the characteristics of a population based on
data from a sample. In statistics there are two ways through which inference can be made
namely:
Estimation
Hypothesis testing.
Estimation is one way of making inference about the population parameter where the
investigator does not have any prior notion about values or characteristics of the population
parameter. It is the process of using sample statistic to estimate the value/values of unknown
population parameters. In general, there are two ways estimation. These are point estimation and
Interval estimation as details are given below:
Point estimation is a procedure that results in a single value as an estimate for a parameter. It is
a process of calculating a single value from the sample data to estimate the parameter. Then, the
point estimate for a single population Mean and Proportion can be given as: the sample mean
is a point estimate for the population mean µ and Suppose that we select n
random samples from the population and x be number of objects/ individuals possessing that
characteristic in the sample, then and is the point estimate of the population proportion
P. The following are some qualities or properties of best estimator for population parameter
It should be unbiased. =>Its expected value must be the value of the parameter being
estimated i.e.,
It should be consistent in a sense that an estimator must gets closer to the value of the
parameter as the sample size increases i.e., .
It should be relatively efficient. The estimator for a parameter with the smallest variance
is termed as relatively efficient estimator.
It is the procedure that results in the interval of values as an estimate for a parameter which
contains the likely values of a parameter. It deals with identifying the upper and lower limits of a
parameter. When we make confidence intervals for population parameter, we have to consider
Suppose that be a random sample from population X having mean and variance
, then there are three different cases to be considered to construct confidence intervals for
population mean( ) as given below:
Case 1: If the population distribution is normal with known variance, size of sample may be
small or large
X
P ( Z Z 2 ) 1
n
2
P ( X Z 2 n X Z 2 n) 1
From standard normal table, Z 2 values corresponding to the most commonly used confidence
levels are:
100(1 ) % 2 Z 2
Solution
Hence, we can be 95% confident that the mean growth of plants is to be between 30.35 and
33.65cm.
This implies that we can be 99% confident that the mean growth rate of plants is to be between
29.83 and 34.17.
Recall the Central Limit Theorem which states that the sampling distribution of X will have a
mean x and a standard deviation x n , and approaches a normal distribution as n
But usually 2 is not known, in that case we estimate by its point estimator S2and then
Exercise 2.1: A random sample of 625 households was drawn from a town and a survey
generated data on weekly expenditure on food. The mean in the sample was Birr 550 with a
standard deviation Birr 90. Construct a 95% confidence interval for the population mean weekly
expenditure of households on food.
Case 3: If the population distribution is normal with unknown variance and sample size is
small
100(1-α)% confidence interval for population mean is:
( X t 2 S n , X t 2 (df n 1) S n)
Here in this case a table of student t distribution with n-1 degree of freedom (refer student t
distribution in chapter 5 of your lecture notes) will be used rather than standard normal
distribution.
Example2. A drug company is testing a new drug which is supposed to reduce blood pressure.
From the six people who are used as subjects, it is found that the average drop in blood pressure
is 2.28 points, with a standard deviation of .95 points. Construct 95% CI for the true mean
change in pressure assuming normal population distribution?
We talked about the sample mean's sampling distribution in previous Section. However, there are
numerous real-world instances in business and other fields where data is collected in the form of
counts or is divided into two categories or groups based on an attribute. Examples include
dividing colony residents into two groups (male and female) based on characteristic sex, dividing
hospital patients into two groups based on whether they have cancer or not, and dividing a batch
of goods into defective and non-defective categories, among others.
Such data are typically evaluated in terms of the proportion of components, people, units, or
products that posses (success) in a particular characteristic or quality. As an illustration, consider
the population's gender distribution, the number of cancer patients treated in a hospital, the
number of lots with defective goods, etc. Instead of dealing with population mean in these
circumstances, we deal with population proportion.
When the population proportion is unknown and the total number of population is too large to
determine the proportion. In this case, the sampling distribution of sample proportion is needed
in order to draw conclusions about the population proportion.
For sampling distribution of sample proportion, we draw all possible samples from the
population and for each sample we calculate the sample proportion as
where, x is the number of observations in the sample which have the particular characteristic
under study and n is the sample size.
Suppose, there is a lot of 3 cartons A, B & C of electric bulbs and each carton contains 20 bulbs.
The number of defective bulbs in each carton is given below:
Carton Number of
Defective Bulbs
5|Page Statistical Methods
A 2
B 4
C 1
The population proportion of defective bulbs can be obtained as
Now, let us assume that we do not know the population proportion of defective bulbs. So we
decide to estimate population proportion of defective bulbs on the basis of samples of size n = 2.
There are possible samples of size 2 with replacement. The all possible samples and
their respective proportion defectives are given in the following table:
Thus, we have seen that mean of sample proportion is equal to the population proportion.
As we have already mentioned in the previous unit that finding mean, variance and standard
error from this process is tedious so we calculate these by another short-cut method when
population proportion is known.
Now, we can easily find the mean and variance of the sampling distribution of sample proportion
by using the above expression as
If the sampling is done without replacement from a finite population then the mean and variance
of sample proportion is given by
and
where, N is the population size and the factor (N-n) / (N-1) is called finite population correction.
If sample size is sufficiently large, such that np > 5 and nq > 5 then by central limit theorem, the
sampling distribution of sample proportion p is approximately normally distributed with mean P
and variance PQ/n where, Q = 1‒ P.
Example: A machine produces a large number of items of which 15% are found to be defective.
If a random sample of 200 items is taken from the population and sample proportion is calculated
then find
a) Mean and standard error of sampling distribution of proportion.
b) The probability that less than or equal to 12% defectives are found in the sample.
Solution: Here, we are given that
a) We know that when sample size is sufficiently large, such that np > 5 and nq > 5 then
sample proportion p is approximately normally distributed with mean P and variance PQ/n
where, Q = 1– P. But here the sample proportion is not given so we assume that the
conditions of normality hold, that is, np > 5 and nq > 5. So mean of sampling distribution of
sample proportion is given by
b) The probability that the sample proportion will be less than or equal to 12% defectives is
given by
To get the value of this probability, we can convert the random variate into standard normal
variate Z by the transformation
Exercise: In university, there are 5000 students, among a random sample of 250 students, 38 are
found to be left handed, then construct 95% CI for the true population proportion of left handed
students in the university.
Exercise: In a survey of diabetics in a large city, it was found that 100 out of 400 persons have
diabetic. Construct 95% CI for the true proportion of diabetics in the city.
In previous sections, we have discussed the sampling distributions of sample mean and sample
proportion. But many practical situations concerned with the variability. For example, a
manufacturer of steel ball bearings may want to know about the variation of diameter of steel
ball bearing, a life insurance company may be interested in the variation of the number of polices
in different years, etc. Therefore, we need information about the sampling distribution of sample
variance.
For describing the sampling distribution of the sample variance, we consider all possible sample
of same size, say, n taken from the population having variance and for each sample we
calculate sample variance . The values of may vary from sample to sample so we construct
the probability distribution of sample variances. The probability distribution thus obtained is
known as sampling distribution of the sample variance. Therefore, the sampling distribution of
sample variance can be defined as:
“The probability distribution of all values of the sample variance would be obtained by drawing
all possible sample of same size from the parent population is called the sampling distribution of
the sample variance.”
then
2)
Proof:
Leave the proof of number 1, because it is beyond the scope of the course.
So, we'll just have to state it without proof.
10 | P a g e Statistical Methods
, by adding
We can do a bit more with the first term of . As an aside, if we take the definition of the
sample variance:
So, the numerator in the first term of can be written as a function of the sample variance. i.e.:
The term on the left side of the equation is a sum of independent random variables.
That's because we have assumed that are observations of a random sample of
size from the normal distribution i.e. Therefore; follows a standard normal
distribution. Now, recall that if we square a standard normal random variable, we get a chi-
square random variable with 1 degree of freedom. So, again:
11 | P a g e Statistical Methods
is a chi-square(1) random variable. That's because the sample mean is normally
is a standard normal random variable. So, if we square , we get a chi-square random variable
with 1 degree of freedom:
Hypothesis: is an assertion or statement about the population parameter(s) and its plausibility is
to be evaluated based of the sample data. Hypothesis Testing is also one way of making inference
about population parameter, where the investigator has prior notion about the value of the
parameter. There are many situations in which we have to make decisions based on observations
or data that are random variables. The theory behind the solutions for these situations is known
as decision theory or hypothesis testing. In this part we will present a brief view of hypothesis
testing about the value of single population characteristics. In any hypothesis testing problem
there are two contradictory hypotheses. These are:
12 | P a g e Statistical Methods
It is the hypothesis of difference.
Usually denoted by H1 or Ha.
Test statistic: is a statistics whose value serves to determine whether to reject or accept the
hypothesis to be tested. It is a random variable.
Hypothesis testing is a method for using sample information to decide whether the null
hypothesis is rejected or not. The null hypothesis, will be rejected in the favor of the alternative
hypothesis, only if the sample evidence supports the null hypothesis is false. So, in hypothesis
testing problem we will make two decisions.
I. Either rejects the null hypothesis and accept the alternative hypothesis, or
II. Fail to reject the null hypothesis and reject the alternative hypothesis,
Step3. Based on sampling distribution of appropriate sample statistic evaluate the test statistic.
13 | P a g e Statistical Methods
Step4. Based on sampling distribution of appropriate sample statistic, identify the critical or
rejection regions. Critical or rejection regions are the set of all test statistic values for
Step5. Make decision: Decide whether to accept or reject the. We will reject if and only if the
observed or computed test statistic values falls in the rejection region.
Step6. Draw conclusion: Based on the decision we made, we have to make conclusion about the
population characteristics using the information obtained from the sample evidence.
then one can formulate two sided (1) and one sided (2 and 3) hypothesis as follows:
1. H 0 : 0 versus H1 : 0
2. H 0 : 0 versus H1 : 0
3. H 0 : 0 versus H1 : 0
Then, the choice the test statistic depends on the three different cases considered in constructing
confidence intervals for population mean as given below:
sided at (1) and one sided at (2) and (3) , respectively. For instances at common choice of
0.05 critical/tabulated values from standard normal distribution table corresponding to the
above three hypothesis are:
Under Ha Critical Values
0 0.05 Z 2 1.96
0 0.05 Z 1.645
14 | P a g e Statistical Methods
0 0.05 Z 1.645
By comparing calculated value of the test statistic with critical values from a table, decision rules
corresponding to the above three hypothesis two sided at (1) and one sided at (2) and (3) are:
Under Ha Reject H0 if Accept H0 if Inconclusive if
0 Z cal Z 2 Z cal Z 2 Z cal Z 2 or Z cal Z 2
0 Z cal Z Z cal Z Z cal Z
0 Z cal Z Z cal Z Z cal Z
Example 1: The mean life time of a sample of 36 light bulbs produced by a company is
computed to be 1570 hours. The population of life time of light bulbs produced by a company
follows normal distribution with standard deviation of 120 hours. Suppose the hypothesized
value for the population mean is 1600 hours. Can we conclude that the life time of light bulbs is
different from 1600 hours? (Use 0.05 )
Solution: Let μ is population mean and μo=1600 is hypothesized population mean
H 0 : 1600 vs H 1 : 1600
Step 5: Decision
At 1% level of significance we do not reject H0, since calculated value of absolute value of Z test
statistic (Zcal=-1.5) is not greater than tabulated value of Z ( Z 2 1.96 ).
Step 6: Conclusion
15 | P a g e Statistical Methods
Thus, based on the above decision in (5), we conclude that average life time of light bulbs for the
population is 1600 hours. In other words at 5% level of significance, we conclude that there is no
evidence to say that that the life time of light bulbs is different from 1600 hours, based on the
given sample data.
Exercise 1: A researcher claims that the average wind speed in a certain city is 8 miles per hour.
A sample of 32 days has an average wind speed of 8.2 miles per hour which is drawn from
normal distributed wind speed with hypothesized mean and standard deviation of the population
is 0.6 mile per hour. At = 0.05, is there enough evidence to reject the claim?
Case 2: When sampling is from a non-normally distributed population or from a
population whose distribution is unknown, but sample size is large
If a sample size is large one can perform a test hypothesis about a single population mean by
using Z test statistic and computed as:
X 0
Zcal , if is known
n
X 0
Zcal , if is unknown
S n
Obtaining critical/tabulated/ values and decision rules are the same as case 1 above.
Exercise 1: A random sample of 400 households was drawn from a town and a survey generated
data on weekly earning. The mean in the sample was Birr 250 with a standard deviation Birr 80.
Test the hypothesis that the average weekly earnings is 280 birr at 5% level of significance and
also construct a 95% confidence interval for the population mean earning.
Case 3: When sampling is from a normal distribution with 2 unknown and sample size is
small
If a sample size is small one can perform a test hypothesis about a single population mean by
using t test statistic and computed as:
X 0
t cal ~ t distributionwith n 1 degrees of freedom.
S n
After specifying we will have the following critical/tabulated values from student t
distribution table corresponding to the above three hypothesis are t 2 (df n 1) ,
16 | P a g e Statistical Methods
t (df n 1) and t (df n 1) for two sided at (1) and one sided at (2) and (3),
respectively.
By comparing calculated value of the test statistic with critical values from a table, decision rules
corresponding to the above three hypothesis two sided at (1) and one sided at (2) and (3) are:
Under Ha Reject H0 if Accept H0 if Inconclusive if
0 t cal t 2 (df n 1) t cal t 2 (df n 1) t cal t 2 (df n 1) or t cal t 2 (df n 1)
Example 1: Test the hypotheses that the average weight gain of sheep from certain a diet after 6
months of feeding is 10 kilogram if the a random sample of 10 sheep weight gain are 10.2, 9.7,
10.1, 10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 kilogram. Use the 0.01 level of significance and
assume that the distribution of weight gain is normal.
Solution: Let μ is population mean and μo=10,
From the sample data, sample mean and standard deviation are computed to be:
X 10.06, S 0.25
Step 1: Identify the appropriate hypothesis
H 0 : μ 10 vs H1 : μ 10
17 | P a g e Statistical Methods
Thus, based on the above decision in (5), we conclude that average weight gain of sheep from a
certain diet is 10 kilogram. In other words at 1% level of significance, we have no evidence to
say that the average weight gain of sheep from a certain diet is different from 10 kilogram, based
on the given sample data.
Exercise 1: A manufacturer has developed a new fishing line, which the company claims has a
mean breaking strength of 15 kilograms. To test a claim about the mean a random sample of 25
lines was tested and their average was computed to be 14 with standard deviation of 0.5
kilograms. Test the hypothesis that μ = 15 kilograms against the alternative that μ≠15 kilograms
assuming that breaking strength follows normal distribution.
It deals with comparing a single sample with a population value and tests whether the proportion
of a single population differs from a specified constant. For a two-tailed test of a proportion,
hypothesis to be tested is: H0: p = p0 versus HA: p ≠ p0 where p is the population proportion and
p0 is the hypothesized value and other possible alternatives are: HA: p>p0, HA: p<p0. If the
hypothesized value of population proportion is given then, the test statistic about a single
population proportion can take the form:
pˆ p0
Z cal ~Z(0,1) provided that sample size is large, i.e., np and nq are both at least 5.
p0 q0
n
The critical value is will be obtained from the standard normal table. The steps involved and
decision rule in testing hypothesis about a single population proportion remain the same to that of
testing hypothesis about single population mean under case 1 above.
Exercise 1: Out of 146 children examined for hearing disability at School-Z, 21 were found to
have some type of hearing abnormality. Does this confirm with the statement that 20% of these
school children have abnormality?
Exercise 2: In a survey of diabetics in a large city, it was found that 100 out of 400 have
diabetic foot. Can we conclude that 20 percent of diabetics in the sampled population have
diabetic foot. Test at the =0.05 significance level.
18 | P a g e Statistical Methods
Sample size determination is closely related to statistical estimation. Quite often you ask: how
large a sample is necessary to make an accurate estimate? The answer is not simple, since it
depends on three things: the margin of error, the population standard deviation, and the degree of
confidence. For example, how close to the true mean do you want to be (2 units, 5 units, etc.),
and how confident do you wish to be (90, 95, 99%, etc.)?
When sample data are used to estimate a population mean μ, the margin of error, denoted by E, is
the maximum likely (with probability 1-α) difference between the observed sample mean and the
true value of the population mean. The margin of error is also maximum error of the estimate and
can be found by multiplying the critical value and the standard deviation of the estimator.
Sample size needed in comparing two means from independent samples
From confidence interval estimation for single population mean ( ), the margin of error, denoted
by B, is the maximum likely (with probability 1-α) between the observed sample mean and the
true value of the population mean will be:
where is population variance to be taken from previous study or estimated from pilot study.
Also, from confidence interval estimation for single population proportion ( ), the margin of
error, denoted by B, is the maximum likely (with probability 1-α) between the observed sample
proportion and the true value of the population proportion will be:
variable,
Where P is population proportion to be taken from previous study or estimated from pilot study.
19 | P a g e Statistical Methods