Intro to Descriptive Statistics
Intro to Descriptive Statistics
1.0 INTRODUCTION
All the spreadsheet figures used in this file has been put in a file. You can download
this file from the BCA pages of IGNOU’s website.
1.1 OBJECTIVES
After performing the activities of this section, you should be able to:
5
Statistical
Techniques Lab 1.2 FREQUENCY DISTRIBUTION OF A VARIABLE
A statistical study is conducted to generate valid conclusions about the problem under
investigation. Recall from BCS040: Block 1 Unit 1 to appreciate the fact that relevant
data is necessary for this purpose. Further, data can be collected either for the first
time (primary data) directly by the investigator or from other sources already
available (secondary data) in the form of some published work or from Government
data resources etc. Broadly, statistics deals with data collection, data analysis and
interpretation of results, with Statistical inference playing a role. Some of the key
definitions used from this view point are given below: (Please refer to BCS-040:
Block 1 Unit 1, for details).
Population and Sample: The set of all the observations relating to the problem under
investigation consists of the population. The sample on the other hand, consists of
data actually collected from the few units selected from the population. For example,
if we want to find the average income of adults in our country, the population consists
of the data on income of every adult in the country. However, collecting and
analyzing such huge data is difficult or impossible. Thus, we may take only a small
part (See Book 3 Chapter 5) of the observations in the population, which is called a
sample.
Discrete or discontinuous Variables: Consider an example of marks scored by
students in statistics out of 100 that are awarded as whole numbers. A variable of this
type that can take distinct, finite or countably infinite values is called a discrete
variable.
Continuous Variables: The continuous variables on the other hand can take any
value between a low and high value. Measures of height, weight etc. are examples of
this type.
Frequency Distribution: It is the most common method for summarizing and
presentation of data, which enables us to quickly assess how frequently any value
occurs in the given data set.
You must read Unit 1 of Block 1 of BCS-040 for more definitions and examples. In
the following example the use of spreadsheet package is demonstrated for the purpose
of generating a frequency distribution.
Example 1: (Data used for this example has been taken from Block 1 Unit 1 of BCS
040 example 7, table 3 with five modifications) Table 3 shows the lives of 100 electric
bulbs. You are required to construct a frequency distribution and create histogram
from this data using Spreadsheet package.
Solution: Figure 1 shows the spreadsheet (for the example, MS-Excel has been used,
but you are free to use any spreadsheet package). Enter the given data in the cells
A2….E21, that needs to be used to construct the frequency distribution.
(Read page 15, Unit 1 of Block 1 of BCS-040). In the next step, you are required to
specify class intervals to construct a frequency distribution. In this context, observe
that the minimum and maximum values for the raw data are 511.6 and 1314.7
respectively (use min(A2:E21), max(A2:E21)), as these are useful information in
deciding about the classes to be formed. Here, the data under bins shows the class
categories and they can be read as 0 to 510.5 (included), 510.6 to 590.5 (included) and
so on. The last category is 1390.6 and above. Please note that for the purpose of
calculation of frequencies, the upper limits of the classes (H2....H13) alone are
required.
Once the basic data is entered, you need to calculate the Frequency distribution for
these class intervals. For this, you need to enter an array formula in the spreadsheet.
An array formula performs calculations on an array of data and may produce an array
of results. To enter array formula, you may need to press CTRL+SHIFT+ENTER into
a worksheet. The formula for calculating frequency is also an array formula. To enter
6
this formula, perform the following steps: Descriptive Statistics
and Statistical
Select the cells in which result is expected in our case these are J2....J14.
Inferences
Enter the formula: =frequency(A2:E21,H2:H13) and press
CTRL+SHIFT+ENTER keys.
You can also find the Cumulative frequencies, use simple addition formula for
this purpose in cells K2….K14. Write your own formula for this purpose.
On completion of these operations, the screen will look like as that of Figure 1.
Please note that you may also use the Struges’ rule that enables you to
calculate the number of classes (k) in terms of total number of observations
(N) as k = [1 + log2N], where [….] is the celling operator (see help on function
CEILING(…,…)). Can you redo the task above using this method? Further,
discuss with other students or counselors how would it help in developing a
C-program for construction of frequency distribution without user
intervention.
7
Statistical transfer the column H data to G by incrementing previous G data by 0.1. Now to make
Techniques Lab the histogram perform the following steps:
1. Select Data Data Analysis and then select Histogram in the resulting dialog
box and press Ok button.
2. In the resulting dialog box, set the Input Range to A2..E21, Bin Range to G2..G13
and Output Range to I1..J14. Please do not forget to check the Chart Output check
box.
3. Click Ok
The Histogram will be displayed as shown in the Figure 2. You need to modify the
format of the histogram to make it look as per your need. For example, to remove
gaps in between bars:
Notice that horizontal axis labels now depict the upper class boundaries having class
width of 80 hours in Figure 2. Once again compare this histogram to the figure that
has been made in the Unit. Can you now plot the frequency polygon and the less than
type to give for the same frequency distribution (refer to Block 1, unit 1,
section 1.2.3, page 17). You can also plot Bar diagram for appropriate data as shown
in the Unit.
You can also explore the fact that all spreadsheet packages allow you to modify the
Title of the chart, Axis titles, colour of bars, size of bars, size of chart and many other
formatting features. For this you should refer to documentation of the spreadsheet
package that you are using.
8
Descriptive Statistics
and Statistical
1.3 SUMMARISATION OF DATA Inferences
Two important statistical measures that are used to summarise of data are:
measures of central tendency;
measure of dispersion
Let us discuss them with the help of the example given in BCS-040, Block 1, Unit 1
page 30-31. The following figure shows an implementation similar to Table 14 of the
said block/unit.
Please note that we have used the following procedure in the worksheet to arrive at the
result shown in Figure 3:
In the cell I2 to I14, you can use the array formula as explained in the last section
to calculate the frequencies of each class.
For calculating the Average life in hours (xj) or the class mid-points, you may
enter in the cell J2 the formula =(G2+H2)/2.
Copy this formula from cell J3 to J13. Notice that you need to find the value for
cell J14 using different formula.
Calculate the value of yj in the cell K2 as per the formula =(J2-$J$8)/80. Copy
this formula to cells K3 to K14.
Now enter the formula for calculating yj fj and (you can do it yourself).
Enter the formula for the µ’(see page 30, Unit 1) - the mean for x-observation,
variance for x-observations and Standard deviation for x -observation.
9
Statistical You can compare the results so obtained with the results that you can obtain by
Techniques Lab directly applying the formula of mean (AVERAGE), median (MEDIAN) and standard
deviation (STDEV) on the cell range A2:E21 containing the original data. Discuss
with other students or counselors, why are there differences in the values of mean and
standard deviations. Please note that the values so calculated are different than the
values as shown in the Table 14 of the course BCS040.
3) Create the measures of central tendencies and measure of dispersion for this
data.
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
Lab Session 1
1) Construct a frequency distribution, plot bar graph, and find the mean and
standard deviation of the data on marks of the 100 students as given in
BCS040/Block 1/Unit 1 Table 2 (page 11) using spreadsheet package.
2) Develop a program in C to construct a frequency distribution and hence
calculate mean & standard deviation for the data mentioned in problem 1.
Your program should make provision to input data from a file and output
results on the screen (you may also use Struges’ rule).
3) Using spreadsheet package generate hypothetical data on marks of 10
different students in 5 different subjects. You may use the spreadsheet
package function RANDBETWEEN(bottom, top) for this purpose. Find the
average (µ) standard deviation of marks (σ) and calculate grades of the
students on that basis of the following: A student who receives marks in the
range of µ± σ is awarded grade B. A student who gets marks above µ+ σ is
awarded an A grade. The student getting marks below µ – σ is awarded a C
grade. Make suitable assumptions, if any.
10
Descriptive Statistics
and Statistical
1.4 SAMPLING DISTRIBUTIONS Inferences
This section discusses about the basis of various tests that you can perform for
hypothesis testing. You should go through BCS-040/ Block 2 before going through
this section. Let us first revise some of the terms used in that Block.
Population and Sample: In statistics the term population refers to a set or collection
of observations relating to the phenomenon under investigation. Thus, the statistical
population or simply population comprises all the observations or measurements,
relating to the phenomenon under investigation, which can be collected. The
population can be finite or infinite. In an infinite population it is not possible to
observe the measurements on all the units; even in the case of a finite population it
may not be economical or feasible to observe the values from all the units of the
population. Thus, a representative set of units from the population are chosen using a
statistically valid procedure and measurements or observations are made from these
selected units. This subset of observations comprises the sample that are selected for
performing statistical analysis (please refer to Part II of Book 3).
Statistic: A function of sample of observations, which does not contain any unknown
parameter and whose values can be observed, is called a statistic. It is denoted by
. For example, T = is a statistic, as its values can be
observed and it does not contain any unknown parameter; T = + µ is not a statistic
when µ is unknown.
The Sampling distribution of a Statistic refers to the list of all possible values of the
statistic and its associated probability distribution.
Let us try to show you the use of Spreadsheet for calculating the sampling
distribution. Consider the BCS-040/Block 2, Unit 4, Page 9 Example 1, using which
the notion of population mean, sample, sample mean, list of all possible samples and
mean of sample mean or grand mean are presented.
Tasks:
List all possible samples of size n = 2 drawn from the population without
replacement
Calculate the mean and variance of each sample
Construct the sampling distribution of the sample mean
Calculate the mean and variance of this distribution
Note that in sampling without replacement, a total of 4C2 = 6 possible samples can be
drawn from the population. The results related to the tasks stated above are shown in
Figure 4. In order to generate the content of Figure 4, you can use your own formulas
by following the steps below:
Create the column F of sample elements
Calculate the sample means in columns G
Hence, construct the list of all possible values of sample mean in column I.
11
Statistical Use the array formula =FREQUENCY(G3:G8,I3:I7) in column J to get the
Techniques Lab frequency distribution of sample mean . From this frequency distribution
calculate the Relative frequency / Probability distribution P( ) of sample mean.
Note that here denotes a random variable and its value.
Now, you can insert the chart of Sample Mean versus relative Frequency. The
Chart type used should be decided keeping in view the discrete or continuous
nature of the distribution. This is the graph showing the sampling distribution of
.
Also note that you can calculate the Grand mean (see E10..G10) as well as
standard error SE( ) as shown in the figure. Their direct computations have also
been shown using worksheet functions (see I14..K15).
Computation of SE( ) using the Finite Population Correction Factor
Some important exact sampling distributions that are widely used in statistical data
analysis are Chi-Square distribution (χ2), Student’s t-distribution, F-distributions and
these are briefly discussed subsequently.
12
Some of the key terms used here are defined below for your recapitulation: Descriptive Statistics
and Statistical
Normal Distribution: It is an important probability distribution of a continuous Inferences
random variable, which you will use in solving many problems subsequently. The
probability density function (PDF) f(x) = f(x: µ, σ2) of a normal distribution having
parameters µ (Mean) and σ2 (Variance) is below:
In general, the expression z = gives us the standardize values for the values of
random variable X having Mean µ and Variance µ2. What are the mean and variance
of Z ?
Task: Try your hand in calculating the following in cells Q22..T24 by altering the
values of U and V in cells Q23..Q24.
P( –1 < Z < 1) = 0.6827
P( –2 < Z < 2) = 0.9545
P( –3 < Z < 3) = 0.9973
Can you now calculate probabilities P( –1 < Z < 0), P( –2 < Z < 3) ?
Task: Construct the frequency distribution of the standardized values for the data on
life of light bulbs used in Figure 1 & 2 (see Figure 5 for the results).
Following steps will enable you to get the content of Figure 5.
The standardized values Z of the random variable X are calculated in the range
H2..L21.
The mean of z-values and standard deviations are given in O20..O21.
The DataData_AnalysisHistogram option was used to generate summary
statistics (bin, frequency).
13
Statistical Point Estimates and Interval Estimates: An estimate of a population parameter that
Techniques Lab is a single number is called a point estimate. For example, sample mean is an
estimate of the unknown population mean µ. An interval estimate on the other hand
provides a set of two numbers, within which the unknown parameter is likely to
belong (see Book 3, Chapter 6).
Let µT and σT be the mean and standard deviation (standard error) of the sampling
distribution of a static T. Then, if the sampling distribution of T is approximately
normal (for CLT see BCS-040/Block 2, Unit 4 page 15), than the values of probability
for are given in the table below. It follows from the table
values that the end numbers T ± σS , ± 2σS , T ± 3σS are the 68.27%, 95.45%, 99.73%
confidence limits for µT respectively.
In the Figure 6(b), the z value for 95% confidence level will be 1.96.
Density Curve of
Sample Mean
95% probability
– error + error
margin margin
Confidence Interval
(a) The density curve for mean of various sample, the error margin here is for 95%
confidence level
P(value) = 1 – 95%
2 P(value) = 1 – 95%
2
–z 0 +z
(b) Standard Normal Curve for 95% confidence level
Figure 6: Confidence Interval, Confidence level and Normal Distribution using density and standard normal curves
14
Please note that Figure 5(a) shows the density of the mean for various possible Descriptive Statistics
samples of defined number of variables. The idea in this figure is that for a confidence and Statistical
level of 95% there will be a range of possible sample mean values that will range Inferences
within a confidence interval. However, if you convert the mean score to z score, you
get a normal distribution as shown in Figure 5(b). This figure shows that there will be
a standard confidence values which will be equal to the shown area on the curve
(95% in the case of Figure 5). The remaining area called α (5% or 0.05) will be in two
tails (2.5% or 0.025 each).
Null Hypothesis: The Null Hypothesis is the statement or the claim that there is no
difference in true means or proportions of groups that are being compared. It is
observed that a Null hypothesis many a times includes phrase like no effect or no
difference.
For example, take the case of test of the hypothesis that the average height of men in
North India (µN ) is greater than the average height of men in East India (µE). Notice
that the claim in this case is µN > µE and this forms the alternative hypothesis H1.
Having no predetermined reason for the heights to differ, the null hypothesis H0 is
that they are the same.
H0: µN = µE Average heights of men in North India and East India are same.
H1: µN > µE Average heights of men in North India is greater than East India.
In general, the H0 and H1 can be one sided or two sided. For example, alternative
hypothesis may be H1 : µN ≠ µE.
Type I error: this occurs when the null hypothesis H0 is rejected when in fact,
H0 true.
Type II error: this occurs when the Hypothesis H0 is false and it is not rejected.
α-Value: In order to test the null hypothesis, it is required to fix a cutoff value for
p-value. An α value of 0.05 means that the evidence provided by data against H0
should be so strong that it would not happen more than 5% of the time. In other
words, if we carry out the test 100 times, only 5 out of these values of mean or
proportion, etc. (statistic T) calculated from samples, is not in the region of
acceptance. Further, it is also termed the level of significance of the test.
1.6 t-DISTRIBUTION
Please read the content of BCS-040/Block 2/Unit 4/ Section 4.4 on page 18 on the
t-Distribution. We reproduced here the relevant portion below:
Suppose X1, X2, ….., Xn be a random sample drawn from the normal distribution
N(µ,σ2 ), where the population mean µ and the variance σ2 is unknown (i.i.d.). Then
the expression given below has Student’s t-distribution or simply t-distribution with a
parameter ν = n – 1.
15
Statistical
Techniques Lab
The probability distribution for different values of d.f. are shown in Figure 3,
BCS040/Block2/Unit4/Section 4.4/page 19. The probability P(t > tα) = α for different
values of v, α is tabulated in Page 93( right-tail) of BCS-040 / Block 2.
Thus, P(|t| > tα) = P(t < – tα) + P(t > tα) = 2P (t > tα) and – tα = t1-α
(Refer to Book 3, Chapter 4).
Please note that the values given in the table are only for the Right-Tail of the
t-distribution. For example, if α =0.05 and ν=10, one tailed value t0.05 = 1.812.
However, if we want to find the two tailed value for the same significance level, we
have to use t0.05/2 = t0.025 to read values from the columns of the table and the value of
v = 10 in the row to get t0.025 = 2.228.
Solution:
The solution using the Table of t-distribution is given in the said Block. To solve this
problem using spreadsheet, you need to know the following function (this function
may have different name in different spreadsheet package):
Probability –probability for which you want to determine the t-value. Please note
that the function return the value of t for two-tails probability of the curve denoted
as P(|t| > tα) = 2P(t > tα). What are two-tails? Please discuss it with other
students. Can you appraise the values of t in the table mentioned above and the
values returned by this function?
For more details on this function, you may refer to the spreadsheet package that you
are using.
16
Now, you are ready to solve the problem. Descriptive Statistics
and Statistical
(i) Shaded portion of the right (α) = 0.05. Inferences
Since the spreadsheet function takes probability in the form of two tails, the total
shaded portion will be twice the shaded portion on the right. Thus,
probability = 2 × 0.05 = 0.1 (it is calculated in cell F6 of Figure 7), the degree of
freedom is given to be 9 (Cell F7 of the Figure 7). Thus, you insert the
formula =TINV(F6, F7) in cell F9. Verify the result with the Example’s result
as-well-as the table value.
The total shaded area is same as the probability, so you just need to insert this
probability in cell G6, put 9 in G7 and formula =TINV(G6, G7) in cell G9
Calculate the total shaded area and insert the formula for t in cell H9.
This area is same as area in the right. So insert the required formulas.
Figure 7 shows these results. Now, let us solve the problem 4 of the same section
using the spreadsheet
Solution:
To solve this problem, you need to use a function from the spreadsheet package:
This function returns the percentage points (Probability) of the student t-distribution at
a t-value specified by x. You may be able to define the other two parameters. You
may refer to spreadsheet package help for more details.
Please check in Figure 7, cells F13 to F16 and G13 to G16, the values have been
entered respectively. You can Insert the formula =TDIST(F15, F14, F16) in cell F18
and copy it to G18. You will get the required probabilities.
17
Statistical
Techniques Lab
Given:
Claim: The fuse will blow in 12.40 minutes on an average with 20% overload.
In other words, you can state null hypothesis as:
H0: The average time for the fuse to blow with 20% overload is 12.40 minutes.
H1: The average time for the fuse to blow with 20% overload is below 12.40 minutes.
or simply stated as
H0 : µ = 12.40 minutes
H1 : µ = 12.40 minutes
The solution to the problem is discussed in the Unit. We can also use spreadsheet to
solve this problem as shown in Figure 8. Insert all the values in the spreadsheet and
calculate the t-value. By now you are familiar about the functions that are used in the
spreadsheet.
18
In the Figure 8, you may notice that the tabulated value of ttab is calculated (in cells Descriptive Statistics
B13, C13 and D13) using the spreadsheet function: and Statistical
Inferences
TINV(probability, degreesOFfreedom)
Please note that the inference in this case is drawn for various α values. You should
also notice the Standard error mentioned in the diagram. Please relate the standard
error to Figure 6.
The claim of the company has been Rejected as you can notice that ttab ≤ – ttab, which
is the condition to Reject the Null Hypothesis.
Further, the test can also be carried out directly using p-value of the test, which in this
case is p =Pr (t ≤ Pr (t ≤ tcal | H0 :µ = 12.40) = 0.002400778, , which is mentioned in
the last line of paragraph above Figure 6 of BCS-040/Block2/Unit4. Note that the
typical values of level of significance used in practice are α = 5%, 1%, 0.1%, and the
decision in the last case being different (Acceptance) in the present case.
Suppose X1, X2,…., Xn be a random sample drawn from the normal distribution
N(µ, σ2). Then distribution of the statistic
19
Statistical
Techniques Lab
Example 5: (Refer page 23 of the Block 2, Unit-4 stated above) you need to find the
value of of the distribution for α = 0.05 and 0.01 for the degree of freedom 5.
Solution:
In order to solve the problem using spreadsheet package, use the function
CHIINV(Probability, DegreeOfFreedom), that returns the value of for given value
of probability and d.f.v. The solution of this problem is shown in Figure 9. The value
of probability and degree of freedom are in cells F3 and F4 respectively for the first
case, and G3 and G4 for the second case. The formula entered for calculating
chi-square value in cell F6 is: =CHIINV(F3, F4). This formula is then copied to cell
G6.
What is the probability that a shipment be rejected even through the variance in the
population is 1.26 × 10-4?
Solution:
First please note (as stated in the solution given in the Unit) that the sample is taken
from a normal population having the variance σ2 . The sample has a variance of S2
and the size of sample is small (n ≤ 30), then chi-square value can be calculated using
You can now use this formula to calculate the value of chi-square and check it against
the tabulated values (you can use spreadsheet CHIINV function to calculate the
tabulated values). Figure 9 shows the details of the calculations. In the cells F9, F11
and F12 size of the sample, the values of variance of population and value of variance
of sample has been entered. The cell F14 contains the formula =((F9-1)*F12)/F11
that calculates the value of chi-square to be . To calculate the probability,
we first calculated the degree of freedom in cell G10 and using the function
CHIDIST(x,DegreeOfFreedom), where x is the value of chi-square on which
probability is to be calculated, you can calculate the probability of rejecting a valid
sample as . Thus, you can enter the formula
=CHIDIST(F14,G10) in the cell G16. The calculated probability is 0.05 which means
there are 5% chances that a due to our method of sampling, we reject a shipment that
has variance of the refractive index of 1.26 × 10-4.
20
Descriptive Statistics
and Statistical
Inferences
Statistic:
Computation of these values are shown in cells K9..N12 of Figure 9 using worksheet
function CHIINV.
21
Statistical Where are the observed number of values in the ith class and are expected
Techniques Lab number of values in the same class. Then the approximate distribution of is
Chi-square with k – 1 degrees of freedom. Example based on this will be discussed in
section 1.10.
1.8 F- DISTRIBUTION
The F- Distribution is defined in the BCS-040/Block 2/Unit 4/ Section 4.6 on page 24.
Before starting with this section, please read BCS-040/Block2/Unit6/Section
6.3.2/page 59, in particular page 64.
Recall that in applying t-test to test the hypothesis of the type H0 : , against
H1 : etc., it was assumed that the samples were drawn from normal
populations with means and equal variance , , which is
unknown. Note that this equality of variances, which is required can be tested using
F-statistics/distribution.
F is always positive
F-distribution is not symmetrical and is skewed
In order to decide which of the two samples is first or the second, we take larger
of the two quantities in the numerator and the smaller one in the denominator.
Table 4&5 in Unit 7 gives probability , and the values so
tabulated are greater than unity. Worksheet function to be used FDIST(x, )
, in order to get value of F for given use worksheet
function FINV( .
Example 6: (Refer to page 24 of the Unit stated above) A random variable has
F-distribution with 40 and 30 degrees of freedom, Find the probability that it will
exceed (i) 1.79 and (b) 2.30.
Solution:
To find the probability for the given value of random variable that has F-distribution,
you need to use the function FDIST(x, DegreeOfFreedom1, DegreeOfFreedom2),
where x is the value of random variable on which probability is to be calculated. You
can insert the data as shown in Figure 10, and enter the formula =FDIST(F5,F3,F4) in
the cell F6. Copy this formula to G6 to get Pr(F > 1.79) = 0.05 and
Pr(F > 2.30) = 0.10. You can compare the results that you have obtained using the
spreadsheet function and the example of the Block or the table lookup value.
22
Descriptive Statistics
Problem 6: (Refer to page 24 of the Unit stated above) If two independent random and Statistical
sample of size n1 = 7 and n1 = 13 are taken from a normal population. What is the Inferences
probability that the variance of the first sample will be at least three times as large as
that of second sample?
Solution:
You can easily find the number of degree of freedom for the value of random variable
and calculate the value of random variable under F-distribution as 3.00 (variance of
first sample is triple of the second). You can once again use the function FDIST as
explained above.
However, in the figure 9 you will find that we have also used a formula of
=FINV(K15,K11,K12) in cell K13. The purpose of this function is to calculate the
value of F for given probability and degrees of freedom. The function in spreadsheet
is FINV(probability, DegreesOfFreedom1, DegreesOfFreedom2).
Further, notice that the expression for α = 0.05 (in cell F15) is
verified in cell F17.
23
Statistical 2) Create chi square-distribution for the data given in Figure 9 of this section using a
Techniques Lab spreadsheet package.
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
3) Create F-distribution for the data given in Figure 10 of this section using a
spreadsheet package
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
Recapitulation: Based on what has been covered above, the following should be
useful in deciding a test statistic:
Before 9.4 10.3 8.4 6.8 7.8 9.8 9.2 11.2 9.4 9.0
After 9.3 10.6 8.8 7.0 7.7 10.0 9.8 11.7 9.7 9.0
Can you say that the reading course is a success at 0.05 level of significance?
Solution:
Note:
1. Since in this example, the question being investigated through statistical test of
hypothesis is: “Can you say that the reading course is a success?”, and it can be
considered to be a success in the case of higher average reading speed among
students after undergoing the course. In other words, we are effectively testing
.
2. Consequently, we will apply a one-tail (left-tail) t-test to test the hypothesis.
3. If we start with a definition (i.e., After –Before), we can reproduce
the solution discussed in Unit 6, page 65-66 with .
No difference in the mean reading speed before and after undergoing the
course i.e., there is no difference in students reading abilities before and after
the course, or .
The mean reading speed after the course is greater than before undergoing the
course i.e., the students reading course is a success, or .
You can apply the Left-Tail t-test to test the hypothesis using spreadsheet package.
Figure 11 shows the application of “t-test: Paired Two Sample for Means” test on the
data. You can apply this test in two ways in the spreadsheet.
On applying the test, the results so generated are shown in Figure 11. Some of the
terms that are of importance in the present case are: mean, variance and number of
observations for both the samples, number of degrees of freedom (d.f.) and p-value
and t-value for one or two tails.
25
Statistical Decision: Here, and . Since ,
Techniques Lab we Reject the Null Hypothesis and conclude that the reading course is a success. Here,
p-value for the test is .
TTEST(array1,array2, tails, type); type is the type of t-test, help on which can be
found from spreadsheet help.
The result shown in Figure 11 has been generated using the formula
=TTEST(B2:K2,B3:K3,1,1) in cell H9, which returns p = 0.007 in cell I9 (Please
Read Note above for explanation on arguments “1,1”).
Note:
In both the cases you have got p-value (Left-tailed) as 0.007, which is less
than the significance level of 0.05. Hence hypothesis H0 is REJECTED. This
implies that there is a significant difference/improvement in the average
reading ability of the students and hence the reading course is a success.
Extreme caution should be exercised in deciding the direction of the rejection
region (left, right or 2-tail) and this has a bearing from the basic question
being investigated.
Conclusion of the test should pertain to the basic question being investigated.
Chi-square testing is used in two important types of test: finding the goodness of fit
and performing the test of independence/ For more details on these tests using
chi-square testing you must go through the BCS-040/Block2/Unit 7. In this section,
we present few examples of these tests using spreadsheet.
26
Test for Goodness of fit Descriptive Statistics
and Statistical
Example 1: (Refer to Unit 7) Jaswant is interested in breeding flowers of a certain Inferences
species. The experimental breeding can result in four possible types of flowers:
MG has 84
MR has 35
RG has 28
RR has 13
Solution:
H0: The distribution of the flower types is multinomial with ratio 9:3:3:1
H1: The distribution is not as per the given ratio.
Figure 12 shows the goodness of fit test carried out for the data and the expected
theoretical model. In the cell C13, you can enter the formula for calculating the
tabulated value of chi-square using the function
The hypotheses of goodness of fit can be tested using the statement in cell C15:
IF(chi-square at alpha >= calculated U ) then Reject H0 else do not reject H0
You can enter the related formula to check if hypothesis is rejected or not Result:
Let us discuss another example for testing goodness of fit using chi-square statistic in
the case of fitting of normal distribution to the given data, which is based on
Problem 1 / Unit 7 (also see Book 3/chapter 7).
27
Statistical
Techniques Lab
Problem 1: (Refer to Unit 7 page 80) A chemical company wants to know if its sales
of a liquid chemical are normally distributed. This information will help them in
planning and controlling the inventory. The sales record for a random sample of 200
days is given in the following Table:
Assume that the upper limit of a class shows that quantities less than that limit are in
the class. So, for example, 35.5 will be included in the third class interval, not the
second one.
At the 5% level of significance, test the hypothesis that the company's sales are
normally distributed.
The spreadsheet calculates the mean and standard deviation based on the formulas
given on BCS040/Block 1/Unit 1 Page 30and 31. Please enter the formulas yourself.
You can also enter the formulas for calculating U as shown in Figure 13.
Conclusion:
Since the calculated value of U is greater than the chi-square tabulated value, we
Reject H0 and conclude that normal distribution does not provide a good fit for the
data.
Further, note that if we use the estimated values of µ and σ (enter values in cells
B19 and C19).
The d.f. has to be reduced by 1 for each parameter (enter 1 in cells N15 and N16).
Notice the expected frequencies and pool the classes appropriately (for frequency
less than 5). Hence, enter the appropriate adjustment to d.f. in cell N17. Is there
any change to the d.f.?
What is the conclusion now?
29
Statistical Test of Independence
Techniques Lab
Chi-square test can also be used to investigate if there is an agreement between the
observed frequencies and the expected frequencies or independence of attributes. See
Unit 7/section 7.3 or Book 3/Chapter 7. The following example explains this in detail.
Remark:
Details of all related computations for the two methods can be seen in the attached
spreadsheet. You can enter all the formulas yourself.
Notice that the notion of “range naming” in worksheets has been used here for the
purpose of demonstration and it is explained in the steps below.
To name a range Click on FormulasName ManagerNewwrite
NameSpecify Refers to
Then while writing a function in place of the range, you can write the named
range from the list.
Here, the range has been named as “ObservedFREQ”.
Can you identify which other places in earlier worksheets could you have used the
same?
The “hypothesis is rejected” in both the methods and we conclude that the income
of a person is related to the importance she attaches to the price of a brand name.
For Method 1, please refer to Unit 7.
Lab Sessions 2, 3, 4
1) Perform all the examples given in the Unit 4, 5, and 6 using spreadsheet package.
3) Write simple C functions to calculate the t and chi-square values from the formula
that are given for their calculations.
1.11 SUMMARY
This section has been an attempt to provide you details on some of the basic concepts
of statistics along with how their computations can be carried out in a spreadsheet
package. The attempt here was not only to introduce you to various steps that were
used to perform the said statistical data analysis, but also to the spreadsheet functions
that can be used in performing the same. The unit begin with frequency distribution,
summarization of data using central tendency and dispersion and subsequently covers
nature of sampling distribution and some important concepts such as p-value,
significance level, normal distribution, array formula and finally range naming. The
unit then explains how you can use spreadsheet to solve problems using t-distribution,
chi-square distribution, F-distribution etc. Finally, the Unit describes the use of
spreadsheet package for performing test of significance. The test of significance is
explained with the help of an example on t-test. The last section of the unit was
devoted to application of chi-square testing on testing goodness of fit and
independence of attribute.
Please use any spreadsheet package and enter all the data as shown in Figure 1
(for question 1), Figure 2 (for question 2) and Figure 3 (for question 3). You
may download the file containing all these figures from the website
www.ignou.ac.in from BCA page under MCSL044. Find out what formulas
have been entered. Also find the errors in the entry of formula.
31
Statistical
Techniques Lab Check Your Progress 2:
Please use any spreadsheet package and enter all the data as shown in
Figure 7, 8 (for question 1), Figure 9 (for question 2) and Figure 10 (for
question 3). You may download the file containing all these figures. Find out
what formulas have been entered. Also find the errors in the entry of formula.
Please use any spreadsheet package and enter all the data as shown in Figure 11
to Figure 14. You may download the file containing all these figures. Find out
what formulas have been entered. Also find the errors in the entry of formula.
You may define cell ranges and use in formulas.
Web link:
www.wikipedia.org
32