STATISTICAL METHODS
Unit 1
Data Introduction
and Summarization
WHAT IS STATISTICS ?
Statistics is the science of
learning from the data
It is concerned with
data collection
data analysis
data interpretation
TYPES OF STATISTICS
Descriptive Statistics
It deals with collecting, summarizing and simplifying
data, which are otherwise quite unwieldy and
voluminous.
When the population interest is small, we will be able
to directly describe the important aspects of the
population measurements.
Inferential Statistics
It is the science of using a sample to make
generalizations about the important aspects of a
population.
A descriptive value for a population is called a
parameter and a descriptive value for a sample is
STATISTICAL DATA
Statisticaldata are the basic raw
material of statistics.
It refers to those aspects of a
problem situation that can be
measured, quantified or counted.
STATISTICAL DATA AND ITS USES
• Data are facts and figures from which conclusions can be drawn
that help in decision making of many professions and
organizations. For e.g.
Economists use conclusions drawn from the latest data on
unemployment and inflation to help the government make policy
decisions.
Financial planners use recent trends in stock market prices and
economic conditions to make investment decisions.
Accountants use sample data concerning a company’s actual
sales revenues to assess whether the company’s claimed sales
revenue are valid.
Marketing professionals help businesses decide which products to
develop and market using data that reveal consumer
preferences.
Production supervisors use manufacturing data to evaluate,
control and improve product quality.
Politician rely on data from public opinion polls to formulate
legislation and to devise campaign strategies.
Physicians and hospitals use data on the effectiveness of drugs
APPLICATIONS OF STATISTICS IN
MANAGEMENT AND INDUSTRY
• Location of Plant
• Size of Plant
• Production Planning
• Quality Control
• Finance Decisions
• Marketing Decisions
• Personnel Decisions
• Purchase Decisions
• Sales Decisions
• Accounting Decisions
DATA SOURCES
Data sources could be seen as of two types:
Secondary
Primary
Secondary data: They already exist in some
form: published or unpublished - in an
identifiable secondary source. They are,
generally, available from published source(s),
though not necessarily in the form actually
required.
Primary data: The data which do not already
exist in any form, and thus have to be collected
for the first time from the primary source(s). By
their very nature, these data require fresh and
first-time collection covering the whole
population or a sample drawn from it.
TYPES OF DATA
In statistics, data are classified into two
broad categories:
Quantitative Data: That can be quantified
in definite units of measurement.
Discrete data
e.g. The number of customers visiting a
departmental store everyday, the number of
incoming flights at an airport, number of
defective items in a consignment received for
sale.
Continuous data:
e.g. All characteristics such as weight, length,
height, thickness, velocity, temperature etc.
Types of Data
Qualitative:That refers to the qualitative
characteristics of a subject or an object.
Nominal data
They are the outcome of classification into two
or more categories of items or units comprising
a sample or a population according to some
quality characteristic.
e.g. Classification of students according to
gender (as males and females), of workers
according to skill (as skilled, semi-skilled and
unskilled) and of employees according to the
level of education (as matriculates,
undergraduates and post-graduates).
TYPES OF DATA
Rank data,
o They are the result of assigning ranks to specify
order in terms of the integers 1,2,3, ..., n.
o Ranks may be assigned according to the level
of performance in a test.
e.g. a contest, a competition, an interview or a
show. The candidates appearing in an interview,
for example, may be assigned ranks in integers
ranging from 1 to n, depending on their
performance in the interview.
VARIABLES
A variable is a characteristic or
condition that can change or take on
different values.
Most research begins with a general
question about the relationship
between two variables for a specific
group of individuals.
POPULATION
A population is the set of all elements about
which we wish to draw conclusions.
SAMPLE
Usually populations are so large that a
researcher cannot examine the entire group.
Therefore, a sample is selected to represent
the population in a research study. The goal
is to use the results obtained from the
sample to help answer questions about the
population.
A sample is a subset o the elements of a
population.
DATA CLASSIFICATION AND PRESENTATION
Meaning and Definition of Data Classification
“Classification is the process of arranging data
into sequences and groups according to their
common characteristics, or separating them into
different but related parts” -- Secrist
METHODS OF CLASSIFICATION
Every item of the collected data has its own
characteristics. These characteristics can be of two types:
(i) Descriptive: (e.g. Honesty, beauty etc.)
These characteristics are those which cannot be
measured directly but they are counted on the basis of
presence or absence. (Non-measurable characteristics
or attributes)
(ii) Numerical: (e.g. height, weight, profit etc.)
TYPES OF CLASSIFICATION
Statistical data can have two types of classification :
(1) Qualitative classification
(2) Quantitative classification.
Qualitative classification can be of two types:
• Dichotomy or Two-fold Classification
• Manifold Classification
Students
Male Females
Female Female
Male Male Unemploy
Unemploy Employed ed
Employed ed
QUANTITATIVE CLASSIFICATION
Data classification on the basis of phenomena which
is capable of quantitative measurement like age,
height, weight, prices, production, income,
expenditure, sales, profits, etc.
The main methods of such classification are:
(i) Geographical Classification
(ii) Chronological Classification
(iii) Variable Classification
(a) Continuous Variable (b) Discrete Variable
(i) Geographical Classification: This type of
classification is based on geographical or location
differences between various items in the data like
states, cities, regions, zones etc. For e.g. The yield of
agricultural output per hectare for different countries
in some given period may be presented as follows:
Agricultural Output of different countries (in Kg. per
hectare)
Country India USA Pakistan Japan china
Avg. 125 585 140 410 330
Output
(ii)Chronological Classification: When data are
classified with respect to different periods of time
( hour, day, week, month, year, etc.) it is known as
chronological or temporal classification. For
example, the population of India for different
decades may be presented as follows:
Population of India ( in Crores)
Year 1951 1961 1971 1981 1991 2000
Population 36.1 43.9 54.7 68.5 84.4 102.7
(iii) Variable Classification: The classification on
this basis is known as variable classification.
Variables are of two kinds:
(a) Discrete variable (b) Continuous variable
Classification Classification
based on the based on the basis
basis of of Continuous
Discrete values
Values Income (Rs.) No. of Employees
Height No. of Students
(cms.)
1000-1500 15
154 8
1500-2000 33
155 10
2000-2500 22
156 6
157 2 2500-3000 18
158 12 3000-3500 12
159 12
Total 100
TABULAR AND GRAPHICAL METHODS
Summarizing Qualitative Data
Summarizing Quantitative Data
Exploratory Data Analysis
Scatter Diagrams
SUMMARIZING QUALITATIVE DATA
Frequency Distribution
Relative Frequency
Percent Frequency Distribution
Bar Graph
Pie Chart
FREQUENCY DISTRIBUTION
A frequency distribution is a tabular summary of
data showing the frequency (or number) of items
in each of several non-overlapping classes.
The objective is to provide insights about the data
that cannot be quickly obtained by looking only at
the original data.
EXAMPLE: MARADA INN
Guests staying at Marada Inn were asked to rate
the quality of their accommodations as being
excellent, above average, average, below
average or poor. The ratings provided by a
sample of 20 guests are shown below.
Below Average Average Above Average
Average
Average Below Average Poor Above
Average
Poor Above Average Below Average
Average
Above Average Above Above
Average Average Average
Above Above Excellent Above
Average Average Average
EXAMPLE: MARADA INN
Frequency Distribution
Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20
RELATIVE FREQUENCY DISTRIBUTION
The relative frequency of a class is the fraction
or proportion of the total number of data items
belonging to the class.
A relative frequency distribution is a tabular
summary of a set of data showing the relative
frequency for each class.
PERCENT FREQUENCY DISTRIBUTION
The percent frequency of a class is the relative
frequency multiplied by 100.
A percent frequency distribution is a tabular
summary of a set of data showing the percent
frequency for each class.
EXAMPLE: MARADA INN
Relative Frequency and Percent Frequency Distributions
Relative Percent
Rating Frequency Frequency
Poor .10 10
Below Average .15 15
Average .25 25
Above Average .45 45
Excellent .05 5
Total 1.00 100
BAR GRAPH
A bar graph is a graphical device for depicting
qualitative data.
On the horizontal axis we specify the labels that
are used for each of the classes.
A frequency, relative frequency, or percent
frequency scale can be used for the vertical
axis.
Using a bar of fixed width drawn above each
class label, we extend the height appropriately.
The bars are separated to emphasize the fact
that each class is a separate category.
EXAMPLE: MARADA INN
Bar Graph
9
8
7
Frequency
6
5
4
3
2
1
Rating
Poor Below AverageAbove Excellent
Average Average
PIE CHART
The pie chart is a commonly used
graphical device for presenting relative
frequency distributions for qualitative
data.
First draw a circle, then use the relative
frequencies to subdivide the circle into
sectors that correspond to the relative
frequency for each class.
Since there are 360 degrees in a circle, a
class with a relative frequency of .25
would consume .25(360) = 90 degrees
of the circle.
EXAMPLE: MARADA INN
Exc.
Poor
5%
Pie Chart 10%
Below
Average
Above
15%
Average
45%
Average
25%
Quality Ratings
EXAMPLE: MARADA INN
Insights Gained from the Preceding Pie Chart
One-half of the customers surveyed gave
Marada a quality rating of “above
average” or “excellent” (looking at the left
side of the pie). This might please the
manager.
For each customer who gave an
“excellent” rating, there were two
customers who gave a “poor” rating
(looking at the top of the pie). This should
displease the manager.
EXPLORATORY DATA ANALYSIS
The techniques of exploratory data analysis
consist of simple arithmetic and easy-to-draw
pictures that can be used to summarize data
quickly.
One such technique is the stem-and-leaf
display.
STEM-AND-LEAF DISPLAY
A stem-and-leaf display shows both the rank
order and shape of the distribution of the data.
It is similar to a histogram on its side, but it has
the advantage of showing the actual data
values.
The first digits of each data item are arranged to
the left of a vertical line.
To the right of the vertical line we record the last
digit for each item in rank order.
Each line in the display is referred to as a stem.
Each digit on a stem is a leaf.
85 7
93 6 7 8
STEM-AND-LEAF DISPLAY
Leaf Units
A single digit is used to define each leaf.
In the preceding example, the leaf unit was 1.
Leaf units may be 100, 10, 1, 0.1, and so on.
Where the leaf unit is not shown, it is assumed to
equal 1.
EXAMPLE: LEAF UNIT = 0.1
If we have data with values such as
8.6 11.7 9.4 9.1 10.2 11.0 8.8
a stem-and-leaf display of these data will be
Leaf Unit = 0.1
8 6 8
9 1 4
10 2
11 0 7
EXAMPLE: HUDSON AUTO
REPAIR
The manager of Hudson Auto would like to get
a better picture of the distribution of costs for
engine tune-up parts. A sample of 50
customer invoices has been taken and the
costs of parts, rounded to the nearest dollar,
are listed below.
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
EXAMPLE: HUDSON AUTO
REPAIR
Stem-and-Leaf Display
5 2 7
6 2 2 2 2 5 6 7 8 8 8 9 9 9
7 1 1 2 2 3 4 4 5 5 5 6 7 8 9
9 9
8 0 0 2 3 5 8 9
9 1 3 7 7 7 8 9
10 1 4 5 5 9
SCATTER DIAGRAM
A scatter diagram is a graphical presentation of
the relationship between two quantitative
variables.
One variable is shown on the horizontal axis
and the other variable is shown on the vertical
axis.
The general pattern of the plotted points
suggests the overall relationship between the
variables.
EXAMPLE: PANTHERS FOOTBALL
TEAM
Scatter Diagram
The Panthers football team is interested
in investigating the relationship, if any,
between interceptions made and points scored.
x = Number of y = Number of
Interceptions Points Scored
1 14
3 24
2 18
1 17
3 27
EXAMPLE: PANTHERS FOOTBALL
TEAM
Scatter Diagram
Number of Points Scored y
30
25
20
15
10
5
0 x
0 1 2 3
Number of Interceptions
EXAMPLE: PANTHERS FOOTBALL TEAM
The preceding scatter diagram indicates a
positive relationship between the number of
interceptions and the number of points scored.
Higher points scored are associated with a
higher number of interceptions.
The relationship is not perfect; all plotted
points in the scatter diagram are not on a
straight line.
SCATTER DIAGRAM
A Positive Relationship
y
x
SCATTER DIAGRAM
A Negative Relationship
y
x
SCATTER DIAGRAM
No Apparent Relationship
y
x
TABULAR AND GRAPHICAL PROCEDURES
Data
Qualitative
Qualitative Data
Data Quantitative Data
Tabular Graphical Tabular Graphical
Methods Methods Methods Methods
•Frequency •Bar Graph
•Frequency •Histogram
Distribution •Pie Chart
Distribution •Ogive
•Rel. Freq. Dist.
•Rel. Freq. Dist. •Scatter
•% Freq. Dist.
•Cum. Freq. Dist. Diagram
•Crosstabulation
•Cum. Rel. Freq.
Distribution
•Stem-and-Leaf
Display