Statistical Methods – Analysis of Frequency Distribution, Correlation and Regression, Coefficient
of Correlation, Time Series Analysis, Index Numbers, , Theoretical Distribution, Normal, Binomial
and Poisson, Indian Statistics, Agricultural Data and Industrial Data with special reference to
Population, Limitations of Indian Statistics and Suggestion for Improvement.
Statistics deals with the collection, analysis, interpretation and presentation of numerical data.
The word Statistics was first used by a German Auther ‘Grottfried Achenwale’ in the year 1749.
He is considered the father of Statistics.
In plural form, Statistics refers to the collection of numerical facts. In singular form Statistics means
the methods with the help of which collection, analysis and interpretation of data is accomplished.
Statistical Data:
Quantitative Data: can be measured in numerical terms. For example, price, income, height,
weights, etc.
Qualitative Data: cannot be expressed in numerical terms. For example, intelligence, beauty,
aptitude etc.
Characteristics of Statistics:
➢ Aggregate of data
➢ Numerically expressed
➢ Affected by different factors
➢ Collected or estimated
➢ Reasonable standard of accuracy
➢ Predetermined purpose
➢ Comparable
➢ Systematic collection
Statistical Methods: Collection of data, Classification of data, Tabulation of data, Presentation of
data, Analysis of data, and Interpretation of data.
Collection of data is the basic activity of statistics. Data can be obtained directly from the
individual units, called primary sources or from the material published earlier elsewhere known
as the secondary sources.
Types of Data: Primary data and Secondary Data.
Primary data are original and are collected for the first time.
Methods of Collecting the Primary Data:
➢ Direct personal investigation or interview method.
➢ Indirect oral investigation.
➢ Investigation through Correspondents.
➢ Information through Schedules to be filled in by Informants.
➢ Information through Enumerators.
Questionnaire is a written tool of investigation that consists of a list of questions. It requires exact
information from the respondents and filled by them.
Schedule is a formal method of asking questions. It is like a form which is containing a list of items
that is going to fill by the investigators.
Secondary Data is collected earlier by someone else, and is found in published or unpublished
state.
Sources of Collection of Secondary Data:
Published Sources: international publications, government publications, report of commissions
and committees, semi government publications, commercial institutions, research institutions,
education institutions, unions, organisations, etc.
Unpublished Sources: preserved by individuals, societies, government or any other institutions.
Methods of Collecting Data: Census Method and Sample Method.
Census Method or Complete enumeration method: It refers to the method in which data is
collected about every item of universe or population.
Sample Method: It refers to the method in which data are collected about the samples or a group
of items taken from the population.
Methods of Sampling: Random sampling and Non-random sampling.
Random Sampling (Chance Sampling): Simple Random Sampling and Restrictive Random
Sampling.
1. Simple Random Sampling: each unit of universe has equal chance of being in sample. It may
use lottery method, rotating the drum, random numbers (Tippet table, Kindall and Smiths random
sampling numbers, Fishers and Yates numbers)
2. Restrictive Random Sampling:
➢ Stratified Random Sampling- Stratified random sampling is a method of sampling that
involves the division of a population into smaller sub-groups known as strata.
➢ Systematic Random Sampling- Systematic sampling is a type of probability sampling
method in which samples are selected according to a random starting point but with a fixed,
periodic interval. This interval, called the sampling interval, is calculated by dividing the
population size by the desired sample size.
➢ Group or Cluster Sampling- Cluster sampling is a probability sampling method in which
you divide a population into clusters, such as districts or schools, and then randomly select
some of these clusters as your sample.
➢ Multistage Sampling- Multistage sampling can be a complex form of cluster sampling
because it is a type of sampling which involves dividing the population into groups.
Non-Random Sampling:
➢ Purposive sampling (Deliberate, judgmental, selective, subjective sampling): Purposive
sampling is a form of non-probability sampling in which researchers rely on their own
judgment when choosing members of the population.
➢ Quota sampling: Quota sampling is a sampling methodology wherein data is collected from
a homogeneous group. Quota sampling relies on convenience sampling within each group.
➢ Convenience sampling: Convenience sampling is a type of non-probability sampling that
involves the sample being drawn from that part of the population that is close to hand.
➢ Snowball sampling: Snowball sampling is where research participants recruit other
participants for a test or study. It's called snowball sampling because once you have the ball
rolling, it picks up more “snow” along the way and becomes larger and larger.
Law of Statistical Regularity states that if a sample is taken at random from a population, it is
likely to possess the characteristics as that of the population.
Law of Inertia of Large Numbers is a corollary of the law of statistical regularity. It states that
larger the size of sample, more accurate the results are likely to be.
Organisation of Data: Organization of data refers to the systematic arrangement of raw data, so
that the data becomes easy to understand.
Classification: It is technique in statistics of organizing data into homogeneous or comparable
groups as per their general characteristics.
Types of Classification of Data: Geographical classification, Chronological classification,
Qualitative classification, Quantitative classification and Spatial classification.
➢ Spatial Classification: When data are classified on the basis of location or areas, it is called
Spatial or geographical classification.
➢ Chronological classification: When data are classified on the basis of time, like months, years
etc., it is called Chronological classification.
➢ Qualitative classification: When data are classified on the basis of some attributes or quality
such as colour of hair, literacy, religion etc., it is called Qualitative classification.
➢ Quantitative classification: When data are classified on the basis of some quantitative
variable such as height, weight, income, profits etc., it is called Quantitative classification.
Important terms used in classification of data:
Raw data: Data in its original or crude form.
Frequency: Number of times each variable gets repeated.
Variable: Data which is expressed in numerical terms and can vary from one object to other. It
can be of two types:
➢ Continuous variable: A Continuous variable can take any numerical value within a specific
interval.
➢ Discrete variable: A discrete variable can take only certain specific values that are whole
numbers.
Attribute: Variable which cannot be measured numerically. Such as intelligence and aptitude.
Frequency Array: It refers to discrete series showing frequency corresponding to some discrete
value.
Frequency distribution: It refers to a series showing frequencies corresponding to different class-
intervals.
Frequency Curve: It is the graphic representation of a frequency distribution.
Tally bars or Tally marks: Vertical line to make counting of variables easy.
Class: Class means a group of data with lowest and highest values.
Class limits: Class limits are the lowest and highest values that can be included in a class.
Class interval: The difference between the upper and lower limit of a class.
Class frequency: The number of observations corresponding to a particular class.
Mid-value or Central Value: Middle value of a class interval.
Statistical series: The arrangement of classified data in a logical order, such as size or time of
occurrence, or according to some measurable or non-measurable characteristics.
Types of Statistical series: Individual series and Frequency series.
Individual series: Values of all units are shown individually.
Frequency series: Discrete series and Continuous series.
Discrete series: Where frequencies of a variable are given but the variable is without class
intervals.
Continuous series: The series in which values are shown in continuous manner. It can be
Cumulative frequency series, Unequal class interval series, Open end series, Exclusive method
series, and Inclusive method series.
➢ Cumulative frequency series: The Continuous series in which values are shown in less than
or more than manner.
➢ Unequal class interval series: The series in which width of class interval is not equal.
➢ Open end series: The series in which lower limit of first-class interval and upper limit of last
class interval are not available.
➢ Exclusive method series: The series in which upper limit of one class is the lower limit of
the next class.
➢ Inclusive method series: The series in which upper limit is included in the class itself instead
of it in the next class.
Presentation of Data: Exhibition of data in clear and attractive way to make it easily understood.
Types of Presentation of Data: Textual or descriptive, Tabular, and Diagrammatic.
Textual or descriptive Presentation: In textual presentation, data are described in the text form.
When the quantity of data is not too large, this form of presentation is more suitable.
Tabular Presentation: In a tabular presentation, data are presented in rows and columns.
Diagrammatic Presentation: In a diagrammatic presentation, data are presented various types of
diagrams.
Kinds of Diagrams: Diagrams are classified into Geometric diagram, Frequency diagram and
Arithmetic line graph.
Geometric diagram can be Bar diagram and Pie diagram. The bar diagrams can be of three
types— simple, multiple and component bar diagrams.
➢ Simple Bar Diagram
➢ Multiple Bar Diagram
➢ Component Bar Diagram
➢ Pie Diagram
Frequency Diagram can be histogram, frequency polygon, frequency curve and ogive.
➢ Histogram:
➢ Frequency polygon:
➢ Frequency curve:
➢ Ogive:
Arithmetic Line Graph (Time Series
Graph): In arithmetic line graph, time is plotted
on x-axis and the value of the variable is plotted
on y-axis. It helps in understanding the trend in
a long-term time series data.
Measures of Central Tendency:
Statistical Average: It is the sum of a collection of numbers divided by the count of numbers in
the collection. It is also called central tendency.
Kinds of Statistical Average or Central Tendency:
Mathematical Averages:
➢ Simple:
✓ Arithmetic Mean (AM)
✓ Geometric Mean (GM)
✓ Harmonic Mean (HM)
➢ Weighted:
Positional Averages:
➢ Median
➢ Mode
Arithmetic Mean (AM): Arithmetic mean in
statistics is defined as the sum of all the observations
in the given dataset divided by the total number of
observations.
Geometric Mean (GM): Geometric mean is
calculated by raising the product of a series of numbers to the inverse of the total length of the
series.
Harmonic Mean (HM): The harmonic mean is
calculated by dividing the number of observations by
the reciprocal of each number in the series.
Relation between arithmetic mean, geometric mean and harmonic mean:
AM>GM>HM
GM2 = AM X HM
Weighted Mean: The weighted mean is calculated by multiplying the weight
associated with a particular outcome with its associated quantitative outcome and
then summing all the products together.
Median: Median is the middle number in a sorted (ascending or
descending) number of a data set.
Mode: Mode is the value that is repeatedly occurring in a given set of data.
Z = 3M – 2X
Partition Values: Values of the items that divide the series into many parts are known as partition
values. A variable may be divided into two, four, five, eight, ten and hundred equal parts known as
Median, Quartile, Quintile, Octile, Decile and Percentile.
Median = [(N+1)/2]th item
Quartile = [(N+1)/4]th item
Quintile = [(N+1)/5]th item
Octile = [(N+1)/8]th item
Decile = [(N+1)/10]th item
Percentile = [(N+1)/100]th item
Measures of Dispersion: “The degree to which numerical data tend to spread about an average value
is called the variation or dispersion of the data.”—Spiegel
Absolute Measures of dispersion express the variation of observations in terms of the original units
of a series.
Relative Measures of dispersion express the variation of observations in ratios or percentages.
Absolute Measures of dispersion: Range (Positional measure), Quartile deviation or Semi-
Interquartile range. (Positional measure), Mean deviation, Standard deviation, Variance, and Lorenz
curve. (Graphic method).
Relative Measures of dispersion: Coefficient of Range, Coefficient of Quartile deviation or Semi-
Interquartile range, Coefficient of Mean deviation, Coefficient of Standard deviation, Coefficient of
Variance.
Range: It is the difference between the greatest and the smallest observation of the distribution.
Range = L – S
Coefficient of Range = (L – S) / (L + S)
Quartile Deviation or Semi Inter-Quartile Range: Inter-Quartile Range is a measure of dispersion
based on the upper quartile Q3 and the lower quartile Q1.
Inter-quartile Range = Q3 – Q1
Quartile Deviation (Q.D.) = (Q3 – Q1)/2
Coefficient of Q.D. = (Q3 – Q1)/ (Q3 + Q1)
Mean Deviation or Average Deviation: It is obtained by taking the average of the deviations of the
given values from a measure of central tendency.
MD =
Coefficient of M.D. = Mean Deviation/Mean
Standard Deviation: Standard deviation is denoted by the letter σ (small sigma) of the Greek alphabet.
It was first suggested by Karl Pearson as a measure of dispersion in 1893. It is the square root of the
arithmetic mean of the squares of the deviations of the given observations from their arithmetic mean.
SD =
Coefficient of S.D. = Standard Deviation/Mean
Variance is the square of the standard deviation and is denoted by σ2.
Variance = SD2
Coefficient of Variance = (SD/Mean) X 100
Lorenz Curve: It is a graphic method of studying the
dispersion in a distribution. It was first used by Max O.
Lorenz, an economic statistician for the measurement of
economic inequalities such as in the distribution of income
and wealth between different countries. However today, it
is also used in business to study the disparities of the
distribution of wages, profits, turnover, production,
population, etc.
SKEWNESS: The literal meaning of skewness is ‘lack of symmetry’. Skewness gives an idea about
the shape of the curve which can be drawn with the help of
the given frequency distribution.
Symmetrical Distribution: Distribution is called
symmetrical when the values of variables appear at regular
frequencies, and the mean, median, and mode all are equal.
(X=M=Z)
(Q3 – M = M – Q1)
Asymmetrical or Skewed Distribution: Distribution is called asymmetrical when the values of
variables do not appear at regular frequencies, and the mean, median, and mode are not equal.
Positively Skewed Distribution: for which the curve has a longer tail towards the right.
(X>M>Z)
(Q3 – M > M – Q1)
Negatively Skewed Distribution: for which the curve has a longer tail towards the left.
(X<M<Z)
(Q3 – M < M – Q1)
Absolute Measures of Skewness:
Sk = Mean – Median, or
Sk = Mean – Mode, or
Sk = Median – Mode, or
Sk = Q3 + Q1 – 2M
Relative Measures of Skewness or Coefficient of skewness:
Karl Pearson’s Coefficient of Skewness:
Bowley’s Coefficient of Skewness):
Kelly’s Coefficient of Skewness:
Kurtosis: Kurtosis is concerned
with the flatness or peakedness
of the frequency curve.
Mesokurtic
Correlation: It is a statistical tool which studies the relationship between two variables.
Correlation analysis: It involves the use of various methods and techniques to study and measure the
extent of the relationship between the two variables.
POSITIVE CORRELATION: If the values of the two variables deviate in the same direction.
Example:
X 1 2 3 4 5 6 7
Y 12 20 30 38 45 52 65
NEGATIVE CORRELATION: If the values of the two variables deviate in the opposite direction.
Example:
X 1 2 3 4 5 6 7
Y 20 15 12 10 8 6 4
LINEAR CORRELATION: If corresponding to a unit change in one variable, there is a constant
change in the other variable over the entire range of the values.
X 1 2 3 4 5 6 7
Y 5 7 9 11 13 15 17
y = 2x + 3 (y=a+bx or y=bx+a)
NON-LINEAR CORRELATION: If corresponding to a unit change in one variable, there is no
constant change in the other variable but a fluctuating change is observed.
X 1 2 3 4 5 6 7
Y 5 8 9 12 13 17 22
METHODS OF STUDYING CORRELATION: Scatter diagram, Karl Pearson’s coefficient of
correlation, Spearman’s Ranking, and Concurrent deviations.
Scatter diagram:
Karl Pearson’s Coefficient of Correlation: Pearson’s correlation coefficient lies between –1 and +1.
Σdxdy/Nσx σy
Spearman’s Ranking Method: Spearman’s rank
correlation coefficient lies between –1 and +1.
Concurrent Deviation Method: This method is based on the signs of the deviations of the values of
variables from their preceding values. It does not take into account the exact magnitude of the values
of the variables. We put a plus (+) sign, minus (–) sign or equality (=) sign for the deviation. The
deviations in the values of two variables are said to be concurrent if they have the same sign.
Following formula is used:
Regression: It is ‘returning to the average value’. It is a mathematical measure expressing the average
relationship between the two variables.
Regression analysis: It aims at establishing the functional relationship between the two variables. It
helps in predicting or estimating the value of the dependent variable for any given value of the
independent variable.
Line of regression of y on x: y = a + bx
Here, ‘b’ is the slope of the line of regression of y on x. It is called the coefficient of regression of y on
x. It represents the increment in the value of the dependent variable y for a unit change in the value of
the independent variable x.
Coefficient of regression of y on x is written as byx.
Line of regression of x on y: x = a + by
Here, ‘b’ is the slope of the line of regression of x on y. It is called the coefficient of regression of x on
y. It represents the increment in the value of the dependent variable x for a unit change in the value of
the independent variable y.
Coefficient of regression of x on y is written as byx.
Index number: Index number refers to a statistical device for indicating the relative movements of the
data where measurement of actual movements is difficult or incapable of providing useful information.
TYPES OF INDEX NUMBERS:
1. Price Index Numbers.
➢ Wholesale Price Index Numbers: The wholesale price index numbers reflect the changes in the
general price level of a country.
➢ Retail Price Index Numbers: The Retail Price Index Numbers reflect the general changes in the
retail prices of various commodities such as consumption goods, stocks and shares, bank deposits,
government bonds, etc.
2. Quantity Index Numbers: Quantity index numbers study the changes in the volume of goods.
3. Value Index Numbers: These are intended to study the change in the total value of production such
as indices of retail sales or profits or inventories.
METHODS OF CONSTRUCTING INDEX NUMBERS:
Simple Aggregate Method: Simple Aggregate Method uses the
aggregate of prices in the current year as a percentage of the
aggregate of prices in the base year.
P01 : Price Index Number for the current year
p0 : Price of a commodity in the base year.
p1 : Price of a commodity in the current year.
Item A B C D Total
Price in 2020 100 150 200 250 700
Price in 2021 200 300 400 500 1400
P01 = (1400/700) X 100 = 200
Weighted Aggregate Method: In Weighted Aggregate method, appropriate weights are assigned to
values to reflect their relative importance in the group.
Laspeyre’s Price Index or Base Year
Method:
Paasche’s Price Index:
Fisher’s Price
Index:
Dorbish-Bowley Price Index:
Kelly’s Price Index or Fixed Weights Index:
Time Series: A time series is an arrangement of statistical data in a chronological order.
In the words of Ya-lun Chou, “A time series may be defined as a collection of readings belonging to
different time periods, of some economic variable or composite of variables”.
Components of a time series:
➢ Secular Trend or Long-term Movement
➢ Periodic Movements or Short-term Fluctuations:
o Seasonal Variations
o Cyclical Variations
➢ Random or Irregular Variations
Secular Trend: The general tendency of the time series data to increase or decrease or stagnate during
a long period of time is called the secular trend or simple trend. Term ‘long period of time’ is a relative
term and cannot be defined exactly.
Seasonal Variations: These variations in a time series are due to the rhythmic forces which operate in
a regular and periodic manner over a span of less than a year.
Cyclical Variations: The oscillatory movements in a time series with period of oscillation greater than
one year are termed as cyclical variations.
Random or Irregular Variations: These fluctuations are purely random and are the result of such
unforeseen and unpredictable forces which operate in absolutely erratic and irregular manner.
Hypothesis Testing:
Hypothesis is a specific, clear, and testable proposition or predictive statement about the possible
outcome of a research study.
Hypothesis Testing: Hypothesis testing is a form of statistical inference that uses data from a sample
to draw conclusions about a population parameter.
Null hypothesis is a tentative assumption made about the parameter or distribution. This assumption is
denoted by H0.
Alternative hypothesis is the opposite of what is stated in the null hypothesis. This is denoted by Ha.
Type I error refers to rejecting H0 when H0 is actually true.
Type II error refers to accepting H0 when H0 is false.
Statistical test: It is a decision method that helps us to validate or invalidate a statistical hypothesis
with a certain degree of confidence. If the sample size is more than or equal to 30, it is called a large
sample and if it is less than 30, it is called a small sample.