Statistical Fundamentals
Statistical Fundamentals
Introduction to Statistics
➢ What do you think of when you see/hear the word “Statistics”? The majority of people
immediately think of numerical facts, data, graphs and tables. But not only do statisticians
collect, classify and tabulate data, they also analyze data in order to make generalizations and
decisions.
➢ Many areas of study use statistics, such as; psychology, sociology, business, biology,
government, engineering, science education and even areas such as history, language and the
arts.
➢ Statistics is the science of collecting, organizing, summarizing, and analyzing data to draw
conclusions or answer questions. It also provides a measure of confidence in any
conclusions.
➢ Statistics is the discipline that concerns the collection, organization, analysis, interpretation,
and presentation of data.
• Inferential statistics: the use of numbers related to a random sample from a population
to give numerical information about the population itself. [analyzing the data to draw
conclusions or answer questions about the population]
Inferential Statistics is the branch of statistics that analyzes sample data to draw
conclusions about a population.
➢ Probability is very important in inferential statistics; it’s related to the risk of making an
error.
Page 1 of 26
Fundamental Issues of Statistics
Two types of variables:
• Qualitative or categorical variable – classification based on some attribute or
characteristic of the individual (non-numerical).
Categorical (qualitative) variables have values that can only be placed into categories,
such as “yes” and “no”; major; architectural style; etc. Example: hair color, eye color,
gender, race, ethnicity
Discrete variables arise from a counting process. Example: number of cars at a light, number
of students in a classroom, number of rooms in a house.
Continuous variables arise from a measuring process. Example: height, weight, age, miles
per gallon, time.
➢ Raw scores or data: Numbers obtained in a particular situation. A collection of raw scores is
usually called a distribution of scores. Data are individual facts or items of information.
Examples of a distribution:
• Test scores on an exam in a particular class.
• Ages of students at MCCC.
• IQ’s of a random sample of 6th grade students in the Trenton school district.
A population consists of all the items or individuals or subjects about which you want to
draw a conclusion. So, the population is the “large group” in which you are interested.
Page 2 of 26
Fundamental Issues of Statistics
➢ Sample – any portion (subset) of a population under consideration.
A sample is the portion of a population selected for analysis. The sample is the “small group”
for whom we have (or plan to have) data, often randomly selected.
o Random Sample: A sample selected in such a way that every member of the population has
an equal chance of being selected. The members of the random sample are picked arbitrarily
from the population.
Explanation:
Population: All students attending Mercer County Community College Variable: Some measure
of mathematical ability
This is not a random sample from the population of all students at MCCC. From this sample we
should not attempt to infer anything about the mathematical ability of all students at MCCC.
A bias in obtaining a sample will destroy the value of the statistical information obtained since
statistical inferences made from this information would be invalid.
Page 3 of 26
Fundamental Issues of Statistics
Basic Terms of Frequency Table
Let us consider the following frequency distribution table consisting the weights of 50 students.
The frequency distribution table is an arrangement of the values that one or more variables take
in a sample. Each entry in the table contains the frequency or count of the occurrences of values
within a particular group or interval, and in this way, the table summarizes the distribution of
values in the sample.
In statistics, a frequency distribution is a list, table or graph that displays the frequency of
various outcomes in a sample. Each entry in the table contains the frequency or count of the
occurrences of values within a particular group or interval.
The class interval (or class width) is the same for all classes. The classes all taken together must
cover at least the distance from the lowest value (minimum) in the data to the highest
(maximum) value. Equal class intervals are preferred in a frequency distribution, while
unequal class intervals (for example logarithmic intervals) may be necessary for certain
Page 4 of 26
Fundamental Issues of Statistics
situations to produce a good spread of observations between the classes and avoid a large
number of empty, or almost empty classes.
Corresponding to a class interval, the class limits may be defined as the minimum value and the
maximum value the class interval may contain. The minimum value is known as the lower-class
limit (LCL) and the maximum value is known as the upper-class limit (UCL).
The class boundaries may be defined as the actual class limit of a class interval. For overlapping
classification or mutually exclusive classification, the class boundaries coincide with the class
limits. This is usually done for a continuous variable. However, for non-overlapping or mutually
inclusive classification, we have lower-class boundaries (LCB) and upper-class boundaries
(UCB) will have the following forms.
𝐷 𝐷
𝐿𝐶𝐵 = 𝐿𝐶𝐿 − & 𝑈𝐶𝐵 = 𝑈𝐶𝐿 +
2 2
where D is the difference between the LCL of the next class interval and the UCL of the given
class interval.
The class midpoint (or class-mark) is a specific point in the center of the classes in a frequency
distribution table. It’s also the center of a bar in a histogram. It is defined as the average of the
upper and lower class limits. The lower-class limit is the lowest value in a class and the upper-
class limits are the highest values that can be in the class. In other words, in a class interval, class
mid-point may be defined as an arithmetic mean or average of the class limits or the class
boundaries.
The frequency (or absolute frequency) of an event is the number of times the event occurred in
an experiment or study. These frequencies are often graphically represented in histograms.
Page 5 of 26
Fundamental Issues of Statistics
Statistical Graphs
Frequency polygon: They are formed by lines. On the horizontal axis is the independent
variable (class marks) and on the vertical axis is the dependent variable (frequency).
Page 6 of 26
Fundamental Issues of Statistics
Cumulative frequency polygon: They are formed by increasing lines. On the horizontal axis is
the independent variable (class upper limit) and on the vertical axis is the dependent variable
(cumulative frequency).
Page 7 of 26
Fundamental Issues of Statistics
Pie chart: A circle is divided into sectors. The amplitude of each sector is proportional to the
corresponding frequency.
Page 8 of 26
Fundamental Issues of Statistics
Histogram: It is a bar graph in which the height of these bars is proportional to the frequency.
There is no space between bars. It is only used if the variable is quantitative and the scale of the
values is continuous.
Page 9 of 26
Fundamental Issues of Statistics
The relation between the frequency polygon and the histogram:
Page 10 of 26
Fundamental Issues of Statistics
Measures of Central Tendency
Central tendency: A measure of central tendency is a single value that attempts to describe a set
of data by identifying the central position within that set of data. As such, measures of central
tendency are sometimes called measures of central location.
The Mean, Mode (Mo) and Median (Me) are all valid measures of central tendency, but under
different conditions, some measures of central tendencies, such as Quartile, Decile, and
Percentile become more appropriate to use than others.
There are three types of Mean, namely Arithmetic Mean (AM), Geometric Mean (GM), and
Harmonic Mean (HM).
∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖
𝐴𝑀 = 𝑥̅ = 𝑛
∑𝑖=1 𝑓𝑖
For the shift 𝑎 and scale ℎ, the coding formula for Arithmetic Mean can be found as
𝑥𝑖 − 𝑎
𝑢𝑖 =
ℎ
⇒ 𝑥𝑖 = 𝑎 + ℎ𝑢𝑖
⇒ 𝑓𝑖 𝑥𝑖 = 𝑎𝑓𝑖 + ℎ𝑓𝑖 𝑢𝑖
𝑛 𝑛 𝑛
⇒ ∑ 𝑓𝑖 𝑥𝑖 = 𝑎 ∑ 𝑓𝑖 + ℎ ∑ 𝑓𝑖 𝑢𝑖
𝑖=1 𝑖=1 𝑖=1
∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑢𝑖
⇒ = 𝑎 + ℎ
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖
∑𝑛
𝑖=1 𝑓𝑖 𝑢𝑖
So, 𝐴𝑀 = 𝑎 + ℎ ∑𝑛
𝑖=1 𝑓𝑖
1
𝑛 ∑𝑛
𝑖=1 𝑓𝑖
𝐺𝑀 = (∏ 𝑥𝑖 𝑓𝑖 )
𝑖=1
Page 11 of 26
Fundamental Issues of Statistics
For the feasible estimation, the working formula for Geometric Mean can be found as
1
𝑛 ∑𝑛
𝑖=1 𝑓𝑖
log(𝐺𝑀) = log (∏ 𝑥𝑖 𝑓𝑖 )
𝑖=1
𝑛
1
⇒ log(𝐺𝑀) = log (∏ 𝑥𝑖 𝑓𝑖 )
∑𝑛𝑖=1 𝑓𝑖
𝑖=1
∑𝑛𝑖=1 log(𝑥𝑖 ) 𝑓𝑖
⇒ log(𝐺𝑀) =
∑𝑛𝑖=1 𝑓𝑖
∑𝑛𝑖=1 𝑓𝑖 log(𝑥𝑖 )
⇒ log(𝐺𝑀) =
∑𝑛𝑖=1 𝑓𝑖
∑𝑛 𝑓 log 𝑥𝑖
∑𝑛 ( 𝑖=1𝑛 𝑖 )
𝑖=1 𝑓𝑖 log 𝑥𝑖 ∑𝑖=1 𝑓𝑖
So, 𝐺𝑀 = Antilog ( ∑𝑛
) = 10
𝑖=1 𝑓𝑖
1 −1 𝑓 −1
∑𝑛𝑖=1 𝑓𝑖 ( ) ∑𝑛𝑖=1 ( 𝑖 ) ∑𝑛𝑖=1 𝑓𝑖
𝑥𝑖 𝑥𝑖
𝐻𝑀 = ( ) = ( ) =
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑓
∑𝑛𝑖=1 ( 𝑖 )
𝑥𝑖
∆1
𝑀𝑜 = 𝐿 + ×𝐶
∆1 + ∆2
The class at which the highest frequency is present is called the modal class. 𝐿 is the lower limit
of the modal class, ∆1 = 𝑓𝑚 − 𝑓𝑚−1 is the frequency difference between the modal and pre-modal
class, ∆2 = 𝑓𝑚 − 𝑓𝑚+1 is the frequency difference between the modal and post-modal class, and
𝐶 is the class size.
𝑁
− 𝐹𝑚−1
𝑀𝑒 = 𝐿 + 2 ×𝐶
𝑓𝑚
Page 12 of 26
Fundamental Issues of Statistics
𝑁
The class at which − 𝑡ℎ frequency (𝑁 = ∑𝑛𝑖=1 𝑓𝑖 ) is present is called the median class. 𝐿 is the
2
lower limit of the median class, 𝐹𝑚−1 is the cumulative frequency pre-median class, 𝑓𝑚 is the
frequency of the median class, and 𝐶 is the class size.
𝑖×𝑁
−𝐹𝑞−1
4
Formula for Quartile is 𝑄𝑖 = 𝐿 + × 𝐶 ; 𝑖 = 1,2,3
𝑓𝑞
𝑖×𝑁
The class at which − 𝑡ℎ frequency (𝑁 = ∑𝑛𝑖=1 𝑓𝑖 ) is present is called the 𝑖 − 𝑡ℎ quartile class.
4
𝐿 is the lower limit of the quartile class, 𝐹𝑞−1 is the cumulative frequency pre-quartile class, 𝑓𝑞 is
the frequency of the quartile class, and 𝐶 is the class size.
𝑖×𝑁
−𝐹𝑑−1
10
Formula for Decile is 𝐷𝑖 = 𝐿 + × 𝐶 ; 𝑖 = 1,2, … … , 9
𝑓𝑑
𝑖×𝑁
The class at which − 𝑡ℎ frequency (𝑁 = ∑𝑛𝑖=1 𝑓𝑖 ) is present is called the 𝑖 − 𝑡ℎ decile class. 𝐿
10
is the lower limit of the decile class, 𝐹𝑑−1 is the cumulative frequency pre-decile class, 𝑓𝑑 is the
frequency of the decile class, and 𝐶 is the class size.
𝑖×𝑁
−𝐹𝑝−1
100
Formula for Prcentile is 𝑃𝑖 = 𝐿 + × 𝐶 ; 𝑖 = 1,2, … … , 99
𝑓𝑝
𝑖×𝑁
The class at which − 𝑡ℎ frequency (𝑁 = ∑𝑛𝑖=1 𝑓𝑖 ) is present is called the 𝑖 − 𝑡ℎ percentile
100
class. 𝐿 is the lower limit of the percentile class, 𝐹𝑝−1 is the cumulative frequency pre-percentile
class, 𝑓𝑝 is the frequency of the percentile class, and 𝐶 is the class size.
𝐴𝑀 − 𝑀𝑜 = 3(𝐴𝑀 − 𝑀𝑒)
𝑀𝑜 = 3𝑀𝑒 − 2𝑀𝑒
Page 13 of 26
Fundamental Issues of Statistics
Measures of Dispersion
Dispersion: Dispersion in statistics is a way of describing how to spread out a set of data is.
When a data set has a large value, the values in the set are widely scattered; when it is small the
items in the set are tightly clustered.
The spread of a data set can be described by a range of descriptive statistics including Mean
Deviation (MD), Standard Deviation (SD), and Interquartile Range. Those are called the absolute
measures of dispersion.
Also, there are some relative measures of dispersion, such as co-efficient of Mean Deviation, co-
efficient of Standard Deviation, and co-efficient of Interquartile Range.
There are three types of Mean Deviation, estimated from Arithmetic Mean, Mode, and Median,
respectively.
𝑀𝐷𝐵𝑎𝑠𝑒
𝐶𝑀𝐷 = × 100% ; 𝐵𝑎𝑠𝑒 = 𝐴𝑀, 𝑀𝑜, 𝑀𝑒
𝐵𝑎𝑠𝑒
Page 14 of 26
Fundamental Issues of Statistics
The working formula for Standard Deviation is
∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2 ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖2 − 2𝑥𝑖 𝑥̅ + 𝑥̅ 2 ) ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖2 − 2𝑥̅ ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 + 𝑥̅ 2 ∑𝑛𝑖=1 𝑓𝑖
𝜎=√ = √ = √
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖
2
∑𝑛 𝑓𝑖 𝑥𝑖2 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛 𝑓 𝑥 2 ∑𝑛 𝑓 𝑥 ∑𝑛 𝑓 𝑥 ∑𝑛 𝑓 𝑥
= √ 𝑖=1 − 2𝑥̅ + 𝑥̅ 2 = √ 𝑖=1 𝑖 𝑖 − 2 𝑖=1 𝑖 𝑖 𝑖=1 𝑖 𝑖 + ( 𝑖=1 𝑖 𝑖 )
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖
2 2 2
∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖2 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖2 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖
=√ − 2( 𝑛 ) +( 𝑛 ) = √ −( 𝑛 )
∑𝑛𝑖=1 𝑓𝑖 ∑𝑖=1 𝑓𝑖 ∑𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑖=1 𝑓𝑖
𝑥𝑖 −𝑎
The coding formula for Standard Deviation is derived from 𝑢𝑖 = as
ℎ
𝑥𝑖 = 𝑎 + ℎ𝑢𝑖
∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑢𝑖
⇒ = 𝑎 + ℎ
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖
⇒ 𝑥̅ = 𝑎 + ℎ𝑥̅
2
∑𝑛𝑖=1 𝑓𝑖 𝑢𝑖 2 ∑𝑛 𝑓𝑖 𝑢𝑖
𝑆𝐷 = ℎ × √ 𝑛 − ( 𝑖=1 )
∑𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖
𝑆𝐷
𝐶𝑆𝐷 = × 100%
𝐴𝑀
Page 15 of 26
Fundamental Issues of Statistics
Interquartile Range is 𝐼𝑄𝑅 = 𝑄3 − 𝑄1
𝑄 −𝑄
Co-efficient of Interquartile Range 𝐶𝐼𝑄𝑅 = 𝑄3+𝑄1 × 100%
3 1
4
𝑀𝐷 = 𝑆𝐷
5
2
𝐼𝑄𝑅 = 𝑆𝐷
3
Page 16 of 26
Fundamental Issues of Statistics
Moments, Skewness, and Kurtosis
Two distributions may have the same Mean and Standard Deviation but may differ in their shape
of the distribution. Further description of their characteristics is necessary that is provided by
measures of skewness and kurtosis. Moments are popularly used to describe the characteristics of
a distribution. They represent a convenient and unifying method for summarizing many of the
most commonly used descriptive statistical measures such as central tendency, variation,
Skewness, and Kurtosis.
The term ‘skewness’ refers to a lack of symmetry or departure from symmetry, e.g., when a
distribution is not symmetrical (or is asymmetrical) it is called a skewed distribution. The
measures of skewness indicate the difference between the manners in which the observations are
distributed in a particular distribution compared with the symmetrical (or normal) distribution.
The concept of skewness gains importance from the fact that statistical theory is often based
upon the assumption of the normal distribution. A measure of skewness is, therefore, necessary
in order to guard against the consequence of this assumption.
In statistics, kurtosis refers to the degree of flatness or peakedness in the region about the mode
of a frequency curve. The degree of kurtosis of a distribution is measured relative to the flatness
or peakedness of a normal curve, it is called “Platykurtic” or “Leptokurtic”. The normal curve
itself is known as “Mesokurtic”.
Page 17 of 26
Fundamental Issues of Statistics
𝑥𝑖 − 𝐴 = ℎ𝑢𝑖
Page 18 of 26
Fundamental Issues of Statistics
Again. we can write
∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2
𝑚2 = = 𝜎2
∑𝑛𝑖=1 𝑓𝑖
So, 𝑚2 is the variance of the data set and hence 𝑆𝐷 = 𝜎 = √𝑚2 .
2
𝑚2 = 𝑚2′ − 𝑚1′
3
𝑚3 = 𝑚3′ − 3𝑚2′ 𝑚1′ + 2𝑚1′
2 4
𝑚4 = 𝑚4′ − 4𝑚3′ 𝑚1′ + 6𝑚2′ 𝑚1′ − 3𝑚1′
Co-efficient of Skewness
𝑚3
𝛾3 =
√𝑚23
Co-efficient of Kurtosis
𝑚4
𝛾4 =
𝑚22
Corrected central momnets due to class size 𝑐 as chosen as the round figure
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑)
𝑚1 =0
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) 1
𝑚2 = 𝑚2 − 𝑐2
12
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑)
𝑚3 = 𝑚3
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) 1 7
𝑚4 = 𝑚4 − 𝑚2 𝑐2 − 𝑐4
2 240
Page 19 of 26
Fundamental Issues of Statistics
Correlation and Regression
Correlation Analysis: Correlation analysis is applied in quantifying the association between two
continuous variables, for example, a dependent and independent variable or among two
independent variables.
The sign of the coefficient of correlation shows the direction of the association. The magnitude
of the coefficient shows the strength of the association. The sample of a correlation coefficient is
estimated in the correlation analysis.
It ranges between −1 and +1, denoted by r and quantifies the strength and direction of the linear
association among two variables. The correlation among two variables can either be positive, i.e.,
a higher level of one variable is related to a higher level of another or negative, i.e., a higher
level of one variable is related to a lower level of the other.
Page 20 of 26
Fundamental Issues of Statistics
Page 21 of 26
Fundamental Issues of Statistics
A model of the relationship is hypothesized, and estimates of the parameter values are used to
develop an estimated regression equation. Various tests are then employed to determine if the
model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can
be used to predict the value of the dependent variable given values for the independent variables.
Linear Regression: This is a linear approach to modeling the relationship between the scalar
components and one or more independent variables. If the regression has one independent
variable, then it is known as a simple linear regression. If it has more than one independent
variable, then it is known as multiple linear regression.
Linear regression only focuses on the conditional probability distribution of the given values
rather than the joint probability distribution. In general, all the real world regressions models
involve multiple predictors. So, the term linear regression often describes multivariate linear
regression.
Page 22 of 26
Fundamental Issues of Statistics
For 𝑁 is the number of inputs and 𝑥 & 𝑦 are two variables, there are some well-known notations
∑(𝑥 − 𝑥̅ )2
𝑉(𝑋) = 𝜎𝑥2 = = 𝑆𝑥𝑥
𝑁
∑(𝑦 − 𝑦̅)2
𝑉(𝑌) = 𝜎𝑦2 = = 𝑆𝑦𝑦
𝑁
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)
𝐶𝑜𝑣(𝑋, 𝑌) = 𝜎𝑥𝑦 = = 𝑆𝑥𝑦
𝑁
Here, 𝑆𝑥𝑦 is called the covariance between 𝑥 & 𝑦.
Page 23 of 26
Fundamental Issues of Statistics
The correlation coefficient
𝑥−𝑎 𝑦−𝑏
Assume 𝑢 = &𝑣 = , where 𝑎 & 𝑏 are the shifts and ℎ & 𝑘 are the scales. Then it can be
ℎ 𝑘
shown that
𝑥 = 𝑎 + ℎ𝑢 𝑦 = 𝑏 + 𝑘𝑣
That gives, 𝑥̅ = 𝑎 + ℎ𝑥̅ That gives, 𝑦̅ = 𝑏 + 𝑘𝑣̅
So, 𝑥 − 𝑥̅ = ℎ(𝑢 − 𝑢̅) So, 𝑦 − 𝑦̅ = 𝑘(𝑣 − 𝑣̅ )
Page 24 of 26
Fundamental Issues of Statistics
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) ℎ𝑘 ∑(𝑢 − 𝑢̅)(𝑣 − 𝑣̅ ) ℎ ∑(𝑢 − 𝑢̅)(𝑣 − 𝑣̅ ) ℎ
𝑏𝑥 = = = = × 𝑏𝑢
𝑦 ∑(𝑦 − 𝑦̅)2 𝑘 2 ∑(𝑣 − 𝑣̅ )2 𝑘 ∑(𝑣 − 𝑣̅ )2 𝑘 𝑣
ℎ 𝑁 ∑ 𝑢𝑣 − ∑ 𝑢 ∑ 𝑣
= ×( )
𝑘 𝑁 ∑ 𝑣 2 − (∑ 𝑣)2
So, the Regression coefficients are independent of the shifts but dependent on scales.
𝑦 − 𝑦̅ = 𝑏𝑦 (𝑥 − 𝑥̅ ) 𝑥 − 𝑥̅ = 𝑏𝑥 (𝑦 − 𝑦̅)
𝑥 𝑦
∑𝑦 ∑𝑥 ∑𝑥 ∑𝑦
⇒ 𝑦− = 𝑏𝑦 (𝑥 − ) ⇒𝑥− = 𝑏𝑥 (𝑦 − )
𝑁 𝑥 𝑁 𝑁 𝑦 𝑁
∑𝑥 ∑𝑦 ∑𝑦 ∑𝑥
⇒ 𝑦 = 𝑏𝑦 (𝑥 − )+ ⇒ 𝑥 = 𝑏𝑥 (𝑦 − )+
𝑥 𝑁 𝑁 𝑦 𝑁 𝑁
Now, we have
𝑟𝑥𝑦 = ±√𝑏𝑦 × 𝑏𝑥
𝑥 𝑦
Then, the definitive formulae can be can be applied to find the followings.
𝜎𝑥𝑦 𝜎𝑥𝑦
𝑟𝑥𝑦 𝜎𝑥 𝜎𝑦 𝜎𝑥 𝑟𝑥𝑦 𝜎𝑥 𝜎𝑦 𝜎𝑦
= 𝜎 = = 𝜎 =
𝑏𝑦 𝑥𝑦 𝜎𝑦 𝑏𝑥 𝑥𝑦 𝜎𝑥
𝑥 𝜎𝑥2 𝑦 𝜎𝑦2
𝜎𝑥 𝜎𝑦
⇒ 𝑟𝑥𝑦 = 𝑏𝑦 × ⇒ 𝑟𝑥𝑦 = 𝑏𝑥 ×
𝑥 𝜎𝑦 𝑦 𝜎𝑥
𝜎𝑦 𝜎𝑥
⇒ 𝑏𝑦 = 𝑟𝑥𝑦 × ⇒ 𝑏𝑥 = 𝑟𝑥𝑦 ×
𝑥 𝜎𝑥 𝑦 𝜎𝑦
Page 25 of 26
Fundamental Issues of Statistics
Rank Correlation: Sometimes there doesn’t exist a marked linear relationship between two
random variables but a monotonic relation (if one increases, the other also increases or instead,
decreases) is clearly noticed.
If instead of measuring the correlation between two sets of continuous random variables 𝑥 & 𝑦
we replace their numerical values by their rankings, then we obtain the Rank Correlation
coefficient. The rank of the 𝑖 − 𝑡ℎ element of a sample of size 𝑁 is equal to the index of the
order statistic.
Page 26 of 26