[go: up one dir, main page]

0% found this document useful (0 votes)
11 views26 pages

Statistical Fundamentals

The document provides an overview of statistics, emphasizing its importance in various fields and everyday life. It distinguishes between descriptive and inferential statistics, explains types of variables, and discusses sampling methods, data types, and measures of central tendency. Additionally, it introduces frequency distributions and graphical representations of data, such as histograms and pie charts.

Uploaded by

tanjim09826
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views26 pages

Statistical Fundamentals

The document provides an overview of statistics, emphasizing its importance in various fields and everyday life. It distinguishes between descriptive and inferential statistics, explains types of variables, and discusses sampling methods, data types, and measures of central tendency. Additionally, it introduces frequency distributions and graphical representations of data, such as histograms and pie charts.

Uploaded by

tanjim09826
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Fundamental Issues of Statistics

Introduction to Statistics
➢ What do you think of when you see/hear the word “Statistics”? The majority of people
immediately think of numerical facts, data, graphs and tables. But not only do statisticians
collect, classify and tabulate data, they also analyze data in order to make generalizations and
decisions.

Why study statistics?


• Everyone comes in contact with statistics in everyday life.
• People should understand reports in newspapers, magazine and journals.
• People should be able to question the statistics they read, and not blindly accept this as
proven fact.

➢ Many areas of study use statistics, such as; psychology, sociology, business, biology,
government, engineering, science education and even areas such as history, language and the
arts.

➢ Statistics is the science of collecting, organizing, summarizing, and analyzing data to draw
conclusions or answer questions. It also provides a measure of confidence in any
conclusions.

➢ Statistics is the discipline that concerns the collection, organization, analysis, interpretation,
and presentation of data.

Two types of statistics:


• Descriptive statistics: the use of numbers to summarize information which is known
about some population. [collecting, organizing and summarizing the data]

Descriptive Statistics is the branch of statistics that focuses on collecting, summarizing,


and presenting a set of data.

• Inferential statistics: the use of numbers related to a random sample from a population
to give numerical information about the population itself. [analyzing the data to draw
conclusions or answer questions about the population]

Inferential Statistics is the branch of statistics that analyzes sample data to draw
conclusions about a population.

➢ Probability is very important in inferential statistics; it’s related to the risk of making an
error.

➢ Variables are characteristics of the individuals or things being studied. Variable is a


characteristic of an individual that will be analyzed using statistics.

Page 1 of 26
Fundamental Issues of Statistics
Two types of variables:
• Qualitative or categorical variable – classification based on some attribute or
characteristic of the individual (non-numerical).

Categorical (qualitative) variables have values that can only be placed into categories,
such as “yes” and “no”; major; architectural style; etc. Example: hair color, eye color,
gender, race, ethnicity

• Quantitative or numerical variable – provides numerical measures of individuals.

Numerical (quantitative) variables have values that represent quantities.

Two types of quantitative variables:


✓ Discrete (something that can be counted) – has either a finite number of possible
values or a countable number of possible values.

Discrete variables arise from a counting process. Example: number of cars at a light, number
of students in a classroom, number of rooms in a house.

✓ Continuous (something that can be measured) – has an infinite number of possible


values that are not countable.

Continuous variables arise from a measuring process. Example: height, weight, age, miles
per gallon, time.

➢ Raw scores or data: Numbers obtained in a particular situation. A collection of raw scores is
usually called a distribution of scores. Data are individual facts or items of information.

Examples of a distribution:
• Test scores on an exam in a particular class.
• Ages of students at MCCC.
• IQ’s of a random sample of 6th grade students in the Trenton school district.

➢ Population – All people or things being considered in a particular situation.

A population consists of all the items or individuals or subjects about which you want to
draw a conclusion. So, the population is the “large group” in which you are interested.

Example of the population:


• All students in the 6th grade in the Trenton school district.
• The mean IQ score of all 6th grade students in the Trenton school district.

✓ A parameter is a numerical summary of a population. Parameter is a numerical measure that


describes a characteristic of a population.

Page 2 of 26
Fundamental Issues of Statistics
➢ Sample – any portion (subset) of a population under consideration.

A sample is the portion of a population selected for analysis. The sample is the “small group”
for whom we have (or plan to have) data, often randomly selected.

Examples of the sample:


• Fifty 6th grade students from the Trenton school district.
• The mean IQ score for fifty 6th grade students from the Trenton school district.

✓ A statistic is a numerical summary of a sample. Statistic is a numerical measure that


describes a characteristic of a sample

✓ A parameter goes with a population and a statistic goes with a sample.

o Random Sample: A sample selected in such a way that every member of the population has
an equal chance of being selected. The members of the random sample are picked arbitrarily
from the population.

Explanation:
Population: All students attending Mercer County Community College Variable: Some measure
of mathematical ability

Sample: Students leaving a section of calculus at MCCC.

This is not a random sample from the population of all students at MCCC. From this sample we
should not attempt to infer anything about the mathematical ability of all students at MCCC.

A bias in obtaining a sample will destroy the value of the statistical information obtained since
statistical inferences made from this information would be invalid.

Why use a sample instead of a population?


• Time
• Money
• Population continuously changing.
• Cannot use the entire population.

Primary & Secondary Data:


Primary data are the original data derived from your research endeavors. Secondary data are data
derived from your primary data. Primary data is information collected through original or first-
hand research. For example, surveys and focus group discussions. On the other hand, secondary
data is information which has been collected in the past by someone else. For example,
researching the internet, newspaper articles and company reports.

Page 3 of 26
Fundamental Issues of Statistics
Basic Terms of Frequency Table

Let us consider the following frequency distribution table consisting the weights of 50 students.

Table: Frequency distribution table for the weight of 50 students


Class Mark Cumulative
Class Interval Class Boundary Frequency
(Mid Value) Frequency
(Class) (Original Class) (f)
(x) (F)
54 - 57 53.5 - 57.5 55.5 6 6
58 - 61 57.5 - 61.5 59.5 9 15
62 - 65 61.5 - 65.5 63.5 11 26
66 - 69 65.5 - 69.5 67.5 16 42
70 - 73 69.5 - 73.5 71.5 8 50

A frequency distribution shows us a summarized grouping of data divided into mutually


exclusive classes and the number of occurrences in a class. It is a way of showing unorganized
data notably to show results of an event considered for a certain interest.

The frequency distribution table is an arrangement of the values that one or more variables take
in a sample. Each entry in the table contains the frequency or count of the occurrences of values
within a particular group or interval, and in this way, the table summarizes the distribution of
values in the sample.

The frequency distribution is a representation, either in a graphical or tabular format that


displays the number of observations within a given interval. The interval size depends on the
data being analyzed and the goals of the analyst. The intervals must be mutually exclusive and
exhaustive.

In statistics, a frequency distribution is a list, table or graph that displays the frequency of
various outcomes in a sample. Each entry in the table contains the frequency or count of the
occurrences of values within a particular group or interval.

The class interval (or class width) is the same for all classes. The classes all taken together must
cover at least the distance from the lowest value (minimum) in the data to the highest
(maximum) value. Equal class intervals are preferred in a frequency distribution, while
unequal class intervals (for example logarithmic intervals) may be necessary for certain

Page 4 of 26
Fundamental Issues of Statistics
situations to produce a good spread of observations between the classes and avoid a large
number of empty, or almost empty classes.

Corresponding to a class interval, the class limits may be defined as the minimum value and the
maximum value the class interval may contain. The minimum value is known as the lower-class
limit (LCL) and the maximum value is known as the upper-class limit (UCL).

The class boundaries may be defined as the actual class limit of a class interval. For overlapping
classification or mutually exclusive classification, the class boundaries coincide with the class
limits. This is usually done for a continuous variable. However, for non-overlapping or mutually
inclusive classification, we have lower-class boundaries (LCB) and upper-class boundaries
(UCB) will have the following forms.

𝐷 𝐷
𝐿𝐶𝐵 = 𝐿𝐶𝐿 − & 𝑈𝐶𝐵 = 𝑈𝐶𝐿 +
2 2

where D is the difference between the LCL of the next class interval and the UCL of the given
class interval.

The class midpoint (or class-mark) is a specific point in the center of the classes in a frequency
distribution table. It’s also the center of a bar in a histogram. It is defined as the average of the
upper and lower class limits. The lower-class limit is the lowest value in a class and the upper-
class limits are the highest values that can be in the class. In other words, in a class interval, class
mid-point may be defined as an arithmetic mean or average of the class limits or the class
boundaries.

The frequency (or absolute frequency) of an event is the number of times the event occurred in
an experiment or study. These frequencies are often graphically represented in histograms.

Cumulative frequency is defined as a running total of frequencies. The frequency of an element


in a set refers to how many of that element there are in the set. Cumulative frequency can also be
defined as the sum of all previous frequencies up to the current point.

Cumulative frequency analysis is the analysis of the frequency of occurrence of values of a


phenomenon less than a reference value. The phenomenon may be time- or space-dependent.
Cumulative frequency is also called the frequency of non-exceedance.

Page 5 of 26
Fundamental Issues of Statistics
Statistical Graphs
Frequency polygon: They are formed by lines. On the horizontal axis is the independent
variable (class marks) and on the vertical axis is the dependent variable (frequency).

Page 6 of 26
Fundamental Issues of Statistics
Cumulative frequency polygon: They are formed by increasing lines. On the horizontal axis is
the independent variable (class upper limit) and on the vertical axis is the dependent variable
(cumulative frequency).

Page 7 of 26
Fundamental Issues of Statistics
Pie chart: A circle is divided into sectors. The amplitude of each sector is proportional to the
corresponding frequency.

Consider the total number of students is 300.


Number secured Alocated subject
40 − 55 French
55 − 70 English
70 − 85 Science
85 − 100 Mathematics

An women has to buy in total 150


domestic elements with the following
price ranges and the percentage of the
individual elements are as provided in the
given pie-chart.

Toys 1000 − 2000, furniture 3000 −


4000, home décor 4000 − 5000, and
electronics 2000 − 3000.

Page 8 of 26
Fundamental Issues of Statistics
Histogram: It is a bar graph in which the height of these bars is proportional to the frequency.
There is no space between bars. It is only used if the variable is quantitative and the scale of the
values is continuous.

Page 9 of 26
Fundamental Issues of Statistics
The relation between the frequency polygon and the histogram:

Please, visit the following links for more details.


https://www.statisticshowto.datasciencecentral.com/
https://www.tutorialspoint.com/statistics/index.htm
https://people.richland.edu/james/lecture/m170/ch02-def.html
https://www.scribbr.com/statistics/
https://libguides.library.curtin.edu.au/uniskills/numeracy-skills/statistics
https://www.statisticshowto.com/probability-and-statistics/
https://courses.lumenlearning.com/introstats1/chapter/learning-outcomes/

Page 10 of 26
Fundamental Issues of Statistics
Measures of Central Tendency
Central tendency: A measure of central tendency is a single value that attempts to describe a set
of data by identifying the central position within that set of data. As such, measures of central
tendency are sometimes called measures of central location.

The Mean, Mode (Mo) and Median (Me) are all valid measures of central tendency, but under
different conditions, some measures of central tendencies, such as Quartile, Decile, and
Percentile become more appropriate to use than others.

There are three types of Mean, namely Arithmetic Mean (AM), Geometric Mean (GM), and
Harmonic Mean (HM).

For 𝑛 number of classes Arithmetic Mean can be estimated as

∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖
𝐴𝑀 = 𝑥̅ = 𝑛
∑𝑖=1 𝑓𝑖

For the shift 𝑎 and scale ℎ, the coding formula for Arithmetic Mean can be found as

𝑥𝑖 − 𝑎
𝑢𝑖 =

⇒ 𝑥𝑖 = 𝑎 + ℎ𝑢𝑖

⇒ 𝑓𝑖 𝑥𝑖 = 𝑎𝑓𝑖 + ℎ𝑓𝑖 𝑢𝑖
𝑛 𝑛 𝑛

⇒ ∑ 𝑓𝑖 𝑥𝑖 = 𝑎 ∑ 𝑓𝑖 + ℎ ∑ 𝑓𝑖 𝑢𝑖
𝑖=1 𝑖=1 𝑖=1

∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑢𝑖
⇒ = 𝑎 + ℎ
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖

∑𝑛
𝑖=1 𝑓𝑖 𝑢𝑖
So, 𝐴𝑀 = 𝑎 + ℎ ∑𝑛
𝑖=1 𝑓𝑖

For 𝑛 number of classes Geometric Mean can be estimated as

1
𝑛 ∑𝑛
𝑖=1 𝑓𝑖
𝐺𝑀 = (∏ 𝑥𝑖 𝑓𝑖 )
𝑖=1

Page 11 of 26
Fundamental Issues of Statistics
For the feasible estimation, the working formula for Geometric Mean can be found as

1
𝑛 ∑𝑛
𝑖=1 𝑓𝑖
log(𝐺𝑀) = log (∏ 𝑥𝑖 𝑓𝑖 )
𝑖=1

𝑛
1
⇒ log(𝐺𝑀) = log (∏ 𝑥𝑖 𝑓𝑖 )
∑𝑛𝑖=1 𝑓𝑖
𝑖=1

∑𝑛𝑖=1 log(𝑥𝑖 ) 𝑓𝑖
⇒ log(𝐺𝑀) =
∑𝑛𝑖=1 𝑓𝑖

∑𝑛𝑖=1 𝑓𝑖 log(𝑥𝑖 )
⇒ log(𝐺𝑀) =
∑𝑛𝑖=1 𝑓𝑖

∑𝑛 𝑓 log 𝑥𝑖
∑𝑛 ( 𝑖=1𝑛 𝑖 )
𝑖=1 𝑓𝑖 log 𝑥𝑖 ∑𝑖=1 𝑓𝑖
So, 𝐺𝑀 = Antilog ( ∑𝑛
) = 10
𝑖=1 𝑓𝑖

For 𝑛 number of classes Harmonic Mean can be estimated as

1 −1 𝑓 −1
∑𝑛𝑖=1 𝑓𝑖 ( ) ∑𝑛𝑖=1 ( 𝑖 ) ∑𝑛𝑖=1 𝑓𝑖
𝑥𝑖 𝑥𝑖
𝐻𝑀 = ( ) = ( ) =
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑓
∑𝑛𝑖=1 ( 𝑖 )
𝑥𝑖

Formula for Mode of the frequency distribution can be estimated as

∆1
𝑀𝑜 = 𝐿 + ×𝐶
∆1 + ∆2

The class at which the highest frequency is present is called the modal class. 𝐿 is the lower limit
of the modal class, ∆1 = 𝑓𝑚 − 𝑓𝑚−1 is the frequency difference between the modal and pre-modal
class, ∆2 = 𝑓𝑚 − 𝑓𝑚+1 is the frequency difference between the modal and post-modal class, and
𝐶 is the class size.

Formula for Median of the frequency distribution can be estimated as

𝑁
− 𝐹𝑚−1
𝑀𝑒 = 𝐿 + 2 ×𝐶
𝑓𝑚

Page 12 of 26
Fundamental Issues of Statistics
𝑁
The class at which − 𝑡ℎ frequency (𝑁 = ∑𝑛𝑖=1 𝑓𝑖 ) is present is called the median class. 𝐿 is the
2

lower limit of the median class, 𝐹𝑚−1 is the cumulative frequency pre-median class, 𝑓𝑚 is the
frequency of the median class, and 𝐶 is the class size.

𝑖×𝑁
−𝐹𝑞−1
4
Formula for Quartile is 𝑄𝑖 = 𝐿 + × 𝐶 ; 𝑖 = 1,2,3
𝑓𝑞

𝑖×𝑁
The class at which − 𝑡ℎ frequency (𝑁 = ∑𝑛𝑖=1 𝑓𝑖 ) is present is called the 𝑖 − 𝑡ℎ quartile class.
4

𝐿 is the lower limit of the quartile class, 𝐹𝑞−1 is the cumulative frequency pre-quartile class, 𝑓𝑞 is
the frequency of the quartile class, and 𝐶 is the class size.

𝑖×𝑁
−𝐹𝑑−1
10
Formula for Decile is 𝐷𝑖 = 𝐿 + × 𝐶 ; 𝑖 = 1,2, … … , 9
𝑓𝑑

𝑖×𝑁
The class at which − 𝑡ℎ frequency (𝑁 = ∑𝑛𝑖=1 𝑓𝑖 ) is present is called the 𝑖 − 𝑡ℎ decile class. 𝐿
10

is the lower limit of the decile class, 𝐹𝑑−1 is the cumulative frequency pre-decile class, 𝑓𝑑 is the
frequency of the decile class, and 𝐶 is the class size.

𝑖×𝑁
−𝐹𝑝−1
100
Formula for Prcentile is 𝑃𝑖 = 𝐿 + × 𝐶 ; 𝑖 = 1,2, … … , 99
𝑓𝑝

𝑖×𝑁
The class at which − 𝑡ℎ frequency (𝑁 = ∑𝑛𝑖=1 𝑓𝑖 ) is present is called the 𝑖 − 𝑡ℎ percentile
100

class. 𝐿 is the lower limit of the percentile class, 𝐹𝑝−1 is the cumulative frequency pre-percentile
class, 𝑓𝑝 is the frequency of the percentile class, and 𝐶 is the class size.

𝐴𝑀 − 𝑀𝑜 = 3(𝐴𝑀 − 𝑀𝑒)

𝑀𝑜 = 3𝑀𝑒 − 2𝑀𝑒

Page 13 of 26
Fundamental Issues of Statistics
Measures of Dispersion
Dispersion: Dispersion in statistics is a way of describing how to spread out a set of data is.
When a data set has a large value, the values in the set are widely scattered; when it is small the
items in the set are tightly clustered.

The spread of a data set can be described by a range of descriptive statistics including Mean
Deviation (MD), Standard Deviation (SD), and Interquartile Range. Those are called the absolute
measures of dispersion.

Also, there are some relative measures of dispersion, such as co-efficient of Mean Deviation, co-
efficient of Standard Deviation, and co-efficient of Interquartile Range.

There are three types of Mean Deviation, estimated from Arithmetic Mean, Mode, and Median,
respectively.

∑𝑛𝑖=1 𝑓𝑖 |𝑥𝑖 − 𝐴𝑀|


𝑀𝐷𝐴𝑀 =
∑𝑛𝑖=1 𝑓𝑖

∑𝑛𝑖=1 𝑓𝑖 |𝑥𝑖 − 𝑀𝑜|


𝑀𝐷𝑀𝑜 =
∑𝑛𝑖=1 𝑓𝑖

∑𝑛𝑖=1 𝑓𝑖 |𝑥𝑖 − 𝑀𝑒|


𝑀𝐷𝑀𝑒 =
∑𝑛𝑖=1 𝑓𝑖

Co-efficient of Mean Deviation is

𝑀𝐷𝐵𝑎𝑠𝑒
𝐶𝑀𝐷 = × 100% ; 𝐵𝑎𝑠𝑒 = 𝐴𝑀, 𝑀𝑜, 𝑀𝑒
𝐵𝑎𝑠𝑒

Variance and Standard Deviation of the statistical data are

∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝐴𝑀)2 ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2


2
𝑉(𝑋) = 𝜎 = =
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖

∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝐴𝑀)2 ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2


𝑆𝐷 = 𝜎 = √ =√
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖

Page 14 of 26
Fundamental Issues of Statistics
The working formula for Standard Deviation is

∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2 ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖2 − 2𝑥𝑖 𝑥̅ + 𝑥̅ 2 ) ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖2 − 2𝑥̅ ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 + 𝑥̅ 2 ∑𝑛𝑖=1 𝑓𝑖
𝜎=√ = √ = √
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖

2
∑𝑛 𝑓𝑖 𝑥𝑖2 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛 𝑓 𝑥 2 ∑𝑛 𝑓 𝑥 ∑𝑛 𝑓 𝑥 ∑𝑛 𝑓 𝑥
= √ 𝑖=1 − 2𝑥̅ + 𝑥̅ 2 = √ 𝑖=1 𝑖 𝑖 − 2 𝑖=1 𝑖 𝑖 𝑖=1 𝑖 𝑖 + ( 𝑖=1 𝑖 𝑖 )
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖

2 2 2
∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖2 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖2 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖
=√ − 2( 𝑛 ) +( 𝑛 ) = √ −( 𝑛 )
∑𝑛𝑖=1 𝑓𝑖 ∑𝑖=1 𝑓𝑖 ∑𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑖=1 𝑓𝑖

𝑥𝑖 −𝑎
The coding formula for Standard Deviation is derived from 𝑢𝑖 = as

𝑥𝑖 = 𝑎 + ℎ𝑢𝑖

∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑢𝑖
⇒ = 𝑎 + ℎ
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖

⇒ 𝑥̅ = 𝑎 + ℎ𝑥̅

So, 𝑥𝑖 − 𝑥̅ = ℎ(𝑢𝑖 − 𝑢̅)

Then, we can write

∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2 ∑𝑛𝑖=1 𝑓𝑖 ℎ2 (𝑢𝑖 − 𝑢̅)2 ∑𝑛𝑖=1 𝑓𝑖 (𝑢𝑖 − 𝑢̅)2


𝜎=√ = √ = ℎ × √
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖

Now, the working formula for Standard Deviation is

2
∑𝑛𝑖=1 𝑓𝑖 𝑢𝑖 2 ∑𝑛 𝑓𝑖 𝑢𝑖
𝑆𝐷 = ℎ × √ 𝑛 − ( 𝑖=1 )
∑𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖

Co-efficient of Standard Deviation/Variation is

𝑆𝐷
𝐶𝑆𝐷 = × 100%
𝐴𝑀

Page 15 of 26
Fundamental Issues of Statistics
Interquartile Range is 𝐼𝑄𝑅 = 𝑄3 − 𝑄1

𝑄 −𝑄
Co-efficient of Interquartile Range 𝐶𝐼𝑄𝑅 = 𝑄3+𝑄1 × 100%
3 1

4
𝑀𝐷 = 𝑆𝐷
5

2
𝐼𝑄𝑅 = 𝑆𝐷
3

Page 16 of 26
Fundamental Issues of Statistics
Moments, Skewness, and Kurtosis

Two distributions may have the same Mean and Standard Deviation but may differ in their shape
of the distribution. Further description of their characteristics is necessary that is provided by
measures of skewness and kurtosis. Moments are popularly used to describe the characteristics of
a distribution. They represent a convenient and unifying method for summarizing many of the
most commonly used descriptive statistical measures such as central tendency, variation,
Skewness, and Kurtosis.

The term ‘skewness’ refers to a lack of symmetry or departure from symmetry, e.g., when a
distribution is not symmetrical (or is asymmetrical) it is called a skewed distribution. The
measures of skewness indicate the difference between the manners in which the observations are
distributed in a particular distribution compared with the symmetrical (or normal) distribution.
The concept of skewness gains importance from the fact that statistical theory is often based
upon the assumption of the normal distribution. A measure of skewness is, therefore, necessary
in order to guard against the consequence of this assumption.

In statistics, kurtosis refers to the degree of flatness or peakedness in the region about the mode
of a frequency curve. The degree of kurtosis of a distribution is measured relative to the flatness
or peakedness of a normal curve, it is called “Platykurtic” or “Leptokurtic”. The normal curve
itself is known as “Mesokurtic”.

Page 17 of 26
Fundamental Issues of Statistics

The 𝑟 − 𝑡ℎ raw moment about an arbitrary point 𝐴 is


∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝐴)𝑟
𝑚𝑟′ = ; 𝐴 ≠ 𝑥̅
∑𝑛𝑖=1 𝑓𝑖
𝑥𝑖 −𝐴
The coding formula for the 𝑟 − 𝑡ℎ raw moment from 𝑢𝑖 = is

𝑥𝑖 − 𝐴 = ℎ𝑢𝑖

Then, we can write

∑𝑛𝑖=1 𝑓𝑖 ℎ𝑟 𝑢𝑖𝑟 ∑𝑛𝑖=1 𝑓𝑖 𝑢𝑖𝑟


𝑚𝑟′ = = ℎ 𝑟
×
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖

The 𝑟 − 𝑡ℎ central moment about the arithmetic mean 𝑥̅ is


∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )𝑟
𝑚𝑟 =
∑𝑛𝑖=1 𝑓𝑖
We can write
∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̅ ) ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖 − 𝑥̅ ∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖
𝑚1 = = = 𝑛 − 𝑥̅ = 𝑥̅ − 𝑥̅ = 0
∑𝑛𝑖=1 𝑓𝑖 ∑𝑛𝑖=1 𝑓𝑖 ∑𝑖=1 𝑓𝑖
So, 𝑚1 = 0 for every dataset.

Page 18 of 26
Fundamental Issues of Statistics
Again. we can write
∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2
𝑚2 = = 𝜎2
∑𝑛𝑖=1 𝑓𝑖
So, 𝑚2 is the variance of the data set and hence 𝑆𝐷 = 𝜎 = √𝑚2 .

Estimation of the central moments from the raw moments


𝑚1 = 0

2
𝑚2 = 𝑚2′ − 𝑚1′

3
𝑚3 = 𝑚3′ − 3𝑚2′ 𝑚1′ + 2𝑚1′

2 4
𝑚4 = 𝑚4′ − 4𝑚3′ 𝑚1′ + 6𝑚2′ 𝑚1′ − 3𝑚1′

Co-efficient of Skewness
𝑚3
𝛾3 =
√𝑚23

If 𝛾3 < 0, the provided data set is called negatively skewed.


If 𝛾3 = 0, the provided data set is called non-skewed (Normal).
If 𝛾3 > 0, the provided data set is called positively skewed.

Co-efficient of Kurtosis
𝑚4
𝛾4 =
𝑚22

If 𝛾4 < 3, the provided data set is called platykurtic (flattered).


If 𝛾4 = 3, the provided data set is called mesokurtic (balanced).
If 𝛾4 > 3, the provided data set is called leptokurtic (peaked).

Corrected central momnets due to class size 𝑐 as chosen as the round figure

(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑)
𝑚1 =0
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) 1
𝑚2 = 𝑚2 − 𝑐2
12
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑)
𝑚3 = 𝑚3
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) 1 7
𝑚4 = 𝑚4 − 𝑚2 𝑐2 − 𝑐4
2 240

Page 19 of 26
Fundamental Issues of Statistics
Correlation and Regression

Correlation Analysis: Correlation analysis is applied in quantifying the association between two
continuous variables, for example, a dependent and independent variable or among two
independent variables.

The sign of the coefficient of correlation shows the direction of the association. The magnitude
of the coefficient shows the strength of the association. The sample of a correlation coefficient is
estimated in the correlation analysis.

It ranges between −1 and +1, denoted by r and quantifies the strength and direction of the linear
association among two variables. The correlation among two variables can either be positive, i.e.,
a higher level of one variable is related to a higher level of another or negative, i.e., a higher
level of one variable is related to a lower level of the other.

Page 20 of 26
Fundamental Issues of Statistics

Regression Analysis: Regression analysis involves identifying the relationship between a


dependent variable and one or more variables. The outcome variable is known as the dependent
or response variable and the risk elements, and cofounders are known as predictors or
independent variables. The dependent variable is shown by 𝒚 and independent variables are
shown by 𝒙 in regression analysis.

Page 21 of 26
Fundamental Issues of Statistics
A model of the relationship is hypothesized, and estimates of the parameter values are used to
develop an estimated regression equation. Various tests are then employed to determine if the
model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can
be used to predict the value of the dependent variable given values for the independent variables.

Linear Regression: This is a linear approach to modeling the relationship between the scalar
components and one or more independent variables. If the regression has one independent
variable, then it is known as a simple linear regression. If it has more than one independent
variable, then it is known as multiple linear regression.

Linear regression only focuses on the conditional probability distribution of the given values
rather than the joint probability distribution. In general, all the real world regressions models
involve multiple predictors. So, the term linear regression often describes multivariate linear
regression.

Page 22 of 26
Fundamental Issues of Statistics

Comparison between Correlation and Regression:

Basis Correlation Regression

A statistical measure that defines Describes how an independent


Meaning co-relationship or association of variable is associated with the
two variables. dependent variable.

Dependent and Independent


No difference Both variables are different.
variables

To fit the best line and estimate


To describe a linear relationship
Usage one variable based on another
between two variables.
variable.

To estimate the values of a


To find a value expressing the
Objective random variable based on the
relationship between variables.
values of a fixed variable.

For 𝑁 is the number of inputs and 𝑥 & 𝑦 are two variables, there are some well-known notations

∑(𝑥 − 𝑥̅ )2
𝑉(𝑋) = 𝜎𝑥2 = = 𝑆𝑥𝑥
𝑁
∑(𝑦 − 𝑦̅)2
𝑉(𝑌) = 𝜎𝑦2 = = 𝑆𝑦𝑦
𝑁
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)
𝐶𝑜𝑣(𝑋, 𝑌) = 𝜎𝑥𝑦 = = 𝑆𝑥𝑦
𝑁
Here, 𝑆𝑥𝑦 is called the covariance between 𝑥 & 𝑦.

Page 23 of 26
Fundamental Issues of Statistics
The correlation coefficient

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)


𝜎𝑥𝑦 𝑁 ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)
𝑟𝑥𝑦 = 𝑟𝑦𝑥 = = =
𝜎𝑥 𝜎𝑦 2 2 √∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2
√∑(𝑥 − 𝑥̅ ) ∑(𝑦 − 𝑦̅)
𝑁 𝑁
𝑁 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
=
√(𝑁 ∑ 𝑥 2 − (∑ 𝑥)2 )(𝑁 ∑ 𝑦 2 − (∑ 𝑦)2 )

𝑥−𝑎 𝑦−𝑏
Assume 𝑢 = &𝑣 = , where 𝑎 & 𝑏 are the shifts and ℎ & 𝑘 are the scales. Then it can be
ℎ 𝑘
shown that

𝑥 = 𝑎 + ℎ𝑢 𝑦 = 𝑏 + 𝑘𝑣
That gives, 𝑥̅ = 𝑎 + ℎ𝑥̅ That gives, 𝑦̅ = 𝑏 + 𝑘𝑣̅
So, 𝑥 − 𝑥̅ = ℎ(𝑢 − 𝑢̅) So, 𝑦 − 𝑦̅ = 𝑘(𝑣 − 𝑣̅ )

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) ℎ𝑘 ∑(𝑢 − 𝑢̅)(𝑣 − 𝑣̅ ) ∑(𝑢 − 𝑢̅)(𝑣 − 𝑣̅ )


𝑟𝑥𝑦 = = = = 𝑟𝑢𝑣
√∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2 √ℎ2 𝑘 2 ∑(𝑢 − 𝑢̅)2 ∑(𝑣 − 𝑣̅ )2 √∑(𝑢 − 𝑢̅)2 ∑(𝑣 − 𝑣̅ )2
𝑁 ∑ 𝑢𝑣 − ∑ 𝑢 ∑ 𝑣
=
√(𝑁 ∑ 𝑢2 − (∑ 𝑢)2 )(𝑁 ∑ 𝑣 2 − (∑ 𝑣)2 )

So, the Correlation coefficient is independent of the shift and scale.

There are two Regression co-efficient

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)


𝜎𝑥𝑦 𝑁 ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) 𝑁 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑏𝑦 = 2 = 2 = =
𝑥 𝜎𝑥 ∑(𝑥 − 𝑥̅ ) ∑(𝑥 − 𝑥̅ )2 𝑁 ∑ 𝑥 2 − (∑ 𝑥)2
𝑁

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)


𝜎𝑥𝑦 𝑁 ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) 𝑁 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑏𝑥 = 2 = 2 = =
𝑦 𝜎𝑦 ∑(𝑦 − 𝑦̅) ∑(𝑦 − 𝑦̅)2 𝑁 ∑ 𝑦 2 − (∑ 𝑦)2
𝑁

Again, for shifting and scaling data set

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) ℎ𝑘 ∑(𝑢 − 𝑢̅)(𝑣 − 𝑣̅ ) 𝑘 ∑(𝑢 − 𝑢̅)(𝑣 − 𝑣̅ ) 𝑘


𝑏𝑦 = = = = × 𝑏𝑣
𝑥 ∑(𝑥 − 𝑥̅ )2 ℎ2 ∑(𝑢 − 𝑢̅)2 ℎ ∑(𝑢 − 𝑢̅)2 ℎ 𝑢
𝑘 𝑁 ∑ 𝑢𝑣 − ∑ 𝑢 ∑ 𝑣
= ×( )
ℎ 𝑁 ∑ 𝑢2 − (∑ 𝑢)2

Page 24 of 26
Fundamental Issues of Statistics
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) ℎ𝑘 ∑(𝑢 − 𝑢̅)(𝑣 − 𝑣̅ ) ℎ ∑(𝑢 − 𝑢̅)(𝑣 − 𝑣̅ ) ℎ
𝑏𝑥 = = = = × 𝑏𝑢
𝑦 ∑(𝑦 − 𝑦̅)2 𝑘 2 ∑(𝑣 − 𝑣̅ )2 𝑘 ∑(𝑣 − 𝑣̅ )2 𝑘 𝑣
ℎ 𝑁 ∑ 𝑢𝑣 − ∑ 𝑢 ∑ 𝑣
= ×( )
𝑘 𝑁 ∑ 𝑣 2 − (∑ 𝑣)2

So, the Regression coefficients are independent of the shifts but dependent on scales.

The Regression lines can be formed as given below.

Regression line of 𝑦 on 𝑥 is Regression line of 𝑥 on 𝑦 is

𝑦 − 𝑦̅ = 𝑏𝑦 (𝑥 − 𝑥̅ ) 𝑥 − 𝑥̅ = 𝑏𝑥 (𝑦 − 𝑦̅)
𝑥 𝑦
∑𝑦 ∑𝑥 ∑𝑥 ∑𝑦
⇒ 𝑦− = 𝑏𝑦 (𝑥 − ) ⇒𝑥− = 𝑏𝑥 (𝑦 − )
𝑁 𝑥 𝑁 𝑁 𝑦 𝑁
∑𝑥 ∑𝑦 ∑𝑦 ∑𝑥
⇒ 𝑦 = 𝑏𝑦 (𝑥 − )+ ⇒ 𝑥 = 𝑏𝑥 (𝑦 − )+
𝑥 𝑁 𝑁 𝑦 𝑁 𝑁

Now, we have

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) (∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅))2


𝑏𝑦 × 𝑏 𝑥 = × =
𝑥 𝑦 ∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2 ∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2
2
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) 2
={ } = 𝑟𝑥𝑦
√∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2
So, it can be written as

𝑟𝑥𝑦 = ±√𝑏𝑦 × 𝑏𝑥
𝑥 𝑦

Then, the definitive formulae can be can be applied to find the followings.
𝜎𝑥𝑦 𝜎𝑥𝑦
𝑟𝑥𝑦 𝜎𝑥 𝜎𝑦 𝜎𝑥 𝑟𝑥𝑦 𝜎𝑥 𝜎𝑦 𝜎𝑦
= 𝜎 = = 𝜎 =
𝑏𝑦 𝑥𝑦 𝜎𝑦 𝑏𝑥 𝑥𝑦 𝜎𝑥
𝑥 𝜎𝑥2 𝑦 𝜎𝑦2
𝜎𝑥 𝜎𝑦
⇒ 𝑟𝑥𝑦 = 𝑏𝑦 × ⇒ 𝑟𝑥𝑦 = 𝑏𝑥 ×
𝑥 𝜎𝑦 𝑦 𝜎𝑥
𝜎𝑦 𝜎𝑥
⇒ 𝑏𝑦 = 𝑟𝑥𝑦 × ⇒ 𝑏𝑥 = 𝑟𝑥𝑦 ×
𝑥 𝜎𝑥 𝑦 𝜎𝑦

Page 25 of 26
Fundamental Issues of Statistics
Rank Correlation: Sometimes there doesn’t exist a marked linear relationship between two
random variables but a monotonic relation (if one increases, the other also increases or instead,
decreases) is clearly noticed.

If instead of measuring the correlation between two sets of continuous random variables 𝑥 & 𝑦
we replace their numerical values by their rankings, then we obtain the Rank Correlation
coefficient. The rank of the 𝑖 − 𝑡ℎ element of a sample of size 𝑁 is equal to the index of the
order statistic.

The Rank Correlation coefficient


6 ∑ 𝑑𝑖2
𝑟𝑟𝑎𝑛𝑘 = 1 − ; 𝑑𝑖 = 𝑥𝑖 − 𝑦𝑖
𝑁(𝑁 2 − 1)

Page 26 of 26

You might also like