Fds Unit II Notes
Fds Unit II Notes
Types of Data
• Data is collection of facts and figures which relay something specific, but
which are not organized in any way. It can be numbers, words, measurements,
observations or even just descriptions of things. We can say, data is raw
material in the production of information.
• Collection of data objects and their attributes. Attributes captures the basic
characteristics of an object
• Each row of a data set is called a record. Each data set also has multiple
attributes, each of which gives information on a specific characteristic.
• Data can broadly be divided into following two types: Qualitative data and
quantitative data.
Qualitative data:
1. Nominal data
2. Ordinal data
Qualitative data:
• There are two types of qualitative data: Interval data and ratio data.
1. Advantages:
customers.
• Avoid pre-judgments
2. Disadvantages:
• Time consuming
1. Advantages:
2. Disadvantages:
Ranked Data
• Ranked data is a variable in which the value of the data is captured from an
ordered set, which is recorded in the order of magnitude. Ranked data is also
called as Ordinal data.
c) Along with the information provided by the nominal scale, ordinal scales
give the rankings of those variables
e) The surveyors can quickly analyze the degree of agreement concerning the
identified order of variables
• Examples:
• There are four different scales of measurement. The data can be defined as
being one of the four scales. The four types of scales are: Nominal, ordinal,
interval and ratio.
Nominal
• A nominal data usually deals with the non-numeric variables or the numbers
that do not have any value. While developing statistical models, nominal data
are usually transformed before building the model.
3. The numbers don't define the object characteristics. The only permissible
aspect of numbers in the nominal scale is "counting".
• Example:
Interval
a) The interval data is quantitative as it can quantify the difference between the
values.
c) To understand the difference between the variables, you can subtract the
values between the variables.
d) The interval scale is the preferred scale in statistics as it helps to assign any
numerical values to arbitrary assessment such as feelings, calender types, etc.
• Examples:
1. Celsius temperature
2. Fahrenheit temperature
Ratio
• Any variable for which the ratios can be computed and are meaningful is
called ratio data.
d) Ratio data has unique and useful properties. One such feature is that it allows
unit conversions like kilogram - calories, gram - calories, etc.
Types of Variables
Discrete variables:
• The word discrete means countable. For example, the number of students in a
class is countable or discrete. The value could be 2, 24, 34 or 135 students, but
it cannot be 23/32 or 12.23 students.
• Number of page in the book is a discrete variable. Discrete data can only take
on certain individual values.
Continuous variables:
• Continuous variables are a variable which can take all values within a given
interval or range. A continuous variable consists of numbers whose values, at
least in theory, have no restrictions.
• Continuous data can take on any value in a certain range. Length of a file is a
continuous variable.
• For example, 2, 4, 9 are exact numbers as they do not need any approximation.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.
1. Independent variables
• The independent variable is the one that the researcher intentionally changes
or controls.
2. Dependent variables
• The dependent variable is the factor that the research measures. It changes in
response to the independent variable or depends upon it.
Observational Study
• These studies are often qualitative in nature and can be used for both
exploratory and explanatory research purposes. While quantitative observational
studies exist, they are less common.
• Observational studies are generally used in hard science, medical and social
science fields. This is often due to ethical or practical concerns that prevent the
researcher from conducting a traditional experiment. However, the lack of
control and treatment groups means that forming inferences is difficult and
there is a risk of confounding variables impacting user analysis.
Confounding Variable
• Confounding variables are those that affect other variables in a way that
produces spurious or distorted associations between two variables. They
confound the "true" relationship between two variables. Confounding refers to
differences in outcomes that occur because of differences in the baseline risks of
the comparison groups.
• A difference between groups might be due not to the independent variable but
to a confounding variable.
• Consider the example, in order to conduct research that has the objective that
alcohol drinkers can have more heart disease than non-alcohol drinkers such
that they can be influenced by another factor. For instance, alcohol drinkers
might consume cigarettes more than non drinkers that act as a confounding
variable (consuming cigarettes in this case) to study an association amidst
drinking alcohol and heart disease.
• For example, suppose a researcher collects data on ice cream sales and shark
attacks and finds that the two variables are highly correlated. Does this mean
that increased ice cream sales cause more shark attacks? That's unlikely. The
more likely cause is the confounding variable temperature. When it is warmer
outside, more people buy ice cream and more people go in the ocean.
• In order to find the frequency distribution of quantitative data, we can use the
following table that gives information about "the number of smartphones owned
per family."
• For such quantitative data, it is quite straightforward to make a frequency
distribution table. People either own 1, 2, 3, 4 or 5 laptops. Then, all we need to
do is to find the frequency of 1, 2, 3, 4 and 5. Arrange this information in table
format and called as frequency table for quantitative data.
• When observations are sorted into classes of single values, the result is
referred to as a frequency distribution for ungrouped data. It is the
representation of ungrouped data and is typically used when we have a smaller
data set.
1. Grouped data:
• Grouped data refers to the data which is bundled together in different classes
or categories.
• Data are grouped when the variable stretches over a wide range and there are a
large number of observations and it is not possible to arrange the data in any
order, as it consumes a lot of time. Hence, it is pertinent to convert frequency
into a class group called a class interval.
• Suppose we conduct a survey in which we ask 15 familys how many pets they
have in their home. The results are as follows:
1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8
2. Classes should be set up so that they do not overlap and so that each piece of
data belongs to exactly one class.
• The real limits are located at the midpoint of the gap between adjacent tabled
boundaries; that is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper tabled
boundary.
• Table 2.3.4 gives a frequency distribution of the IQ test scores for 75 adults.
• IQ score is a quantitative variable and according to Table, eight of the
individuals have an IQ score between 80 and 94, fourteen have scores between
95 and 109, twenty-four have scores between 110 and 124, sixteen have scores
between 125 and 139 and thirteen have scores between 140 and 154.
• If the lower class limit for the second class, 95, is added to the upper class
limit for the first class,94 and the sum divided by 2, the upper boundary for
the first class and the lower boundary for the second class is determined. Table
2.3.5 gives all the boundaries for Table 2.3.5.
• If the lower class limit is added to the upper class limit for any class and the
sum divided by 2, the class mark for that class is obtained. The class mark for a
class is the midpoint of the class and is sometimes called the class midpoint
rather than the class mark.
Example 2.3.1: Following table gives the frequency distribution for the
cholesterol values of 45 patients in a cardiac rehabilitation study. Give the
lower and upper class limits and boundaries as well as the class marks for
each class.
• Solution: Below table gives the limits, boundaries and marks for the classes.
(123-69)/ 10=54/10=5.4≈ 5
b) Real limits for the lowest class interval in this frequency distribution = 64.5-
69.5.
Example 2.3.3: Given below are the weekly pocket expenses (in Rupees) of
a group of 25 students selected at random.
37, 41, 39, 34, 41, 26, 46, 31, 48, 32, 44, 39, 35, 39, 37, 49, 27, 37, 33, 38, 49,
45, 44, 37, 36
Solution:
• In the given data, the smallest value is 26 and the largest value is 49. So, the
range of the weekly pocket expenses = 49-26=23.
Outliers
• An outlier is a value that escapes normality and can cause anomalies in the
results obtained through algorithms and analytical systems. There, they always
need some degrees of attention.
• Understanding the outliers is critical in analyzing data for at least two aspects:
• The simplest way to find outliers in data is to look directly at the data table,
the dataset, as data scientists call it. The case of the following table clearly
exemplifies a typing error, that is, input of the data.
• The field of the individual's age Antony Smith certainly does not represent the
age of 470 years. Looking at the table it is possible to identify the outlier, but it
is difficult to say which would be the correct age. There are several possibilities
that can refer to the right age, such as: 47, 70 or even 40 years.
Relative and Cumulative Frequency Distribution
• A relative frequency distribution lists the data values along with the percent
of all observations belonging to each group. These relative frequencies are
calculated by dividing the frequencies for each group by the total number of
observations.
• Example: Suppose we take a sample of 200 India family's and record the
number of people living there. We obtain the following:
Cumulative frequency:
• A cumulative frequency distribution can be useful for ordered data (e.g. data
arranged in intervals, measurement data, etc.). Instead of reporting frequencies,
the recorded values are the sum of all frequencies for values less than and
including the current value.
• Example: Suppose we take a sample of 200 India family's and record the
number of people living there. We obtain the following:
1. Histogram
• Here the data values only take on integer values, but we still split the range of
values into intervals. In this case, the intervals are [1,2), [2,3), [3,4), etc. Notice
that this graph is also close to being bell-shaped. A symmetric, bell-shaped
distribution is called a normal distribution.
• Notice that all the rectangles are adjacent and they have no gaps between them
unlike a bar graph.
• If we had used the percentage to make the histogram, we would call the graph
a percentage histogram.
2. Frequency polygon
• Frequency polygons are a graphical device for understanding the shapes of
distributions. They serve the same purpose as histograms, but are especially
helpful for comparing sets of data. Frequency polygons are also a good choice
for displaying cumulative frequency distributions.
• We can say that frequency polygon depicts the shapes and trends of data. It
can be drawn with or without a histogram.
• Suppose we are given frequency and bins of the ages from another survey as
shown in Table 2.4.1.
• The midpoints will be used for the position on the horizontal axis and the
frequency for the vertical axis. From Table 2.4.1 we can then create the
frequency polygon as shown in Fig. 2.4.2.
(i) What is the frequency of the class interval whose class mark is 15?
• Solution:
• Stem and leaf diagrams allow to display raw data visually. Each raw score is
divided into a stem and a leaf. The leaf is typically the last digit of the raw
value. The stem is the remaining digits of the raw value.
• Data points are split into a leaf (usually the ones digit) and a stem (the other
digits)
• To generate a stem and leaf diagram, first create a vertical column that
contains all of the stems. Then list each leaf next to the corresponding stem. In
these diagrams, all of the scores are represented in the diagram without the loss
of any information.
• A stem-and-leaf plot retains the original data. The leaves are usually the last
digit in each data value and the stems are the remaining digits.
• Create a stem-and-leaf plot of the following test scores from a group of college
freshmen.
• There are a couple of graphs that are appropriate for qualitative data that has
no natural ordering.
1. Bar graphs
• Bar Graphs are like histograms, but the horizontal axis has the name of each
category and there are spaces between the bars.
• Usually, the bars are ordered with the categories in alphabetical order. One
variant of a bar graph is called a Pareto Chart. These are bar graphs with the
categories ordered by frequency, from largest to smallest.
• In bar graph, bars are used to represent the amount of data in each category;
one axis displays the categories of qualitative data and the other axis displays
the frequencies.
Misleading Graph
• It is a well known fact that statistics can be misleading. They are often used to
prove a point and can easily be twisted in favour of that point.
• Good graphs are extremely powerful tools for displaying large quantities of
complex data; they help turn the realms of information available today into
knowledge. But, unfortunately, some graphs deceive or mislead.
• This may happen because the designer chooses to give readers the impression
of better performance or results than is actually the situation. In other cases, the
person who prepares the graph may want to be accurate and honest, but may
mislead the reader by a poor choice of a graph form or poor graph construction.
1. Title
2. Labels on both axes of a line or bar chart and on all sections of a pie chart
4. Key to a pictograph
• A graph can be altered by changing the scale of the graph. For example, data
in the two graphs of Fig. 2.6.1 are identical, but scaling of the Y-axis changes
the impression of the magnitude of differences.
Example 2.6.1: Construct a frequency distribution for the number of
different residences occupied by graduating seniors during their college
career, namely: 1, 4, 2, 3, 3, 1, 6, 7, 4, 3, 3, 9, 2, 4, 2, 2, 3, 2, 3, 4, 4, 2, 3, 3, 5.
What is the shape of this distribution?
Solution:
• Averages consist of numbers (or words) about which the data are, in some
sense, centered. They are often referred to as measures of central tendency. It
is already covered in section 1.12.1.
1. Mean :
• The mean of a data set is the average of all the data values. The sample mean x
is the point estimator of the population mean μ.
2. Median :
• The median of a data set is the value in the middle when the data items are
arranged in ascending order. Whenever a data set has extreme values, the
median is the preferred measure of central location.
• The median is the measure of location most often reported for annual income
and property value data. A few extremely large incomes of property values can
inflate the mean.
Median=19
3. Mode:
• The mode of a data set is the value that occurs with greatest frequency. The
greatest frequency can occur at two or more different values. If the data have
exactly two modes, the data have exactly two modes, the data are bimodal. If
the data have more than two modes, the data are multimodal.
• Trimmed mean: A major problem with the mean is its sensitivity to extreme
(e.g., outlier) values. Even a small number of extreme values can corrupt the
mean. The trimmed mean is the mean obtained after cutting off values at the
high and low
extremes.
• For example, we can sort the values and remove the top and bottom 2 %
before computing the mean. We should avoid trimming too large a portion
(such as 20 %) at both ends as this can result in the loss of valuable information.
• Holistic measure is a measure that must be computed on the entire data set as
a whole. It cannot be computed by partitioning the given data into subsets and
merging the values obtained for the measure in each subset.
Describing Variability
• Central tendency describes the central point of the distribution and variability
describes how the scores are scattered around that central point. Together,
central tendency and variability are the two primary values that are used to
describe a distribution of scores.
• Variability can be measured with the range, the interquartile range and the
standard deviation/variance. In each case, variability is determined by
measuring distance.
Range
• The range is the total distance covered by the distribution, from the highest
score to the lowest score (using the upper and lower real limits of the range).
Merits :
a) It is easier to compute.
Variance
• Variance is the expected value of the squared deviation of a random variable
from its mean. In short, it is the measurement of the distance of a set of random
numbers from their collective average value. Variance is used in statistics as a
way of better understanding a data set's distribution.
• In the formula above, μ represents the mean of the data points, x is the value
of an individual data point and N is the total number of data points.
• Data scientists often use variance to better understand the distribution of a data
set. Machine learning uses variance calculations to make generalizations about a
data set, aiding in a neural network's understanding of data distribution.
Variance is often used in conjunction with probability distributions.
Standard Deviation
• Standard deviation is simply the square root of the variance. Standard
deviation measures the standard distance between a score and the mean.
Standard deviation=√Variance
• The standard deviation is a measure of how the values in data differ from one
another or how spread out data is. There are two types of variance and standard
deviation in terms of sample and population.
• The standard deviation measures how far apart the data points in observations
are from each. we can calculate it by subtracting each data point from the mean
value and then finding the squared mean of the differenced values; this is called
Variance. The square root of the variance gives us the standard deviation.
b) The center of the distribution (the mean) changes, but the standard deviation
remains the same.
c) If each score is multiplied by a constant, the standard deviation will be
multiplied by the same constant.
• If user are given numerical values for the mean and the standard deviation, we
should be able to construct a visual image (or a sketch) of the distribution of
scores. As a general rule, about 70% of the scores will be within one standard
deviation of the mean and about 95% of the scores will be within a distance of
two standard deviations of the mean.
• Standard deviation distances always originate from the mean and are
expressed as positive deviations above the mean or negative deviations below
the mean.
SS = Σ (X-X̄ )2
SS = Σx2 - (Σx)2/n
Example 2.8.1: The heights of animals are: 600 mm, 470 mm, 170 mm, 430
mm and 300 mm. Find out the mean, the variance and the standard
deviation.
Solution:
Mean = 600+ 470 + 170+ 430 + 300 / 5
Variance = 21704
= 142.32 ≈ 142
• The interquartile range is the distance covered by the middle 50% of the
distribution (the difference between Q1 and Q3).
Example 2.8.2: Determine the values of the range and the IQR for the
following sets of data.
(a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63
Solution:
a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63
Range = 25
IQR:
45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70
Median
Q1=60 , Q3 65
IQR = Q3-Q1=65-60 = 5
z Scores
Z = X-μ / σ
a) Positive or negative sign indicating whether it's above or below the mean;
and
b) Number indicating the size of its deviation from the mean in standard
deviation units
(b) And enables us to compare two scores that are from different samples
(which may have different means and standard deviations).
• Using the z-score technique, one can now compare two different test results
based on relative performance, not individual grading scale.
Example 2.9.1: A class of 50 students who have written the science test last
week. Rakshita student scored 93 in the test while the average score of the
class was 68. Determine the z-score for Rakshita's test mark if the standard
deviation is 13.
Solution: Given,
Rakshita's test score, x = 93, Mean (u) = 68, Standard deviation (σ) = 13 The z-
score for Rakshita's test score can be calculated using formula as,
Ꮓ = X- μ / σ = 93-68 / 13 = 1.923
(b) A score of 470 on the SAT math test, given a mean of 500 and a
standard deviation of 100.
Solution :
Given, Margaret's IQ (X) = 135, Mean (u) = 100, Standard deviation (o) = 15
Z = X- μ / σ = 135-100 / 15 =2.33
b) A score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100
Given,
Score (X) = 470, Mean (u) = 500, Standard deviation (6)= 100
• Although there is an infinite number of different normal curves, each with its
own mean and standard deviation, there is only one standard normal curve, with
a mean of 0 and a standard deviation of 1.