lOMoARcPSD|55371821
DSC-UNIT-3 - CSE
Computer Science and Engineering (Swarna Bharathi Institute of Science and
Technology)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)
lOMoARcPSD|55371821
UNIT-III
Data analysis: Introduction, Terminology and concepts, Introduction to statistics, Central tendencies
and distributions, Variance, Distribution properties and arithmetic, Samples/CLT, Basic machine
learning algorithms, Linear regression, SVM, Naive Bayes.
1. a) What is the role of statistics in Data Analysis
b) Describe central tendencies and distributions
2. a) How does the shape of a distribution influence the Measures of Central Tendency? Explain.
b) Explain briefly about SVM.
3. What is the importance of Machine learning in Data Science? Explain with an example
4. a) Write about sampling distribution.
b) Discuss the basic machine learning algorithm.
_______________________________________________________________________________________
Explain Central tendencies and various distribution techniques
Central tendency is a central value for a probability distribution. There are three main
measures of central tendency: the mode, the median and the mean
Mean: Mean is the “Average” value of the dataset.
Mean = Sum of all data values (s)/Total number of data values(n)
Median: The middle value of the sorted dataset is called the median.
Step 1. The dataset is arranged in either increasing or decreasing order.
Step 2. If the data set has an odd number of data values (n=odd), the data at
(n + 1)/2 place is the median of the dataset.
Step 3. If the dataset has an even number of data values (n = even), the average
of two middle values is computed as the median.
Mode: The most frequently occurring value in the dataset is called mode.
Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)
lOMoARcPSD|55371821
For example the weight (in kg) of 5 children as 36, 40, 32, 42, 30.
Mean = (36 + 40 + 32 + 42 + 30)/5 = 180/5 = 36kg
Median: Arrange the data in ascending order: 30, 32, 36, 40, 42
The middle value is 36. So, median = 36kg.
Mode: 36 kg occurs most number of times, so mode = 36 kg
Discuss on variance and standard deviation:
Variance and Standard Deviation are the two important measurements in statistics.
Variance: Variance is a measure of how data points vary from the mean. Variance is
the square deviation from the mean. It is denoted as „σ2‟.
Properties of Variance
It is always non-negative since the variance sum is squared and therefore the
result is either positive or zero.
Variance always has squared units. For example, the variance of a set of weights
estimated in kilograms will be given in kg squared.
Standard Deviation
Standard deviation is the measure of the distribution of statistical data. Standard
Deviation is the square root of the variance. Standard deviation is denoted by the
symbol, „σ‟.
Properties of Standard Deviation
It describes the square root of the mean of the squares of all values in a data set
and is also called the root-mean-square deviation.
The smallest value of the standard deviation is 0 since it cannot be negative.
When the data values of a group are similar, then the standard deviation will be
very low or close to zero. But when the data values vary with each other, then the
standard variation is high or far from zero.
Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)
lOMoARcPSD|55371821
Example: Let there be two cricket players: Pant and Kartik, and you have to select
one for the cricket world cup. The score of both the players in the last five T-20
matches are as follows:
Kartik Pant
23 34
28 85
45 02
59 15
63 77
Answer: Now, we will find the SD, and one who has the lesser value of SD will be
more consistent.
Case -1: Kartik
Runs (xi) Squared Deviation (xi– mean)2
23 (23 – 43.6)2
28 (28 – 43.6) 2
45 (45 – 43.6) 2
59 (59 – 43.6) 2
63 (63 – 43.6) 2
Mean = (23 + 38 + 45 + 59 + 63) / 5 Sum of Squared Deviation = 1283.2
= 43.6
Explain type’s statistical data science analysis:
Statistical analysis is done on data sets, and the analysis process can create different output
types from the input data. For example, the process can give output data from the input,
present input data characteristics are prove a null hypothesis, etc. The output type and format
vary with other.
Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)
lOMoARcPSD|55371821
The types of statistical:
Descriptive statistics: It refers to collecting, organizing, analyzing, and summarizing
data sets in an understandable format, like charts, graphs, and tables. It makes a large
data set and eliminates complexity, to help analysts understand it.
Inferential statistics: Inferential statistics used to large population. It is based on the
analysis and findings produced for sample data from the large population. it makes the
process cost-efficient and time-efficient.
Predictive analysis: This analysis is used to predict future events based on past and
present data. It uses machine learning tools, data mining, big data, predictive modeling,
artificial intelligence, and simulations.
Prescriptive analysis: This analysis used to find the best possible outcome based on the
data. It helps make decisions and encourages efficient decision-making.
Exploratory data analysis (EDA): In statistics, this method studies data sets to highlight
their major features, which is frequently used in statistical graphics and data visualization
methods.
Causal analysis: It focuses on the cause and effect. In simple terms, it focuses on the
cause and the reason behind them; based on data analysis, its understanding why
something didn‟t work out and failures in business and professional activities.
Sample/CLT:
The selection of some of the population from the whole population list is known
as Sample.
Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)
lOMoARcPSD|55371821
Total number of population, Population Size = N
Mean of the population, Population Mean(μ) = (ΣX)/N
Variance of the population, Population Variance(σ²) = Σ( Xi — μ )²/ N
Sample Size = n
Mean of the samples, Sample Mean(x¯) = (Σ x)/n
Variance of the samples, Sample Variance(S²) = Σ( xi — x¯)²/ n-1
Sampling Distribution:
Sampling Distribution is a probability of distribution get from many samples drawn
from a population list.
The sampling distribution‟s mean is denoted by μₓ¯.
μₓ¯ = (Sum of all the sample means)/(Total number of samples)
Sampling distribution‟s standard deviation (Standard error) = σ/√n,
simple random sampling, where we have a complete list of elements of the general
population and we select elements randomly.
Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)
lOMoARcPSD|55371821
the same probability of obtaining each element of the population in the sample.
Another option is stratified sampling. Here we know something about our population
- we know that it consists of several homogeneous clusters that should be represented
in our sample.
Central Limit Theorem:
The Central Limit Theorem (CLT) states that for any data, provided a high number of
samples have been taken. The following properties hold:
1. Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
2. Sampling distribution‟s standard deviation (Standard error) = σ/√n ≈S/√n
3. For n > 30, the sampling distribution becomes a normal distribution.
Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)
lOMoARcPSD|55371821
Normal distribution:
In this distribution (if it is a perfect normal distribution), the mean of the data is
0 while standard deviation equates 1.
It forms a bell-shaped structure.
It tells us that most of the data is around the mean only & the values move
away from the mean,
The two major parameters are mean & standard deviation.
Mean, Median & Mode for such distribution are equal
Basic machine learning algorithms: Refer ML
Linear regression: Refer ML
SVM: Refer ML
Naive Bayes: Refer ML
Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)
lOMoARcPSD|55371821
Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)