[go: up one dir, main page]

0% found this document useful (0 votes)
58 views9 pages

DSC Unit 3 Cse

This document covers Unit-3 of a Computer Science and Engineering course, focusing on data analysis, statistics, and machine learning concepts. It explains central tendencies, variance, standard deviation, sampling distributions, and the Central Limit Theorem, along with basic machine learning algorithms such as linear regression, SVM, and Naive Bayes. Additionally, it discusses various types of statistical analysis including descriptive, inferential, predictive, prescriptive, exploratory, and causal analysis.

Uploaded by

aiswaryaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views9 pages

DSC Unit 3 Cse

This document covers Unit-3 of a Computer Science and Engineering course, focusing on data analysis, statistics, and machine learning concepts. It explains central tendencies, variance, standard deviation, sampling distributions, and the Central Limit Theorem, along with basic machine learning algorithms such as linear regression, SVM, and Naive Bayes. Additionally, it discusses various types of statistical analysis including descriptive, inferential, predictive, prescriptive, exploratory, and causal analysis.

Uploaded by

aiswaryaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

lOMoARcPSD|55371821

DSC-UNIT-3 - CSE

Computer Science and Engineering (Swarna Bharathi Institute of Science and


Technology)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)
lOMoARcPSD|55371821

UNIT-III
Data analysis: Introduction, Terminology and concepts, Introduction to statistics, Central tendencies
and distributions, Variance, Distribution properties and arithmetic, Samples/CLT, Basic machine
learning algorithms, Linear regression, SVM, Naive Bayes.

1. a) What is the role of statistics in Data Analysis


b) Describe central tendencies and distributions
2. a) How does the shape of a distribution influence the Measures of Central Tendency? Explain.
b) Explain briefly about SVM.
3. What is the importance of Machine learning in Data Science? Explain with an example
4. a) Write about sampling distribution.
b) Discuss the basic machine learning algorithm.
_______________________________________________________________________________________

Explain Central tendencies and various distribution techniques

Central tendency is a central value for a probability distribution. There are three main
measures of central tendency: the mode, the median and the mean

Mean: Mean is the “Average” value of the dataset.

Mean = Sum of all data values (s)/Total number of data values(n)

Median: The middle value of the sorted dataset is called the median.

Step 1. The dataset is arranged in either increasing or decreasing order.

Step 2. If the data set has an odd number of data values (n=odd), the data at
(n + 1)/2 place is the median of the dataset.

Step 3. If the dataset has an even number of data values (n = even), the average
of two middle values is computed as the median.

Mode: The most frequently occurring value in the dataset is called mode.

Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)


lOMoARcPSD|55371821

For example the weight (in kg) of 5 children as 36, 40, 32, 42, 30.

Mean = (36 + 40 + 32 + 42 + 30)/5 = 180/5 = 36kg

Median: Arrange the data in ascending order: 30, 32, 36, 40, 42

The middle value is 36. So, median = 36kg.

Mode: 36 kg occurs most number of times, so mode = 36 kg

Discuss on variance and standard deviation:

Variance and Standard Deviation are the two important measurements in statistics.

Variance: Variance is a measure of how data points vary from the mean. Variance is
the square deviation from the mean. It is denoted as „σ2‟.

Properties of Variance
 It is always non-negative since the variance sum is squared and therefore the
result is either positive or zero.
 Variance always has squared units. For example, the variance of a set of weights
estimated in kilograms will be given in kg squared.

Standard Deviation

Standard deviation is the measure of the distribution of statistical data. Standard


Deviation is the square root of the variance. Standard deviation is denoted by the
symbol, „σ‟.

Properties of Standard Deviation

 It describes the square root of the mean of the squares of all values in a data set
and is also called the root-mean-square deviation.
 The smallest value of the standard deviation is 0 since it cannot be negative.
 When the data values of a group are similar, then the standard deviation will be
very low or close to zero. But when the data values vary with each other, then the
standard variation is high or far from zero.

Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)


lOMoARcPSD|55371821

Example: Let there be two cricket players: Pant and Kartik, and you have to select
one for the cricket world cup. The score of both the players in the last five T-20
matches are as follows:

Kartik Pant
23 34
28 85
45 02
59 15
63 77

Answer: Now, we will find the SD, and one who has the lesser value of SD will be
more consistent.

Case -1: Kartik

Runs (xi) Squared Deviation (xi– mean)2


23 (23 – 43.6)2
28 (28 – 43.6) 2
45 (45 – 43.6) 2
59 (59 – 43.6) 2
63 (63 – 43.6) 2
Mean = (23 + 38 + 45 + 59 + 63) / 5 Sum of Squared Deviation = 1283.2
= 43.6

Explain type’s statistical data science analysis:

Statistical analysis is done on data sets, and the analysis process can create different output
types from the input data. For example, the process can give output data from the input,
present input data characteristics are prove a null hypothesis, etc. The output type and format
vary with other.

Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)


lOMoARcPSD|55371821

The types of statistical:

 Descriptive statistics: It refers to collecting, organizing, analyzing, and summarizing


data sets in an understandable format, like charts, graphs, and tables. It makes a large
data set and eliminates complexity, to help analysts understand it.
 Inferential statistics: Inferential statistics used to large population. It is based on the
analysis and findings produced for sample data from the large population. it makes the
process cost-efficient and time-efficient.
 Predictive analysis: This analysis is used to predict future events based on past and
present data. It uses machine learning tools, data mining, big data, predictive modeling,
artificial intelligence, and simulations.
 Prescriptive analysis: This analysis used to find the best possible outcome based on the
data. It helps make decisions and encourages efficient decision-making.
 Exploratory data analysis (EDA): In statistics, this method studies data sets to highlight
their major features, which is frequently used in statistical graphics and data visualization
methods.
 Causal analysis: It focuses on the cause and effect. In simple terms, it focuses on the
cause and the reason behind them; based on data analysis, its understanding why
something didn‟t work out and failures in business and professional activities.

Sample/CLT:

The selection of some of the population from the whole population list is known

as Sample.

Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)


lOMoARcPSD|55371821

 Total number of population, Population Size = N

 Mean of the population, Population Mean(μ) = (ΣX)/N

 Variance of the population, Population Variance(σ²) = Σ( Xi — μ )²/ N

 Sample Size = n

 Mean of the samples, Sample Mean(x¯) = (Σ x)/n

 Variance of the samples, Sample Variance(S²) = Σ( xi — x¯)²/ n-1

Sampling Distribution:

Sampling Distribution is a probability of distribution get from many samples drawn

from a population list.

The sampling distribution‟s mean is denoted by μₓ¯.

μₓ¯ = (Sum of all the sample means)/(Total number of samples)

Sampling distribution‟s standard deviation (Standard error) = σ/√n,

simple random sampling, where we have a complete list of elements of the general

population and we select elements randomly.

Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)


lOMoARcPSD|55371821

the same probability of obtaining each element of the population in the sample.

Another option is stratified sampling. Here we know something about our population

- we know that it consists of several homogeneous clusters that should be represented

in our sample.

Central Limit Theorem:

The Central Limit Theorem (CLT) states that for any data, provided a high number of

samples have been taken. The following properties hold:

1. Sampling Distribution Mean(μₓ¯) = Population Mean(μ)

2. Sampling distribution‟s standard deviation (Standard error) = σ/√n ≈S/√n

3. For n > 30, the sampling distribution becomes a normal distribution.

Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)


lOMoARcPSD|55371821

Normal distribution:
 In this distribution (if it is a perfect normal distribution), the mean of the data is
0 while standard deviation equates 1.
 It forms a bell-shaped structure.
 It tells us that most of the data is around the mean only & the values move
away from the mean,
 The two major parameters are mean & standard deviation.
 Mean, Median & Mode for such distribution are equal

Basic machine learning algorithms: Refer ML

Linear regression: Refer ML

SVM: Refer ML

Naive Bayes: Refer ML

Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)


lOMoARcPSD|55371821

Downloaded by AISWARYA J (aiswaryaj@skasc.ac.in)

You might also like