[go: up one dir, main page]

100% found this document useful (2 votes)
267 views12 pages

Basic Statistics 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 12

Q1) Identify the Data type for the Following:

Activity Data Type


Number of beatings from Wife Discrete
Results of rolling a dice Discrete
Weight of a person Continuous
Weight of Gold Continuous
Distance between two places Continuous
Length of a leaf Continuous
Dog's weight Continuous
Blue Color Discrete
Number of kids Discrete
Number of tickets in Indian railways Discrete
Number of times married Discrete
Gender (Male or Female) Discrete

Q2) Identify the Data types, which were among the following
Nominal, Ordinal, Interval, Ratio.
Data Data Type
Gender Nominal
High School Class Ranking Ordinal
Celsius Temperature Ratio
Weight Ratio
Hair Color Ordinal
Socioeconomic Status Ratio
Fahrenheit Temperature Ratio
Height Ratio
Type of living accommodation Ordinal
Level of Agreement Nominal
IQ(Intelligence Scale) Interval
Sales Figures Ratio
Blood Group Nominal
Time Of Day Ratio
Time on a Clock with Hands Ratio
Number of Children Interval
Religious Preference Nominal
Barometer Pressure Ratio
SAT Scores Ratio
Years of Education Interval

Q3) Three Coins are tossed, find the probability that two heads and one tail are
obtained?
{HHH, HHT, HTH, THH, TTT, TTH, THT, HTT}
P (2Heads & 1 Tail)=3/8=0.375
Q4) Two Dice are rolled, find the probability that sum is
a) Equal to 1= 0
b) Less than or equal to 4 = 6/36= 1/6
c) Sum is divisible by 2 and 3 = 6/36= 1/6
Q5) A bag contains 2 red, 3 green and 2 blue balls. Two balls are drawn at
random. What is the probability that none of the balls drawn is blue?

{RR, RG, RB, RR, GG, BB}


=P (not getting blue)/Total no. of Probability= 3/6=1/2=0.5
Q6) Calculate the Expected number of candies for a randomly selected child
Below are the probabilities of count of candies for children (ignoring the nature of
the child-Generalized view)
CHILD Candies count Probability
A 1 0.015
B 4 0.20
C 3 0.65
D 5 0.005
E 6 0.01
F 2 0.120
Child A – probability of having 1 candy = 1*0.015= 0.015
Child B – probability of having 4 candies = 4*0.20= 0.80
Q7) Calculate Mean, Median, Mode, Variance, Standard Deviation, Range &
comment about the values / draw inferences, for the given dataset
- For Points,Score,Weigh>
Find Mean, Median, Mode, Variance, Standard Deviation, and Range
and also Comment about the values/ Draw some inferences.
In R:{To see mean, median, 1st and 3rd quadrant}
Using summary function in R or as below:
> getmode<-function(x){
+ uniquv<-unique(x) creating a function for calculating mode
+ uniquv[which.max(tabulate(match(x,uniquv)))]
+}
Details on col. drat Details on col. Wt Details on col. qsec
> mean(psw$drat) > mean(wt) > mean(qsec)
[1] 3.596563 [1] 3.21725 [1] 17.84875
> median(psw$drat) > median(wt) > median(qsec)
[1] 3.695 [1] 3.325 [1] 17.71
> attach(psw) > var(wt) > var(qsec)
> var(drat) # Variance [1] 0.957379 [1] 3.193166
[1] 0.2858814 > sd(wt) > sd(qsec)
> sd(drat) # Standard Deviation [1] 0.9784574 [1] 1.786943
[1] 0.5346787 > range(wt) > range(qsec)
> sqrt(var(drat)) [1] 1.513 5.424 [1] 14.5 22.9
[1] 0.5346787 > getmode(psw$wt) > getmode(psw$qsec) mode using funct
> range(drat) [1] 3.44 ion
[1] 2.76 4.93 [1] 17.02
> getmode(psw$drat)
[1] 3.92

> w<-boxplot(wt,horizontal =TRUE) > q<-boxplot(qsec, horizontal = TRUE)


> d<-boxplot(drat, horizontal = TRUE) > q$stats
> w$stats
> d$stats [,1]
[,1]
[,1] [1,] 14.500
[1,] 1.5130
[1,] 2.760 [2,] 16.885
[2,] 2.5425
[2,] 3.080 [3,] 17.710
[3,] 3.3250
[3,] 3.695 [4,] 18.900
[4,] 3.6500
[4,] 3.920 [5,] 20.220
[5,] 5.2500
[5,] 4.930 > q$out
> w$out
> d$out [1] 22.9
[1] 5.424 5.345
numeric(0)

From all 3 columns, they are slightly left skewed or positively skewed data, its also
normal distributed data.
Only few outliers 2 in Wt col and 1 in qsec.

Q8) Calculate Expected Value for the problem below


a) The weights (X) of patients at a clinic (in pounds), are
108, 110, 123, 134, 135, 145, 167, 187, 199
Assume one of the patients is chosen at random. What is the Expected
Value of the Weight of that patient?
> pwt<-c(108, 110, 123, 134, 135, 145, 167, 187, 199)
> prop.table(pwt)
[1] 0.08256881 0.08409786 0.09403670 0.10244648 0.10321101 0.11085627 0.12767584 0.14296636
0.15214067
> prob<-prop.table(pwt)
> expected<-sum(pwt*prob)
> expected
[1] 151.8028 # this the expected value {Expected Value µ=∑XP(X)}

Q9) Calculate Skewness, Kurtosis & draw inferences on the following data
Cars speed and distance

Skewness
{E((X-µ)/σ)^3}

Speed is negatively skewed


> skewness(car$speed)
[1] -0.1139548

Dist is positively skewed


> skewness(car$dist)
[1] 0.7824835

Kutosis
For Speed {E(((X-µ)/σ)^4) -3}

Speed is negative kurtosis as


it has more data spread
across in same level
kurtosis(car$speed)
[1] 2.422853

Dist is positive kurtosis


> kurtosis(car$dist)
[1] 3.248019

For Dist
SP and Weight(WT)
Skewness
{E((X-µ)/σ)^3}

Speed is positively skewed


> skewness(sp$SP)
[1] 1.581454

Dist is negatively skewed, but its


very close to a normal distribution
> skewness(sp$WT)
[1] -0.6033099

Kutosis
{E(((X-µ)/σ)^4) -3}

Speed is positive kurtosis but


seems to have many outliers
> kurtosis(sp$SP)
[1] 5.723521

Dist is positive kurtosis and has outliers


> kurtosis(sp$WT)
[1] 3.819466
Q10) Draw inferences about the following boxplot & histogram

This is positively Skewed data, positive kurtosis and has outliers

This is positively Skewed data, positive kurtosis and has many outliers
Q11) Suppose we want to estimate the average weight of an adult male in
Mexico. We draw a random sample of 2,000 men from a population of
3,000,000 men and weigh them. We find that the average person in our
sample weighs 200 pounds, and the standard deviation of the sample is 30
pounds. Calculate 94%,98%,96% confidence interval ?
Q12) Below are the scores obtained by a student in tests

34,36,36,38,38,39,39,40,40,41,41,41,41,42,42,45,49,56
1) Find mean, median, variance, standard deviation.
> stud<-c(34,36,36,38,38,39,39,40,40,41,41,41,41,42,42,45,49,56)
> length(stud)
[1] 18
> mean(stud)
[1] 41
> median(stud)
[1] 40.5
> var(stud)
[1] 25.52941
> sd(stud)
[1] 5.052664

2) What can we say about the student marks?


The student has overall marks as below average. Most of the marks are
between 35 to 42. Only 1 subject is high comparatively, remaining all
almost the same.

Q13) What is the nature of skewness when mean, median of data are equal?
It’s a normal Distribution
Q14) What is the nature of skewness when mean > median ?
It’s a positively skewed data
Q15) What is the nature of skewness when median > mean?
It’s a negatively Skewed data
Q16) What does positive kurtosis value indicates for a data ?
It means there is has a peak value
Q17) What does negative kurtosis value indicates for a data?
The data is spread across similarly.
Q18) Answer the below questions using the below boxplot visualization.

What can we say about the distribution of the data?


Its not normally distributed and has most values between 10’ to 18’,
median is approx. 15’.
What is nature of skewness of the data?
Its negatively skewed data.
What will be the IQR of the data (approximately)?
Q1=10’
Q2=15’
Q3=18’
Upper Extreme= 1’
Lower Extreme= 19’
IQR I0’ to 18’
Q19) Comment on the below Boxplot visualizations?

Boxplot 2 has more data and is


spread across IQR compare to
boxplot 1.

Draw an Inference from the distribution of data for Boxplot 1 with respect
Boxplot 2.
Q 20) Calculate probability from the given dataset for the below cases
Data _set: Cars.csv
Calculate the probability of MPG of Cars for the below cases.
MPG <- Cars$MPG
a. P(MPG>38)
b. P(MPG<40)
c. P (20<MPG<50)
First found mean and Standard deviation, then calculated:

m<-mean(car1$MPG) 34.42208
s<-sd(car1$MPG) 9.131445
a. > 1-pnorm(38,m,s)
[1] 0.3475939 => 34.76%
b. > pnorm(40,m,s)
[1] 0.7293499 => 72.93%
c. > z2<-pnorm(50,m,s) => 0.9559927
> z1<-pnorm(20,m,s) => 0.05712378
> z2-z1 => 0.8988689 =>89.89%

Q 21) Check whether the data follows normal distribution


a) Check whether the MPG of Cars follows Normal Distribution
Dataset: Cars.csv
Code:

import pandas as pd
import numpy as np
cd=pd.read_csv("C:\\Users\\DS0029tu
\\Desktop\\Data Science\\CSV & EXL
files\\Cars.csv")
import pylab
import scipy.stats as st
st.probplot(cd['MPG'],dist="norm",plot=pylab)
----------------------------
plt.hist(cd['MPG']) # for boxplot visualization
----------------------------
The MPG of cars does not follow normal
distribution, it is more of right skewed data.
b) Check Whether the Adipose Tissue (AT) and Waist Circumference(Waist)
from wc-at data set follows Normal Distribution
Dataset: wc-at.csv
QQ plot of Waist Circumference: QQ plot of AT:

Both Waist and AT does not follow normal distribution

Q 22) Calculate the Z scores of 90% confidence interval,94% confidence


interval, 60% confidence interval

st.norm.ppf(0.95,0,1) # 90% Confidence Interval


1.6448536269514722

st.norm.ppf(0.97,0,1) # 94% Confidence Interval


1.8807936081512509

st.norm.ppf(0.80,0,1) # 60% Confidence Interval


0.8416212335729143

Q 23) Calculate the t scores of 95% confidence interval, 96% confidence


interval, 99% confidence interval for sample size of 25
n=25; df=n-1=24
T score of 95% Confidence Interval= 2.064 (Value from T score table)
T score of 96% Confidence Interval=2.492 (Value from T score table)
T score of 99% Confidence Interval=2.797 (Value from T score table)

Q 24) A Government company claims that an average light bulb lasts 270
days. A researcher randomly selects 18 bulbs for testing. The sampled bulbs
last an average of 260 days, with a standard deviation of 90 days. If the
CEO's claim were true, what is the probability that 18 randomly selected
bulbs would have an average life of no more than 260 days

Hint:

rcode  pt(tscore,df)
df  degrees of freedom

n=18; df=17
S=90
t=270-260/(90/sqrt18)=10/21.211=0.471

In python code:
import math
n=math.sqrt(18)
err=90/n
tscore=(270-260)/err

tscore
Out[10]: 0.4714045207910317,

Now check the value in Tscore table with the tscore value and degree of freedom, hence
approx. 50% confidence interval.

You might also like