[go: up one dir, main page]

0% found this document useful (0 votes)
294 views5 pages

Foundations of Data Analytics

1. The document contains exercises from a CS course involving loading and analyzing datasets in R. It loads an automobiles dataset, analyzes missing values, and calculates summary statistics like median price and mean price of four-door cars. It also loads an abalone dataset, plots relationships between variables, identifies outliers, and calculates Pearson correlations. 2. Key results include: the number of cars starting with M is 39; there are 36 combinations with missing values in the automobiles data; the median price of four-door cars is $11,245 and mean is $13,565.67. For the abalone data, outliers are identified and correlations above 0.95 are listed. 3. Empirical CDF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
294 views5 pages

Foundations of Data Analytics

1. The document contains exercises from a CS course involving loading and analyzing datasets in R. It loads an automobiles dataset, analyzes missing values, and calculates summary statistics like median price and mean price of four-door cars. It also loads an abalone dataset, plots relationships between variables, identifies outliers, and calculates Pearson correlations. 2. Key results include: the number of cars starting with M is 39; there are 36 combinations with missing values in the automobiles data; the median price of four-door cars is $11,245 and mean is $13,565.67. For the abalone data, outliers are identified and correlations above 0.95 are listed. 3. Empirical CDF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CS 910 Exercise Sheet 2: Trying out tools

Question 1
## Loading the .csv file to R into dataframe Auto
Auto <- read.csv("C:\\Users\\Akhilesh Pandey\\Desktop\\Automobiles.csv", header = F,
sep = ",")
## Taking the first alphabet of the column make and storing it as a dataframe
count <- as.data.frame(table(substr(Auto$V3, start = 1, stop = 1)))
## Taking count of rows with first alphabet as M or m
subset(count, Var1 == "m" | Var1 == "M")
##
Var1 Freq
## 8
m
39
The count of number if Cars with name starting with M are 39
Question 2 (a)
## Storing the required columns in a separate data frame count
count <- as.data.frame(table(Auto$V4, Auto$V5, Auto$V6, Auto$V7, Auto$V8, Auto$V9))
## removing the combinatons with zero occrence
count <- subset(count, as.numeric(count$Freq) > 0)
## taking count of rows
nrow(count)
## [1] 36
The total number of unique combinations for which there are one or more missing values in one of the vectors
is 36
(b)
## Storing the required columns in a separate data frame count
count <- as.data.frame(table(Auto$V4, Auto$V5, Auto$V6, Auto$V7, Auto$V8, Auto$V9))
## removing the combinatons with zero occrence
count <- subset(count, as.numeric(count$Freq) > 0)
## saving the list of cols and removing the rows with ? in any field
collist <- c("Var1", "Var2", "Var3", "Var4", "Var5", "Var6")
sel <- apply(count[, collist], 1, function(row) !"?" %in% row)
count <- count[sel, ]
## taking count of rows
nrow(count)
## [1] 34
The total number of unique combinations for which there are one or more missing values in one of the vectors
is 34

Question 3
##Selecting cars with four doors
q3.Auto <- subset(Auto, Auto$V6 == "four")
##converting the column cost into numeric
q3.Auto$V26 <- as.numeric(as.character(q3.Auto$V26))
##Removing the NA values and displaying median
median(q3.Auto$V26, na.rm= TRUE)

## [1] 11245
The median of price of four door cars is 11245
##Removing the NA values and displaying mean
mean(q3.Auto$V26, na.rm= TRUE)

## [1] 13565.67
The mean of price of four door cars is 13565.67
Question 4
## Loading the .csv file to R into dataframe Abal
Abal <- read.csv("C:\\Users\\Akhilesh Pandey\\Desktop\\Abalone.csv", header = T,
sep = ",")
## Plotting the graph of height and length columns
plot(Abal$Height, Abal$Length, main = "Scatterplot showing Height and Length of Abalone",
xlab = "Height", ylab = "Length", pch = 1, ylim = c(0, 1.2))
abline(lm(Abal$Length ~ Abal$Height), col = "red")
lines(lowess(Abal$Height, Abal$Length), col = "blue")

0.6
0.0

0.2

0.4

Length

0.8

1.0

1.2

Scatterplot showing Height and Length of Abalone

0.0

0.2

0.4

0.6

0.8

1.0

Height

##Equation of the scatterplot


lm(formula = Abal$Length ~ Abal$Height)->equation
equation

##
##
##
##
##
##
##

Call:
lm(formula = Abal$Length ~ Abal$Height)
Coefficients:
(Intercept) Abal$Height
0.1925
2.3761

Outliers are the values in a dataset which are not similar or along the lines of most of the dataset and hence
tend to standout. These are usually present because of many reasons, e.g. , data being entered incorrectly,
missing values, etc. In our plot the outliers are the points (0, 0.43) and (0, 0.315) being present as the Height
has been entered as 0 for these plots. Also, points (0.515, 0.705) and (1.13, 0.455) are outliers as these values
are very far from regression line, and hence, are outliers.

Question 5
##Taking numeric columns
nAbal <- Abal[sapply(Abal, is.numeric)]
##making combinations of 2 columns
combn(colnames(nAbal),2)-> combo
##calculate PPCC
apply(combo, 2, function(x) cor(nAbal[,x[1]], nAbal[,x[2]])) -> PPCCnAbal
##Storing result as data frame
as.data.frame(PPCCnAbal)-> PPCCnAbal
##taking transpose to convert column into rows
t(combo) -> combo
##binding with result
cbind(combo, PPCCnAbal)-> soln
##filtering as per condition
subset(soln,as.numeric(as.character(soln$ PPCCnAbal))>0.95)
##
##
##
##
##

1
2 PPCCnAbal
1
Length
Diameter 0.9868116
19 Whole.weight Shucked.weight 0.9694055
20 Whole.weight Viscera.weight 0.9663751
21 Whole.weight
Shell.weight 0.9553554

The combinations for which Pearson product coefficient is more than 0.95 are (Length,Diameter),
(Whole.weight,Shucked.weight), (Whole.weight,Viscera.weight) and (Whole.weight,Shell.weight)

Question 6
## taking rows with sex as Males
Abal_m <- subset(Abal, as.character(Abal$Sex) == "M")
## calculating the ecdf subset
ecdf.male.rings <- ecdf(Abal_m$Rings)
## taking rows with sex as Females
Abal_f <- subset(Abal, as.character(Abal$Sex) == "F")
## calculating the ecdf subset
ecdf.female.rings <- ecdf(Abal_f$Rings)
## taking rows with sex as Infants
Abal_i <- subset(Abal, as.character(Abal$Sex) == "I")
## calculating the ecdf subset
ecdf.infant.rings <- ecdf(Abal_m$Rings)
## Plotting the ECDF for males
plot(ecdf.male.rings, main = "Emperical CDF of various Sexes", ylab = "Quantiles of diff Sexes",
xlab = "Number of Rings", pch = 19, col = "blue")
## Adding female ECDF
lines(ecdf.female.rings, pch = 20, col = "red")
## Adding infant ECDF
lines(ecdf.infant.rings, pch = 20, col = "green")

0.8
0.6
0.4
0.2
0.0

Quantiles of diff Sexes

1.0

Emperical CDF of various Sexes

10

15
Number of Rings

20

25

30

You might also like