[go: up one dir, main page]

0% found this document useful (0 votes)
44 views12 pages

Midsem I 31 03 2023

This document contains a mid-semester test for a data analytics course. It includes 20 multiple choice and numerical questions testing concepts like big data, Hadoop, probability, statistics, and data analysis techniques. It also includes 3 long answer questions asking students to discuss topics like Hadoop streams, comparing Spark and Kafka, and analyzing data in different sectors. The test assesses all 5 course outcomes of identifying big data techniques, applying statistics, analyzing data for machine learning, justifying algorithm performance in Python, and identifying real-life data applications.

Uploaded by

MUSHTAQ AHAMED
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views12 pages

Midsem I 31 03 2023

This document contains a mid-semester test for a data analytics course. It includes 20 multiple choice and numerical questions testing concepts like big data, Hadoop, probability, statistics, and data analysis techniques. It also includes 3 long answer questions asking students to discuss topics like Hadoop streams, comparing Spark and Kafka, and analyzing data in different sectors. The test assesses all 5 course outcomes of identifying big data techniques, applying statistics, analyzing data for machine learning, justifying algorithm performance in Python, and identifying real-life data applications.

Uploaded by

MUSHTAQ AHAMED
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

No.

of Pages: 3
Name of the Faculty : Dr. S.VASANTHARATHNA

COIMBATORE INSTITUTE OF TECHNOLOGY: COIMBATORE-641 014.


(Government Aided Autonomous Institution)
DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING

B.E. (Electrical and Electronics Engineering) - V Semester


MID SEMESTER TEST - I
19EEE35 - DATA ANALYTICS
31-03-2023

COURSE OUTCOMES: % Weightage


After completing the course successfully, the students will be able to,
CO1 20
Identify big data and its analysis techniques.
:
CO2 20
Apply Probability and Statistics in Data Analysis
:
CO3 20
Analyze the data for machine learning algorithms.
:
CO4 20
Justify the performance measure of algorithms using Python language 
:
CO5 20
Identify appropriate data in real life applications
:

TECHNOLOGY FOUNDATION FOR BIG DATA


Fundamentals of Big Data – Classification and characteristics of Big Data –Types of Data Analysis – Introduction
to Opensource frameworks - Hadoop Environment – Mapper – Reducer – Combiner – Partitioner – Searching –
Sorting – Compression - Hive - Introduction to features of MongoDB, Cassandra,Spark, Kafka

MATHEMATICAL FOUNDATIONS
Population and Sample - Measures of Central Tendency - Measures of Deviation – Measures of Shape - Correlation
Analysis–Bayes Theorem - Probability Density functions and distributions – Hypothesis Testing – Dimensionality
Reduction – Principal Component Analysis - Analysis of Variance (ANOVA)

Q. Question Marks CO BT PI
No.
PART A ( 10 X 1 = 10)
1. Which of the following is an example of big data? 1 1 1 2.1.3
a) NETFLIX b) Oracle Database
c) Excel spreadsheet d) Library database
2. In a computer randomly if a technical issue arise, it can 1 1 2 2.2.1
recover by creating a backup from a copy of the data that it
had automatically saved, without the user knowing what had
happened. This is
a) Yet Another Resource Negotiator (YARN)
b) Hadoop Distributed File System (HDFS)
c) Hadoop
d) Mapper-Reducer
3. Classification of cars is carried out by comparing a car with 1 1 2 4.1.4
others. Cluster cars into a tree structure based on the data.
Which characteristic of Big Data will the problem have?
a) Veracity b) Velocity c) Volume d) Variety
4. In Hadoop, ------ is the framework for job scheduling and 1 1 1 4.3.4
cluster resource management.
a) Yet Another Resource Negotiator (YARN)
b) Hadoop Distributed File System (HDFS)
c) NameNode and DataNode
d) Mapper-Reducer

CIT/ EEE/ VASANTHARATHNA/THEORY 1


5. _____ takes a set of data and converts it into another set of 1 1 1 4.3.3
data, where individual elements are broken down into tuples
(key/value pairs) and ______ task, which takes the output
from a map as an input and combines those data tuples into a
smaller set of tuples.
a) Reduce , Map b)Map, Reduce
c)Map, Stream d) Stream, Reduce
6. If the values of two variables move in the same direction, the 1 2 3 1.1.1
correlation is said to be
a) non-linear b) linear c) negative d) positive
7. The matrix [0 − 4 + 𝑖 4 + 𝑖 0 ] is 1 2 4 1.1.2
a) Symmetric b) Skew-Symmetric
c) Hermitian d) Skew-Hermitian
8. The subject handling faculty member is interested In 1 2 1 1.1.1
checking the effectiveness of the teaching. The appropriate
test for checking the effectiveness is
a) Two population t-test with equal variance
b) Paired t-test
c) Two population t-test with unequal variance
d) Chi-Square test of independence
9. Pearson Correlation coefficient between two variables X and 1 2 3 1.1.1
Y is 0.85. Pearson Correlation Coefficient between 2+3X
and 4-5Y is
a)Less than 0.85 b) More than 0.85 c) 0.85 d)
-0.85
10. Covariance between two random variables is 0.5. Correlation 1 2 2 1.1.1
between these two variables will be
a) Atleast 0.5 b) Atmost 0.5 c) 0.25 d) 0.75

PART B ( 5 X 2 = 10)
11. An airline operates several domestic flights from 4 major 2 2 4 4.2.2
airports of India: (a) New Delhi, (b) Mumbai, (c) Bangalore,
and (d) Hyderabad. The percentage of flights operated from
New Delhi, Mumbai, Bangalore, and Hyderabad are 40%,
25%, 25%, and 10%, respectively. The percentage of delayed
flights at these four airports are 10%, 8%, 7%, and 6%. If a
flight is delayed, what is the probability that the flight
originated from Bangalore airport?
12. Differentiate between data and big data. 2 1 2 5.1.2
13. With a detailed block diagram, show the MapReduce Process 2 2 3 4.3.2
for the following input words
Car, Van, Bus,
Lorry, Lorry, Bus,
Car, Lorry, Van
14. For 4 data points of two correlated variables x and y, it is 2 1 1 4.1.3
given that
2 2
∑ 𝑥 = 24, ∑= 11, ∑ 𝑥 = 202, ∑ 𝑥𝑦 = 84, ∑ 𝑦 = 39

Fit a least square line to this data using x as an independent


variable.
15. List the different methods of acquiring data. 2 1 1 5.2.2

PART C ( 3 X 10 = 30)
16a) Discuss how Hadoop Streams real time data. 5 1 2 5.3.2

b) Compare Spark and Kafka 5 1 2 5.3.2

CIT/ EEE/ VASANTHARATHNA/THEORY 2


OR
17 Discuss any three data analysis highlighting on the data 5 1 3 6.1.1
sources, type of analysis and the impacts in the sector to
benefit
i)Banking Sector ii)Industrial Automation Sector
18a) Two persons X and Y appear in an interview for two 5 2 2 2.1.3
vacancies in the same post, the probability of X’s selection is
1/5 and that of Y’s selection is 1/3.What is the probability
that:
i)both X and Y will be selected?
ii)Only one of them will be selected?
iii)none of them will be selected
b) X, Y, and Z each try independently to solve a problem. If 5 2 2 2.1.3
their individual probabilities for success are 1/4 , 1/2 , and
5/8, respectively, what is the probability that X and Y but not
Z will solve the problem?

OR
19 A company faces a challenge that, few students accept the 4 2 2 2.3.1
a) job offer but not joining (Renage). The percentage of renage
is 20%. If the company offers 8 jobs,
i) What is the probability that all 8 candidates who accept
the offer, join?
ii) What is the probability that exactly 2 candidates will
not join?
19 Time to failure distribution of a lithium ion battery unit 6 2 2 2.1.3
b) follows an exponential distribution with mean time between
failures of 20000 hours.
i) What is the probability that the battery will survive for at
least 10000 hours?
ii) The manufacturer would like to decide on the warranty
such that not more than 5% of batteries can fail during
the warranty period. What should be the duration of the
warranty period fixed?
20 Write the Hessian H_g and the discriminant D_g for the 5 2 3 2.4.2
following function:
g(x, y) = x^3 + 2y^2 + 3xy^2
What do the Hessian and Discriminant Signify?
20 Suggest and Justify the type of data analytics to be 5 1 5 4.2.2
b) performed for the following use cases
i. In a  manufacturing companies often record the
runtime, downtime, and work queue for various
machines and then analyze the data to better plan the
workloads so the machines operate closer to peak
capacity.
ii. Gaming companies use data analytics to set reward
schedules for players that keep the majority of
players active in the game.
iii. If the likelihood of a hot summer is measured as an
average of five weather models, is above 58%, it is
preferable to add fruit juices in all food servings and
rent an additional tank to increase storage of water.
OR
21 Provide a sample that could correspond to this box plot: 5 2 5 4.2.2
a)

CIT/ EEE/ VASANTHARATHNA/THEORY 3


21 Suggest a suitable plots for the following use cases 5 2 4 1.1.1
b) i. website visitors per day for the past 30 days
ii. distribution of grades on exam 2.2.4
iii. data distribution or clustering trends
iv. Comparing parts of a bigger set of data, highlighting
different categories, or showing change over time.
v. some total amount is divided among distinct
categories

Q Question Ma C B PI
. rks O T
N
o.
PART A ( 10 X 1 = 10)
1. Which of the following is an example of big data? 1 1 1 2.1
.3
b) NETFLIX b) Oracle Database c) Excel spreadsheet d)
Library database

Answer: NETFLIX
2. In a computer randomly if a technical issue arise, it can recover by 1 1 2 2.2
creating a backup from a copy of the data that it had automatically .1
saved, without the user knowing what had happened. This is
i. Yet Another Resource Negotiator (YARN)
ii. Hadoop Distributed File System (HDFS)
iii. Hadoop
iv. Mapper-Reducer
Answer: Hadoop Distributed File System (HDFS)

CIT/ EEE/ VASANTHARATHNA/THEORY 4


3. Classification of cars is carried out by comparing a car with others. 1 1 2 4.1
Cluster cars into a tree structure based on the data. Which .4
characteristic of Big Data will the problem have?

b) Veracity b) Velocity c) Volume d) Variety


Answer: variety
4. In Hadoop, ------ is the framework for job scheduling and cluster 1 1 1 4.3
resource management. .4
v. Yet Another Resource Negotiator (YARN)
vi. Hadoop Distributed File System (HDFS)
vii. NameNode and DataNode
viii. Mapper-Reducer
Answer: Yet Another Resource Negotiator (YARN)
5. _____ takes a set of data and converts it into another set of data, 1 1 1 4.3
where individual elements are broken down into tuples (key/value .3
pairs) and ______ task, which takes the output from a map as an input
and combines those data tuples into a smaller set of tuples.

A. Reduce , Map
B. Map, Reduce
C. Map, Stream
D. Stream, Reduce

Ans : B Map, Reduce

6. If the values of two variables move in the same direction, the 1 2 3 1.1
correlation is said to be .1

a) non-linear b) linear c) negative d) positive

Answer: positive
7. The matrix [0 − 4 + 𝑖 4 + 𝑖 0 ] is 1 2 4 1.1
a) Symmetric b) Skew-Symmetric .2
d) Hermitian d) Skew-Hermitian

Answer: d) Skew-Hermitian

CIT/ EEE/ VASANTHARATHNA/THEORY 5


8. The subject handling faculty member is interested In checking the 1 2 1 1.1
effectiveness of the teaching. The appropriate test for checking the .1
effectiveness is
a) Two population t-test with equal variance
b) Paired t-test
c) Two population t-test with unequal variance
d) Chi-Square test of independence

Answer: Paired t-test

9. Pearson Correlation coefficient between two variables X and Y is 1 2 3 1.1


0.85. Pearson Correlation Coefficient between 2+3X and 4-5Y is .1
a) Less than 0.85 b) More than 0.85 c) 0.85 d) -0.85

CIT/ EEE/ VASANTHARATHNA/THEORY 6


10. Covariance between two random variables is 0.5. Correlation between 1 2 2 1.1
these two variables will be .1
a) Atleast 0.5 b) Atmost 0.5 c) 0.25 d) 0.75

PART B ( 5 X 2 = 10)
11. An airline operates several domestic flights from 4 major airports of 2 2 4 4.2
India: (a) New Delhi, (b) Mumbai, (c) Bangalore, and (d) Hyderabad. .2
The percentage of flights operated from New Delhi, Mumbai,
Bangalore, and Hyderabad are 40%, 25%, 25%, and 10%,
respectively. The percentage of delayed flights at these four airports
are 10%, 8%, 7%, and 6%. If a flight is delayed, what is the
probability that the flight originated from Bangalore airport?

12. Differentiate between data and big data. 2 1 2 5.1


.2
Volume, Velocity, Variety, Veracity makes a data to be a big data

13. With a detailed block diagram, show the MapReduce Process for the 2 2 3 4.3
following input words .2

Car, Van, Bus,


Lorry, Lorry, Bus,
Car, Lorry, Van

MAP REDUCE
Input Splitting Mapping Shuffling Reducing Final
Result
Car, Car, Car 1 Bus 1 Bus 2 Bus 2
Van, Van, Van 1 Bus 1 Car 2
Bus, Bus, Bus 1 Lorry
Lorry, 3
Lorry, Van 2
Bus,
Car,
Lorry,
Van

Lorry, Lorry 1 Car 1 Car 2


Lorry, Lorry 1 Car 1
Bus, Bus 1

Car, Car 1 Lorry 1 Lorry 3


Lorry, Lorry 1 Lorry 1
Van Van 1 Lorry 1

Van 1 Van 2
Van 1

14. 2 1 1 4.1
.3
For 4 data points of two correlated variables x and y, it is given that

CIT/ EEE/ VASANTHARATHNA/THEORY 7


2 2
∑ 𝑥 = 24, ∑ 𝑦 = 11, ∑ 𝑥 = 202, ∑ 𝑥𝑦 = 84, ∑ 𝑦 = 39

Fit a least square line to this data using x as an independent variable.

CIT/ EEE/ VASANTHARATHNA/THEORY 8


15. List the different methods of acquiring data. 4 1 1 5.2
.2
Human generated, machine generated and gaming data
PART C ( 3 X 10 = 30)
1 Discuss how Hadoop Streams real time data. 6 1 2 5.3
6 .2
a)

b) Compare Spark and Kafka 6 1 2 5.3


.2
Feature Criteria Apache Spark Kafka
Speed 100 times faster than Decent speed
Hadoop
Processing Real-time & Batch Real-time / Window
processing processing only
Difficulty Easy to learn because of Easy to configure
high-level modules
Recovery Allows recovery of Fault-tolerant/Replic
partitions using Cache ation
and RDD
Interactivity Has interactive modes No Interactive
mode/Consume the
data

OR
1 Discuss any three data analysis highlighting on the data sources, type 5 1 3 6.1
7 of analysis and the impacts in the sector to benefit .1
a) Banking Sector
b) Industrial Automation Sector

1 Two persons X and Y appear in an interview for two vacancies in the same 5 2 2 2.1
8 post, the probability of X’s selection is 1/5 and that of Y’s selection is .3

CIT/ EEE/ VASANTHARATHNA/THEORY 9


1/3.What is the probability that:1)both X and Y will be selected?2)Only one
of them will be selected?3)none of them will be selected

X and Y appear for an interview for two posts. The probabilities X’s
selection is 1/3 and that of Y’s selection is 1/5. Find the probability
that at least one of them is selected?

The probability of non-selection of x is 1–1/3 = 2/3

The probability of non-selection of y is 1–1/5 = 4/5

The probability that none of them is selected = (2/3)*(4/5) = 8/15

So, the probability of at least one of them is selected = 1 - 8/15 = 7/15

Alternately

P(xUy) = p(x) + p(y) - p(xy) = 1/3 + 1/5 - (1/3*1/5)

= 1/3 + 1/5 - 1/15 = (5+3–1)/15 = 7/15

So, the probability of at least one of them is


selected = 7/15
X, Y, and Z each try independently to solve a problem. If their 5 2 2 2.1
individual probabilities for success are 1/4 , 1/2 , and 5/8, .3
respectively, what is the probability that X and Y but not Z will
solve the problem?

Because X, Y and Z are independent, we multiply probabilities


together.

Pr(X solves) = 1/4

Pr(Y solves) = 1/2

Pr(Z does not solve) = 1 - 5/8 = 3/8.

Thus,

Pr(X and Y solves, Z does not) = (1/4)(1/2)(3/8) = 3/64

OR
1 A company faces a challenge that, few students accept the job offer 4 2 2 2.3
9 but not joining (Renage). The percentage of renage is 20%. If the .1
a) company offers 8 jobs,
iii) What is the probability that all 8 candidates who accept
the offer, join?
iv) What is the probability that exactly 2 candidates will
not join?
1 Time to failure distribution of a lithium ion battery unit follows an 6 2 2 2.1
9 exponential distribution with mean time between failures of 20000 .3
b) hours.
iii) What is the probability that the battery will survive for at least
10000 hours?
iv) The manufacturer would like to decide on the warranty such
that not more than 5% of batteries can fail during the warranty
period. What should be the duration of the warranty period
fixed?

CIT/ EEE/ VASANTHARATHNA/THEORY 10


2 Write the Hessian H_g and the discriminant D_g for the following 5 2 3 2.4
0 function: .2
g(x, y) = x^3 + 2y^2 + 3xy^2
What Do The Hessian And Discriminant Signify?

For the function g(x,y):

Cannot draw any conclusions for the point (0, 0)

f_xx(1, 0) = 6 > 0 and D_g(1, 0) = 60 > 0, hence (1, 0) is a local


minimum

The point (0,1) is a saddle point as D_g(0, 1) < 0

f_xx(-1,0) = -6 < 0 and D_g(-1, 0) = 12 > 0, hence (-1, 0) is a local


maximum

The Hessian and the corresponding discriminant are used to determine


the local extreme points of a function. Evaluating them helps in the
understanding of a function of several variables. Here are some
important rules for a point (a,b) where the discriminant is D(a, b):

The function f has a local minimum if f_xx(a, b) > 0 and the


discriminant D(a,b) > 0

The function f has a local maximum if f_xx(a, b) < 0 and the


discriminant D(a,b) > 0

The function f has a saddle point if D(a, b) < 0

Cannot draw any conclusions if D(a, b) = 0 and need more tests

2 Suggest and Justify the type of data analytics to be performed for the 5 1 5 4.2
0 following use cases .2
b) iv. In a  manufacturing companies often record the runtime,
downtime, and work queue for various machines and then
analyze the data to better plan the workloads so the machines
operate closer to peak capacity.
v. Gaming companies use data analytics to set reward schedules
for players that keep the majority of players active in the
game.

CIT/ EEE/ VASANTHARATHNA/THEORY 11


vi. If the likelihood of a hot summer is measured as an average of
five weather models, is above 58%, it is preferable to add fruit
juices in all food servings and rent an additional tank to
increase storage of water.
OR
2 Provide a sample that could correspond to this box plot: 5 2 5 4.2
1 .2
a)

Obviously, there are uncountably infinite solutions to this problem. Two


representative samples are: {25, 40, 50, 52, 53, 55, 56, 58, 60, 68, 80} and
{25, 40, 48, 50, 51, 52, 53, 55, 56, 57, 58, 60, 65, 68, 80}

2 Suggest a suitable plots for the following use cases 5 2 4 1.1


1 .1
vi. website visitors per day for the past 30 days – Timeline Chart
b)
vii. distribution of grades on exam – Histogram 2.2
viii. data distribution or clustering trends – Scatter Plot .4
ix. Comparing parts of a bigger set of data, highlighting different
categories, or showing change over time. – Bar Chart
x. some total amount is divided among distinct categories – pie
chart

CIT/ EEE/ VASANTHARATHNA/THEORY 12

You might also like