DATA ANALYTICS FOR
BUILT ENVIRONMENT
ASSIGNMENT
NAME: ADARSH .R
ENROLLMENT NO.:
A70059023054
BATCH: 2023-2025
Q1) Find mean, median and mode of the data in Figure
Figure 1: Histogram for Problem 1
ANSWER:
Mean (Average):
Mean = ∑Values = 43 ≈ 2.87
Number of Values 15
The mean value is approximately 2.87.
Median:
Median = Middle Value of Sorted Dataset
The sorted dataset is [0,0,0,0,1,2,2,3,4,4,4,5,5,6,7].
The median (middle value) is 3.
Mode = Most Frequent Value
The value 0 appears 4 times, which is the highest frequency.
Thus, the mode is 0.
Q2) What do you mean by Dispersion? State any two
measures of dispersion.
ANSWER:
Dispersion refers to the extent to which values in a distribution differ from the average of
the distribution. It provides a precise view of the distribution by explaining the disparity
of data from one another. Dispersion is a measure that indicates the scattering of data
and gives an idea about the variation and central value of individual items.
Two common measures of dispersion are:
1. Range: It is the simplest method of measurement of dispersion and defines the
difference between the largest and the smallest item in each distribution. The range is
calculated by subtracting the smallest value from the largest value:
Range = Y max – Y min.
2. Standard Deviation: It is the square root of the arithmetic average of the square of
the deviations measured from the mean. The standard deviation is given by the formula:
Q3) Write any two assumptions of Normal Distribution.
ANSWER:
Two assumptions of the normal distribution are:
1. Symmetry: The normal distribution is symmetric around the mean, meaning that
the distribution is balanced on both sides of the mean. This is reflected in the
bell-shaped curve where most of the data points are clustered around the mean,
with fewer points at the extremes.
2. Kurtosis: The normal distribution has a kurtosis of 3, which means that the
distribution is mesokurtic, or slightly more peaked than a standard normal
distribution. This contrasts with distributions that are platykurtic (flatter) or
leptokurtic (more peaked).
Q4) Draw and compare right skewed and left skewed
distribution. Discuss the behavior of mean for the above
two types of skewness.
ANSWER:
Right Skewed Distribution:
A right skewed distribution is characterized by a longer tail on the right side of the
distribution. This type of skewness is also known as positive skewness. In a right
skewed distribution, the mean is greater than the median, and the mode is often less
than the median.
Left Skewed Distribution:
A left skewed distribution is characterized by a longer tail on the left side of the
distribution. This type of skewness is also known as negative skewness. In a left
skewed distribution, the mean is less than the median, and the mode is often less
than the median.
Comparison of Right and Left Skewed Distributions:
FACTORS RIGHT SKEWED LEFT SKEWED
DISTRIBUTION DISRTIBUTION
MEAN Greater than the median. Less than the median.
MEDIAN Less than the mean. Greater than the mean.
MODE Often less than the Often less than the
median. median.
TAIL Longer on the right side. Longer on the left side.
Household income, Household income (with
EXAMPLES stock market returns. a very long left tail),
individual income
distribution.
Behavior of Mean in Right and Left Skewed Distributions:
Right Skewed Distribution: The mean is pulled away from the median by the
extreme values on the right side, making it greater than the median.
Left Skewed Distribution: The mean is pulled away from the median by the
extreme values on the left side, making it less than the median.
SECTION – B
Q5) When two balanced dice are rolled, Find the
probability that
a. the sum of the dice is 11
b. the sum of the dice is 1
ANSWER:
a. The sum of the dice is 11:
The probability of rolling a sum of 11 with two dice is 1/18. This is because there are
two ways to roll a sum of 11: (5,6) and (6,5). There are a total of 36 possible outcomes
when rolling two dice, so the probability is:
b. The sum of the dice is 1:
It is not possible to roll a sum of 1 with two balanced dice. The smallest possible sum is
2, which can be achieved by rolling a 1 on each die.
Q6) Case Description: Cyclone Wind Speeds from Tropical
Cyclone Reports, published by the National Hurricane
Center, we obtained the data shown in below table, in
miles per hour (mph), for one year’s tropical cyclones in
the Atlantic Basin.
a. Prepare the data for computing the three quartiles.
b. Calculate Q1, Q2 & Q3.
c. Find out IQR.
ANSWER:
1. List all the data points:
[60, 70, 85, 65, 100, 60, 110, 45, 80, 40, 105, 80, 115, 90, 50, 45, 90, 115,
50]
2. Sort the data:
[40, 45, 45, 50, 50, 60, 60, 65, 70, 80, 80, 85, 90, 90, 100, 105, 110, 115,
115]
3. Calculate Quartiles:
Q1 (25th percentile): Find the value at the 25% position.
Q1 = 50
Q2 (Median or 50th percentile): Find the middle value.
Q2 = 80
Q3 (75th percentile): Find the value at the 75% position.
Q3 = 100
4. Calculate the IQR:
IQR = Q3 - Q1 = 100 - 50 = 50
These are your quartiles and the interquartile range for the data set.
SECTION – C
Q7) Performing linear regression model for two variables
yielded the trend equation ŷ=345 + 77x with R-squared
value of 0.12. Estimate y for x = 15 considering x to be
within the data collection range.
ANSWER:
Given the linear regression equation y^=345+77x, we can estimate the value of
y for x=15.
Linear Regression Model:
The equation provided:
y^=345+77x
This equation represents a straight line where:
y^ is the estimated or predicted value of y.
345 is the y-intercept of the regression line.
77is the slope of the regression line.
X is the independent variable.
Estimating y for x=15:
To find the estimated y^ when x=15, substitute x=15 into the regression equation:
y^=345+77×15
Let's calculate this value.
y^=345+1155=1500
So, the estimated y^ for x=15 is 1500.
R-squared Value:
The R-squared value is 0.12. This indicates that approximately 12% of the
variability in the dependent variable y is explained by the independent
variable x. While this is relatively low, it doesn't affect the point estimate y^
for a given x; it just indicates that the model may not fit the data very well
overall.
Conclusion:
Using the linear regression model y^=345+77x, the estimated value of y
when x=15 is 1500.
Q8) A snack-food company produces a 454-g bag of
pretzels. Although the actual net weights deviate slightly
from 454 g and vary from one bag to another, the
company insists that the mean net weight of the bags be
454 g. As part of its program, the quality assurance
department periodically performs a hypothesis test to
decide whether the packaging machine is working
properly, that is, to decide whether the mean net weight
of all bags packaged is 454 g
a) Determine the null hypothesis for the hypothesis test.
b) Determine the alternative hypothesis for the
hypothesis test.
c) Classify the hypothesis test as two tailed, left tailed, or
right tailed.
ANSWER:
When conducting a hypothesis test to determine if the mean net weight of the bags is
454 grams as claimed by the snack-food company, we need to establish both the null
hypothesis (H0) and the alternative hypothesis (H1).
a) Determine the Null Hypothesis (H0)
The null hypothesis represents the status quo or the claim that is being tested. In this
case, the company claims that the mean net weight of the bags is 454 grams.
H0: μ=454 g
where μ represents the true mean net weight of the bags.
b) Determine the Alternative Hypothesis (H1)
The alternative hypothesis represents the statement that we are trying to find evidence
for. In this scenario, the quality assurance department is checking if there is a deviation
from the mean weight of 454 grams. This can be either an increase or decrease,
suggesting that the packaging machine may not be working properly.
H1: μ ≠ 454 g
c) Classify the Hypothesis Test
The hypothesis test examines whether the mean weight is different from 454 grams,
regardless of whether it is higher or lower. Therefore, we are looking for any deviation
from 454 grams, which means we are testing for both directions (greater than or less
than).
This makes the hypothesis test a two-tailed test.
Summary
Null Hypothesis (H0): The mean net weight of the bags is 454 grams.
H0: μ = 454 g
Alternative Hypothesis (H1): The mean net weight of the bags is not equal to
454 grams.
H1: μ ≠ 454 g
Type of Test: Two-tailed test, because we are interested in detecting deviations
in both directions (whether the mean is either less than or greater than 454
grams).
Q9) The Association of American Medical Colleges (AAMC)
compiles data on medical school faculty and publishes the
results in AAMC Faculty Roster. The following contingency
table cross-classifies medical school faculty by the
characteristics gender and rank.
a) Find P (R3)
b) Find P (R3 |G1)
c) Are events G1 and R3 independent? Explain your
answer.
ANSWER:
To analyze the data from the provided table, we will perform the following steps:
a) Find P(R3)
P(R3) represents the probability of selecting an Assistant Professor regardless of gender.
From the table:
The total number of Assistant Professors (R3) is 40,379.
The total number of faculty members is 98,993.
Thus, P(R3) is calculated as:
b) Find P(R3∣G1)
P(R3∣G1) represents the conditional probability of selecting an Assistant Professor given that the
faculty member is male.
From the table:
The number of male Assistant Professors (R3 given G1) is 25,888.
The total number of male faculty members (G1) is 70,000.
Thus, P(R3∣G1) is calculated as:
c) Are events G1G_1G1 and R3R_3R3 independent?
Two events A and B are independent if:
For our problem, the events are:
A: Being male (G1).
B: Being an Assistant Professor (R3).
To check if G1 and R3 are independent, we compare P(R3∣G1) and P(R3).
If P(R3∣G1) = P(R3), then G1 and R3 are independent.
To determine if the events G1 (being male) and R3 (being an Assistant Professor) are
independent, we compare P(R3∣G1) and P(R3).
From our calculations:
P(R3) ≈ 0.408
P(R3∣G1) ≈ 0.370
Since P(R3∣G1) is not equal to P(R3), the events G1 and R3 are not independent. This implies
that knowing a faculty member is male changes the probability of them being an Assistant
Professor compared to the overall probability.