FOUNDATIONS OF DATASCIENCE
ASSIGNMENT I
1 (a) (i) — Handling Variables with >30% Missing Values
When dealing with datasets where some variables have more than 30% missing values, it's
crucial to be cautious, as improper handling can lead to misleading conclusions. Here’s a
step-by-step strategy:
1. Evaluate the Cause of Missingness
MCAR (Missing Completely at Random): Missingness has no relationship to any
values.
MAR (Missing at Random): Missingness is related to observed data.
MNAR (Missing Not at Random): Missingness is related to unobserved data.
Understanding this helps decide whether to impute, drop, or analyze separately.
2. Dropping the Variable (if justified)
If the variable is not critical to the analysis and has low correlation with the output, it
can be dropped.
Also consider:
o Data redundancy (are similar features present?)
o Sample size reduction impact
3. Imputation Techniques (if variable is important):
Technique Description Best For
Mean/Median/Mode Replace missing values with central tendency Numerical data
KNN Imputation Finds k similar instances and uses their values Structured data
Predicts missing values by modeling each
MICE Complex datasets
feature
Regression Continuous
Use regression models to predict missing data
Imputation variables
4. Add a Missing Indicator Variable
Create a new binary column (e.g., is_missing_age) where:
o 1 → missing
o 0 → present
Helps models learn if the absence itself is meaningful.
5. Use External Sources or Domain Knowledge
Cross-reference other datasets or use expert input to intelligently fill gaps.
Example: Public records, customer databases.
6. Compare Models With and Without That Feature
Train a model including the feature and another excluding it.
Use performance metrics (accuracy, F1-score, etc.) to decide which to keep.
1 (a) (ii) — Feature Selection Methods
Feature selection is the process of identifying the most relevant variables that contribute
significantly to predicting the target variable. This improves model performance and reduces
overfitting.
Types of Feature Selection Methods:
Method
Description Examples
Type
Filter Selects features based on statistical tests Pearson, Chi-square, Mutual Info
Uses predictive model to assess feature
Wrapper RFE, Forward/Backward Selection
subset
LASSO, Tree-based feature
Embedded Feature selection during model training
importance
Hybrid Combines filter + wrapper Multi-stage approaches
1. Filter Methods (In-depth)
Use statistical measures to score features.
Examples:
o Correlation Coefficient: High correlation with target = useful.
o Chi-Square Test: Used for categorical features.
o Information Gain / Entropy: In decision trees.
Advantages:
o Fast and scalable.
Disadvantages:
o Doesn’t consider feature interdependencies.
2. Wrapper Methods (In-depth)
Uses predictive model as a black-box.
Examples:
o Recursive Feature Elimination (RFE): Recursively removes least important
feature.
o Forward Selection: Start with no features and add one at a time.
o Backward Elimination: Start with all features, remove one by one.
Advantages:
o Better performance (evaluates actual model output).
Disadvantages:
o Time-consuming, prone to overfitting if not careful.
QUESTION 2
2 (a) (i): Skewness in Distributions
(1) Incomes of tax payers have a mean of $48,000 and a median of $43,000.
Interpretation:
o The mean is higher than the median, which typically suggests that there are a
few very high income values pulling the average (mean) up.
Conclusion:
o This type of distribution is positively skewed (right-skewed).
o In a positively skewed distribution, the tail on the right side is longer, and
the mass of the distribution is concentrated on the left.
Positive Skew = Mean > Median > Mod
(2) GPAs for all students at some college have a mean of 3.01 and a median of 3.20.
Interpretation:
o The mean is lower than the median, which implies that a few very low GPAs
are pulling the average (mean) downward.
Conclusion:
o This distribution is negatively skewed (left-skewed).
o In a negatively skewed distribution, the tail is longer on the left, and the bulk
of the values are on the right.
Negative Skew = Mean < Median < Mode
2 (a) (ii): Descriptive Statistics and Distribution Shape
Given Data:
2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12, 10, 4, 3
(1) Mode, Median, and Mean
Step 1: Sort the data:
2, 2, 3, 3, 4, 5, 5, 5, 6, 7, 8, 10, 12, 17, 28
Mode:
o The most frequently occurring value is 5, which appears 3 times.
o ✅ Mode = 5
Median:
o Total values = 15 (odd number), so the middle value is at position (15+1)/2
= 8.
o The 8th number in sorted list is 5.
o ✅ Median = 5
Mean:
o Sum of values = 127
o Mean = 127 / 15 = 8.47
o ✅ Mean ≈ 8.47
(2) Distribution Shape Based on Measures
When mean > median > mode, it generally indicates a positively skewed
distribution.
The higher values such as 17 and 28 increase the mean, pulling it to the right.
The bulk of the data is on the lower side (2 to 8), and the tail is stretched on the right.
Conclusion: The distribution is positively skewed (right-skewed).
(1) Find the Mode, Median, and Mean
Given data:
2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12, 10, 4, 3
Mean:
Mean=Sum of valuesTotal number of values=11715=7.80\text{Mean} = \frac{\text{Sum of
values}}{\text{Total number of values}} = \frac{117}{15} = 7.80
Median:
Sorted data: 2, 2, 3, 3, 4, 5, 5, 5, 6, 7, 8, 10, 12, 17, 28
Middle value = 8th value = 5
Mode:
Most frequent value = 5 (appears 3 times)
(2) Characterizing the Shape of the Distribution
From the values:
Mean > Median > Mode
This pattern indicates the distribution is positively skewed (right-skewed).
In the histogram above:
The right tail is longer due to the high value 28.
The bulk of data is on the left, with mean pulled to the right by higher values.
Final Conclusion:
Mean: 7.80
Median: 5
Mode: 5
Shape: Positively Skewed
QUESTION 3
3 (a) (i): Impact When the Goodness-of-Fit Test Score is Low
The goodness-of-fit test is commonly used to determine how well a model's predicted values
match the observed data. It measures how accurately the model captures the pattern in the
data.
When the goodness-of-fit score is low, it indicates:
1. Model does not represent the data well
o The observed values differ significantly from the predicted values.
o The model fails to capture important patterns or trends in the data.
2. Model assumptions may be violated
o For example, linear regression assumes linearity, normal distribution of
residuals, homoscedasticity, and independence.
o A poor fit may indicate outliers or missing variables.
3. Forecasting becomes unreliable
o Any predictions made using a poor-fit model are likely to be inaccurate.
o The confidence intervals become wide, and error terms increase.
4. High residual errors
o The differences between actual and predicted values are large.
o This decreases the trust in the model's predictions.
5. Need for model improvement
o Data transformation (e.g., log scale), adding interaction terms, or choosing a
different model (e.g., polynomial regression) may improve the fit.
Example:
If we are predicting house prices based on size, but we ignore other important variables like
location or number of rooms, our model will have a poor fit and low accuracy.
3 (a) (ii): Regression Analysis for Predicting Salary
Let’s do this in more detail, as an easy step-by-step guide to using Simple Linear
Regression.
Dataset Recap:
Age (x) Salary (y)
54 67000
42 43000
49 55000
57 71000
35 25000
We want to predict salary when age = 45
Step-by-Step Calculation:
Step 1: Calculate Means
xˉ=54+42+49+57+355=2375=47.4\bar{x} = \frac{54 + 42 + 49 + 57 + 35}{5} = \frac{237}{5} = 47.4
yˉ=67000+43000+55000+71000+250005=2610005=52200\bar{y} = \frac{67000 + 43000 + 55000 +
71000 + 25000}{5} = \frac{261000}{5} = 52200
Step 2: Calculate Slope (b)
Using formula:
b=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}
Let’s create a table for intermediate steps:
x y x - x̄ y - ȳ (x - x̄)(y - ȳ) (x - x̄)²
54 67000 6.6 14800 97680 43.56
42 43000 -5.4 -9200 49680 29.16
49 55000 1.6 2800 4480 2.56
57 71000 9.6 18800 180480 92.16
35 25000 -12.4 -27200 337280 153.76
Sum:
∑(x - x̄)(y - ȳ) = 669600
∑(x - x̄)² = 321.2
b=669600321.2≈2084.6b = \frac{669600}{321.2} ≈ 2084.6
Step 3: Calculate Intercept (a)
a=yˉ−b⋅xˉ=52200−(2084.6⋅47.4)≈52200−98721≈−46521a = \bar{y} - b \cdot \bar{x} = 52200 - (2084.6
\cdot 47.4) ≈ 52200 - 98721 ≈ -46521
Step 4: Predict Salary for Age 45
Salary=a+b⋅x=−46521+2084.6⋅45≈−46521+93807=47286\text{Salary} = a + b \cdot x = -46521 +
2084.6 \cdot 45 ≈ -46521 + 93807 = 47286
Final Predicted Salary = ₹47,286 (Approx)
QUESTION 4
4 (a) Define Dictionary in Python. Do the following operations on dictionaries.
Definition of Dictionary in Python:
A dictionary in Python is an unordered collection of data values used to store data values
like a map. Unlike other Data Types that hold only a single value as an element, a dictionary
holds key: value pairs.
Each key in a dictionary is unique, and it is used to access the corresponding value.
Syntax:
my_dict = {
"key1": "value1",
"key2": "value2"
}
(i) Initialize two dictionaries (D1 and D2) with key and value pairs.
D1 = {'a': 1, 'b': 2, 'c': 3}
D2 = {'b': 2, 'c': 4, 'd': 5}
(ii) Compare those two dictionaries with master key list ‘M’ and print the
missing keys.
M = ['a', 'b', 'c', 'd', 'e'] # master key list
missing_in_D1 = [key for key in M if key not in D1]
missing_in_D2 = [key for key in M if key not in D2]
print("Keys missing in D1:", missing_in_D1)
print("Keys missing in D2:", missing_in_D2)
Output:
Keys missing in D1: ['d', 'e']
Keys missing in D2: ['a', 'e']
(iii) Find keys that are in D1 but NOT in D2.
keys_only_in_D1 = D1.keys() - D2.keys()
print("Keys in D1 but not in D2:", keys_only_in_D1)
Output:
Keys in D1 but not in D2: {'a'}
(iv) Merge D1 and D2 and create D3 using expressions.
If both dictionaries have the same key, values from D2 will override values from D1.
D3 = {**D1, **D2}
print("Merged Dictionary D3:", D3)
Output:
Merged Dictionary D3: {'a': 1, 'b': 2, 'c': 4, 'd': 5}
Visual Explanation of Python Dictionary Operations
Step-by-step Flowchart Representation
START
|
v
Create D1 → {'a':1, 'b':2, 'c':3}
Create D2 → {'b':2, 'c':4, 'd':5}
|
v
Create master key list M → ['a','b','c','d','e']
|
v
[Find missing keys]
|
├── Missing in D1 → M - keys in D1 = ['d', 'e']
└── Missing in D2 → M - keys in D2 = ['a', 'e']
|
v
[Find keys in D1 but not in D2]
→ Compare D1.keys() - D2.keys() = {'a'}
|
v
[Merge dictionaries]
→ D3 = {**D1, **D2}
Result: {'a':1, 'b':2, 'c':4, 'd':5}
|
v
END
Key Notes:
The ** unpacking syntax is used to merge two dictionaries.
In D3 = {**D1, **D2}, if the same key exists in both, the value from D2 replaces the
one from D1.
Key differences like 'c' have values 3 in D1 and 4 in D2 — D2's value is taken.
5 (a) - Python Code to Project the Globe on a 2D Flat Surface Using Scatter
Plot
Objective:
Use matplotlib to display the globe as a 2D map (cylindrical projection).
Plot three major Indian cities (e.g., Delhi, Mumbai, Chennai) on the map.
Python Code:
import matplotlib.pyplot as plt
# Coordinates for Indian cities (latitude, longitude)
cities = {
"Delhi": (28.6139, 77.2090),
"Mumbai": (19.0760, 72.8777),
"Chennai": (13.0827, 80.2707)
}
# Separate latitudes and longitudes
lats = [coord[0] for coord in cities.values()]
lons = [coord[1] for coord in cities.values()]
# Plotting
plt.figure(figsize=(10, 6))
plt.title("Major Indian Cities on a 2D Globe Projection", fontsize=14)
plt.xlabel("Longitude")
plt.ylabel("Latitude")
# Scatter the points
plt.scatter(lons, lats, color='blue', marker='o')
# Annotate cities
for city, (lat, lon) in cities.items():
plt.text(lon + 0.5, lat + 0.5, city, fontsize=10)
# Set grid and limits
plt.grid(True)
plt.xlim(65, 90) # Roughly India's longitudes
plt.ylim(5, 40) # Roughly India's latitudes
plt.show()
Explanation:
Cylindrical Projection: Flattened world map (2D view with latitudes vs. longitudes).
Cities Plotted:
o Delhi (28.61°N, 77.20°E)
o Mumbai (19.07°N, 72.87°E)
o Chennai (13.08°N, 80.27°E)
The map is simple and does not require basemap or geospatial libraries—suitable for
academic purposes.