[go: up one dir, main page]

0% found this document useful (0 votes)
6 views11 pages

DS assignment COMPLETED DOC

The document outlines strategies for handling variables with over 30% missing values, emphasizing the importance of understanding the cause of missingness and various imputation techniques. It also discusses feature selection methods to improve model performance and describes the implications of skewness in data distributions. Additionally, it provides a Python dictionary tutorial and a code example for projecting cities on a 2D map using matplotlib.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

DS assignment COMPLETED DOC

The document outlines strategies for handling variables with over 30% missing values, emphasizing the importance of understanding the cause of missingness and various imputation techniques. It also discusses feature selection methods to improve model performance and describes the implications of skewness in data distributions. Additionally, it provides a Python dictionary tutorial and a code example for projecting cities on a 2D map using matplotlib.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

FOUNDATIONS OF DATASCIENCE

ASSIGNMENT I

1 (a) (i) — Handling Variables with >30% Missing Values

When dealing with datasets where some variables have more than 30% missing values, it's
crucial to be cautious, as improper handling can lead to misleading conclusions. Here’s a
step-by-step strategy:

1. Evaluate the Cause of Missingness

 MCAR (Missing Completely at Random): Missingness has no relationship to any


values.
 MAR (Missing at Random): Missingness is related to observed data.
 MNAR (Missing Not at Random): Missingness is related to unobserved data.

Understanding this helps decide whether to impute, drop, or analyze separately.

2. Dropping the Variable (if justified)

 If the variable is not critical to the analysis and has low correlation with the output, it
can be dropped.
 Also consider:
o Data redundancy (are similar features present?)
o Sample size reduction impact

3. Imputation Techniques (if variable is important):

Technique Description Best For


Mean/Median/Mode Replace missing values with central tendency Numerical data
KNN Imputation Finds k similar instances and uses their values Structured data
Predicts missing values by modeling each
MICE Complex datasets
feature
Regression Continuous
Use regression models to predict missing data
Imputation variables

4. Add a Missing Indicator Variable

 Create a new binary column (e.g., is_missing_age) where:


o 1 → missing
o 0 → present
 Helps models learn if the absence itself is meaningful.

5. Use External Sources or Domain Knowledge


 Cross-reference other datasets or use expert input to intelligently fill gaps.
 Example: Public records, customer databases.

6. Compare Models With and Without That Feature

 Train a model including the feature and another excluding it.


 Use performance metrics (accuracy, F1-score, etc.) to decide which to keep.

1 (a) (ii) — Feature Selection Methods

Feature selection is the process of identifying the most relevant variables that contribute
significantly to predicting the target variable. This improves model performance and reduces
overfitting.

Types of Feature Selection Methods:

Method
Description Examples
Type
Filter Selects features based on statistical tests Pearson, Chi-square, Mutual Info
Uses predictive model to assess feature
Wrapper RFE, Forward/Backward Selection
subset
LASSO, Tree-based feature
Embedded Feature selection during model training
importance
Hybrid Combines filter + wrapper Multi-stage approaches

1. Filter Methods (In-depth)

 Use statistical measures to score features.


 Examples:
o Correlation Coefficient: High correlation with target = useful.
o Chi-Square Test: Used for categorical features.
o Information Gain / Entropy: In decision trees.
 Advantages:
o Fast and scalable.
 Disadvantages:
o Doesn’t consider feature interdependencies.

2. Wrapper Methods (In-depth)

 Uses predictive model as a black-box.


 Examples:
o Recursive Feature Elimination (RFE): Recursively removes least important
feature.
o Forward Selection: Start with no features and add one at a time.
o Backward Elimination: Start with all features, remove one by one.
 Advantages:
o Better performance (evaluates actual model output).
 Disadvantages:
o Time-consuming, prone to overfitting if not careful.
QUESTION 2

2 (a) (i): Skewness in Distributions

(1) Incomes of tax payers have a mean of $48,000 and a median of $43,000.

 Interpretation:
o The mean is higher than the median, which typically suggests that there are a
few very high income values pulling the average (mean) up.
 Conclusion:
o This type of distribution is positively skewed (right-skewed).
o In a positively skewed distribution, the tail on the right side is longer, and
the mass of the distribution is concentrated on the left.

Positive Skew = Mean > Median > Mod

(2) GPAs for all students at some college have a mean of 3.01 and a median of 3.20.

 Interpretation:
o The mean is lower than the median, which implies that a few very low GPAs
are pulling the average (mean) downward.
 Conclusion:
o This distribution is negatively skewed (left-skewed).
o In a negatively skewed distribution, the tail is longer on the left, and the bulk
of the values are on the right.

Negative Skew = Mean < Median < Mode

2 (a) (ii): Descriptive Statistics and Distribution Shape

Given Data:
2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12, 10, 4, 3

(1) Mode, Median, and Mean

 Step 1: Sort the data:


2, 2, 3, 3, 4, 5, 5, 5, 6, 7, 8, 10, 12, 17, 28
 Mode:
o The most frequently occurring value is 5, which appears 3 times.
o ✅ Mode = 5
 Median:
o Total values = 15 (odd number), so the middle value is at position (15+1)/2
= 8.
o The 8th number in sorted list is 5.
o ✅ Median = 5
 Mean:
o Sum of values = 127
o Mean = 127 / 15 = 8.47
o ✅ Mean ≈ 8.47

(2) Distribution Shape Based on Measures

 When mean > median > mode, it generally indicates a positively skewed
distribution.
 The higher values such as 17 and 28 increase the mean, pulling it to the right.
 The bulk of the data is on the lower side (2 to 8), and the tail is stretched on the right.

Conclusion: The distribution is positively skewed (right-skewed).

(1) Find the Mode, Median, and Mean

Given data:
2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12, 10, 4, 3

 Mean:

Mean=Sum of valuesTotal number of values=11715=7.80\text{Mean} = \frac{\text{Sum of


values}}{\text{Total number of values}} = \frac{117}{15} = 7.80

 Median:
Sorted data: 2, 2, 3, 3, 4, 5, 5, 5, 6, 7, 8, 10, 12, 17, 28
Middle value = 8th value = 5
 Mode:
Most frequent value = 5 (appears 3 times)

(2) Characterizing the Shape of the Distribution

From the values:

 Mean > Median > Mode


This pattern indicates the distribution is positively skewed (right-skewed).

In the histogram above:

 The right tail is longer due to the high value 28.


 The bulk of data is on the left, with mean pulled to the right by higher values.

Final Conclusion:

 Mean: 7.80
 Median: 5
 Mode: 5
 Shape: Positively Skewed

QUESTION 3

3 (a) (i): Impact When the Goodness-of-Fit Test Score is Low

The goodness-of-fit test is commonly used to determine how well a model's predicted values
match the observed data. It measures how accurately the model captures the pattern in the
data.

When the goodness-of-fit score is low, it indicates:

1. Model does not represent the data well


o The observed values differ significantly from the predicted values.
o The model fails to capture important patterns or trends in the data.
2. Model assumptions may be violated
o For example, linear regression assumes linearity, normal distribution of
residuals, homoscedasticity, and independence.
o A poor fit may indicate outliers or missing variables.
3. Forecasting becomes unreliable
o Any predictions made using a poor-fit model are likely to be inaccurate.
o The confidence intervals become wide, and error terms increase.
4. High residual errors
o The differences between actual and predicted values are large.
o This decreases the trust in the model's predictions.
5. Need for model improvement
o Data transformation (e.g., log scale), adding interaction terms, or choosing a
different model (e.g., polynomial regression) may improve the fit.

Example:

If we are predicting house prices based on size, but we ignore other important variables like
location or number of rooms, our model will have a poor fit and low accuracy.

3 (a) (ii): Regression Analysis for Predicting Salary

Let’s do this in more detail, as an easy step-by-step guide to using Simple Linear
Regression.
Dataset Recap:

Age (x) Salary (y)

54 67000

42 43000

49 55000

57 71000

35 25000

We want to predict salary when age = 45

Step-by-Step Calculation:

Step 1: Calculate Means

xˉ=54+42+49+57+355=2375=47.4\bar{x} = \frac{54 + 42 + 49 + 57 + 35}{5} = \frac{237}{5} = 47.4


yˉ=67000+43000+55000+71000+250005=2610005=52200\bar{y} = \frac{67000 + 43000 + 55000 +
71000 + 25000}{5} = \frac{261000}{5} = 52200

Step 2: Calculate Slope (b)

Using formula:

b=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}

Let’s create a table for intermediate steps:

x y x - x̄ y - ȳ (x - x̄)(y - ȳ) (x - x̄)²

54 67000 6.6 14800 97680 43.56

42 43000 -5.4 -9200 49680 29.16

49 55000 1.6 2800 4480 2.56

57 71000 9.6 18800 180480 92.16

35 25000 -12.4 -27200 337280 153.76


Sum:

 ∑(x - x̄)(y - ȳ) = 669600


 ∑(x - x̄)² = 321.2

b=669600321.2≈2084.6b = \frac{669600}{321.2} ≈ 2084.6

Step 3: Calculate Intercept (a)

a=yˉ−b⋅xˉ=52200−(2084.6⋅47.4)≈52200−98721≈−46521a = \bar{y} - b \cdot \bar{x} = 52200 - (2084.6


\cdot 47.4) ≈ 52200 - 98721 ≈ -46521

Step 4: Predict Salary for Age 45

Salary=a+b⋅x=−46521+2084.6⋅45≈−46521+93807=47286\text{Salary} = a + b \cdot x = -46521 +


2084.6 \cdot 45 ≈ -46521 + 93807 = 47286

Final Predicted Salary = ₹47,286 (Approx)

QUESTION 4

4 (a) Define Dictionary in Python. Do the following operations on dictionaries.

Definition of Dictionary in Python:

A dictionary in Python is an unordered collection of data values used to store data values
like a map. Unlike other Data Types that hold only a single value as an element, a dictionary
holds key: value pairs.

Each key in a dictionary is unique, and it is used to access the corresponding value.

Syntax:

my_dict = {
"key1": "value1",
"key2": "value2"
}

(i) Initialize two dictionaries (D1 and D2) with key and value pairs.
D1 = {'a': 1, 'b': 2, 'c': 3}
D2 = {'b': 2, 'c': 4, 'd': 5}

(ii) Compare those two dictionaries with master key list ‘M’ and print the
missing keys.
M = ['a', 'b', 'c', 'd', 'e'] # master key list

missing_in_D1 = [key for key in M if key not in D1]


missing_in_D2 = [key for key in M if key not in D2]
print("Keys missing in D1:", missing_in_D1)
print("Keys missing in D2:", missing_in_D2)

Output:

Keys missing in D1: ['d', 'e']


Keys missing in D2: ['a', 'e']

(iii) Find keys that are in D1 but NOT in D2.


keys_only_in_D1 = D1.keys() - D2.keys()
print("Keys in D1 but not in D2:", keys_only_in_D1)

Output:

Keys in D1 but not in D2: {'a'}

(iv) Merge D1 and D2 and create D3 using expressions.

If both dictionaries have the same key, values from D2 will override values from D1.

D3 = {**D1, **D2}
print("Merged Dictionary D3:", D3)

Output:

Merged Dictionary D3: {'a': 1, 'b': 2, 'c': 4, 'd': 5}

Visual Explanation of Python Dictionary Operations

Step-by-step Flowchart Representation

START
|
v
Create D1 → {'a':1, 'b':2, 'c':3}
Create D2 → {'b':2, 'c':4, 'd':5}
|
v
Create master key list M → ['a','b','c','d','e']
|
v
[Find missing keys]
|
├── Missing in D1 → M - keys in D1 = ['d', 'e']
└── Missing in D2 → M - keys in D2 = ['a', 'e']
|
v
[Find keys in D1 but not in D2]
→ Compare D1.keys() - D2.keys() = {'a'}
|
v
[Merge dictionaries]
→ D3 = {**D1, **D2}
Result: {'a':1, 'b':2, 'c':4, 'd':5}
|
v
END

Key Notes:

 The ** unpacking syntax is used to merge two dictionaries.


 In D3 = {**D1, **D2}, if the same key exists in both, the value from D2 replaces the
one from D1.
 Key differences like 'c' have values 3 in D1 and 4 in D2 — D2's value is taken.

5 (a) - Python Code to Project the Globe on a 2D Flat Surface Using Scatter
Plot

Objective:

 Use matplotlib to display the globe as a 2D map (cylindrical projection).


 Plot three major Indian cities (e.g., Delhi, Mumbai, Chennai) on the map.

Python Code:
import matplotlib.pyplot as plt

# Coordinates for Indian cities (latitude, longitude)


cities = {
"Delhi": (28.6139, 77.2090),
"Mumbai": (19.0760, 72.8777),
"Chennai": (13.0827, 80.2707)
}

# Separate latitudes and longitudes


lats = [coord[0] for coord in cities.values()]
lons = [coord[1] for coord in cities.values()]

# Plotting
plt.figure(figsize=(10, 6))
plt.title("Major Indian Cities on a 2D Globe Projection", fontsize=14)
plt.xlabel("Longitude")
plt.ylabel("Latitude")

# Scatter the points


plt.scatter(lons, lats, color='blue', marker='o')

# Annotate cities
for city, (lat, lon) in cities.items():
plt.text(lon + 0.5, lat + 0.5, city, fontsize=10)

# Set grid and limits


plt.grid(True)
plt.xlim(65, 90) # Roughly India's longitudes
plt.ylim(5, 40) # Roughly India's latitudes

plt.show()

Explanation:
 Cylindrical Projection: Flattened world map (2D view with latitudes vs. longitudes).
 Cities Plotted:
o Delhi (28.61°N, 77.20°E)
o Mumbai (19.07°N, 72.87°E)
o Chennai (13.08°N, 80.27°E)
 The map is simple and does not require basemap or geospatial libraries—suitable for
academic purposes.

You might also like