DS assignment COMPLETED DOC
DS assignment COMPLETED DOC
ASSIGNMENT I
When dealing with datasets where some variables have more than 30% missing values, it's
crucial to be cautious, as improper handling can lead to misleading conclusions. Here’s a
step-by-step strategy:
If the variable is not critical to the analysis and has low correlation with the output, it
can be dropped.
Also consider:
o Data redundancy (are similar features present?)
o Sample size reduction impact
Feature selection is the process of identifying the most relevant variables that contribute
significantly to predicting the target variable. This improves model performance and reduces
overfitting.
Method
Description Examples
Type
Filter Selects features based on statistical tests Pearson, Chi-square, Mutual Info
Uses predictive model to assess feature
Wrapper RFE, Forward/Backward Selection
subset
LASSO, Tree-based feature
Embedded Feature selection during model training
importance
Hybrid Combines filter + wrapper Multi-stage approaches
(1) Incomes of tax payers have a mean of $48,000 and a median of $43,000.
Interpretation:
o The mean is higher than the median, which typically suggests that there are a
few very high income values pulling the average (mean) up.
Conclusion:
o This type of distribution is positively skewed (right-skewed).
o In a positively skewed distribution, the tail on the right side is longer, and
the mass of the distribution is concentrated on the left.
(2) GPAs for all students at some college have a mean of 3.01 and a median of 3.20.
Interpretation:
o The mean is lower than the median, which implies that a few very low GPAs
are pulling the average (mean) downward.
Conclusion:
o This distribution is negatively skewed (left-skewed).
o In a negatively skewed distribution, the tail is longer on the left, and the bulk
of the values are on the right.
Given Data:
2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12, 10, 4, 3
When mean > median > mode, it generally indicates a positively skewed
distribution.
The higher values such as 17 and 28 increase the mean, pulling it to the right.
The bulk of the data is on the lower side (2 to 8), and the tail is stretched on the right.
Given data:
2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12, 10, 4, 3
Mean:
Median:
Sorted data: 2, 2, 3, 3, 4, 5, 5, 5, 6, 7, 8, 10, 12, 17, 28
Middle value = 8th value = 5
Mode:
Most frequent value = 5 (appears 3 times)
Final Conclusion:
Mean: 7.80
Median: 5
Mode: 5
Shape: Positively Skewed
QUESTION 3
The goodness-of-fit test is commonly used to determine how well a model's predicted values
match the observed data. It measures how accurately the model captures the pattern in the
data.
Example:
If we are predicting house prices based on size, but we ignore other important variables like
location or number of rooms, our model will have a poor fit and low accuracy.
Let’s do this in more detail, as an easy step-by-step guide to using Simple Linear
Regression.
Dataset Recap:
54 67000
42 43000
49 55000
57 71000
35 25000
Step-by-Step Calculation:
Using formula:
x y x - x̄ y - ȳ (x - x̄)(y - ȳ) (x - x̄)²
QUESTION 4
A dictionary in Python is an unordered collection of data values used to store data values
like a map. Unlike other Data Types that hold only a single value as an element, a dictionary
holds key: value pairs.
Each key in a dictionary is unique, and it is used to access the corresponding value.
Syntax:
my_dict = {
"key1": "value1",
"key2": "value2"
}
(i) Initialize two dictionaries (D1 and D2) with key and value pairs.
D1 = {'a': 1, 'b': 2, 'c': 3}
D2 = {'b': 2, 'c': 4, 'd': 5}
(ii) Compare those two dictionaries with master key list ‘M’ and print the
missing keys.
M = ['a', 'b', 'c', 'd', 'e'] # master key list
Output:
Output:
If both dictionaries have the same key, values from D2 will override values from D1.
D3 = {**D1, **D2}
print("Merged Dictionary D3:", D3)
Output:
START
|
v
Create D1 → {'a':1, 'b':2, 'c':3}
Create D2 → {'b':2, 'c':4, 'd':5}
|
v
Create master key list M → ['a','b','c','d','e']
|
v
[Find missing keys]
|
├── Missing in D1 → M - keys in D1 = ['d', 'e']
└── Missing in D2 → M - keys in D2 = ['a', 'e']
|
v
[Find keys in D1 but not in D2]
→ Compare D1.keys() - D2.keys() = {'a'}
|
v
[Merge dictionaries]
→ D3 = {**D1, **D2}
Result: {'a':1, 'b':2, 'c':4, 'd':5}
|
v
END
Key Notes:
5 (a) - Python Code to Project the Globe on a 2D Flat Surface Using Scatter
Plot
Objective:
Python Code:
import matplotlib.pyplot as plt
# Plotting
plt.figure(figsize=(10, 6))
plt.title("Major Indian Cities on a 2D Globe Projection", fontsize=14)
plt.xlabel("Longitude")
plt.ylabel("Latitude")
# Annotate cities
for city, (lat, lon) in cities.items():
plt.text(lon + 0.5, lat + 0.5, city, fontsize=10)
plt.show()
Explanation:
Cylindrical Projection: Flattened world map (2D view with latitudes vs. longitudes).
Cities Plotted:
o Delhi (28.61°N, 77.20°E)
o Mumbai (19.07°N, 72.87°E)
o Chennai (13.08°N, 80.27°E)
The map is simple and does not require basemap or geospatial libraries—suitable for
academic purposes.