Unit 2
🔷 1. The Data Science Process - Case Study: RealDirect
RealDirect: An online real estate firm helping users buy/sell homes using data-driven
decisions.
📌 Steps in Data Science Process:
1. Define Objective: e.g., Improve home sale prediction
2. Data Collection: Property listings, user interactions
3. Data Cleaning: Remove duplicates, missing values
4. Exploratory Data Analysis (EDA): Understand features like price, location
5. Model Building: Predict time-to-sell or price
6. Evaluation: RMSE, R² scores
7. Deployment: Integrated into the RealDirect platform
📈 Diagram: Data Science Process
[Collect] → [Clean] → [EDA] → [Model] → [Evaluate] → [Deploy]
---
🔷 2. Three Basic Machine Learning Algorithms
✅ Linear Regression
Predicts a continuous outcome using a straight line.
Formula: y = mx + b
📉 Use Case: Predict house prices based on size.
✅ k-Nearest Neighbors (k-NN)
Classifies based on 'k' closest data points.
No training phase — lazy learning.
📌 Example: Classify a new home’s neighborhood based on similar homes nearby.
📍 Diagram:
New point → check k nearest neighbors → assign most common class
✅ k-Means Clustering
Unsupervised learning algorithm.
Groups data into k clusters by minimizing distance from centroid.
📊 Example: Segment customers into groups based on buying behavior.
---
🔷 3. Motivating Application: Filtering Spam
📨 Goal: Classify emails as Spam or Not Spam.
❌ Why Linear Regression is a poor choice:
Predicts continuous output, not ideal for classification (spam is binary).
Sensitive to outliers.
❌ Why k-NN is not ideal:
High-dimensional data (emails) = slow computation.
Requires distance metric → Hard for text data.
---
🔷 4. Naive Bayes for Spam Filtering
✅ Why Naive Bayes works well:
Assumes independence between features (words).
Calculates probability that an email is spam based on word occurrences.
Very fast and effective in high-dimensional data like text.
📌 Formula:
P(Spam | Words) ∝ P(Words | Spam) * P(Spam)
📩 Example: If words like “FREE” and “WIN” appear → High probability of spam.
---
🔷 5. Data Wrangling: APIs & Web Scraping
📌 APIs (Application Programming Interface):
Structured way to access online data.
Example: Twitter API, Google Maps API
import requests
response = requests.get("https://api.twitter.com/...")
📌 Web Scraping Tools:
Extracts data from HTML pages.
Tools: BeautifulSoup, Scrapy, Selenium
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_data, 'html.parser')
📊 Use Case: Collect housing prices from websites like Zillow.