[go: up one dir, main page]

0% found this document useful (0 votes)
19 views3 pages

Unit 2

Uploaded by

srgimt485
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views3 pages

Unit 2

Uploaded by

srgimt485
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Unit 2

🔷 1. The Data Science Process - Case Study: RealDirect


RealDirect: An online real estate firm helping users buy/sell homes using data-driven
decisions.

📌 Steps in Data Science Process:


1. Define Objective: e.g., Improve home sale prediction

2. Data Collection: Property listings, user interactions

3. Data Cleaning: Remove duplicates, missing values

4. Exploratory Data Analysis (EDA): Understand features like price, location

5. Model Building: Predict time-to-sell or price

6. Evaluation: RMSE, R² scores

7. Deployment: Integrated into the RealDirect platform

📈 Diagram: Data Science Process


[Collect] → [Clean] → [EDA] → [Model] → [Evaluate] → [Deploy]

---

🔷 2. Three Basic Machine Learning Algorithms


✅ Linear Regression
Predicts a continuous outcome using a straight line.

Formula: y = mx + b

📉 Use Case: Predict house prices based on size.


✅ k-Nearest Neighbors (k-NN)
Classifies based on 'k' closest data points.

No training phase — lazy learning.

📌 Example: Classify a new home’s neighborhood based on similar homes nearby.


📍 Diagram:
New point → check k nearest neighbors → assign most common class

✅ k-Means Clustering
Unsupervised learning algorithm.

Groups data into k clusters by minimizing distance from centroid.

📊 Example: Segment customers into groups based on buying behavior.


---

🔷 3. Motivating Application: Filtering Spam


📨 Goal: Classify emails as Spam or Not Spam.
❌ Why Linear Regression is a poor choice:
Predicts continuous output, not ideal for classification (spam is binary).

Sensitive to outliers.

❌ Why k-NN is not ideal:


High-dimensional data (emails) = slow computation.

Requires distance metric → Hard for text data.

---

🔷 4. Naive Bayes for Spam Filtering


✅ Why Naive Bayes works well:
Assumes independence between features (words).

Calculates probability that an email is spam based on word occurrences.

Very fast and effective in high-dimensional data like text.

📌 Formula:
P(Spam | Words) ∝ P(Words | Spam) * P(Spam)

📩 Example: If words like “FREE” and “WIN” appear → High probability of spam.
---

🔷 5. Data Wrangling: APIs & Web Scraping


📌 APIs (Application Programming Interface):
Structured way to access online data.

Example: Twitter API, Google Maps API

import requests
response = requests.get("https://api.twitter.com/...")

📌 Web Scraping Tools:


Extracts data from HTML pages.

Tools: BeautifulSoup, Scrapy, Selenium

from bs4 import BeautifulSoup


soup = BeautifulSoup(html_data, 'html.parser')

📊 Use Case: Collect housing prices from websites like Zillow.

You might also like