[go: up one dir, main page]

0% found this document useful (0 votes)
11 views5 pages

Unit 1

Dmw

Uploaded by

srgimt485
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views5 pages

Unit 1

Dmw

Uploaded by

srgimt485
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Unit 1

🔷 1. Big Data and Data Science


📌 Big Data:
Refers to extremely large datasets that traditional data-processing software can't handle.

Characterized by 5 Vs:

Volume: Large amount of data

Velocity: Speed of data in/out

Variety: Different types (text, images, video)

Veracity: Uncertainty of data

Value: Insights derived from data

📈 Diagram: 5 V's of Big Data


Volume
|
Variety -- Value
|
Velocity
|
Veracity

📌 Data Science:
Interdisciplinary field combining:

Statistics

Computer Science

Domain Knowledge

Uses data to derive insights, make predictions, and aid decision-making.

Example: Using transaction data to recommend products on Amazon.


---

🔷 2. Datafication - Current Landscape of Perspectives


Datafication: Converting various aspects of life into data (e.g., social media behavior, health
records).

Everything from emotions to business processes is now measurable.

Example: Fitbit tracks physical activities → Converts to data → Used for fitness insights.

📊 Current Perspective:
Social Media → Behavioral Analysis

IoT Devices → Smart Homes

Healthcare → Predictive Diagnosis

---

🔷 3. Statistical Inference - Populations and Samples


📌 Populations:
Entire group we're interested in studying.

Example: All diabetics in India.

📌 Samples:
Subset of the population used for analysis.

Example: 1,000 diabetics surveyed.

📈 Diagram:
Population
|________ Sample

---
🔷 4. Statistical Modeling, Probability Distributions, Fitting a Model
📌 Statistical Modeling:
Creating a mathematical model to represent real-world processes.

Helps in prediction and decision-making.

📌 Probability Distributions:
Shows the likelihood of all possible outcomes.

Types:

Discrete: Binomial, Poisson

Continuous: Normal, Exponential

📉 Normal Distribution:
Bell-shaped curve
/\
/ \__
/ \__

📌 Fitting a Model:
Process of choosing the best model (e.g., linear regression) that fits the data.

Example: Predicting house prices using square footage.

---

🔷 5. Intro to R (Programming Language)


R is a programming language used for statistical computing and graphics.

Popular for data analysis, visualization, and machine learning.

📌 Basic R Commands:
# Create vector
x <- c(1, 2, 3, 4)

# Summary statistics
summary(x)

# Plotting
plot(x)

---

🔷 6. Exploratory Data Analysis (EDA) and Data Science Process


📌 EDA:
First step in data analysis to understand data patterns.

Uses graphs, charts, summary statistics.

📊 Basic Tools:
Histograms: Frequency of values

Box Plots: Summary of distribution

Scatter Plots: Relation between two variables

📈 Diagram: Box Plot Components


Min | Q1 | Median | Q3 | Max
----|----|--------|----|----

Example: Exploring sales data before predicting future sales.

📌 Data Science Process:


1. Define Problem

2. Collect Data

3. Clean Data
4. EDA

5. Modeling

6. Evaluation

7. Deployment

🔁 Diagram:
[Collect] → [Clean] → [EDA] → [Model] → [Evaluate] → [Deploy]

---

🔷 7. Philosophy of EDA
Emphasizes letting data speak for itself.

Developed by John Tukey.

Focuses on:

Visual inspection

Discovering patterns

Detecting outliers

Hypothesis generation (not testing)

📌 Quote:
"EDA is detective work — not confirmation."

You might also like