Unit 1
🔷 1. Big Data and Data Science
📌 Big Data:
Refers to extremely large datasets that traditional data-processing software can't handle.
Characterized by 5 Vs:
Volume: Large amount of data
Velocity: Speed of data in/out
Variety: Different types (text, images, video)
Veracity: Uncertainty of data
Value: Insights derived from data
📈 Diagram: 5 V's of Big Data
Volume
|
Variety -- Value
|
Velocity
|
Veracity
📌 Data Science:
Interdisciplinary field combining:
Statistics
Computer Science
Domain Knowledge
Uses data to derive insights, make predictions, and aid decision-making.
Example: Using transaction data to recommend products on Amazon.
---
🔷 2. Datafication - Current Landscape of Perspectives
Datafication: Converting various aspects of life into data (e.g., social media behavior, health
records).
Everything from emotions to business processes is now measurable.
Example: Fitbit tracks physical activities → Converts to data → Used for fitness insights.
📊 Current Perspective:
Social Media → Behavioral Analysis
IoT Devices → Smart Homes
Healthcare → Predictive Diagnosis
---
🔷 3. Statistical Inference - Populations and Samples
📌 Populations:
Entire group we're interested in studying.
Example: All diabetics in India.
📌 Samples:
Subset of the population used for analysis.
Example: 1,000 diabetics surveyed.
📈 Diagram:
Population
|________ Sample
---
🔷 4. Statistical Modeling, Probability Distributions, Fitting a Model
📌 Statistical Modeling:
Creating a mathematical model to represent real-world processes.
Helps in prediction and decision-making.
📌 Probability Distributions:
Shows the likelihood of all possible outcomes.
Types:
Discrete: Binomial, Poisson
Continuous: Normal, Exponential
📉 Normal Distribution:
Bell-shaped curve
/\
/ \__
/ \__
📌 Fitting a Model:
Process of choosing the best model (e.g., linear regression) that fits the data.
Example: Predicting house prices using square footage.
---
🔷 5. Intro to R (Programming Language)
R is a programming language used for statistical computing and graphics.
Popular for data analysis, visualization, and machine learning.
📌 Basic R Commands:
# Create vector
x <- c(1, 2, 3, 4)
# Summary statistics
summary(x)
# Plotting
plot(x)
---
🔷 6. Exploratory Data Analysis (EDA) and Data Science Process
📌 EDA:
First step in data analysis to understand data patterns.
Uses graphs, charts, summary statistics.
📊 Basic Tools:
Histograms: Frequency of values
Box Plots: Summary of distribution
Scatter Plots: Relation between two variables
📈 Diagram: Box Plot Components
Min | Q1 | Median | Q3 | Max
----|----|--------|----|----
Example: Exploring sales data before predicting future sales.
📌 Data Science Process:
1. Define Problem
2. Collect Data
3. Clean Data
4. EDA
5. Modeling
6. Evaluation
7. Deployment
🔁 Diagram:
[Collect] → [Clean] → [EDA] → [Model] → [Evaluate] → [Deploy]
---
🔷 7. Philosophy of EDA
Emphasizes letting data speak for itself.
Developed by John Tukey.
Focuses on:
Visual inspection
Discovering patterns
Detecting outliers
Hypothesis generation (not testing)
📌 Quote:
"EDA is detective work — not confirmation."