[go: up one dir, main page]

0% found this document useful (0 votes)
383 views152 pages

Introduction To Data Science - Ii-I Course File 2025-26

The document outlines a course on Data Science for the academic year 2025-2026, detailing its objectives, syllabus, and key components. It covers essential topics such as the data science process, machine learning applications, data visualization, and the significance of exploratory data analysis. Additionally, it highlights the benefits and uses of data science across various industries, emphasizing its role in driving innovation and informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
383 views152 pages

Introduction To Data Science - Ii-I Course File 2025-26

The document outlines a course on Data Science for the academic year 2025-2026, detailing its objectives, syllabus, and key components. It covers essential topics such as the data science process, machine learning applications, data visualization, and the significance of exploratory data analysis. Additionally, it highlights the benefits and uses of data science across various industries, emphasizing its role in driving innovation and informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 152

INTRODUCTION TO DATASCIENCE

INTRODUCTION TO DATA SCIENCE


COURSE FILE

DEPARTMENTOFCOMPUTERSCIENCEENGINEERING
(DATA SCIENCE)
(2025-2026)

FacultyIn-Charge HOD-CSE
J.Ramesh
SYLLABUS

II Year I Semester

INTRODUCTION TO DATA SCIENCE


COURSE OBJECTIVES:-

From the course the student will learn

1. Knowledge and expertise to become a data scientist.


2. Essential concepts of statistics and machine learning that are vital for data science;
3. Significance of exploratory data analysis (EDA) in data science.
4. Critically evaluate data visualizations presented on the dashboards
5. Suitability and limitations of tools and techniques related to data science process

UNIT I: Introduction to Data science, benefits and uses, facets of data, data science process in brief, big data ecosystem and
data science Data Science process: Overview, defining goals and creating project charter, retrieving data, cleansing,
integrating and transforming data, exploratory analysis, model building, presenting findings and building applications on top
of them

Unit II: Applications of machine learning in Data science, role of ML in DS, Python tools like sklearn, modelling process for
feature engineering, model selection, validation and prediction, types of ML, semi-supervised learning Handling large data:
problems and general techniques for handling large data, programming tips for dealing large data, case studies on DS projects
for predicting malicious URLs, for building recommender systems

UNIT III: NoSQL movement for handling Bigdata: Distributing data storage and processing with Hadoop framework, case
study on risk assessment for loan sanctioning, ACID principle of relational databases, CAP theorem, base principle of NoSQL
databases, types of NoSQL databases, case study on disease diagnosis and profiling

UNIT IV: Tools and Applications of Data Science: Introducing Neo4jfor dealing with graph databases, graph query language
Cypher, Applications graph databases, Python libraries like nltk and SQLite for handling Text mining and analytics, case
study on classifying Reddit posts

UNIT V: Data Visualization and Prototype Application Development: Data Visualization options, Crossfilter, the JavaScript
MapReduce library, Creating an interactive dashboard with dc.js, Dashboard development tools. Applying the Data Science
process for real world problem solving scenarios as a detailed case study.
Textbook: -
1) Davy Cielen, Arno D.B.Meysman, and Mohamed Ali, “Introducing to Data Science using Python tools”, Manning
Publications Co, Dreamtech press, 2016

2) Prateek Gupta, “Data Science with Jupyter” BPB publishers, 2019 for basics

Reference Books:
1) Joel Grus, “Data Science From Scratch”, OReilly, 2019
2) Doing Data Science: Straight Talk From The Frontline, 1 st Edition, Cathy O’Neil and Rachel Schutt, O’Reilly, 2013

UNIT-I
Introduction to Data Science

What is Data Science?


Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to
extract knowledge and insights from structured and unstructured data. It combines elements from statistics,
computer science, domain expertise, and data analysis to solve complex problems and make data-driven
decisions.

Key Components of Data Science


1. Mathematics & Statistics: Foundation for algorithms and data analysis

2. Programming: Tools to process and analyze data (Python, R, SQL)

3. Domain Knowledge: Understanding the specific field you're working in

4. Data Wrangling: Cleaning and preparing raw data for analysis


5. Machine Learning: Building predictive models from data

6. Data Visualization: Communicating insights through graphical representations

The Data Science Process


1. Problem Definition: Understanding the business problem or question

2. Data Collection: Gathering relevant data from various sources

3. Data Cleaning: Handling missing values, outliers, and inconsistencies

4. Exploratory Data Analysis (EDA): Understanding patterns and relationships

5. Feature Engineering: Creating meaningful variables for modeling

6. Model Building: Applying statistical and machine learning techniques

7. Model Evaluation: Assessing model performance

8. Deployment: Implementing the solution in real-world scenarios

9. Monitoring & Maintenance: Ensuring continued performance

Applications of Data Science


 Predictive analytics

 Recommendation systems (e.g., Netflix, Amazon)

 Fraud detection

 Natural language processing

 Image and speech recognition

 Healthcare diagnostics

 Financial modeling

 Supply chain optimization

Essential Tools and Technologies


 Programming languages: Python, R, SQL

 Data manipulation: Pandas, NumPy


 Visualization: Matplotlib, Seaborn, Tableau

 Machine learning: Scikit-learn, TensorFlow, PyTorch

 Big data tools: Hadoop, Spark

 Cloud platforms: AWS, Google Cloud, Azure

Data science continues to evolve rapidly, driving innovation across virtually all industries by turning raw data
into actionable insights and intelligent systems.

Benefits and Uses of Data Science


Key Benefits of Data Science
1. Data-Driven Decision Making

 Helps businesses make informed decisions based on data rather than intuition.

 Reduces guesswork and improves accuracy in strategic planning.

2. Automation and Efficiency

 Automates repetitive tasks (e.g., data cleaning, report generation).

 Optimizes operations in manufacturing, logistics, and supply chains.

3. Predictive Analytics

 Forecasts future trends (e.g., sales, stock prices, customer behavior).

 Used in weather prediction, risk assessment, and preventive healthcare.

4. Personalization & Improved Customer Experience

 Powers recommendation systems (e.g., Netflix, Amazon, Spotify).

 Enables targeted marketing and personalized user experiences.

5. Fraud Detection & Security

 Identifies unusual patterns in financial transactions (banks, credit cards).

 Enhances cybersecurity by detecting anomalies in network traffic.

6. Cost Reduction & Revenue Growth

 Identifies inefficiencies and reduces waste in business processes.

 Helps companies optimize pricing, inventory, and marketing spend.


7. Scientific & Medical Advancements

 Accelerates drug discovery and genomic research.

 Improves disease diagnosis through medical imaging analysis (AI in radiology).

8. Real-Time Analytics

 Enables instant insights in industries like finance (algorithmic trading) and IoT.

 Used in ride-sharing apps (Uber, Lyft) for dynamic pricing and route optimization.

Major Uses of Data Science Across Industries


1. Healthcare

 Predictive diagnostics (early disease detection)

 Drug discovery and personalized medicine

 Hospital resource optimization

2. Finance & Banking

 Fraud detection and risk management

 Algorithmic trading and stock market prediction

 Credit scoring and loan approval automation

3. Retail & E-commerce

 Demand forecasting and inventory management

 Customer segmentation and personalized recommendations

 Dynamic pricing strategies

4. Manufacturing & Logistics

 Predictive maintenance for machinery

 Supply chain optimization

 Quality control using computer vision

5. Entertainment & Social Media


 Content recommendation (YouTube, Netflix)

 Sentiment analysis (brand monitoring on Twitter, Facebook)

 User behavior analytics for engagement optimization

6. Transportation & Smart Cities

 Traffic prediction and route optimization (Google Maps, Waze)

 Autonomous vehicles (Tesla, Waymo)

 Smart infrastructure planning

7. Telecommunications

 Network optimization and outage prediction

 Customer churn prediction

 Fraud detection in call data

8. Government & Public Sector

 Crime prediction and prevention

 Disaster response planning

 Policy-making based on demographic and economic data

Conclusion

Data science transforms raw data into actionable insights, driving innovation across industries. Its benefits
include improved decision-making, automation, cost savings, and enhanced customer experiences. From
healthcare to finance, retail to smart cities, data science is revolutionizing how businesses and organizations
operate.

Would you like a deeper dive into any specific application?

New chat

Facets of Data in Data Science


Data is the foundation of data science, and understanding its different facets is crucial for effective analysis. Here are the key dimensions
of data that influence how it is collected, processed, and analyzed:
1. By Structure
Data can be categorized based on its organization and format:

a) Structured Data

 Organized in a fixed schema (rows and columns).

 Easily stored and queried using SQL.

 Examples:
o Relational databases (MySQL, PostgreSQL)

o Spreadsheets (Excel, Google Sheets)

b) Unstructured Data

 No predefined format or schema.

 Requires advanced techniques (NLP, computer vision) for processing.

 Examples:
o Text (emails, social media posts)

o Images, videos, audio files

o PDFs, Word documents

c) Semi-Structured Data

 Not fully structured but has some organizational properties.

 Often stored in JSON, XML, or NoSQL formats.

 Examples:
o Web logs, sensor data

o Emails (metadata is structured, content is unstructured)

2. By Source
a) Internal Data

 Generated within an organization (e.g., sales records, CRM data).


 Typically structured and well-documented.

b) External Data

 Collected from outside sources (e.g., social media, government datasets).

 May require cleaning and validation.

c) Public vs. Private Data

 Public: Open datasets (e.g., Kaggle, government data portals).

 Private: Proprietary data (e.g., customer transactions, internal reports).

3. By Nature
a) Quantitative Data

 Numerical and measurable.

 Types:
o Discrete (whole numbers, e.g., "number of customers").

o Continuous (infinite values, e.g., "temperature").

b) Qualitative (Categorical) Data

 Non-numerical, descriptive.

 Types:
o Nominal (no order, e.g., "gender," "country").

o Ordinal (ordered categories, e.g., "survey ratings: Poor, Good, Excellent").

4. By Time Dependency
a) Static (Batch) Data

 Collected at a specific point in time (e.g., census data).

b) Streaming (Real-Time) Data

 Continuously generated (e.g., stock market feeds, IoT sensors).


5. By Scale (Big Data 4 Vs)
a) Volume

 The sheer size of data (e.g., terabytes, petabytes).

b) Velocity

 Speed at which data is generated (e.g., social media posts per second).

c) Variety

 Different types of data (structured, unstructured, semi-structured).

d) Veracity

 Data quality, reliability, and noise.

6. By Use Case
a) Training Data

 Used to train machine learning models.

b) Test Data

 Used to evaluate model performance.

c) Production Data

 Real-world data used in deployed systems.

Conclusion

Understanding the different facets of data helps in:


✔ Choosing the right storage and processing tools.
✔ Applying appropriate preprocessing techniques.
✔ Selecting suitable machine learning models.

Would you like a deeper explanation of any specific facet?

Data Science Process (Brief Overview)

The data science process is a structured approach to solving problems using data. Here’s a simplified breakdown:
1. Problem Definition

 Understand the business objective.

 Define key questions to answer with data.

2. Data Collection

 Gather relevant datasets (databases, APIs, web scraping, etc.).

3. Data Cleaning & Preprocessing

 Handle missing values, outliers, and inconsistencies.

 Format data for analysis (normalization, encoding).

4. Exploratory Data Analysis (EDA)

 Visualize data to uncover patterns, trends, and anomalies.

 Perform statistical summaries.

5. Feature Engineering

 Select, transform, or create meaningful variables for modeling.

6. Model Building

 Choose algorithms (regression, classification, clustering).

 Train machine learning models on the data.

7. Model Evaluation

 Test performance using metrics (accuracy, precision, RMSE, etc.).

 Optimize and fine-tune the model.

8. Deployment

 Integrate the model into production (APIs, dashboards, apps).

9. Monitoring & Maintenance

 Track model performance over time.

 Retrain with new data if needed.

Key Points:
 Iterative: Steps often repeat (e.g., revisiting EDA after model failure).

 Domain Knowledge Matters: Context improves data interpretation.

 End Goal: Deliver actionable insights or automated decisions.

Would you like a deeper dive into any step?


UNIT-II

Applications of Machine Learning in Data Science

Machine Learning (ML) is a core component of data science, enabling systems to learn from data and make predictions or
decisions. Below are key applications across industries:

1. Predictive Analytics
 Sales Forecasting: Predict future sales based on historical trends.

 Stock Market Prediction: Analyze market trends for trading strategies.

 Weather Forecasting: Improve accuracy using historical climate data.

2. Natural Language Processing (NLP)


 Chatbots & Virtual Assistants (e.g., Siri, Alexa)

 Sentiment Analysis: Classify emotions in social media/text.

 Machine Translation (e.g., Google Translate)

3. Computer Vision
 Facial Recognition (e.g., iPhone Face ID, surveillance)

 Medical Imaging: Detect tumors in X-rays/MRIs.

 Autonomous Vehicles: Self-driving cars (Tesla, Waymo).

4. Recommendation Systems
 E-commerce: Amazon’s "Customers who bought this also bought…"

 Streaming Services: Netflix/Spotify personalized recommendations.

5. Fraud Detection & Cybersecurity


 Credit Card Fraud: Detect unusual transactions in real-time.

 Network Intrusion Detection: Identify cyber threats.

6. Healthcare & Medicine


 Disease Prediction: Early diagnosis (e.g., diabetes, cancer).

 Drug Discovery: Accelerate research using AI models.

7. Customer Insights & Personalization


 Churn Prediction: Identify customers likely to leave.

 Dynamic Pricing: Adjust prices based on demand (Uber, airlines).

8. Manufacturing & IoT


 Predictive Maintenance: Detect machine failures before they happen.

 Quality Control: Automate defect detection in production lines.

9. Financial Services
 Credit Scoring: Assess loan eligibility using ML models.

 Algorithmic Trading: Automate stock trading strategies.

10. Social Media & Advertising


 Targeted Ads: Facebook/Google ad personalization.

 Content Moderation: Detect hate speech/fake news.

Why ML Matters in Data Science?

✔ Automation: Reduces manual analysis.


✔ Scalability: Handles large datasets efficiently.
✔ Continuous Learning: Models improve with more data.

Would you like a deeper dive into any specific application?

The Role of Machine Learning in Data Science

Machine Learning (ML) is a core pillar of Data Science (DS), enabling systems to learn from data, identify patterns,
and make decisions with minimal human intervention. Below is a breakdown of its key roles:
1. Automating Data Analysis
 Replaces manual processes (e.g., statistical modeling, rule-based systems).

 Handles large datasets efficiently (Big Data applications).

2. Predictive Modeling
 Forecasting trends (sales, stock prices, weather).

 Risk assessment (fraud detection, loan approvals).

3. Pattern Recognition & Classification


 Image/Video Analysis (facial recognition, medical imaging).

 Text Classification (spam detection, sentiment analysis).

4. Personalization & Recommendations


 Recommender systems (Netflix, Amazon, Spotify).

 Dynamic pricing (Uber surge pricing, airline tickets).

5. Anomaly Detection
 Fraud detection (credit card transactions).

 Network security (identifying cyber threats).

6. Optimization & Decision-Making


 Supply chain/logistics (route optimization, demand forecasting).

 A/B testing automation (optimizing marketing campaigns).

7. Natural Language Processing (NLP)


 Chatbots & virtual assistants (Siri, ChatGPT).

 Text summarization & translation (Google Translate).

8. Reinforcement Learning (Advanced AI)


 Self-driving cars (Tesla, Waymo).

 Game AI (AlphaGo, Chess engines).


How ML Complements Data Science

Data Science Machine Learning

Focuses on data collection, cleaning, and


Focuses on building predictive models
EDA

Uses statistics & visualization Uses algorithms to learn from data

Automates decision-making &


Solves business problems with insights
predictions

Key ML Techniques Used in Data Science

1. Supervised Learning (Regression, Classification)

2. Unsupervised Learning (Clustering, Dimensionality Reduction)

3. Deep Learning (Neural Networks for complex tasks)

4. Reinforcement Learning (AI systems that learn by trial & error)

Conclusion

Machine Learning enhances Data Science by:


✅ Automating repetitive analysis tasks.
✅ Improving accuracy in predictions.
✅ Enabling real-time decision-making.

Without ML, Data Science would rely more on manual statistical analysis and rule-based systems, limiting scalability and
efficiency.

Would you like examples of ML algorithms used in specific DS tasks?

ssential Python Tools for Machine Learning in Data Science

Python is the dominant language in Data Science (DS) and Machine Learning (ML), thanks to its rich
ecosystem of libraries. Below are the key Python tools (like scikit-learn) used for ML tasks in DS:

1. Machine Learning Frameworks


🔹 Scikit-learn (sklearn)
 Best for: Traditional ML (supervised/unsupervised learning).

 Key Features:
o Simple API for classification, regression, clustering.

o Built-in datasets & model evaluation tools.

o Integrates with NumPy, Pandas.

 Example Use Cases:


python

Copy

Download

from sklearn.ensemble import RandomForestClassifier


model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

🔹 TensorFlow & Keras

 Best for: Deep Learning (Neural Networks).

 Key Features:
o High-performance numerical computation.

o Keras provides a simpler interface for quick prototyping.

 Example Use Cases:


python

Copy

Download

from tensorflow.keras.models import Sequential


model = Sequential([
Dense(64, activation='relu'),
Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')

🔹 PyTorch

 Best for: Research & custom DL models.

 Key Features:
o Dynamic computation graphs (flexibility).
o Preferred in academia and cutting-edge AI.

 Example Use Cases:


python

Copy

Download

import torch.nn as nn
model = nn.Sequential(
nn.Linear(20, 64),
nn.ReLU(),
nn.Linear(64, 2)
)

2. Data Handling & Preprocessing


🔹 Pandas (pandas)

 Data manipulation (cleaning, filtering, aggregations).

 Example:
python

Copy

Download

df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True) # Handle missing values

🔹 NumPy (numpy)

 Numerical computing (arrays, matrices, math operations).

 Example:
python

Copy

Download

import numpy as np
X = np.array([[1, 2], [3, 4]])

🔹 SciPy (scipy)

 Advanced math & stats (optimization, signal processing).


3. Feature Engineering & Model Tuning
🔹 Feature-engine

 Automated feature selection & transformation.

🔹 Scikit-learn’s Pipeline & ColumnTransformer

 Streamline preprocessing + modeling in one workflow.

 Example:
python

Copy

Download

from sklearn.pipeline import Pipeline


from sklearn.preprocessing import StandardScaler
pipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])

🔹 Optuna / Hyperopt

 Hyperparameter tuning (automate model optimization).

4. Model Evaluation & Explainability


🔹 Scikit-learn’s metrics

 Accuracy, Precision, Recall, ROC-AUC, etc.

python

Copy

Download

from sklearn.metrics import classification_report


print(classification_report(y_test, predictions))

🔹 SHAP / LIME

 Explain ML predictions (black-box interpretability).

🔹 Yellowbrick
 Visual model diagnostics (feature importance, residuals).

5. Deployment & Production


🔹 Flask / FastAPI

 Deploy ML models as APIs.

🔹 ONNX Runtime

 Optimize models for production.

🔹 MLflow

 Track experiments & manage model lifecycle.

Summary Table: Top Python ML Tools

Task Best Python Library

Classic ML Scikit-learn (sklearn)

Deep Learning TensorFlow / PyTorch

Data Wrangling Pandas, NumPy

Feature Engineering Feature-engine, Pipeline

Hyperparameter
Optuna, Hyperopt
Tuning

Model Explainability SHAP, LIME

Deployment Flask, FastAPI, MLflow

Final Thoughts

 For beginners: Start with scikit-learn + pandas.

 For deep learning: Learn TensorFlow/Keras or PyTorch.

 For real-world projects: Use MLflow for tracking & Flask for APIs.
Feature Engineering Modeling Process

Feature engineering is the process of transforming raw data into meaningful features to improve
ML model performance. Below is a structured step-by-step modeling process for feature
engineering in Python (using libraries like pandas, scikit-learn, and feature-engine).

Step 1: Understand the Data


 Perform Exploratory Data Analysis (EDA) to identify:
o Missing values

o Outliers

o Data distributions

o Correlations

Tools:
python

Copy
Download
import pandas as pd
import seaborn as sns

df = pd.read_csv("data.csv")
print(df.info()) # Check data types & missing values
sns.heatmap(df.corr()) # Correlation matrix

Step 2: Handle Missing Data


Options:

1. Drop missing values (if small %):

python
Copy

Download

df.dropna(inplace=True)
2. Impute missing values (mean/median/mode):
python

Copy

Download

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(strategy="median")
df[["Age"]] = imputer.fit_transform(df[["Age"]])
3. Advanced imputation (KNN, regression-based):
python

Copy

Download

from sklearn.impute import KNNImputer


imputer = KNNImputer(n_neighbors=3)
df[["Income"]] = imputer.fit_transform(df[["Income"]])

Step 3: Encode Categorical Variables


A) Label Encoding (for ordinal data)
python

Copy
Download
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Education"] = le.fit_transform(df["Education"]) # High=2, Medium=1, Low=0

B) One-Hot Encoding (for nominal data)


python

Copy
Download
df = pd.get_dummies(df, columns=["City"]) # Creates binary columns (e.g.,
City_NewYork)

C) Target Encoding (for high-cardinality categories)


python

Copy
Download
from feature_engine.encoding import MeanEncoder
encoder = MeanEncoder(variables=["Category"])
df = encoder.fit_transform(df, df["Target"])
Step 4: Scale/Normalize Numerical Features
A) Standardization (for Gaussian-like data)
python

Copy
Download
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["Age", "Income"]] = scaler.fit_transform(df[["Age", "Income"]])

B) Min-Max Scaling (for neural networks)


python

Copy
Download
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[["Age", "Income"]] = scaler.fit_transform(df[["Age", "Income"]])

Step 5: Feature Creation


A) Polynomial Features (for non-linear relationships)
python

Copy
Download
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = poly.fit_transform(df[["Age", "Income"]])

B) Binning (for converting numeric → categorical)


python

Copy
Download
df["Age_Group"] = pd.cut(df["Age"], bins=[0, 18, 35, 60, 100], labels=["Child",
"Young", "Adult", "Senior"])

C) Date-Time Features
python

Copy
Download
df["Year"] = pd.to_datetime(df["Date"]).dt.year
df["DayOfWeek"] = pd.to_datetime(df["Date"]).dt.dayofweek

Step 6: Feature Selection


A) Correlation-Based
python

Copy
Download
corr_matrix = df.corr()
high_corr_features = corr_matrix[abs(corr_matrix["Target"]) > 0.5]

B) Model-Based (Feature Importance)


python

Copy
Download
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
importance = pd.Series(model.feature_importances_,
index=X.columns).sort_values(ascending=False)

C) Recursive Feature Elimination (RFE)


python

Copy
Download
from sklearn.feature_selection import RFE
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=5)
rfe.fit(X_train, y_train)
selected_features = X.columns[rfe.support_]

Step 7: Automate with Pipelines


python

Copy
Download
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Define preprocessing steps


numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(transformers=[
("num", numeric_transformer, ["Age", "Income"]),
("cat", categorical_transformer, ["Gender", "City"])
])

# Combine with model


full_pipeline = Pipeline(steps=[
("preprocessor", preprocessor),
("classifier", RandomForestClassifier())
])

full_pipeline.fit(X_train, y_train)

Summary: Key Steps in Feature Engineering

Step Purpose Python Tools

Missing Data
Fill/drop NaN values SimpleImputer, KNNImputer
Handling

Categorical Encoding Convert text → numbers LabelEncoder, OneHotEncoder

Feature Scaling Normalize numerical features StandardScaler, MinMaxScaler

Feature Creation Generate new features PolynomialFeatures, pd.cut()

Feature Selection Remove irrelevant features RFE, feature_importances_

Pipeline Automate workflow sklearn.pipeline

Final Tips

✔ Domain knowledge helps create meaningful features.


✔ Iterate—feature engineering is trial-and-error.
✔ Monitor feature importance to avoid overfitting.

Model Selection in Machine Learning: A Structured Approach

Model selection is the process of choosing the best algorithm for a given dataset and problem
type. Below is a step-by-step guide with Python examples using scikit-learn.

1. Define the Problem Type


First, identify whether your task is:

 Supervised Learning (labeled data)


o Regression (predicting continuous values, e.g., house prices)

o Classification (predicting categories, e.g., spam detection)

 Unsupervised Learning (unlabeled data)


o Clustering (grouping similar data, e.g., customer segmentation)

o Dimensionality Reduction (e.g., PCA for visualization)

2. Select Candidate Models


A) For Regression Problems

Algorithm Use Case Scikit-Learn Class

Linear Regression Simple linear relationships LinearRegression()

Decision Tree Non-linear data DecisionTreeRegressor()

Random Forest High accuracy, avoids overfitting RandomForestRegressor()

XGBoost/LightGBM Best for structured data XGBRegressor()

SVR (Kernel SVM) Small datasets, non-linear SVR(kernel='rbf')

B) For Classification Problems

Algorithm Use Case Scikit-Learn Class

Logistic Regression Binary classification LogisticRegression()

Robust, handles non-


Random Forest RandomForestClassifier()
linearity

SVM High-dimensional data SVC()

XGBoost Imbalanced datasets XGBClassifier()

K-Nearest Neighbors Simple, small datasets KNeighborsClassifier()

C) For Clustering (Unsupervised)

Algorithm Use Case Scikit-Learn Class

K-Means General-purpose clustering KMeans()

DBSCAN Noise-resistant, irregular shapes DBSCAN()

Hierarchica
Nested cluster analysis AgglomerativeClustering()
l

3. Train & Evaluate Models


A) Split Data (Train/Test)
python

Copy
Download
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

B) Cross-Validation (Avoid Overfitting)


python

Copy
Download
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier()
scores = cross_val_score(model, X_train, y_train, cv=5) # 5-fold CV
print(f"Mean Accuracy: {scores.mean():.2f}")

C) Evaluation Metrics

For Classification:
python

Copy
Download
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))

For Regression:
python

Copy
Download
from sklearn.metrics import mean_squared_error, r2_score
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("R² Score:", r2_score(y_test, y_pred))

4. Hyperparameter Tuning
A) GridSearchCV (Exhaustive Search)
python

Copy
Download
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20]
}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best Params:", grid.best_params_)
B) RandomizedSearchCV (Faster Alternative)
python

Copy
Download
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
RandomForestClassifier(),
param_distributions=param_grid,
n_iter=10,
cv=5
)
random_search.fit(X_train, y_train)

C) Automated Tuning (Optuna, Hyperopt)


python

Copy
Download
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 200),
'max_depth': trial.suggest_int('max_depth', 3, 20)
}
model = RandomForestClassifier(**params)
return cross_val_score(model, X_train, y_train, cv=5).mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print("Best Params:", study.best_params)

5. Compare & Select the Best Model


A) Compare Multiple Models
python

Copy
Download
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

models = {
"Logistic Regression": LogisticRegression(),
"SVM": SVC(),
"Random Forest": RandomForestClassifier()
}

for name, model in models.items():


model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"{name}: {score:.3f}")

B) Select Based on Metrics


Training
Model Accuracy Interpretability
Speed

Logistic Regression 0.85 Fast High

Random Forest 0.92 Medium Medium

XGBoost 0.93 Slow Low

Trade-offs to Consider:
✔ Accuracy vs. Speed (e.g., XGBoost vs. Logistic Regression)
✔ Interpretability (e.g., decision trees vs. neural networks)
✔ Overfitting Risk (simple models generalize better)

6. Finalize & Deploy the Best Model


python

Copy
Download
best_model = RandomForestClassifier(n_estimators=100, max_depth=10)
best_model.fit(X_train, y_train)

# Save model for deployment


import joblib
joblib.dump(best_model, "model.pkl")

# Load & predict


loaded_model = joblib.load("model.pkl")
predictions = loaded_model.predict(new_data)

Summary: Model Selection Checklist

1. Define problem type (regression/classification/clustering).

2. Select candidate models based on data size & complexity.

3. Evaluate using cross-validation (avoid overfitting).

4. Tune hyperparameters (GridSearchCV/Optuna).

5. Compare models on accuracy, speed, and interpretability.

6. Deploy the best model (save as .pkl or use APIs).


Validation & Prediction in Machine Learning

Validation ensures your model generalizes well to unseen data, while prediction is the final step
where the model makes real-world inferences. Below is a structured breakdown with Python
examples.

1. Validation Techniques
A) Train-Test Split

 Simple validation (70-30% or 80-20% split).

 Best for: Large datasets.

python

Copy
Download
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

B) K-Fold Cross-Validation

 More robust than a single train-test split.

 Best for: Small datasets.

python

Copy
Download
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier()
scores = cross_val_score(model, X, y, cv=5) # 5-fold CV
print(f"Mean Accuracy: {scores.mean():.2f} (±{scores.std():.2f})")

C) Stratified K-Fold

 Preserves class distribution in each fold (critical for imbalanced datasets).

python

Copy
Download
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for train_idx, test_idx in skf.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

D) Time Series Cross-Validation

 For temporal data (prevents data leakage).

python

Copy
Download
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

2. Model Evaluation Metrics


A) For Classification

Metric Usage Code Example

Accuracy Overall correctness accuracy_score(y_test, y_pred)

Precision False positives (e.g., spam filter) precision_score(y_test, y_pred)

Recall False negatives (e.g., fraud detection) recall_score(y_test, y_pred)

F1-Score Balance of precision/recall f1_score(y_test, y_pred)

ROC-AUC Model’s ranking ability roc_auc_score(y_test, y_pred_proba)

B) For Regression

Metric Usage Code Example

RMSE Error magnitude (sensitive to outliers) mean_squared_error(y_test, y_pred, squared=False)

MAE Average error mean_absolute_error(y_test, y_pred)

R² Score Variance explained (0 to 1) r2_score(y_test, y_pred)

3. Making Predictions
A) Single Prediction
python

Copy
Download
# Train model
model = RandomForestClassifier().fit(X_train, y_train)

# Predict class (classification)


y_pred = model.predict(X_test)

# Predict probabilities
y_pred_proba = model.predict_proba(X_test)[:, 1] # Probability of class=1

B) Batch Prediction (New Data)


python

Copy
Download
new_data = pd.read_csv("new_samples.csv")
new_data_processed = preprocessor.transform(new_data) # Apply same preprocessing
predictions = model.predict(new_data_processed)

C) Uncertainty Estimation (Advanced)


python

Copy
Download
# For Bayesian models (e.g., with `scikit-learn` compatible libraries like `pyro` or
`pymc3`)
confidence_intervals = model.predict_quantiles(X_test, quantiles=[0.05, 0.95])

4. Validating Predictions
A) Confusion Matrix
python

Copy
Download
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()

B) Calibration Curves

 Checks if predicted probabilities match real frequencies.

python

Copy
Download
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_test, y_pred_proba, n_bins=10)
plt.plot(prob_pred, prob_true)

C) Residual Analysis (Regression)


python

Copy
Download
residuals = y_test - y_pred
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red')
5. Common Pitfalls & Fixes
Issue Solution

Overfitting Use regularization, cross-validation, or simpler models.

Data Leakage Ensure preprocessing (e.g., scaling) is fit only on training data.

Class Imbalance Use stratified splits, SMOTE, or class weights.

Uncertain
Use models with probability estimates (e.g., predict_proba).
Predictions

Summary Workflow

1. Split data (train-test or cross-validation).

2. Train model on training set.

3. Validate using metrics (accuracy, RMSE, etc.).

4. Predict on new data (with preprocessing).

5. Monitor performance in production (e.g., drift detection).

Types of Machine Learning

Machine Learning (ML) can be broadly categorized into three main types, each with distinct
approaches and applications. Below is a clear breakdown with examples and use cases.

1. Supervised Learning
Definition: The model learns from labeled data (input-output pairs) to make predictions.

A) Classification

 Predicts discrete categories (e.g., spam/not spam).

 Algorithms:
o Logistic Regression

o Decision Trees

o Random Forest

o SVM (Support Vector Machines)

 Use Cases:
o Email spam detection

o Medical diagnosis (disease/no disease)

o Sentiment analysis

B) Regression

 Predicts continuous values (e.g., house prices).

 Algorithms:
o Linear Regression

o Polynomial Regression

o Ridge/Lasso Regression

 Use Cases:
o Stock price forecasting

o Weather prediction

o Sales trend analysis

2. Unsupervised Learning
Definition: The model finds patterns in unlabeled data (no predefined outputs).

A) Clustering

 Groups similar data points.

 Algorithms:
o K-Means
o DBSCAN

o Hierarchical Clustering

 Use Cases:
o Customer segmentation

o Anomaly detection (fraud)

o Image compression

B) Dimensionality Reduction

 Reduces features while preserving key information.

 Algorithms:
o PCA (Principal Component Analysis)

o t-SNE

 Use Cases:
o Visualizing high-dimensional data

o Speeding up ML models

C) Association Rule Learning

 Discovers relationships between variables (e.g., "If A, then B").

 Algorithms:
o Apriori

o FP-Growth

 Use Cases:
o Market basket analysis (e.g., "Customers who buy X also buy Y")

o Recommendation systems

3. Reinforcement Learning (RL)


Definition: The model learns by trial and error using rewards/penalties.

Key Components:
 Agent: The learner/decision-maker.

 Environment: The world the agent interacts with.

 Reward Signal: Feedback for actions (e.g., +1 for winning, -1 for losing).

Algorithms:

 Q-Learning

 Deep Q-Networks (DQN)

 Policy Gradient Methods

Use Cases:

 Game AI (AlphaGo, Chess engines)

 Robotics (self-learning robots)

 Autonomous vehicles

Other Emerging Types


1. Semi-Supervised Learning

 Mix of labeled + unlabeled data (reduces annotation costs).

 Use Cases:
o Speech recognition

o Medical imaging (limited labeled scans)

2. Self-Supervised Learning

 Generates labels from data itself (e.g., predicting missing parts of an image).

 Use Cases:
o Pretraining large language models (GPT, BERT)

3. Transfer Learning

 Reuses a pre-trained model for a new task (e.g., fine-tuning ResNet for medical images).

 Use Cases:
o Computer vision (object detection)
o NLP (text classification)

Summary Table

Type Data Needed Example Algorithms Use Cases

Supervised Labeled (X → y) Random Forest, SVM Spam detection, Forecasting

Unsupervised Unlabeled K-Means, PCA Customer segmentation

Reinforcement Rewards/Penalties Q-Learning, DQN Game AI, Robotics

Semi-Supervised Partial labels Self-training models Speech recognition

Pre-trained
Transfer Learning Fine-tuning BERT, ResNet Medical imaging
models

Key Takeaways

 Supervised: Best for predictive tasks with labeled data.

 Unsupervised: Ideal for exploratory analysis (no labels).

 Reinforcement: Used for sequential decision-making.

 Hybrids (Semi-Supervised, Transfer): Reduce data dependency.

Semi-Supervised Learning: Bridging the Gap Between Labeled and


Unlabeled Data

Semi-supervised learning (SSL) is a hybrid approach that leverages both small amounts of
labeled data and large amounts of unlabeled data to improve model performance. It’s
especially useful when labeling data is expensive or time-consuming (e.g., medical imaging, speech
recognition).

1. Why Use Semi-Supervised Learning?


 Cost-Effective: Reduces reliance on expensive labeled data.

 Improved Accuracy: Unlabeled data helps uncover hidden patterns.

 Widely Applicable: Works in domains like NLP, CV, and bioinformatics.


Example Scenario:

 Labeled data: 1,000 annotated medical images.

 Unlabeled data: 10,000 unannotated images.

 SSL uses both to train a better model than supervised learning alone.

2. Key Techniques in Semi-Supervised Learning


A) Self-Training

1. Train a model on the labeled data.

2. Use the model to predict pseudo-labels for unlabeled data.

3. Retrain the model on the combined dataset (labeled + high-confidence pseudo-labels).

Python Example (Scikit-Learn):


python

Copy
Download
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import RandomForestClassifier

# X_labeled, y_labeled = 1000 labeled samples


# X_unlabeled = 10000 unlabeled samples
base_model = RandomForestClassifier()
ssl_model = SelfTrainingClassifier(base_model)
ssl_model.fit(X_labeled, y_labeled)

B) Co-Training

 Two models train on different feature subsets and teach each other.

 Requires two independent feature sets (e.g., image pixels + text captions).

C) Label Propagation

 Uses graph-based methods to propagate labels from labeled to unlabeled data.

 Works well when data has a clear manifold structure (e.g., clusters).

Python Example:
python

Copy
Download
from sklearn.semi_supervised import LabelPropagation
model = LabelPropagation(kernel='knn', n_neighbors=10)
model.fit(X_labeled, y_labeled)

D) Consistency Regularization

 Forces the model to produce similar outputs for slightly perturbed inputs (e.g., image rotations).

 Used in deep learning (e.g., FixMatch, Mean Teacher).

Framework Example (TensorFlow):


python

Copy
Download
# Pseudocode for consistency loss
def consistency_loss(unlabeled_images):
augmented_1 = augment(unlabeled_images)
augmented_2 = augment(unlabeled_images)
predictions_1 = model(augmented_1)
predictions_2 = model(augmented_2)
return mse(predictions_1, predictions_2)

3. Applications of Semi-Supervised Learning


Domain Use Case Example

Computer Vision Medical image segmentation Tumor detection with few labeled MRI scans

NLP Text classification Sentiment analysis with limited labeled reviews

Automatic speech recognition


Speech Transcribing rare dialects
(ASR)

Bioinformatics Protein structure prediction Leveraging unlabeled genomic data

4. Challenges & Solutions


Challenge Solution

Use confidence thresholds (e.g., discard low-probability


Noisy pseudo-labels
predictions).

Class imbalance Balance pseudo-labels across classes.

Overfitting to labels Regularize the model (e.g., dropout, weight decay).


5. When to Use Semi-Supervised Learning?
✔ Limited labeled data but abundant unlabeled data.
✔ Data labeling is expensive (e.g., medical experts needed).
✔ The unlabeled data has useful structure (e.g., clusters).

When NOT to Use It:


❌ Unlabeled data is noisy or irrelevant.
❌ Labeled data is sufficient for supervised learning.

Summary

 Semi-supervised learning combines the best of supervised and unsupervised learning.

 Key methods: Self-training, co-training, label propagation, and consistency regularization.

 Best for: Cost-sensitive domains like healthcare, NLP, and speech processing.

UNIT-III

NoSQL Movement for Handling Big Data

The NoSQL (Not Only SQL) movement emerged as a response to the limitations of
traditional relational databases (RDBMS) in handling Big Data (high volume, velocity, and
variety). NoSQL databases provide scalability, flexibility, and high performance for modern
data-intensive applications.

1. Why NoSQL for Big Data?


Challenge with RDBMS NoSQL Solution

Scalability Issues Horizontal scaling (sharding)

Rigid Schema Schema-less or dynamic schema

Slow for Unstructured Data Handles JSON, XML, graphs, etc.

High Latency in Distributed


Optimized for distributed clusters
Systems
2. Types of NoSQL Databases
A) Document Stores (e.g., MongoDB, CouchDB)

 Data Model: JSON-like documents.

 Use Cases:
o Content management systems (CMS)

o Real-time analytics (e.g., user profiles)

Example (MongoDB Query):


javascript

Copy
Download
db.users.insertOne({ name: "Alice", age: 30, hobbies: ["hiking", "coding"] });

B) Key-Value Stores (e.g., Redis, DynamoDB)

 Data Model: Simple key → value pairs.

 Use Cases:
o Caching (Redis)

o Session management

Example (Redis CLI):


bash

Copy
Download
SET user:1001 "Alice"
GET user:1001 # Returns "Alice"

C) Column-Family Stores (e.g., Cassandra, HBase)

 Data Model: Columnar storage (optimized for read/write speed).

 Use Cases:
o Time-series data (IoT sensors)

o High-write applications (e.g., logs)

Example (Cassandra CQL):


sql

Copy
Download
CREATE TABLE users (id UUID PRIMARY KEY, name TEXT, email TEXT);
D) Graph Databases (e.g., Neo4j, Amazon Neptune)

 Data Model: Nodes + edges (relationships).

 Use Cases:
o Social networks (friend recommendations)

o Fraud detection (transaction links)

Example (Neo4j Cypher Query):


cypher

Copy
Download
CREATE (Alice:Person {name: "Alice"})-[:FRIENDS_WITH]->(Bob:Person {name: "Bob"});

3. NoSQL vs. SQL for Big Data


Feature NoSQL SQL (RDBMS)

Schema Dynamic (flexible) Fixed (rigid)

Scaling Horizontal (distributed) Vertical (single-server)

ACID
Often sacrificed for speed Strictly enforced
Compliance

Unstructured/semi-structured
Best For Structured data with relations
data

4. NoSQL in Big Data Ecosystems


A) Hadoop Integration

 HBase (column-family DB for Hadoop) enables random read/write access.

 MongoDB Connector for Hadoop allows querying NoSQL data in MapReduce/Spark.

B) Spark & NoSQL

 Spark SQL can query NoSQL databases (e.g., Cassandra, MongoDB).

 Example (PySpark + MongoDB):

python

Copy
Download
df = spark.read.format("mongo").load()
df.filter(df["age"] > 25).show()

C) Cloud NoSQL Services

 AWS: DynamoDB, DocumentDB

 Google Cloud: Firestore, Bigtable

 Azure: Cosmos DB

5. Challenges of NoSQL
 No Standard Query Language (varies by DB type).

 Eventual Consistency (not all DBs guarantee ACID).

 Limited Joins (requires denormalization).

6. When to Use NoSQL?


✔ Big Data (TB/PB-scale datasets).
✔ High-velocity data (real-time analytics).
✔ Unstructured/semi-structured data (JSON, logs, social media).

When NOT to Use NoSQL:


❌ Complex transactions (e.g., banking systems).
❌ Strong consistency requirements.

Conclusion

The NoSQL movement revolutionized Big Data handling by offering:

 Scalability (distributed architectures).

 Flexibility (schema-less designs).

 Performance (optimized for read/write-heavy workloads).


Popular NoSQL databases like MongoDB, Cassandra, and Redis power modern apps (Netflix,
Uber, Facebook). For Big Data pipelines, NoSQL integrates seamlessly with Hadoop, Spark, and
cloud platforms.

Distributing Data Storage & Processing with


Hadoop Framework
The Hadoop framework is a distributed system designed to store and process Big Data across
clusters of commodity hardware. It provides scalability, fault tolerance, and parallel
processing for handling massive datasets (TB/PB scale).

1. Core Components of Hadoop


A) Hadoop Distributed File System (HDFS)

 Distributed storage that splits large files into blocks (default: 128MB/256MB).

 Key Features:
o Fault Tolerance: Replicates blocks across nodes (default: 3 copies).

o Scalability: Adds nodes to scale storage/processing.

o High Throughput: Optimized for batch processing.

HDFS Architecture:

Component Role

Master server managing metadata (file→block


NameNode
mapping).

DataNode Worker nodes storing actual data blocks.

Secondary NameNode Performs periodic checkpoints (not a backup!).

B) Yet Another Resource Negotiator (YARN)

 Manages cluster resources (CPU, memory) and schedules jobs.

 Key Components:
o ResourceManager (RM): Global cluster resource manager.

o NodeManager (NM): Per-node agent managing resources.


o ApplicationMaster (AM): Oversees execution of a single job.

C) MapReduce

 Parallel processing model for batch jobs.

 Two Phases:
1. Map: Processes input data (filtering, sorting).

2. Reduce: Aggregates results (e.g., word count).

Example (Word Count in Java):


java

Copy
Download
// Mapper
public void map(LongWritable key, Text value, Context context) {
String[] words = value.toString().split(" ");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
}
}

// Reducer
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}

2. How Hadoop Distributes Data & Processing


Step 1: Data Ingestion

 Data is uploaded to HDFS (e.g., via hadoop fs -put).

 HDFS splits files into blocks and distributes them across DataNodes.

Step 2: Job Submission

 A MapReduce job (or Spark/Flink job) is submitted to YARN.

 YARN’s ResourceManager assigns resources.


Step 3: Data Locality Optimization

 Map tasks run on nodes where the data resides (minimizes network transfer).

 Reduce tasks aggregate results from multiple mappers.

Step 4: Fault Tolerance

 If a DataNode fails, HDFS recovers data from replicas.

 If a task fails, YARN reschedules it on another node.

3. Hadoop Ecosystem Tools


Tool Purpose

Hive SQL-like querying (HiveQL) for Hadoop.

Pig High-level scripting (Pig Latin) for ETL.

HBase NoSQL database for real-time read/write access.

Spark In-memory processing (faster than MapReduce).

ZooKeeper Coordination service for distributed systems.

Sqoop Imports/export data between Hadoop and RDBMS.

Flume Collects streaming data (e.g., logs) into HDFS.

4. Hadoop vs. Modern Alternatives


Feature Hadoop (MapReduce) Apache Spark

Processing Batch-only Batch + real-time (streaming)

Speed Disk-based (slower) In-memory (100x faster)

High-level APIs
Ease of Use Low-level (Java-heavy)
(Python/Scala)

Fault
Recomputes failed tasks Uses RDDs/DAGs for recovery
Tolerance

When to Use Hadoop?


✔ Batch processing of huge datasets (historical analytics).
✔ Cost-effective (runs on commodity hardware).
When to Use Spark/Flink?
✔ Real-time analytics (e.g., fraud detection).
✔ Iterative algorithms (e.g., ML training).

5. Hadoop in the Cloud


 AWS: EMR (Elastic MapReduce)

 Google Cloud: Dataproc

 Azure: HDInsight

Example (AWS EMR):


bash

Copy
Download
aws emr create-cluster --name "Hadoop Cluster" \
--release-label emr-6.5.0 \
--applications Name=Hadoop Name=Hive \
--instance-type m5.xlarge \
--instance-count 3

6. Limitations of Hadoop
 Not for real-time processing (use Spark/Flink instead).

 Complex to manage (manual tuning required).

 Slower for small datasets (high overhead).

Conclusion

Hadoop revolutionized Big Data with:


✅ Distributed storage (HDFS) for scalability.
✅ Parallel processing (MapReduce/YARN) for batch jobs.
✅ Fault tolerance via replication.

While Spark and cloud data lakes (e.g., S3 + Athena) are replacing some Hadoop use cases, it
remains vital for large-scale batch processing.
s, CAP theorem
UNIT-1
Introduction: What Is Data Science?

Data science combines math and statistics, specialized programming, advanced analytics,
artificial intelligence (AI), and machine learning with specific subject matter expertise to
uncover actionable insights hidden in an organization’s data. These insights can be used to
guide decision making and strategic planning.

Theacceleratingvolumeofdatasources,andsubsequentlydata,has madedatascienceisone

of the fastest growing field across every industry. It's increasingly critical to businesses:
The
insightsthatdatasciencegenerateshelporganizationsincreaseoperationalefficiency,identify
newbusinessopportunitiesandimprove marketingandsalesprograms,amongotherbenefits.
Ultimately, they can lead to competitive advantages over business rivals.

BigData andData Science Hype


DataScience BigData
DataScienceisanarea. BigDataisatechniquetocollect,maintain and
process huge information.
It is about the collection, processing, Itisaboutextractingvitalandvaluable
analyzing, and utilizing of data in various information from a huge amount of data.
operations. It is more conceptual.
It is a field of study just like Computer It is a technique for tracking and discovering
Science, Applied Statistics, or Applied trends in complex data sets.
Mathematics.
Thegoalistobuilddata-dominantproducts for Thegoalistomakedatamorevitalandusable
a venture. i.e. by extracting only important information
from the huge data within existing traditional
aspects.
Tools mainlyused in Data Science include ToolsmostlyusedinBigDataincludeHadoop,
SAS, R, Python, etc Spark, Flink, etc.
It is a superset of Big Data as data science It is a sub-set of Data Science as mining
consists of Data scrapping, cleaning, activities which is in a pipeline of Data
visualization, statistics, and many more science.
techniques.
Itismainlyusedforscientificpurposes. Itismainlyusedforbusinesspurposesand
customer satisfaction.
Itbroadlyfocusesonthescienceofthedata. Itismoreinvolvedwiththeprocessesofhandling
voluminous data.

So,whatiseyebrow-raisingabout BigData anddata science? Let’scount the ways:

1. There’s a lack of definitions around the most basic terminology. What is “Big Data”
anyway?Whatdoes“datascience”mean?WhatistherelationshipbetweenBigDataanddata
science? Is data science the science of Big Data? Is data science only the stuff going on in
companies like Google and Facebook and tech companies? Whydo manypeople refer to Big
Data ascrossingdisciplines (astronomy, finance,tech, etc.)and todata science as onlytaking
place intech?Justhow bigisbig? Oris itjusta relative term?These termsare soambiguous,
they’re well-nigh meaningless.
2. There’sadistinctlackofrespectfortheresearchersinacademiaandindustrylabswho
havebeenworkingonthiskindofstuffforyears,andwhoseworkisbasedondecades(insome cases,
centuries) of work bystatisticians, computer scientists, mathematicians, engineers, and
scientistsof alltypes. From the waythe mediadescribesit,machinelearningalgorithmswere just
invented last week and data was never “big” until Google came along. This is simplynot the
case. Many of the methods and techniques we’re using—and the challenges we’re facing now
—arepartoftheevolutionofeverythingthat’scomebefore.Thisdoesn’tmeanthatthere’s
notnewandexcitingstuffgoingon,butwethinkit’simportanttoshowsomebasicrespectfor
everything that came before.
3. The hype is crazy—people throw around tired phrases straight out of the height of the
pre-financial crisis era like “Masters of the Universe” to describe data scientists, and that
doesn’t bode well. In general, hype masks reality and increases the noise-to-signal ratio. The
longerthehype goeson,themoremanyofuswillgetturnedoffbyit,andtheharderitwillbe to see
what’s good underneath it all, if anything.
4. Statisticians alreadyfeel that theyare studyingand workingon the “Science of Data.”
That’stheirbreadandbutter.Maybe you, dearreader,arenotastatisticiananddon’tcare,but imagine
that forthestatistician,this feelsa little bit like howidentitytheft might feel for you.
Althoughwewillmakethecasethatdatascienceisnotjustarebrandingofstatisticsormachine
learningbutratherafielduntoitself,themediaoftendescribesdatascienceinawaythatmakes it
sound like as if it’s simplystatistics or machine learningin the context of the tech industry.
People have saidto us, “Anythingthat hasto call itselfa scienceisn’t.” Although there might
betruthinthere,thatdoesn’tmeanthattheterm“datascience” itselfrepresentsnothing,butof course
what it represents may not be science but more of a craft.

GettingPasttheHype

Rachel’s experience going from getting a PhD in statistics to working at Google is a great
example to illustrate why we thought, in spite of the aforementioned reasons to be dubious,
there might be some meat in the data science sandwich. In her words:

It was clear to me pretty quickly that the stuff I was working on at Google was different than
anything I had learned at school when I got my PhD in statistics. This is not to say that my
degreewasuseless;farfrom it—whatI’d learnedinschoolprovidedaframeworkandwayof
thinkingthatIreliedondaily,andmuchoftheactualcontentprovidedasolidtheoreticaland practical
foundation necessary to do my work.

But there were also many skills I had to acquire on the job at Google that I hadn’t learned in
school.Ofcourse,myexperienceisspecifictomeinthesensethatIhadastatisticsbackground
andpickedupmorecomputation,coding,andvisualizationskills, aswellasdomainexpertise while
at Google. Another person coming in as a computer scientist or a social scientist or a
physicist would have different gaps and would fill them in accordingly.But whatis important
hereisthat,asindividuals,weeachhaddifferentstrengthsandgaps,yetwewereabletosolve
problemsbyputtingourselvestogetherintoadatateamwell-suitedtosolvethedataproblems that
came our way.

Here’sareasonableresponseyoumighthavetothisstory. It’sageneraltruismthat,whenever you


gofromschooltoarealjob,yourealizethere’sagapbetweenwhatyoulearnedinschool and what you
do on the job. In other words, you were simply facing the difference between academic
statistics and industry statistics.

Wehave a couple repliestothis:

 Sure, there’s is a difference between industryand academia. But does it reallyhave to


be that way? Why do many courses in school have to be so intrinsically out of touch with
reality?
 Even so, the gap doesn’t represent simply a difference between industrystatistics and
academic statistics. The general experience of data scientists is that, at their job, they have
access to a larger body of knowledge and methodology, as well as a process, which we now
defineasthe datascience process(detailsin Chapter2),thathasfoundationsinbothstatistics
and computer science.
Around all the hype, in other words, there is a ring of truth:this is something new. But at the same
time,it’sa fragile,nascent idea at real riskofbeingrejectedprematurely. For one thing,
it’sbeingparaded around asa magic bullet, raisingunrealistic expectationsthat willsurelybe
disappointed.

Rachelgaveherselfthetaskofunderstandingtheculturalphenomenonofdatascienceandhow others
were experiencing it. She started meeting with people at Google, at startups and tech
companies, and at universities, mostly from within statistics departments.

Fromthosemeetingsshestartedtoformaclearerpictureofthenewthingthat’semerging.She ultimately
decided to continue the investigation by giving a course at Columbia called “Introduction to
Data Science,” which Cathy covered on her blog.

Datafication
In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor Mayer-
Schoenberger wrote anarticle called “TheRise of BigData”. In it theydiscuss the concept of
datafication, and their example is how we quantify friendships with “likes”: it’s the way
everything we do, online or otherwise, ends up recorded for later examination in someone’s
data storage units. Or maybe multiple storage units, and maybe also for sale.

Theydefine dataficationasa processof“takingall aspectsoflife andturningthem intodata.” As


examples, theymention that “Google’s augmented-reality glasses datafythe gaze. Twitter
datafies stray thoughts. LinkedIn datafies professional networks.”

Datafication is an interesting concept and led us to consider its importance with respect to
people’s intentions about sharing their own data. We are being datafied, or rather our actions
are, and when we “like” someone or something online, we are intending to be datafied, or at
leastweshouldexpecttobe.Butwhenwe merelybrowsetheWeb,weareunintentionally,or
atleastpassively,beingdatafiedthroughcookiesthatwemightormightnotbeawareof.And when we
walk around in a store, or even on the street, we are being datafied in a completely
unintentional way, via sensors, cameras, or Google glasses.

This spectrum of intentionality ranges from us gleefully taking part in a social media experiment
we are proud of, to all-out surveillance and stalking. But it’s all datafication. Our intentions
may run the gamut, but the results don’t.
They follow up their definition in the article with a line that speaks volumes about their
perspective:

Oncewedatafythings,wecantransformtheirpurposeandturntheinformationintonewforms of
value.

Here’s an important question that we will come back to throughout the book: who is “we” in
thatcase?What kinds ofvalue dotheyrefer to?Mostly, given theirexamples, the “we” is the
modelers and entrepreneurs making money from getting people to buy stuff, and the “value”
translates into something like increased efficiency through automation.

TheCurrentLandscape
So,whatisdata science?Isitnew,or isit juststatisticsoranalytics rebranded?Isit real,oris it pure
hype? And if it’s new and if it’s real, what does that mean?

Thisisanongoingdiscussion,butone waytounderstandwhat’s going oninthisindustryisto look


online and see what current discussions are taking place. This doesn’t necessarilytell us what
data science is, but it at least tells us what other people think it is, or how they’re perceiving
it. For example, on Quora there’s a discussion from 2010 about “What is Data Science?” and
here’s Metamarket CEO Mike

Driscoll’sanswer:

Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired


statistics. But data science is not merely hacking—because when hackers finish debugging
their Bash one-liners and Pig scripts, few of them care about non-Euclidean distance metrics.
Anddatascienceisnotmerelystatistics,becausewhenstatisticiansfinishtheorizingtheperfect
model,fewcouldreadatab-delimitedfileintoRiftheirjobdependedonit.Datascienceisthe civil
engineering of data. Its acolytes possess a practical knowledge of tools and materials, coupled
with a theoretical understanding of what’s possible.

Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010, shown in
Figure 1-1.
HealsomentionsthesexyskillsofdatageeksfromNathanYau’s2009 post,“RiseoftheData
Scientist”, which include:

• Statistics(traditionalanalysis you’reusedtothinkingabout)

• Datamunging(parsing,scraping,andformattingdata)

• Visualization(graphs,tools,etc.)

But wait, is data science just a bag of tricks? Or is it the logical extension of other fields like
statistics and machine learning? For one argument, see Cosma Shalizi’s posts here and here,
and Cathy’s posts here and here, which constitute an ongoing discussion of the difference
betweenastatisticianandadatascientist.Cosmabasicallyarguesthatanystatisticsdepartment worth
its salt does all the stuff in the descriptions of data science that he sees, and therefore data
science is just a rebrandingand unwelcome takeover of statistics. For a slightlydifferent
perspective,seeASAPresidentNancyGeller’s2011AmstatNewsarticle,“Don’tshunthe‘S’
word”, in which she defends statistics:

We need to tell people that Statisticians are the ones who make sense of the data deluge
occurring in science, engineering, and medicine; that statistics provides methods for data
analysis in all fields, from art history to zoology; that it is exciting to be a Statistician in the
21stcenturybecauseofthemanychallengesbroughtaboutbythedata explosioninallofthese fields.

Though we get her point—the phrase “art history to zoology” is supposed to represent the
conceptofAtoZ—she’skindofshootingherselfinthefootwiththeseexamplesbecausethey don’t
correspond to the high-tech world where much of the data explosion is coming from.
Muchofthedevelopmentofthefieldishappeninginindustry,not academia.Thatis,thereare people
with the job title data scientist in companies, but no professors of data science in academia.
(Though this may be changing.)
Not long ago, DJ Patil described how he and Jeff Hammerbacher— then at LinkedIn and
Facebook, respectively—coined the term “data scientist” in 2008. So that is when “data
scientist”emergedasajobtitle. (Wikipediafinallygainedanentryondatasciencein2012.)It
makessensetousthatoncetheskillsetrequiredtothriveatGoogle —workingwithateamon
problems that required a hybrid skill set of stats and computer science paired with personal
characteristics including curiosity and persistence—spread to other Silicon Valley tech
companies,itrequiredanewjobtitle.Onceitbecameapattern,it deservedaname.Andonce it got a
name, everyone and their mother wanted to be one. It got even worse when Harvard Business
Review declared data scientist to be the “Sexiest Job of the 21st Century”.

TheRoleoftheSocialScientistinDataScience

Both LinkedIn and Facebook are social network companies. Oftentimes a description or
definition ofdata scientist includes hybrid statistician, software engineer, andsocial scientist.
This made sense in the context of companies where the product was a social product and still
makes sense when we’re dealing with human or user behavior. But if you think about Drew
Conway’sVenndiagram,datascienceproblemscrossdisciplines—that’swhatthesubstantive
expertiseisreferringto.Inotherwords,itdependsonthecontextoftheproblemsyou’retrying to
solve.

If they’re social science-yproblems like friend recommendations or people you know or user
segmentation,then byall means, bringon the social scientist!Social scientists alsodo tend to be
good question askers and have other good investigative qualities, so a social scientist who
also has the quantitative and programming chops makes a great data scientist. But it’s almost
a “historical” (historical is in quotes because 2008 isn’t that long ago) artifact to limit your
conception of a data scientist to someone who works only with online user behavior data.

There’sanotheremergingfieldout therecalledcomputationalsocialsciences, whichcouldbe


thought of as a subset of data science. But we can go back even further. In 2001, William
Cleveland wrote a position paper about data science called “Data Science: An action plan to
expandthefieldofstatistics.” Sodatascienceexistedbeforedatascientists?Isthissemantics,
ordoesitmakesense?Thisallbegsafewquestions:can youdefinedatasciencebywhatdata scientists
do? Who gets to define the field, anyway? There’s lotsof buzz and hype—does the
mediagettodefineit,orshouldwerelyonthe practitioners,theself-appointeddatascientists?
Oristheresomeactualauthority?Let’sleavetheseasopenquestionsfornow,thoughwe will return to
them throughout the book.

DataScienceJobs
ColumbiajustdecidedtostartanInstituteforDataSciencesandEngineeringwithBloomberg’s help.
There are 465 job openings in New York City alone for data scientists last time we
checked.That’salot.Soevenifdatascience isn’t a realfield,it hasrealjobs.Andhere’sone
thing we noticed about most of the job descriptions: they ask data scientists to be experts in
computerscience,statistics, communication,datavisualization, andtohaveextensivedomain
expertise.Nobodyisanexpertineverything,whichiswhyitmakesmoresensetocreateteams of
people who have different profiles and different expertise—together, as a team, they can
specialize in all those things. We’ll talk about this more after we look at the composite set of
skills in demand for today’s data scientists.

StatisticalInference
The world we live in is complex, random, and uncertain. At the same time, it’s one big data-
generating machine. As we commute to work on subways and in cars, as our blood moves
through our bodies, as we’re shopping, emailing, procrastinating at work by browsing the
Internet and watching the stock market, as we’re building things, eating things, talking to our
friends and family about things, while factories are producing products, this all at least
potentially produces data.

Imagine spending 24 hours looking out the window, and for every minute, counting and
recordingthenumberofpeoplewhopassby.Orgatheringupeveryonewholiveswithinamile of
yourhouse andmakingthemtellyouhowmanyemailmessagestheyreceive every dayfor
thenextyear.Imagineheadingovertoyourlocalhospitaland rummagingaroundintheblood
samples looking for patterns in the DNA. That all sounded creepy, but it wasn’t supposed to.
The point here is that the processes in our lives are actually data-generating processes.

We’dlikewaystodescribe,understand,andmakesenseoftheseprocesses, inpartbecauseas
scientists we just want to understand the world better, but many times, understanding these
processesispartofthesolutiontoproblemswe’retryingtosolve.Datarepresentsthetracesof thereal-
worldprocesses,andexactlywhichtraceswegatheraredecidedbyourdatacollection or sampling
method. You, the data scientist, the observer, areturningthe world intodata,and this is an
utterly subjective, not objective, process. After separating the process from the data
collection, we can see clearly that there are two sources of randomness and uncertainty.
Namely, the randomness and uncertainty underlying the process itself, and the uncertainty
associatedwithyourunderlyingdatacollectionmethods.Onceyouhaveallthisdata,youhave
somehow captured the world, or certain traces of the world. But you can’t go walkingaround
withahugeExcelspreadsheetordatabaseofmillionsoftransactionsand lookatitand,witha snap of a
finger, understand the world and process that generated it. So you need a new idea,
andthat’stosimplifythosecapturedtracesintosomethingmorecomprehensible,tosomething that
somehow captures it all in a much more concise way, and that something could be
mathematical models or functions of the data, known as statistical estimators.

This overall process of going from the world to the data, and then from the data back to the
world, isthe field of statisticalinference. More precisely, statisticalinferenceisthediscipline
that concerns itself with the development of procedures, methods, and theorems that allow us
to extract meaning and information from data that has been generated bystochastic (random)
processes.
PopulationsandSamples

Let’sgetsometerminologyandconceptsinplacetomakesurewe’realltalkingaboutthesame
thing.In classical statistical literature, a distinction is made between the population and the
sample. The word population immediately makes us think of the entire US population of 300
million people, or the entire world’s population of 7 billion people. But put that image out of
yourhead,becauseinstatisticalinferencepopulationisn’tusedtosimplydescribeonlypeople. It
could be any set of objects or units, such as tweets or photographs or stars. If we could
measurethecharacteristicsorextractcharacteristicsofallthoseobjects,we’dhaveacomplete
setofobservations,andtheconventionistouseNtorepresentthetotalnumberofobservations
inthepopulation.Supposeyourpopulationwasallemailssentlastyearbyemployeesatahuge
corporation, BigCorp. Then a single observation could be a list of things: the sender’s name,
the list of recipients, date sent, text of email, number of characters in the email, number of
sentences in the email, number of verbs in the email, and the length of time until first reply.

When we take a sample, we take a subset of the units of size n in order to examine the
observationstodrawconclusionsandmakeinferencesaboutthepopulation.Therearedifferent
waysyoumightgoaboutgettingthissubsetofdata,andyouwanttobeawareofthissampling
mechanismbecauseitcanintroducebiasesintothedata,anddistort it,sothatthesubsetisnot a “mini-
me” shrunk-down version of the population. Once that happens, anyconclusions you draw
will simply be wrong and distorted.

IntheBigCorpemailexample,youcouldmakealistofalltheemployees andselect1/10thof those


people at random and take all the email they ever sent, and that would be your sample.
Alternatively,youcouldsample1/10thofallemailsenteachdayatrandom,andthatwouldbe
yoursample.Boththesemethodsarereasonable,andbothmethodsyieldthesamesamplesize. But if
you took them and counted how manyemail messages each person sent, and used that to
estimate the underlying distribution of emails sent by all indiviuals at BigCorp, you might get
entirely different answers. So if even getting a basic thing down like counting can get
distortedwhenyou’reusingareasonable-soundingsamplingmethod,imaginewhatcanhappen
tomore complicatedalgorithmsandmodelsifyou haven’t takenintoaccount the processthat got
the data into your hands.

Modeling
In the next chapter, we’ll look at how we build models from the data we collect, but first we
wanttodiscusswhatweevenmeanbythisterm.Rachelhadarecentphoneconversationwith someone
about a modelling workshop, and several minutes into it she realized the word
“model”meantcompletelydifferentthingstothem.Hewasusingit tomeandatamodels—the
representation one ischoosingto store one’s data, which isthe realm of database managers—
whereas she was talking about statistical models, which is what much of this book is about.
OneofAndrewGelman’sblogpostsonmodelingwasrecentlytweetedbypeopleinthefashion
industry, but that’s a different issue. Even if you’ve used the terms statistical model or
mathematical model for years, is it even clear to yourself and to the people you’re talking to
what you mean? What makes a model a model? Also, while we’re asking fundamental
questions like this, what’s the difference between a statistical model and a machine learning
algorithm? Before we dive deeply into that, let’s add a bit of context with this deliberately
provocativeWiredmagazinepiece,“TheEndofTheory:TheDataDelugeMakestheScientific
Method Obsolete,” published in 2008 by Chris Anderson, then editor-in-chief. Anderson
equatesmassiveamountsofdatatocompleteinformation andarguesnomodelsarenecessary
and“correlationisenough”;e.g.,thatinthecontextofmassiveamountsofdata,“they[Google] don’t
have to settle for models at all.”

Really? We don’t think so, and we don’t think you’ll think so either by the end of the book.
But the sentiment is similar to the Cukier and Mayer-Schoenberger article we just discussed
about N=ALL, so you might already be getting a sense of the profound confusion we’re
witnessing all around us. To their credit, it’s the press that’s currently raising awareness of
thesequestionsandissues,andsomeonehastodoit.Evenso,it’shardtotakewhentheopinion makers
are people who don’t actuallywork with data. Thinkcriticallyabout whether you buy what
Anderson is saying; where you agree, disagree, or where you need more information to
formanopinion.Giventhatthisishowthepopularpressiscurrentlydescribingandinfluencing
publicperceptionofdatascienceandmodeling,it’sincumbent uponusasdatascientiststobe
awareofitandtochimeinwithinformedcomments.Withthatcontext,then,whatdowemean when
we say models? And how do we use them as data scientists? To get at these questions, let’s
dive in.

Whatisamodel?

Humans try to understand the world around them by representing it in different ways.
Architects capture attributes of buildings through blueprints and three-dimensional, sca led-
down versions. Molecular biologists capture protein structure with three-dimensional
visualizationsoftheconnectionsbetweenaminoacids.Statisticiansanddata scientistscapture
theuncertaintyandrandomnessofdata-generatingprocesseswithmathematicalfunctionsthat
express the shape and structure of the data itself. A model is our attempt to understand and
represent the nature of reality through a particular lens, be it architectural, biological, or
mathematical.

A model is an artificial construction where all extraneous detail has been removed or
abstracted. Attention must always be paid to these abstracted details after a model has been
analyzedtoseewhatmighthavebeenoverlooked.Inthecaseofproteins,amodeloftheprotein
backbonewithsidechainsbyitselfisremovedfromthelawsofquantummechanicsthatgovern the
behavior of theelectrons, which ultimatelydictate the structureand actionsof proteins. In the
case of a statistical model, we may have mistakenly excluded key variables, included
irrelevant ones, or assumed a mathematical structure divorced from reality.
Statisticalmodelling

Beforeyougettooinvolvedwiththedataandstartcoding,it’susefultodrawapictureofwhat
youthinktheunderlyingprocessmightbewithyourmodel.Whatcomesfirst?Whatinfluences what?
What causes what? What’s a test of that? But different people think in different ways. Some
prefer to express these kinds of relationships in terms of math. The mathematical
expressionswillbegeneralenoughthattheyhavetoincludeparameters,butthevaluesofthese
parameters are not yet known. In mathematical expressions, the convention is to use Greek
letters for parameters and Latin letters for data. So, for example, if you have two columns of
data, x and y, and you think there’s a linear relationship, you’d write down y = β0 +β1x. You
don’t know what β0 and β1 are in terms of actual numbers yet, so they’re the parameters.

Other people prefer pictures and will first draw a diagram of data flow, possiblywith arrows,
showinghowthingsaffectotherthingsorwhat happensovertime.Thisgivesthemanabstract picture
of the relationships before choosing equations to express them.

Probabilitydistributions

Probability distributions are the foundation of statistical models. When we get to linear
regressionandNaive Bayes, youwill see how thishappensinpractice.One cantake multiple
semestersofcourses onprobabilitytheory,and soit’sa tall challenge tocondense itdown for you
in a small section.

Back in the day, before computers, scientists observed real-world phenomenon, took
measurements, and noticed that certain mathematical shapes kept reappearing. The classical
example is the height of humans, following a normal distribution—a bell-shaped curve, also
called a Gaussian distribution, named after Gauss. Other common shapes have been named
aftertheirobserversaswell (e.g.,thePoissondistributionandtheWeibulldistribution),while other
shapes such as Gamma distributions or exponential distributions are named after associated
mathematical objects.

Natural processes tend to generate measurements whose empirical shape could be


approximated bymathematical functions with a few parameters that could be estimated from
thedata.Notallprocessesgeneratedatathatlookslikea nameddistribution,butmanydo.We can
usethesefunctions as buildingblocksof our models. It’s beyond the scope ofthe book to go into
each of the distributions in detail,but we provide them in Figure 2-1 as an illustration
ofthevariouscommonshapes,andtoremindyouthattheyonlyhave namesbecausesomeone
observedthemenoughtimestothinktheydeservednames.Thereisactuallyaninfinitenumber of
possible distributions. They are to be interpreted as assigning a probability to a subset of
possible outcomes,andhave correspondingfunctions. Forexample, thenormal distributionis
written as:
The parameter μ is the mean and median and controls where the distribution is centered
(because this is a symmetric distribution), and the parameter σ controls how spread out the
distribution is. This is the general functional form, but for specific real-world phenomenon,
these parameters have actual numbers as values, which we can estimate from the data.

A random variable denoted by x or y can be assumed to have a corresponding probability


distribution,p(x), whichmaps xto a positive real number. In orderto be a probabilitydensity
function, were restricted to the set of functions such that if we integrate p(x) to get the area
under the curve, it is 1, so it can be interpreted as probability.

Inadditiontodenotingdistributionsofsinglerandomvariableswith functionsofonevariable, we
use multivariate functionscalledjoint distributionstodothe same thingformore thanone random
variable. So in the case of two random variables, for example, we could denote our
distributionbyafunctionp(x,y),anditwouldtakevaluesintheplaneandgiveusnonnegative
values. In keeping with its interpretation as a probability, its (double) integral over the whole
plane would be 1.

Wealsohavewhatiscalledaconditionaldistribution,p(x|y),which istobeinterpretedasthe density


function of x given a particular value of y.

Whenweobservedatapoints,i.e.,(x1,y1),(x2,y2),...,(xn,yn),weareobservingrealizations ofa
pairofrandomvariables.Whenwe have anentire datasetwithnrowsandk columns,we are
observing n realizations of the joint distribution of those k random variables.

Fittinga model

Fittinga model means that you estimatethe parameters of themodel usingthe observed data.
Youareusing yourdataasevidencetohelp approximatethereal-worldmathematical process
thatgeneratedthedata.Fittingthemodelofteninvolvesoptimizationmethodsandalgorithms, such
as maximum likelihood estimation, to help get the parameters.

Fittingthe model iswhen you start actuallycoding: your code will read nthe data, and you’ll
specifythe functional form thatyou wrote down on the pieceof paper.Then Ror Python will
use built-in optimization methods to give you the most likely values of the parameters given
thedata.Asyougainsophistication,orifthisisoneofyourareasofexpertise,you’lldigaround in the
optimization methods yourself. Initially you should have an understanding that
optimizationistakingplace and howit works, but you don’t have tocodethispartyourself— it
underlies the R or Python functions.

Overfitting

Throughout the book you will be cautioned repeatedlyabout overfitting, possiblyto the point
youwillhavenightmaresaboutit.Overfitting isthetermusedtomeanthatyouusedadataset to
estimate the parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data. You might know this because you have tried to use it to predict
labels foranother setof datathat you didn’t use tofitthemodel, anditdoesn’t doagood job, as
measured by an evaluation metric such as accuracy.

BasicsofR

Introduction

 Risaprogramminglanguageandsoftwareenvironmentforstatisticalanalysis,graphics
representation and reporting.
 RwascreatedbyRossIhakaandRobertGentlemanattheUniversityofAuckland,New
Zealand, and is currently developed by the R Development Core Team.
 R is freelyavailable under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.
 Thisprogramminglanguagewasnamed R,basedonthefirstletteroffirstnameofthe two R
authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell
Labs Language S.
 ThecoreofRisaninterpretedcomputerlanguagewhichallowsbranchingandlooping as
well as modular programming using functions.
 R allows integration with the procedures written in the C, C++, .Net, Python or
FORTRAN languages for efficiency.
EvolutionofR
R was initiallywritten byRoss Ihaka and Robert Gentleman at theDepartmentofStatistics
oftheUniversityofAucklandinAuckland,NewZealand.Rmadeitsfirstappearancein1993.
 AlargegroupofindividualshascontributedtoR bysendingcode and bugreports.
 Since mid-1997 there has been a core group (the "R Core Team") who can modifythe
R source code archive.

FeaturesofR
Asstatedearlier,Risaprogramminglanguageandsoftwareenvironmentforstatistical
analysis,graphicsrepresentationandreporting.Thefollowingaretheimportantfeatures ofR

 Risawell-developed,simpleandeffectiveprogramminglanguagewhichincludes
conditionals, loops, user defined recursive functions and input and output facilities.
 Rhasaneffectivedatahandlingandstorage facility,
 Rprovidesasuiteofoperatorsforcalculationsonarrays,lists,vectorsandmatrices.
 Rprovidesa large,coherentandintegratedcollectionoftoolsfordataanalysis.
 Rprovidesgraphicalfacilitiesfordataanalysisanddisplayeitherdirectlyatthe computer or
printing at the papers.

R -Environment Setup

1. InstallationofRandRStudioinWindows.

InLinux:(Through Terminal)

 PressCtrl+Alt+T toopenTerminal
 Thenexecutesudoapt-getupdate
 Afterthat,sudoapt-getinstall r-base

InWindows:

InstallRonwindows

Step–1:Goto CRANRprojectwebsite.(Comprehensive RArchive Network)


Step–2:ClickontheDownloadRfor Windowslink.
Step–3:Clickon thebasesubdirectorylinkorinstallRfor the firsttimelink.
Step–4:ClickDownloadR X.X.Xfor Windows(X.X.XstandforthelatestversionofR. eg: 3.6.1)
and save the executable .exe file.
Step–5:Runthe.exe fileandfollowtheinstallationinstructions.
SelectthedesiredlanguageandthenclickNext.
Readthe licenseagreement andclickNext.
Selectthecomponentsyou wishtoinstall(itisrecommendedtoinstallallthe
components). Click Next.
Enter/browsethefolder/pathyouwishtoinstallRintoandthenconfirmby
clicking Next.
Select additionaltaskslikecreatingdesktopshortcutsetc.thenclickNext.
Waitfortheinstallationprocesstocomplete.
ClickonFinishtocompletethe installation.

InstallRStudioonWindows

Step–1: WithR-baseinstalled,let’smoveonto installingRStudio.Tobegin,go to


download RStudio and click on the download button for RStudio desktop.
Step–2:Clickonthe linkforthe windowsversionofRStudioandsave the .exe file.
Step–3:Runthe .exe andfollowthe installationinstructions.
ClickNextonthewelcome window.
Enter/browsethepathtothe installationfolderandclickNexttoproceed.
Selectthefolderforthestartmenushortcutorclickondonotcreateshortcutsand then
click Next.
Waitfortheinstallationprocesstocomplete.
ClickFinishtoendthe installation.

ProgrammingwithR
R-BasicSyntax
TooutputtextinR,usesingleordoublequotes: "Hello
World!"

Tooutputnumbers,justtypethenumber(withoutquotes): 5
10
25

Todosimplecalculations,addnumberstogether: 2+3

Output:5

RPrint Output
Print

Unlikemanyotherprogramminglanguages, youcanoutput codeinRwithoutusingaprint function:

"Hello!"
Example

However,Rdoeshaveaprint()functionavailableifyouwanttouseit.

Example
print("Hello!")

Andtherearetimesyoumustusethe print()functiontooutputcode,forexample x<-10


print(x)

RComments
Comments

CommentscanbeusedtoexplainRcode,andtomakeitmorereadable.Itcanalsobeusedto prevent execution when tes


Commentsstartswitha #.Whenexecutingcode,Rwillignoreanythingthatstartswith #. This example uses a comm

#Thisisacomment
"Hello!"

Thisexample usesacommentattheendofa line ofcode:


Example
Example

"HelloWorld!"#Thisisacomment
MultilineComments

Unlikeotherprogramminglanguages,suchasJava,therearenosyntaxinRformultiline comments.
However, we can just insert a # for each line to create multiline comments:

Example
#Thisisacomment #
written in
#morethanjustoneline
"Hello!"

RVariables
CreatingVariablesinR

Variablesarecontainersforstoringdata values.

Rdoesnothaveacommandfordeclaringavariable.Avariableiscreatedthemomentyoufirst assign a
value to it. To assign a value to a variable, use the <- sign. To output (or print) the variable
value, just type the variable name:

Example
name<-"John"
age <- 40

name#output"John"ag
e # output 40

Inotherprogramminglanguage,itiscommontouse=asanassignmentoperator.InR,wecan use both =


and <- as assignment operators.

However,<-ispreferredinmostcasesbecausethe=operatorcanbeforbiddeninsomecontext in R.

Print/OutputVariables

Comparedtomanyotherprogramminglanguages, youdonothavetouseafunctionto print/output


variables in R. You can just type the name of the variable:

name<-"John"
Example

name #auto-printthevalue ofthename variable


Example
name<-"John"

print(name) #printthe value ofthe namevariable

ConcatenateElements

Youcanalsoconcatenate,orjoin,two ormore elements,byusingthe paste()function.

Tocombine bothtextanda variable,Rusescomma(,):

Example
text<-"awesome"

paste("Ris",text)
Example
text1 <- "R is"text2<-"awesome"

paste(text1,text2)

Fornumbers,the+characterworksasa mathematicaloperator:
Example

num1 <-5
num2 <-10

num1 + num2

Ifyoutrytocombine astring(text)andanumber,Rwillgiveyouanerror:

num <-5
Example
text<-"Sometext"

num + text

Result:
Error in num +text:non-numericargument to binaryoperator

MultipleVariables

Rallows youtoassignthesamevaluetomultiplevariablesinoneline:

#Assignthesamevaluetomultiplevariablesinoneline
Example var1 <-
var2 <- var3 <- "Orange"

#Printvariablevalues
var1
var2
var3

VariableNames
Avariablecanhaveashortname(likexand y)oramoredescriptivename(age,carname,
total_volume). Rules for R variables are:

 Avariablenamemust startwithaletter
andcanbeacombinationofletters,digits,period(.) and underscore(_).
 Ifitstartswithperiod(.),itcannotbe followedbya digit.
 Avariablenamecannotstartwitha numberorunderscore(_)
 Variablenamesarecase-sensitive(age,AgeandAGEarethreedifferentvariables)
 Reservedwordscannotbeusedasvariables(TRUE,FALSE,NULL,if...)

#Legalvariablenames:
myvar <- "John"
my_var <- "John"
myVar <- "John"
MYVAR <- "John"
myvar2<-"John"
.myvar<-"John"

#Illegalvariablenames:
2myvar <- "John"
my-var<-"John"
myvar<-"John"
_my_var <- "John"
my_v@ar<-"John"
TRUE <- "John"
RDataTypes

Variablescanstoredataofdifferenttypes,anddifferenttypescandodifferentthings.

InR, variablesdonotneedtobedeclaredwithanyparticulartype,andcanevenchangetype after they


have been set:

Example
my_var <- 30
my_var<-"raghul"

BasicDataTypes

BasicdatatypesinRcanbedividedintothefollowingtypes:

 numeric-(10.5, 55, 787)


 integer -(1L,55L,100L,wherethe letter"L"declaresthisasaninteger)
 complex -(9+3i, where "i"istheimaginarypart)
 character(string)-("k", "Risexciting", "FALSE", "11.5")
 logical(boolean)-(TRUE orFALSE)

Wecanusetheclass()functiontocheckthedatatypeofa variable:

Example
#numeric
x <- 10.5
class(x)

# integer
x <-1000L
class(x)

#complex
x <-9i + 3
class(x)

# character/string
x<-"Risexciting"clas
s(x)

#logical/boolean
x <- TRUE
class(x)
RNumbers

Numbers

TherearethreenumbertypesinR:

 numeric
 integer
 complex

Variablesofnumbertypesarecreatedwhenyouassignavaluetothem:

Example
x <-10.5# numeric
y<-10L# integer
z<-1i# complex

Numeric

Anumeric datatypeisthemostcommontypeinR,andcontainsanynumberwithorwithout a decimal, like: 10.5, 55,

Example
x <- 10.5
y<-55

#Printvaluesofxandy x
y

#Printtheclassnameofxandy
class(x)
class(y)

Integer

Integers are numeric data without decimals. This is used when you are certain that you will
never create a variable that should contain decimals. To create an integer variable, you must
use the letter L after the integer value:
Example
x<-1000L y
<- 55L

#Printvaluesofxandy x
y

#Printtheclassnameofxandy
class(x)
class(y)

Complex

Acomplexnumberiswritten withan "i"astheimaginarypart:

Example
x <- 3+5i
y<-5i

#Printvaluesofxandy x
y

#Printtheclassnameofxandy class(x)
class(y)

Type Conversion

Youcanconvertfrom onetype toanotherwiththe followingfunctions:

 as.numeric()
 as.integer()
 as.complex()

Example
x<-1L#integer
y<-2 #numeric

#convertfromintegertonumeric: a
<- as.numeric(x)
#convertfromnumerictointeger: b
<- as.integer(y)

#printvaluesofxandy x
y

#printtheclassnameofaandb
class(a)
class(b)
UNIT-2
Atthemostbasiclevel,attributesarenotaboutnumbersorsymbols. However,todiscussand more
preciselyanalyze the characteristics of objects, we assign numbers or symbols to them. To do
this in a well-defined way, we need a measurement scale.

Definition 2.Ameasurementscale isa rule(function)thatassociates a numericalorsymbolic


valuewithanattributeofanobject.Formally,theprocessofmeasurementistheapplicationof a
measurement scaleto associatea valuewith a particularattribute ofa specificobject. While this
may seem a bit abstract, we engage in the process of measurement all the time.
For instance, we step on a bathroom scale to determine our weight, we classify someone as
male or female, or we count the number of chairs in a room to see if there will be enough to
seat all the people comingto a meeting. In allthese cases) the "physical value"of an attribute
of an object is mapped to a numerical or symbolic value. With this background, we can now
discuss the type of an attribute, a concept that is important in determining if a particular data
analysis technique is consistent with a specific type of attribute.

TheTypeof an Attribute

In other words, the values used to represent an attribute may have properties that are not
properties of the attribute itself, and vice versa. This is illustrated with two examples.

Example 1 (Employee Age and ID Number). Two attributes that might be associated with an
employee are ID and age (in years). Both of these attributes can be represented as integers.
However,whileitisreasonabletotalkabouttheaverageageofanemployee,itmakesnosense to talk
about the average employee ID.

Indeed, the onlyaspect of employees thatwe want tocapturewiththeID attribute isthatthey are
distinct. Consequently, the only valid operation for employee IDs is to test whether they are
equal. There is no hint of this limitation, however, when integers are used to represent the
employeeIDattribute.Fortheageattribute,thepropertiesoftheintegers usedtorepresentage
areverymuchthepropertiesoftheattribute.Evenso,thecorrespondenceisnotcompletesince, for
example, ages have a maximum' while integers do not.

ConsiderbelowFigure,whichshowssomeobjects-linesegmentsandhowthelengthattribute
oftheseobjectscanbemappedtonumbersintwodifferentways.Eachsuccessivelinesegment, going
from the top to the bottom, is formed by appending the topmost line segment to itself. Thus,
the second line segment from the top is formed byappendingthe topmost line segment to itself
twice, the third line segment from the top is formed by appending the topmost line
segmenttoitselfthreetimes,andsoforth.Inaveryreal(physical)sense,alltheline segments are
multiples of the first. This fact is captured bythe measurements on the right-hand side of the
figure, but not by those on the left hand-side.

More specifically, the measurement scale on the left-hand side captures only the ordering of
the length attribute, while the scale on the right-hand side captures both the ordering and
additivityproperties.Thus,anattribute canbemeasuredinawaythatdoesnot captureallthe
propertiesoftheattribute.Thetypeofanattributeshouldtelluswhatpropertiesoftheattribute are
reflected in the values used to measure it. Knowing the type of an attribute is important
because it tells us which properties of the measured values are consistent with the underlying
properties of the attribute, and therefore, it allows us to avoid foolish actions, such as
computing the average employee ID.
Notethatitis commontorefertothetypeofanattributeasthetype ofameasurementscale.

Given these properties, we can define four types of attributes: nominal, ordinal, interval, and
ratio.Table2.2givesthedefinitionsofthesetypes, alongwithinformationaboutthestatistical
operations that are valid for each type. Each attribute type possesses all of the properties and
operationsoftheattributetypesaboveit.Consequently,anypropertyoroperationthatisvalid
fornominal,ordinal,andintervalattributesisalsovalid forratioattributes.Inotherwords,the
definitionoftheattributetypesiscumulative.However, thisdoesnotmeanthattheoperations
appropriate for one attribute type are appropriate for the attribute types above it.

The DifferentTypesof Attributes


A useful (and simple) way to specify the type of an attribute is to identify the properties of
numbers that correspond to underlying properties of the attribute. For example, an attribute
such as length has many of the properties of numbers. It makes sense to compare and order
objects by length, as well as to talk about the differences and ratios of length. The following
properties (operations) of numbers are typically used to describe attributes.
R-Vectors
Vectors in R are the same as the arrays in C language which are used to hold multiple data
values of the same type. One major keypoint is that in R the indexing of the vector will start
from ‘1’ and not from ‘0’. We can create numericvectorsand character vectors as well.

VectorCreation
SingleElementVector
EvenwhenyouwritejustonevalueinR,itbecomesavectoroflength1andbelongstooneof the above vector types.

# Atomic vector of type character. print("abc");


# Atomic vector of type double. print(12.5)
# Atomic vector of type integer. print(63L)
#Atomicvectoroftypelogical.

print(TRUE)

# Atomic vector of type complex.

print(2+3i)

# Atomic vector of type raw.

print(charToRaw('hello'))
MultipleElementsVector Using
colon operator with numeric data

# Creating a sequence from 5 to 13.

v <- 5:13

print(v)

# Creating a sequence from 6.6 to 12.6. v <- 6.6:12.6


print(v)
# If the final element specified does not belong to the sequence then it is discarded.v <- 3.8:11.4
print(v)

Usingsequence(Seq.)operator
#Create vectorwith elementsfrom 5to 9incrementingby0.4.
print(seq(5, 9, by= 0.4))

Whenwe execute theabove code, it producesthe followingresult−


[1]5.0 5.45.8 6.2 6.67.0 7.47.8 8.2 8.6 9.0

Usingthec() function

Thenon-charactervaluesarecoercedtocharactertypeifoneoftheelementsisacharacter. # The
logical and numeric values are converted to characters.
s<-c('apple','red',5,TRUE)
print(s)
Whenwe execute theabove code,itproducesthe followingresult −
[1]"apple""red""5" "TRUE"
AccessingVectorElements
Elements of a Vector are accessed using indexing. The [ ] brackets are used for indexing.
Indexing starts with position 1. Giving a negative value in the index drops that element from
result.TRUE, FALSE or 0 and 1 can also be used for indexing.
#Accessingvectorelementsusingposition.
t <-c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <-t[c(2,3,6)]
print(u)

#Accessingvectorelementsusinglogical indexing.
v<-t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)

#Accessingvectorelementsusingnegativeindexing. x <- t[c(-2,-5)]


print(x)

#Accessingvectorelementsusing0/1indexing. y <- t[c(0,0,0,0,0,0,1)]


print(y)

Whenweexecute theabove code, itproducesthe followingresult−

[1]"Mon""Tue""Fri"
[1]"Sun""Fri"
[1]"Sun""Tue""Wed""Fri""Sat"
[1]"Sun"
VectorManipulation
Vectorarithmetic
Twovectorsofsamelengthcanbeadded,subtracted,multipliedordividedgivingtheresult as a
vector output.
#Createtwovectors. v1
<- c(3,8,4,5,0,11)
v2<-c(4,11,0,8,1,2)

# Vector addition. add.result<-v1+v2


print(add.result)

# Vector subtraction.
sub.result<-v1-v2 print(sub.result)

# Vector multiplication.
multi.result<-v1*v2 print(multi.result)

# Vector division.
divi.result<-v1/v2 print(divi.result)

VectorElementRecycling
Ifweapplyarithmeticoperationstotwovectorsofunequallength,thentheelementsofthe shorter
vector are recycled to complete the operations.
v1<-c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomesc(4,11,4,11,4,11)
add.result<-v1+v2
print(add.result)

sub.result<-v1-v2
print(sub.result)

VectorElementSorting
Elementsinavectorcanbe sortedusingthe sort()function.
v<-c(3,8,4,5,0,11, -9,304)

#Sorttheelementsofthevector. result <- sort(v)


print(result)

#Sorttheelementsinthereverseorder.
result<-sort(v,decreasing=TRUE) print(result)

# Sortingcharactervectors.
v<-c("Red","Blue","yellow","violet") result <- sort(v)
print(result)

#Sortingcharactervectorsinreverseorder
result <- sort(v, decreasing = TRUE)
print(result)

Typesofvectors
Vectors are of different types which are used in R. Following are some of the types of
vectors:
 Numericvectors
Numericvectorsarethosewhichcontainnumericvaluessuchasinteger,float,etc.

# R program to create numeric


Vectors# creation of vectors using c()
function. v1 <- c(4, 5, 6, 7)
#displaytypeofvector
typeof(v1)

#byusing'L' wecanspecifythat we wantinteger values.


v2<-c(1L,4L,2L,5L)
#displaytypeofvector
typeof(v2)

Output:
[1] "double"
[1] "integer"
 Charactervectors
Charactervectorscontainalphanumericvaluesandspecialcharacters.

#RprogramtocreateCharacterVectors #
by default numeric values
#areconvertedintocharacters
v1 <- c('geeks', '2', 'hello',
57) # Displaying type of
vector typeof(v1)

Output:
[1]"character"
 Logicalvectors
Logical vectors contain boolean values such as TRUE, FALSE and NA for Null
values.

#RprogramtocreateLogicalVectors #

Creating logical vector

#usingc() function

v1<-c(TRUE, FALSE,TRUE,NA)
#Displayingtypeofvector

typeof(v1)

 Output:
[1]"logical"

Modifyingavector
Modification of a Vector is the process of applying some operation on an individual element of a vector to ch

X<-c(2, 7, 9, 7, 8,2)

# modifyaspecificelement

X[3]<-1

X[2]<-9

cat('subscriptoperator',X,'\n')

#Modifyusingdifferentlogics.

X[X>5]<-0

cat('Logicalindexing',X,'\n')

#Modifybyspecifying

#thepositionorelements. X

<- X[c(3, 2, 1)]

cat('combine()function',X)

Output
subscriptoperator291782
Logicalindexing201002
combine()function102

Deletingavector
Deletion of a Vector is the process of deleting all of the elements of the vector. This can be
done by assigning it to a NULL value.

M<-c(8, 10, 2,5)


#setNULLtothevector M
<- NULL
cat('Outputvector',M)

Output:
OutputvectorNULL

SortingelementsofaVector
sort() function is used with the help of which we can sort the values in ascending or descending order.

#Rprogramto sort elementsofaVector

# CreationofVector

X<-c(8, 2, 7, 1, 11,2)

#Sortin ascendingorder

A<-sort(X)

cat('ascendingorder',A,'\n') #

sort in descending order

# by setting decreasing as TRUE

B<-sort(X,decreasing=TRUE)
cat('descendingorder',B)

Output:
ascendingorder1227811
descendingorder1187221

Creatingnamedvectors
Namedvectorcanbecreatedinseveralways.Withc:

xc <-c('a'=5,'b'=6,'c'= 7,'d'= 8)
whichresultsin:

>xc
abcd 5 6 7 8

Withthe setNam function,twovectorsofthesamelengthcanbeusedtocreateanamed


vector:

x <-5:8
y<- letters[1:4]

xy<-setNames(x,y)
whichresultsinanamedintegervector:

>xy
abcd 5 6 7 8

xy<-5:8 nam
You mayalso usethe functiontoget the sameresult:
names(xy)<-letters[1:4]
#Withsuchavectoritisalsopossiblytoselectelementsbyname:
xy["a"]

Vectorsub-setting
InRProgramming Language,subsettingallowstheusertoaccesselementsfromanobject.It takes
out a portion from the object based on the condition provided.
Method1:SubsettinginR Using[]Operator
Using the ‘[ ]’ operator, elements of vectors and observations from data frames can be
accessed.Toneglectsomeindexes,‘-‘isusedtoaccessallotherindexesofvectorordata frame.
x <- 1:15
#Print vector
cat("Originalvector:",x,"\n") #
Subsetting vector
cat("First5valuesofvector:",x[1:5], "\n")
cat("Withoutvaluespresentat index1, 2and3:",x[-c(1,2, 3)],"\n")

Method4:SubsettinginRUsingsubset()Function
subset()functioninRprogrammingisusedtocreateasubsetofvectors,matrices,ordata frames based
on the conditions provided in the parameters.
q<-subset(airquality,Temp<65, select=c(Month)) print(q)

Matrices
Matrixisarectangulararrangementofnumbersinrowsandcolumns.Inamatrix,aswe know rows
are the ones that run horizontally and columns are the ones that run vertically.
CreatingandNamingaMatrix
TocreateamatrixinRyouneedtousethefunctioncalledmatrix().Theargumentsto this matrix()
are the set of elements in the vector. You have to pass how many numbers of
rowsandhowmanynumbersofcolumnsyouwanttohaveinyourmatrix.

Note:Bydefault,matricesareincolumn-wiseorder.

A=matrix(

# Taking sequence of elements


c(1, 2, 3, 4, 5, 6, 7, 8, 9),

#Noofrows
nrow = 3,

#Noofcolumns
ncol = 3,
#Bydefaultmatricesareincolumn-wiseorder
# So this parameter decides how to arrange the matrix
byrow = TRUE
)

#Namingrows
rownames(A)=c("r1","r2","r3")

#Namingcolumns
colnames(A)=c("c1","c2","c3")

cat("The 3x3 matrix:\n")


print(A)

Creatingspecialmatrices
R allows creation of various different types of matrices with the use of arguments passed to
the matrix() function.

 Matrixwhereallrowsandcolumnsarefilledbyasingleconstant‘k’:
Tocreatesuchamatrixthesyntaxisgivenbelow:

Syntax:matrix(k,m,n)
Parameters:
k: the constant
m: no ofrows n:
no of columns

print(matrix(5,3,3))
Diagonalmatrix:
A diagonal matrix is a matrix in which the entries outside the main diagonal are all zero. To
create such a matrix the syntax is given below:
print(diag(c(5,3,3),3,3))
Identitymatrix:
A square matrix in which all the elements of the principal diagonal are ones and all other
elements are zeros. To create such a matrix the syntax is given below:
print(diag(1,3,3))
Matrixmetrics
Matrixmetricsmeanonceamatrixiscreatedthen

 Howcanyouknowthedimensionofthematrix?
 Howcanyouknowhowmanyrowsarethereinthematrix?
 Howmanycolumnsareinthematrix?
How many elements are there in the matrix? are the questions we generally wanted to

answer.
A= matrix(
c(1, 2, 3, 4, 5, 6, 7, 8,9),
nrow= 3,
ncol =
3,byrow=TRU
E
)
cat("The3x3matrix:\n")
print(A)

cat("Dimensionofthematrix:\n")
print(dim(A))

cat("Numberof rows:\n")
print(nrow(A))

cat("Numberof columns:\n")
print(ncol(A))

cat("Numberofelements:\n") print(length(A))
# OR
print(prod(dim(A)))

Matrixsubsetting
Amatrixissubsetwithtwoargumentswithinsinglebrackets,[],andseparatedbya comma. The first
argument specifies the rows, and the second the columns.

M_new<-matrix(c(25,23,25,20,15,17,13,19,25,24,21,19,20,12,30,17),ncol=4)

#M_new<-matrix(1:16,4)
M_new
colnames(M_new)<-c("C1","C2","C3","C4")
rownames(M_new)<-c("R1","R2","R3","R4")

M_new[,1,drop=FALSE] # all rows with 1st column

M_new[1,,drop=FALSE] #1st row with all column

M_new[1,1,drop=FALSE]#display1strowand1stcolumn,cellvalue

M_new[1:2,2:3]#display 1st ,2ndrows and 2nd ,3rdcolumn

M_new[1:2,c(2,4)] #display 1st ,2ndrows and 2nd ,4thcolumn

Arrays
ArraysaretheRdataobjects whichcanstoredatainmorethantwodimensions.Forexample
−Ifwecreateanarrayofdimension(2,3,4)thenitcreates4rectangularmatriceseachwith2 rows and
3 columns. Arrays can store only data type.

Anarrayiscreatedusingthearray()function.Ittakesvectorsasinputandusesthevaluesin the dim


parameter to create an array.
Example
The followingexamplecreatesanarrayoftwo 3x3matriceseach with3rowsand3columns.

#Createtwovectorsofdifferentlengths. vector1
<- c(5,9,3)
vector2<-c(10,11,12,13,14,15)

#Take these vectorsasinputtothearray.


result<-array(c(vector1,vector2),dim=c(3,3,2)) print(result)

NamingColumnsandRows
We cangive namestothe rows,columnsandmatricesinthe arraybyusingthe dimnames
parameter.
#Createtwovectorsofdifferentlengths. vector1
<- c(5,9,3)
vector2<-c(10,11,12,13,14,15)
column.names<-c("COL1","COL2","COL3")
row.names<-c("ROW1","ROW2","ROW3")
matrix.names<-c("Matrix1","Matrix2")

#Take these vectorsasinputtothearray.


result<-array(c(vector1,vector2),dim=c(3,3,2),dimnames=list(row.names,column.names, matrix.names))
print(result)

Accessingarrays
The arrays can be accessed byusingindices for different dimensions separated bycommas. Differentcomponen

AccessingUni-DimensionalArray
Theelementscan be accessedbyusingindexesof the correspondingelements.

vec<-c(1:10)

#accessingentire vector
cat("Vectoris:",vec)

# accessingelements
cat("Thirdelementofvectoris:",vec[3])

AccessingArrayElements
#Createtwovectorsofdifferentlengths. vector1
<- c(5,9,3)
vector2<-c(10,11,12,13,14,15)
column.names<-c("COL1","COL2","COL3")
row.names<-c("ROW1","ROW2","ROW3")
matrix.names<-c("Matrix1","Matrix2")

#Take these vectorsasinputtothearray.


result<-array(c(vector1,vector2),dim=c(3,3,2),dimnames=list(row.names,
column.names, matrix.names))

#Printthethirdrowofthesecondmatrixofthearray.
print(result[3,,2])

#Printthe elementinthe 1strowand3rdcolumnofthe 1st matrix.


print(result[1,3,1])

#Print the2nd Matrix.


print(result[,,2])

CalculationsacrossArray Elements
Wecandocalculationsacrossthe elementsinanarrayusingtheapply() function.
Syntax
apply(x,margin,fun)
Followingisthe description ofthe parametersused −
xisanarray.
marginisthe name ofthe data setused.
funisthe functiontobe applied acrosstheelementsofthearray.
Example
Weusetheapply()functionbelowtocalculatethesumoftheelementsintherowsofanarray across all the
matrices.
#Createtwovectorsofdifferentlengths. vector1
<- c(5,9,3)
vector2<-c(10,11,12,13,14,15)

#Takethesevectorsasinputto thearray.
new.array<-array(c(vector1,vector2),dim=c(3,3,2))
print(new.array)

#Useapplytocalculatethesumoftherowsacrossallthematrices. result <-


apply(new.array, c(1), sum)
print(result)
Whenwe executetheabovecode,it producesthefollowingresult−

,,1
[,1] [,2] [,3]

[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
,,2
[,1] [,2] [,3]

[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15

[1] 56 68 60
Accessingsubsetofarray elements
Asmaller subsetofthearrayelementscanbeaccessedbydefiningarangeofrowor column limits.

row_names<-c("row1","row2")
col_names<-c("col1","col2","col3","col4") mat_names
<- c("Mat1", "Mat2")
arr=array(1:15,dim=c(2,4,2),
dimnames=list(row_names,col_names,mat_names))

#printelementsofboththerowsandcolumns2and3 ofmatrix 1
print(arr[, c(2, 3), 1])

Addingelementstoarray
Elements can be appended at the different positions in the array. The sequence of elements is
retained in order of their addition to the array. The time complexity required to add new
elements is O(n) where n is the length of the array. The length of the array increases by the
number of element additions. There are various in-built functions available in R to add new
values:

c(vector,values):c()functionallowsustoappendvaluestothe endofthearray.Multiple values can


also be added together.
append(vector,values):Thismethodallowsthevaluestobeappendedatanypositioninthe vector.
By default, this function adds the element at end.
append(vector,values,after=length(vector))addsnewvalues afterspecifiedlengthofthe array
specified in the last argument of the function.

Usingthelengthfunctionofthe array:
Elementscanbeaddedatlength+xindiceswherex>0. #
creating a uni-dimensional array
x <-c(1, 2, 3, 4,5)

#additionofelementusingc()function x
<- c(x, 6)
print("Arrayafter1stmodification") print
(x)

#additionofelementusingappendfunction x
<- append(x, 7)
print("Arrayafter2ndmodification")
print (x)

#addingelementsaftercomputingthelength len
<- length(x)
x[len+1]<-8
print("Arrayafter3rdmodification") print
(x)

#addingonlength+3index
x[len + 3]<-9
print("Arrayafter4thmodification") print
(x)

#appendavectorofvaluestothearrayafterlength+3ofarray print ("Array after 5th modification")


x <-append(x,c(10,11, 12),after=length(x)+3)
print(x)

# adds new elements after 3rd index print("Arrayafter6thmodification") x <- append(x, c(-1, -1), after = 3) pri
[1] "Arrayafter 1st modification"
[1] 1 2 3 45 6
[1] "Arrayafter2ndmodification"
[1] 1 2 3 45 6 7
[1] "Arrayafter3rdmodification"
[1] 1 2 3 45 6 7 8
[1]"Arrayafter4thmodification"
[1]12345678 NA9
[1]"Arrayafter5thmodification"
[1]12345678 NA9 10 11 12
[1]"Arrayafter6thmodification"
[1]123-1-145678 NA9 10 1112
RemovingElementsfromArray
Elements can be removed from arrays in R, either one at a time or multiple together. These
elementsarespecifiedasindexestothearray,whereinthearrayvaluessatisfyingtheconditions are
retainedandrest removed.Thecomparisonforremoval isbasedonarray values.Multiple
conditions can also be combined together to remove a range of elements. Another way to
removeelementsisbyusing%in%operatorwhereinthesetofelementvaluesbelongingtothe TRUE
values of the operator are displayed as result and the rest are removed.

#creatinganarrayoflength9 m
<-c(1, 2, 3, 4, 5, 6, 7, 8, 9)
print("OriginalArray")
print (m)

#removeasinglevalueelement:3fromarray m
<- m[m != 3]
print("After1stmodification") print
(m)

ClassinR
Classistheblueprintthathelpstocreateanobjectandcontainsitsmembervariablealongwith the
attributes. As discussed earlier in the previous section, there are two classes of R, S3, and S4.
S3 Class
 S3classissomewhatprimitiveinnature.Itlacksaformaldefinitionandobjectofthisclass can be
created simply by adding a class attribute to it.
 This simplicityaccounts for the fact that it is widelyused in R programminglanguage. In
fact most of the R built-in classes are of this type.
Example1:S3class
#create alistwithrequiredcomponents
s<-list(name="John",age=21, GPA= 3.5) #
name the class appropriately
class(s)<-"student"

S4 Class
 S4classarean improvementovertheS3class. They haveaformally definedstructure which
helps in making object of the same class look more or less similar.
 ClasscomponentsareproperlydefinedusingthesetClass()functionandobjectsarecreated using
the new() function.
Example2:S4class
<setClass("student",slots=list(name="character",age="numeric",GPA="numeric"))

Reference Class
 Reference class were introduced later,compared tothe other two. It is more similarto the
objectorientedprogrammingweareusedtoseeinginothermajorprogramminglanguages.
 ReferenceclassesarebasicallyS4classedwithanenvironmentaddedtoit.
Example 3: Reference class
<setRefClass("student")

Factors
IntroductiontoFactors:
FactorsinRProgrammingLanguagearedatastructuresthatareimplementedtocategorizethe data
or represent categorical data and store it on multiple levels.

They can be stored as integers with a corresponding label to every unique integer. Though
factors may look similar to character vectors, they are integers and care must be taken while
using them as strings. The factor accepts only a restricted number of distinct values. For
example, a data field such as gender may contain values only from female, male.
CreatingaFactorinRProgrammingLanguage
 ThecommandusedtocreateormodifyafactorinRlanguageis –factor()witha vectoras input.
The twostepsto creatinga factor are:
 Creatingavector
 Convertingthe vectorcreatedintoa factorusingfunction factor()

Example:

#Createavectoras input.
data<-c("East","West","East","North","North","East","West","West","West","East","North")

print(data)
print(is.factor(data))

#Applythefactorfunction.
factor_data <- factor(data)

print(factor_data)
print(is.factor(factor_data))

ChangingtheOrderofLevels
The order of the levels in a factor can be changed by applying the factor function againwith
new order of the levels.
Example:
data<-c("East","West","East","North","North","East","West","West","West","East","North") # Create the factors
factor_data<-factor(data) print(factor_data)
v<-gl(3,4,labels=c("A","B","C")) print(v)
# Apply the factor function with required order of the level. new_order_data<-factor(factor_data,levels=c("East","W

AccessingelementsofaFactorinR
Likeweaccesselementsofavector,thesamewayweaccesstheelementsofafactor.Ifgender is a
factor then gender[i] would mean accessing ith element in the factor.
Example:
GeneratingFactorLevels
We can generate factor levels by using the gl() function. It takes two integers as input which
indicates how many levels and how many times each level.
Syntax
gl(n,k, labels)
Followingisthe description ofthe parametersused −
 nisaintegergivingthe numberoflevels.
 kisaintegergivingthe numberofreplications.
 labelsisa vectoroflabelsforthe resultingfactor levels.
gender <- factor(c("female", "male", "male", "female"))
gender[3]

Summarizinga Factor
The summary function in R returns the results of basic statistical calculations (minimum, 1st
quartile, median, mean, 3rd quartile, and maximum) for a numerical vector. The general way
to write the R summary function is summary(x, na.rm=FALSE/TRUE). Again, X refers to a
numericalvector,whilena.rm=FALSE/TRUEspecifieswhethertoremoveemptyvaluesfrom the
calculation.
Example:
v<-gl(3,4,labels=c("A","B","C"))
print(v)
summary(v)

LevelOrderingofFactors
Factorsaredataobjectsusedtocategorizedataandstoreitaslevels.Theycanstoreastringas well as an
integer. They represent columns as they have a limited number of unique values. Factors in R
can be created using factor() function. It takes a vector as input. c() function is used to create
a vector with explicitly provided values.
Example:
x<-c("Pen","Pencil","Brush","Pen",
"Brush","Brush","Pencil","Pencil")

print(x)
print(is.factor(x))

# Applythe factor function.


factor_x = factor(x)
levels(factor_x)

In the above code, x is a vector with 8 elements. To convert it to a factor the function factor()
is used. Here there are 8 factors and 3 levels. Levels are the unique elements in the data. Can
be found using levels() function.

OrderingFactorLevels

Ordered factors is an extension of factors. It arranges the levels in increasing order. We use
two functions: factor() along with argument ordered().

Syntax:factor(data,levels=c(“”),ordered =TRUE)
Parameter:
data:inputvectorwithexplicitlydefinedvalues.
levels():Mention the list of levels in c function.
ordered: It is set true for enabling ordering.

Example:

size=c("small","large","large","small","medium","large","medium","medium") # converting to factor


size_factor <-factor(size)
print(size_factor)

# orderingthelevels
ordered.size<-factor(size,levels=c("small","medium","large"),ordered=TRUE) print(ordered.size)
Intheabovecode,sizevectoriscreatedusingcfunction.Thenitisconvertedtoafactor.And fororderingfactor()functio
Data Frames
Adataframeisatableoratwo-dimensionalarray-likestructureinwhicheachcolumncontains
valuesofonevariableandeachrowcontainsonesetofvaluesfromeachcolumn. Dataframes can also
beinterpretedas matrices whereeachcolumn of amatrixcan be of the different data types.
Followingarethecharacteristicsofadataframe.

 Thecolumnnamesshouldbe non-empty.
 Therownamesshould beunique.
 Thedata storedina dataframe canbeofnumeric,factororcharactertype.
 Eachcolumnshouldcontainsame numberofdata items.

CreatingDataFrame
friend.data<-data.frame( friend_id = c(1:5),
friend_name=c("Sachin","Sourav", "Dravid", "Sehwag", "Dhoni"),
)
#printthedataframe print(friend.data)

Output:
friend_idfriend_name
1 1 Sachin
2 2 Sourav
3 3 Dravid
4 4 Sehwag
5 5 Dhoni

SummaryofDatainDataFrame
The statistical summary and nature of the data can be obtained by
applying summary() function.
#Createthedataframe.
emp.data<- data.frame(
emp_id =c (1:5),
emp_name=c("Rick","Dan","Michelle","Ryan","Gary"), salary
= c(623.3,515.2,611.0,729.0,843.25),

start_date=as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors=FALSE
)
#Printthesummary.

print(summary(emp.data))

Whenwe execute theabove code, it producesthe followingresult−


emp_idemp_namesalarystart_date
Min.:1Length:5Min.:515.2Min.:2012-01-01
1st Qu.:2Class :character1st Qu.:611.01st Qu.:2013-09-23 Median:3Mode:characterMedian:623.3Median:2014-05
Mean:3Mean:664.4Mean:2014-01-14
3rdQu.:43rdQu.:729.03rdQu.:2014-11-15
Max.:5Max.:843.2Max.:2015-03-27

ExtractData fromDataFrame
Extractspecificcolumnfromadataframeusingcolumnname. # Create the data frame.
emp.data<-data.frame( emp_id = c (1:5),
emp_name=c("Rick","Dan","Michelle","Ryan","Gary"),
salary=c(623.3,515.2,611.0,729.0,843.25),

start_date=as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-03-27")),
)
#ExtractSpecificcolumns.
result<-data.frame(emp.data$emp_name,emp.data$salary)
print(result)

Whenwe execute theabove code, it producesthe followingresult−


emp.data.emp_nameemp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25

Extractthefirsttworowsandthenallcolumns result
<- emp.data[1:2,]

Extract3rdand5throwwith2ndand 4thcolumn
result<-emp.data[c(3,5),c(2,4)]

ExpandDataFrame/ExtendingDataFrame
Adataframecanbeexpanded byaddingcolumnsand rows.

AddColumn
Justadd thecolumn vectorusinganewcolumnname.

#Createthedataframe. emp.data<- data.frame(


emp_id =c (1:5),
emp_name=c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25),

start_date=as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-03-27")),
stringsAsFactors=FALSE
)

#Addthe"dept"coulmn.
emp.data$dept<-c("IT","Operations","IT","HR","Finance") v
<- emp.data
print(v)
Whenwe execute theabove code, it producesthe followingresult−
emp_idemp_namesalarystart_date dept
1 1 Rick Dan 623.302012-01-01 IT
2 2 515.202013-09-23 Operations IT
3 3 Michelle611.002014-11-15 HR
4 4 Ryan Gary729.002014-05-11 Finance
5 5 843.252015-03-27

AddRow
Toaddmorerowspermanentlytoanexistingdataframe,weneedtobringinthenewrowsinthe same
structure as the existing data frame and use the rbind() function.

Intheexamplebelowwecreatea dataframe withnewrowsandmergeit withtheexistingdataframe to create the final data fram

#Createthefirstdataframe. emp.data <- data.frame(


emp_id =c (1:5),
emp_name=c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25),
start_date=as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-03-27")),
dept=c("IT","Operations","IT","HR","Finance"),
)

#Createtheseconddataframe emp.newdata <- data.frame(


emp_id=c (6:8),

emp_name=c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date=as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept=c("IT","Operations","Fianance"),
)

#Bindthetwodataframes.
emp.finaldata<-rbind(emp.data,emp.newdata)
print(emp.finaldata)

RemoveRowsand Columns
Usethec()functionto removerowsand columnsinaDataFrame:
Example
Data_Frame<-data.frame(
Training=c("Strength","Stamina","Other"),
Pulse = c(100, 150, 120),

Duration= c(60,30, 45)


)

# Removethefirstrowandcolumn
Data_Frame_New<-Data_Frame[-c(1),-c(1)]

#Printthenewdataframe
Data_Frame_New

Pulse Duration
215030
312045

CreateSubsetsofaDataframe
subset()functioninRProgramming LanguageisusedtocreatesubsetsofaDataframe.This can also be used to drop

emp.data<-data.frame(

emp_id =c(1:5),
emp_name=c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date=as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-
03-27")),
)
emp.data

subset(emp.data,emp_id==3)

subset(emp.data,emp_id==c(1:3))

emp_idemp_namesalarystart_date
1 1 Rick623.302012-01-01
2 2 Dan515.202013-09-23
3 3Michelle611.002014-11-15
4 4 Ryan729.002014-05-11
5 5 Gary843.252015-03-27

emp_idemp_namesalarystart_date
3 3 Michelle 6112014-11-15

emp_idemp_namesalarystart_date
1 1 Rick623.32012-01-01
2 2 Dan515.22013-09-23
3 3Michelle611.02014-11-15

SortingData
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING.
PrependthesortingvariablebyaminussigntoindicateDESCENDINGorder. Here aresome
examples.
data =
data.frame(rollno=c(1,
5,4,2,3),
subjects=c("java","python","php","sql","c"))

print(data)
print("sortthedataindecreasingorderbasedonsubjects")
print(data[order(data$subjects, decreasing = TRUE), ])

print("sortthedataindecreasingorderbasedonrollno") print(data[order(data$rollno,
decreasing = TRUE), ])

Output:
rollnosubjects
1 1 java
2 5python
3 4 php
4 2 sql
5 3 c
[1] "sort the data in decreasing order based on subjects "
rollno subjects
4 2 sql
2 5 python
3 4 php
1 1 java
5 3 c
[1] "sort the data in decreasing order based on rollno "
rollno subjects
2 5python
3 4 php
5 3 c
4 2 sql
1 1 java

Lists
Listsareone-dimensional,heterogeneousdatastructures.Thelistcanbealistofvectors,a list of
matrices, a list of characters and a list of functions, and so on.
Alist is a vector but with heterogeneous data elements. Alist in R is created with the use of
list()function.Rallowsaccessingelementsofalistwiththeuseoftheindexvalue.InR,the indexing of
a list starts with 1 instead of 0 like other programming languages.

Creatinga List
Tocreatea ListinR youneedtousethefunctioncalled“list()”.Inotherwords,alistisa
genericvectorcontainingotherobjects.Toillustratehowalistlooks,wetakeanexample
here.Wewanttobuildalistofemployees withthedetails.Soforthis,wewantattributes such as
ID, employee name, and the number of employees.
empId=c(1,2,3,4)

empName = c("Debi", "Sandeep", "Subham", "Shiba")

numberOfEmp = 4

empList = list(empId, empName, numberOfEmp)

print(empList)

or
list_data<-list("Red","Green",c(21,32,11),TRUE,51.23,119.1)
print(list_data)

Accessingcomponentsofalist
Wecanaccesscomponentsofa listintwoways.

Accesscomponentsbynames:Allthecomponentsofalistcanbenamed andwecanuse those names to access the com

empId= c(1,2, 3, 4)
empName=c("Debi","Sandeep","Subham","Shiba") numberOfEmp = 4
empList = list( "ID"= empId,
"Names"= empName,
"TotalStaff"=numberOfEmp
)
print(empList)

# Accessingcomponentsbynames
cat("Accessingnamecomponentsusing$command\n")
print(empList$Names)
Access components by indices:We can also access the components of the list using indices.
To access the top-level components of a list we have to use a double slicing operator “[[ ]]”
which is two square brackets and if we want to access the lower orinner level components of
alistwehavetouseanothersquarebracket“[]”alongwiththedoubleslicingoperator“[[]]“.
empId= c(1,2, 3, 4)
empName=c("Debi","Sandeep","Subham","Shiba")
numberOfEmp = 4
empList =
list( "ID"=
empId,
"Names"= empName,
"TotalStaff"=numberOfEmp
)
print(empList)

#Accessingatoplevelcomponentsbyindices
cat("Accessingnamecomponentsusingindices\n") print(empList[[2]])

#Accessinga innerlevelcomponentsbyindices
cat("AccessingSandeepfromnameusingindices\n") print(empList[[2]][2])

#Accessinganotherinnerlevelcomponentsbyindices
cat("Accessing4fromIDusingindices\n")

print(empList[[1]][4])

Modifyingcomponentsofa list
Alistcanalsobemodifiedbyaccessingthecomponentsandreplacingthemwiththeones which you
want.
empId= c(1,2, 3, 4)
empName=c("Debi","Sandeep","Subham","Shiba")
numberOfEmp = 4
empList =
list( "ID"=
empId,
"Names"= empName,
"TotalStaff"=numberOfEmp
)
cat("Beforemodifyingthelist\n")
print(empList)

#Modifyingthetop-level component
empList$`TotalStaff`=5

#Modifyinginnerlevelcomponent empList[[1]][5] = 5
empList[[2]][5]= "Kamala"

cat("Aftermodifiedthelist\n")
print(empList)

Merginglist
Wecanmergethelistbyplacingallthelistsintoa single list.

lst1<-list(1,2,3)
lst2<-list("Sun","Mon","Tue")

# Merge the two lists.


new_list<-c(lst1,lst2)

# Print themerged list.


print(new_list)

Deletingcomponentsofalist
Todeletecomponentsofalist,firstofall,weneedtoaccessthosecomponentsandthen insert a
negative sign before those components. It indicates that we had to delete that component.
empId=c(1,2,3,4)
empName = c("Debi", "Sandeep", "Subham", "Shiba")
numberOfEmp = 4
empList = list(
"ID"=empId,
"Names"=empName,
"TotalStaff"=numberOfEmp
)
cat("Before deletion the list is\n")
print(empList)

#Deletingatoplevelcomponents
cat("After Deleting Total staff components\n") print(empList[-3])
# Deleting a inner level components cat("After Deleting sandeep from name\n") print(empList[[2]][-2])

ConvertingListtoVector
Herewearegoingtoconvertthelisttovector,forthiswewillcreatealistfirstandthen unlist the list into the vector.
# Create lists. lst<-list(1:5) print(lst)

# Convert the lists to vectors. vec <- unlist(lst)

print(vec)
Unit-4
Conditionalsandcontrolflow
R-Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.
TypesofOperators

WehavethefollowingtypesofoperatorsinRprogramming−

 Arithmetic Operators
 RelationalOperators
 LogicalOperators
Assignment Operators
 MiscellaneousOperators
ArithmeticOperators
FollowingtableshowsthearithmeticoperatorssupportedbyRlanguage.Theoperatorsacton each
element of the vector.

Operator Description Example

+ Adds two vectors

v<-c(2,5.5,6)
t<-c(8,3, 4)
print(v+t)
itproducesthefollowingresult−
[1]10.0 8.510.0

− Subtractssecondvectorfromthe first v<-c(2,5.5,6)


t<-c(8,3, 4)
print(v-t)
itproducesthefollowingresult−
[1]-6.0 2.5 2.0

* Multiplies both vectors v<-c(2,5.5,6)


t<-c(8,3, 4)
print(v*t)
itproducesthefollowingresult−
[1]16.0 16.524.0

/ Dividethefirstvectorwiththe second v<-c(2,5.5,6)


t<-c(8,3, 4)
print(v/t)
Whenweexecutetheabovecode,itproduces the
following result −
[1]0.250000 1.8333331.500000

%% Givetheremainderofthefirstvector with v<-c(2,5.5,6)


the second t<-c(8,3, 4)
print(v%%t)
itproducesthefollowingresult−
[1]2.0 2.52.0

%/% Theresultofdivisionoffirstvector with v<-c(2,5.5,6)


second (quotient) t<-c(8,3, 4)
print(v%/%t)
itproducesthefollowingresult−
[1]0 11

^ Thefirstvectorraisedtotheexponent of v<-c(2,5.5,6)
second vector t<-c(8,3, 4)
print(v^t)
itproducesthefollowingresult−
[1]256.000 166.3751296.000
Relational Operators
Following table shows the relational operators supported by R language. Each element of
the first vector is compared with the corresponding element of the second vector. The result
of comparison is a Boolean value.

Operator Description Example

> v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris t<-c(8,2.5,14,9)
greater than the corresponding element of print(v>t)
the second vector.
itproducesthefollowingresult−
[1]FALSETRUEFALSEFALSE
< v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris less t<-c(8,2.5,14,9)
than the corresponding element of the print(v<t)
second vector.
itproducesthefollowingresult−
[1]TRUEFALSETRUEFALSE

== v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris equal t<-c(8,2.5,14,9)
to the corresponding element of the second print(v==t)
vector.
itproducesthefollowingresult−
[1]FALSEFALSEFALSETRUE

<= v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris less t<-c(8,2.5,14,9)
than or equal to the corresponding element print(v<=t)
of the second vector.
itproducesthefollowingresult−
[1]TRUEFALSETRUETRUE

>= v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris t<-c(8,2.5,14,9)
greater than or equal to the corresponding print(v>=t)
element of the second vector.
itproducesthefollowingresult−
[1]FALSETRUEFALSETRUE

!= v<-c(2,5.5,6,9)
Checks if each element of thefirst vector is t<-c(8,2.5,14,9)
unequaltothecorrespondingelementofthe print(v!=t)
second vector.
itproducesthefollowingresult−
[1] TRUETRUETRUE FALSE

Logical Operators
Following table shows the logical operators supported by R language. It is applicable only to
vectors of type logical, numeric or complex. All numbers greater than 1 are considered as
logical value TRUE.
Each element of the first vector is compared with the corresponding element of the second
vector. The result of comparison is a Boolean value.

Operator Description Example


& It is called Element-wise Logical AND v <- c(3,1,TRUE,2+3i)
operator. It combines each element of the t<-c(4,1,FALSE,2+3i)
firstvectorwiththecorrespondingelement of print(v&t)
the second vector and gives a output TRUE itproducesthefollowingresult−
if both the elements are TRUE.
[1]TRUETRUEFALSETRUE

| It is called Element-wise Logical OR v <- c(3,0,TRUE,2+2i)


operator. It combines each element of the t<-c(4,0,FALSE,2+3i)
firstvectorwiththecorrespondingelement of print(v|t)
the second vector and gives a output TRUE itproducesthefollowingresult−
if one the elements is TRUE.
[1]TRUEFALSETRUETRUE

! v<-c(3,0,TRUE,2+2i)
ItiscalledLogicalNOToperator.Takes print(!v)
each elementofthevectorand givesthe
opposite logical value. itproducesthefollowingresult−
[1]FALSETRUEFALSE FALSE

Thelogicaloperator&&and||considersonlythefirstelementofthevectorsandgiveavector of
single element as output.

Operator Description Example

&& v<-c(3,0,TRUE,2+2i)
CalledLogicalANDoperator.Takesfirst t<-c(1,3,TRUE,2+3i)
element of both the vectors and gives the print(v&&t)
TRUE only if both are TRUE. itproducesthefollowingresult−
[1]TRUE

|| v<-c(0,0,TRUE,2+2i)
Called Logical OR operator. Takes first t<-c(0,3,TRUE,2+3i)
elementofboththevectorsandgivesthe TRUE print(v||t)
if one of them is TRUE. itproducesthefollowingresult−
[1]FALSE
AssignmentOperators
Theseoperatorsareusedtoassignvaluesto vectors.

Operator Description Example


CalledLeft Assignment v1 <- c(3,1,TRUE,2+3i)
<−
v2<<-c(3,1,TRUE,2+3i)

or v3 =c(3,1,TRUE,2+3i)
print(v1)
=
print(v2)
or print(v3)
<<− itproducesthefollowingresult−
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i

CalledRight Assignment c(3,1,TRUE,2+3i) ->v1


c(3,1,TRUE,2+3i)->>v2
-> print(v1)
MiscellaneousOperators print(v2)
or
itproducesthefollowingresult−
Theseoperatorsareusedtoforspecificpurposeandnotgeneralmathematicalorlogical
->>
computation. [1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i

Operator Description Example

: Colon v<-2:8
operator.It print(v)
createsthe
seriesof itproducesthefollowingresult−
numbersin [1]2 3 4 5 6 78
sequence
foravector.

%in% v1<-8
This
operatoris v2<-12
usedto t<-1:10
print(v1%in%t)
identifyif
print(v2%in%t)
anelement
belongstoa
vector. itproducesthefollowingresult−
[1]TRUE
[1]FALSE
%*% This M=matrix(c(2,6,5,1,10,4), nrow=2,ncol =3,byrow=TRUE)
operatoris t=M %*%t(M)
usedto print(t)
multiplya
itproducesthefollowingresult−
matrixwith
its [,1] [,2]
transpose. [1,]6582
[2,]82117

Rprovidesthe followingR-Decisionmaking/Conditionalstatement
typesof decisionmaking statements.Clickthe following linksto check
their detail.making structures require the programmer to specify one or more conditions to be
Decision
evaluated or tested bythe program, along with a statement or statements to be executed if
Sr.No.
the condition is determined to be true, andStatement&Description
optionally, other statements to be executed if the
condition is determined to be false.
1
Following ifisstatement
the general form of a typical decision making structure found in most of the
programming languages consistsofaBoolean
Anifstatement − expression followedbyoneormore statements.

2 if...else statement
Anif statementcanbefollowedbyanoptional elsestatement,whichexecuteswhenthe Boolean
expression is false.

3 switch statement
Aswitchstatement allows avariableto betestedforequalityagainst alist ofvalues.

R-IfStatement
Anifstatement consistsof aBooleanexpression followed byoneor morestatements.
Example
Syntax
x<-30L
if(is.integer(x)) {
ThebasicsyntaxforcreatinganifstatementinRis− if(boolean_expression) {
print("XisanInteger")
//statement(s) willexecute ifthe booleanexpression is true.
}}
Whentheabovecodeiscompiledand
If the Boolean expression evaluates executed,it producesthe
to be true, followingresult−
then the block of code inside the if statement
willbeexecuted.IfBooleanexpressionevaluatestobefalse,thenthefirstsetofcodeafterthe end of
the if statement (after the closing curly brace) will be executed.
FlowDiagram
[1] "X isanInteger"

R-If...ElseStatement
Anifstatementcanbefollowedbyanoptionalelsestatementwhichexecuteswhentheboolean
expression is false.
Syntax
Thebasicsyntaxforcreatinganif...elsestatementinRis−
if(boolean_expression) {
//statement(s) will execute ifthe boolean expression is true.
}else {
//statement(s) willexecute ifthe booleanexpression is false.
x<-c("what","is","truth")
}
IftheBooleanexpressionevaluatestobetrue,thentheifblockofcodewillbeexecuted,
if("Truth"%in% x) { otherwise else block of code
FlowDiagram
print("Truthisfound")
}else{
print("Truthisnotfound")
}

Example

Whentheabovecodeiscompiledand executed,it producesthe followingresult−


[1] "Truth isnot found"
Here"Truth"and"truth" aretwo different strings.
Theif...elseif...elseStatement
An if statement can befollowed byan optional elseif...else statement, which is veryusefulto test
various conditions using single if...else if statement.
Whenusingif,else if,elsestatementstherearefew pointsto keepin mind.
 Anifcanhavezero orone elseand it mustcome afteranyelse if's.
 Anifcan havezero to manyelse if'sand theymust comebeforethe else.
 Onceanelse ifsucceeds,noneof theremainingelseif'sorelse's willbetested.
Syntax
Thebasic syntax forcreatinganif...else if...elsestatement inR is−
if(boolean_expression1){
//Executeswhenthe booleanexpression1is true.
}else if(boolean_expression 2){
//Executeswhenthe booleanexpression2is true.
}else if(boolean_expression 3){
//Executeswhenthe booleanexpression3is true.
}else {
//executeswhen noneof the aboveconditionis true.
}
Example
x<-c("what","is","truth")

if("Truth"%in%x){
print("Truthisfound thefirst time")
}elseif("truth"%in%x){
print("truthisfound thesecond time")
}else{
print("Notruth found")
}

Whentheabovecodeiscompiledand executed,it producesthe followingresult−


[1] "truth isfound thesecond time"

NestedIfStatements

Youcanalsohaveifstatements insideifstatements, thisis called nested ifstatements.

x<-41
Example

if (x >10) {
print("Aboveten")
if (x >20) {
print("and alsoabove20!")
}else{
print("butnotabove20.")
}
} else {
print("below10.")
}
AND

The&symbol(and) isa logical operator, andisused tocombineconditional statements:

Example
Testifais greaterthanb,ANDifcisgreaterthana:

a<-200
b<-33
c<-500

if(a>b&c>a){
print("Bothconditionsare true")
}
OR

The |symbol (or)is alogical operator,andis usedto combineconditional statements:

Example
Testifais greaterthanb,orifcisgreaterthana:

a<-200
b<-33
c<-500

if(a>b|a>c){
print("Atleast one oftheconditions is true")
}

R-Switch Statement
Aswitchstatementallowsa variable tobe testedforequality againstalistof values.Each value is
called a case, and the variable being switched on is checked for each case.
Syntax
ThebasicsyntaxforcreatingaswitchstatementinR is−
switch(expression, case1, case2, case3.. .)
Thefollowing rulesapplytoa switchstatement −
 Ifthevalueofexpressionisnota characterstringitis coercedto integer.
 You can haveanynumber ofcasestatementswithin a switch.
 If the value of the integer is between 1 and nargs()−1 (The max number of
arguments)thenthecorrespondingelementofcaseconditionisevaluatedandtheresult
returned.
 If expression evaluates to a character string then that string is matched (exactly) to the
names of the elements.
 Ifthereis morethanone match,the first matchingelement is returned.
 NoDefaultargumentisavailable.
 Inthecaseof nomatch, if thereis aunnamed element of... itsvalue is returned.
FlowDiagram

x<-switch( 3,
Examp
"first",
"second",
"third", "fourth"
)
print(x)

Whentheabovecodeiscompiledand executed,it producesthe followingresult−


[1]"third"

Example2:

#Followingis val1simpleR program


#todemonstratesyntaxofswitch. #

Mathematical calculation

val1= 6
val2 = 7
val3="s"
result=switch( val3,
"a"= cat("Addition =", val1 + val2),
"d"=cat("Subtraction=",val1-val2),
"r"= cat("Division = ", val1 / val2),
"s"=cat("Multiplication=",val1*val2),
"m"= cat("Modulus =", val1 %% val2),
"p"= cat("Power =", val1 ^ val2)
)

print(result)

IterativeProgramminginR
R -Loops

Introduction:
There may be a situation when you need to execute a block of code
several number of times. In general, statements are executed
sequentially. The first statement in a function is executed first, followed
by the second, and so on.
Programming languages provide various control structures that allow for
more complicated execution paths.
A loop statement allows us to execute a statement or group of statements
multiple times and the following is the general form of a loop statement in
most of the programming languages −

R programming language provides the following kinds of loop to handle


looping requirements. Click the following links to check their detail.

Sr.No. LoopType&Description

1 repeat loop

Executesasequenceofstatementsmultipletimesandabbreviatesthecode
that manages the loop variable.

2 whileloop
Repeats a statement or group of statements while a given condition is
true. Ittests the condition before executing the loop body.

3 forloop

Like a while statement, except that it tests the condition at the end of
the loopbody.

R- ForLoop
AForloopisarepetitioncontrolstructurethatallowsyoutoefficientlywritealoop
thatneeds to executea specificnumber oftimes.

Syntax
Thebasicsyntaxforcreatingaforloopstatement inRis −
for(valueinvector){
statements
}
FlowDiagram

R’s for loops are particularly flexible in that they are not limited to
integers, or even numbers in the input. We can pass character vectors,
logical vectors, lists or expressions.
Example
Whentheabovecodeiscompiledandexecuted,itproducesthefollowingresult−
v<-LETTERS[1:4]
for(iinv){ print(i)
[1] "A"
} "B"
[1]
[1] "C"
[1] "D"

Example R-WhileLoop
for(xin1:10){
TheWhileloopexecutesthesamecodeagainandagain untilastopcondition ismet.
print(x)
Syntax
}
The basicsyntaxforcreatingawhile loopinRis−
Example2:Programtodisplaydaysofaweek. week
while(test_expression){
< - c('Sunday',
statement
} 'Monday',

'Tuesday',

'Wednesday',

'Thursday',

'Friday',

'Saturday')

for(dayinweek)

{ print(day)

}
FlowDiagram
Example2
n<-5
factorial<-1 i
<-1
while(i<=n)
{
factorial=factorial*i i
=i+1
}
print(factorial)

Herekeypointofthewhileloopisthattheloopmightnoteverrun.Whenthecondi
tion is tested and the result is false, the loop body will be skipped and
the first statement after the while loop will be executed.
Example1

val=1

while(val<=5)
{
print(val)v
al=val+1
}
R-Repeat Loop

It is a simple loop that will run the same statement or a group of


statements repeatedly until the stop condition has been
encountered. Repeat loop does not have any condition to terminate
the loop, a programmer must specifically place a condition within
the loop’s body and use the declaration of a break statement to
terminate this loop. If no condition is present in the body of the
repeat loop then it will iterate infinitely.

Syntax
Example1
val=1
Thebasicsyntaxforcreatingarepeatloop inRis −
repeat
repeat
{{
statement

if(condition)
{
break
}
}
FlowDiagram
print(val)val
=val+1

if(val>5)
{
break
}
}

Example2:

i <-0

repeat
{
print("Geeks4geeks!") i = i + 1
if(i ==5)
{
break
}
}

LoopControlStatements/Jumpstatements
Loopcontrolstatementschangeexecutionfromitsnormalsequence.Whenexecution leaves
Rsupportsthe following controlstatements. Clickthe followinglinkstocheck their detail.

Sr.No. ControlStatement& Description

1 breakstatement

Terminatestheloopstatementandtransfersexecutiontothestatement
immediately following the loop.

2 Next statement
The nextstatementsimulatesthebehaviorofRswitch.

R-Break Statement
ThebreakstatementinRprogramminglanguagehasthefollowingtwousages−
 Whenthebreakstatementisencounteredinsidealoop,theloopisimmedia
telyterminated and program control resumes at the next statement
following the loop.
 It canbeusedtoterminateacaseintheswitchstatement

Syntax
The basic syntaxfor creatinga breakstatementin R is−break
FlowDiagram

Example
for(valin1:5)
{
#checkingcondition if
(val == 3)
{
#usingbreakkeyword
break
}

#displayingitemsinthesequence
print(val)
}

R-NextStatement
The next statement in R programming language is useful when we want to
skip thecurrent iteration of a loop without terminating it. On encountering
next, the R parser skips further evaluation and starts next iteration of the
loop.
Syntax
The basic syntaxfor creatinga next statementin R is−next
FlowDiagram

Example
for(val in1:5)
{
#checkingcondition if
(val == 3)
{
#usingnextkeyword
next
}

#displayingitemsinthesequence
print(val)
}

Loopoveralist
Aforloopisveryvaluablewhenweneedtoiterateoveralistofelements orarangeof numbers.
Loopcanbeusedtoiterateoveralist,dataframe,vector,matrix oranyother object. The
braces and square bracket are compulsory.
ForLoopinRExample1:Weiterateoveralltheelementsofavectorandprintthecurrent value.

#Createfruitvector
fruit<-c('Apple','Orange','Passionfruit','Banana') #
Create the for statement
for(iinfruit){
print(i)
}

R-Functions
Functionsareusefulwhenyouwanttoperformacertaintaskmultipletimes.Afunctionaccepts input
arguments and produces the output by executing valid R commands that are inside the
function.InRProgrammingLanguagewhenyouarecreatingafunctionthefunctionnameand the
file in which you are creating the function need not be the same and you can have one or
more function definitions in a single R file.

Typesoffunction in RLanguage

Built-in Function: Built function R is sqrt(), mean(), max(), these function are directly call in
the program by users.

User-definedFunction: Rlanguageallowustowriteourownfunction.

Functionsin RLanguage
FunctionsarecreatedinRbyusingthecommandfunction().Thegeneralstructureofthe function file
is as follows:
Built-inFunctioninRProgrammingLanguage
Here we will use built-in function like sum(), max() and min().
print(sum(4:6))

# Find max of numbers 4 and 6.


print(max(4:6))

# Find min of numbers 4 and 6.


print(min(4:6))

User-definedFunctionsinRProgrammingLanguage
Rprovidesbuilt-infunctionslike print(),cat(),etc.butwecanalsocreateourownfunctions. These
functions are called user-defined functions.
evenOdd = function(x)
{ if(x %% 2 == 0)
return("even")
else
return("odd")
}

print(evenOdd(4))
print(evenOdd(3))

SingleInputSingleOutput
NowcreateafunctioninRthatwilltakeasingleinputandgivesusasingleoutput.
areaOfCircle = function(radius)
{ area = pi*radius^2
return(area)
}

print(areaOfCircle(2))

MultipleInputMultipleOutput
Now create a function in R Language that will take multiple inputs and gives us multiple
outputs using a list.
ThefunctionsinRLanguagetakesmultipleinputobjectsbutreturnedonlyoneobjectas
output,thisis,however,notalimitationbecauseyoucancreatelistsofalltheoutputswhich
you want to create and once the list is created you can access them into the elements of
thelist and get the answers which you want.

Rectangle = function(length, width)


{ area = length * width
perimeter=2*(length+width)

# create an object called result which is


# a list of area and perimeter
result = list("Area" = area, "Perimeter" = perimeter)
return(result)
}

resultList = Rectangle(2, 3)
print(resultList["Area"])
print(resultList["Perimeter"])

InlineFunctionsinRProgrammingLanguage
Sometimes creating an R script file, loading it, executing it is a lot of work when you want
to just create a very small function. So, what we can do in this kind of situation is an inline
function.
To create an inline function you have to use the function command with the argument x and
then the expression of the function.
f = function(x)x*100

print(f(4))

PassingargumentstoFunctionsinRProgrammingLanguage
Thereareseveralwaysyoucanpasstheargumentstothefunction:
 Case 1: Generally in R, the arguments are passed to the function in the same order as in
the function definition.
 Case 2: If you do not want to follow anyorder what you can do is you can pass the
arguments using the names of the arguments in any order.
 Case 3: If the arguments are not passed the default values are used to execute the
function.

Rectangle=function(length=5,width=4){ area
= length * width
return(area)
}

#Case1:
print(Rectangle(2,3))
#Case2:
print(Rectangle(width= 8,length=4))

#Case3:
print(Rectangle())

LazyevaluationsofFunctionsinRProgrammingLanguage
In R the functions are executed in a lazy fashion. When we say lazy what it means is
ifsome arguments are missing the function is still executed as long as the execution does
not involve those arguments.

Example1:
Cal= function(a,b,c)
{ v = a*b
return(v)
}

print(Cal(5,10))

Example2:
Cal=function(a,b,c){
v=a*b*c
return(v)
}

print(Cal(5,10))

FunctionArgumentsin RProgramming
Arguments aretheparameters provided to afunction to perform operationsin aprogramming
language. In R programming, we can use as many arguments as we want and are separatedby
a comma. There is no limit on the number of arguments in a function in R. In this article,
we’ll discuss different ways of adding arguments in a function in R programming.

AddingArgumentsinR
We can pass an argument to a function while calling the function by simply giving the value
as an argument inside the parenthesis. Belowis animplementation ofa functionwith a single
argument.
divisbleBy5 <- function(n)
{ if(n %% 5 == 0)
{
return("numberisdivisibleby5")
}
else
{
return("numberisnotdivisibleby5")
}
}
# Function call
divisbleBy5(100)
AddingMultipleArgumentsinR
A function in R programming can have multiple arguments too. Below is an
implementation of a function with multiple arguments.
divisible <- function(a, b)
{ if(a %% b == 0)
{
return(paste(a,"isdivisibleby",b))
}
else
{
return(paste(a,"isnotdivisibleby",b))
}
}

#Functioncall
divisible(7, 3)

AddingDefaultValueinR
Default value in a function is a value that is not required to specify each time the function is
called. If the value is passed by the user, then the user-defined value is used by the function
otherwise, the default value is used. Below is an implementation of function with default
value.
divisible <- function(a, b = 3)
{ if(a %% b == 0)
{
return(paste(a,"isdivisibleby",b))
}
else
{
return(paste(a,"isnotdivisibleby",b))
}
}

# Function call
divisible(10,5)
divisible(12)

DotsArgument
Dots argument (…) is also known as ellipsis which allows the function to take an undefined
number of arguments. It allows the function to take an arbitrary number of arguments.
Belowisanexampleofafunctionwithanarbitrarynumberofarguments.
fun<-function(n,...){
l<-list(n,...)
paste(l,collapse="")
}
#Functioncall
fun(5,1L,6i,15.2,TRUE)

RecursiveFunctionsinRProgramming

The recursive function uses the concept of recursion to perform iterative tasks they call
themselves, again and again, which acts as a loop. These kinds of functions need a stopping
condition so that they can stop looping continuously.
Recursive functions call themselves. Theybreak down the problem into smaller components.
The function() calls itself within the original function() on each of the smaller components.
After this, the results will be put together to solve the original problem.
Example1:
fac<- function(x){ if(x==0||x==1)
{
return(1)
}
else
{
return(x*fac(x-1))
}
}

fac(3)

NestedFunctions
Therearetwoways tocreate anestedfunction:

Callafunctionwithinanother function.
Writeafunction within afunction.

Callafunctionwithinanother function.

Example

Callafunctionwithinanotherfunction:

Nested_function <- function(x, y) {


a<-x+y
return(a)
}

Nested_function(Nested_function(2,2),Nested_function(3,3))
 Writeafunction within afunction.

Example

Writeafunction within afunction:

Outer_func <- function(x)


{ Inner_func<-function(y){
a <- x + y
return(a)
}
return(Inner_func)
}
output<-Outer_func(3)#TocalltheOuter_func output(5)

LoadinganRpackage

Packages
PackagesarecollectionsofRfunctions,data,andcompiledcodeinawell-definedformat.The directory where packag
loadedintothe sessiontobe used.

.libPaths()#get librarylocation

library()#seeall packages installed

search() #seepackagescurrentlyloaded

AddingPackages
Youcanexpandthetypesofanalyses youdobeaddingotherpackages.Acomplete listof contributed
packages is available from CRAN.

Followthese steps:

1. Downloadand install a package(youonlyneed todo this once).


2. Tousethepackage,invokethelibrary(package)commandtoloaditintothecurrentsession. (You
need to do this once in each session, unless you customize your environment to
automatically load it each time.)
OnMSWindows:
1. ChooseInstallPackages fromthePackagesmenu.
2. SelectaCRANMirror.(e.g.Norway)
3. Selectapackage.(e.g.boot)
4. Thenusethelibrary(package)functiontoloadit foruse.(e.g. library(boot))

LoadanRPackage

Therearebasicallytwo extremelyimportant functions when it comes downtoRpackages:

 install.packages(),whichas you canexpect,installsagiven package.


 library()whichloadspackages,i.e.attachesthemtothesearchliston yourR
workspace.

Toinstallpackages, youneedadministratorprivileges.Thismeansthat install.packages()will thus not


work in the DataCamp interface. However, almost all CRAN packages are installed on our
servers. You can load them with library().

In this exercise, you'll be learning how to load the ggplot2 package, a powerful package for
datavisualization.You'lluseittocreateaplotoftwovariablesofthe mtcarsdataframe.The data has
already been prepared for you in the workspace.

Beforestarting,execute thefollowingcommands inthe console:

 search(),tolook atthecurrentlyattachedpackagesand
 qplot(mtcars$wt,mtcars$hp),tobuild aplotof twovariablesofthemtcarsdata frame.

MathematicalFunctionsinR

Rprovidesthevariousmathematicalfunctionstoperformthemathematicalcalculation.These
mathematical functions are very helpful to find absolute value, square value and much more
calculations. In R, there are the following functions which are used:

S. No Function Description Example

1. abs(x) Itreturnstheabsolutevalueofinput x. x<--4


print(abs(x))
Output
[1]4
2. sqrt(x) Itreturnsthe squarerootofinput x. x<- 4
print(sqrt(x))
Output
[1]2
3. ceiling(x) Itreturnsthesmallestintegerwhichislargerthanor equal x<- 4.5
to x. print(ceiling(x))
Output
[1]5
4. floor(x) Itreturnsthelargestinteger,whichissmallerthanor x<- 2.5
equal to x. print(floor(x))
Output
[1]2
5. trunc(x) Itreturnsthetruncate valueofinput x. x<- c(1.2,2.5,8.1)
print(trunc(x))
Output
[1]128
6. round(x, Itreturnsround valueofinput x. x<--4
digits=n) print(abs(x))
Output
4
7. cos(x), Itreturnscos(x),sin(x) valueofinput x. x<- 4
sin(x), print(cos(x))
tan(x) print(sin(x))
print(tan(x))
Output
[1]-06536436
[2]-0.7568025
[3]1.157821
8. log(x) Itreturnsnatural logarithmofinput x. x<- 4
print(log(x))
Output
[1]1.386294
9. log10(x) Itreturnscommonlogarithmofinputx. x<- 4
print(log10(x))
Output
[1]0.60206
10. exp(x) Itreturns exponent. x<- 4
print(exp(x))
Output
[1]54.59815
Unit-5
DataReduction
Imagine that you have selected data from the AllElectronics data warehouse for analysis. The
data set will likely be huge! Complex data analysis and mining on huge amounts of data can
take a long time, making such analysis impractical or infeasible.
Datareductiontechniquescanbeappliedtoobtainareducedrepresentationofthedatasetthat is much
smaller in volume, yet closely maintains the integrity of the original data. That is, mining on
the reduced data set should be more efficient yet produce the same (or almost the same)
analytical results.

OverviewofDataReductionStrategies
Data reductionstrategiesincludedimensionality reduction,numerosity reduction,anddata compression.

Dimensionalityreductionistheprocessofreducingthenumberofrandomvariablesor attributes
under consideration.
 Dimensionalityreductionmethodsinclude wavelettransforms

andprincipalcomponentsanalysis,whichtransformorprojecttheoriginaldataontoasmaller
space.
 Attribute subset selection is a method of dimensionalityreduction in which
irrelevant,weakly relevant, or redundant attributes or dimensions are detected and
removed.

Numerosity reduction techniques replace the original data volume by alternative,


smallerforms of data representation.
 Thesetechniques maybeparametricor nonparametric.
 Forparametricmethods,amodelisusedtoestimatethedata,sothattypicallyonlythe data
parameters need to be stored, instead of the actual data. (Outliers may also be stored.)
Regression and log-linear models are examples.
 Nonparametricmethodsforstoringreducedrepresentationsofthedatainclude
histograms,clustering,sampling,anddatacube aggregation.

In data compression, transformations are applied so as to obtain a reduced or “compressed”


representation of the original data. If the original data can be reconstructed from the
compressed data without any information loss, the data reduction is called lossless.
 If,instead,wecanreconstructonlyanapproximationoftheoriginaldata,thenthedata
reduction is called lossy.
 There are several lossless algorithms for string compression; however, they typically
allow only limited data manipulation.
 Dimensionalityreduction and numerosityreduction techniques can also beconsidered
forms of data compression.
 Therearemanyotherwaysoforganizingmethodsofdatareduction.Thecomputational time
spent on data reduction should notoutweigh or “erase” the time saved bymining on a
reduced data set size.
WaveletTransforms
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector X, transforms it to a numerically different vector, X’, of wavelet
coefficients.
 Thetwovectorsareofthesamelength.Whenapplyingthistechniquetodatareduction, we
consider each tuple as an n-dimensional data vector, that is, X =(.x1,x2, .,., ,xn),
depicting n measurements made on the tuple from n database attributes.
 “How can this technique be useful for data reduction if the wavelet transformed data
are of the same length as the original data?” The usefulness lies in the fact that the
wavelet transformed data can be truncated. A compressed approximation of the data
can be retained by storing only a small fraction of the strongest of the
waveletcoefficients.

 The technique also works to remove noise without smoothing outthe mainfeatures of
the data, making it effective for data cleaning as well. Given a set of coefficients, an
approximation of the original data can be constructed by applying the inverse of the
DWT used.

 The DWT is closely related to the discrete Fourier transform (DFT), a signal
processing technique involving sines and cosines. In general, however, the DWT
achieves better lossy compression.
 UnliketheDFT,waveletsarequitelocalizedinspace,contributingtotheconservation of
local detail.

ThereisonlyoneDFT,yetthereareseveralfamiliesofDWTs.Figure3.4showssomewavelet
families.PopularwavelettransformsincludetheHaar-2,Daubechies-4,andDaubechies-6.The
general procedure for applying a discrete wavelet transform uses a hierarchical pyramid
algorithm that halves the data at each iteration, resulting in fast computational speed.

Themethod is as follows:
1. Thelength, L, oftheinput datavectormust bean integerpowerof2. This condition can be met
by padding the data vector with zeros as necessary (L >=n).
2. Eachtransforminvolvesapplyingtwofunctions.Thefirstappliessomedatasmoothing,such
asasumorweightedaverage.Thesecondperformsaweighteddifference,whichactstobring out the
detailed features of the data.
3. ThetwofunctionsareappliedtopairsofdatapointsinX,thatis,toallpairsofmeasurements
.x2i,x2i+1.Thisresultsintwodatasetsoflength L=2. In general,these representasmoothed or low-
frequency version of the input data and the high frequency content of it, respectively.
4. Thetwofunctionsarerecursivelyappliedtothedatasetsobtainedinthepreviousloop,until the
resulting data sets obtained are of length 2. 5. Selected values from the data sets obtained in
the previous iterations are designated the wavelet coefficients of the transformed data.
 Equivalently,amatrixmultiplicationcanbeappliedtotheinputdatainordertoobtain the
wavelet coefficients, where the matrix used depends on the given DWT.
 The matrix must be orthonormal, meaning that the columns are unit vectors and are
mutually orthogonal, so that the matrix inverse is just its transpose. By factoring the
matrixusedintoaproductofafewsparsematrices,theresulting“fastDWT”algorithm has a
complexity of O(n) for an input vector of length n.

 Wavelet transforms can be applied to multidimensional data such as a data cube. This
isdonebyfirstapplyingthetransformtothefirstdimension,thentothesecond,andso on.
 LossycompressionbywaveletsisreportedlybetterthanJPEGcompression,thecurrent
commercial standard.
 Wavelet transforms have many real world applications, including the compression of
fingerprint images, computer vision, analysis of time-series data, and data cleaning.

PrincipalComponentsAnalysis
Principal components analysis (PCA; also called the K-L, method) searches for k n-
dimensional orthogonal vectors that can best be used to represent the data, where k <= n.
The original data are thus projected onto a much smaller space, resulting in dimensionality
reduction.

Thebasicprocedureisas follows:
1. The input data are normalized, so that each attribute falls within the same range. This step
helps ensure that attributes with large domains will not dominate attributes with
smallerdomains.
2. PCAcomputes korthonormal vectorsthat provideabasisforthenormalized input data.
3. Theprincipalcomponents aresortedinorderofdecreasing“significance”or strength.
4. Because the components are sorted in decreasing order of “significance,” the data size can
be reduced by eliminatingthe weaker components, that is, those with low variance. Usingthe
strongest principal components, it should be possible to reconstruct a good approximation of
the original data.
AttributeSubset Selection
Data sets for analysis maycontain hundreds of attributes, manyof which maybe irrelevant to
theminingtaskorredundant.Forexample,ifthetaskistoclassifycustomersbasedonwhether or not
they are likelyto purchase a popular new CD at AllElectronics when notified of a sale,
attributessuchasthecustomer’stelephonenumberarelikelytobeirrelevant,unlikeattributes such
as age or music taste.
Although it maybe possible for a domain expert to pick out some of the useful attributes, this can
be a difficult and time consuming task, especially when the data’s behavior is not well
known. (Hence, a reason behind its analysis!) Leaving out relevant attributes or keeping
irrelevantattributesmaybedetrimental,causingconfusionfortheminingalgorithmemployed. This
can result in discovered patterns of poor quality. In addition, the added volume of irrelevant
or redundant attributes can slow down the mining process.

Attribute subset selection reduces the data set size by removing irrelevant or redundant
attributes (or dimensions). The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes. Mining on a reduced set of
attributes has an additional benefit: It reduces the number of attributes appearing in
thediscovered patterns, helping to make the patterns easier to understand.

Therefore, heuristic methods that explore a reduced search space are commonly used for
attributesubsetselection.Thesemethodsaretypicallygreedyinthat,whilesearchingthrough
attribute space, they always make what looks to be the best choice at the time. Their strategy
istomakealocallyoptimalchoiceinthehopethatthiswillleadtoagloballyoptimalsolution. Such
greedy methods are effective in practice and may come close to estimating an optimal
solution.
The “best” (and “worst”) attributes are typically determined using tests of statistical significance,
which assume that the attributes are independent of one another. Many other attribute
evaluation measures can be used such as the information gain measure used in building
decision trees for classification.5
Basic heuristic methods of attribute subset selection include the techniques that follow, some of
which are illustrated in Figure 3.6.
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes is added to the
set.
2. Stepwisebackwardelimination:Theprocedurestartswiththefullsetofattributes.Ateach step,
it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedureselectsthebestattributeandremovestheworstfromamongtheremainingattributes.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchart like
structure where each internal (nonleaf) node denotes a test on an attribute, each branch
correspondstoanoutcomeofthetest,andeachexternal(leaf)nodedenotesaclassprediction. At each
node, the algorithm chooses the “best” attribute to partition the data into individual classes.
When decision tree induction is used for attribute subset selection, a tree is constructed from
thegivendata.Allattributesthatdonotappearinthetreeareassumedtobeirrelevant.Thesetof
attributes appearing in the tree form the reduced subset of attributes.

RegressionandLog-LinearModels:ParametricData Reduction

Regressionandlog-linearmodelscanbeusedtoapproximatethegivendata.In(simple)linear
regression,thedataaremodeledtofitastraightline.Forexample,arandomvariable,y(called
aresponsevariable),canbemodeledasalinearfunctionofanotherrandomvariable,x(called a
predictor variable), with the equation
where the variance of y is assumed to be constant. In the context of data mining, x and y are
numericdatabase attributes.Thecoefficients, w andb(calledregressioncoefficients),specify the
slope of the line and the y-intercept, respectively.

Multiple linear regression is an extension of (simple) linear regression, which allows a


response variable, y, to be modeled as a linear function of two or more predictor variables.

Log-linearmodelsapproximatediscretemultidimensionalprobabilitydistributions.
 Given aset of tuples in n dimensions (e.g., described byn attributes),we can consider
each tuple as a point in an n-dimensional space.
 Log-linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller subset of
dimensional combinations.
 Thisallowsahigher-dimensionaldataspacetobeconstructedfromlower-dimensional
spaces.

Histograms
Histograms use discarding to approximate data distributions and are a popular form of data
reduction. A histogram for an attribute, A, partitions the data distribution of A into
disjointsubsets, referred to as buckets or bins. If each bucket represents only a single
attribute–
value/frequencypair,thebucketsarecalledsingletonbuckets.Often,bucketsinsteadrepresent
continuous ranges for the given attribute.
Clustering
 Clusteringtechniques considerdata tuplesas objects.
 They partition the objects into groups, or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters.
 Similarityiscommonlydefinedintermsofhow“close”theobjectsareinspace,based on a
distance function.
 The “quality” of a cluster may be represented by its diameter, the maximum
distancebetween any two objects in the cluster.
 Centroiddistanceisanalternative measure ofcluster quality andisdefinedasthe average
distance of each cluster object from the cluster centroid.
 In data reduction, the cluster representations of the data are used to replace the
actualdata.

Sampling
Samplingcanbeusedasadatareductiontechniquebecauseitallowsalargedatasetto be
represented by a much smaller random data sample (or subset). Suppose that a large data
set,D,containsNtuples.Let’slookatthemostcommonwaysthatwecouldsampleDfordata
reduction, as illustrated in Figure 3.9.

Simple random sample without replacement (SRSWOR) of size s: This is created by


drawing s of the N tuples from D (s <N), where the probability of drawing any tuple in D is
1=N, that is, all tuples are equally likely to be sampled.

Simple randomsample with replacement (SRSWR) of size s: Thisis similar to SRSWOR,


exceptthateachtimeatupleisdrawnfromD,itisrecordedandthenreplaced.Thatis,aftera tuple is
drawn, it is placed back in D so that it may be drawn again.

You might also like