Introduction To Data Science - Ii-I Course File 2025-26
Introduction To Data Science - Ii-I Course File 2025-26
DEPARTMENTOFCOMPUTERSCIENCEENGINEERING
(DATA SCIENCE)
(2025-2026)
FacultyIn-Charge HOD-CSE
J.Ramesh
SYLLABUS
II Year I Semester
UNIT I: Introduction to Data science, benefits and uses, facets of data, data science process in brief, big data ecosystem and
data science Data Science process: Overview, defining goals and creating project charter, retrieving data, cleansing,
integrating and transforming data, exploratory analysis, model building, presenting findings and building applications on top
of them
Unit II: Applications of machine learning in Data science, role of ML in DS, Python tools like sklearn, modelling process for
feature engineering, model selection, validation and prediction, types of ML, semi-supervised learning Handling large data:
problems and general techniques for handling large data, programming tips for dealing large data, case studies on DS projects
for predicting malicious URLs, for building recommender systems
UNIT III: NoSQL movement for handling Bigdata: Distributing data storage and processing with Hadoop framework, case
study on risk assessment for loan sanctioning, ACID principle of relational databases, CAP theorem, base principle of NoSQL
databases, types of NoSQL databases, case study on disease diagnosis and profiling
UNIT IV: Tools and Applications of Data Science: Introducing Neo4jfor dealing with graph databases, graph query language
Cypher, Applications graph databases, Python libraries like nltk and SQLite for handling Text mining and analytics, case
study on classifying Reddit posts
UNIT V: Data Visualization and Prototype Application Development: Data Visualization options, Crossfilter, the JavaScript
MapReduce library, Creating an interactive dashboard with dc.js, Dashboard development tools. Applying the Data Science
process for real world problem solving scenarios as a detailed case study.
Textbook: -
1) Davy Cielen, Arno D.B.Meysman, and Mohamed Ali, “Introducing to Data Science using Python tools”, Manning
Publications Co, Dreamtech press, 2016
2) Prateek Gupta, “Data Science with Jupyter” BPB publishers, 2019 for basics
Reference Books:
1) Joel Grus, “Data Science From Scratch”, OReilly, 2019
2) Doing Data Science: Straight Talk From The Frontline, 1 st Edition, Cathy O’Neil and Rachel Schutt, O’Reilly, 2013
UNIT-I
Introduction to Data Science
Fraud detection
Healthcare diagnostics
Financial modeling
Data science continues to evolve rapidly, driving innovation across virtually all industries by turning raw data
into actionable insights and intelligent systems.
Helps businesses make informed decisions based on data rather than intuition.
3. Predictive Analytics
8. Real-Time Analytics
Enables instant insights in industries like finance (algorithmic trading) and IoT.
Used in ride-sharing apps (Uber, Lyft) for dynamic pricing and route optimization.
7. Telecommunications
Conclusion
Data science transforms raw data into actionable insights, driving innovation across industries. Its benefits
include improved decision-making, automation, cost savings, and enhanced customer experiences. From
healthcare to finance, retail to smart cities, data science is revolutionizing how businesses and organizations
operate.
New chat
a) Structured Data
Examples:
o Relational databases (MySQL, PostgreSQL)
b) Unstructured Data
Examples:
o Text (emails, social media posts)
c) Semi-Structured Data
Examples:
o Web logs, sensor data
2. By Source
a) Internal Data
b) External Data
3. By Nature
a) Quantitative Data
Types:
o Discrete (whole numbers, e.g., "number of customers").
Non-numerical, descriptive.
Types:
o Nominal (no order, e.g., "gender," "country").
4. By Time Dependency
a) Static (Batch) Data
b) Velocity
Speed at which data is generated (e.g., social media posts per second).
c) Variety
d) Veracity
6. By Use Case
a) Training Data
b) Test Data
c) Production Data
Conclusion
The data science process is a structured approach to solving problems using data. Here’s a simplified breakdown:
1. Problem Definition
2. Data Collection
5. Feature Engineering
6. Model Building
7. Model Evaluation
8. Deployment
Key Points:
Iterative: Steps often repeat (e.g., revisiting EDA after model failure).
Machine Learning (ML) is a core component of data science, enabling systems to learn from data and make predictions or
decisions. Below are key applications across industries:
1. Predictive Analytics
Sales Forecasting: Predict future sales based on historical trends.
3. Computer Vision
Facial Recognition (e.g., iPhone Face ID, surveillance)
4. Recommendation Systems
E-commerce: Amazon’s "Customers who bought this also bought…"
9. Financial Services
Credit Scoring: Assess loan eligibility using ML models.
Machine Learning (ML) is a core pillar of Data Science (DS), enabling systems to learn from data, identify patterns,
and make decisions with minimal human intervention. Below is a breakdown of its key roles:
1. Automating Data Analysis
Replaces manual processes (e.g., statistical modeling, rule-based systems).
2. Predictive Modeling
Forecasting trends (sales, stock prices, weather).
5. Anomaly Detection
Fraud detection (credit card transactions).
Conclusion
Without ML, Data Science would rely more on manual statistical analysis and rule-based systems, limiting scalability and
efficiency.
Python is the dominant language in Data Science (DS) and Machine Learning (ML), thanks to its rich
ecosystem of libraries. Below are the key Python tools (like scikit-learn) used for ML tasks in DS:
Key Features:
o Simple API for classification, regression, clustering.
Copy
Download
Key Features:
o High-performance numerical computation.
Copy
Download
🔹 PyTorch
Key Features:
o Dynamic computation graphs (flexibility).
o Preferred in academia and cutting-edge AI.
Copy
Download
import torch.nn as nn
model = nn.Sequential(
nn.Linear(20, 64),
nn.ReLU(),
nn.Linear(64, 2)
)
Example:
python
Copy
Download
df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True) # Handle missing values
🔹 NumPy (numpy)
Example:
python
Copy
Download
import numpy as np
X = np.array([[1, 2], [3, 4]])
🔹 SciPy (scipy)
Example:
python
Copy
Download
🔹 Optuna / Hyperopt
python
Copy
Download
🔹 SHAP / LIME
🔹 Yellowbrick
Visual model diagnostics (feature importance, residuals).
🔹 ONNX Runtime
🔹 MLflow
Hyperparameter
Optuna, Hyperopt
Tuning
Final Thoughts
For real-world projects: Use MLflow for tracking & Flask for APIs.
Feature Engineering Modeling Process
Feature engineering is the process of transforming raw data into meaningful features to improve
ML model performance. Below is a structured step-by-step modeling process for feature
engineering in Python (using libraries like pandas, scikit-learn, and feature-engine).
o Outliers
o Data distributions
o Correlations
Tools:
python
Copy
Download
import pandas as pd
import seaborn as sns
df = pd.read_csv("data.csv")
print(df.info()) # Check data types & missing values
sns.heatmap(df.corr()) # Correlation matrix
python
Copy
Download
df.dropna(inplace=True)
2. Impute missing values (mean/median/mode):
python
Copy
Download
Copy
Download
Copy
Download
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Education"] = le.fit_transform(df["Education"]) # High=2, Medium=1, Low=0
Copy
Download
df = pd.get_dummies(df, columns=["City"]) # Creates binary columns (e.g.,
City_NewYork)
Copy
Download
from feature_engine.encoding import MeanEncoder
encoder = MeanEncoder(variables=["Category"])
df = encoder.fit_transform(df, df["Target"])
Step 4: Scale/Normalize Numerical Features
A) Standardization (for Gaussian-like data)
python
Copy
Download
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["Age", "Income"]] = scaler.fit_transform(df[["Age", "Income"]])
Copy
Download
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[["Age", "Income"]] = scaler.fit_transform(df[["Age", "Income"]])
Copy
Download
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = poly.fit_transform(df[["Age", "Income"]])
Copy
Download
df["Age_Group"] = pd.cut(df["Age"], bins=[0, 18, 35, 60, 100], labels=["Child",
"Young", "Adult", "Senior"])
C) Date-Time Features
python
Copy
Download
df["Year"] = pd.to_datetime(df["Date"]).dt.year
df["DayOfWeek"] = pd.to_datetime(df["Date"]).dt.dayofweek
Copy
Download
corr_matrix = df.corr()
high_corr_features = corr_matrix[abs(corr_matrix["Target"]) > 0.5]
Copy
Download
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
importance = pd.Series(model.feature_importances_,
index=X.columns).sort_values(ascending=False)
Copy
Download
from sklearn.feature_selection import RFE
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=5)
rfe.fit(X_train, y_train)
selected_features = X.columns[rfe.support_]
Copy
Download
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer(transformers=[
("num", numeric_transformer, ["Age", "Income"]),
("cat", categorical_transformer, ["Gender", "City"])
])
full_pipeline.fit(X_train, y_train)
Missing Data
Fill/drop NaN values SimpleImputer, KNNImputer
Handling
Final Tips
Model selection is the process of choosing the best algorithm for a given dataset and problem
type. Below is a step-by-step guide with Python examples using scikit-learn.
Hierarchica
Nested cluster analysis AgglomerativeClustering()
l
Copy
Download
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Copy
Download
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier()
scores = cross_val_score(model, X_train, y_train, cv=5) # 5-fold CV
print(f"Mean Accuracy: {scores.mean():.2f}")
C) Evaluation Metrics
For Classification:
python
Copy
Download
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))
For Regression:
python
Copy
Download
from sklearn.metrics import mean_squared_error, r2_score
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("R² Score:", r2_score(y_test, y_pred))
4. Hyperparameter Tuning
A) GridSearchCV (Exhaustive Search)
python
Copy
Download
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20]
}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best Params:", grid.best_params_)
B) RandomizedSearchCV (Faster Alternative)
python
Copy
Download
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
RandomForestClassifier(),
param_distributions=param_grid,
n_iter=10,
cv=5
)
random_search.fit(X_train, y_train)
Copy
Download
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 200),
'max_depth': trial.suggest_int('max_depth', 3, 20)
}
model = RandomForestClassifier(**params)
return cross_val_score(model, X_train, y_train, cv=5).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print("Best Params:", study.best_params)
Copy
Download
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
models = {
"Logistic Regression": LogisticRegression(),
"SVM": SVC(),
"Random Forest": RandomForestClassifier()
}
Trade-offs to Consider:
✔ Accuracy vs. Speed (e.g., XGBoost vs. Logistic Regression)
✔ Interpretability (e.g., decision trees vs. neural networks)
✔ Overfitting Risk (simple models generalize better)
Copy
Download
best_model = RandomForestClassifier(n_estimators=100, max_depth=10)
best_model.fit(X_train, y_train)
Validation ensures your model generalizes well to unseen data, while prediction is the final step
where the model makes real-world inferences. Below is a structured breakdown with Python
examples.
1. Validation Techniques
A) Train-Test Split
python
Copy
Download
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
B) K-Fold Cross-Validation
python
Copy
Download
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier()
scores = cross_val_score(model, X, y, cv=5) # 5-fold CV
print(f"Mean Accuracy: {scores.mean():.2f} (±{scores.std():.2f})")
C) Stratified K-Fold
python
Copy
Download
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for train_idx, test_idx in skf.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
python
Copy
Download
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
B) For Regression
3. Making Predictions
A) Single Prediction
python
Copy
Download
# Train model
model = RandomForestClassifier().fit(X_train, y_train)
# Predict probabilities
y_pred_proba = model.predict_proba(X_test)[:, 1] # Probability of class=1
Copy
Download
new_data = pd.read_csv("new_samples.csv")
new_data_processed = preprocessor.transform(new_data) # Apply same preprocessing
predictions = model.predict(new_data_processed)
Copy
Download
# For Bayesian models (e.g., with `scikit-learn` compatible libraries like `pyro` or
`pymc3`)
confidence_intervals = model.predict_quantiles(X_test, quantiles=[0.05, 0.95])
4. Validating Predictions
A) Confusion Matrix
python
Copy
Download
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
B) Calibration Curves
python
Copy
Download
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_test, y_pred_proba, n_bins=10)
plt.plot(prob_pred, prob_true)
Copy
Download
residuals = y_test - y_pred
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red')
5. Common Pitfalls & Fixes
Issue Solution
Data Leakage Ensure preprocessing (e.g., scaling) is fit only on training data.
Uncertain
Use models with probability estimates (e.g., predict_proba).
Predictions
Summary Workflow
Machine Learning (ML) can be broadly categorized into three main types, each with distinct
approaches and applications. Below is a clear breakdown with examples and use cases.
1. Supervised Learning
Definition: The model learns from labeled data (input-output pairs) to make predictions.
A) Classification
Algorithms:
o Logistic Regression
o Decision Trees
o Random Forest
Use Cases:
o Email spam detection
o Sentiment analysis
B) Regression
Algorithms:
o Linear Regression
o Polynomial Regression
o Ridge/Lasso Regression
Use Cases:
o Stock price forecasting
o Weather prediction
2. Unsupervised Learning
Definition: The model finds patterns in unlabeled data (no predefined outputs).
A) Clustering
Algorithms:
o K-Means
o DBSCAN
o Hierarchical Clustering
Use Cases:
o Customer segmentation
o Image compression
B) Dimensionality Reduction
Algorithms:
o PCA (Principal Component Analysis)
o t-SNE
Use Cases:
o Visualizing high-dimensional data
o Speeding up ML models
Algorithms:
o Apriori
o FP-Growth
Use Cases:
o Market basket analysis (e.g., "Customers who buy X also buy Y")
o Recommendation systems
Key Components:
Agent: The learner/decision-maker.
Reward Signal: Feedback for actions (e.g., +1 for winning, -1 for losing).
Algorithms:
Q-Learning
Use Cases:
Autonomous vehicles
Use Cases:
o Speech recognition
2. Self-Supervised Learning
Generates labels from data itself (e.g., predicting missing parts of an image).
Use Cases:
o Pretraining large language models (GPT, BERT)
3. Transfer Learning
Reuses a pre-trained model for a new task (e.g., fine-tuning ResNet for medical images).
Use Cases:
o Computer vision (object detection)
o NLP (text classification)
Summary Table
Pre-trained
Transfer Learning Fine-tuning BERT, ResNet Medical imaging
models
Key Takeaways
Semi-supervised learning (SSL) is a hybrid approach that leverages both small amounts of
labeled data and large amounts of unlabeled data to improve model performance. It’s
especially useful when labeling data is expensive or time-consuming (e.g., medical imaging, speech
recognition).
SSL uses both to train a better model than supervised learning alone.
Copy
Download
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import RandomForestClassifier
B) Co-Training
Two models train on different feature subsets and teach each other.
Requires two independent feature sets (e.g., image pixels + text captions).
C) Label Propagation
Works well when data has a clear manifold structure (e.g., clusters).
Python Example:
python
Copy
Download
from sklearn.semi_supervised import LabelPropagation
model = LabelPropagation(kernel='knn', n_neighbors=10)
model.fit(X_labeled, y_labeled)
D) Consistency Regularization
Forces the model to produce similar outputs for slightly perturbed inputs (e.g., image rotations).
Copy
Download
# Pseudocode for consistency loss
def consistency_loss(unlabeled_images):
augmented_1 = augment(unlabeled_images)
augmented_2 = augment(unlabeled_images)
predictions_1 = model(augmented_1)
predictions_2 = model(augmented_2)
return mse(predictions_1, predictions_2)
Computer Vision Medical image segmentation Tumor detection with few labeled MRI scans
Summary
Best for: Cost-sensitive domains like healthcare, NLP, and speech processing.
UNIT-III
The NoSQL (Not Only SQL) movement emerged as a response to the limitations of
traditional relational databases (RDBMS) in handling Big Data (high volume, velocity, and
variety). NoSQL databases provide scalability, flexibility, and high performance for modern
data-intensive applications.
Use Cases:
o Content management systems (CMS)
Copy
Download
db.users.insertOne({ name: "Alice", age: 30, hobbies: ["hiking", "coding"] });
Use Cases:
o Caching (Redis)
o Session management
Copy
Download
SET user:1001 "Alice"
GET user:1001 # Returns "Alice"
Use Cases:
o Time-series data (IoT sensors)
Copy
Download
CREATE TABLE users (id UUID PRIMARY KEY, name TEXT, email TEXT);
D) Graph Databases (e.g., Neo4j, Amazon Neptune)
Use Cases:
o Social networks (friend recommendations)
Copy
Download
CREATE (Alice:Person {name: "Alice"})-[:FRIENDS_WITH]->(Bob:Person {name: "Bob"});
ACID
Often sacrificed for speed Strictly enforced
Compliance
Unstructured/semi-structured
Best For Structured data with relations
data
python
Copy
Download
df = spark.read.format("mongo").load()
df.filter(df["age"] > 25).show()
Azure: Cosmos DB
5. Challenges of NoSQL
No Standard Query Language (varies by DB type).
Conclusion
Distributed storage that splits large files into blocks (default: 128MB/256MB).
Key Features:
o Fault Tolerance: Replicates blocks across nodes (default: 3 copies).
HDFS Architecture:
Component Role
Key Components:
o ResourceManager (RM): Global cluster resource manager.
C) MapReduce
Two Phases:
1. Map: Processes input data (filtering, sorting).
Copy
Download
// Mapper
public void map(LongWritable key, Text value, Context context) {
String[] words = value.toString().split(" ");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
}
}
// Reducer
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
HDFS splits files into blocks and distributes them across DataNodes.
Map tasks run on nodes where the data resides (minimizes network transfer).
High-level APIs
Ease of Use Low-level (Java-heavy)
(Python/Scala)
Fault
Recomputes failed tasks Uses RDDs/DAGs for recovery
Tolerance
Azure: HDInsight
Copy
Download
aws emr create-cluster --name "Hadoop Cluster" \
--release-label emr-6.5.0 \
--applications Name=Hadoop Name=Hive \
--instance-type m5.xlarge \
--instance-count 3
6. Limitations of Hadoop
Not for real-time processing (use Spark/Flink instead).
Conclusion
While Spark and cloud data lakes (e.g., S3 + Athena) are replacing some Hadoop use cases, it
remains vital for large-scale batch processing.
s, CAP theorem
UNIT-1
Introduction: What Is Data Science?
Data science combines math and statistics, specialized programming, advanced analytics,
artificial intelligence (AI), and machine learning with specific subject matter expertise to
uncover actionable insights hidden in an organization’s data. These insights can be used to
guide decision making and strategic planning.
Theacceleratingvolumeofdatasources,andsubsequentlydata,has madedatascienceisone
of the fastest growing field across every industry. It's increasingly critical to businesses:
The
insightsthatdatasciencegenerateshelporganizationsincreaseoperationalefficiency,identify
newbusinessopportunitiesandimprove marketingandsalesprograms,amongotherbenefits.
Ultimately, they can lead to competitive advantages over business rivals.
1. There’s a lack of definitions around the most basic terminology. What is “Big Data”
anyway?Whatdoes“datascience”mean?WhatistherelationshipbetweenBigDataanddata
science? Is data science the science of Big Data? Is data science only the stuff going on in
companies like Google and Facebook and tech companies? Whydo manypeople refer to Big
Data ascrossingdisciplines (astronomy, finance,tech, etc.)and todata science as onlytaking
place intech?Justhow bigisbig? Oris itjusta relative term?These termsare soambiguous,
they’re well-nigh meaningless.
2. There’sadistinctlackofrespectfortheresearchersinacademiaandindustrylabswho
havebeenworkingonthiskindofstuffforyears,andwhoseworkisbasedondecades(insome cases,
centuries) of work bystatisticians, computer scientists, mathematicians, engineers, and
scientistsof alltypes. From the waythe mediadescribesit,machinelearningalgorithmswere just
invented last week and data was never “big” until Google came along. This is simplynot the
case. Many of the methods and techniques we’re using—and the challenges we’re facing now
—arepartoftheevolutionofeverythingthat’scomebefore.Thisdoesn’tmeanthatthere’s
notnewandexcitingstuffgoingon,butwethinkit’simportanttoshowsomebasicrespectfor
everything that came before.
3. The hype is crazy—people throw around tired phrases straight out of the height of the
pre-financial crisis era like “Masters of the Universe” to describe data scientists, and that
doesn’t bode well. In general, hype masks reality and increases the noise-to-signal ratio. The
longerthehype goeson,themoremanyofuswillgetturnedoffbyit,andtheharderitwillbe to see
what’s good underneath it all, if anything.
4. Statisticians alreadyfeel that theyare studyingand workingon the “Science of Data.”
That’stheirbreadandbutter.Maybe you, dearreader,arenotastatisticiananddon’tcare,but imagine
that forthestatistician,this feelsa little bit like howidentitytheft might feel for you.
Althoughwewillmakethecasethatdatascienceisnotjustarebrandingofstatisticsormachine
learningbutratherafielduntoitself,themediaoftendescribesdatascienceinawaythatmakes it
sound like as if it’s simplystatistics or machine learningin the context of the tech industry.
People have saidto us, “Anythingthat hasto call itselfa scienceisn’t.” Although there might
betruthinthere,thatdoesn’tmeanthattheterm“datascience” itselfrepresentsnothing,butof course
what it represents may not be science but more of a craft.
GettingPasttheHype
Rachel’s experience going from getting a PhD in statistics to working at Google is a great
example to illustrate why we thought, in spite of the aforementioned reasons to be dubious,
there might be some meat in the data science sandwich. In her words:
It was clear to me pretty quickly that the stuff I was working on at Google was different than
anything I had learned at school when I got my PhD in statistics. This is not to say that my
degreewasuseless;farfrom it—whatI’d learnedinschoolprovidedaframeworkandwayof
thinkingthatIreliedondaily,andmuchoftheactualcontentprovidedasolidtheoreticaland practical
foundation necessary to do my work.
But there were also many skills I had to acquire on the job at Google that I hadn’t learned in
school.Ofcourse,myexperienceisspecifictomeinthesensethatIhadastatisticsbackground
andpickedupmorecomputation,coding,andvisualizationskills, aswellasdomainexpertise while
at Google. Another person coming in as a computer scientist or a social scientist or a
physicist would have different gaps and would fill them in accordingly.But whatis important
hereisthat,asindividuals,weeachhaddifferentstrengthsandgaps,yetwewereabletosolve
problemsbyputtingourselvestogetherintoadatateamwell-suitedtosolvethedataproblems that
came our way.
Rachelgaveherselfthetaskofunderstandingtheculturalphenomenonofdatascienceandhow others
were experiencing it. She started meeting with people at Google, at startups and tech
companies, and at universities, mostly from within statistics departments.
Fromthosemeetingsshestartedtoformaclearerpictureofthenewthingthat’semerging.She ultimately
decided to continue the investigation by giving a course at Columbia called “Introduction to
Data Science,” which Cathy covered on her blog.
Datafication
In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor Mayer-
Schoenberger wrote anarticle called “TheRise of BigData”. In it theydiscuss the concept of
datafication, and their example is how we quantify friendships with “likes”: it’s the way
everything we do, online or otherwise, ends up recorded for later examination in someone’s
data storage units. Or maybe multiple storage units, and maybe also for sale.
Datafication is an interesting concept and led us to consider its importance with respect to
people’s intentions about sharing their own data. We are being datafied, or rather our actions
are, and when we “like” someone or something online, we are intending to be datafied, or at
leastweshouldexpecttobe.Butwhenwe merelybrowsetheWeb,weareunintentionally,or
atleastpassively,beingdatafiedthroughcookiesthatwemightormightnotbeawareof.And when we
walk around in a store, or even on the street, we are being datafied in a completely
unintentional way, via sensors, cameras, or Google glasses.
This spectrum of intentionality ranges from us gleefully taking part in a social media experiment
we are proud of, to all-out surveillance and stalking. But it’s all datafication. Our intentions
may run the gamut, but the results don’t.
They follow up their definition in the article with a line that speaks volumes about their
perspective:
Oncewedatafythings,wecantransformtheirpurposeandturntheinformationintonewforms of
value.
Here’s an important question that we will come back to throughout the book: who is “we” in
thatcase?What kinds ofvalue dotheyrefer to?Mostly, given theirexamples, the “we” is the
modelers and entrepreneurs making money from getting people to buy stuff, and the “value”
translates into something like increased efficiency through automation.
TheCurrentLandscape
So,whatisdata science?Isitnew,or isit juststatisticsoranalytics rebranded?Isit real,oris it pure
hype? And if it’s new and if it’s real, what does that mean?
Driscoll’sanswer:
Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010, shown in
Figure 1-1.
HealsomentionsthesexyskillsofdatageeksfromNathanYau’s2009 post,“RiseoftheData
Scientist”, which include:
• Statistics(traditionalanalysis you’reusedtothinkingabout)
• Datamunging(parsing,scraping,andformattingdata)
• Visualization(graphs,tools,etc.)
But wait, is data science just a bag of tricks? Or is it the logical extension of other fields like
statistics and machine learning? For one argument, see Cosma Shalizi’s posts here and here,
and Cathy’s posts here and here, which constitute an ongoing discussion of the difference
betweenastatisticianandadatascientist.Cosmabasicallyarguesthatanystatisticsdepartment worth
its salt does all the stuff in the descriptions of data science that he sees, and therefore data
science is just a rebrandingand unwelcome takeover of statistics. For a slightlydifferent
perspective,seeASAPresidentNancyGeller’s2011AmstatNewsarticle,“Don’tshunthe‘S’
word”, in which she defends statistics:
We need to tell people that Statisticians are the ones who make sense of the data deluge
occurring in science, engineering, and medicine; that statistics provides methods for data
analysis in all fields, from art history to zoology; that it is exciting to be a Statistician in the
21stcenturybecauseofthemanychallengesbroughtaboutbythedata explosioninallofthese fields.
Though we get her point—the phrase “art history to zoology” is supposed to represent the
conceptofAtoZ—she’skindofshootingherselfinthefootwiththeseexamplesbecausethey don’t
correspond to the high-tech world where much of the data explosion is coming from.
Muchofthedevelopmentofthefieldishappeninginindustry,not academia.Thatis,thereare people
with the job title data scientist in companies, but no professors of data science in academia.
(Though this may be changing.)
Not long ago, DJ Patil described how he and Jeff Hammerbacher— then at LinkedIn and
Facebook, respectively—coined the term “data scientist” in 2008. So that is when “data
scientist”emergedasajobtitle. (Wikipediafinallygainedanentryondatasciencein2012.)It
makessensetousthatoncetheskillsetrequiredtothriveatGoogle —workingwithateamon
problems that required a hybrid skill set of stats and computer science paired with personal
characteristics including curiosity and persistence—spread to other Silicon Valley tech
companies,itrequiredanewjobtitle.Onceitbecameapattern,it deservedaname.Andonce it got a
name, everyone and their mother wanted to be one. It got even worse when Harvard Business
Review declared data scientist to be the “Sexiest Job of the 21st Century”.
TheRoleoftheSocialScientistinDataScience
Both LinkedIn and Facebook are social network companies. Oftentimes a description or
definition ofdata scientist includes hybrid statistician, software engineer, andsocial scientist.
This made sense in the context of companies where the product was a social product and still
makes sense when we’re dealing with human or user behavior. But if you think about Drew
Conway’sVenndiagram,datascienceproblemscrossdisciplines—that’swhatthesubstantive
expertiseisreferringto.Inotherwords,itdependsonthecontextoftheproblemsyou’retrying to
solve.
If they’re social science-yproblems like friend recommendations or people you know or user
segmentation,then byall means, bringon the social scientist!Social scientists alsodo tend to be
good question askers and have other good investigative qualities, so a social scientist who
also has the quantitative and programming chops makes a great data scientist. But it’s almost
a “historical” (historical is in quotes because 2008 isn’t that long ago) artifact to limit your
conception of a data scientist to someone who works only with online user behavior data.
DataScienceJobs
ColumbiajustdecidedtostartanInstituteforDataSciencesandEngineeringwithBloomberg’s help.
There are 465 job openings in New York City alone for data scientists last time we
checked.That’salot.Soevenifdatascience isn’t a realfield,it hasrealjobs.Andhere’sone
thing we noticed about most of the job descriptions: they ask data scientists to be experts in
computerscience,statistics, communication,datavisualization, andtohaveextensivedomain
expertise.Nobodyisanexpertineverything,whichiswhyitmakesmoresensetocreateteams of
people who have different profiles and different expertise—together, as a team, they can
specialize in all those things. We’ll talk about this more after we look at the composite set of
skills in demand for today’s data scientists.
StatisticalInference
The world we live in is complex, random, and uncertain. At the same time, it’s one big data-
generating machine. As we commute to work on subways and in cars, as our blood moves
through our bodies, as we’re shopping, emailing, procrastinating at work by browsing the
Internet and watching the stock market, as we’re building things, eating things, talking to our
friends and family about things, while factories are producing products, this all at least
potentially produces data.
Imagine spending 24 hours looking out the window, and for every minute, counting and
recordingthenumberofpeoplewhopassby.Orgatheringupeveryonewholiveswithinamile of
yourhouse andmakingthemtellyouhowmanyemailmessagestheyreceive every dayfor
thenextyear.Imagineheadingovertoyourlocalhospitaland rummagingaroundintheblood
samples looking for patterns in the DNA. That all sounded creepy, but it wasn’t supposed to.
The point here is that the processes in our lives are actually data-generating processes.
We’dlikewaystodescribe,understand,andmakesenseoftheseprocesses, inpartbecauseas
scientists we just want to understand the world better, but many times, understanding these
processesispartofthesolutiontoproblemswe’retryingtosolve.Datarepresentsthetracesof thereal-
worldprocesses,andexactlywhichtraceswegatheraredecidedbyourdatacollection or sampling
method. You, the data scientist, the observer, areturningthe world intodata,and this is an
utterly subjective, not objective, process. After separating the process from the data
collection, we can see clearly that there are two sources of randomness and uncertainty.
Namely, the randomness and uncertainty underlying the process itself, and the uncertainty
associatedwithyourunderlyingdatacollectionmethods.Onceyouhaveallthisdata,youhave
somehow captured the world, or certain traces of the world. But you can’t go walkingaround
withahugeExcelspreadsheetordatabaseofmillionsoftransactionsand lookatitand,witha snap of a
finger, understand the world and process that generated it. So you need a new idea,
andthat’stosimplifythosecapturedtracesintosomethingmorecomprehensible,tosomething that
somehow captures it all in a much more concise way, and that something could be
mathematical models or functions of the data, known as statistical estimators.
This overall process of going from the world to the data, and then from the data back to the
world, isthe field of statisticalinference. More precisely, statisticalinferenceisthediscipline
that concerns itself with the development of procedures, methods, and theorems that allow us
to extract meaning and information from data that has been generated bystochastic (random)
processes.
PopulationsandSamples
Let’sgetsometerminologyandconceptsinplacetomakesurewe’realltalkingaboutthesame
thing.In classical statistical literature, a distinction is made between the population and the
sample. The word population immediately makes us think of the entire US population of 300
million people, or the entire world’s population of 7 billion people. But put that image out of
yourhead,becauseinstatisticalinferencepopulationisn’tusedtosimplydescribeonlypeople. It
could be any set of objects or units, such as tweets or photographs or stars. If we could
measurethecharacteristicsorextractcharacteristicsofallthoseobjects,we’dhaveacomplete
setofobservations,andtheconventionistouseNtorepresentthetotalnumberofobservations
inthepopulation.Supposeyourpopulationwasallemailssentlastyearbyemployeesatahuge
corporation, BigCorp. Then a single observation could be a list of things: the sender’s name,
the list of recipients, date sent, text of email, number of characters in the email, number of
sentences in the email, number of verbs in the email, and the length of time until first reply.
When we take a sample, we take a subset of the units of size n in order to examine the
observationstodrawconclusionsandmakeinferencesaboutthepopulation.Therearedifferent
waysyoumightgoaboutgettingthissubsetofdata,andyouwanttobeawareofthissampling
mechanismbecauseitcanintroducebiasesintothedata,anddistort it,sothatthesubsetisnot a “mini-
me” shrunk-down version of the population. Once that happens, anyconclusions you draw
will simply be wrong and distorted.
Modeling
In the next chapter, we’ll look at how we build models from the data we collect, but first we
wanttodiscusswhatweevenmeanbythisterm.Rachelhadarecentphoneconversationwith someone
about a modelling workshop, and several minutes into it she realized the word
“model”meantcompletelydifferentthingstothem.Hewasusingit tomeandatamodels—the
representation one ischoosingto store one’s data, which isthe realm of database managers—
whereas she was talking about statistical models, which is what much of this book is about.
OneofAndrewGelman’sblogpostsonmodelingwasrecentlytweetedbypeopleinthefashion
industry, but that’s a different issue. Even if you’ve used the terms statistical model or
mathematical model for years, is it even clear to yourself and to the people you’re talking to
what you mean? What makes a model a model? Also, while we’re asking fundamental
questions like this, what’s the difference between a statistical model and a machine learning
algorithm? Before we dive deeply into that, let’s add a bit of context with this deliberately
provocativeWiredmagazinepiece,“TheEndofTheory:TheDataDelugeMakestheScientific
Method Obsolete,” published in 2008 by Chris Anderson, then editor-in-chief. Anderson
equatesmassiveamountsofdatatocompleteinformation andarguesnomodelsarenecessary
and“correlationisenough”;e.g.,thatinthecontextofmassiveamountsofdata,“they[Google] don’t
have to settle for models at all.”
Really? We don’t think so, and we don’t think you’ll think so either by the end of the book.
But the sentiment is similar to the Cukier and Mayer-Schoenberger article we just discussed
about N=ALL, so you might already be getting a sense of the profound confusion we’re
witnessing all around us. To their credit, it’s the press that’s currently raising awareness of
thesequestionsandissues,andsomeonehastodoit.Evenso,it’shardtotakewhentheopinion makers
are people who don’t actuallywork with data. Thinkcriticallyabout whether you buy what
Anderson is saying; where you agree, disagree, or where you need more information to
formanopinion.Giventhatthisishowthepopularpressiscurrentlydescribingandinfluencing
publicperceptionofdatascienceandmodeling,it’sincumbent uponusasdatascientiststobe
awareofitandtochimeinwithinformedcomments.Withthatcontext,then,whatdowemean when
we say models? And how do we use them as data scientists? To get at these questions, let’s
dive in.
Whatisamodel?
Humans try to understand the world around them by representing it in different ways.
Architects capture attributes of buildings through blueprints and three-dimensional, sca led-
down versions. Molecular biologists capture protein structure with three-dimensional
visualizationsoftheconnectionsbetweenaminoacids.Statisticiansanddata scientistscapture
theuncertaintyandrandomnessofdata-generatingprocesseswithmathematicalfunctionsthat
express the shape and structure of the data itself. A model is our attempt to understand and
represent the nature of reality through a particular lens, be it architectural, biological, or
mathematical.
A model is an artificial construction where all extraneous detail has been removed or
abstracted. Attention must always be paid to these abstracted details after a model has been
analyzedtoseewhatmighthavebeenoverlooked.Inthecaseofproteins,amodeloftheprotein
backbonewithsidechainsbyitselfisremovedfromthelawsofquantummechanicsthatgovern the
behavior of theelectrons, which ultimatelydictate the structureand actionsof proteins. In the
case of a statistical model, we may have mistakenly excluded key variables, included
irrelevant ones, or assumed a mathematical structure divorced from reality.
Statisticalmodelling
Beforeyougettooinvolvedwiththedataandstartcoding,it’susefultodrawapictureofwhat
youthinktheunderlyingprocessmightbewithyourmodel.Whatcomesfirst?Whatinfluences what?
What causes what? What’s a test of that? But different people think in different ways. Some
prefer to express these kinds of relationships in terms of math. The mathematical
expressionswillbegeneralenoughthattheyhavetoincludeparameters,butthevaluesofthese
parameters are not yet known. In mathematical expressions, the convention is to use Greek
letters for parameters and Latin letters for data. So, for example, if you have two columns of
data, x and y, and you think there’s a linear relationship, you’d write down y = β0 +β1x. You
don’t know what β0 and β1 are in terms of actual numbers yet, so they’re the parameters.
Other people prefer pictures and will first draw a diagram of data flow, possiblywith arrows,
showinghowthingsaffectotherthingsorwhat happensovertime.Thisgivesthemanabstract picture
of the relationships before choosing equations to express them.
Probabilitydistributions
Probability distributions are the foundation of statistical models. When we get to linear
regressionandNaive Bayes, youwill see how thishappensinpractice.One cantake multiple
semestersofcourses onprobabilitytheory,and soit’sa tall challenge tocondense itdown for you
in a small section.
Back in the day, before computers, scientists observed real-world phenomenon, took
measurements, and noticed that certain mathematical shapes kept reappearing. The classical
example is the height of humans, following a normal distribution—a bell-shaped curve, also
called a Gaussian distribution, named after Gauss. Other common shapes have been named
aftertheirobserversaswell (e.g.,thePoissondistributionandtheWeibulldistribution),while other
shapes such as Gamma distributions or exponential distributions are named after associated
mathematical objects.
Inadditiontodenotingdistributionsofsinglerandomvariableswith functionsofonevariable, we
use multivariate functionscalledjoint distributionstodothe same thingformore thanone random
variable. So in the case of two random variables, for example, we could denote our
distributionbyafunctionp(x,y),anditwouldtakevaluesintheplaneandgiveusnonnegative
values. In keeping with its interpretation as a probability, its (double) integral over the whole
plane would be 1.
Whenweobservedatapoints,i.e.,(x1,y1),(x2,y2),...,(xn,yn),weareobservingrealizations ofa
pairofrandomvariables.Whenwe have anentire datasetwithnrowsandk columns,we are
observing n realizations of the joint distribution of those k random variables.
Fittinga model
Fittinga model means that you estimatethe parameters of themodel usingthe observed data.
Youareusing yourdataasevidencetohelp approximatethereal-worldmathematical process
thatgeneratedthedata.Fittingthemodelofteninvolvesoptimizationmethodsandalgorithms, such
as maximum likelihood estimation, to help get the parameters.
Fittingthe model iswhen you start actuallycoding: your code will read nthe data, and you’ll
specifythe functional form thatyou wrote down on the pieceof paper.Then Ror Python will
use built-in optimization methods to give you the most likely values of the parameters given
thedata.Asyougainsophistication,orifthisisoneofyourareasofexpertise,you’lldigaround in the
optimization methods yourself. Initially you should have an understanding that
optimizationistakingplace and howit works, but you don’t have tocodethispartyourself— it
underlies the R or Python functions.
Overfitting
Throughout the book you will be cautioned repeatedlyabout overfitting, possiblyto the point
youwillhavenightmaresaboutit.Overfitting isthetermusedtomeanthatyouusedadataset to
estimate the parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data. You might know this because you have tried to use it to predict
labels foranother setof datathat you didn’t use tofitthemodel, anditdoesn’t doagood job, as
measured by an evaluation metric such as accuracy.
BasicsofR
Introduction
Risaprogramminglanguageandsoftwareenvironmentforstatisticalanalysis,graphics
representation and reporting.
RwascreatedbyRossIhakaandRobertGentlemanattheUniversityofAuckland,New
Zealand, and is currently developed by the R Development Core Team.
R is freelyavailable under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.
Thisprogramminglanguagewasnamed R,basedonthefirstletteroffirstnameofthe two R
authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell
Labs Language S.
ThecoreofRisaninterpretedcomputerlanguagewhichallowsbranchingandlooping as
well as modular programming using functions.
R allows integration with the procedures written in the C, C++, .Net, Python or
FORTRAN languages for efficiency.
EvolutionofR
R was initiallywritten byRoss Ihaka and Robert Gentleman at theDepartmentofStatistics
oftheUniversityofAucklandinAuckland,NewZealand.Rmadeitsfirstappearancein1993.
AlargegroupofindividualshascontributedtoR bysendingcode and bugreports.
Since mid-1997 there has been a core group (the "R Core Team") who can modifythe
R source code archive.
FeaturesofR
Asstatedearlier,Risaprogramminglanguageandsoftwareenvironmentforstatistical
analysis,graphicsrepresentationandreporting.Thefollowingaretheimportantfeatures ofR
−
Risawell-developed,simpleandeffectiveprogramminglanguagewhichincludes
conditionals, loops, user defined recursive functions and input and output facilities.
Rhasaneffectivedatahandlingandstorage facility,
Rprovidesasuiteofoperatorsforcalculationsonarrays,lists,vectorsandmatrices.
Rprovidesa large,coherentandintegratedcollectionoftoolsfordataanalysis.
Rprovidesgraphicalfacilitiesfordataanalysisanddisplayeitherdirectlyatthe computer or
printing at the papers.
R -Environment Setup
1. InstallationofRandRStudioinWindows.
InLinux:(Through Terminal)
PressCtrl+Alt+T toopenTerminal
Thenexecutesudoapt-getupdate
Afterthat,sudoapt-getinstall r-base
InWindows:
InstallRonwindows
InstallRStudioonWindows
ProgrammingwithR
R-BasicSyntax
TooutputtextinR,usesingleordoublequotes: "Hello
World!"
Tooutputnumbers,justtypethenumber(withoutquotes): 5
10
25
Todosimplecalculations,addnumberstogether: 2+3
Output:5
RPrint Output
Print
"Hello!"
Example
However,Rdoeshaveaprint()functionavailableifyouwanttouseit.
Example
print("Hello!")
RComments
Comments
#Thisisacomment
"Hello!"
"HelloWorld!"#Thisisacomment
MultilineComments
Unlikeotherprogramminglanguages,suchasJava,therearenosyntaxinRformultiline comments.
However, we can just insert a # for each line to create multiline comments:
Example
#Thisisacomment #
written in
#morethanjustoneline
"Hello!"
RVariables
CreatingVariablesinR
Variablesarecontainersforstoringdata values.
Rdoesnothaveacommandfordeclaringavariable.Avariableiscreatedthemomentyoufirst assign a
value to it. To assign a value to a variable, use the <- sign. To output (or print) the variable
value, just type the variable name:
Example
name<-"John"
age <- 40
name#output"John"ag
e # output 40
However,<-ispreferredinmostcasesbecausethe=operatorcanbeforbiddeninsomecontext in R.
Print/OutputVariables
name<-"John"
Example
ConcatenateElements
Example
text<-"awesome"
paste("Ris",text)
Example
text1 <- "R is"text2<-"awesome"
paste(text1,text2)
Fornumbers,the+characterworksasa mathematicaloperator:
Example
num1 <-5
num2 <-10
num1 + num2
Ifyoutrytocombine astring(text)andanumber,Rwillgiveyouanerror:
num <-5
Example
text<-"Sometext"
num + text
Result:
Error in num +text:non-numericargument to binaryoperator
MultipleVariables
Rallows youtoassignthesamevaluetomultiplevariablesinoneline:
#Assignthesamevaluetomultiplevariablesinoneline
Example var1 <-
var2 <- var3 <- "Orange"
#Printvariablevalues
var1
var2
var3
VariableNames
Avariablecanhaveashortname(likexand y)oramoredescriptivename(age,carname,
total_volume). Rules for R variables are:
Avariablenamemust startwithaletter
andcanbeacombinationofletters,digits,period(.) and underscore(_).
Ifitstartswithperiod(.),itcannotbe followedbya digit.
Avariablenamecannotstartwitha numberorunderscore(_)
Variablenamesarecase-sensitive(age,AgeandAGEarethreedifferentvariables)
Reservedwordscannotbeusedasvariables(TRUE,FALSE,NULL,if...)
#Legalvariablenames:
myvar <- "John"
my_var <- "John"
myVar <- "John"
MYVAR <- "John"
myvar2<-"John"
.myvar<-"John"
#Illegalvariablenames:
2myvar <- "John"
my-var<-"John"
myvar<-"John"
_my_var <- "John"
my_v@ar<-"John"
TRUE <- "John"
RDataTypes
Variablescanstoredataofdifferenttypes,anddifferenttypescandodifferentthings.
Example
my_var <- 30
my_var<-"raghul"
BasicDataTypes
BasicdatatypesinRcanbedividedintothefollowingtypes:
Wecanusetheclass()functiontocheckthedatatypeofa variable:
Example
#numeric
x <- 10.5
class(x)
# integer
x <-1000L
class(x)
#complex
x <-9i + 3
class(x)
# character/string
x<-"Risexciting"clas
s(x)
#logical/boolean
x <- TRUE
class(x)
RNumbers
Numbers
TherearethreenumbertypesinR:
numeric
integer
complex
Variablesofnumbertypesarecreatedwhenyouassignavaluetothem:
Example
x <-10.5# numeric
y<-10L# integer
z<-1i# complex
Numeric
Example
x <- 10.5
y<-55
#Printvaluesofxandy x
y
#Printtheclassnameofxandy
class(x)
class(y)
Integer
Integers are numeric data without decimals. This is used when you are certain that you will
never create a variable that should contain decimals. To create an integer variable, you must
use the letter L after the integer value:
Example
x<-1000L y
<- 55L
#Printvaluesofxandy x
y
#Printtheclassnameofxandy
class(x)
class(y)
Complex
Example
x <- 3+5i
y<-5i
#Printvaluesofxandy x
y
#Printtheclassnameofxandy class(x)
class(y)
Type Conversion
as.numeric()
as.integer()
as.complex()
Example
x<-1L#integer
y<-2 #numeric
#convertfromintegertonumeric: a
<- as.numeric(x)
#convertfromnumerictointeger: b
<- as.integer(y)
#printvaluesofxandy x
y
#printtheclassnameofaandb
class(a)
class(b)
UNIT-2
Atthemostbasiclevel,attributesarenotaboutnumbersorsymbols. However,todiscussand more
preciselyanalyze the characteristics of objects, we assign numbers or symbols to them. To do
this in a well-defined way, we need a measurement scale.
TheTypeof an Attribute
In other words, the values used to represent an attribute may have properties that are not
properties of the attribute itself, and vice versa. This is illustrated with two examples.
Example 1 (Employee Age and ID Number). Two attributes that might be associated with an
employee are ID and age (in years). Both of these attributes can be represented as integers.
However,whileitisreasonabletotalkabouttheaverageageofanemployee,itmakesnosense to talk
about the average employee ID.
Indeed, the onlyaspect of employees thatwe want tocapturewiththeID attribute isthatthey are
distinct. Consequently, the only valid operation for employee IDs is to test whether they are
equal. There is no hint of this limitation, however, when integers are used to represent the
employeeIDattribute.Fortheageattribute,thepropertiesoftheintegers usedtorepresentage
areverymuchthepropertiesoftheattribute.Evenso,thecorrespondenceisnotcompletesince, for
example, ages have a maximum' while integers do not.
ConsiderbelowFigure,whichshowssomeobjects-linesegmentsandhowthelengthattribute
oftheseobjectscanbemappedtonumbersintwodifferentways.Eachsuccessivelinesegment, going
from the top to the bottom, is formed by appending the topmost line segment to itself. Thus,
the second line segment from the top is formed byappendingthe topmost line segment to itself
twice, the third line segment from the top is formed by appending the topmost line
segmenttoitselfthreetimes,andsoforth.Inaveryreal(physical)sense,alltheline segments are
multiples of the first. This fact is captured bythe measurements on the right-hand side of the
figure, but not by those on the left hand-side.
More specifically, the measurement scale on the left-hand side captures only the ordering of
the length attribute, while the scale on the right-hand side captures both the ordering and
additivityproperties.Thus,anattribute canbemeasuredinawaythatdoesnot captureallthe
propertiesoftheattribute.Thetypeofanattributeshouldtelluswhatpropertiesoftheattribute are
reflected in the values used to measure it. Knowing the type of an attribute is important
because it tells us which properties of the measured values are consistent with the underlying
properties of the attribute, and therefore, it allows us to avoid foolish actions, such as
computing the average employee ID.
Notethatitis commontorefertothetypeofanattributeasthetype ofameasurementscale.
Given these properties, we can define four types of attributes: nominal, ordinal, interval, and
ratio.Table2.2givesthedefinitionsofthesetypes, alongwithinformationaboutthestatistical
operations that are valid for each type. Each attribute type possesses all of the properties and
operationsoftheattributetypesaboveit.Consequently,anypropertyoroperationthatisvalid
fornominal,ordinal,andintervalattributesisalsovalid forratioattributes.Inotherwords,the
definitionoftheattributetypesiscumulative.However, thisdoesnotmeanthattheoperations
appropriate for one attribute type are appropriate for the attribute types above it.
VectorCreation
SingleElementVector
EvenwhenyouwritejustonevalueinR,itbecomesavectoroflength1andbelongstooneof the above vector types.
print(TRUE)
print(2+3i)
print(charToRaw('hello'))
MultipleElementsVector Using
colon operator with numeric data
v <- 5:13
print(v)
Usingsequence(Seq.)operator
#Create vectorwith elementsfrom 5to 9incrementingby0.4.
print(seq(5, 9, by= 0.4))
Usingthec() function
Thenon-charactervaluesarecoercedtocharactertypeifoneoftheelementsisacharacter. # The
logical and numeric values are converted to characters.
s<-c('apple','red',5,TRUE)
print(s)
Whenwe execute theabove code,itproducesthe followingresult −
[1]"apple""red""5" "TRUE"
AccessingVectorElements
Elements of a Vector are accessed using indexing. The [ ] brackets are used for indexing.
Indexing starts with position 1. Giving a negative value in the index drops that element from
result.TRUE, FALSE or 0 and 1 can also be used for indexing.
#Accessingvectorelementsusingposition.
t <-c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <-t[c(2,3,6)]
print(u)
#Accessingvectorelementsusinglogical indexing.
v<-t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)
[1]"Mon""Tue""Fri"
[1]"Sun""Fri"
[1]"Sun""Tue""Wed""Fri""Sat"
[1]"Sun"
VectorManipulation
Vectorarithmetic
Twovectorsofsamelengthcanbeadded,subtracted,multipliedordividedgivingtheresult as a
vector output.
#Createtwovectors. v1
<- c(3,8,4,5,0,11)
v2<-c(4,11,0,8,1,2)
# Vector subtraction.
sub.result<-v1-v2 print(sub.result)
# Vector multiplication.
multi.result<-v1*v2 print(multi.result)
# Vector division.
divi.result<-v1/v2 print(divi.result)
VectorElementRecycling
Ifweapplyarithmeticoperationstotwovectorsofunequallength,thentheelementsofthe shorter
vector are recycled to complete the operations.
v1<-c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomesc(4,11,4,11,4,11)
add.result<-v1+v2
print(add.result)
sub.result<-v1-v2
print(sub.result)
VectorElementSorting
Elementsinavectorcanbe sortedusingthe sort()function.
v<-c(3,8,4,5,0,11, -9,304)
#Sorttheelementsinthereverseorder.
result<-sort(v,decreasing=TRUE) print(result)
# Sortingcharactervectors.
v<-c("Red","Blue","yellow","violet") result <- sort(v)
print(result)
#Sortingcharactervectorsinreverseorder
result <- sort(v, decreasing = TRUE)
print(result)
Typesofvectors
Vectors are of different types which are used in R. Following are some of the types of
vectors:
Numericvectors
Numericvectorsarethosewhichcontainnumericvaluessuchasinteger,float,etc.
Output:
[1] "double"
[1] "integer"
Charactervectors
Charactervectorscontainalphanumericvaluesandspecialcharacters.
#RprogramtocreateCharacterVectors #
by default numeric values
#areconvertedintocharacters
v1 <- c('geeks', '2', 'hello',
57) # Displaying type of
vector typeof(v1)
Output:
[1]"character"
Logicalvectors
Logical vectors contain boolean values such as TRUE, FALSE and NA for Null
values.
#RprogramtocreateLogicalVectors #
#usingc() function
v1<-c(TRUE, FALSE,TRUE,NA)
#Displayingtypeofvector
typeof(v1)
Output:
[1]"logical"
Modifyingavector
Modification of a Vector is the process of applying some operation on an individual element of a vector to ch
X<-c(2, 7, 9, 7, 8,2)
# modifyaspecificelement
X[3]<-1
X[2]<-9
cat('subscriptoperator',X,'\n')
#Modifyusingdifferentlogics.
X[X>5]<-0
cat('Logicalindexing',X,'\n')
#Modifybyspecifying
#thepositionorelements. X
cat('combine()function',X)
Output
subscriptoperator291782
Logicalindexing201002
combine()function102
Deletingavector
Deletion of a Vector is the process of deleting all of the elements of the vector. This can be
done by assigning it to a NULL value.
Output:
OutputvectorNULL
SortingelementsofaVector
sort() function is used with the help of which we can sort the values in ascending or descending order.
# CreationofVector
X<-c(8, 2, 7, 1, 11,2)
#Sortin ascendingorder
A<-sort(X)
cat('ascendingorder',A,'\n') #
B<-sort(X,decreasing=TRUE)
cat('descendingorder',B)
Output:
ascendingorder1227811
descendingorder1187221
Creatingnamedvectors
Namedvectorcanbecreatedinseveralways.Withc:
xc <-c('a'=5,'b'=6,'c'= 7,'d'= 8)
whichresultsin:
>xc
abcd 5 6 7 8
x <-5:8
y<- letters[1:4]
xy<-setNames(x,y)
whichresultsinanamedintegervector:
>xy
abcd 5 6 7 8
xy<-5:8 nam
You mayalso usethe functiontoget the sameresult:
names(xy)<-letters[1:4]
#Withsuchavectoritisalsopossiblytoselectelementsbyname:
xy["a"]
Vectorsub-setting
InRProgramming Language,subsettingallowstheusertoaccesselementsfromanobject.It takes
out a portion from the object based on the condition provided.
Method1:SubsettinginR Using[]Operator
Using the ‘[ ]’ operator, elements of vectors and observations from data frames can be
accessed.Toneglectsomeindexes,‘-‘isusedtoaccessallotherindexesofvectorordata frame.
x <- 1:15
#Print vector
cat("Originalvector:",x,"\n") #
Subsetting vector
cat("First5valuesofvector:",x[1:5], "\n")
cat("Withoutvaluespresentat index1, 2and3:",x[-c(1,2, 3)],"\n")
Method4:SubsettinginRUsingsubset()Function
subset()functioninRprogrammingisusedtocreateasubsetofvectors,matrices,ordata frames based
on the conditions provided in the parameters.
q<-subset(airquality,Temp<65, select=c(Month)) print(q)
Matrices
Matrixisarectangulararrangementofnumbersinrowsandcolumns.Inamatrix,aswe know rows
are the ones that run horizontally and columns are the ones that run vertically.
CreatingandNamingaMatrix
TocreateamatrixinRyouneedtousethefunctioncalledmatrix().Theargumentsto this matrix()
are the set of elements in the vector. You have to pass how many numbers of
rowsandhowmanynumbersofcolumnsyouwanttohaveinyourmatrix.
Note:Bydefault,matricesareincolumn-wiseorder.
A=matrix(
#Noofrows
nrow = 3,
#Noofcolumns
ncol = 3,
#Bydefaultmatricesareincolumn-wiseorder
# So this parameter decides how to arrange the matrix
byrow = TRUE
)
#Namingrows
rownames(A)=c("r1","r2","r3")
#Namingcolumns
colnames(A)=c("c1","c2","c3")
Creatingspecialmatrices
R allows creation of various different types of matrices with the use of arguments passed to
the matrix() function.
Matrixwhereallrowsandcolumnsarefilledbyasingleconstant‘k’:
Tocreatesuchamatrixthesyntaxisgivenbelow:
Syntax:matrix(k,m,n)
Parameters:
k: the constant
m: no ofrows n:
no of columns
print(matrix(5,3,3))
Diagonalmatrix:
A diagonal matrix is a matrix in which the entries outside the main diagonal are all zero. To
create such a matrix the syntax is given below:
print(diag(c(5,3,3),3,3))
Identitymatrix:
A square matrix in which all the elements of the principal diagonal are ones and all other
elements are zeros. To create such a matrix the syntax is given below:
print(diag(1,3,3))
Matrixmetrics
Matrixmetricsmeanonceamatrixiscreatedthen
Howcanyouknowthedimensionofthematrix?
Howcanyouknowhowmanyrowsarethereinthematrix?
Howmanycolumnsareinthematrix?
How many elements are there in the matrix? are the questions we generally wanted to
answer.
A= matrix(
c(1, 2, 3, 4, 5, 6, 7, 8,9),
nrow= 3,
ncol =
3,byrow=TRU
E
)
cat("The3x3matrix:\n")
print(A)
cat("Dimensionofthematrix:\n")
print(dim(A))
cat("Numberof rows:\n")
print(nrow(A))
cat("Numberof columns:\n")
print(ncol(A))
cat("Numberofelements:\n") print(length(A))
# OR
print(prod(dim(A)))
Matrixsubsetting
Amatrixissubsetwithtwoargumentswithinsinglebrackets,[],andseparatedbya comma. The first
argument specifies the rows, and the second the columns.
M_new<-matrix(c(25,23,25,20,15,17,13,19,25,24,21,19,20,12,30,17),ncol=4)
#M_new<-matrix(1:16,4)
M_new
colnames(M_new)<-c("C1","C2","C3","C4")
rownames(M_new)<-c("R1","R2","R3","R4")
M_new[1,1,drop=FALSE]#display1strowand1stcolumn,cellvalue
Arrays
ArraysaretheRdataobjects whichcanstoredatainmorethantwodimensions.Forexample
−Ifwecreateanarrayofdimension(2,3,4)thenitcreates4rectangularmatriceseachwith2 rows and
3 columns. Arrays can store only data type.
#Createtwovectorsofdifferentlengths. vector1
<- c(5,9,3)
vector2<-c(10,11,12,13,14,15)
NamingColumnsandRows
We cangive namestothe rows,columnsandmatricesinthe arraybyusingthe dimnames
parameter.
#Createtwovectorsofdifferentlengths. vector1
<- c(5,9,3)
vector2<-c(10,11,12,13,14,15)
column.names<-c("COL1","COL2","COL3")
row.names<-c("ROW1","ROW2","ROW3")
matrix.names<-c("Matrix1","Matrix2")
Accessingarrays
The arrays can be accessed byusingindices for different dimensions separated bycommas. Differentcomponen
AccessingUni-DimensionalArray
Theelementscan be accessedbyusingindexesof the correspondingelements.
vec<-c(1:10)
#accessingentire vector
cat("Vectoris:",vec)
# accessingelements
cat("Thirdelementofvectoris:",vec[3])
AccessingArrayElements
#Createtwovectorsofdifferentlengths. vector1
<- c(5,9,3)
vector2<-c(10,11,12,13,14,15)
column.names<-c("COL1","COL2","COL3")
row.names<-c("ROW1","ROW2","ROW3")
matrix.names<-c("Matrix1","Matrix2")
#Printthethirdrowofthesecondmatrixofthearray.
print(result[3,,2])
CalculationsacrossArray Elements
Wecandocalculationsacrossthe elementsinanarrayusingtheapply() function.
Syntax
apply(x,margin,fun)
Followingisthe description ofthe parametersused −
xisanarray.
marginisthe name ofthe data setused.
funisthe functiontobe applied acrosstheelementsofthearray.
Example
Weusetheapply()functionbelowtocalculatethesumoftheelementsintherowsofanarray across all the
matrices.
#Createtwovectorsofdifferentlengths. vector1
<- c(5,9,3)
vector2<-c(10,11,12,13,14,15)
#Takethesevectorsasinputto thearray.
new.array<-array(c(vector1,vector2),dim=c(3,3,2))
print(new.array)
,,1
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
,,2
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
[1] 56 68 60
Accessingsubsetofarray elements
Asmaller subsetofthearrayelementscanbeaccessedbydefiningarangeofrowor column limits.
row_names<-c("row1","row2")
col_names<-c("col1","col2","col3","col4") mat_names
<- c("Mat1", "Mat2")
arr=array(1:15,dim=c(2,4,2),
dimnames=list(row_names,col_names,mat_names))
#printelementsofboththerowsandcolumns2and3 ofmatrix 1
print(arr[, c(2, 3), 1])
Addingelementstoarray
Elements can be appended at the different positions in the array. The sequence of elements is
retained in order of their addition to the array. The time complexity required to add new
elements is O(n) where n is the length of the array. The length of the array increases by the
number of element additions. There are various in-built functions available in R to add new
values:
Usingthelengthfunctionofthe array:
Elementscanbeaddedatlength+xindiceswherex>0. #
creating a uni-dimensional array
x <-c(1, 2, 3, 4,5)
#additionofelementusingc()function x
<- c(x, 6)
print("Arrayafter1stmodification") print
(x)
#additionofelementusingappendfunction x
<- append(x, 7)
print("Arrayafter2ndmodification")
print (x)
#addingelementsaftercomputingthelength len
<- length(x)
x[len+1]<-8
print("Arrayafter3rdmodification") print
(x)
#addingonlength+3index
x[len + 3]<-9
print("Arrayafter4thmodification") print
(x)
# adds new elements after 3rd index print("Arrayafter6thmodification") x <- append(x, c(-1, -1), after = 3) pri
[1] "Arrayafter 1st modification"
[1] 1 2 3 45 6
[1] "Arrayafter2ndmodification"
[1] 1 2 3 45 6 7
[1] "Arrayafter3rdmodification"
[1] 1 2 3 45 6 7 8
[1]"Arrayafter4thmodification"
[1]12345678 NA9
[1]"Arrayafter5thmodification"
[1]12345678 NA9 10 11 12
[1]"Arrayafter6thmodification"
[1]123-1-145678 NA9 10 1112
RemovingElementsfromArray
Elements can be removed from arrays in R, either one at a time or multiple together. These
elementsarespecifiedasindexestothearray,whereinthearrayvaluessatisfyingtheconditions are
retainedandrest removed.Thecomparisonforremoval isbasedonarray values.Multiple
conditions can also be combined together to remove a range of elements. Another way to
removeelementsisbyusing%in%operatorwhereinthesetofelementvaluesbelongingtothe TRUE
values of the operator are displayed as result and the rest are removed.
#creatinganarrayoflength9 m
<-c(1, 2, 3, 4, 5, 6, 7, 8, 9)
print("OriginalArray")
print (m)
#removeasinglevalueelement:3fromarray m
<- m[m != 3]
print("After1stmodification") print
(m)
ClassinR
Classistheblueprintthathelpstocreateanobjectandcontainsitsmembervariablealongwith the
attributes. As discussed earlier in the previous section, there are two classes of R, S3, and S4.
S3 Class
S3classissomewhatprimitiveinnature.Itlacksaformaldefinitionandobjectofthisclass can be
created simply by adding a class attribute to it.
This simplicityaccounts for the fact that it is widelyused in R programminglanguage. In
fact most of the R built-in classes are of this type.
Example1:S3class
#create alistwithrequiredcomponents
s<-list(name="John",age=21, GPA= 3.5) #
name the class appropriately
class(s)<-"student"
S4 Class
S4classarean improvementovertheS3class. They haveaformally definedstructure which
helps in making object of the same class look more or less similar.
ClasscomponentsareproperlydefinedusingthesetClass()functionandobjectsarecreated using
the new() function.
Example2:S4class
<setClass("student",slots=list(name="character",age="numeric",GPA="numeric"))
Reference Class
Reference class were introduced later,compared tothe other two. It is more similarto the
objectorientedprogrammingweareusedtoseeinginothermajorprogramminglanguages.
ReferenceclassesarebasicallyS4classedwithanenvironmentaddedtoit.
Example 3: Reference class
<setRefClass("student")
Factors
IntroductiontoFactors:
FactorsinRProgrammingLanguagearedatastructuresthatareimplementedtocategorizethe data
or represent categorical data and store it on multiple levels.
They can be stored as integers with a corresponding label to every unique integer. Though
factors may look similar to character vectors, they are integers and care must be taken while
using them as strings. The factor accepts only a restricted number of distinct values. For
example, a data field such as gender may contain values only from female, male.
CreatingaFactorinRProgrammingLanguage
ThecommandusedtocreateormodifyafactorinRlanguageis –factor()witha vectoras input.
The twostepsto creatinga factor are:
Creatingavector
Convertingthe vectorcreatedintoa factorusingfunction factor()
Example:
#Createavectoras input.
data<-c("East","West","East","North","North","East","West","West","West","East","North")
print(data)
print(is.factor(data))
#Applythefactorfunction.
factor_data <- factor(data)
print(factor_data)
print(is.factor(factor_data))
ChangingtheOrderofLevels
The order of the levels in a factor can be changed by applying the factor function againwith
new order of the levels.
Example:
data<-c("East","West","East","North","North","East","West","West","West","East","North") # Create the factors
factor_data<-factor(data) print(factor_data)
v<-gl(3,4,labels=c("A","B","C")) print(v)
# Apply the factor function with required order of the level. new_order_data<-factor(factor_data,levels=c("East","W
AccessingelementsofaFactorinR
Likeweaccesselementsofavector,thesamewayweaccesstheelementsofafactor.Ifgender is a
factor then gender[i] would mean accessing ith element in the factor.
Example:
GeneratingFactorLevels
We can generate factor levels by using the gl() function. It takes two integers as input which
indicates how many levels and how many times each level.
Syntax
gl(n,k, labels)
Followingisthe description ofthe parametersused −
nisaintegergivingthe numberoflevels.
kisaintegergivingthe numberofreplications.
labelsisa vectoroflabelsforthe resultingfactor levels.
gender <- factor(c("female", "male", "male", "female"))
gender[3]
Summarizinga Factor
The summary function in R returns the results of basic statistical calculations (minimum, 1st
quartile, median, mean, 3rd quartile, and maximum) for a numerical vector. The general way
to write the R summary function is summary(x, na.rm=FALSE/TRUE). Again, X refers to a
numericalvector,whilena.rm=FALSE/TRUEspecifieswhethertoremoveemptyvaluesfrom the
calculation.
Example:
v<-gl(3,4,labels=c("A","B","C"))
print(v)
summary(v)
LevelOrderingofFactors
Factorsaredataobjectsusedtocategorizedataandstoreitaslevels.Theycanstoreastringas well as an
integer. They represent columns as they have a limited number of unique values. Factors in R
can be created using factor() function. It takes a vector as input. c() function is used to create
a vector with explicitly provided values.
Example:
x<-c("Pen","Pencil","Brush","Pen",
"Brush","Brush","Pencil","Pencil")
print(x)
print(is.factor(x))
In the above code, x is a vector with 8 elements. To convert it to a factor the function factor()
is used. Here there are 8 factors and 3 levels. Levels are the unique elements in the data. Can
be found using levels() function.
OrderingFactorLevels
Ordered factors is an extension of factors. It arranges the levels in increasing order. We use
two functions: factor() along with argument ordered().
Syntax:factor(data,levels=c(“”),ordered =TRUE)
Parameter:
data:inputvectorwithexplicitlydefinedvalues.
levels():Mention the list of levels in c function.
ordered: It is set true for enabling ordering.
Example:
# orderingthelevels
ordered.size<-factor(size,levels=c("small","medium","large"),ordered=TRUE) print(ordered.size)
Intheabovecode,sizevectoriscreatedusingcfunction.Thenitisconvertedtoafactor.And fororderingfactor()functio
Data Frames
Adataframeisatableoratwo-dimensionalarray-likestructureinwhicheachcolumncontains
valuesofonevariableandeachrowcontainsonesetofvaluesfromeachcolumn. Dataframes can also
beinterpretedas matrices whereeachcolumn of amatrixcan be of the different data types.
Followingarethecharacteristicsofadataframe.
Thecolumnnamesshouldbe non-empty.
Therownamesshould beunique.
Thedata storedina dataframe canbeofnumeric,factororcharactertype.
Eachcolumnshouldcontainsame numberofdata items.
CreatingDataFrame
friend.data<-data.frame( friend_id = c(1:5),
friend_name=c("Sachin","Sourav", "Dravid", "Sehwag", "Dhoni"),
)
#printthedataframe print(friend.data)
Output:
friend_idfriend_name
1 1 Sachin
2 2 Sourav
3 3 Dravid
4 4 Sehwag
5 5 Dhoni
SummaryofDatainDataFrame
The statistical summary and nature of the data can be obtained by
applying summary() function.
#Createthedataframe.
emp.data<- data.frame(
emp_id =c (1:5),
emp_name=c("Rick","Dan","Michelle","Ryan","Gary"), salary
= c(623.3,515.2,611.0,729.0,843.25),
start_date=as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors=FALSE
)
#Printthesummary.
print(summary(emp.data))
ExtractData fromDataFrame
Extractspecificcolumnfromadataframeusingcolumnname. # Create the data frame.
emp.data<-data.frame( emp_id = c (1:5),
emp_name=c("Rick","Dan","Michelle","Ryan","Gary"),
salary=c(623.3,515.2,611.0,729.0,843.25),
start_date=as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-03-27")),
)
#ExtractSpecificcolumns.
result<-data.frame(emp.data$emp_name,emp.data$salary)
print(result)
Extractthefirsttworowsandthenallcolumns result
<- emp.data[1:2,]
Extract3rdand5throwwith2ndand 4thcolumn
result<-emp.data[c(3,5),c(2,4)]
ExpandDataFrame/ExtendingDataFrame
Adataframecanbeexpanded byaddingcolumnsand rows.
AddColumn
Justadd thecolumn vectorusinganewcolumnname.
start_date=as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-03-27")),
stringsAsFactors=FALSE
)
#Addthe"dept"coulmn.
emp.data$dept<-c("IT","Operations","IT","HR","Finance") v
<- emp.data
print(v)
Whenwe execute theabove code, it producesthe followingresult−
emp_idemp_namesalarystart_date dept
1 1 Rick Dan 623.302012-01-01 IT
2 2 515.202013-09-23 Operations IT
3 3 Michelle611.002014-11-15 HR
4 4 Ryan Gary729.002014-05-11 Finance
5 5 843.252015-03-27
AddRow
Toaddmorerowspermanentlytoanexistingdataframe,weneedtobringinthenewrowsinthe same
structure as the existing data frame and use the rbind() function.
emp_name=c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date=as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept=c("IT","Operations","Fianance"),
)
#Bindthetwodataframes.
emp.finaldata<-rbind(emp.data,emp.newdata)
print(emp.finaldata)
RemoveRowsand Columns
Usethec()functionto removerowsand columnsinaDataFrame:
Example
Data_Frame<-data.frame(
Training=c("Strength","Stamina","Other"),
Pulse = c(100, 150, 120),
# Removethefirstrowandcolumn
Data_Frame_New<-Data_Frame[-c(1),-c(1)]
#Printthenewdataframe
Data_Frame_New
Pulse Duration
215030
312045
CreateSubsetsofaDataframe
subset()functioninRProgramming LanguageisusedtocreatesubsetsofaDataframe.This can also be used to drop
emp.data<-data.frame(
emp_id =c(1:5),
emp_name=c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date=as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-
03-27")),
)
emp.data
subset(emp.data,emp_id==3)
subset(emp.data,emp_id==c(1:3))
emp_idemp_namesalarystart_date
1 1 Rick623.302012-01-01
2 2 Dan515.202013-09-23
3 3Michelle611.002014-11-15
4 4 Ryan729.002014-05-11
5 5 Gary843.252015-03-27
emp_idemp_namesalarystart_date
3 3 Michelle 6112014-11-15
emp_idemp_namesalarystart_date
1 1 Rick623.32012-01-01
2 2 Dan515.22013-09-23
3 3Michelle611.02014-11-15
SortingData
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING.
PrependthesortingvariablebyaminussigntoindicateDESCENDINGorder. Here aresome
examples.
data =
data.frame(rollno=c(1,
5,4,2,3),
subjects=c("java","python","php","sql","c"))
print(data)
print("sortthedataindecreasingorderbasedonsubjects")
print(data[order(data$subjects, decreasing = TRUE), ])
print("sortthedataindecreasingorderbasedonrollno") print(data[order(data$rollno,
decreasing = TRUE), ])
Output:
rollnosubjects
1 1 java
2 5python
3 4 php
4 2 sql
5 3 c
[1] "sort the data in decreasing order based on subjects "
rollno subjects
4 2 sql
2 5 python
3 4 php
1 1 java
5 3 c
[1] "sort the data in decreasing order based on rollno "
rollno subjects
2 5python
3 4 php
5 3 c
4 2 sql
1 1 java
Lists
Listsareone-dimensional,heterogeneousdatastructures.Thelistcanbealistofvectors,a list of
matrices, a list of characters and a list of functions, and so on.
Alist is a vector but with heterogeneous data elements. Alist in R is created with the use of
list()function.Rallowsaccessingelementsofalistwiththeuseoftheindexvalue.InR,the indexing of
a list starts with 1 instead of 0 like other programming languages.
Creatinga List
Tocreatea ListinR youneedtousethefunctioncalled“list()”.Inotherwords,alistisa
genericvectorcontainingotherobjects.Toillustratehowalistlooks,wetakeanexample
here.Wewanttobuildalistofemployees withthedetails.Soforthis,wewantattributes such as
ID, employee name, and the number of employees.
empId=c(1,2,3,4)
numberOfEmp = 4
print(empList)
or
list_data<-list("Red","Green",c(21,32,11),TRUE,51.23,119.1)
print(list_data)
Accessingcomponentsofalist
Wecanaccesscomponentsofa listintwoways.
empId= c(1,2, 3, 4)
empName=c("Debi","Sandeep","Subham","Shiba") numberOfEmp = 4
empList = list( "ID"= empId,
"Names"= empName,
"TotalStaff"=numberOfEmp
)
print(empList)
# Accessingcomponentsbynames
cat("Accessingnamecomponentsusing$command\n")
print(empList$Names)
Access components by indices:We can also access the components of the list using indices.
To access the top-level components of a list we have to use a double slicing operator “[[ ]]”
which is two square brackets and if we want to access the lower orinner level components of
alistwehavetouseanothersquarebracket“[]”alongwiththedoubleslicingoperator“[[]]“.
empId= c(1,2, 3, 4)
empName=c("Debi","Sandeep","Subham","Shiba")
numberOfEmp = 4
empList =
list( "ID"=
empId,
"Names"= empName,
"TotalStaff"=numberOfEmp
)
print(empList)
#Accessingatoplevelcomponentsbyindices
cat("Accessingnamecomponentsusingindices\n") print(empList[[2]])
#Accessinga innerlevelcomponentsbyindices
cat("AccessingSandeepfromnameusingindices\n") print(empList[[2]][2])
#Accessinganotherinnerlevelcomponentsbyindices
cat("Accessing4fromIDusingindices\n")
print(empList[[1]][4])
Modifyingcomponentsofa list
Alistcanalsobemodifiedbyaccessingthecomponentsandreplacingthemwiththeones which you
want.
empId= c(1,2, 3, 4)
empName=c("Debi","Sandeep","Subham","Shiba")
numberOfEmp = 4
empList =
list( "ID"=
empId,
"Names"= empName,
"TotalStaff"=numberOfEmp
)
cat("Beforemodifyingthelist\n")
print(empList)
#Modifyingthetop-level component
empList$`TotalStaff`=5
#Modifyinginnerlevelcomponent empList[[1]][5] = 5
empList[[2]][5]= "Kamala"
cat("Aftermodifiedthelist\n")
print(empList)
Merginglist
Wecanmergethelistbyplacingallthelistsintoa single list.
lst1<-list(1,2,3)
lst2<-list("Sun","Mon","Tue")
Deletingcomponentsofalist
Todeletecomponentsofalist,firstofall,weneedtoaccessthosecomponentsandthen insert a
negative sign before those components. It indicates that we had to delete that component.
empId=c(1,2,3,4)
empName = c("Debi", "Sandeep", "Subham", "Shiba")
numberOfEmp = 4
empList = list(
"ID"=empId,
"Names"=empName,
"TotalStaff"=numberOfEmp
)
cat("Before deletion the list is\n")
print(empList)
#Deletingatoplevelcomponents
cat("After Deleting Total staff components\n") print(empList[-3])
# Deleting a inner level components cat("After Deleting sandeep from name\n") print(empList[[2]][-2])
ConvertingListtoVector
Herewearegoingtoconvertthelisttovector,forthiswewillcreatealistfirstandthen unlist the list into the vector.
# Create lists. lst<-list(1:5) print(lst)
print(vec)
Unit-4
Conditionalsandcontrolflow
R-Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.
TypesofOperators
WehavethefollowingtypesofoperatorsinRprogramming−
Arithmetic Operators
RelationalOperators
LogicalOperators
Assignment Operators
MiscellaneousOperators
ArithmeticOperators
FollowingtableshowsthearithmeticoperatorssupportedbyRlanguage.Theoperatorsacton each
element of the vector.
v<-c(2,5.5,6)
t<-c(8,3, 4)
print(v+t)
itproducesthefollowingresult−
[1]10.0 8.510.0
^ Thefirstvectorraisedtotheexponent of v<-c(2,5.5,6)
second vector t<-c(8,3, 4)
print(v^t)
itproducesthefollowingresult−
[1]256.000 166.3751296.000
Relational Operators
Following table shows the relational operators supported by R language. Each element of
the first vector is compared with the corresponding element of the second vector. The result
of comparison is a Boolean value.
> v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris t<-c(8,2.5,14,9)
greater than the corresponding element of print(v>t)
the second vector.
itproducesthefollowingresult−
[1]FALSETRUEFALSEFALSE
< v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris less t<-c(8,2.5,14,9)
than the corresponding element of the print(v<t)
second vector.
itproducesthefollowingresult−
[1]TRUEFALSETRUEFALSE
== v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris equal t<-c(8,2.5,14,9)
to the corresponding element of the second print(v==t)
vector.
itproducesthefollowingresult−
[1]FALSEFALSEFALSETRUE
<= v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris less t<-c(8,2.5,14,9)
than or equal to the corresponding element print(v<=t)
of the second vector.
itproducesthefollowingresult−
[1]TRUEFALSETRUETRUE
>= v<-c(2,5.5,6,9)
Checksifeachelementofthefirstvectoris t<-c(8,2.5,14,9)
greater than or equal to the corresponding print(v>=t)
element of the second vector.
itproducesthefollowingresult−
[1]FALSETRUEFALSETRUE
!= v<-c(2,5.5,6,9)
Checks if each element of thefirst vector is t<-c(8,2.5,14,9)
unequaltothecorrespondingelementofthe print(v!=t)
second vector.
itproducesthefollowingresult−
[1] TRUETRUETRUE FALSE
Logical Operators
Following table shows the logical operators supported by R language. It is applicable only to
vectors of type logical, numeric or complex. All numbers greater than 1 are considered as
logical value TRUE.
Each element of the first vector is compared with the corresponding element of the second
vector. The result of comparison is a Boolean value.
! v<-c(3,0,TRUE,2+2i)
ItiscalledLogicalNOToperator.Takes print(!v)
each elementofthevectorand givesthe
opposite logical value. itproducesthefollowingresult−
[1]FALSETRUEFALSE FALSE
Thelogicaloperator&&and||considersonlythefirstelementofthevectorsandgiveavector of
single element as output.
&& v<-c(3,0,TRUE,2+2i)
CalledLogicalANDoperator.Takesfirst t<-c(1,3,TRUE,2+3i)
element of both the vectors and gives the print(v&&t)
TRUE only if both are TRUE. itproducesthefollowingresult−
[1]TRUE
|| v<-c(0,0,TRUE,2+2i)
Called Logical OR operator. Takes first t<-c(0,3,TRUE,2+3i)
elementofboththevectorsandgivesthe TRUE print(v||t)
if one of them is TRUE. itproducesthefollowingresult−
[1]FALSE
AssignmentOperators
Theseoperatorsareusedtoassignvaluesto vectors.
or v3 =c(3,1,TRUE,2+3i)
print(v1)
=
print(v2)
or print(v3)
<<− itproducesthefollowingresult−
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
: Colon v<-2:8
operator.It print(v)
createsthe
seriesof itproducesthefollowingresult−
numbersin [1]2 3 4 5 6 78
sequence
foravector.
%in% v1<-8
This
operatoris v2<-12
usedto t<-1:10
print(v1%in%t)
identifyif
print(v2%in%t)
anelement
belongstoa
vector. itproducesthefollowingresult−
[1]TRUE
[1]FALSE
%*% This M=matrix(c(2,6,5,1,10,4), nrow=2,ncol =3,byrow=TRUE)
operatoris t=M %*%t(M)
usedto print(t)
multiplya
itproducesthefollowingresult−
matrixwith
its [,1] [,2]
transpose. [1,]6582
[2,]82117
Rprovidesthe followingR-Decisionmaking/Conditionalstatement
typesof decisionmaking statements.Clickthe following linksto check
their detail.making structures require the programmer to specify one or more conditions to be
Decision
evaluated or tested bythe program, along with a statement or statements to be executed if
Sr.No.
the condition is determined to be true, andStatement&Description
optionally, other statements to be executed if the
condition is determined to be false.
1
Following ifisstatement
the general form of a typical decision making structure found in most of the
programming languages consistsofaBoolean
Anifstatement − expression followedbyoneormore statements.
2 if...else statement
Anif statementcanbefollowedbyanoptional elsestatement,whichexecuteswhenthe Boolean
expression is false.
3 switch statement
Aswitchstatement allows avariableto betestedforequalityagainst alist ofvalues.
R-IfStatement
Anifstatement consistsof aBooleanexpression followed byoneor morestatements.
Example
Syntax
x<-30L
if(is.integer(x)) {
ThebasicsyntaxforcreatinganifstatementinRis− if(boolean_expression) {
print("XisanInteger")
//statement(s) willexecute ifthe booleanexpression is true.
}}
Whentheabovecodeiscompiledand
If the Boolean expression evaluates executed,it producesthe
to be true, followingresult−
then the block of code inside the if statement
willbeexecuted.IfBooleanexpressionevaluatestobefalse,thenthefirstsetofcodeafterthe end of
the if statement (after the closing curly brace) will be executed.
FlowDiagram
[1] "X isanInteger"
R-If...ElseStatement
Anifstatementcanbefollowedbyanoptionalelsestatementwhichexecuteswhentheboolean
expression is false.
Syntax
Thebasicsyntaxforcreatinganif...elsestatementinRis−
if(boolean_expression) {
//statement(s) will execute ifthe boolean expression is true.
}else {
//statement(s) willexecute ifthe booleanexpression is false.
x<-c("what","is","truth")
}
IftheBooleanexpressionevaluatestobetrue,thentheifblockofcodewillbeexecuted,
if("Truth"%in% x) { otherwise else block of code
FlowDiagram
print("Truthisfound")
}else{
print("Truthisnotfound")
}
Example
if("Truth"%in%x){
print("Truthisfound thefirst time")
}elseif("truth"%in%x){
print("truthisfound thesecond time")
}else{
print("Notruth found")
}
NestedIfStatements
x<-41
Example
if (x >10) {
print("Aboveten")
if (x >20) {
print("and alsoabove20!")
}else{
print("butnotabove20.")
}
} else {
print("below10.")
}
AND
Example
Testifais greaterthanb,ANDifcisgreaterthana:
a<-200
b<-33
c<-500
if(a>b&c>a){
print("Bothconditionsare true")
}
OR
Example
Testifais greaterthanb,orifcisgreaterthana:
a<-200
b<-33
c<-500
if(a>b|a>c){
print("Atleast one oftheconditions is true")
}
R-Switch Statement
Aswitchstatementallowsa variable tobe testedforequality againstalistof values.Each value is
called a case, and the variable being switched on is checked for each case.
Syntax
ThebasicsyntaxforcreatingaswitchstatementinR is−
switch(expression, case1, case2, case3.. .)
Thefollowing rulesapplytoa switchstatement −
Ifthevalueofexpressionisnota characterstringitis coercedto integer.
You can haveanynumber ofcasestatementswithin a switch.
If the value of the integer is between 1 and nargs()−1 (The max number of
arguments)thenthecorrespondingelementofcaseconditionisevaluatedandtheresult
returned.
If expression evaluates to a character string then that string is matched (exactly) to the
names of the elements.
Ifthereis morethanone match,the first matchingelement is returned.
NoDefaultargumentisavailable.
Inthecaseof nomatch, if thereis aunnamed element of... itsvalue is returned.
FlowDiagram
x<-switch( 3,
Examp
"first",
"second",
"third", "fourth"
)
print(x)
Example2:
Mathematical calculation
val1= 6
val2 = 7
val3="s"
result=switch( val3,
"a"= cat("Addition =", val1 + val2),
"d"=cat("Subtraction=",val1-val2),
"r"= cat("Division = ", val1 / val2),
"s"=cat("Multiplication=",val1*val2),
"m"= cat("Modulus =", val1 %% val2),
"p"= cat("Power =", val1 ^ val2)
)
print(result)
IterativeProgramminginR
R -Loops
Introduction:
There may be a situation when you need to execute a block of code
several number of times. In general, statements are executed
sequentially. The first statement in a function is executed first, followed
by the second, and so on.
Programming languages provide various control structures that allow for
more complicated execution paths.
A loop statement allows us to execute a statement or group of statements
multiple times and the following is the general form of a loop statement in
most of the programming languages −
Sr.No. LoopType&Description
1 repeat loop
Executesasequenceofstatementsmultipletimesandabbreviatesthecode
that manages the loop variable.
2 whileloop
Repeats a statement or group of statements while a given condition is
true. Ittests the condition before executing the loop body.
3 forloop
Like a while statement, except that it tests the condition at the end of
the loopbody.
R- ForLoop
AForloopisarepetitioncontrolstructurethatallowsyoutoefficientlywritealoop
thatneeds to executea specificnumber oftimes.
Syntax
Thebasicsyntaxforcreatingaforloopstatement inRis −
for(valueinvector){
statements
}
FlowDiagram
R’s for loops are particularly flexible in that they are not limited to
integers, or even numbers in the input. We can pass character vectors,
logical vectors, lists or expressions.
Example
Whentheabovecodeiscompiledandexecuted,itproducesthefollowingresult−
v<-LETTERS[1:4]
for(iinv){ print(i)
[1] "A"
} "B"
[1]
[1] "C"
[1] "D"
Example R-WhileLoop
for(xin1:10){
TheWhileloopexecutesthesamecodeagainandagain untilastopcondition ismet.
print(x)
Syntax
}
The basicsyntaxforcreatingawhile loopinRis−
Example2:Programtodisplaydaysofaweek. week
while(test_expression){
< - c('Sunday',
statement
} 'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday')
for(dayinweek)
{ print(day)
}
FlowDiagram
Example2
n<-5
factorial<-1 i
<-1
while(i<=n)
{
factorial=factorial*i i
=i+1
}
print(factorial)
Herekeypointofthewhileloopisthattheloopmightnoteverrun.Whenthecondi
tion is tested and the result is false, the loop body will be skipped and
the first statement after the while loop will be executed.
Example1
val=1
while(val<=5)
{
print(val)v
al=val+1
}
R-Repeat Loop
Syntax
Example1
val=1
Thebasicsyntaxforcreatingarepeatloop inRis −
repeat
repeat
{{
statement
if(condition)
{
break
}
}
FlowDiagram
print(val)val
=val+1
if(val>5)
{
break
}
}
Example2:
i <-0
repeat
{
print("Geeks4geeks!") i = i + 1
if(i ==5)
{
break
}
}
LoopControlStatements/Jumpstatements
Loopcontrolstatementschangeexecutionfromitsnormalsequence.Whenexecution leaves
Rsupportsthe following controlstatements. Clickthe followinglinkstocheck their detail.
1 breakstatement
Terminatestheloopstatementandtransfersexecutiontothestatement
immediately following the loop.
2 Next statement
The nextstatementsimulatesthebehaviorofRswitch.
R-Break Statement
ThebreakstatementinRprogramminglanguagehasthefollowingtwousages−
Whenthebreakstatementisencounteredinsidealoop,theloopisimmedia
telyterminated and program control resumes at the next statement
following the loop.
It canbeusedtoterminateacaseintheswitchstatement
Syntax
The basic syntaxfor creatinga breakstatementin R is−break
FlowDiagram
Example
for(valin1:5)
{
#checkingcondition if
(val == 3)
{
#usingbreakkeyword
break
}
#displayingitemsinthesequence
print(val)
}
R-NextStatement
The next statement in R programming language is useful when we want to
skip thecurrent iteration of a loop without terminating it. On encountering
next, the R parser skips further evaluation and starts next iteration of the
loop.
Syntax
The basic syntaxfor creatinga next statementin R is−next
FlowDiagram
Example
for(val in1:5)
{
#checkingcondition if
(val == 3)
{
#usingnextkeyword
next
}
#displayingitemsinthesequence
print(val)
}
Loopoveralist
Aforloopisveryvaluablewhenweneedtoiterateoveralistofelements orarangeof numbers.
Loopcanbeusedtoiterateoveralist,dataframe,vector,matrix oranyother object. The
braces and square bracket are compulsory.
ForLoopinRExample1:Weiterateoveralltheelementsofavectorandprintthecurrent value.
#Createfruitvector
fruit<-c('Apple','Orange','Passionfruit','Banana') #
Create the for statement
for(iinfruit){
print(i)
}
R-Functions
Functionsareusefulwhenyouwanttoperformacertaintaskmultipletimes.Afunctionaccepts input
arguments and produces the output by executing valid R commands that are inside the
function.InRProgrammingLanguagewhenyouarecreatingafunctionthefunctionnameand the
file in which you are creating the function need not be the same and you can have one or
more function definitions in a single R file.
Typesoffunction in RLanguage
Built-in Function: Built function R is sqrt(), mean(), max(), these function are directly call in
the program by users.
User-definedFunction: Rlanguageallowustowriteourownfunction.
Functionsin RLanguage
FunctionsarecreatedinRbyusingthecommandfunction().Thegeneralstructureofthe function file
is as follows:
Built-inFunctioninRProgrammingLanguage
Here we will use built-in function like sum(), max() and min().
print(sum(4:6))
User-definedFunctionsinRProgrammingLanguage
Rprovidesbuilt-infunctionslike print(),cat(),etc.butwecanalsocreateourownfunctions. These
functions are called user-defined functions.
evenOdd = function(x)
{ if(x %% 2 == 0)
return("even")
else
return("odd")
}
print(evenOdd(4))
print(evenOdd(3))
SingleInputSingleOutput
NowcreateafunctioninRthatwilltakeasingleinputandgivesusasingleoutput.
areaOfCircle = function(radius)
{ area = pi*radius^2
return(area)
}
print(areaOfCircle(2))
MultipleInputMultipleOutput
Now create a function in R Language that will take multiple inputs and gives us multiple
outputs using a list.
ThefunctionsinRLanguagetakesmultipleinputobjectsbutreturnedonlyoneobjectas
output,thisis,however,notalimitationbecauseyoucancreatelistsofalltheoutputswhich
you want to create and once the list is created you can access them into the elements of
thelist and get the answers which you want.
resultList = Rectangle(2, 3)
print(resultList["Area"])
print(resultList["Perimeter"])
InlineFunctionsinRProgrammingLanguage
Sometimes creating an R script file, loading it, executing it is a lot of work when you want
to just create a very small function. So, what we can do in this kind of situation is an inline
function.
To create an inline function you have to use the function command with the argument x and
then the expression of the function.
f = function(x)x*100
print(f(4))
PassingargumentstoFunctionsinRProgrammingLanguage
Thereareseveralwaysyoucanpasstheargumentstothefunction:
Case 1: Generally in R, the arguments are passed to the function in the same order as in
the function definition.
Case 2: If you do not want to follow anyorder what you can do is you can pass the
arguments using the names of the arguments in any order.
Case 3: If the arguments are not passed the default values are used to execute the
function.
Rectangle=function(length=5,width=4){ area
= length * width
return(area)
}
#Case1:
print(Rectangle(2,3))
#Case2:
print(Rectangle(width= 8,length=4))
#Case3:
print(Rectangle())
LazyevaluationsofFunctionsinRProgrammingLanguage
In R the functions are executed in a lazy fashion. When we say lazy what it means is
ifsome arguments are missing the function is still executed as long as the execution does
not involve those arguments.
Example1:
Cal= function(a,b,c)
{ v = a*b
return(v)
}
print(Cal(5,10))
Example2:
Cal=function(a,b,c){
v=a*b*c
return(v)
}
print(Cal(5,10))
FunctionArgumentsin RProgramming
Arguments aretheparameters provided to afunction to perform operationsin aprogramming
language. In R programming, we can use as many arguments as we want and are separatedby
a comma. There is no limit on the number of arguments in a function in R. In this article,
we’ll discuss different ways of adding arguments in a function in R programming.
AddingArgumentsinR
We can pass an argument to a function while calling the function by simply giving the value
as an argument inside the parenthesis. Belowis animplementation ofa functionwith a single
argument.
divisbleBy5 <- function(n)
{ if(n %% 5 == 0)
{
return("numberisdivisibleby5")
}
else
{
return("numberisnotdivisibleby5")
}
}
# Function call
divisbleBy5(100)
AddingMultipleArgumentsinR
A function in R programming can have multiple arguments too. Below is an
implementation of a function with multiple arguments.
divisible <- function(a, b)
{ if(a %% b == 0)
{
return(paste(a,"isdivisibleby",b))
}
else
{
return(paste(a,"isnotdivisibleby",b))
}
}
#Functioncall
divisible(7, 3)
AddingDefaultValueinR
Default value in a function is a value that is not required to specify each time the function is
called. If the value is passed by the user, then the user-defined value is used by the function
otherwise, the default value is used. Below is an implementation of function with default
value.
divisible <- function(a, b = 3)
{ if(a %% b == 0)
{
return(paste(a,"isdivisibleby",b))
}
else
{
return(paste(a,"isnotdivisibleby",b))
}
}
# Function call
divisible(10,5)
divisible(12)
DotsArgument
Dots argument (…) is also known as ellipsis which allows the function to take an undefined
number of arguments. It allows the function to take an arbitrary number of arguments.
Belowisanexampleofafunctionwithanarbitrarynumberofarguments.
fun<-function(n,...){
l<-list(n,...)
paste(l,collapse="")
}
#Functioncall
fun(5,1L,6i,15.2,TRUE)
RecursiveFunctionsinRProgramming
The recursive function uses the concept of recursion to perform iterative tasks they call
themselves, again and again, which acts as a loop. These kinds of functions need a stopping
condition so that they can stop looping continuously.
Recursive functions call themselves. Theybreak down the problem into smaller components.
The function() calls itself within the original function() on each of the smaller components.
After this, the results will be put together to solve the original problem.
Example1:
fac<- function(x){ if(x==0||x==1)
{
return(1)
}
else
{
return(x*fac(x-1))
}
}
fac(3)
NestedFunctions
Therearetwoways tocreate anestedfunction:
Callafunctionwithinanother function.
Writeafunction within afunction.
Callafunctionwithinanother function.
Example
Callafunctionwithinanotherfunction:
Nested_function(Nested_function(2,2),Nested_function(3,3))
Writeafunction within afunction.
Example
LoadinganRpackage
Packages
PackagesarecollectionsofRfunctions,data,andcompiledcodeinawell-definedformat.The directory where packag
loadedintothe sessiontobe used.
.libPaths()#get librarylocation
search() #seepackagescurrentlyloaded
AddingPackages
Youcanexpandthetypesofanalyses youdobeaddingotherpackages.Acomplete listof contributed
packages is available from CRAN.
Followthese steps:
LoadanRPackage
In this exercise, you'll be learning how to load the ggplot2 package, a powerful package for
datavisualization.You'lluseittocreateaplotoftwovariablesofthe mtcarsdataframe.The data has
already been prepared for you in the workspace.
search(),tolook atthecurrentlyattachedpackagesand
qplot(mtcars$wt,mtcars$hp),tobuild aplotof twovariablesofthemtcarsdata frame.
MathematicalFunctionsinR
Rprovidesthevariousmathematicalfunctionstoperformthemathematicalcalculation.These
mathematical functions are very helpful to find absolute value, square value and much more
calculations. In R, there are the following functions which are used:
OverviewofDataReductionStrategies
Data reductionstrategiesincludedimensionality reduction,numerosity reduction,anddata compression.
Dimensionalityreductionistheprocessofreducingthenumberofrandomvariablesor attributes
under consideration.
Dimensionalityreductionmethodsinclude wavelettransforms
andprincipalcomponentsanalysis,whichtransformorprojecttheoriginaldataontoasmaller
space.
Attribute subset selection is a method of dimensionalityreduction in which
irrelevant,weakly relevant, or redundant attributes or dimensions are detected and
removed.
The technique also works to remove noise without smoothing outthe mainfeatures of
the data, making it effective for data cleaning as well. Given a set of coefficients, an
approximation of the original data can be constructed by applying the inverse of the
DWT used.
The DWT is closely related to the discrete Fourier transform (DFT), a signal
processing technique involving sines and cosines. In general, however, the DWT
achieves better lossy compression.
UnliketheDFT,waveletsarequitelocalizedinspace,contributingtotheconservation of
local detail.
ThereisonlyoneDFT,yetthereareseveralfamiliesofDWTs.Figure3.4showssomewavelet
families.PopularwavelettransformsincludetheHaar-2,Daubechies-4,andDaubechies-6.The
general procedure for applying a discrete wavelet transform uses a hierarchical pyramid
algorithm that halves the data at each iteration, resulting in fast computational speed.
Themethod is as follows:
1. Thelength, L, oftheinput datavectormust bean integerpowerof2. This condition can be met
by padding the data vector with zeros as necessary (L >=n).
2. Eachtransforminvolvesapplyingtwofunctions.Thefirstappliessomedatasmoothing,such
asasumorweightedaverage.Thesecondperformsaweighteddifference,whichactstobring out the
detailed features of the data.
3. ThetwofunctionsareappliedtopairsofdatapointsinX,thatis,toallpairsofmeasurements
.x2i,x2i+1.Thisresultsintwodatasetsoflength L=2. In general,these representasmoothed or low-
frequency version of the input data and the high frequency content of it, respectively.
4. Thetwofunctionsarerecursivelyappliedtothedatasetsobtainedinthepreviousloop,until the
resulting data sets obtained are of length 2. 5. Selected values from the data sets obtained in
the previous iterations are designated the wavelet coefficients of the transformed data.
Equivalently,amatrixmultiplicationcanbeappliedtotheinputdatainordertoobtain the
wavelet coefficients, where the matrix used depends on the given DWT.
The matrix must be orthonormal, meaning that the columns are unit vectors and are
mutually orthogonal, so that the matrix inverse is just its transpose. By factoring the
matrixusedintoaproductofafewsparsematrices,theresulting“fastDWT”algorithm has a
complexity of O(n) for an input vector of length n.
Wavelet transforms can be applied to multidimensional data such as a data cube. This
isdonebyfirstapplyingthetransformtothefirstdimension,thentothesecond,andso on.
LossycompressionbywaveletsisreportedlybetterthanJPEGcompression,thecurrent
commercial standard.
Wavelet transforms have many real world applications, including the compression of
fingerprint images, computer vision, analysis of time-series data, and data cleaning.
PrincipalComponentsAnalysis
Principal components analysis (PCA; also called the K-L, method) searches for k n-
dimensional orthogonal vectors that can best be used to represent the data, where k <= n.
The original data are thus projected onto a much smaller space, resulting in dimensionality
reduction.
Thebasicprocedureisas follows:
1. The input data are normalized, so that each attribute falls within the same range. This step
helps ensure that attributes with large domains will not dominate attributes with
smallerdomains.
2. PCAcomputes korthonormal vectorsthat provideabasisforthenormalized input data.
3. Theprincipalcomponents aresortedinorderofdecreasing“significance”or strength.
4. Because the components are sorted in decreasing order of “significance,” the data size can
be reduced by eliminatingthe weaker components, that is, those with low variance. Usingthe
strongest principal components, it should be possible to reconstruct a good approximation of
the original data.
AttributeSubset Selection
Data sets for analysis maycontain hundreds of attributes, manyof which maybe irrelevant to
theminingtaskorredundant.Forexample,ifthetaskistoclassifycustomersbasedonwhether or not
they are likelyto purchase a popular new CD at AllElectronics when notified of a sale,
attributessuchasthecustomer’stelephonenumberarelikelytobeirrelevant,unlikeattributes such
as age or music taste.
Although it maybe possible for a domain expert to pick out some of the useful attributes, this can
be a difficult and time consuming task, especially when the data’s behavior is not well
known. (Hence, a reason behind its analysis!) Leaving out relevant attributes or keeping
irrelevantattributesmaybedetrimental,causingconfusionfortheminingalgorithmemployed. This
can result in discovered patterns of poor quality. In addition, the added volume of irrelevant
or redundant attributes can slow down the mining process.
Attribute subset selection reduces the data set size by removing irrelevant or redundant
attributes (or dimensions). The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes. Mining on a reduced set of
attributes has an additional benefit: It reduces the number of attributes appearing in
thediscovered patterns, helping to make the patterns easier to understand.
Therefore, heuristic methods that explore a reduced search space are commonly used for
attributesubsetselection.Thesemethodsaretypicallygreedyinthat,whilesearchingthrough
attribute space, they always make what looks to be the best choice at the time. Their strategy
istomakealocallyoptimalchoiceinthehopethatthiswillleadtoagloballyoptimalsolution. Such
greedy methods are effective in practice and may come close to estimating an optimal
solution.
The “best” (and “worst”) attributes are typically determined using tests of statistical significance,
which assume that the attributes are independent of one another. Many other attribute
evaluation measures can be used such as the information gain measure used in building
decision trees for classification.5
Basic heuristic methods of attribute subset selection include the techniques that follow, some of
which are illustrated in Figure 3.6.
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes is added to the
set.
2. Stepwisebackwardelimination:Theprocedurestartswiththefullsetofattributes.Ateach step,
it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedureselectsthebestattributeandremovestheworstfromamongtheremainingattributes.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchart like
structure where each internal (nonleaf) node denotes a test on an attribute, each branch
correspondstoanoutcomeofthetest,andeachexternal(leaf)nodedenotesaclassprediction. At each
node, the algorithm chooses the “best” attribute to partition the data into individual classes.
When decision tree induction is used for attribute subset selection, a tree is constructed from
thegivendata.Allattributesthatdonotappearinthetreeareassumedtobeirrelevant.Thesetof
attributes appearing in the tree form the reduced subset of attributes.
RegressionandLog-LinearModels:ParametricData Reduction
Regressionandlog-linearmodelscanbeusedtoapproximatethegivendata.In(simple)linear
regression,thedataaremodeledtofitastraightline.Forexample,arandomvariable,y(called
aresponsevariable),canbemodeledasalinearfunctionofanotherrandomvariable,x(called a
predictor variable), with the equation
where the variance of y is assumed to be constant. In the context of data mining, x and y are
numericdatabase attributes.Thecoefficients, w andb(calledregressioncoefficients),specify the
slope of the line and the y-intercept, respectively.
Log-linearmodelsapproximatediscretemultidimensionalprobabilitydistributions.
Given aset of tuples in n dimensions (e.g., described byn attributes),we can consider
each tuple as a point in an n-dimensional space.
Log-linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller subset of
dimensional combinations.
Thisallowsahigher-dimensionaldataspacetobeconstructedfromlower-dimensional
spaces.
Histograms
Histograms use discarding to approximate data distributions and are a popular form of data
reduction. A histogram for an attribute, A, partitions the data distribution of A into
disjointsubsets, referred to as buckets or bins. If each bucket represents only a single
attribute–
value/frequencypair,thebucketsarecalledsingletonbuckets.Often,bucketsinsteadrepresent
continuous ranges for the given attribute.
Clustering
Clusteringtechniques considerdata tuplesas objects.
They partition the objects into groups, or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters.
Similarityiscommonlydefinedintermsofhow“close”theobjectsareinspace,based on a
distance function.
The “quality” of a cluster may be represented by its diameter, the maximum
distancebetween any two objects in the cluster.
Centroiddistanceisanalternative measure ofcluster quality andisdefinedasthe average
distance of each cluster object from the cluster centroid.
In data reduction, the cluster representations of the data are used to replace the
actualdata.
Sampling
Samplingcanbeusedasadatareductiontechniquebecauseitallowsalargedatasetto be
represented by a much smaller random data sample (or subset). Suppose that a large data
set,D,containsNtuples.Let’slookatthemostcommonwaysthatwecouldsampleDfordata
reduction, as illustrated in Figure 3.9.