0% found this document useful (0 votes)

119 views27 pages

DecisionTrees RandomForest v2

This document introduces decision trees and random forests. It explains that decision trees use a set of binary rules to calculate a target value for classification or regression problems. Random forests are an ensemble method that combines the predictions from multiple decision trees to improve accuracy. The document outlines how random forests work by building trees on random subsets of data and variables, and averaging their predictions. Random forests provide measures of accuracy, variable importance and error rates.

Uploaded by

Sandeep Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views27 pages

DecisionTrees RandomForest v2

Uploaded by

Sandeep Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Introduction to decision trees and

random forests
Ned Horning
American Museum of Natural History's
Center for Biodiversity and Conservation
horning@amnh.org

What are decision trees?

A predictive model that uses a set of binary

rules applied to calculate a target value
Can be used for classification (categorical
variables) or regression (continuous variables)
applications
Rules are developed using software available
in many statistics packages
Different algorithms are used to determine the
best split at a node

Example classification tree

How do classification trees work?

Uses training data to

build model
Tree generator
determines:

Which variable to split

at a node and the
value of the split
Decision to stop
(make a terminal note)
or split again
Assign terminal nodes
to a class

Dividing feature space recursive

partitioning

Blue = water
Green = forest
Yellow = shrub
Brown = non-forest
Gray = cloud/shadow

Dividing feature space recursive

partitioning

Dividing feature space recursive

partitioning
A constant (class or predicted function value) is assigned to each rectangle

Dividing feature space recursive

partitioning

Dividing feature space recursive

partitioning

Dividing feature space recursive

partitioning

Dividing feature space recursive

partitioning

Editing (pruning) the tree

Overfitting is common since individual pixels

can be a terminal node
Classification trees can have hundreds or
thousands of nodes and these need to be
reduced by pruning to simplify the tree
Pruning involves removing nodes to simplify the
tree
Parameters such as minimum node size, and
maximum standard deviation of samples at a
node can restrict tree size

Regression trees

Regression calculates
relationship between
predictor and
response variables
Structure is similar to
classification tree
Terminal nodes are
predicted function
(model) values
Predicted values are
limited to the values
in the terminal nodes

Decision tree advantages

Easy to interpret the decision rules

Nonparametric so it is easy to incorporate a
range of numeric or categorical data layers and
there is no need to select unimodal training data

Robust with regard to outliers in training data

Classification is fast once rules are developed

Drawbacks of decision trees

Decision trees tend to overfit training data which

can give poor results when applied to the full
data set
Splitting perpendicular to feature space axes is
not always efficient
Not possible to predict beyond the minimum
and maximum limits of the response variable in
the training data

Packages in R

tree The original decision tree package

rpart A slightly newer and more aggressively
maintained package

What are ensemble models?

Combines the results

from different models
Models can be a
similar type or
different
The result from an
ensemble model is
usually better than
the result from one of
the individual models

What is random forests

An ensemble
classifier using many
decision tree models
Can be used for
classification or
regression
Accuracy and variable
importance
information is
provided with the
results

How random forests work

A different subset of the

training data are selected
(~2/3), with replacement,
to train each tree
Remaining training data
(OOB) are used to
estimate error and variable
importance
Class assignment is made
by the number of votes
from all of the trees and
for regression the average
of the results is used

Use a subset of variables

A randomly selected
subset of variables is used
to split each node
The number of variables
used is decided by the
user (mtry parameter in R)
Smaller subset produces
less correlation (lower
error rate) but lower
predictive power (high
error rate)
Optimum range of values
is often quite wide

Common variables for random

forests

Input data (predictor

and response)
Number of trees
Number of variables
to use at each split
Options to calculate
error and variable
significance
information
Sampling with or
without replacement

randomForest(x, y=NULL, xtest=NULL,

ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt
(ncol(x))),
replace=TRUE, classwt=NULL, cutoff,
strata,
sampsize = if (replace) nrow(x) else
ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y))
5 else 1,
importance=FALSE, localImp=FALSE,
nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest),
corr.bias=FALSE,
keep.inbag=FALSE, ...)

Proximity measure

Proximity measures
how frequent unique
pairs of training
samples (in and out of
bag) end up in the
same terminal node
Used to fill in missing
data and calculating
outliers
Outliers for classification

Information from Random Forests

Classification
accuracy
Variable importance
Outliers
(classification)
Missing data
estimation
Error rates for random
forest objects

Error rate vs. number of trees

Advantages of random forests

No need for pruning

trees
Accuracy and variable
importance generated
automatically
Overfitting is not a
problem
Not very sensitive to
outliers in training data
Easy to set parameters

Limitations of random forests

Regression can't
predict beyond range
in the training data
In regression extreme
values are often not
predicted accurately
underestimate highs
and overestimate
lows

Common remote sensing

applications of random forests

Classification

Land cover
classification
Cloud/shadow
screening

Regression

Continuous fields
(percent cover)
mapping
Biomass mapping

Resources to learn more about

random forests

http://www.stat.berkeley.edu/~breiman/
RandomForests/cc_home.htm#prox

http://en.wikipedia.org/wiki/Random_forest

The randomForest Package (for R) description

12 e Elm 434
No ratings yet
12 e Elm 434
5 pages
Energy Labelling and Standards Programs Throughout The World - Julio 2004 PDF
No ratings yet
Energy Labelling and Standards Programs Throughout The World - Julio 2004 PDF
56 pages
Iec 554-3-1 - 0001
No ratings yet
Iec 554-3-1 - 0001
10 pages
CEC Value of DA: Distribution Automation Detailed Scenarios: Xanthus
No ratings yet
CEC Value of DA: Distribution Automation Detailed Scenarios: Xanthus
19 pages
E Procurement EXAMPLES
No ratings yet
E Procurement EXAMPLES
33 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
15 pages
Monte Carlo Method
No ratings yet
Monte Carlo Method
18 pages
Decision Tree Classifier & ID3 Guide
No ratings yet
Decision Tree Classifier & ID3 Guide
34 pages
Ucm6100 AMI Guide 0
No ratings yet
Ucm6100 AMI Guide 0
15 pages
Reliability Modeling of Distributed Generation in Conventional Distribution Systems Planning and Analysis
No ratings yet
Reliability Modeling of Distributed Generation in Conventional Distribution Systems Planning and Analysis
6 pages
Synchronous Speed and Is Given As Follows
No ratings yet
Synchronous Speed and Is Given As Follows
18 pages
Deployments and Deployment Attempts: Futuregrid, and The More Modern Intergrid and Intragrid
No ratings yet
Deployments and Deployment Attempts: Futuregrid, and The More Modern Intergrid and Intragrid
22 pages
Software Development Risk Management: by Karl Gallagher
No ratings yet
Software Development Risk Management: by Karl Gallagher
19 pages
2 Marks Question Bank Wecc
0% (1)
2 Marks Question Bank Wecc
2 pages
Notes On Time Series Analysis
No ratings yet
Notes On Time Series Analysis
111 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Multivariate Time Series Classification of Sensor Data From An in
No ratings yet
Multivariate Time Series Classification of Sensor Data From An in
101 pages
Time Series Anomaly Detection Intro
No ratings yet
Time Series Anomaly Detection Intro
43 pages
Statistical Machine Learning For Engineers With Application
No ratings yet
Statistical Machine Learning For Engineers With Application
393 pages
Evolution of HVDC Light Infographic
No ratings yet
Evolution of HVDC Light Infographic
1 page
Linear Algebra For Machine Learning: Sargur N. Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Linear Algebra For Machine Learning: Sargur N. Srihari Srihari@cedar - Buffalo.edu
62 pages
HydropowerNorway SeminarPaper
No ratings yet
HydropowerNorway SeminarPaper
71 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
9 pages
CIM For Generation
No ratings yet
CIM For Generation
20 pages
9.-Time-Series Prediction of Wind Speed Using Machine Learning Algorithms 2018
No ratings yet
9.-Time-Series Prediction of Wind Speed Using Machine Learning Algorithms 2018
17 pages
Transmission Planning Standards Update
No ratings yet
Transmission Planning Standards Update
4 pages
Classification Vs Regression
No ratings yet
Classification Vs Regression
3 pages
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
100% (1)
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
25 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
Risk-Constrained FTR Bidding Strategy in Transmission Markets
No ratings yet
Risk-Constrained FTR Bidding Strategy in Transmission Markets
8 pages
Chapter 5 (Time Series Analysis - Forecasting)
No ratings yet
Chapter 5 (Time Series Analysis - Forecasting)
71 pages
LV PH DThesis 2019
No ratings yet
LV PH DThesis 2019
183 pages
Time-Series Forecasting With Deep Learning - A Survey
No ratings yet
Time-Series Forecasting With Deep Learning - A Survey
14 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
No ratings yet
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
24 pages
A "Short" Introduction To Model Selection
No ratings yet
A "Short" Introduction To Model Selection
25 pages
Little Book of R For Multivariate Analysis
No ratings yet
Little Book of R For Multivariate Analysis
51 pages
Phuong Nguyen: The Complete Guide To Cluster Analysis Using Python
No ratings yet
Phuong Nguyen: The Complete Guide To Cluster Analysis Using Python
68 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
A Modified ID3 Decision Tree Algorithm Based On Cumulative
100% (1)
A Modified ID3 Decision Tree Algorithm Based On Cumulative
19 pages
Aging of Polymeric Materials
No ratings yet
Aging of Polymeric Materials
29 pages
Boundary Load Flow Solutions
No ratings yet
Boundary Load Flow Solutions
75 pages
Development of Smart Grid Interoperability For Energy Efficiency Systems
No ratings yet
Development of Smart Grid Interoperability For Energy Efficiency Systems
28 pages
Power Grid Monitoring Based On Machine Learning An
No ratings yet
Power Grid Monitoring Based On Machine Learning An
17 pages
Handling Imbalanced Data
No ratings yet
Handling Imbalanced Data
21 pages
Full Statistics
No ratings yet
Full Statistics
108 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Bayesian Inference of State Space Models Kalman Filtering and Beyond
No ratings yet
Bayesian Inference of State Space Models Kalman Filtering and Beyond
503 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
(SpringerBriefs in Mathematics) Qi He, Le Yi Wang, George G. Yin - System Identification Using Regular and Quantized Observations - Applications of Large Deviations Principles-Springer (2013)
No ratings yet
(SpringerBriefs in Mathematics) Qi He, Le Yi Wang, George G. Yin - System Identification Using Regular and Quantized Observations - Applications of Large Deviations Principles-Springer (2013)
108 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
2011 Wecc Soti Report
No ratings yet
2011 Wecc Soti Report
29 pages
Random Forest
100% (1)
Random Forest
18 pages
Object-Oriented in R
No ratings yet
Object-Oriented in R
138 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
Feature Engineering / Feature Selection
No ratings yet
Feature Engineering / Feature Selection
33 pages
Random Forest
No ratings yet
Random Forest
8 pages
Random Forest for Data Scientists
No ratings yet
Random Forest for Data Scientists
38 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Diabets Project Document3
No ratings yet
Diabets Project Document3
60 pages
Basic Concepts of Machine Learning For Beginners 1732109263
No ratings yet
Basic Concepts of Machine Learning For Beginners 1732109263
102 pages
Abstracts
No ratings yet
Abstracts
18 pages
Orange Data Mining
100% (1)
Orange Data Mining
26 pages
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
100% (1)
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
32 pages
MachineLearning Algorithm Hope
No ratings yet
MachineLearning Algorithm Hope
134 pages
Data Science Interview Questions
100% (1)
Data Science Interview Questions
68 pages
Review (2) - Machine Learning For SPAM Detection 2023
No ratings yet
Review (2) - Machine Learning For SPAM Detection 2023
13 pages
9 - Prediction of Air Pollution by Using Machine Learning Algorithm
No ratings yet
9 - Prediction of Air Pollution by Using Machine Learning Algorithm
59 pages
Telecom Call Drop ML Analysis
No ratings yet
Telecom Call Drop ML Analysis
16 pages
Government Twitter Sentiment Analysis
No ratings yet
Government Twitter Sentiment Analysis
12 pages
Time Series Analysis for Students
No ratings yet
Time Series Analysis for Students
7 pages
Fraud Detection with Machine Learning
No ratings yet
Fraud Detection with Machine Learning
8 pages
NCRACIT-2023 IT Conference Proceedings
No ratings yet
NCRACIT-2023 IT Conference Proceedings
479 pages
Anomaly Based Intrusion Detection Model Using Supervised Machine Learning Techniques
No ratings yet
Anomaly Based Intrusion Detection Model Using Supervised Machine Learning Techniques
5 pages
Decision Trees, Bagging, and Boosting Guide
No ratings yet
Decision Trees, Bagging, and Boosting Guide
2 pages
Used Car Price Prediction Using Machine Learning: Veluru Ranjith (Urk18Cs020)
No ratings yet
Used Car Price Prediction Using Machine Learning: Veluru Ranjith (Urk18Cs020)
26 pages
Ensemble Methods for ML Enthusiasts
No ratings yet
Ensemble Methods for ML Enthusiasts
3 pages
MBA Fraud Detection Report
No ratings yet
MBA Fraud Detection Report
51 pages
ML Unit 3 (DS)
No ratings yet
ML Unit 3 (DS)
31 pages
Fake News Detetcion
No ratings yet
Fake News Detetcion
16 pages
Data Science 面试必备指南 + 面试真题
No ratings yet
Data Science 面试必备指南 + 面试真题
54 pages
Machine Learning Assignment 1
No ratings yet
Machine Learning Assignment 1
4 pages
Personality Classification From Online Text
No ratings yet
Personality Classification From Online Text
17 pages
Animal Species Prediction Using Machine Learning
No ratings yet
Animal Species Prediction Using Machine Learning
10 pages
Artificial Intelligence For Self Healing Automation Testing Frameworks Real Time Fault Prediction and Recovery
No ratings yet
Artificial Intelligence For Self Healing Automation Testing Frameworks Real Time Fault Prediction and Recovery
31 pages
Imp 2
No ratings yet
Imp 2
6 pages
MPK Predictive Analytics (Minitab Module) - VR
No ratings yet
MPK Predictive Analytics (Minitab Module) - VR
10 pages
Improving Heart Disease Prediction Accuracy Using A Hybrid Machine Learning Approach: A Comparative Study of SVM and KNN Algorithms
No ratings yet
Improving Heart Disease Prediction Accuracy Using A Hybrid Machine Learning Approach: A Comparative Study of SVM and KNN Algorithms
6 pages
Loan Mount Prediction. ML Project Report
No ratings yet
Loan Mount Prediction. ML Project Report
6 pages

DecisionTrees RandomForest v2

Uploaded by

DecisionTrees RandomForest v2

Uploaded by

Introduction to decision trees and

What are decision trees?

A predictive model that uses a set of binary

Example classification tree

How do classification trees work?

Uses training data to

Which variable to split

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Editing (pruning) the tree

Overfitting is common since individual pixels

Decision tree advantages

Easy to interpret the decision rules

Robust with regard to outliers in training data

Classification is fast once rules are developed

Drawbacks of decision trees

Decision trees tend to overfit training data which

tree The original decision tree package

What are ensemble models?

Combines the results

What is random forests

How random forests work

A different subset of the

Use a subset of variables

Common variables for random

Input data (predictor

randomForest(x, y=NULL, xtest=NULL,

Information from Random Forests

Error rate vs. number of trees

Advantages of random forests

No need for pruning

Limitations of random forests

Common remote sensing

Resources to learn more about

The randomForest Package (for R) description

You might also like