[go: up one dir, main page]

0% found this document useful (0 votes)
109 views539 pages

SAS Python R Full Book

The document is a comprehensive guide comparing SAS, Python, and R programming environments, focusing on data science techniques and methodologies. It covers various topics including data import, exploration, visualization, modeling, and performance metrics, along with practical code examples for each language. Additionally, it includes quizzes, cheat sheets, and resources for further learning, making it a valuable resource for data science practitioners.

Uploaded by

bigblueshoe777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views539 pages

SAS Python R Full Book

The document is a comprehensive guide comparing SAS, Python, and R programming environments, focusing on data science techniques and methodologies. It covers various topics including data import, exploration, visualization, modeling, and performance metrics, along with practical code examples for each language. Additionally, it includes quizzes, cheat sheets, and resources for further learning, making it a valuable resource for data science practitioners.

Uploaded by

bigblueshoe777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 539

Preface: Code Comparison Cheat Sheet 10

Overview 15
Programming Environments 18
SAS OnDemand for Academics 21
SAS Studio IDE 22
Log 24
Results and Output 25
Python Spyder 27
Python Spyder IDE 28
Editor 29
Console 29
Help Panel 30
Variable Explorer 32
RStudio 33
RStudio IDE 34
Global Environment 35
Packages 36
Chapter 1: Programming Environments – Conclusion and Transition 38
Chapter 1 Summary: Programming Environments 39
Chapter 1 Quiz 43
Chapter 1 Cheat Sheet 45
Overview 48
Project Overview 49
Import Data 50
Data Dictionary 51
Sampling 55
Chapter 2: Gathering Data – Conclusion and Transition 58
Chapter 2 Summary: Gathering Data 59
Chapter 2 Quiz 62
Chapter 2 Cheat Sheet 64
Overview 66
2 SAS, Python & R: A Cross Reference Guide

Art and Science 67


Project Overview 68
Target Variable 69
Classification vs. Regression Models 70
Multi-Target Models 73
Creating a Target Variable 74
Target Variable Analysis 78
Predictive Variables 82
Exploratory Data Analysis (EDA) 83
How to Detect Outliers 88
Winsorizing 89
Inter-Quartile Range (IQR) 90
Dealing with Outliers 91
Capping and Flooring Values – Winsorizing Example 92
Data Inference 95
Data Rejection 96
Feature Selection and Engineering 96
Filtering with Correlation Analysis 98
Filtering with Wrapper Methods 101
Filtering with Embedded Methods 109
Creating New Features Through Feature Engineering 114
Feature Scaling 115
Encoding Categorical Variables: Dummy Variables 119
Feature Interaction 123
Reducing the Dimensionality of the Data Set 126
Principal Component Analysis (PCA) 127
Data Balancing 133
Modeling Scenarios 136
Machine Learning Algorithms Sensitivity 137
Chapter 3: Creating a Modeling Data Set – Conclusion and Transition 140
Chapter 3 Summary: Creating a Modeling Data Set 142
SAS, Python & R: A Cross Reference Guide

Chapter 3 Quiz 145


Chapter 3 Cheat Sheet 147
Overview 149
Data Preparation 150
Order of Operations 151
Data Set Preparation for Regression Models 152
Comparing with Tree-Based Models 153
Data Segmentation 154
Different Types of Data Segmentation 155
Cross-Validation 156
Comparison of Data Segmentation Techniques 157
Data Segmentation Approach for the Lending Club Data Set 159
Modeling Pipeline 160
Model Pipeline Steps 161
Model Pipeline Code 163
Modeling Pipeline Output 171
Processing Time 173
Chapter 4: Model Pipeline – Conclusion and Transition 174
Chapter 4 Summary: Model Pipeline 175
Chapter 4 Quiz 178
Chapter 4 Cheat Sheet 180
Overview 182
Modeling Data 182
Machine Learning Models 183
Difference Between Linear and Logistic Regression 185
Logistic Regression 186
Step-by-Step Construction of a Logistic Regression Model 187
Loss Function 189
Odds Ratio 189
Link Function 191
Assumptions 192
4 SAS, Python & R: A Cross Reference Guide

Model Fit 194


Interpretation 195
Regularization 197
Data Set Balancing 198
Dummy Variable Trap 199
Pros and Cons of Logistic Regression 200
Best Practices 200
Logistic Regression Model Code 200
Decision Trees 212
Comparing Logistic Regression and Decision Trees 213
Feature Selection Methods 215
Decision Tree Construction 217
Pruning Decision Trees 218
Performance Metrics 219
Hyperparameter Tuning 221
Pros and Cons 223
Best Practices 223
Decision Tree Model Code 224
Chapter 5: Predictive Modeling Part I – Foundation – Conclusion and Transition 235
Chapter 5 Summary: Predictive Modeling Part I – Foundation 236
Chapter 5 Quiz 239
Chapter 5 Cheat Sheet 241
Overview 243
Random Forest 243
Step-by-Step Construction of a Random Forest Model 244
Comparing Decision Trees and Random Forests 245
The Role of the Out-of-Bag (OOB) Metric in Random Forests 246
Feature Selection Methods 247
Pruning Random Forests 247
Performance Metrics 248
Pros and Cons 248
SAS, Python & R: A Cross Reference Guide

Best Practices 248


Random Forest Model Code 249
Gradient Boosting 258
Step-by-Step Construction of a Gradient Boosting Model 258
Different Types of Gradient Boosting Models 259
Comparing Random Forests and Gradient Boosting 260
Out-of-Bag vs. Validation Data Sets 261
Feature Selection Methods 262
Log Loss 263
Pruning and Regularization in Gradient Boosting Models 264
Performance Metrics 264
Pros and Cons 264
Best Practices 265
Gradient Boosting Model Code 265
Chapter 6: Predictive Modeling Part II – Ensemble Methods – Conclusion and Transition
274
Chapter 6 Summary: Predictive Modeling Part II – Ensemble Methods 276
Chapter 6 Quiz 279
Chapter 6 Cheat Sheet 281
Overview 284
Support Vector Machines 284
Step-by-Step Construction of an SVM Model 286
Comparing Ensemble Tree-Based Models to SVMs 287
Understanding Kernel Functions and the C Parameter in SVMs 289
Standardization in Support Vector Machines 290
Performance Metrics 291
Pros and Cons 291
Best Practices 292
Data Preparation for SVM Models 292
Support Vector Machine Model Code 293
Possible Training Time Issues 295
Neural Networks 305
6 SAS, Python & R: A Cross Reference Guide

Step-by-Step Construction of a Neural Network Model 308


Types of Neural Network Models 310
Comparing Baseline Models to Neural Network Models 310
Pros and Cons 311
Best Practices for Neural Networks 312
Issues to Address in Neural Network Development 313
Chapter 7: Predictive Modeling Part III – Advanced Techniques – Conclusion and
Transition 327
Chapter 7 Summary: Predictive Modeling Part III – Advanced Techniques 328
Chapter 7 Quiz 331
Chapter 7 Cheat Sheet 333
Overview 336
Introduction to Performance Metrics for Classification vs. Regression Models 337
Classification Metrics: Introduction and Example 340
Establishing the Example: Creating the Data Set 341
Classification Metrics Rules of Thumb 348
Understanding Classification Thresholds 351
Interpreting the Thresholds 353
Business Goals and Threshold Setting 354
Analyzing the Model Score Distribution 355
Understanding the Confusion Matrix 355
AUC, Gini, and KS Statistic 359
Lift Table 362
Lift Table Interpretation 363
Lift and Gain Charts 368
Regression Metrics: Introduction and Example 370
Establishing the Example: Creating the Data Set 371
Residuals Analysis: A Key Tool in Regression Metrics 377
Visualizing Residuals 377
Using Residuals in Conjunction with Performance Metrics 382
Regression Metrics Rules of Thumb 386
Comparison of Classification and Regression Performance Metrics 389
SAS, Python & R: A Cross Reference Guide

Introduction to Model Implementation 391


Development vs. Production Environments 391
Model Packages 392
Implementing in Different Environments 394
Third-Party Implementation Tools: Docker and Kubernetes 395
Model Monitoring: Ensuring Long-Term Model Performance 396
Overview of Model Monitoring Approaches 397
Threshold Analysis for Classification Model Performance Metrics 398
PSI (Population Stability Index) and VSI (Variance Stability Index) 400
PSI and VSI Thresholds 401
Implementing a Robust Model Monitoring Framework 402
Chapter 8: Performance Metrics Implementation and Model Monitoring – Conclusion and
Transition 402
Chapter 8 Summary: Performance Metrics, Implementation, and Model Monitoring 404
Chapter 8 Quiz: 406
Chapter 8 Cheat Sheet 408
Overview 411
Python Spyder IDE 412
Running R Code in Python Spyder 412
Using R Libraries 414
Subprocess Module 414
Jupyter Notebook R Implementation 416
Running SAS Code in Python Spyder 418
Method 1: Using the Subprocess Module 418
Method 2: Using SAS Kernel for Jupyter 419
Method 3: Using SASPy 421
RStudio IDE: A Hub for Multilingual Data Science 423
Implementing Python in RStudio 423
Method 1: Using Reticulate for Python Integration 424
Method 2: Running Python Scripts 425
Method 3: Jupyter Notebooks Integration 425
Implementing SAS in RStudio 427
8 SAS, Python & R: A Cross Reference Guide

Method 1: Using the RSASSA Package 427


Method 2: Using RMarkdown with SAS Chunk 428
Method 3: Using System2 to Call SAS 429
SAS Studio: Where the Power of Python and R Meets SAS Strength 431
Balancing Two Worlds: SAS Studio and SAS Enterprise Guide 431
Implementation Differences between SAS Studio and SAS Enterprise Guide 432
Method 1: Using the Python Code Node in SAS Studio 434
Method 2: Using the X Python Statement 435
Method 3: Using Jupyter Notebooks in SAS 436
Chapter 9: Language Fusion – Conclusion and Transition to Your Data Science Journey 437
Chapter 9 Summary: Language Fusion 439
Chapter 9 Quiz 441
Chapter 9 Cheat Sheet 443
Appendix A: Further Learning Resources 445
Appendix B: Open-Source Data Sources 453
Appendix C: GitHub Repositories 457
Appendix D: Chapter Quiz and Answers 464
Appendix E: Glossary of Data Science and AI/ML Terms 499
Index of Terms 513
SAS, Python & R: A Cross Reference Guide

Acknowledgments
This book would not have been possible without the invaluable contributions and
unwavering support of numerous individuals and teams. I would like to take this
opportunity to extend my heartfelt gratitude to everyone who played a role in its
creation.

To the SAS Team: I am deeply grateful to Carrie Vetter, Suzanne Morgen, and
Catherine Connolly, whose expertise and guidance were instrumental in shaping the
technical content of this book. Your dedication to the advancement of data science
has been a constant source of inspiration.

To the Wells Fargo Team: A special thank you to Paul Davis, Cem Isin, Chris Challis,
Abdulaziz Gebril, Jie Chen, Michael Luo, Nijan Balan, Debjyoti Sadhu, Jordan
Eustaquio, Ferda Ozcakir Yilmaz, Vinothdumar Venkataraman, Todd Anderson,
Swapnesh Sanghavi, Weishun Chen, Tom Zhu, Boobalaganesh E., Dip Chatterjee,
Ibro Mujacic, Justin Gaiski, Soumyaa Mukherjee, Prabhat Vashishth, Kumar
Nityanand, Saurabh Chauhan, Anna Koltsova, Andrew Wolschlag, Yusuf Qaddura,
Tinhang Hong, Sara Slovensky, Satish Komirishetti, Sidharth Thakur, Insiya Hanif,
Jody Zhang, Nageswara Reddy, Anupam Chatterjee, Anoop Kamath, Nicole Chhabra,
Deeksha Tembhare, Sruthy V., Ashish Agrawal, Shyamasis Guchhait, Patrick Hook,
Brian Abrams, Mirela Mulaj, Christina Fang, Travis Alexander, Kathy Cunningham,
and Howard Kim. Your insights, feedback, and collaborative spirit have significantly
enriched the quality of this work. It has been an honor to work alongside such a
talented and dedicated group of professionals.

To the ELVTR Team: I am also incredibly thankful to the team at ELVTR, the online
data science platform, for their support throughout this journey. Anna Ansulene Van
Zyl, Yana Dovgopolova, Olga Viun, Saul Mora, and Olia Tsvek, your commitment to
education and innovation in the field of data science has been truly inspiring. Thank
you for providing a platform that fosters learning and growth for so many.

To My Family: Everything that I have ever done or ever will do is credited to the love
and support of my wife and son, Tanya and Jake. I don’t know if we’re the three
musketeers, a three-man jam band or a skeleton pirate crew, but we’re in it ‘till the
fuckin wheels come off.
10 SAS, Python & R: A Cross Reference Guide

Preface: Code Comparison Cheat Sheet

Important Note on Whitespace Character Sensitivity

Before you dive into the code cheat sheet, it's important to understand a key
difference between the programming languages featured in this book: whitespace
character sensitivity.

● SAS: The SAS programming language is not sensitive to whitespace


characters. This means that you can format your SAS code with spaces and
line breaks as you like, without affecting how the code runs. For example,
you can freely add spaces or new lines for readability without worrying
about the code execution being affected.

● Python and R: On the other hand, Python and R are sensitive to whitespace
characters, which means that indentation and spacing are crucial for the
correct execution of code. In Python, for instance, the indentation level
indicates code blocks, and incorrect indentation can lead to syntax errors or
incorrect program logic. R is slightly less strict but still relies on proper
formatting, especially in conditional statements and loops.

Due to publishing limitations, you may notice that some Python and R code snippets
in this cheat sheet are wrapped or formatted to fit within the page. When writing or
copying this code into your development environment, ensure that you adhere to
the correct indentation and spacing rules specific to each language. This will prevent
errors and ensure that the code executes as intended.

Python (pandas, R (tidyverse, caret,


Task SAS
sklearn, etc.) etc.)
PROC IMPORT
DATAFILE='file.csv'
OUT=mydata import pandas as pd
mydata <-
Data Import DBMS=CSV df =
read.csv('file.csv')
REPLACE; pd.read_csv('file.csv')
GETNAMES=YES;
RUN;
PROC CONTENTS
DATA=mydata; str(mydata)
Data Exploration df.info() df.describe()
RUN; PROC summary(mydata)
MEANS
SAS, Python & R: A Cross Reference Guide

DATA=mydata;
RUN;

PROC SGPLOT library(ggplot2)


import
DATA=mydata; ggplot(mydata,
Data Visualization matplotlib.pyplot as plt
HISTOGRAM aes(x=variable)) +
df['variable'].hist()
variable; RUN; geom_histogram()
PROC STDIZE
mydata[is.na(myda
DATA=mydata
Imputing Missing df.fillna(df.mean(), ta)] <-
OUT=mydata_repl
Values inplace=True) mean(mydata,
MISSING=MEAN;
na.rm = TRUE)
RUN;
import numpy as np
Q1 =
PROC UNIVARIATE
df['variable'].quantile(0 boxplot.stats(myda
Outlier Detection DATA=mydata;
.25) Q3 = ta$variable)$out
VAR variable; RUN;
df['variable'].quantile(0
.75) IQR = Q3 - Q1
DATA mydata; SET mydata$new_var
df['new_var'] =
Feature Engineering mydata; new_var = <-
df['old_var']**2
old_var**2; RUN; mydata$old_var^2
from
sklearn.preprocessing
PROC STANDARD
import StandardScaler
DATA=mydata mydata_std <-
Standardizing scaler =
OUT=mydata_std; scale(mydata)
StandardScaler()
RUN;
df_scaled =
scaler.fit_transform(df)
from
sklearn.linear_model
PROC REG
import model <- lm(target
DATA=mydata;
Linear Regression LinearRegression ~ var1 + var2,
MODEL target =
model = data=mydata)
var1 var2; RUN;
LinearRegression()
model.fit(X, y)
from
model <-
PROC LOGISTIC sklearn.linear_model
glm(target ~ var1 +
DATA=mydata; import
var2,
Logistic Regression MODEL LogisticRegression
family=binomial(lin
target(event='1') = model =
k='logit'),
var1 var2; RUN; LogisticRegression()
data=mydata)
model.fit(X, y)
from sklearn.tree
PROC HPSPLIT
import library(rpart) model
DATA=mydata;
DecisionTreeClassifi <- rpart(target ~ var1
Decision Trees CLASS target;
er model = + var2,
MODEL target =
DecisionTreeClassifi data=mydata)
var1 var2; RUN;
er() model.fit(X, y)
12 SAS, Python & R: A Cross Reference Guide

from
PROC HPFOREST sklearn.ensemble library(randomForest
DATA=mydata; import ) model <-
Random Forest TARGET target; RandomForestClass randomForest(target
INPUT var1 var2; ifier model = ~ var1 + var2,
RUN; RandomForestClass data=mydata)
ifier() model.fit(X, y)
from
PROC sklearn.ensemble library(gbm) model
GRADBOOST import <- gbm(target ~ var1
Gradient Boosting DATA=mydata; GradientBoostingCla + var2,
Machines TARGET target; ssifier model = data=mydata,
INPUT var1 var2; GradientBoostingCla distribution="bernoul
RUN; ssifier() model.fit(X, li")
y)
PROC SVMOD
library(e1071) model
DATA=mydata; from sklearn.svm
Support Vector <- svm(target ~ var1
TARGET target; import SVC model =
Machines + var2,
INPUT var1 var2; SVC() model.fit(X, y)
data=mydata)
RUN;
from keras.models
PROC NEURAL import Sequential library(nnet) model
DATA=mydata; model = Sequential() <- nnet(target ~ var1
Neural Networks INPUT var1 var2; model.add(Dense(1 + var2,
TARGET target; 0, input_dim=10, data=mydata,
RUN; activation='relu')) size=10)
model.compile()
PROC LOGISTIC
from sklearn.metrics
DATA=mydata library(pROC) auc <-
import
PLOTS=ROC; roc(response=mydat
roc_auc_score auc
AUC MODEL a$target,
=
target(event='1') = predictor=model$fitt
roc_auc_score(y_tru
var1 var2; ROC; ed.values)$auc
e, y_pred)
RUN;
PROC LOGISTIC
DATA=mydata;
MODEL
gini = 2 *
target(event='1') =
Gini roc_auc_score(y_tru gini <- 2 * auc - 1
var1 var2; OUTPUT
e, y_pred) - 1
OUT=gini
PREDPROBS=I;
RUN;
from scipy.stats ks_stat <-
PROC NPAR1WAY
import ks_2samp ks.test(mydata$targ
DATA=mydata KS;
KS Statistic ks_stat, p_value = et,
CLASS target; VAR
ks_2samp(y_true, model$fitted.values)
score; RUN;
y_pred) $statistic
SAS, Python & R: A Cross Reference Guide

PROC FREQ from sklearn.metrics


cm <-
DATA=mydata; import
table(mydata$target,
TABLES confusion_matrix cm
Confusion Matrix predict(model,
target*predicted / =
mydata,
OUTPCT confusion_matrix(y_t
type="response"))
OUT=cmat; RUN; rue, y_pred)
precision <-
PROC LOGISTIC sum(true_positive) /
DATA=mydata; (sum(true_positive)
MODEL from sklearn.metrics +
target(event='1') = import sum(false_positive))
var1 var2; OUTPUT precision_score, recall <-
Precision, Recall, F1
OUT=pred recall_score, sum(true_positive) /
Score
PREDPROBS=I; f1_score precision = (sum(true_positive)
RUN; PROC precision_score(y_tr +
MEANS ue, y_pred) sum(false_negative)
DATA=pred; VAR ) F1 <- 2 * (precision
_prob_; RUN; * recall) / (precision
+ recall)
from sklearn.metrics
import
PROC MEANS rmse <-
mean_squared_erro
DATA=mydata sqrt(mean((mydata$t
r,
RMSE, MAE, MSE NMISS MIN MAX arget -
mean_absolute_erro
MEAN STD VAR; predict(model,
r mse =
VAR target; RUN; mydata))^2))
mean_squared_erro
r(y_true, y_pred)
PROC RANK
DATA=mydata
OUT=ranks
GROUPS=10; VAR from sklearn.metrics library(caret)
score; RANKS rank; import roc_curve lift_table <- lift(target
Lift Table
RUN; PROC roc_curve(y_true, ~ score, data =
MEANS y_pred) mydata)
DATA=ranks;
CLASS rank; VAR
target; RUN;
import
matplotlib.pyplot as
PROC LOGISTIC
plt from library(pROC)
DATA=mydata
sklearn.metrics roc_curve <-
PLOTS(ONLY)=RO
AUC Chart import roc_curve fpr, roc(target,
C; MODEL
tpr, _ = model$fitted.values)
target(event='1') =
roc_curve(y_true, plot(roc_curve)
var1 var2; RUN;
y_pred) plt.plot(fpr,
tpr)
14 SAS, Python & R: A Cross Reference Guide
SAS, Python & R: A Cross Reference Guide

Chapter 1: Programming Environments

Overview

You may not be aware of it, but data science touches nearly every aspect of your
daily life. If you are reading a book about computer programming, you are probably
already aware of the prominent and most common applications of data science.
These include how Netflix uses data science to recommend movies based on your
demographics and history, how Amazon upsells products, how dating sites
recommend matches, and how TikTok populates your video stream.

We generally know that online systems such as Netflix, Amazon, and TikTok gather
large volumes of data and subject it to machine learning algorithms to customize
content and sell products. However, data science has escaped the confines of the
tech giants and now drives nearly every aspect of our lives. Examples include the
formulation of medications, police patrol routes, fashion trends, distribution of aid
in times of crisis, housing development locations, wildfire prevention and control,
military actions, pet food ingredients, and much, much more.

“Data science” is a broad term that encompasses a wide array of disciplines,


including computer science, statistics, machine learning, and artificial intelligence.
The concepts of data science are generally expressed through the manipulation of
data using computer programming languages. Of course, several different computer
programming languages are available for developing and implementing data science
projects.

The programming language that you choose to focus on should be determined by


the type of work that you want to do. If you are a software engineer, you might
want to learn Java, C++, or Python. If you are a web developer, you might want to
learn JavaScript, PHP, or Python. However, this book will focus on the programming
languages and methodologies that support the field of data science.

Among the top programming languages in data science, SAS, Python, and R stand
out. SAS has a long-standing presence in the industry, offering a comprehensive
suite of tools for data analysis and modeling. Python has gained significant
popularity due to its versatility, ease of use, and extensive libraries for data
16 SAS, Python & R: A Cross Reference Guide

manipulation and machine learning. R, a language designed for statistical analysis,


provides powerful statistical and graphical capabilities. These languages, along with
their respective environments (SAS Studio, Python Spyder, and RStudio), offer data
scientists a rich ecosystem to tackle complex data problems.

Although SAS, Python, and R are among the top programming languages in data
science,1 there are other programming languages and analytical environments
commonly used for data science, such as:

● Julia
● C++
● Scala
● MATLAB
● Octave
● SQL
● Java
● Alteryx
● Tableau
● and even Microsoft Excel

Each of these programming languages is powerful and flexible and can perform a
wide array of data science functions. However, this book aims to provide you with a
practical, hands-on guide to the top three programming languages used in the
business world.

Truth be told, in the 2024 Kaggle Machine Learning and Data Science Survey, the top
programming languages included both #1 Python and #3 R (#2 was SQL), but SAS
came in at #12 on the list. I believe that this is a result of selection bias. The survey
results show that nearly half of the respondents have less than five years of
programming experience and that almost 40% of the respondents were students.
SAS is generally not taught in schools because the full SAS product line is a license-
based product. Universities and students prefer freeware for obvious reasons.

1
Yes, I realize that Python is listed as a programming language for software engineering, web
development and data science. It is a very versatile language with a lot of applications and a huge
support community. This is one of the main reasons that it is used extensively in academia. You should
definitely at least learn the basics of working with this programming language. However, many
corporations do not support Python due to a legacy codebase developed in other languages or they
feel that other analytical product offerings support their business needs in a more integrated manner.
SAS, Python & R: A Cross Reference Guide

However, SAS does offer a free student version of its product, which contains all the
capabilities of a fully licensed product and a file size limit of 5GB. For most students,
this SAS freeware version will meet 90% of their needs.

So, including Python and R as part of this book is obvious, but why include SAS when
it is listed as #12 on the data science popularity list? The reason is that in the real
world, SAS is still a dominant analytical environment where senior data scientists
have developed analytical products, automated processes, and bleeding-edge
artificial intelligence codebases that contain over five decades of knowledge. SAS is
still the leading programming language throughout the financial services, insurance,
and biotech industries and all of government services. If you want to work in any of
these fields, you will need to transform your freeware knowledge into SAS.

The Gartner Magic Quadrant reviews and analyzes all the major data science and
machine learning (DS/ML) software products. This independent evaluation has
placed SAS as the top DS/ML software provider for its entire review history, lasting
over 15 years. Figure 1.1 shows the Gartner Magic Quadrant for DS/ML software
products. This quadrant analysis plots DS/ML software companies where the X axis
represents the company’s “completeness of vision,” and the Y axis represents the
company’s “ability to execute.”

Figure 1.1: Gartner Magic Quadrant for DS/ML Software Products

Although Python and R are not specifically identified in the quadrant analysis, they
are the programming languages that drive the data science and machine learning
18 SAS, Python & R: A Cross Reference Guide

capabilities of most of the other DS/ML software providers represented in Figure


1.1.

This chapter will provide you with an overview of the Integrated Development
Environment (IDE) for Python Spyder, RStudio, and SAS Studio. Although the
underlying constructs of the environments are quite different, the user experience
and overall look of the environments are surprisingly similar.

Programming Environments

Every analytical programming environment has its own look and feel that can be
customized in a multitude of ways. The arrangement of the coding window, the
output window, the data set view, and the graphics display, along with the look and
feel of the font and background colors – all of this can be customized. Some
programmers prefer “dark mode,” while others prefer a white background. In order
to cut through all the confusion that can be brought on by customization, I will use
the default settings in the programming environments in this book.

The following sections will provide you with the links to either access or download
each of the respective software programs. These sections will not include step-by-
step installation procedures because these procedures vary depending on which
operating system you are using and which version of the software you choose to
install. Fortunately, each of these software programs has highly detailed support
documentation and large user communities that will quickly guide you through the
installation process.

Before we install these programs, let’s quickly review each programming


environment in terms of its functionality, performance, and ease of use.

SAS Studio:

● Functionality: SAS Studio provides a comprehensive suite of tools for data


manipulation, analysis, and modeling. It offers a wide range of statistical
procedures, data management capabilities, and advanced analytics
techniques. SAS Studio excels in handling large data sets and has robust
support for data cleaning, transformation, and statistical modeling.
SAS, Python & R: A Cross Reference Guide

● Performance: SAS Studio is known for its efficient and optimized processing,
making it suitable for handling large-scale data and complex analyses. It
leverages the SAS platform's powerful engine to deliver high-performance
computing capabilities.

● Ease of use: SAS Studio has a user-friendly interface with a point-and-click


environment, making it accessible to users who do not have extensive
programming knowledge. It also supports the SAS programming language
for those who prefer coding.

Python Spyder:

● Functionality: Python Spyder is a versatile IDE specifically designed for


scientific computing and data analysis. It offers a wide range of data
manipulation libraries, statistical packages, and machine learning
frameworks. Spyder integrates seamlessly with popular Python libraries like
NumPy, Pandas, matplotlib, and scikit-learn.

● Performance: Python is a highly efficient programming language that excels


at various data processing tasks. Spyder benefits from the optimized Python
ecosystem and can efficiently manage large data set.

● Ease of use: Spyder provides an intuitive interface with interactive consoles,


variable explorers, and a powerful code editor. It supports code completion,
debugging, and other features that enhance the development experience.

RStudio:

● Functionality: RStudio is a powerful IDE dedicated to the R programming


language, which is widely used in statistical analysis and data science.
RStudio provides a range of tools and packages for data manipulation,
visualization, statistical modeling, and machine learning. It has extensive
support for statistical techniques and offers specialized libraries for specific
domains.

● Performance: R is known for its excellent statistical capabilities and


performs well in statistical modeling tasks. However, due to its inherent
20 SAS, Python & R: A Cross Reference Guide

single-threaded nature, it may face some performance challenges when


dealing with large data sets.

● Ease of use: RStudio offers a user-friendly environment with an intuitive


interface, integrated help documentation, and various features to enhance
productivity. It has a strong community and a vast repository of packages,
making it easy to find and implement specific data science functionalities.

Comparison:

● SAS Studio stands out in its comprehensive suite of statistical procedures


and data management capabilities, making it suitable for advanced analytics
tasks. It excels in handling large data sets and offers efficient processing.

● Python Spyder, with its versatile libraries and extensive ecosystem, provides
flexibility for various data analysis and machine learning tasks. It benefits
from Python's efficiency and performance.

● RStudio shines in its rich statistical capabilities and extensive library support.
It is widely adopted in the academic and research community, offering
specialized packages for specific statistical domains.

● SAS Studio and RStudio have point-and-click interfaces, making them more
accessible to users without programming expertise. Python Spyder caters to
both coding and point-and-click approaches.

● Python Spyder and RStudio have larger communities and a broader range of
open-source libraries available, while SAS Studio offers the advantage of
being a comprehensive, integrated environment.

Ultimately, the choice of programming environment depends on your specific


needs, background, and preferences. All three options provide powerful tools for
data science, and it can be beneficial to have familiarity with multiple environments
to leverage their strengths in different scenarios.
SAS, Python & R: A Cross Reference Guide

SAS OnDemand for Academics

SAS OnDemand for Academics: https://welcome.oda.sas.com/login

SAS OnDemand for Academics (SODA) is a powerful and versatile web-based


environment that provides free access to SAS software for educational purposes. It
offers the full functionality of SAS Studio, enabling students, educators, and
researchers to leverage the power of SAS tools for data analysis, statistical
modeling, and predictive analytics.

With SAS OnDemand for Academics, users can perform complex data
manipulations, explore large data sets, and conduct advanced statistical analyses.
The environment provides an intuitive interface that enables users to write and
execute SAS code seamlessly, with features such as syntax highlighting,
autocompletion, and error checking to enhance the programming experience.

One key advantage of SAS OnDemand for Academics is its extensive library of SAS
procedures and functions, which covers a wide range of statistical techniques and
data analysis methods. Users can easily access and use these procedures to perform
tasks such as data cleaning, regression analysis, clustering, and more. Additionally,
SAS offers comprehensive documentation and resources to support users in learning
and applying statistical methods effectively.

SAS OnDemand for Academics also provides a collaborative environment where


users can share and collaborate on SAS code and projects. This is particularly
beneficial for educational institutions and research teams working on data-driven
projects, as it enables seamless collaboration and knowledge sharing among team
members.

Furthermore, SAS OnDemand for Academics offers the advantage of cloud-based


computing, eliminating the need for local installation and configuration of SAS
software. This allows users to access their SAS projects and analyses from any
device with an internet connection, providing flexibility and convenience.

Overall, SAS OnDemand for Academics is a valuable resource for students,


educators, and researchers in academia who want to leverage the power of SAS for
their data analysis and research projects. It provides a user-friendly interface,
extensive statistical capabilities, and a collaborative environment, making it a
valuable tool for learning and applying data science techniques using SAS.
22 SAS, Python & R: A Cross Reference Guide

The process to get started in SODA is easy. Once you access the link provided above,
you will register as a user and receive an email. Just follow the guided process, and
you will be able to access SODA within 5 minutes. I will not bore you with
screenshots of how to fill out a registration form or accept a license agreement.
We’ve all done these many times before.

Once you register and sign in, you will be presented with the SODA dashboard. For
our current purposes, we will select the first item, SAS Studio. In later chapters, we
will explore SASPy access to SAS-hosted servers along with several other ways to
integrate SAS, Python, and R.

Figure 1.2: SAS OnDemand for Academics Dashboard

SAS Studio IDE

Once we select the SAS Studio option, we are connected to the SAS Studio IDE
(Integrated Development Environment). This is the primary programming
environment for SAS users. Figure 1.3 demonstrates the SAS Studio IDE.

The IDE consists of two main sections. On the left is the navigation pane. This
section contains a series of drop-down menus that enable you to select point-and-
click procedures to perform a variety of tasks. These tasks can range from simply
importing data to creating graphs or even developing advanced machine learning
models.

The navigation pane also contains code snippets that provide you with the building
blocks of SAS code to create various data manipulation procedures and graphical
SAS, Python & R: A Cross Reference Guide

output. The combination of the point-and-click and code snippet features provides
data scientists with a powerful arsenal of prepared data wrangling and analytical
techniques that will get you started in SAS Studio very quickly. These are the types
of features that you cannot find in most freeware products.

The second section of the SAS Studio IDE is the work area. This is the section where
you will perform most of your work. SAS code is developed in the code window.
Once the code has been submitted, the log window will provide you with a variety
of information, including the number of observations included in each section of the
submitted code, along with runtime information and will also contain an error log.
SAS error logs are generally informative and provide you with common language
information that helps you debug your code (unlike some program language’s error
logs that only add to your confusion about what went wrong).

The work area also contains a results window. This window displays output
requested as part of the code, such as frequency distributions, database contents,
metadata, and graphical output.

Figure 1.3: SAS Studio IDE


24 SAS, Python & R: A Cross Reference Guide

Log

In the example above, I created a simple import statement that accesses a .csv file
that I stored locally and imported into the SAS Studio environment. Once the
program has completed running, SAS Studio automatically generates a log that
provides you with information about the completed process. Figure 1.4
demonstrates that the log will contain three sections in the “Log Tabs” area.

The first section will identify any errors that occurred during the processing. These
are often hard stops that will prevent the rest of the program from running. SAS
error logs are the most useful error logs among the most popular analytical software
packages. SAS error logs will show you exactly which part of the submitted code has
a problem and provide you with suggestions on how to resolve the problem. For
example, the error log will specify if you are missing a semicolon to close a
statement or if there is a misalignment due to variable formatting, or many other
issues that arise when debugging code. The combination of specifying exactly where
the problem is occurring along with common language descriptions of what the
problem is and suggestions on how to resolve the problem can save you hours of
debugging time.

The second section is the warning section. This will provide information about
possible issues that need to be addressed. Warnings will not terminate a program,
but they will provide you with information about the program that you might want
to address. These issues can be related to long processing times or truncated
variable names or calculation issues or a variety of different items that you may
need to be aware of.

The final section is the notes section. This item will provide general information
about the completed process. It will commonly include the number of observations
and variables in a data set, the total processing time, how much memory was used,
along with several other items depending on the nature of the program that was
run.
SAS, Python & R: A Cross Reference Guide

Figure 1.4: SAS Studio Log Output

Results and Output

Once the program has successfully run, the results can be found in the results and
output data sections. These sections provide you with information concerning the
completed process and the final output of that process. In the example shown in
Figure 1.5, the program imported data from a local .csv file and loaded it into the
SAS Studio environment. The results window contains the output of a PROC
CONTENTS statement (we will cover exactly what this is in a later section). The
information displayed in the results window will vary depending on the nature of
the program that was run. For example, if you ran a program designed to create a
histogram, then the visual output of the histogram would be contained in the results
window.
26 SAS, Python & R: A Cross Reference Guide

Figure 1.5: SAS Studio Results Output

The final piece of information provided with a successful program run is the output
data. This tab provides you with a spreadsheet-like view of the resulting data set
created from a program run. Figure 1.6 shows you the two main pieces of
information in the output data tab. First, the output contains a list of all the
variables contained in the data table, along with the properties of each of the
variables. These properties include the variable label, name, length, type, format,
and informat.

The second piece of information is a view of the data table. This view is incredibly
useful in understanding the contents of the data set and its completeness. The data
table shows all the variables and observations in the data set in a spreadsheet-like
view.
SAS, Python & R: A Cross Reference Guide

Figure 1.6: SAS Studio Output Data Tab

SAS Studio provides the data scientist with lots of features that save development
time and reduce debugging headaches. The prepared code snippets provide you
with valuable instruction on a variety of topics such as data preparation, data
manipulation, analytics, machine learning, and graphical output. The log information
provides you with specific information that reduces debugging time, and the file
management system allows you to organize your project workflow.

SAS Studio is also the interface for the latest SAS product, SAS Viya. Although this
product is not the focus of this book, it is a fantastic data science and machine
learning environment that provides cutting-edge machine learning algorithms along
with highly efficient distributed database storage and processing.

Python Spyder

Python Spyder download: https://www.spyder-ide.org/

or

Anaconda download: https://www.anaconda.com/products/individual


28 SAS, Python & R: A Cross Reference Guide

Python is a broad programming language that can be used for software engineering,
data science, web development and many other things. Because of the language’s
versatility, the Python Spyder IDE is not set up specifically for data science and
machine learning. When compared to the analytical environment of SAS Studio,
Python Spyder can look a bit basic on its surface. However, this could not be farther
from the truth.

The power of the Python Spyder environment is not in its “bells and whistles” but its
inherent flexibility. Since Python is an open-source product, data scientists have a
wide array of libraries that they can download and use for free. These libraries can
range from mathematical formulas to machine learning algorithms to graphical
displays to web development to just about anything that you can think of.

Python Spyder IDE

Python Spyder is a dedicated integrated development environment (IDE) for the


Python programming language, widely used in data science and machine learning.
Spyder combines a powerful code editor, an interactive console, and various tools to
support efficient development and analysis workflows. It offers a comprehensive set
of features tailored specifically for scientific computing and data exploration.

Spyder provides a user-friendly interface that enables data scientists to write and
execute Python code with ease. The IDE includes features such as syntax
highlighting, code completion, and error detection, which enhance the coding
experience and improve productivity. Spyder also integrates with popular Python
libraries and frameworks commonly used in data science, such as NumPy, pandas,
matplotlib, and scikit-learn, allowing seamless integration of these tools into the
workflow.

One of the key strengths of Spyder is its interactive console, which provides a
convenient environment for exploring and manipulating data. Data scientists can
execute Python commands and view the results instantly, making it easy to test and
iterate code snippets. The IDE also supports advanced debugging features, profiling
tools, and variable exploration, which are valuable for troubleshooting and
optimizing code performance.
SAS, Python & R: A Cross Reference Guide

Spyder offers extensive support for data visualization, with built-in plotting
capabilities and integration with popular data visualization libraries like Matplotlib
and Seaborn. This enables data scientists to generate insightful visualizations to gain
a deeper understanding of the data. Spyder also provides a comprehensive data
editor that allows users to view and manipulate data sets directly within the IDE.

With its rich set of features and focus on scientific computing, Spyder is a valuable
tool for data scientists and researchers working with Python. Its user-friendly
interface, extensive library integration, and interactive console make it an excellent
choice for exploratory data analysis, model development, and scientific computing
tasks. Spyder's versatility and flexibility make it suitable for both beginners and
experienced Python programmers in the field of data science.

There are three main components to the Spyder IDE. Figure 1.6 shows the three
sections as the editor, the console, and the help panel.

Editor
The editor is where you will create most of your code. This section allows you to
type directly into the editor panel to create your code. You can then run the code by
either highlighting the section and selecting Ctrl + Enter, or you can highlight the
section that you want to run and then select the green arrow button above the
editor panel. Finally, you can also right-click within the editor panel and select the
run function from the list of features that pop up. The editor panel allows you to
open existing programs or create new ones within the editor panel.

Console
The console panel allows you to type directly into the console interface, just like if
you were to create and run programs in your computer’s command panel. This
panel will enable you to work interactively within the console panel or run code in
the editor. It also provides some high-level information, such as which version of
Python you are using.

Figure 1.7 shows a small program developed in the editor that takes a base input
and raises it to a specified power. In this case, eight is raised to the power of two,
and the program prints the result of the calculation. The output of the code is
provided in the console panel.
30 SAS, Python & R: A Cross Reference Guide

Figure 1.7: Python Spyder IDE

Help Panel
The help panel has four sections. These sections include files, plots, help, and the
variable explorer. Figure 1.7 shows the file section. This section allows you to
explore and structure your project files. Figure 1.8 shows the plot area of the help
section. This figure demonstrates some simple code that reads a .csv file and plots
the data points on a line graph. The graphical results are displayed in the plots area.
SAS, Python & R: A Cross Reference Guide

Figure 1.8: Python Spyder Plots

The help section provides documentation for any object with a docstring. A
docstring functions like an embedded comment that contains information about
modules, classes, functions, and methods. If you have a question about any feature
of Python, you can simply place your cursor after the item in question and press Ctrl
+ I. This will bring up the documentation within the help panel.

Figure 1.9 provides an example. In this example, I want to know more information
about DataFrames. I placed the cursor just after the DataFrame statement and
pressed Ctrl + I. The documentation for DataFrames appears in the help section. You
can also manually enter an object’s name directly into the “Object” area just above
the help window.

Finally, you can enable automatic help by changing your settings under Preferences
– Help – Automatic Connections. Once this setting is selected, you can turn it on and
off by using the lock icon directly to the right of the Object area above the help
window. Once this setting is selected, you would simply type an open left
parenthesis after a function or method name and the associated information will
32 SAS, Python & R: A Cross Reference Guide

appear in the help window. For example, if I wanted more information about arrays,
I would type np.array( into the editor, and the documentation for arrays will appear
in the help window.

Figure 1.9: Python Spyder Help

Variable Explorer
The final area in Python Spyder is the variable explorer area. This section provides
metadata information for all data sets in each project. Figure 1.10 shows the results
from an import statement. The work area has the code to pull a local data set into
Spyder. The console section provides information concerning the data pull, and the
variable explorer section provides metadata about the imported data set. This
metadata includes information about what type of data set it is, the variables
contained in the data set, the data types, the data size, and each variable’s values.

The variable explorer section is interactive. You can click the DataFrame’s name to
view the whole data set. You can even click on the individual variable’s name to see
an individual column of that variable’s values. The variable explorer section will also
SAS, Python & R: A Cross Reference Guide

let you right-click on a variable and provide you with options to delete or modify its
values.

Figure 1.10: Python Spyder Variable Explorer

The Python Spyder IDE is a powerful and flexible tool that provides data scientists
with several features to organize, develop and deploy various projects. The IDE
provides detailed information about coding features, graphical displays, object
documentation, data set and variable information, and interactive console and data
set features.

RStudio

RStudio download: https://www.rstudio.com/products/rstudio/download/

RStudio is a popular integrated development environment (IDE) specifically


designed for the R programming language, which is widely used in statistical analysis
34 SAS, Python & R: A Cross Reference Guide

and data science. It provides a powerful and user-friendly environment for data
exploration, visualization, and modeling. RStudio combines a code editor, a console
for interactive R programming, and various tools to enhance the data science
workflow.

With RStudio, data scientists and statisticians can leverage R's extensive capabilities
to perform complex analyses and build sophisticated models. The IDE offers an
intuitive interface with features such as syntax highlighting, code completion, and
error checking, facilitating efficient coding and debugging. RStudio also provides
seamless integration with R's vast ecosystem of packages, which cover a wide range
of statistical techniques, machine learning algorithms, and visualization tools.

One of RStudio's notable strengths is its emphasis on reproducibility and


collaboration. It allows users to organize their work into projects that encapsulate
data, code, and outputs, making reproducing and sharing analyses easier. RStudio
also supports version control integration, enabling collaborative workflows and
ensuring the traceability of code changes.

Furthermore, RStudio provides a range of additional features to enhance the data


science process, including integrated documentation and help resources, interactive
data visualization tools, and support for creating interactive web applications. The
IDE's vibrant and supportive community contributes to the wealth of available
resources, tutorials, and packages, making it an ideal environment for learning and
advancing in the field of data science.

Whether you are a beginner or an experienced data scientist, RStudio offers a


versatile and powerful platform for exploring, analyzing, and visualizing data,
building predictive models, and conducting statistical analyses. Its user-friendly
interface, extensive package ecosystem, and emphasis on reproducibility make it a
valuable tool for data science projects of all scales and complexities.

RStudio IDE

Figure 1.11 shows the RStudio IDE. Notice how similarly all three interfaces have
been constructed. The RStudio interface is not much different than the SAS Studio
and Python Spyder interfaces. There are always some changes in the location of
certain sections and features, but the four main sections are present. RStudio has a
SAS, Python & R: A Cross Reference Guide

code editor in the top left section, a console and terminal section in the bottom left
area, a global environment section in the top right, and a tab selection section of
files, plots, R packages, and views in the bottom right area. The location of each of
these sections is changeable, so you can configure them in a manner that best suits
you.

Figure 1.11: RStudio IDE

Global Environment

All the data sets you have created are available for inspection in the global
environment area. On the far right side of the section containing the data set name,
you will find a spreadsheet-like icon that you can select. This will bring up a
spreadsheet view of your data set. Figure 1.12 demonstrates this view.

The spreadsheet view provides a great amount of insight into your data set. You can
easily see the number of variables, the data types, value ranges and many other
valuable pieces of information.
36 SAS, Python & R: A Cross Reference Guide

Figure 1.12: RStudio – Data Viewer

Packages

One of the main features that separates the RStudio interface from the other IDEs is
the R packages tab located in the bottom right area of my configuration. Figure 1.13
shows the RStudio Package Editor. This is an interface that allows the programmer
to easily search, install, or delete any of the available R packages.

We will review libraries and packages in the next section of the book, but for now,
understand that these packages contain logic that allows a programmer to develop
data science models, web pages, documentation, graphics, and many other
possibilities. They are essentially expansion packs for your R language.

To find a new R package, you simply select the “Install” button just above the
packages area. An interface will come up, and you can type the name of the package
that you want to install. Figure 1.13 shows the view with the package installer
interface. In this example, I typed the package name “caret” into the package
installer. Once the “Install” button is selected, RStudio automatically downloads the
SAS, Python & R: A Cross Reference Guide

package and all its dependencies and installs them in the area that you have
selected.

Once the package is installed, you will find all the installation notes in your console
section and the newly downloaded and installed package will be at the top of your
packages list. This is by far the easiest interface to search, download, install and
update libraries/packages across all the IDEs that we have reviewed.

Figure 1.13: RStudio Package Editor

The RStudio IDE is a flexible and powerful interface that extends far beyond the
original design of calculating statistics and performing data analysis. With the
extension of packages such as “Shiny,” RStudio is a fully integrated website and web
app development tool. Although these features are beyond the scope of this book,
they are powerful and awesome tools that are definitely worth exploring.
38 SAS, Python & R: A Cross Reference Guide

Chapter 1: Programming Environments – Conclusion and


Transition

In this chapter, we explored the three primary programming environments that you
will use throughout this book: SAS Studio, Python Spyder, and RStudio. We
compared their interfaces, functionalities, and unique features, setting the stage for
our journey through data science and machine learning across these platforms.

Now that you're familiar with the environments in which we'll be working, the next
step is to delve into the data itself. After all, data is the foundation of any data
science project, and how we gather, clean, and prepare it will significantly influence
the outcomes of our analyses.

In the upcoming chapter, we'll shift our focus to the process of gathering data. You'll
learn how to import data from various sources into SAS, Python, and R, and we'll
discuss best practices for ensuring the quality and integrity of the data you work
with. Understanding how to gather and prepare your data efficiently is crucial as it
sets the stage for the subsequent steps in your data science workflow.

So, with your programming environment ready, let's move on to exploring the
diverse world of data collection and preparation. This is where the real work begins,
and mastering these techniques will empower you to build robust, reliable models
in the chapters to come.
SAS, Python & R: A Cross Reference Guide

Chapter 1 Summary: Programming Environments

1. Introduction to Data Science and Programming Languages

● Overview: Data science is a pervasive force that influences many aspects of


our daily lives, from personalized recommendations on streaming platforms
like Netflix to decisions made in sectors such as healthcare, finance, and
government. This chapter emphasizes the importance of programming
languages in executing data science projects, specifically focusing on SAS,
Python, and R.

● Context: The chapter sets the stage for the book’s central theme – cross-
referencing SAS, Python, and R – by highlighting their significance in the
data science community, even though their popularity may vary across
different user bases, as evidenced by the 2024 Kaggle Machine Learning and
Data Science Survey.

2. Importance of SAS, Python, and R in Data Science

● SAS: Despite ranking lower in some popularity surveys, SAS remains a


dominant tool in industries like financial services, insurance, biotech, and
government services due to its comprehensive suite of data analysis and
management tools. The chapter discusses the historical and current
relevance of SAS, particularly in high-stakes, data-intensive environments.

● Python: Python’s rise to prominence in data science is attributed to its


versatility, ease of use, and extensive library support for machine learning,
data manipulation, and visualization. The chapter underscores Python's
widespread adoption, especially in academia and among novice
programmers.

● R: As a language designed specifically for statistical analysis, R excels in


providing powerful statistical and graphical capabilities. The chapter
highlights R’s strong presence in academic research and specialized domains
requiring robust statistical computations.

3. Introduction to Integrated Development Environments (IDEs)

● Overview of IDEs: This chapter introduces the three primary IDEs associated
with SAS, Python, and R: SAS Studio, Python Spyder, and RStudio. Each IDE is
40 SAS, Python & R: A Cross Reference Guide

described in terms of its functionality, performance, and ease of use, setting


the foundation for their cross-referencing in the rest of the book.

● Comparison of IDEs:

o SAS Studio: Known for its powerful data management capabilities


and a user-friendly point-and-click interface, SAS Studio is ideal for
handling large data sets and conducting advanced analytics.

o Python Spyder: A versatile IDE designed for scientific computing,


Spyder offers robust features like an interactive console, code
completion, and extensive library integration, making it a favorite
among Python data scientists.

o RStudio: Tailored specifically for R, RStudio supports a wide range of


statistical techniques and provides an intuitive environment that
fosters reproducibility and collaboration through features like
version control and project management.

4. Detailed Exploration of Each IDE

● SAS Studio:

o Functionality: The chapter delves into SAS Studio’s capabilities,


including data manipulation, advanced statistical procedures, and
the integration of machine learning algorithms. It highlights the
IDE’s support for both coding and non-coding users through its
point-and-click interface.

o Performance: SAS Studio’s efficient processing of large-scale data


and complex analyses is emphasized, particularly its optimized
engine that ensures high-performance computing.

o Ease of Use: The chapter discusses how SAS Studio’s interface


caters to users with varying levels of programming expertise,
making it accessible while retaining powerful analytical tools.

● Python Spyder:

o Functionality: Python Spyder is a comprehensive IDE that supports


scientific computing with seamless integration of Python libraries
like NumPy, pandas, and scikit-learn. The chapter explains how
SAS, Python & R: A Cross Reference Guide

these libraries facilitate a wide range of data science tasks, from


data cleaning to machine learning.

o Performance: The chapter notes Python’s efficiency in handling


data processing tasks within Spyder, leveraging Python’s strong
performance across different computing environments.

o Ease of Use: Spyder’s interactive console, debugging tools, and


user-friendly code editor are discussed, highlighting how they
enhance the development experience for Python programmers.

● RStudio:

o Functionality: RStudio is described as an IDE that not only supports


statistical modeling and data visualization but also encourages
reproducibility and collaboration. The chapter details RStudio’s
extensive package ecosystem, which includes specialized libraries
for various statistical domains.

o Performance: While R is noted for its statistical strengths, the


chapter also addresses potential performance challenges when
working with large data sets due to R’s single-threaded nature.

o Ease of Use: The chapter emphasizes RStudio’s user-friendly


environment, which includes integrated help documentation, data
visualization tools, and support for creating interactive web
applications.

5. Comparison and Choosing the Right IDE

● Overview: The chapter concludes with a comparative analysis of SAS Studio,


Python Spyder, and RStudio, summarizing their strengths and identifying the
scenarios in which each might be most effective.

● Choosing an IDE: The chapter advises readers on selecting an IDE based on


their specific needs, such as the size of the data set, the complexity of the
analysis, and their familiarity with the programming language.
42 SAS, Python & R: A Cross Reference Guide

6. Looking Ahead

● Transition to Analytical Concepts: The chapter prepares the reader for the
transition from understanding the programming environments to applying
analytical concepts across SAS, Python, and R in subsequent chapters.

● Cross-Referencing: It sets the stage for the book’s core theme –


demonstrating how the same analytical tasks can be executed across
different programming languages, allowing data scientists to leverage the
strengths of each language and environment.
SAS, Python & R: A Cross Reference Guide

Chapter 1 Quiz

Questions:

1. What are the primary advantages of using SAS in industries such as financial
services and biotechnology?

2. How does Python’s open-source nature benefit data scientists in terms of


library availability and community support?

3. In what ways does R excel in statistical analysis and visualization compared


to other programming languages?

4. Explain how SAS Studio’s point-and-click interface can benefit data scientists
who may not have extensive programming experience.

5. Discuss the role of Integrated Development Environments (IDEs) in


enhancing the productivity of data scientists.

6. How does Python Spyder facilitate the integration of data manipulation and
machine learning libraries such as NumPy, pandas, and scikit-learn?

7. What are the key features of RStudio that make it suitable for reproducible
research and collaboration in data science?

8. Describe how SAS Studio handles large data sets and why this is important
for advanced analytics.

9. What advantages does Python Spyder offer for exploratory data analysis in
scientific computing?

10. How does RStudio’s package management system contribute to its flexibility
in data science projects?

11. Explain the significance of cross-referencing programming languages like


SAS, Python, and R in a data science project.

12. How can knowledge of multiple IDEs and programming languages enhance a
data scientist’s ability to tackle complex data problems?
44 SAS, Python & R: A Cross Reference Guide

13. Describe how SAS Studio’s log output assists in debugging and optimizing
code.

14. What role does the interactive console in Python Spyder play in iterative
code development?

15. How does RStudio support the creation of interactive data visualizations and
what are some use cases?

16. Why is it important for data scientists to be familiar with the different
programming environments discussed in this chapter?

17. Discuss the impact of the long-standing presence of SAS in the industry on
the development of analytical products and automated processes.

18. How do Python’s extensive libraries and frameworks support the


implementation of machine learning models?

19. What are the benefits of using RStudio for statistical modeling in academic
and research settings?

20. How does understanding the similarities and differences between SAS,
Python, and R improve a data scientist’s versatility and problem-solving
capabilities?
SAS, Python & R: A Cross Reference Guide

Chapter 1 Cheat Sheet

Category SAS Studio Python Spyder RStudio


A comprehensive
A powerful IDE
analytical platform
A versatile, open- specifically
widely used in
source IDE designed designed for
industries such as
Brief for scientific statistical
finance, insurance,
Description computing, offering computing and
and government for
seamless integration graphics, widely
advanced data
with Python libraries. used in academia
management and
and research.
statistical analysis.
- Statistical
- Data management - Data manipulation
analysis
- Data
Main Use - Advanced analytics - Machine learning
visualization
Cases
- Reporting and
- Scientific - Reproducible
compliance in
computing research
regulated industries
- Wide range of
- Robust data - Rich package
libraries (e.g.,
management ecosystem for
NumPy, pandas,
capabilities statistical analysis
scikit-learn)
Primary - Interactive - Strong
- Extensive built-in
Strengths development community
statistical procedures
environment support
- Emphasis on
- Highly optimized for - Flexibility with
reproducibility
large data sets open-source tools
and collaboration
- Integrated help
- Point-and-click - Interactive console
system and
Key Features interface for non- for real-time code
package
programmers execution
management
46 SAS, Python & R: A Cross Reference Guide

- Support for
- Detailed log outputs - Integrated help and
version control
for debugging documentation
and projects
- Integrated with SAS - Variable explorer
- Interactive data
Viya for advanced for easy data
visualization tools
analytics inspection
- Intuitive
- Accessible to both
interface with
novice and - User-friendly with
support for both
experienced users due features like code
basic and
Ease of Use to the combination of completion, syntax
advanced users,
a point-and-click highlighting, and
especially in
interface and coding debugging tools
statistical
capabilities
modeling
- Strong in
- Highly efficient in - Efficient with statistical
processing large-scale Python's strong modeling but may
data, particularly in performance across face performance
Performance regulated data processing challenges with
environments tasks; benefits from very large data
requiring rigorous an optimized open- sets due to its
data handling source ecosystem single-threaded
nature
- Available as a
- Access via SAS
- Available via direct free download,
OnDemand for
download or through with extensive
Installation Academics (web-
Anaconda documentation
based) or licensed
distribution and community
installations
support
- Organize
- Regularly update
- Utilize the built-in projects using
Best Python libraries to
code snippets for RStudio’s project
Practices access the latest
common tasks management
features
features
SAS, Python & R: A Cross Reference Guide

- Leverage the - Use Spyder’s - Explore the wide


extensive SAS variable explorer for range of packages
procedures for data inspection for specialized
advanced analytics during development tasks
- CRAN
- Extensive online (Comprehensive R
- Comprehensive SAS
documentation and Archive Network)
documentation
community support for accessing
packages
- Python Package - RStudio’s online
Resources - Active user
Index (PyPI) for resources and
community forums
accessing libraries webinars
- Active
- Training and
- Tutorials and community and
certification programs
webinars Stack Overflow
available
support
48 SAS, Python & R: A Cross Reference Guide

Chapter 2: Gathering Data

Overview

Garbage in, garbage out (GIGO). This is perhaps the most important concept of data
science. Here’s why:

● Poor-quality data results in poor-quality outputs, regardless of the power of


your machine learning algorithm.
● The volume of data does not compensate for poor data quality.
● High-quality data is essential for building reliable and robust models.

GIGO refers to the quality of the data you use to build your data science projects. If
you have poor-quality data as your input, then you will undoubtedly have a poor-
quality output. It doesn’t matter how powerful your machine learning algorithm is
or how many terabytes of data you have; if you put garbage in, you will get garbage
out. Imagine trying to build a house on sand with a foundation made of balsa wood.
It doesn’t matter how sophisticated the design of the house is; I would not live in it.

This book aims to provide you with a tool that will allow you to utilize your existing
knowledge of a programming language and expand it to other programming
languages. If you already know SAS but have a new project that requires you to
develop it in R, this book will provide you with a cross-reference guide where you
can look up a SAS procedure that you already know and convert it into R code.

As useful as this cross-reference guide can be, the real goal is to learn each of these
programming languages to a point where we do not need to look up an existing
procedure we already know. Instead, given time and practice, you will learn two
new programming languages and expand the skills of your existing programming
knowledge.

The structure of this book will be based on a single project. We will import data,
analyze and transform the data, perform feature engineering, create several
machine learning models, and evaluate the effectiveness of these models. For
consistency and to build our skills throughout each project phase, we will focus on a
single data set with a single objective. You should be able to transfer each model
SAS, Python & R: A Cross Reference Guide

development step to nearly any data science project that you create. So, even
though we will focus on a single project in a specific industry, you can apply these
skills to various projects in any industry.

We will not discuss the underlying foundations of data science or the details
surrounding different modeling algorithms in great detail. For a more in-depth study
of these concepts, check out my previous book, End-to-End Data Science with SAS.

Project Overview

The project focuses on building a risk model using the Lending Club data set to
predict loan default based on application information. The Lending Club data set is a
comprehensive collection of loan data from the Lending Club platform, a peer-to-
peer lending marketplace. It provides valuable insights into borrower characteristics
and loan details, making it an ideal data set for risk assessment.

The structure of the Lending Club data set follows a common format for supervised
machine learning tasks:

● It consists of rows and columns, where each row represents a loan


application, and each column represents a specific attribute or feature of
that loan application.

● The data set includes various information such as borrower demographics,


loan purpose, loan amount, interest rate, employment details, credit
history, and more.

● However, one crucial aspect is that the data set does not include a pre-
existing target variable indicating loan default. Therefore, we need to derive
this target variable ourselves based on the available information in the data
set.

Creating a target variable is an essential step in this project. We will define loan
default based on specific criteria such as past due payments, charged-off status, or
other relevant indicators available in the data set. Creating the target variable from
the available information is crucial in accurately training our risk model.
50 SAS, Python & R: A Cross Reference Guide

Through data exploration, feature engineering, and applying suitable machine


learning techniques, we will leverage the Lending Club data set to develop a risk
model that can effectively predict loan default using the information available at the
point of application. This project highlights the importance of data manipulation and
feature engineering in extracting meaningful patterns from raw data to build
predictive models.

In the context of this project, let's define two key terms:

● Target Variable: The target variable, also known as the dependent variable,
is the variable that we aim to predict. In this case, the target variable is loan
default, which represents whether a borrower will default on their loan. It is
typically binary, taking on values like "default" or "non-default," "1" or "0,"
or "yes" or "no." Creating an accurate prediction for the target variable is
the primary objective of our risk model.

● Predictor Variables: Predictor variables, also known as independent


variables or features, are used to make predictions or classifications. They
provide information about the borrower's characteristics, financial history,
and loan details. In our risk model, these predictor variables will be used to
assess the likelihood of loan default.

Exploratory data analysis and manipulation techniques will be discussed in detail in


the next chapter, "Creating a Modeling Data Set." These techniques will help us
understand the patterns and relationships within the data set, manage missing
values, address data inconsistencies, and prepare the data for modeling purposes.
By combining feature engineering with machine learning techniques, we will aim to
develop a robust risk model that can effectively predict loan default based on the
available information in the Lending Club data set.

Import Data

The data set we will work with is the Lending Club loan database. This data set
represents loan accounts sourced from the Lending Club website. It provides
anonymized account-level information with data attributes representing the
borrower, loan type, loan duration, loan performance and loan status. The raw data
can be found at:

https://www.kaggle.com/wordsforthewise/lending-club
SAS, Python & R: A Cross Reference Guide

Users are required to register on Kaggle before they can download the data set.
However, the remainder of this chapter focuses on getting the data and performing
some filtering and minor data manipulation. The resulting data set is a sample of the
overall data set with fewer features. To ensure we are all working from the same
data set, I have included the reduced data set in the GitHub repository for this book.
The data set is labeled “Loan_Samp” and can be found at:

https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide

Since we will be creating risk models based on historical customer data, we will only
need to focus on the data set that contains accepted applications that result in
active accounts. We certainly cannot use rejected data because you cannot default
on a loan for which you were never approved. Therefore, we will only need to
download the “accepted_2007_to_2018Q4” file. This file should contain all the
available variables at the point of application and the critical information necessary
to create our target variable.

*Note: GitHub has a size limitation on items that you are allowed to keep in your
repository. Unfortunately, this data set exceeds the max size restriction. However, if
you apply the code shown in the next section, you should be able to download the
data. By using the “seed” option, you should also be able to generate the same
sample as I will be using throughout this book. Also, the final reduced data set can
be found in the GitHub repository for this book.

Data Dictionary

One of the nice things about the Lending Club data set is that it comes with a data
dictionary. A data dictionary is a document that contains definitions for each field in
the data set. The level of detail can vary greatly for these types of documents. This
particular one only lists the variables in the data set and their common language
definition. The full data dictionary can be found in my GitHub repository:

https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide

However, due to the size of the data and the number of attributes, we will not use
all the available data. The raw data set has 2.2 million observations and 151
variables. We will limit the data to the following attributes and limit the number of
52 SAS, Python & R: A Cross Reference Guide

observations of the data. Table 2.1 shows the variable type, name, and description
for all variables that we will use as part of this project. Notice that “loan_status” will
be used to construct our target variable, and the remaining data attributes will be
our predictors.

Table 2.1: Data Dictionary of Limited Data Set


Type Variable Description
Target loan_status Current status of the loan
The listed amount of the loan applied for by the borrower. If at some point in
Predictive loan_amnt time, the credit department reduces the loan amount, then it will be reflected in
this value.
The number of payments on the loan. Values are in months and can be either 36
Predictive term
or 60.
Predictive int_rate Interest rate on the loan.
Predictive installment The monthly payment owed by the borrower if the loan originates.
Predictive grade LC assigned loan grade.
Predictive sub_grade LC assigned loan subgrade.
Employment length in years. Possible values are between 0 and 10 where 0
Predictive emp_length
means less than one year and 10 means ten or more years.
The homeownership status provided by the borrower during registration or
Predictive home_ownership obtained from the credit report. Our values are: RENT, OWN, MORTGAGE,
OTHER.
Predictive annual_inc The self-reported annual income provided by the borrower during registration.
Indicates if income was verified by LC, not verified, or if the income source was
Predictive verification_status
verified.
Predictive issue_d The month which the loan was funded.
Predictive purpose A category provided by the borrower for the loan request.
A ratio calculated using the borrower’s total monthly debt payments on the total
Predictive dti debt obligations, excluding mortgage and the requested LC loan, divided by the
borrower’s self-reported monthly income.
Predictive earliest_cr_line The month the borrower's earliest reported credit line was opened.
Predictive open_acc The number of open credit lines in the borrower's credit file.
Predictive pub_rec Number of derogatory public records.
Predictive revol_bal Total credit revolving balance.
Revolving line utilization rate, or the amount of credit the borrower is using
Predictive revol_util
relative to all available revolving credit.
Predictive total_acc The total number of credit lines currently in the borrower's credit file.
Indicates whether the loan is an individual application or a joint application with
Predictive application_type
two co-borrowers.
Predictive mort_acc Number of mortgage accounts.
Predictive pub_rec_bankruptcies Number of public record bankruptcies.
SAS, Python & R: A Cross Reference Guide

Our first step is to download the raw data to our laptop (or put it on a server if you
have that kind of setup). Once the data is downloaded, we will import it into our
analytical tool.

Although we could connect directly to the data set and read it into our analytical
program, this can lead to unforeseen data consistency issues. Websites routinely
drop data sets, update existing data sets with new data, or corrupt those data sets.
There are plenty of ways that the data can change from the original data file, so it is
a good practice to download a copy of the data set and refer to that consistent
copy.

The raw data set contains 151 variables, including the “loan_status” variable, which
we will use to develop our target variable. Many of these variables are not relevant
or useful to our project and take up a lot of space. So, we will only import a limited
set of variables that we have selected using our prior business knowledge. These
variables will focus on items that we know are related to loan default risk.

Program 2.1 shows how to import a limited set of variables with all observations
into your analytical environment. You will only need to change the highlighted
sections that specify the file pathway of where you placed the raw data file.

Program 2.1: Import Data into the Analytical Environment

Language Programming Code


library(data.table)

# Read in the entire data set


loan <- fread("/...INPUT YOUR FILE
PATHWAY.../accepted_2007_to_2018Q4.csv")

R Programming # Select specific columns


loan <- loan[, .(id, loan_status, loan_amnt, term,
int_rate, grade, sub_grade, emp_length, home_ownership,
annual_inc, verification_status, issue_d, purpose, dti,
earliest_cr_line, open_acc, pub_rec, revol_bal,
revol_util, total_acc, application_type, mort_acc,
pub_rec_bankruptcies)]
54 SAS, Python & R: A Cross Reference Guide

import pandas as pd

# Specify the columns you want to read


cols = ['id', 'loan_status', 'loan_amnt', 'term',
'int_rate', 'grade', 'sub_grade', 'emp_length',
'home_ownership', 'annual_inc', 'verification_status',
'issue_d', 'purpose' ,'dti', 'earliest_cr_line',
Python Programming 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
'total_acc', 'application_type', 'mort_acc',
'pub_rec_bankruptcies']

# Read specific columns of the CSV file using Pandas


loan = pd.read_csv(r'/...INPUT YOUR FILE
PATHWAY.../accepted_2007_to_2018Q4.csv', usecols=cols)

LIBNAME MYDATA BASE " /...INPUT YOUR FILE PATHWAY.../";

FILENAME REFFILE '/...INPUT YOUR FILE


PATHWAY.../accepted_2007_to_2018Q4.csv';

%let vars = id loan_status loan_amnt term int_rate grade


sub_grade emp_length home_ownership annual_inc
verification_status issue_d purpose dti earliest_cr_line
SAS Programming open_acc pub_rec revol_bal revol_util total_acc
application_type mort_acc pub_rec_bankruptcies;

PROC IMPORT DATAFILE=REFFILE


DBMS=CSV
OUT= MYDATA.Loan (keep=&vars.);
GETNAMES=YES;
RUN;

Notes on Program 2.1:

● Both Python and R are sensitive to whitespace characters. This means that
even though the formatting of the programming code above might look like
the filename is spread out over a couple of lines (due to publishing
standards), these languages will register this as an error. Remember to place
all pathway specifications on the same line when using Python or R.

● SAS is not sensitive to whitespace characters, so it doesn’t matter how you


spread out your code in that environment.

● Python and R require you to import specific libraries to handle the data
properly, while SAS does not require you to import any additional libraries
to manipulate data.
SAS, Python & R: A Cross Reference Guide

● The raw data is in a .csv format. Python uses the “read_csv” function, and R
uses the “fread” function to import delimited files. At the same time, SAS
relies on the “DBMS” statement within the PROC IMPORT procedure to
process the delimited file.

● Each programming language has its own way of limiting the selection of
variables during the data import step.

o In Python, the programmer specifies the list of variables and then


uses the “usecols” option during the data import statement.

o In R, the programmer uses a “select” statement to limit the


variables during import.

o In SAS, the variables are placed into a global variable and those
variables are retained using a “keep” statement.

● Each programming language has several methods for limiting variables


during the data import process. Alternative methods provide additional
flexibility and different functions.

● I use the “pd.DataFrame()” statement to transform the Python data set into
a panda’s dataframe. This format provides several benefits that we will
examine in the following chapters.

● All three programming languages create a final data set named “loan” in
their respective environments.

Sampling

The Lending Club data set is not particularly large compared to data science projects
that use petabytes of data to make predictions. However, to make data processing
more efficient and ensure that all of our computers can easily perform all of the
data manipulation and machine learning techniques that we will perform in the
following chapters, we will create a random sample of the data set.

There are four main sampling techniques that we could perform on a data set.
These include:
56 SAS, Python & R: A Cross Reference Guide

● Simple Random Sampling: Each member of the population has an equal


chance of being selected. This method ensures that every possible sample
has an equal probability of being chosen.

● Stratified Random Sampling: The population is divided into subgroups


(strata) based on a specific characteristic, and random samples are taken
from each stratum. This technique ensures that each subgroup is
adequately represented.

● Cluster Sampling: The population is divided into clusters, usually based on


geographical or natural groupings. A random sample of clusters is selected,
and all members within the chosen clusters are included in the sample. This
method is useful when a population is spread over a large area.

● Systematic Sampling: Every nth member of the population is selected after


a random starting point. This technique is straightforward and easy to
implement, especially with large populations.

For this project, we will use simple random sampling to reduce the data set's size
from 2.2 million observations to 100K observations. In later chapters, we will decide
whether to oversample or undersample the population, but for now, we will simply
create a simple random sample of the full data set to reduce its overall size while
maintaining the original data distributions of each variable.

One of the main reasons that we are not going to create a balanced data set at this
stage is that we will have to impute some missing values, adjust for outliers, and
perform a few other data manipulation techniques to prepare the data for
modeling. To perform these adjustments properly, we must base any adjustments
on the original data distributions.

Program 2.2 shows how to perform simple random sampling with a seed value. The
seed value ensures that if we were to rerun the sampling program, we would
achieve the same results (because in computer science and life, nothing is truly
random). I have chosen the seed value of 42. It doesn’t matter what value you
choose for the seed value; however, since Douglas Adams taught us that the answer
to life, the universe and everything is 42, that value is good enough for us.
SAS, Python & R: A Cross Reference Guide

Program 2.2: Simple Random Sample with Seed Value


Language Programming Code
set.seed(42)
R Programming
loan_samp <- loan[sample(nrow(loan), size=100000)]

Python loan_samp = loan.sample(n=100000, replace=False,


Programming random_state=42)

PROC SURVEYSELECT DATA=MYDATA.LOAN METHOD=srs N=100000


SAS Programming SEED=42 OUT=MYDATA.LOAN_SAMP;
RUN;

Notes on Program 2.2:

● All three programming languages explicitly state the seed value of 42.

o R uses the “set.seed(42)” statement.

o Python uses the “random_state” statement.

o SAS uses the “seed” option.

● Each of the programs explicitly states the number of observations that we


want to retain in the final sample.

o R uses the “size” statement.

o Python uses the “n” statement.

o SAS uses the “N” option.

● All three programming languages create a final data set named “loan_samp”
in their respective environments.
58 SAS, Python & R: A Cross Reference Guide

Chapter 2: Gathering Data – Conclusion and Transition

In this chapter, we explored the essential steps involved in gathering data, from
importing data from various sources into SAS, Python, and R, to ensuring that the
data is clean and ready for analysis. We discussed best practices for handling
different types of data and emphasized the importance of data quality in laying a
strong foundation for your data science projects.

With your data successfully gathered and prepared, the next step in your journey is
to transform this raw data into a format suitable for modeling. This is where the
concept of a modeling data set comes into play. In the next chapter, we will dive
into the critical process of creating a modeling data set, which involves selecting the
right features, handling missing data, and preparing the data for the specific
requirements of your predictive models.

Understanding how to create a modeling data set is crucial because it bridges the
gap between raw data and the machine learning algorithms that will analyze it. The
decisions you make in this phase will directly impact the performance and accuracy
of your models.

So, with your data ready to go, let’s move on to Chapter 3, where you’ll learn how
to craft a data set that will maximize the potential of your models and set the stage
for successful predictive analysis.
SAS, Python & R: A Cross Reference Guide

Chapter 2 Summary: Gathering Data

1. Introduction to Data Quality in Data Science

● Overview: The chapter emphasizes the foundational concept of "Garbage


In, Garbage Out" (GIGO), which underscores the critical importance of data
quality in data science. It explains that no matter how powerful a machine
learning algorithm is, poor-quality data will result in poor-quality outputs.

● Context: The chapter introduces the importance of gathering high-quality


data for building reliable and robust models. It sets the stage for the entire
data science process, from data acquisition to model evaluation,
highlighting that data quality is the cornerstone of successful data science
projects.

2. Importance of High-Quality Data

● GIGO Principle: The chapter stresses that the volume of data does not
compensate for poor data quality. High-quality data is essential for building
models that are accurate, reliable, and generalizable.

● Project Overview: The Lending Club data set is introduced as the primary
data set for the book’s ongoing project – a risk model to predict loan
default. The chapter describes how the data set is structured, the types of
variables it includes, and the need to create a target variable for the model.

3. Importing Data into Analytical Environments

● Overview of Data Importing: The chapter provides detailed instructions on


how to import the Lending Club data set into SAS, Python, and R. It
emphasizes the importance of importing a consistent copy of the data set to
avoid issues related to data consistency and changes in the raw data source.

● Comparison of Import Methods:

o SAS: The chapter discusses using the PROC IMPORT procedure to


import data and limit the variables during the data import process.
60 SAS, Python & R: A Cross Reference Guide

o Python: The pandas library is highlighted for its read_csv function,


which allows users to specify and import only the necessary
columns.

o R: The fread function from the data.table package is used to import


data with a focus on selecting specific variables during the import
process.

4. Sampling Techniques

● Overview of Sampling: The chapter introduces the concept of sampling and


its importance in reducing the data set to a manageable size while
maintaining the integrity of the data distributions. The Lending Club data set
is sampled down from 2.2 million observations to 100,000 observations.

● Types of Sampling:

o Simple Random Sampling: Each member of the population has an


equal chance of being selected.

o Stratified Random Sampling: The population is divided into


subgroups, and random samples are taken from each subgroup to
ensure representation.

o Cluster Sampling: The population is divided into clusters, and a


random sample of clusters is selected, with all members of the
selected clusters included in the sample.

o Systematic Sampling: Every nth member of the population is


selected after a random starting point.

● Implementation: The chapter provides code examples in SAS, Python, and R


for performing simple random sampling with a seed value to ensure
reproducibility.

5. Comparison and Choosing the Right Data Import and Sampling Techniques

● Overview: The chapter concludes with a comparative analysis of the data


import and sampling techniques in SAS, Python, and R, summarizing their
strengths and identifying scenarios where each technique might be most
effective.
SAS, Python & R: A Cross Reference Guide

● Choosing a Method: The chapter advises readers on selecting appropriate


data import and sampling methods based on their specific project needs,
such as data set size, available resources, and desired analysis outcomes.

6. Looking Ahead

● Transition to Data Transformation: The chapter prepares the reader for the
next steps in the data science process, which involve transforming the
sampled data set into a modeling data set by performing data
transformations and imputations.

● Cross-Referencing: It reinforces the importance of understanding how to


perform similar tasks across different programming languages, enabling
data scientists to leverage the strengths of SAS, Python, and R.
62 SAS, Python & R: A Cross Reference Guide

Chapter 2 Quiz

Questions:

1. What is the significance of the "Garbage In, Garbage Out" (GIGO) principle
in data science?

2. Why is high-quality data essential for building reliable and robust models?

3. Describe the structure of the Lending Club data set used in this project.

4. Explain the process of creating a target variable for the Lending Club risk
model.

5. How does the PROC IMPORT procedure in SAS facilitate data import and
variable selection?

6. What is the role of the pandas library in Python for importing data?

7. How does R's fread function from the data.table package assist in data
import and selection?

8. Why is it important to download a consistent copy of the data set before


importing it into an analytical environment?

9. What is simple random sampling, and why is it used in this project?

10. How does stratified random sampling ensure representation of subgroups in


a data set?

11. Describe the process of cluster sampling and its applications.

12. What is systematic sampling, and when is it most useful?

13. How does the use of a seed value in sampling ensure reproducibility?

14. Compare the data import and sampling techniques in SAS, Python, and R.

15. What factors should be considered when choosing a data import method?

16. Why might a data scientist choose to sample a data set before analysis?
SAS, Python & R: A Cross Reference Guide

17. How does understanding different sampling techniques benefit a data


scientist?

18. What are the benefits of reducing the size of a data set through sampling?

19. How does cross-referencing programming languages enhance a data


scientist’s ability to work with different tools?

20. What are the next steps after importing and sampling data in a data science
project?
64 SAS, Python & R: A Cross Reference Guide

Chapter 2 Cheat Sheet

Category SAS Python R


-
- PROC IMPORT - fread() from
pandas.read_csv()
for importing CSV data.table package
for reading CSV
files for fast CSV import
files
- Use usecols
- Use select()
parameter to
- Use DBMS=CSV parameter to
Data Import select specific
to specify file type specify columns
columns during
during import
import
- Use
GETNAMES=YES
to automatically
recognize headers
- Use keep=
- Use usecols
statement in
parameter in - Use select() within
Variable PROC IMPORT to
pandas.read_csv() fread() for selecting
Selection retain only
to import selected specific variables
necessary
columns
variables
- Use PROC - Use .sample() - Use sample()
SURVEYSELECT for method for function for creating
sampling sampling rows a random sample
Sampling - METHOD=SRS - random_state=
- set.seed() for
Techniques for simple random to set seed for
reproducibility
sampling reproducibility
- SEED= to ensure
reproducibility
- pandas handles
- fread() is
- Efficiently large data, but
Handling optimized for
handles large data may need
Large Data speed, making it
sets during import optimization for
Sets effective for large
and sampling very large data
data sets
sets
SAS, Python & R: A Cross Reference Guide

- Ensure
- Use
consistency by - Use fread() for
pandas.read_csv()
downloading a efficient data
with selective
static version of import and column
column import for
the data set selection
Best Practices efficiency
before import
- Set
- Always set a - Set set.seed() for
random_state for
seed value for reproducibility in
consistent
reproducibility sampling
sampling
66 SAS, Python & R: A Cross Reference Guide

Chapter 3: Creating a Modeling Data Set

Overview

In data science, we often encounter a stark contrast between the pristine academic
data sets used for education and the messy nature of real-world data used for
decision making. While academic examples like the well-known “Iris” data set for
clustering, the “MNIST” data set for classification, or the “Real Estate Price
Prediction” data set for regression modeling offer clean and structured data, real-
world data sets present unique challenges and complexities. These data sets are not
meticulously constructed for machine learning; rather, they serve as repositories of
events and observations from various domains.

Consider the diverse scenarios that real-world data encompasses, such as bank
transactions, car accidents, disease conditions, forest fires, lost luggage, or even the
historical pricing of Van Halen tickets. These data sets reflect the intricacies and
nuances of our world, carrying invaluable insights waiting to be unlocked through
the lens of data science.

In this chapter, we embark on a journey to harness the power of real-world data by


creating a modeling data set that suits our objectives. We navigate the intricacies of
data preparation, manipulation, and transformation, understanding these steps'
vital role in the data science workflow. We aim to bridge the gap between the raw
data and the modeling stage, paving the way for accurate predictions, informed
decisions, and meaningful insights.

Throughout this chapter, we will explore an array of fundamental techniques to


craft a modeling data set that unlocks the true potential of the data. We dive into
target variable analysis, predictive variable explanation, exploratory data analysis,
outlier detection, feature selection and engineering, dimensionality reduction, and
data balancing. These techniques serve as powerful tools in our data science
arsenal, allowing us to distill the essence of the data and shape it into a form that
facilitates effective modeling.

By combining the art and science of data science, we transform messy real-world
data into a refined and structured modeling data set. We equip ourselves with the
SAS, Python & R: A Cross Reference Guide

ability to extract hidden patterns, build accurate models, and make informed
decisions.

Art and Science

Data science is a field that combines both art and science to extract insights and
value from data. On the one hand, there is the science of data science, which
involves using statistical methods, machine learning algorithms, and computer
science techniques to analyze and model data. This requires a solid foundation in
mathematics, programming, and data manipulation skills. On the other hand, the art
of data science involves creativity, intuition, and domain expertise to identify
meaningful patterns and relationships in the data that may not be immediately
obvious.

Every data science project will have inflection points where a decision that will
impact the entire project must be made. These decision points could be as
foundational as defining your business problem or deciding what data set you will
use. Other decision points will concern the actual data of your modeling approach:

● Are you going to infer missing values?


● How do you plan on handling outliers?
● What date range are you going to use?
● How interpretable does the result need to be?
● What modeling algorithm will you use?
● Are you going to balance the data set?

These decision points are part of the “art” of data science. Although we have been
trained in the mathematical and programming skills essential to understand
machine learning algorithms and code them from scratch, if necessary, these skills
cannot tell you how to frame your business question or if you should cap your
outliers. The holistic view, strategy, framework, and approach are all part of the Art
of Data Science.

Throughout this book, I will point out the critical decision points data scientists must
make given our sample project. These symbols will indicate these decision points
and critical pieces of information:
68 SAS, Python & R: A Cross Reference Guide

To be successful in data science, one must possess both the technical skills and the
ability to think critically and creatively. It is not enough to simply apply algorithms
and models to data; one must also have a deep understanding of the problem
domain and be able to ask the right questions to extract the insights that matter.
Furthermore, data science is a constantly evolving field, and practitioners must be
willing to adapt and learn new skills to keep up with the latest developments.

Ultimately, the art and science of data science come together to create a powerful
approach to solving complex problems and driving innovation. By combining
analytical rigor with creative thinking, data scientists can uncover new opportunities
and insights to help organizations make better decisions and improve their
operations.

Project Overview

In the previous chapter, we identified the Lending Club data set as the data set we
will use to answer our business questions. Now, we need a more formal definition
of our business problem to ensure we correctly frame the issue.

Lending Club is a peer-to-peer lending platform allowing individual lenders to


provide loans to borrowers in exchange for interest income. This interest income
can be significant. The Lending Club data set contains a maximum loan interest rate
of 30.99%. It would be difficult to achieve these kinds of returns from other
investments. However, there is always a certain level of risk with any investment.
For lending platforms, that risk is primarily default risk. This is the risk that after the
borrower receives the funds, they will not be able to pay back the loan, and the
lender will lose the issued funds.

Default risk is a significant issue for lending platforms. How can a potential lender
decide who they should lend money to? The highest interest rates also come with
SAS, Python & R: A Cross Reference Guide

the highest probability of default. That is essentially what loan interest is. It is the
trade-off between risk and income. This is why people with long and stable credit
histories pay low interest rates, because they have proved that they are low-risk
borrowers, and the lender has a high probability of getting their money back with
interest.

The goal of this project is to design a predictive model that calculates the probability
of a borrower defaulting on a loan. This information can be used in two ways. First,
we can use the probability of default metric at the application stage and deny
applicants with a high probability of default.

Secondly, this information may help identify existing customers who are about to
default on their debts. The lender could help these clients by identifying them and
providing assistance in the form of postponed payments, loan restructuring, or
lowered interest rates. These steps may reduce losses from loan defaults and
provide proactive customer support to borrowers who are having trouble with their
loan payments.

In the previous chapter, we accessed the Lending Club data set, downloaded it to
our local PC, and fed it into our statistical programs. We must create a target
variable and explore the relationship between the target and the predictors.

Remember to use the data set found in the GitHub repository for this book:

https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide

Target Variable

The target variable is a critical component of modeling because it defines the


objective of the analysis and determines the type of statistical model appropriate
for the data. The target variable is the variable that the model is attempting to
predict or estimate, and it is the most important variable in the model because it
guides the entire modeling process.

There are several reasons why the target variable is so important in modeling:
70 SAS, Python & R: A Cross Reference Guide

● Defines the problem: The target variable defines the problem the model
attempts to solve. By identifying the target variable, we can clearly
articulate the research questions and the objective of the analysis.

● Determines the model type: The model used for analysis depends on the
type of target variable. If the target variable is categorical, we would use
classification models. If the target variable is continuous, we would use
regression models.

● Guides feature selection: Feature selection is the process of selecting the


most relevant variables to include in the model. The target variable guides
the feature selection process by identifying the variables that are most
likely to be associated with the target variable.

● Guides data preparation and cleaning: The choice of the target variable
influences the quality and structure of the data required for the analysis.
Data preparation and cleaning techniques should be chosen with the target
variable in mind.

● Evaluates model performance: The target variable is used to evaluate the


model's performance. By comparing the predicted values of the target
variable to the actual values, we can determine the overall model accuracy
and identify observations where the model underperformed.

Classification vs. Regression Models

Classification and regression are two types of statistical models commonly used to
analyze data with distinct target variables. Table 3.1 provides some general
descriptions of regression and classification models.
SAS, Python & R: A Cross Reference Guide

Table 3.1: Regression and Classification Models

Regression Classification
Regression means to Classification means to
General Description predict the output value output the group into a
using training data class
Regression to predict the Classification to predict the
value ($ amount) of an type of an insurance claim
Example
insurance claim using (fraud vs. non-fraud) using
training data training data
If it is a real If it is a discrete/categorical
Target Variable number/continuous, then variable, then it is a
it is a regression problem classification problem

In classification, the target variable is categorical, meaning it takes on a limited


number of discrete values. Examples of categorical variables include binary variables
(e.g., yes or no, 0 or 1) or variables with multiple categories (e.g., red, blue, green).
A classification model aims to predict the class or category a new observation
belongs to based on the values of one or more predictor variables.

In regression, the target variable is a continuous variable, which means it takes on a


range of numerical values. Examples of continuous variables include temperature,
income, and time. In a regression model, the goal is to predict the value of the
target variable for a new observation based on the values of one or more predictors.

There is no strict division of machine learning techniques exclusively specific to


either regression or classification models. However, certain algorithms are
commonly used for each task due to their characteristics and performance.

Regression models typically focus on predicting continuous numeric values. Some


commonly used machine learning techniques for regression include:

1. Linear Regression: A basic technique that uses a linear equation to model


the relationship between input variables and a continuous output.

2. Decision Trees: Tree-based models that recursively split the data based on
input features to make predictions.
72 SAS, Python & R: A Cross Reference Guide

3. Random Forest: An ensemble method that combines multiple decision trees


to improve prediction accuracy.

4. Support Vector Regression (SVR): An extension of Support Vector Machines


(SVM) for regression tasks.

5. Gradient Boosting: Algorithms like XGBoost and LightGBM that sequentially


combine weak models to create a more robust predictive model.

On the other hand, classification models focus on predicting discrete class labels or
categories. Some popular machine learning techniques for classification include:

1. Logistic Regression: A technique that models the relationship between


input variables and a binary or multi-class outcome using the logistic
function.

2. Decision Trees: Similar to regression, decision trees are also commonly used
for classification tasks. This technique uses recursive partitioning of the data
to categorize a prediction.

3. Random Forest: Ensemble method that combines multiple decision trees


for classification purposes.

4. Support Vector Machines (SVM): A powerful binary and multi-class


classification algorithm that finds optimal ways of separating hyperplanes in
high-dimensional space.

5. Naive Bayes: A probabilistic algorithm based on Bayes' theorem that


assumes feature independence.

6. Neural Networks: Deep learning models with multiple layers of


interconnected nodes, commonly used for classification tasks.

While these techniques are frequently associated with regression or classification, it


is important to note that many algorithms can be adapted or modified to manage
both types of tasks. The choice of technique ultimately depends on the nature of
the problem, the available data, and the desired performance.
SAS, Python & R: A Cross Reference Guide

Multi-Target Models

A single machine learning model can have two separate target variables. This
scenario is known as multi-output or multi-target regression/classification. Instead
of predicting a single target variable, the model aims to predict multiple target
variables simultaneously.

The multi-output modeling approach depends on the problem and the relationships
between the target variables. Some techniques that can be used for multi-output
modeling include:

1. Multi-output Regression: This involves extending traditional regression


models, such as linear regression or decision trees, to manage multiple
target variables. The model learns to predict each target variable
independently or jointly, depending on the problem.

2. Multi-label Classification: Each instance can be associated with multiple


labels or classes. The model predicts the presence or absence of each label
for a given input.

3. Ensemble Methods: Ensemble methods like Random Forest or Gradient


Boosting can naturally manage multi-output problems by training multiple
trees or models, each targeting different output variables.

4. Neural Networks: Deep learning architectures can be designed with


multiple output nodes, each corresponding to a different target variable.
These architectures can be customized to manage multi-output tasks.

The choice of technique depends on the nature of the problem, the relationships
between the target variables, and the available data. When designing the model, it
is important to consider the dependencies or correlations between the target
variables, as they can impact the model's performance and interpretability.
74 SAS, Python & R: A Cross Reference Guide

Creating a Target Variable

Defining the target variable is a crucial step in creating a modeling data set.
Sometimes, a data set already contains a variable that can serve as the target,
particularly when working with historical data. For example, in marketing
campaigns, we might know who responded positively or negatively to previous
campaigns, and in financial domains, there may be records of account defaults or
loan applications. These pre-existing target variables are typically binary (e.g., 1/0 or
Yes/No), providing a clear objective to predict or classify.
Using these pre-existing variables enables you to:

● Guide Data Exploration: Focus your analysis on uncovering relevant


patterns and relationships that influence the outcome.

● Enhance Model Accuracy: Develop predictive models that are more likely to
yield accurate results and actionable insights.

However, not all data sets come with a pre-existing target variable. In such cases,
you’ll need to create one based on the available data. This process involves selecting
and transforming relevant variables and applying your domain knowledge to define
the desired outcome.

Example:

Consider a scenario where you're working with a data set that tracks customer
behavior, but it lacks a predefined target variable for predicting customer churn. In
this case, you would examine variables such as customer activity, purchase history,
and engagement metrics. By setting specific criteria or thresholds based on data
patterns, you can define a target variable that indicates the likelihood of customer
churn.

Decision Time: Identifying or Creating the Target Variable


A crucial decision a data scientist needs to make is identifying or
creating the target variable. The Lending Club data set does not come
with a predefined target variable. However, it does contain a variable
called “loan_status.” The data dictionary describes this variable as the
“Current state of the loan.”
SAS, Python & R: A Cross Reference Guide

However, this definition alone does not fully explain the nature of the
variable or how we can use it to construct a meaningful target
variable for modeling. Understanding and defining the target variable
is essential to ensuring the success of your predictive models.

Creating Target Variables from Scratch

When constructing your own target variable, you need a deep understanding of
both the data and the problem domain. This involves:

● Feature Engineering: Carefully select and transform variables to ensure the


target variable captures the essence of the outcome that you aim to predict.

● Exploring Relationships: Analyze the target variable’s relationship with


other variables to uncover patterns and identify key factors.

By taking charge of crafting your target variable, you empower yourself to tackle
prediction and classification tasks where no predefined target variable exists,
unlocking the data's predictive potential and building models that provide valuable
insights.

Example: Lending Club Data Set

A key decision in data science is identifying or creating the target variable. The
Lending Club data set, for instance, doesn't include a predefined target variable.
However, it does have a variable called “loan_status,” which describes the current
state of the loan.
The loan status field includes categories such as paid, current, late, charged off, or in
default. By running a frequency distribution on this field, you can see the
distribution of accounts across these categories. This step is essential in
understanding the data and deciding how to use it to construct a meaningful target
variable.

Program 3.1 demonstrates how to generate a frequency distribution for each


programming language. While there are alternative methods for achieving similar
results, this program provides a straightforward approach.
76 SAS, Python & R: A Cross Reference Guide

Program 3.1: Frequency Distribution Code

Language Programming Code


library("data.table")
R Programming freq <- table(loan_samp$loan_status)
print(freq)

Python Programming loan_samp['loan_status'].value_counts()

SAS Programming PROC FREQ DATA=LOAN_SAMP; TABLES loan_status; RUN;

Table 3.2 shows the SAS output of the frequency distribution. We can see that the
“loan_status” variable consists of eight categories and has only one missing value.
If we are trying to create a target variable that indicates if someone is a credit risk,
then we would want to identify the “Charged Off” and the “Does not meet the
credit policy. Status: Charged Off” categories as positive indicators for credit risk
and all other categories as a negative indicator for credit risk. An account with a
loan status of “Late” does not meet our definition of risk; therefore, those
categories will not be included in our risk target variable.

Table 3.2a: Frequency Distribution of Loan Status


Account Volume and Default Rate by Month
Cumulative Cumulative
loan_status Frequency Percent
Frequency Percent
Charged Off 11880 11.88 11880 11.88
Current 38826 38.83 50706 50.71
Does not meet the credit policy.
31 0.03 50737 50.74
Status:Charged Off
Does not meet the credit policy.
92 0.09 50829 50.83
Status:Fully Paid
Fully Paid 47609 47.61 98438 98.44
In Grace Period 370 0.37 98808 98.81
Late (16-30 days) 223 0.22 99031 99.03
Late (31-120 days) 968 0.97 99999 100
Frequency Missing = 1

Program 3.2 shows the development of the target variable in each of our
programming languages. We are simply creating a binary target variable defined as:
if the loan status is either “Charged Off” or “Does not meet the credit policy.
SAS, Python & R: A Cross Reference Guide

Status:Charged Off,” then the binary indicator will be 1. For all other categories, the
binary indicator will be 0. The name of the target variable will be “bad.”

Program 3.2: Target Variable Development


Language Programming Code
loan_samp$bad <-
ifelse(loan_samp$loan_status=='Charged Off' |
R Programming
loan_samp$loan_status=='Does not meet the credit
policy. Status:Charged Off',1,0)

import numpy as np
loan_samp['bad'] =
Python np.where((loan_samp['loan_status']=='Charged
Programming Off') | (loan_samp['loan_status']=='Does not
meet the credit policy. Status:Charged
Off'),'1','0')

DATA LOAN_SAMP; SET LOAN_SAMP;


IF loan_status in ("Charged Off", "Does not meet
SAS Programming
the credit policy. Status:Charged Off") then bad
= 1; else bad = 0; RUN;

A final frequency distribution of the newly created target variable shows that the
event rate is 11.91%. However, this metric represents the overall event rate across
the entire data set. Sometimes, the event rate will vary widely across different time
periods, and this can tell us a lot about the data.

Table 3.2b: Target Variable Frequency Distribution

Cumulative Cumulative
bad Frequency Percent
Frequency Percent
0 88089 88.09 88089 88.09
1 11911 11.91 100000 100
78 SAS, Python & R: A Cross Reference Guide

Target Variable Analysis

Warning: Temporal Stability of Target Variable


If you have a date value in your data set, it is important to examine the
stability of the target variable over time. Binary target variables are
generally assessed by creating an event rate calculated as (# of positive
events / total # of events). A crosstabulation frequency distribution of
the variables “issue_d” and “bad” should give us the number of
positive and negative events for each month. Program 3.3 shows the
crosstab code for each programming language.

Program 3.3: Cross-Tabulation Frequency Distribution Code

Language Programming Code


R Programming table(loan_samp$issue_d, loan_samp$bad)

Python pd.crosstab(loan_samp['issue_d'], loan_samp['bad'],


Programming dropna=False)

SAS Programming PROC FREQ DATA=LOAN_SAMP; tables issue_d*bad; run;

Table 3.3 below shows the difference between the output generated by the three
programming languages and environments. The Python and R output looks
identical, with the “issue_d” variable being interpreted as a character variable. The
SAS environment read the “issue_d” variable as a date variable and placed the
output in the correct logical order for a date field.
SAS, Python & R: A Cross Reference Guide

Table 3.3: Cross-Tabulation Frequency Distribution Output

For Python and R, we would need to convert the character formatted “issue_d”
variable to a date format. Program 3.4 shows the simple conversion code for each
program language.

Program 3.4: Convert Character Variable to Date


Language Programming Code
loan_samp$issue_d <- as.Date(loan_samp$issue_d,
R Programming
format = "%m/%d/%Y")

loan_samp['issue_d'] =
Python Programming
pd.to_datetime(loan_samp['issue_d'])

SAS Programming issue_d = INPUT(issue_d, MMDDYY9.);

Please note that both the R and Python programming languages required us to
convert the “issue_d” and the “earliest_cr_dt” variables into a date field. However,
SAS identified the proper format for these fields as date fields. The code contained
in Program 3.4 shows how to convert the R and Python “issue_d” fields into date
80 SAS, Python & R: A Cross Reference Guide

formats. The SAS example shows how to convert a character field into a date field.
This code would be included in a DATA step if necessary.

Once we have correctly formatted the “issue_d” field, we can create a chart that
plots the account volume as bars and the event rate as a line, using the newly
formatted “issue_d” field as the X axis.

Figure 3.1 shows the account volumes and default rates by month. This chart shows
us two critical pieces of information. First, the account volume starts very low,
resulting in wild swings to the event rate. Second, the event rate began to trail off
about 18 months after there was a stable period. This tells us that people need
about a year and a half of credit availability before they default.

Figure 3.1: Account Volume and Default Rates by Month

Decision Time: Balancing Data Science with Intuition


The beginning of this chapter spoke of the art and science of
data science. The “art” part of data science is rooted in
creativity, domain expertise, and intuition. When deciding on
issues such as data segmentation or selecting a specific date
range, there are limited data analysis techniques that can
help you evaluate and visualize the issue. However, decisions
SAS, Python & R: A Cross Reference Guide

such as selecting an acceptable date range are often decided


by business knowledge and intuition and form a story the
data scientist is constructing.

Chart 3.1 shows us that the Lending Club data set has a significant ramp-up period
that starts in July 2007 and goes up to about January 2014. Notice how much the
event rate fluctuates during this time period. These fluctuations are due to low
monthly account volume. Around January 2014, the monthly account volume was
large enough that the event rate stabilized. This leads us to believe we should not
use data before January 2014.

Around January 2017, the event rate began to decrease steadily. This behavior
seems strange because the account volume is still high, and there had been stability
for years prior to this point. Since there had not been any external factors, such as
federal interest rate cuts or consumer bailout programs introduced, that would
have affected this population, we can conclude that the steady reduction in default
rates is a result of what we can call “runway time.”

In this context, “runway time” means the general amount of time that must pass for
an event to occur. In our example, people do not generally default on their loans
within the first couple of months of acquiring them. The general “runway time” for
loan default can be anywhere between six months to five years, depending on the
type of loan.

By examining the event rate over time, we can determine that the runway time for
this event is about 18 months. The decline in the event rate began in January 2014
and bottomed out around June 2017. This leads us to believe we should not use
data after January 2017.

Program 3.5 shows the code to limit the data set to the January 2014 to January
2017 time period.

Program 3.5: Limit Date Range of Modeling Data Set


Language Programming Code
loan_data <- subset(loan_samp, issue_d >=
R Programming as.Date("2014-01-01") & issue_d <=
as.Date("2017-12-31"))
82 SAS, Python & R: A Cross Reference Guide

loan_data =
Python
loan_samp[(loan_samp['issue_d']>='2014-01-
Programming
01') & (loan_samp['issue_d']<='2017-12-31')]

DATA loan_data; SET loan_samp; WHERE


SAS Programming '01JAN2014'd le issue_d le '31DEC2017'd;
RUN;

Table 3.4 shows that filtering the data set to the specified date range has decreased
the overall size of the data set by about a third while significantly increasing the
overall default rate.

Table 3.4: Target Variable Frequency Distribution and Account Volume

Cumulative Cumulative
bad Frequency Percent
Frequency Percent
0 58065 85.44 58065 85.44
1 9897 14.56 67962 100

We needed to make these decisions about up-front filtering concerning date ranges
because when we analyze our predictor variables, they should be much more stable
over time due to the increased account volume. Also, if we need to impute any
missing values, we will have a stable data set to make those imputations.

Now that we have limited the data set to a specified date range, it consists of 67,962
observations and 24 variables.

Predictive Variables

Predictive variables are points of information that you are using to predict the state
of the target variable. In the previous chapter, we limited the number of predictive
variables to a group that made sense from a business perspective. Table 3.5 shows
the SAS output that displays the metadata content for the data set. We can easily
see that there are 24 variables in the data set. These variables are a combination of
SAS, Python & R: A Cross Reference Guide

character and numeric variables. They contain a unique identifier (“id”), two date
variables (“issue_d” and “earliest_cr_line”), and a binary target variable that we
developed labeled “bad.”

Table 3.5: SAS PROC Contents Output – Metadata Content for the Data Set
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
1 Id Num 8 BEST9. BEST9.
2 loan_amnt Num 8 BEST5. BEST5.
3 Term Char 9 $CHAR9. $CHAR9.
4 int_rate Num 8 BEST5. BEST5.
5 Grade Char 1 $CHAR1. $CHAR1.
6 sub_grade Char 2 $CHAR2. $CHAR2.
7 emp_length Char 9 $CHAR9. $CHAR9.
8 home_ownership Char 8 $CHAR8. $CHAR8.
9 annual_inc Num 8 BEST9. BEST9.
10 verification_status Char 15 $CHAR15. $CHAR15.
11 issue_d Num 8 MMDDYY10. MMDDYY10.
12 loan_status Char 51 $CHAR51. $CHAR51.
13 purpose Char 18 $CHAR18. $CHAR18.
14 dti Num 8 BEST6. BEST6.
15 earliest_cr_line Num 8 MMDDYY10. MMDDYY10.
16 open_acc Num 8 BEST2. BEST2.
17 pub_rec Num 8 BEST2. BEST2.
18 revol_bal Num 8 BEST7. BEST7.
19 revol_util Num 8 BEST5. BEST5.
20 total_acc Num 8 BEST3. BEST3.
21 application_type Char 10 $CHAR10. $CHAR10.
22 mort_acc Num 8 BEST2. BEST2.
23 pub_rec_bankruptcies Num 8 BEST1. BEST1.
24 bad Num 8

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a critical phase in the data science journey,
allowing us to unravel the secrets hidden within the data. It involves
84 SAS, Python & R: A Cross Reference Guide

comprehensively examining the data set and uncovering patterns, trends, and
relationships that provide valuable insights. EDA provides us with a deep
understanding of the data's characteristics, guiding subsequent steps in data
preparation, feature engineering, and modeling.

The importance of EDA cannot be overstated. It allows us to gain familiarity with the
data, identifying potential issues such as missing values, outliers, or inconsistencies.
Visualizing the data through plots, charts, and statistical summaries lets us grasp its
distribution, central tendencies, and dispersion. Moreover, EDA helps us explore
relationships between variables, highlighting correlations or dependencies that
inform feature selection and modeling decisions.

EDA serves as the foundation upon which data-driven decisions are made. It enables
us to ask the right questions, validate assumptions, and generate hypotheses.
Through visualizations and statistical techniques, we gain insights that go beyond
mere numbers, fostering a deeper understanding of the underlying mechanisms
driving the data. Armed with these insights, we can make informed decisions,
identify pitfalls, and unlock the data's true potential.

The first step of exploratory data analysis is to create a statistical summary of the
numeric variables of the data set. Program 3.6 shows how to create a summary
overview of the numeric values in your data set.

Program 3.6: Summary Statistics for Numeric Data


Language Programming Code
R Programming summary(loan_data)

pd.set_option('display.max_rows',
len(loan_data.index))
Python Programming pd.set_option('display.max_columns',
len(loan_data.columns))
loan_data.describe()

PROC MEANS DATA=loan_data N NMISS MIN P1 P25


SAS Programming
P75 P99 MAX MEAN MEDIAN Skewness; RUN;
SAS, Python & R: A Cross Reference Guide

Notes on Program 3.6:

● The length of the programming statement varies by the programming


language. It appears that the R programming language is the simplest with a
single line of code; however, the Python and SAS languages could also be
contained to a single line of code, but I wanted to adjust the standard
output for those languages.
● The standard Python output would display the first and last few variables
but not all the summary statistics unless you force it to show everything.
The two “option” statements that precede the describe statement tell
Python that we want to see everything.

● The SAS statement appears verbose, but I wanted additional statistics in the
output. I could have run the PROC MEANS statement without specifying
exactly which variables I wanted to display, and it would have output the
standard default statistics of N, MEAN, MIN, STD, and MAX. However, the
next section will discuss outliers, and I wanted to show the information
necessary for identifying outliers.

Table 3.6: SAS Output – Summary Statistics


N 1st 25th 75th 99th
Variable N MIN MAX Mean Median Skew
Miss Pctl Pctl Pctl Pctl

loan_amnt 67962 0 1000 1600 8000 20000 36400 40000 14,968.40 12975 0.74

int_rate 67962 0 5.32 5.32 9.49 15.61 27.31 30.99 13.09 12.69 0.83

annual_inc 67962 0 0 18000 47000 94000 261813 8900000 78,349.63 65000 50.66

dti 67937 25 0 2.03 12.27 24.77 39.58 999 18.92 18.23 23.45

open_acc 67962 0 1 3 8 15 30 60 11.77 11 1.31

pub_rec 67962 0 0 0 0 0 3 46 0.23 0 10.35

revol_bal 67962 0 0 226 6131 20493 103632 805550 16,949.58 11458 8.53

revol_util 67911 51 0 1.6 33.3 70.1 98.6 154.3 51.53 51.4 (0.02)

total_acc 67962 0 2 5 16 31 61 126 24.61 23 1.01

mort_acc 67962 0 0 0 0 3 8 26 1.59 1 1.61

pub_rec_
67962 0 0 0 0 0 1 6 0.14 0 3.19
bankruptcies

bad 67962 0 0 0 0 0 1 1 0.15 0 2.01


86 SAS, Python & R: A Cross Reference Guide

This simple table of information tells us a lot about the data set. Starting from the
top, the “loan_amnt” variable shows us that the minimum loan available through
Lending Club is $1000, and the maximum amount is $40K. The mean value is slightly
higher than the median value. That tells us that although there are some outliers on
the high end, the average loan amount is about $15K across all observations. The
skewness is slightly positive, confirming our suspicion of high-end outliers.

If we wanted to dig in even further, we can develop a histogram of the “loan_amnt”


variable and see that the distribution is slightly skewed to the right.

Program 3.7 shows the programming code to develop histograms for each language.

Program 3.7: Histogram Development with 50 Bins


Language Programming Code
R Programming hist(loan_data$loan_amnt, breaks=50, col='blue')

import matplotlib.pyplot as plt


Python import pandas as pd
Programming plt.hist(loan_data['loan_amnt'], bins=50,
color='blue')

PROC UNIVARIATE DATA=loan_data; VAR loan_amnt;


SAS Programming
HIST; RUN;
SAS, Python & R: A Cross Reference Guide

Figure 3.2: “LOAN_AMNT” Histogram – R Output

You can investigate each of the variables to have a much better understanding of
the data. This is often much more beneficial than complicated statistical techniques.
Understanding the range, mean, and skewness of the data is essential. For example,
let’s look at the “annual_inc” variable. When using financial data, there are often
outliers on the high and low ends. The table of information above shows us that the
“annual_inc” variable has a minimum value of 0 and a maximum value of $8.9MM.
Both of these values appear suspect.

Generally, loans are not distributed to individuals with no income, and if you have
an annual income of $8.9MM, then you are not usually using a crowd-sourced loan
platform such as Lending Club. I have a feeling that the high-end outlier values were
erroneously entered. Their annual income is probably $89K, but they included .00
to represent cents, and the intake field does not recognize the period. Although we
cannot verify it, that is probably true for values over $250K.

Figure 3.3 shows a histogram for the “annual_inc” variable. This graphic confirms
that only a few high outliers are skewing the data set.
88 SAS, Python & R: A Cross Reference Guide

Figure 3.3: “ANNUAL_INC” Histogram – SAS Output

Outliers are a critical issue in data science and general data analysis. It is important
to know how to identify outliers and what to do with them (if anything) once you
do.

Decision Time: Defining and Handling Outliers

The standard definition of an outlier is simply an observation that lies


outside the expected range of values. This is not a hard-and-fast rule
where every data scientist will classify outliers similarly. It can depend
on which methodology you use to identify outliers and can be
influenced by business knowledge. Decisions on how to detect
outliers and what to do with them once you’ve detected them can
significantly impact the overall performance of your model.

How to Detect Outliers

Outliers in a data set can significantly impact our analysis and modeling, warranting
careful attention and treatment. Untreated outliers can introduce various concerns
and distort our understanding of the underlying patterns and relationships within
the data. Here are the key concerns associated with having untreated outliers:
SAS, Python & R: A Cross Reference Guide

1. Skewed statistical measures: Outliers can heavily influence statistical


measures such as the mean and standard deviation, leading to skewed and
unreliable summaries of the data's central tendency and dispersion.

2. Biased model performance: Outliers can disproportionately influence


model fitting, leading to biased parameter estimates and predictions. This
can result in poor model performance and inaccurate insights.

3. Distorted relationships: Outliers can distort the relationships between


variables, misleading us about the true associations and correlations. Failing
to address outliers may lead to erroneous conclusions and misguided
decision-making.

4. Increased model complexity: Outliers can introduce noise and unnecessary


complexity to the model, hindering its ability to capture the underlying
patterns. This may result in overfitting, where the model performs well on
the training data but fails to generalize to new observations.

5. Impaired interpretability: Outliers can hinder the model's interpretability,


making it challenging to extract meaningful insights and understand the true
impact of the predictors. This can undermine the model's usefulness for
decision-making and analysis.

Addressing outliers through appropriate techniques, such as trimming, winsorizing,


or robust modeling, is essential to mitigate these concerns and ensure accurate
analysis. Treating outliers allows us to uncover the genuine patterns, relationships,
and insights hidden within the data, enabling more reliable modeling and informed
decision-making.

Winsorizing
Winsorizing is a technique for managing outliers that replaces extreme values with
less extreme values. Instead of removing or transforming outliers completely,
winsorizing truncates the extreme values at a specified percentile and replaces
them with the nearest values within that percentile. This approach helps reduce the
impact of outliers without completely eliminating their influence on the data.
90 SAS, Python & R: A Cross Reference Guide

In winsorizing, the lower and upper tails of the data distribution are trimmed to a
certain percentile, often chosen as a small percentage like 1% or 5%. Any values
below the lower percentile are replaced with the value at that percentile, and any
values above the upper percentile are replaced with those at that percentile. This
ensures that extreme outliers are "trimmed" or "capped" to a more moderate
range.

By winsorizing the data, we retain the overall shape and characteristics of the
distribution while mitigating the influence of extreme values. This approach is
particularly useful when the outliers are considered valid data points but are
believed to be measurement errors or extreme observations due to rare
circumstances.

Winsorizing strikes a balance between completely removing outliers and retaining


their information in the analysis. It allows us to manage outliers more robustly,
improving the stability and reliability of statistical measures and model estimates.

Inter-Quartile Range (IQR)


Another way to identify outliers is to use the 1.5 interquartile range (IQR) rule. This
rule is very straightforward and easy to understand. For any continuous variable,
you can simply multiply the interquartile range by the number 1.5. You then add
that number to the third quartile. Any values above that threshold are suspected of
being outliers. You can also perform the same calculation on the low end. You can
subtract the value of IQR x 1.5 from the first quartile to find low-end outliers.

For example, to find the outlier threshold for the annual income variable
(“annual_inc”) in our data set, we can refer to Table 3.6 above and see that the 25th
percentile value for annual income is $47,000 while the 75th percentile value is
$94,000. The interquartile range would be ($94,000 - $47,000 = $47,000). We would
then multiply $47,000 by 1.5 to get $70,500. Finally, we would add this new value to
the third quartile to get the upper threshold to identify outliers ($94,000 + $70,500
= $164,500). We can interpret this value as any observation with an annual income
above $164,500 is considered an outlier.

Table 3.7 below shows each piece of the IQR methodology and the calculations for
the annual income and revolving balance variables. It is important to note that
although the low-end adjusted values are below zero, we often set a floor for those
SAS, Python & R: A Cross Reference Guide

variables at zero. It does not make sense for the data to contain values below zero
for these variables.

Table 3.7: IQR Outlier Adjustment Examples

25th 75th Low-End High-End


Variable IQR
Percentile Percentile Adjustment Adjustment
Annual $47,000=($94,000 - -$23,500 = $47,000 $164,500 = $94,000
$47,000 $94,000
Income $47,000) - (1.5 * $47,000) + (1.5*$47,000)
Revolving $14,362=($20,493- -$15,412 = $6,131 - $42,036 = $20,493
$6,131 $20,493
Balance $6,131) (1.5 * $14,362) + (1.5*$14,362)

The table above shows that for the “Annual Income” variable, the high-end adjusted
value would be capped at $164,500. Any values over this amount would be replaced
with $164,500. However, the low-end adjusted value is negative. Therefore, we
would simply set the floor to zero. Any values less than zero are set to zero. This
makes sense in this business problem because it is not appropriate to have negative
values for annual income (especially if you are applying for a loan).

Dealing with Outliers

When we detect outliers, we have a few options. First, we can do nothing.


Depending on your final goal, we can leave the data set alone and use the data with
outliers as is. For example, if you already know that you will either transform the
data with a Weight of Evidence (WOE) transformation or use a tree-based model for
prediction, you might not need to adjust outliers. Both of these processes can
handle outliers without affecting the output.

However, if you are going to create a dashboard report that shows each variable’s
average value or if you are going to create a linear regression model for prediction,
then these processes will be significantly impacted by outliers. This would require us
to transform these outliers.

There are two main approaches to transforming outliers:

● capping and flooring, and


● inference.
92 SAS, Python & R: A Cross Reference Guide

Capping and Flooring Values – Winsorizing Example


We can identify high and low-end outliers by running our descriptive statistics
program and finding low-end and high-end thresholds, such as the 1% and 99% or
5% and 95% range. Then, we can replace any values under the low-end or over the
high-end thresholds with the appropriate values. So, in this example, the 1%
threshold for the “annual_inc” variable is $19,000 while the 99% threshold is
$260,000. For any values below $19,000, we will assign a value of $19,000 for those
observations. For any values above $260,000, we will infer a value of $260,000 for
those observations.

However, to be consistent with the IQR methodology described previously, let’s


develop the code for this process in all three programming languages.

This simple logic to adjust for outliers using the IQR technique can result in some
verbose code because there are several steps in the process:

1.) Select numeric data only.

2.) Calculate the 1st and 3rd quartiles.

3.) Calculate the interquartile range.

4.) Calculate the upper and lower bounds for the outliers.

5.) Replace the outlier values with the 1.5 IQR values.

This is an excellent example of comparing the structure of the programming


languages. The programs below show the functions to adjust for outliers for each
language.

Program 3.8: Outlier Adjustment

Language Programming Code


#select numeric data only
filter_numeric_columns <- function(df) {
numeric_cols <- sapply(df, is.numeric)
R Programming
df[, ..numeric_cols, with=FALSE]
}
SAS, Python & R: A Cross Reference Guide

numeric_data <- filter_numeric_columns(loan_data)

#infer outliers with 1.5 IQR rule


replace_outliers <- function(x) {

#calculate the first and third quartiles


q1 <- quantile(x, probs = 0.25, na.rm = TRUE)
q3 <- quantile(x, probs = 0.75, na.rm = TRUE)

#calculate the interquartile range (IQR)


iqr <- q3 - q1

#calculate the lower and upper bounds for the


outliers
upper <- q3 + 1.5 * iqr
lower <- q1 - 1.5 * iqr

#replace outlier values with the 1.5 IQR rule values


x[x > upper] <- upper
x[x < lower] <- lower
x
}

library(dplyr)

#replace outliers in column x


outliers <- mutate(numeric_data, x =
replace_outliers(x))
summary(outliers)

#select numeric data only


def filter_numeric_columns(df):
Python
numeric_cols = df.select_dtypes(include
Programming
='number').columns
return df[numeric_cols]
94 SAS, Python & R: A Cross Reference Guide

numeric_data = filter_numeric_columns(loan_data)

#infer outliers with 1.5 IQR rule


import numpy as np
def replace_outliers(arr):
#calculate the first and third quartiles
q1 = np.percentile(arr, 25)
q3 = np.percentile(arr, 75)

#calculate the interquartile range (IQR)


iqr = q3 - q1

#calculate the lower and upper bounds for the


outliers
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

#replace outlier values with the 1.5 IQR rule values


arr[arr < lower_bound] = lower_bound
arr[arr > upper_bound] = upper_bound

return arr

outliers = replace_outliers(numeric_data)
outliers.describe()

DATA outliers;
/*Select numeric variables*/
SET loan_data (KEEP=_NUMERIC_);
SAS
ARRAY vars(*) loan_amnt -- pub_rec_bankruptcies;
Programming

DO i = 1 TO dim(vars);
SAS, Python & R: A Cross Reference Guide

/*Infer outliers with 1.5 IQR rule*/


q1 = quantile(vars(i), 0.25);
q3 = quantile(vars(i), 0.75);

/*Calculate the interquartile range (IQR)*/


iqr = q3 - q1;

/*Calculate the lower and upper bounds for the


outliers*/
low = q1 - 1.5 * iqr;
high = q3 + 1.5 * iqr;

/*Replace outlier values with the 1.5 IQR rule values*/


IF vars(i) < low THEN vars(i) = low;
ELSE IF vars(i) > high THEN vars(i) = high;
END;

DROP i q1 q3 iqr low high;


RUN;

Each program accesses the current data set LOAN_DATA and applies the five steps
to adjust for outliers using the 1.5 IQR rule. The R and Python programs use a
function to make the adjustments in the examples provided. For SAS, I decided to
use a single DATA step to make the adjustments to demonstrate that all of the steps
can be performed in a single DATA step.

Data Inference
Data inference is a statistical technique that uses patterns observed in a sample to
make inferences about a larger population. One way to use data inference to deal
with outliers in a modeling data set is to use imputation methods. Imputation
involves replacing values with estimated values based on the patterns observed in
the data.
96 SAS, Python & R: A Cross Reference Guide

To use imputation for handling outliers, you can first identify the outliers in the data
set using visualization or statistical methods. Then, you can impute the missing
values in the data set using a method robust to outliers, such as the median or
mode.

Another approach to using data inference to deal with outliers is predictive


modeling. Instead of directly modeling the outcome variable using the original data
set, you can build a model to predict the outcome variable using the non-outlier
data points. Then, you can use this model to predict the outcome variable for the
outlier data points. This can help reduce the impact of the outliers on the model and
improve its accuracy.

We will review the programming code to develop these predictive models in the
predictive modeling chapter of this book.

Data Rejection
Finally, another option is to remove observations where outliers occur. This is
obviously not a data transformation like our other two options. We are always
hesitant to throw out data. It is generally a better approach to use one of the
adjustment methods described above. However, there are some circumstances
where it doesn’t make sense to keep outliers because they could represent bad-
quality data, making the whole observation useless. Then, feel free to drop these
observations.

Feature Selection and Engineering

Feature selection is a crucial step in data science that involves selecting relevant
variables to include in a model. However, the need for feature selection depends on
the specific machine learning problem and the algorithms being used.

In general, feature selection techniques can improve model performance and


reduce overfitting by removing irrelevant or redundant features. However, some
algorithms are less sensitive to irrelevant or redundant features and can perform
well even with a large number of features.
SAS, Python & R: A Cross Reference Guide

For example, decision trees and random forests can handle many features and have
built-in feature selection methods that can automatically select the most
informative features. In contrast, linear models such as linear regression and logistic
regression can benefit from feature selection techniques to reduce the impact of
irrelevant or redundant features.

It is important to note that feature selection should always be done cautiously and
with a good understanding of the problem domain. Blindly removing features can
sometimes result in the loss of important information, and it is essential to evaluate
the impact of feature selection on model performance before making a final
decision.

For regression-type models, feature selection is important in machine learning for


several reasons:

1. Reducing Overfitting: When a machine learning model is trained on many


features, it may learn to fit the noise in the data rather than the underlying
patterns. This can lead to overfitting and poor generalization to new data.
By selecting only the most relevant features, we can reduce the complexity
of the model and improve its ability to generalize.

2. Improving Model Performance: Removing irrelevant or redundant


features can improve the performance of a machine learning model. This is
because irrelevant features can add noise to the data, and redundant
features may provide the same information as other features. By selecting
only the most informative features, we can improve the accuracy and
efficiency of the model.

3. Reducing Training Time and Cost: Training a machine learning model on a


large number of features can be computationally expensive and time-
consuming. By selecting only the most relevant features, we can reduce the
time and cost required to train the model.

4. Improving Model Interpretability: Machine learning models can be


difficult to interpret, especially when trained on many features. By selecting
only the most relevant features, we can improve the model's interpretability
and gain insight into the underlying patterns in the data.

Overall, feature selection is an important step in machine learning that can improve
the models' performance, efficiency, and interpretability.
98 SAS, Python & R: A Cross Reference Guide

There are several techniques for feature selection, including filtering, wrapper, and
embedded methods.

Filtering with Correlation Analysis


One common filtering method is using correlation matrices to identify highly
correlated predictors and remove one of them from the model. For example, the
following code calculates the correlation matrix for a data set and removes any
variables with a correlation above a specified threshold.

Program 3.9: Correlation Analysis

Language Programming Code


# Load the necessary libraries
library(corrplot)
library(caret)

# Calculate the correlation matrix


cor_matrix <- cor(numeric_data)

# Plot the correlation matrix to visualize the


R Programming correlations
corrplot(cor_matrix, method = "circle")

# Find highly correlated variables with a threshold of


0.8
high_cor <- findCorrelation(cor_matrix, cutoff = 0.8)

# Drop highly correlated vars and create the output


data set
CORR_limit <- numeric_data[, -high_cor]

# Load the necessary packages


Python import pandas as pd
Programming from sklearn.feature_selection import
VarianceThreshold
SAS, Python & R: A Cross Reference Guide

from scipy.stats import pearsonr

# Perform correlation analysis


corr_matrix = numeric_data.corr(method='pearson')

# Set the threshold for correlation


corr_threshold = 0.8

# Find the highly correlated variables and remove them


high_corr = set()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if abs(corr_matrix.iloc[i, j]) >
corr_threshold:

CORR_limit = numeric_data.drop(high_corr, axis=1)

/* Perform correlation analysis */


PROC CORR DATA=numeric_data OUTP=corr_matrix NOPRINT;
RUN;

/* Set the threshold for correlation */


%LET corr_threshold = 0.8;

/* Find the highly correlated variables and remove


them */
SAS Programming DATA high_corr;
SET corr_matrix;
ARRAY _vars(*) _NUMERIC_;
DO i = 2 TO dim(_vars);
DO j = 1 TO i - 1;
IF abs(_vars(i)) >= &corr_threshold AND
abs(_vars(i)) = abs(_vars(j)) THEN DO;
var1 = vname(_vars(i));
var2 = vname(_vars(j));
100 SAS, Python & R: A Cross Reference Guide

OUTPUT;
END;
END;
END;
RUN;

PROC SQL NOPRINT;


SELECT DISTINCT var1, var2 INTO :high_corr_cols
separated by ' '
FROM high_corr;
QUIT;

DATA CORR_limit; SET numeric_data


(DROP=&high_corr_cols); RUN;

In the example provided in Program 3.9, the data was limited to numeric values
only, and the correlation threshold was set to 0.8. Any correlated features with a
correlation value greater than 0.8 are eliminated from the output data set, which is
labeled “corr_limit.”

Figure 3.4 below shows the R output for correlation visualization. This figure shows
us that there are a few variables that are significantly correlated with one another.
The “total_acc” and “open_acc” variables are correlated. This makes sense because
the “open_acc” variable is a natural subcategory of the “total_acc” variable.
However, if we look at the correlation table, we can see that the Pearson correlation
statistic between these two variables is 0.7145. This falls below our 0.8 threshold
and will, therefore, be excluded from filtering.

The reason that the R output contains “?” values for the variables “dti” and
“revol_util” is that they both have a small number of missing values. The “dti”
variable has 25 missing values, while the “revol_util” variable has 51 missing values.

If we wanted R to ignore the missing values and provide us with the correlation
charts and tables without “?” or “NA” values, we simply need to update our
correlation code with a USE parameter. For example: cor(numeric_data,
use=”complete.obs”).
SAS, Python & R: A Cross Reference Guide

Figure 3.4: Correlation Visualization – R Output

Table 3.8: Correlation Matrix – R Output

Filtering with Wrapper Methods


Wrapper methods involve repeatedly training and evaluating a model using
different subsets of features to determine the optimal set of features that yields the
best performance. Wrapper methods typically use a specific learning algorithm as a
"wrapper" around the feature selection process.
102 SAS, Python & R: A Cross Reference Guide

Wrapper methods can employ different strategies, such as forward selection,


backward elimination, or recursive feature elimination. These methods assess the
learning algorithm's performance on various subsets of features and select the
subset that maximizes a specific evaluation metric, such as accuracy or AIC (Akaike
Information Criterion). By iteratively evaluating subsets of features, wrapper
methods aim to find the subset that optimizes the model's predictive performance.

Wrapper methods can be applied in various machine learning algorithms and


programming languages, and they are used to enhance the feature selection
process by incorporating the model's performance directly into the selection
criteria.

All three programming languages, SAS, Python, and R, share commonalities and
differences regarding wrapper methods in feature selection.

Commonalities:

● Wrapper methods concept: All three languages recognize wrapper


methods, which involve selecting subsets of features and evaluating their
performance using iterative model training and testing.

● Iterative model evaluation: SAS, Python, and R offer functionality for


repeatedly training and testing models with different feature subsets to
assess their impact on model performance.

● Performance evaluation criteria: Each language provides options to specify


evaluation criteria, such as AIC, BIC, or cross-validation, for comparing and
selecting the best feature subsets.

Differences:

● Syntax and implementation: The languages differ in syntax and


implementation approaches. SAS utilizes procedures and functions specific
to the language, like PROC GLMSELECT or PROC VARSELECT. Python uses
libraries like scikit-learn, Boruta, and FeatureSelector, which provide a
Pythonic way of implementing wrapper methods. R relies on packages like
SAS, Python & R: A Cross Reference Guide

caret, mlr, and glmnet that offer specialized functions for feature selection
and wrapper methods.

● Ecosystem and community support: Each language has its own ecosystem
of packages, libraries, and community support. Python boasts a rich
ecosystem with a wide range of machine learning and data science libraries,
making it highly versatile and widely adopted. R strongly focuses on
statistical analysis and provides a comprehensive collection of packages for
various modeling tasks. As a proprietary software, SAS has its own
ecosystem and tools tailored for data analysis and modeling.

While the core principles of wrapper methods remain the same across the three
languages, the differences lie in the specific syntax, implementation options,
available packages, and the overall ecosystem supporting feature selection and
wrapper methods.

Implementation of the Wrapper Method

One of the wrapper methods commonly implemented and available across all three
programming languages is Recursive Feature Elimination (RFE). RFE is a popular
wrapper method for feature selection in machine learning.

RFE works by recursively eliminating features and evaluating their impact on model
performance. It starts with the complete set of features and iteratively eliminates
less important features until a specified number or a desired subset of features
remains. During each iteration, the model is trained and evaluated, typically using
cross-validation or a performance metric, and the feature with the lowest
importance is removed. This process continues until the desired number of features
is reached.

RFE is widely used because it can be applied to various machine learning algorithms
and is agnostic to the specific modeling technique being used. It focuses on the
model's predictive performance and provides a ranking or selection of features
based on their importance.

The implementation of RFE may vary across programming languages and libraries,
but the underlying concept remains the same. SAS, Python (through libraries like
104 SAS, Python & R: A Cross Reference Guide

scikit-learn), and R (using packages like caret or mlr) provide functionalities to


perform RFE and utilize it as a wrapper method for feature selection in machine
learning tasks.

Therefore, RFE is an example of a wrapper method that can be commonly


implemented across SAS, Python, and R, allowing for consistent feature selection
approaches across different programming languages.
Program 3.10 below shows filtering with wrapper methods implemented in all three
programming languages.

Program 3.10: Filtering with Wrapper Method – RFE

Language Programming Code


library(caret)

# Filter numeric columns including the target


variable
numeric_columns <-
names(loan_data)[sapply(loan_data, is.numeric)]
numeric_columns <- c(numeric_columns, "bad") #
Include the target variable

# Create a copy of the subset to avoid modifying


the original data frame
loan_data_numeric <- loan_data[,
R Programming numeric_columns]

# Remove rows with missing values


loan_data_numeric <- na.omit(loan_data_numeric)

# Separate the features (X) and target variable


(y)
X <-
loan_data_numeric[, !(names(loan_data_numeric) %
in% "bad")]
y <- loan_data_numeric$bad
SAS, Python & R: A Cross Reference Guide

# Create the model to be used for feature


selection
model <- train(
X,
y,
method = "glm",
trControl = trainControl(method = "none"),
metric = "None"
)

# Perform Recursive Feature Elimination (RFE)


rfe_result <- rfe(
X,
y,
sizes = 5, # Select top 5 features
rfeControl = rfeControl(functions = glmFuncs),
method = "glm"
)

# Get the selected features


selected_features <-
colnames(X)[rfe_result$optVariables]
print(selected_features)

import pandas as pd
import numpy as np
from sklearn.feature_selection import RFE
from sklearn.linear_model import
Python LogisticRegression
Programming
# Filter numeric columns including the target
variable
numeric_columns =
loan_data.select_dtypes(include=[np.number]).col
umns
106 SAS, Python & R: A Cross Reference Guide

numeric_columns =
numeric_columns.append(pd.Index(['bad'])) #
Include the target variable
loan_data_numeric =
loan_data[numeric_columns].copy()

# Drop rows with missing values from the copied


subset
loan_data_numeric.dropna(inplace=True)

# Separate the features (X) and target variable


(y)
X = loan_data_numeric.drop('bad', axis=1)
y = loan_data_numeric['bad']

# Create the model to be used for feature


selection
model = LogisticRegression()

# Perform Recursive Feature Elimination (RFE)


rfe = RFE(model, n_features_to_select=5) #
Select top 5 features
rfe.fit(X, y)

# Get the selected features


selected_features = X.columns[rfe.support_]
print(selected_features)

/*Create global variable for numeric variables*/


PROC CONTENTS NOPRINT DATA = loan_data (KEEP =
_NUMERIC_
DROP=id bad) OUT = var (KEEP = name); RUN;
SAS Programming PROC SQL NOPRINT; SELECT name INTO:num separated
by " " FROM var; QUIT;

/* Example using PROC GLMSELECT */


ODS SELECT ALL;
SAS, Python & R: A Cross Reference Guide

ODS GRAPHICS ON / WIDTH=800px HEIGHT=600px; /*


Adjust dimensions as needed */

PROC GLMSELECT DATA=loan_data;


CLASS bad;
MODEL bad = &num. / SELECTION=backward;
RUN;

Table 3.9 below shows the SAS output for the GLMSELECT procedure. This output
shows that the method incorporated the backward selection methodology, which
removed five variables from the list of predictors. The algorithm selected eight
variables with a significant relationship with the target variable.

It’s important to note that these variables have not been adjusted from the
correlation or outlier analysis introduced previously. This raw data set results in a
poor-quality model; however, this technique was not employed to create a finalized
model. Instead, we are using this technique to identify variables significantly related
to the target variable and filter out those not significantly related to the target
variable.

The SAS output contains separate sections for the effects removed, an ANOVA table,
performance estimates, and parameter estimates. You can get these same pieces of
information from both Python and R; however, generating the additional output will
require more coding.

Table 3.9: Filtering with Wrapper Method – SAS Output – PROC GLMSELECT

Backward Selection Summary


Effect Number
Step Effects SBC
Removed
In
0 14 -145776.95
1 earliest_cr_line 13 -145785.79
2 pub_rec_bankruptcies 12 -145791.97
3 total_acc 11 -145795.87
4 annual_inc 10 -145797.9
5 revol_util 9 -145798.12*
108 SAS, Python & R: A Cross Reference Guide

* Optimal Value of Criterion

Analysis of Variance
Sum of Mean
DF F Value
Source Squares Square
Model 8 533.50893 66.68862 571.96
Error 67877 7914.24501 0.1166
Corrected
67885 8447.75394
Total

Root MSE 0.34146


Dependent Mean 0.14566
R-Square 0.0632
Adj R-Sq 0.063
AIC -77992
AICC -77992
SBC -145798

Parameter Estimates
Standard t Valu
DF Estimate
Error e
Parameter
Intercept 1 1.807715 0.068935 26.22
dti 1 0.000833 0.00012 6.96
int_rate 1 0.015374 0.000284 54.22
issue_d 1 -0.000092186 0.000003345 -27.56
loan_amnt 1 0.000000604 0.000000161 3.76
mort_acc 1 -0.008414 0.000721 -11.67
open_acc 1 0.001791 0.000247 7.25
pub_rec 1 0.010701 0.002024 5.29
revol_bal 1 -0.000000384 6.39E-08 -6.01
SAS, Python & R: A Cross Reference Guide

Filtering with Embedded Methods


Embedded methods are a type of feature selection technique that incorporates
feature selection within the model training process itself. These methods aim to
automatically determine the importance or relevance of features during model
training, effectively embedding the feature selection process within the model
building process. Embedded methods typically use regularization techniques or
model-specific attributes to identify and prioritize key features.

All three programming languages, SAS, Python, and R, share commonalities and
differences when it comes to filtering with embedded methods.

Commonalities:

● Integration within the modeling process: Embedded methods are


integrated directly into the model training process, automatically evaluating
and selecting features during the training phase.

● Feature importance ranking: Embedded methods assign importance or


relevance scores to features based on their impact on the model's
performance or regularization penalties.

● Iterative feature selection: These methods iteratively assess and update


feature importance during model training to optimize performance.

● Automated feature selection: Embedded methods automate the feature


selection process, reducing the need for separate feature selection steps.

Differences:

● SAS: In SAS, embedded feature selection methods are often implemented


using procedures like PROC GLMSELECT and PROC HPGENSELECT. These
procedures allow for the incorporation of regularization techniques such as
LASSO, ridge regression, or elastic net within the modeling process. By
specifying appropriate options and penalties, SAS provides a flexible
approach to embedded feature selection.
110 SAS, Python & R: A Cross Reference Guide

● Python: Python's scikit-learn library is a popular choice for implementing


embedded feature selection methods. The library offers various algorithms
and techniques, such as L1 regularization (Lasso), L2 regularization (Ridge),
and elastic net, which can be seamlessly integrated into the modeling
process. The library provides extensive support for embedded feature
selection across different machine learning models, including linear
regression, logistic regression, and support vector machines.

● R: R offers several packages for implementing embedded feature selection


methods. For instance, the glmnet package provides functionalities for
fitting linear models with L1 (Lasso) or L2 (Ridge) regularization, which can
be used for embedded feature selection. The caret package also supports
implementing embedded feature selection with various machine learning
algorithms, using techniques such as recursive feature elimination.

Embedded feature selection methods integrate feature selection directly into the
model training process. While the core principles are similar across SAS, Python, and
R, the implementation specifics differ based on each language's available libraries,
procedures, and packages. Data scientists can leverage the respective functionalities
provided by each language to implement embedded feature selection techniques in
their machine learning models.

Program 3.11 below shows the filtering with embedded methods implemented in all
three programming languages.

Program 3.11: Filtering with Embedded Method – Random Forest Classifier

Language Programming Code


library(caret)

# Filter numeric columns including the target


variable
numeric_columns <- names(loan_data)[sapply(loan_data,
is.numeric)]
numeric_columns <- c(numeric_columns, "bad") #
Include the target variable

# Create a copy of the data set to avoid modifying


the original data frame
SAS, Python & R: A Cross Reference Guide

filtered_data <- loan_data[, numeric_columns]

# Remove rows with missing values


filtered_data <- na.omit(filtered_data)

# Separate the features (X) and target variable (y)


X <- filtered_data[, !(names(filtered_data) %in%
"bad")]
y <- filtered_data$bad

# Create the Random Forest classification model and


perform feature selection
model <- train(
X,
y,
method = "rf",
trControl = trainControl(method = "none"),
metric = "None"
)

# Perform feature selection using embedded methods


(Random Forest)
selected_features <- varImp(model)$importance$Feature
print(selected_features)

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
Python from sklearn.feature_selection import SelectFromModel
Programming
# Filter numeric columns including the target
variable
numeric_columns =
loan_data.select_dtypes(include=[np.number]).columns
112 SAS, Python & R: A Cross Reference Guide

numeric_columns =
numeric_columns.append(pd.Index(['bad'])) # Include
the target variable

# Create a copy of the data set to avoid modifying


the original DataFrame
filtered_data = loan_data[numeric_columns].copy()

# Remove rows with missing values


filtered_data.dropna(inplace=True)

# Separate the features (X) and target variable (y)


X = filtered_data.drop('bad', axis=1)
y = filtered_data['bad']

# Create the Random Forest classification model


model = RandomForestClassifier()

# Perform feature selection using embedded methods


(Random Forest)
selector = SelectFromModel(model)
selector.fit(X, y)

# Get the selected features


selected_features = X.columns[selector.get_support()]
print(selected_features)

/*Create global variable for numeric variables*/


PROC CONTENTS NOPRINT DATA = loan_data (KEEP =
_NUMERIC_ DROP=id bad)
OUT = var (KEEP = name); RUN;
SAS
Programming PROC SQL NOPRINT; SELECT name INTO:num separated by "
" FROM var; QUIT;

/* Create a copy of the data set to avoid modifying


the original data */
SAS, Python & R: A Cross Reference Guide

DATA work.filtered_data;
SET work.loan_data;
RUN;

/* Remove rows with missing values */


DATA work.filtered_data;
SET work.filtered_data;
IF NOT missing(BAD) THEN OUTPUT;
RUN;

/* Specify the Random Forest classification model */


PROC HPFOREST DATA=work.filtered_data;
TARGET BAD / LEVEL=binary;
INPUT &num. / LEVEL=interval;
ODS OUTPUT VariableImportance=var_importance;
RUN;

Table 3.10 shows the Python output. This output shows the list of features selected
from the random forest classifier algorithm. Additional output options are available
for both the Python and R models; however, these options require additional code
to display specific output.

Table 3.10: Filtering with Embedded Method – Python Output – Random Forest
Classifier

print(selected_features)
Index(['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'revol_bal', 'revol_util',
'total_acc'], dtype='object')
114 SAS, Python & R: A Cross Reference Guide

Creating New Features Through Feature Engineering

Feature engineering is a critical step in the data science workflow. It empowers us to


transform raw data into meaningful features that enhance the performance and
interpretability of machine learning models. By harnessing domain knowledge and
creative techniques, feature engineering enables us to extract valuable insights,
uncover hidden patterns, and optimize the representation of our data.

In this section, we will delve into the art of feature engineering and explore its vital
role in building robust and accurate models. We will examine several fundamental
techniques that allow us to engineer features effectively, providing our models with
the best possible representation of the underlying data.

Key points to consider in feature engineering:

● Feature Scaling:
- Scaling numerical features to a consistent range (e.g., normalization
or standardization) ensures that no single feature dominates the
learning process. This is particularly important for algorithms
sensitive to feature scales, such as K-nearest neighbors (KNN) and
support vector machines (SVM).

- For tree-based models like random forests or gradient boosting,


feature scaling may not be necessary since these models operate
based on decision rules rather than feature magnitudes.

● Encoding Categorical Variables:


- Converting categorical variables into numerical representations is
essential for most machine learning algorithms. Common encoding
techniques include one-hot encoding, label encoding, and target
encoding.

- When using one-hot encoding, remember the "dummy variable


trap" in regression models. One of the dummy variables must be
excluded to avoid multicollinearity, which can impact model
stability. This exclusion is typically achieved by either dropping one
SAS, Python & R: A Cross Reference Guide

dummy variable explicitly or utilizing other encoding schemes like


effect encoding or sum encoding.

● Feature Interaction:
- Creating new features by combining or interacting with existing
ones can capture complex relationships and interactions that
individual features may fail to represent adequately.

- Feature interaction can be achieved through techniques like


polynomial features, which introduce higher-order terms or
interactions between features.

- Feature interaction is particularly beneficial for models that aim to


capture non-linear relationships and complex patterns.

These points provide a foundation for your exploration of feature engineering. As


you dive deeper into this field, you will encounter various advanced techniques and
strategies tailored to your specific data and modeling objectives. Remember, the
key is to extract and engineer features that encapsulate the most salient
information within the data, ultimately unlocking the full potential of your machine
learning models.

Feature Scaling
Feature scaling is a preprocessing technique that aims to bring all data set features
onto a similar scale. It involves transforming the values of different features to a
consistent range, which helps improve the performance of many machine learning
algorithms. By scaling the features, we ensure that no single feature dominates the
learning process due to differences in their magnitudes.

When to Use Feature Scaling:


Feature scaling is typically recommended in the following scenarios:

● Algorithms that Rely on Distance Measures:


- Machine learning algorithms that calculate distances between data
points, such as K-nearest neighbors (KNN) and support vector
machines (SVM), can be influenced by the scale of the features.
116 SAS, Python & R: A Cross Reference Guide

Scaling the features helps ensure that all features contribute equally
to the distance calculations.

● Gradient Descent Optimization:


- Algorithms that use gradient descent optimization, like linear
regression, logistic regression, and neural networks, can benefit
from feature scaling. Scaling the features can help the optimization
process converge faster and reach the global minimum more
effectively.

● Regularization Techniques:
- Regularization techniques, such as L1 and L2 regularization, prevent
overfitting in regression and classification models. Feature scaling
ensures that the regularization terms have a similar effect on all
features, preventing any particular feature from dominating the
regularization process.

Common Feature Scaling Techniques:

● Normalization (Min-Max Scaling):


- Normalization scales the values of features to a range between 0
and 1. It is achieved by subtracting the minimum value of the
feature and dividing it by the difference between the maximum and
minimum values.

- This technique preserves the relative relationships between the


feature values but may be sensitive to outliers.

● Standardization (Z-score Scaling):


- Standardization transforms the feature values to have zero mean
and unit variance. It involves subtracting the mean and dividing it by
the standard deviation of the feature.

- Standardization maintains the shape of the distribution and is less


affected by outliers compared to normalization.
SAS, Python & R: A Cross Reference Guide

Determining which features need to be scaled depends on the nature of the


features and the machine learning algorithm that you plan to use. Here are some
general guidelines to consider when deciding whether to scale features:

● Numerical Features:
- Continuous numerical features usually require scaling. Examples
include age, income, and temperature.

- Scaling might also benefit discrete numerical features with a


meaningful order, such as a feature representing rating scores from
1 to 10.

● Categorical Features:
- Categorical features already encoded as numeric values, such as
binary variables (0/1), do not typically require scaling.

- Scaling may be necessary if the categorical feature is represented


using arbitrary numeric values (e.g., color: 1=red, 2=blue, 3=green).

● Ordinal Features:
- Ordinal features have a natural order or ranking. Scaling might not
be necessary if the order is already preserved.
- However, if the range of values in ordinal features is large, scaling
can help improve model performance.

● Tree-Based Models:
- Tree-based models, such as decision trees and random forests, are
not sensitive to feature scaling. Scaling may not be required for
these algorithms.

● Distance-Based Models:
- Models that rely on distance measures, such as K-nearest neighbors
(KNN) and support vector machines (SVM), often require feature
scaling for optimal performance.

Ultimately, the decision to scale features depends on the specific data set, the
characteristics of the features, and the algorithms being employed. Experimenting
118 SAS, Python & R: A Cross Reference Guide

with and without scaling is recommended to observe the impact on model


performance and make an informed decision based on the results.

Program 2.13 provides the code examples for feature scaling in SAS, Python, and R.

Program 3.12: Feature Scaling

Language Programming Code


library(caret)
numeric_cols <- sapply(loan_data, is.numeric)
R preprocess_params <- preProcess(loan_data[,
Programming numeric_cols], method = c("center", "scale"))
loan_data_scaled <- predict(preprocess_params,
loan_data[, numeric_cols])

import pandas as pd
from sklearn.preprocessing import StandardScaler
numeric_cols =
loan_data.select_dtypes(include=['float64',
Python
'int64']).columns
Programming
scaler = StandardScaler()
loan_data_scaled = loan_data.copy()
loan_data_scaled[numeric_cols] =
scaler.fit_transform(loan_data[numeric_cols])

PROC STANDARD DATA=loan_data OUT=loan_data_scaled


MEAN=0 STD=1;
SAS VAR _NUMERIC_; /* Specify the numeric variables to
Programming scale */
RUN;

Binary variables, also known as indicator variables, typically do not need to be


included in feature scaling. This is because binary variables already have a limited
range of values (0 or 1) and do not require scaling to a specific range or distribution.

Feature scaling techniques such as standardization (mean centering and scaling to


unit variance) or normalization (scaling to a specific range) are commonly applied to
SAS, Python & R: A Cross Reference Guide

continuous variables with a wide range of values or different units of measurement.


These techniques aim to bring all variables to a similar scale to avoid dominance by
variables with larger magnitudes.

However, it is important to note that the decision to exclude binary variables from
feature scaling depends on the specific context and the machine learning algorithm
being used. Some algorithms may be sensitive to the scaling of binary variables,
while others may not be affected.

In general, if you use an algorithm that is not sensitive to feature scaling, such as
tree-based models (e.g., decision trees, random forests), you can include binary
variables in the feature scaling process without significant impact. On the other
hand, if you're using algorithms sensitive to scaling, such as logistic regression or
support vector machines, it is advisable to exclude binary variables from feature
scaling to prevent unnecessary transformations.

Always consider the requirements and assumptions of the specific algorithm being
used and adjust the feature scaling approach accordingly.

Encoding Categorical Variables: Dummy Variables

Categorical variables are commonly encountered in many real-world data sets, and
they require appropriate encoding to be used effectively in machine learning
models. One popular technique for encoding categorical variables is the use of
dummy variables. Dummy variables create binary features for each unique category
within a categorical variable. Here are some key points to consider about dummy
variables:

● Importance of Dummy Variables: Dummy variables are crucial in capturing


categorical information within a data set and making it usable for machine
learning algorithms. By converting categorical variables into numerical
features, dummy variables enable algorithms to operate on the data
effectively. They enable the incorporation of categorical information as
binary indicators, avoiding any ordinal assumptions that may not hold for
the categories.
120 SAS, Python & R: A Cross Reference Guide

● When to Use Dummy Variables: Dummy variables are typically used when
dealing with nominal or unordered categorical variables. These are variables
with categories that have no inherent order or ranking. Examples include
variables like color (red, blue, green), city (New York, London, Paris), or
product type (A, B, C). By creating dummy variables, each category becomes
a separate binary feature, preserving the distinctiveness of the categories
without imposing any numerical order.

● Dummy Variables in Regression Models: When using dummy variables in


regression models, it is crucial to manage multicollinearity. Multicollinearity
occurs when two or more dummy variables are highly correlated because
they represent categories of the same categorical variable. To avoid this
issue, one of the dummy variables must be excluded as a reference
category. The excluded category serves as a baseline against which the
effects of the other categories are compared. This is known as the "dummy
variable trap."

● Dealing with High Cardinality: When dealing with categorical variables with
many unique categories (high cardinality), creating individual dummy
variables for each category can lead to an explosion in the feature space. In
such cases, it may be necessary to apply additional techniques like feature
hashing or entity embeddings to reduce dimensionality while still capturing
useful information from the categorical variable.

● Consideration for Tree-Based Models: For tree-based models like decision


trees and random forests, managing categorical variables using dummy
variables is not always necessary. These models can handle categorical
variables directly without the need for explicit encoding. However, it may
still be beneficial to encode categorical variables as dummy variables in
certain scenarios, such as using tree-based models in conjunction with other
algorithms or using gradient boosting frameworks that require numerical
inputs.

By understanding the significance of dummy variables in encoding categorical


variables, their appropriate usage, and considerations for different types of models,
data scientists can effectively incorporate categorical information into their machine
learning pipelines.
SAS, Python & R: A Cross Reference Guide

As an example, let’s look at the variable “home_ownership”. This is a categorical


variable with four levels. Table 3.11 below shows the breakout of the
home_ownership variable.

Table 3.11: Home Ownership Frequency Distribution – SAS Output

Cumulative Cumulative
home_ownership Frequency Percent
Frequency Percent
ANY 19 0.03 19 0.03
MORTGAGE 33484 49.27 33503 49.3
OWN 7695 11.32 41198 60.62
RENT 26764 39.38 67962 100

We want to create a binary numeric indicator for each level of this categorical
variable. We could perform this manually with explicit code as demonstrated in
Program 3.13:

Program 3.13: Manual Categorical Encoding – SAS Code

DATA dummy;
SET loan_data;
IF home_ownership = 'ANY' THEN ANY_IND = 1; ELSE ANY_IND = 0;
IF home_ownership = 'MORTGAGE' THEN MORT_IND = 1; ELSE MORT_IND = 0;
IF home_ownership = 'OWN' THEN OWN_IND = 1; ELSE OWN_IND = 0;
IF home_ownership = 'RENT' THEN RENT_IND = 1; ELSE RENT_IND = 0;
RUN;

Although this approach works perfectly fine, it can become burdensome when a
categorical variable has several levels. A more efficient approach is to use the
methods and procedures available in each programming language. Program 3.14
shows how to create dummy variables efficiently in each programming language.

Program 3.14: Encoding Categorical Variables

Language Programming Code


dummy_vars <- model.matrix(~ home_ownership - 1,
R data=loan_data)
Programming
loan_data <- cbind(loan_data, dummy_vars)
122 SAS, Python & R: A Cross Reference Guide

loan_data <- loan_data[, !grepl("^home_ownership",


colnames(loan_data))]

dummy_vars =
pd.get_dummies(loan_data['home_ownership'],
Python prefix='home_ownership')
Programming
loan_data = pd.concat([loan_data, dummy_vars], axis=1)
loan_data.drop('home_ownership', axis=1, inplace=True)

PROC GLMSELECT DATA=loan_data OUTDESIGN=design;


SAS CLASS home_ownership;
Programming MODEL bad = home_ownership / SELECTION=STEPWISE;
RUN;

Remember that it is important to know how many levels there are for each of your
categorical variables. If you have a variable encoded as a character variable that has
hundreds of levels, such as “job category” or “ZIP code,” then this will result in a
large number of dummy variables, which can lead to issues with high cardinality.

High cardinality refers to a categorical variable with a large number of distinct


values or categories. Here are four chief concerns related to high cardinality:

● Increased dimensionality: High cardinality variables can significantly


increase the dimensionality of the data set. Each distinct category becomes
a separate feature or dummy variable, leading to more input features. This
can impact the efficiency and computational complexity of modeling
algorithms.

● Sparse data: When dealing with high cardinality variables, some categories
may have very few observations or occur infrequently in the data set. This
leads to sparse data, where many dummy variables have a low frequency of
occurrence. Sparse data can pose challenges for certain modeling
techniques that assume a sufficient number of observations in each
category.

● Model overfitting: High cardinality variables can cause overfitting,


especially if the number of observations is limited. Instead of generalizing
SAS, Python & R: A Cross Reference Guide

well to unseen data, the model may learn patterns specific to each category,
including noise or random fluctuations. Overfitting can lead to poor model
performance and a lack of generalization ability.

● Interpretability and feature importance: Interpreting the impact of high


cardinality variables on the model's predictions becomes challenging. With
numerous dummy variables, understanding the individual contribution of
each category to the model's output becomes less straightforward.
Identifying and interpreting the most influential categories or extracting
meaningful insights from the model may be difficult.

Managing high cardinality variables requires careful consideration and


preprocessing techniques to address these concerns. Some approaches include
dimensionality reduction techniques, feature selection methods, or grouping
categories based on domain knowledge or statistical criteria.

Feature Interaction

Feature interaction refers to the phenomenon where the combined effect of two or
more features in a machine learning model significantly impacts the target variable.
It recognizes that the relationship between features and the target variable may not
be linear or independent but can be influenced by their interactions. By capturing
these interactions, we can improve the model's predictive power and gain deeper
insights into the data.

Here are four major topics related to feature interaction in machine learning:

1. Polynomial Features: Creating polynomial features involves generating new


features by taking powers and interactions of existing features. This allows
the model to capture non-linear relationships between the features and the
target variable. Polynomials can uncover complex patterns and provide a
more flexible data representation.

2. Feature Crosses: Feature crosses combine two or more features to create


new composite features. This technique is beneficial when the interaction
between specific features is expected to significantly impact the target
124 SAS, Python & R: A Cross Reference Guide

variable. Feature crosses enable the model to capture synergistic effects


that individual features may not represent independently.

3. Non-linear Transformations: Non-linear transformations involve applying


mathematical functions (e.g., logarithm, exponential, square root) to
features to reveal hidden patterns and relationships. These transformations
can help uncover non-linear dependencies and enhance the model's ability
to capture complex interactions.

4. Feature Importance: Evaluating the importance of features, including


interaction terms, can provide insights into the contribution of individual
and combined features to the model's predictive performance.
Understanding which interactions are most influential can guide feature
selection and engineering efforts.

Program 3.15 provides the code examples for developing polynomial features for
the numeric feature “loan_amnt” in each of the programming languages:

Program 3.15: Creating Polynomial Variables

Language Programming Code


loan_amnt_poly <- poly(loan_data$loan_amnt, degree = 2,
R Programming raw = TRUE)
loan_amnt_poly <- data.frame(loan_amnt_poly)

from sklearn.preprocessing import PolynomialFeatures


Python
Programming poly = PolynomialFeatures(degree=2)
loan_amnt_poly = poly.fit_transform(X[['loan_amnt']])

DATA poly;
SET loan_data;
SAS Programming
loan_amnt_poly = loan_amnt**2;
RUN;
SAS, Python & R: A Cross Reference Guide

When considering which variables to combine for feature interaction, it's important
to consider the underlying relationships of the data and your own domain
knowledge. Here are a couple of potential feature interactions that you could
explore based on the numeric variables in the data set:

● Income and Debt-to-Income Ratio: The interaction between income and


debt-to-income ratio can capture the impact of an individual's income level
on their ability to manage their debts. It may provide insights into how
income influences the debt burden and repayment capacity.

● Number of Open Accounts and Loan Amount: The interaction between the
number of open accounts and the loan amount can capture the relationship
between an individual's credit activity and the size of the loan they are
applying for. It may reveal whether individuals with more open accounts
tend to apply for larger loans.

These are just a few examples, and the choice of feature interactions ultimately
depends on the specific context and goals of your analysis. It's recommended to
explore different combinations and evaluate their impact on the model's
performance and interpretability. Additionally, consulting with domain experts or
conducting thorough exploratory data analysis can help guide the selection of
meaningful feature interactions.

Let’s take the “income and debt-to-income ratio” feature interaction example and
look at the code in all three programming languages.

Program 3.16: Feature Interactions

Language Programming Code


loan_data$DTI_INC_INTERACTION <- loan_data$dti *
R Programming
loan_data$annual_inc

import pandas as pd
Python
Programming
loan_data['DTI_INC_INTERACTION'] = loan_data['dti'] *
loan_data['annual_inc']
126 SAS, Python & R: A Cross Reference Guide

DATA loan_data;
SET loan_data;
SAS Programming
DTI_INC_INTERACTION = dti * annual_inc;
RUN;

Reducing the Dimensionality of the Data Set

Reducing the dimensionality of a data set is a crucial step in data preprocessing and
feature engineering. As data sets grow in size and complexity, they often contain a
large number of features, making it challenging to extract meaningful insights or
build efficient models. Dimensionality reduction techniques aim to overcome this
problem by transforming the data into a lower-dimensional space while preserving
essential information. By reducing the number of features, we can improve
computational efficiency, alleviate the curse of dimensionality, enhance model
interpretability, and mitigate issues such as overfitting. This section will explore
various dimensionality reduction techniques and their practical applications.

Key Concepts:
1. Principal Component Analysis (PCA): PCA is a widely used linear
dimensionality reduction technique. It identifies the orthogonal axes, known
as principal components, which capture the maximum variance in the data.
By projecting the data onto a subset of these components, we can represent
the data set in a lower-dimensional space while retaining the most valuable
information.

2. Feature Selection: Feature selection involves identifying a subset of the


most relevant features from the original data set. This can be done through
statistical techniques, such as univariate selection or recursive feature
elimination, or by using model-based approaches that evaluate the
importance of features in the context of a specific machine learning
algorithm.

3. Manifold Learning: Manifold learning techniques aim to preserve the


intrinsic structure of the data by mapping it onto a lower-dimensional
space. These methods, such as t-SNE (t-distributed Stochastic Neighbor
Embedding) or Isomap, are particularly useful for visualizing high-
SAS, Python & R: A Cross Reference Guide

dimensional data or capturing non-linear relationships that may be missed


by linear methods like PCA.

4. Autoencoders: Autoencoders are neural network architectures used for


unsupervised learning and dimensionality reduction. They consist of an
encoder that compresses the input data into a lower-dimensional
representation and a decoder that reconstructs the original input from this
compressed representation. By training the autoencoder to minimize the
reconstruction error, we can effectively learn a compressed representation
that captures the most essential features of the data.

These concepts provide a foundation for understanding and implementing


dimensionality reduction techniques. By applying these methods judiciously, data
scientists can reduce the complexity of their data sets while preserving critical
information, enabling more efficient analysis and modeling.

Principal Component Analysis (PCA)

Dimensionality reduction techniques are crucial in tackling high-dimensional data


sets by extracting essential information and reducing the number of variables. One
prominent method for dimensionality reduction is Principal Component Analysis
(PCA). PCA aims to transform the original variables into a new set of uncorrelated
variables called principal components, sorted in descending order of their variances.
By retaining a subset of the principal components that capture the most significant
variability, PCA allows us to condense the information in the data while preserving
its essential structure. It facilitates data visualization, pattern recognition, and
subsequent analysis by focusing on the most influential components. PCA effectively
manages large feature spaces and uncovers hidden relationships, enabling us to gain
insights and make informed decisions from complex data sets.

When selecting variables for Principal Component Analysis (PCA), the primary
consideration is to choose numerical variables that are relevant and informative for
capturing the underlying variance in the data. Here are some guidelines to help
determine which variables to use for PCA analysis:

1. Variable Type: PCA is suitable for numerical variables rather than


categorical or ordinal variables. Ensure that the variables you select for PCA
are continuous or discrete numeric variables.
128 SAS, Python & R: A Cross Reference Guide

2. Data Scale: Normalize or standardize the variables to have comparable


scales before performing PCA. This step is important because PCA is
sensitive to the relative magnitudes of variables. Scaling the variables helps
avoid dominance by variables with larger ranges.

3. Variability: Consider variables that exhibit significant variability or variance


in the data set. Variables with low variability may not contribute much to
the principal components and can be less informative for dimensionality
reduction.

4. Correlation: Look for variables that are correlated with each other. High
correlation indicates a potential redundancy in the information captured by
those variables. Including highly correlated variables can result in a
multicollinearity problem, affecting the principal components'
interpretability.

5. Domain Knowledge: Consider variables that are conceptually meaningful


and relevant to the project. Expert knowledge about the data and the
domain can guide the selection of variables that are likely to capture
important aspects of the underlying data structure.

It's essential to strike a balance between including enough informative variables to


capture the essence of the data and avoiding variables that introduce noise or
redundancy. Exploratory data analysis, correlation analysis, and domain knowledge
can guide the selection process.

PCA Variable Selection

When conducting a PCA analysis, it is generally recommended to exclude binary


indicators (dummy variables) that represent categorical or binary features. This is
because PCA assumes continuous variables and operates based on the covariance or
correlation matrix of the input variables. Including binary indicators can lead to
distorted results as these variables have limited variability and do not adhere to the
assumptions of PCA.

Including binary indicators in PCA can result in their domination over other variables
and may produce misleading outcomes. Therefore, excluding binary indicators or
SAS, Python & R: A Cross Reference Guide

categorical variables from the PCA analysis is advisable. However, if there are
ordinal variables with multiple levels, they can be included in PCA after appropriate
scaling or transformation. It is essential to carefully consider the nature of the
variables and their suitability for PCA to ensure accurate dimensionality reduction
and meaningful results.

When conducting a PCA analysis, it is generally recommended to include only


numeric continuous variables. PCA operates on the covariance or correlation matrix,
and it assumes continuous variables to effectively capture their variability and
relationships. Categorical variables or binary indicators, as mentioned earlier, should
be excluded.

Regarding variable selection, it is common practice to consider variables with high


correlation and variability for PCA. Variables with high correlation tend to capture
similar information and including them in PCA can lead to redundancy. On the other
hand, variables with low variability may not contribute significantly to the principal
components and can be less informative.

Therefore, it is beneficial to select variables based on their correlation and


variability. Variables with higher correlations and greater variability are more likely
to have a significant impact on the principal components. This helps preserve
important information and reduce the data set's dimensionality effectively.
However, the final selection should be based on domain knowledge and the specific
goals of the analysis.

In our PCA analysis, we are including the variables "total accounts" (total_acc) and
"open accounts" (open_acc), as well as "public records" (pub_rec) and "public
bankruptcies" (pub_rec_bankruptcies). These variables have a high correlation
above 70% and exhibit a wide standard deviation. By including these variables, we
aim to capture the underlying patterns and relationships in the data represented by
these correlated and variable-rich features.

Now, let's provide coding examples in each of the three programming languages for
performing PCA analysis on these variables:
130 SAS, Python & R: A Cross Reference Guide

Program 3.17: Dimensionality Reduction – PCA Analysis

Language Programming Code


library(dplyr)
library(factoextra)

data <- loan_data %>%


select(total_acc, open_acc, pub_rec,
pub_rec_bankruptcies)

R Programming # Standardize the data


data_std <- scale(data)

# Perform PCA
pca_result <- prcomp(data_std)

# Access the principal components


principal_components <- pca_result$rotation

from sklearn.decomposition import PCA


import pandas as pd

data = loan_data[['total_acc', 'open_acc',


'pub_rec', 'pub_rec_bankruptcies']]

# Standardize the data


Python data_std = (data - data.mean()) / data.std()
Programming

# Perform PCA
pca = PCA()
pca.fit(data_std)

# Access the principal components


principal_components = pca.components_
SAS, Python & R: A Cross Reference Guide

/* Standardize the data */


PROC STANDARD DATA=loan_data OUT=loan_data_std
MEAN=0 STD=1;
VAR total_acc open_acc pub_rec
pub_rec_bankruptcies;
SAS RUN;
Programming
/* Perform PCA analysis on standardized data */
PROC PRINCOMP DATA=loan_data_std OUT=loan_data_pca;
VAR total_acc open_acc pub_rec
pub_rec_bankruptcies;
RUN;

The output of the SAS PCA analysis automatically generates a scree plot and a
variance explained plot. The Python and R algorithms can also generate these plots
with additional code. The scree plot and variance explained plot are commonly used
to interpret the results of a PCA analysis. Here's a brief explanation of each plot:

1. Scree Plot:
The scree plot displays the eigenvalues of each principal component in
descending order. The eigenvalues represent the amount of variance
explained by each principal component. The scree plot helps determine the
number of principal components to retain in the analysis. The plot usually
shows a steep drop in eigenvalues at the beginning, followed by a leveling
off. The point at which the drop levels off is considered a cutoff point for
retaining principal components. Components before this point are typically
retained as they explain a significant amount of variance, while those after
the point contribute less to the overall variance.

2. Variance Explained Plot:


The variance explained plot shows the cumulative proportion of variance
explained by the retained principal components. It helps understand how
much of the total variance in the data is captured by including a certain
number of principal components. The plot displays a line graph that starts at
zero and gradually increases as more principal components are added. The
steepness of the curve indicates the amount of variance explained by each
132 SAS, Python & R: A Cross Reference Guide

component. Ideally, we want to retain enough principal components to


capture a substantial portion of the total variance (e.g., 80% or more).

By examining these plots, you can determine the optimal number of principal
components to retain based on the steep drop in eigenvalues in the scree plot and
the cumulative proportion of variance explained in the variance explained plot.

Figure 3.5 shows the scree and variance explained plots generated from the SAS
code.

Figure 3.5: Scree and Variance Explained Plots – SAS Output

The eigenvalues in the PCA analysis scree plot represent the variance explained by
each principal component. In our example, the scree plot shows that the
eigenvalues for the first two principal components are relatively high (1.73 and
1.65), indicating that they capture significant variance in the data.
The drop in eigenvalues for the third and fourth principal components (0.35 and
0.26) suggests that these components explain less variance compared to the first
two components. Typically, a significant drop in eigenvalues indicates that the
corresponding principal components may not contribute much to the overall
variability in the data.
SAS, Python & R: A Cross Reference Guide

To determine the number of principal components to retain, you can consider the
"elbow" rule, which suggests selecting the components before the significant drop
in eigenvalues. In our example, you may choose to retain the first two principal
components as they have relatively high eigenvalues and capture a substantial
portion of the variance in the data.

Decision Time: Selecting Principal Components


It's important to note that interpreting eigenvalues in the scree plot
is just one criterion for determining the number of principal
components to retain. You may also consider other factors, such as
the cumulative proportion of variance explained and your specific
analysis goals, when deciding about the number of components to
retain.

Data Balancing

Balancing a data set is a crucial step in data preprocessing when working with
imbalanced data sets. In many real-world scenarios, data sets often exhibit class
imbalance, where one class significantly outweighs the other(s). This can pose
challenges in machine learning models, as they tend to be biased toward the
majority class, leading to poor performance in predicting the minority class.
Balancing the data set helps address this issue and allows the model to learn
effectively from all classes present in the data. In this section, we will explore the
importance of balancing data sets, discuss situations where it is necessary, and
outline practical strategies to achieve balanced data sets.

Why and when to balance a data set:

1. Overcoming class imbalance: Balancing a data set is essential to mitigate


the impact of class imbalance, where the minority class has limited
representation. By balancing the data set, we ensure that the model
receives sufficient training examples from all classes, improving its ability to
learn patterns and make accurate predictions for both majority and minority
classes.
134 SAS, Python & R: A Cross Reference Guide

2. Enhanced model performance: Balancing the data set can improve model
performance, particularly in scenarios where accurate predictions for the
minority class are critical. Models trained on imbalanced data tend to
prioritize the majority class, resulting in lower recall and precision for the
minority class. Balancing the data set can help address this bias and provide
more balanced performance across all classes.

3. Unbiased evaluation of model performance: When evaluating the


performance of a model, it is crucial to have a representative data set that
reflects the real-world class distribution. Balancing the data set ensures that
the evaluation metrics accurately reflect the model's ability to generalize to
new data, as it is not skewed by the disproportionate representation of
classes.

4. Addressing specific business requirements: Balancing a data set becomes


especially important in certain applications where the cost of
misclassification for different classes varies significantly. For instance, in
medical diagnosis, correctly identifying rare diseases may be more critical
than accurately predicting common conditions. Balancing the data set helps
ensure that the model is trained to make informed decisions for all classes
based on their respective importance or impact.

While there is no hard-and-fast rule regarding the specific class imbalance threshold
that requires balancing the data set, a general guideline is to consider balancing
when the class distribution is significantly skewed. The 80/20 ratio (80% for one
class, 20% for the other) can be considered as a potential threshold, but it's not a
definitive rule.

Decision Point: Balancing the Data Set

The decision to balance the data set depends on several key factors:
SAS, Python & R: A Cross Reference Guide

● Problem Domain: Consider the context – certain fields, like


healthcare or fraud detection, may require balancing to better
identify minority classes.
● Class Importance: Assess the significance of each class –
misclassifying a critical class could justify balancing.
● Model Performance: Evaluate your model’s effectiveness on
imbalanced data – if it struggles to detect the minority class,
balancing may be necessary.

Balancing strategies include oversampling, undersampling, or using


techniques like SMOTE. Choose based on your specific data set and
analysis goals.

Here are some considerations:

1. Severity of class imbalance: Assess the extent of class imbalance. If the


class distribution is heavily skewed, such as 90/10 or 95/5, it indicates a
significant imbalance that may require balancing.

2. Importance of minority class: Evaluate the significance of correctly


predicting the minority class. If misclassifying the minority class has serious
consequences or the minority class represents critical instances, balancing
the data set becomes more crucial.

3. Performance on imbalanced data: Examine the model's performance on


the imbalanced data set. If the model struggles to accurately predict the
minority class or shows significantly lower recall or precision for the
minority class, it suggests the need for balancing.

4. Business requirements and domain knowledge: Consider the specific


requirements and domain knowledge of your problem. Balancing the data
set may be necessary if there are legal, ethical, or practical reasons to
ensure fair representation of all classes.
The decision to balance the data set should be based on carefully assessing the class
distribution, the importance of accurate predictions for each class, and the model's
performance on the imbalanced data. It's important to strike a balance between
136 SAS, Python & R: A Cross Reference Guide

addressing class imbalance and avoiding overcorrection that might introduce biases
or distort the underlying data characteristics.

Modeling Scenarios
Data set balancing is not limited to classification models alone. While class
imbalance is often a concern in classification tasks, balancing data can also be
relevant in other modeling scenarios. Here are a few examples:

● Classification models: Class imbalance is a common challenge in binary or


multi-class classification problems, where one or more classes have
significantly fewer instances than others. Balancing the data set by
equalizing the representation of classes can help improve the model's
performance and mitigate bias toward the majority class.

● Anomaly detection: In anomaly detection tasks, where the objective is to


identify rare or abnormal instances, imbalanced data can arise when
anomalies are scarce compared to normal instances. Balancing the data set
by ensuring a more equitable representation of anomalies can assist in
accurately identifying and capturing these rare instances.

● Regression models: While class imbalance is not typically a concern in


regression problems, there may be cases where the target variable exhibits
extreme skewness or has outliers that disproportionately impact the
model's performance. In such situations, applying techniques to address the
imbalance or skewness in the target variable can be beneficial.

● Recommender systems: In recommendation systems, the user-item


interaction data can exhibit significant skewness, where some items receive
many ratings while others have very few. Balancing the data by adjusting
the weight or relevance of items based on their popularity or other factors
can help provide fair and diverse recommendations.

While class imbalance is commonly associated with classification models, data set
balancing can be relevant in various modeling scenarios to address data skewness,
improve model performance, and ensure fairness and accuracy in predictions.
SAS, Python & R: A Cross Reference Guide

Machine Learning Algorithms Sensitivity

Certain machine learning algorithms can be sensitive to imbalanced data. Class


imbalance can particularly impact models that rely on accurate estimation of class
probabilities or decision boundaries. Here are a few examples:

● Logistic Regression: Logistic regression is often sensitive to class imbalance,


especially when the minority class is underrepresented. Since logistic
regression estimates class probabilities, imbalanced data can lead to biased
predictions and a tendency to favor the majority class. Balancing the data
can help mitigate this issue and improve the model's performance.

● Support Vector Machines (SVM): SVMs can be affected by class imbalance,


as they aim to find an optimal decision boundary. When one class is
dominant, the SVM may prioritize the majority class, leading to suboptimal
performance for the minority class. Balancing the data can help the SVM
capture patterns from both classes more effectively.

● Decision Trees: Decision trees are less sensitive to class imbalance than
some other algorithms. However, when combined with ensemble methods
like Random Forest or Gradient Boosting, imbalanced data can still impact
their performance. In ensemble models, the individual decision trees can be
influenced by the class distribution, and balancing the data can help prevent
bias toward the majority class.

● Gradient Boosting: Gradient Boosting models, such as XGBoost and


LightGBM, are generally robust to class imbalance due to their ensemble
nature and the ability to assign higher weights to minority class samples.
However, severe class imbalance can still pose challenges, and balancing the
data can be beneficial to ensure more reliable predictions.

While various algorithms can be affected by class imbalance, the sensitivity may
vary. Logistic regression and SVMs are typically more sensitive, while decision trees
and gradient boosting models are relatively more robust. However, it is important
to note that the impact of imbalance can still vary based on the severity of class
imbalance, data set characteristics, and other factors. Balancing the data can help
improve the model's performance, especially when the class imbalance is
substantial or when the minority class is of particular interest.
138 SAS, Python & R: A Cross Reference Guide

Program 3.18 demonstrates the process of data balancing, specifically achieving a


50-50 split between positive and negative cases through a technique known as
undersampling. By selecting all the positive cases and randomly sampling an equal
number of negative cases, the resulting balanced data set represents an equal
representation of both classes. This approach helps address the issue of class
imbalance, ensuring that both classes have equal influence during model training
and evaluation. By employing undersampling, we can mitigate the impact of the
majority class and improve the performance of machine learning algorithms in
handling imbalanced data.

Program 3.18: Data Set Balancing – 50/50 Split

Language Programming Code


library(dplyr)

# Separate positive and negative cases


positive_cases <- loan_data %>% filter(bad == 1)
negative_cases <- loan_data %>% filter(bad == 0)

# Sample the negative cases to match the number


of positive cases
negative_sampled <- negative_cases %>%
R Programming sample_n(nrow(positive_cases), replace = FALSE,
set.seed(42))

# Combine the positive and sampled negative


cases
balanced_data <- bind_rows(positive_cases,
negative_sampled)

# Check frequency distribution of the target


variable
table(balanced_data$bad)

import pandas as pd
Python from sklearn.utils import resample
Programming
# Separate positive and negative cases
SAS, Python & R: A Cross Reference Guide

positive_cases = loan_data[loan_data['bad'] ==
1]
negative_cases = loan_data[loan_data['bad'] ==
0]

# Sample the negative cases to match the number


of positive cases
negative_sampled = resample(negative_cases,
replace=False, n_samples=len(positive_cases),
random_state=42)

# Concatenate the positive and sampled negative


cases
balanced_data = pd.concat([positive_cases,
negative_sampled])

# Check frequency distribution of the target


variable
balanced_data['bad'].value_counts()

/* Separate the positive and negative cases */


DATA pos_cases neg_cases;
SET loan_data;
IF bad = 1 THEN OUTPUT pos_cases; ELSE
OUTPUT neg_cases;
RUN;

/* Get the number of positive cases */


SAS Programming PROC SQL NOPRINT;
SELECT count(*) INTO :pos_count
FROM pos_cases;
QUIT;

/* Randomly sample negative cases to match the


number of positive cases */
PROC SURVEYSELECT DATA=neg_cases OUT=neg_sample
METHOD=srs
SAMPSIZE=&pos_count; RUN;
140 SAS, Python & R: A Cross Reference Guide

/* Combine the positive and sampled negative


cases */
DATA balanced_data;
SET pos_cases neg_sample;
RUN;

/* Check the frequency distribution of the


target variable */
PROC FREQ DATA=balanced_data;
TABLES bad;
RUN;

Table 3.12 below shows the SAS output of frequency distribution for the balanced
data set.

Table 3.12: Data Set Balancing Freq Dist. – SAS Output

Balanced Data set - 50/50 Split


Cumulative Cumulative
bad Frequency Percent
Frequency Percent

0 9897 50 9897 50
1 9897 50 19794 100

Chapter 3: Creating a Modeling Data Set – Conclusion and


Transition

In this chapter, you’ve learned how to transform raw data into a structured
modeling data set, a critical step that involves selecting and engineering features,
handling missing data, and preparing the data for analysis. These tasks ensure that
the data you input into your models is clean, relevant, and ready to produce reliable
results.
SAS, Python & R: A Cross Reference Guide

Now that you have a well-prepared modeling data set, the next step is to
understand how to use this data set within a model pipeline. A model pipeline is a
sequence of steps that takes your data through various transformations, modeling
processes, and evaluations. This pipeline is essential for ensuring that your models
are built, tested, and deployed in a systematic and efficient manner.

In Chapter 4, we will explore the components of a model pipeline in detail. You’ll


learn how to set up a pipeline in SAS, Python, and R, integrate your data with the
appropriate machine-learning algorithms, and perform the necessary steps to fine-
tune and validate your models. This chapter will provide you with a comprehensive
understanding of how to streamline your workflow, from data input to model
output.

So, with your modeling data set ready, let’s proceed to Chapter 4 and delve into the
intricacies of constructing an effective model pipeline that will serve as the
backbone of your data science projects.
142 SAS, Python & R: A Cross Reference Guide

Chapter 3 Summary: Creating a Modeling Data Set

1. Introduction to Real-World Data

● Overview: This chapter begins by contrasting academic data sets with


real-world data, highlighting the complexities and challenges of the
latter. Unlike the clean and structured data often used in educational
examples, real-world data is messy and requires significant preparation
before it can be used for modeling.

● Context: The chapter sets the stage for the creation of a modeling data
set, emphasizing the importance of data preparation, manipulation, and
transformation in the data science workflow.

2. The Art and Science of Data Science

● Art and Science: The chapter explores the dual nature of data science as
both an art and a science. It discusses how data scientists must combine
technical skills with creativity and domain expertise to make informed
decisions throughout the data science process.

● Critical Decision Points: The chapter identifies key decision points in


data science, such as handling missing values, managing outliers,
selecting date ranges, and choosing modeling algorithms. These
decisions are part of the "art" of data science, requiring intuition and
domain knowledge.

3. Project Overview and Business Problem Definition

● Lending Club Data Set: The Lending Club data set is reintroduced as the
primary data set for the ongoing project. The chapter formally defines
the business problem – designing a predictive model to calculate the
probability of a borrower defaulting on a loan.

● Target Variable Creation: The chapter discusses the importance of the


target variable in modeling and how it defines the objective of the
analysis. The Lending Club data set's "loan_status" field is used to create
a binary target variable, "bad," indicating whether a loan is at risk of
default.
SAS, Python & R: A Cross Reference Guide

4. Target Variable Analysis

● Importance of Stability: The chapter emphasizes the importance of


examining the stability of the target variable over time. It introduces the
concept of "runway time," the period it typically takes for an event (like
loan default) to occur.

● Limiting the Data Set: The chapter describes how to limit the data set to
a specific date range to ensure stability and reliability in the target
variable, resulting in a more manageable and accurate data set.

5. Exploratory Data Analysis (EDA)

● Overview of EDA: EDA is presented as a critical phase in data


preparation, allowing data scientists to understand the data's
distribution, central tendencies, and relationships between variables.
The chapter provides examples of summary statistics and visualizations,
such as histograms, to explore the data.

● Outlier Detection: The chapter discusses how to identify and handle


outliers, introducing techniques like Winsorizing and the Interquartile
Range (IQR) method. It emphasizes the impact of outliers on model
performance and the importance of addressing them appropriately.

6. Feature Selection and Engineering

● Importance of Feature Selection: Feature selection is presented as a


crucial step in data science, helping to reduce overfitting, improve
model performance, and enhance interpretability. The chapter
introduces techniques like correlation analysis, wrapper methods, and
embedded methods for feature selection.

● Feature Engineering: The chapter also discusses feature engineering,


which involves creating new features from existing ones to improve
model performance. Techniques such as feature scaling, encoding
categorical variables, and creating feature interactions are explored.

7. Comparison and Choosing the Right Methods

● Overview: The chapter concludes with a comparative analysis of the


various methods and techniques discussed, helping readers choose the
most appropriate approaches for their specific project needs.
144 SAS, Python & R: A Cross Reference Guide

● Looking Ahead: The chapter prepares readers for the next steps in the
data science process, which involve building and evaluating predictive
models using the prepared data set.
SAS, Python & R: A Cross Reference Guide

Chapter 3 Quiz

Questions:

1. What are the key differences between academic data sets and real-world
data sets?

2. Explain the importance of the target variable in a data science project.

3. How does the "art" of data science influence the decision-making process
during data preparation?

4. What is the business problem defined for the Lending Club data set in this
chapter?

5. How is the "loan_status" field in the Lending Club data set used to create
the target variable "bad"?

6. Why is it important to examine the stability of the target variable over time?

7. What is "runway time," and how does it affect the analysis of the target
variable?

8. Describe the process of limiting the data set to a specific date range and its
impact on the modeling process.

9. What is Exploratory Data Analysis (EDA), and why is it important in data


science?

10. How can summary statistics and visualizations like histograms help in
understanding the data set?

11. What are outliers, and why is it important to address them in data science?

12. Explain the Winsorizing technique for handling outliers.

13. What is the Interquartile Range (IQR) method, and how is it used to identify
outliers?

14. Why is feature selection important in machine learning, and what are its
benefits?
146 SAS, Python & R: A Cross Reference Guide

15. How can correlation analysis be used as a feature selection method?

16. What are wrapper methods in feature selection, and how do they differ
from filtering methods?

17. Explain the concept of embedded methods in feature selection and their
implementation.

18. What is feature engineering, and how can it improve model performance?

19. Describe the process of feature scaling and its importance in machine
learning.

20. How can encoding categorical variables impact the performance of a


machine learning model?
SAS, Python & R: A Cross Reference Guide

Chapter 3 Cheat Sheet

Category SAS Python R


- Use IF-THEN
statements in - Use np.where() - Use ifelse() to
DATA steps to for creating binary create binary
create binary target variables target variables
Target Variable target variables
Creation - Example:
- Example: if - Example:
df['bad'] =
loan_status = data$bad <-
np.where(df['loan
"Charged Off" ifelse(data$loan_s
_status'] ==
then bad = 1; else tatus == "Charged
'Charged Off', 1,
bad = 0; Off", 1, 0)
0)
- PROC MEANS - df.describe() for - summary() for
for summary summary summary
statistics statistics statistics
Exploratory Data
- PROC - matplotlib and
Analysis (EDA) - hist() and
UNIVARIATE for seaborn for
boxplot() for visual
histograms and histograms and
exploration
outlier detection visualizations
- Use PROC - Use
- Use quantile()
UNIVARIATE to np.percentile() to
and custom
identify outliers identify outliers
functions for
- Implement
Outlier Detection identifying and
- Winsorizing and Winsorizing and
handling outliers
IQR method for IQR methods
using the IQR
handling outliers manually or with
method
custom functions
- PROC CORR for - pandas.corr() for - cor() function for
correlation correlation correlation
analysis analysis analysis
- PROC -
Feature Selection - caret package for
GLMSELECT for sklearn.feature_s
feature selection
implementing election.RFE for
with various
feature selection wrapper-based
methods
methods feature selection
148 SAS, Python & R: A Cross Reference Guide

- StandardScaler
- Use PROC and
- Use scale() for
STANDARDIZE MinMaxScaler
feature scaling
for feature scaling from sklearn for
Feature scaling
Engineering
- Use - Create
- Create
PolynomialFeatur interaction terms
interaction terms
es for creating with poly()
with DATA steps
interaction terms function
SAS, Python & R: A Cross Reference Guide

Chapter 4: Model Pipeline

Overview

In data science and machine learning, a model pipeline is a critical concept that
ensures the systematic and repeatable processing of data, leading to the
development of robust models.

A model pipeline is a sequence of steps, each representing a crucial stage in the


machine learning process. These steps typically include data preprocessing, model
development, performance evaluation, and hyperparameter tuning. The pipeline
concludes with the model being applied to new data for final performance
evaluation.

The importance of a model pipeline lies in its ability to


provide consistency, efficiency, and repeatability in machine learning projects. It
allows for the automation of repetitive tasks, ensuring that each step is performed
consistently across different data sets or iterations. This consistency is vital in
maintaining the integrity of the results and ensuring that the insights drawn from
the model are reliable.

Each step in the model pipeline plays a pivotal role:

● Data preprocessing prepares the raw data for analysis and modeling. It
includes steps such as outlier treatment, missing value imputation, and
categorical variable encoding.
● Model development involves selecting an appropriate algorithm and
training it on the preprocessed data.
● Hyperparameter tuning optimizes the parameters of the model to improve
its performance.
● Performance evaluation assesses how well the model performs on unseen
data.

By understanding and implementing these concepts, you’ll be well-equipped to


handle any data science project efficiently and precisely.
150 SAS, Python & R: A Cross Reference Guide

Data Preparation

In the previous chapter, we explored the process of creating a modeling data set,
focusing on techniques such as outlier treatment, feature engineering, missing value
imputation, and data balancing. These steps are essential to preparing our data for
predictive modeling and ensuring we have a clean and reliable data set. In this
section, we will outline the specific techniques we applied to the Lending Club data
set, setting the foundation for the predictive models that we will develop in the
following sections.

1. Outlier Treatment: We employed the Windsor method to handle outliers,


which involves capping and flooring extreme values. By setting upper and
lower limits, we effectively minimize the impact of outliers on our predictive
models, ensuring more robust and accurate results.

2. Feature Engineering: Feature engineering is a crucial step in predictive


modeling, where we transform and create new features to enhance the
predictive power of our models. In the case of the Lending Club data set, we
used polynomials to capture non-linear relationships between variables.
Additionally, we explored feature interactions to uncover synergistic effects
between variables, enabling us to capture more nuanced patterns and
improve model performance.

3. Missing Value Imputation: Missing values can pose challenges in predictive


modeling, leading to biased results or reduced model performance. We
addressed this by employing suitable imputation techniques to infer missing
values in the data set. By leveraging statistical methods or predictive
models, we were able to fill in the gaps and ensure a complete and
informative data set for modeling.

4. Categorical Variable Encoding: Categorical variables are common in real-


world data sets and require appropriate encoding for predictive modeling.
In our approach, we created dummy variables for the categorical variables
in the Lending Club data set. This technique allows us to represent
categorical information in a numeric format, facilitating the inclusion of
these variables in our models.
SAS, Python & R: A Cross Reference Guide

5. Data Balancing: Class imbalance is often observed in binary classification


problems, where one class significantly outweighs the other. We employed
data balancing techniques to ensure that the dominant class does not skew
model performance. By carefully balancing the positive and negative cases
in the target variable, we create a more representative data set, allowing
our models to learn effectively from both classes.

Order of Operations

While the specific order in which you should perform these adjustments may vary
depending on the data set and the particular problem at hand, there is a generally
recommended order for performing these data treatments to optimize the data set:

1. Outlier Treatment: It is generally advisable to start by addressing outliers in


the data. Outliers can significantly impact the distribution and statistical
properties of the variables. Removing or capping outliers ensures that
extreme values do not disproportionately influence the subsequent steps in
the data processing pipeline.

2. Missing Value Imputation: Once outliers have been addressed, the next
step is to handle missing values. Missing data can be imputed using various
techniques such as mean imputation, median imputation, or more advanced
methods like multiple imputation or predictive modeling-based imputation.
Imputing missing values allows for a more complete data set and avoids
biases caused by omitting incomplete observations.

3. Feature Engineering: Feature engineering can be performed after


addressing outliers and missing values. This step involves transforming and
creating new features to enhance the predictive power of the models.
Techniques such as polynomial features, interaction terms, scaling, log
transformations, or domain-specific transformations can be applied to
capture non-linear relationships and interactions or make the data more
suitable for modeling.

4. Categorical Variable Encoding: Once the numeric variables are processed,


the focus shifts to encoding categorical variables. This step involves
152 SAS, Python & R: A Cross Reference Guide

converting categorical variables into a numeric representation suitable for


modeling. Common methods include one-hot encoding, label encoding, or
target encoding, depending on the nature of the categorical variables and
the algorithms being used.

5. Data Balancing: Finally, data balancing techniques can be applied if there is


class imbalance in the target variable. This step addresses situations where
one class is significantly underrepresented compared to the other.
Techniques such as undersampling, oversampling, or synthetic sampling can
create a balanced data set, enabling the models to learn from both classes
equally.

Decision Time: Adapting Your Data Preparation Strategy


It is important to note that the order provided is a general guideline,
and there may be cases where specific considerations or domain
knowledge suggest a different order or a combination of steps.
Flexibility and adapting the order based on the data set's
characteristics and the problem at hand are crucial for optimal data
preparation.

Data Set Preparation for Regression Models


In predictive modeling, data set preparation plays a vital role in ensuring the
accuracy and reliability of regression models. As regression models aim to estimate
continuous numerical values, treating missing values, outliers, and categorical
variables requires careful consideration to avoid biased or erroneous results. Let's
explore the impact of these factors on regression models and compare it to their
treatment in tree-based models.

1. Missing Values: Missing values can significantly affect regression models.


The presence of missing values leads to incomplete information, potentially
distorting the relationship between predictors and the target variable. In
regression, imputation methods such as mean imputation or multiple
imputation can replace missing values with reasonable estimates. However,
it is crucial to consider the potential biases introduced by imputation and
carefully evaluate the impact of imputed data on regression results.
SAS, Python & R: A Cross Reference Guide

2. Outliers: Extreme values deviating from most of the data can heavily
influence regression models. In regression analysis, outliers can distort the
estimated relationships between predictors and the target variable, leading
to biased coefficient estimates and compromised model performance.
Addressing outliers through techniques like Winsorization, truncation, or
robust regression helps mitigate their impact on regression models, allowing
for more reliable estimates.

3. Categorical Variables: Categorical variables pose unique challenges in


regression models. Unlike in classification models, where categorical
variables are typically encoded using dummy variables, regression models
require careful treatment to incorporate categorical information effectively.
This involves transforming categorical variables into a numerical format
suitable for regression analysis. Techniques like one-hot encoding, effect
coding, or target encoding can be employed to represent categorical
variables appropriately in regression models.

Comparing with Tree-Based Models

Tree-based models, such as decision trees, random forests, or gradient boosting


machines, have built-in mechanisms that handle missing values, outliers, and
categorical variables more naturally. These models can automatically handle missing
values by effectively partitioning the data based on available predictors. Outliers
have minimal impact on tree-based models as they are less sensitive to extreme
values. Additionally, tree-based models can directly handle categorical variables
without the need for explicit encoding, as they can split the data based on different
categories.

However, it is important to note that despite the advantages of tree-based models


in handling certain data characteristics, regression models offer interpretability,
simplicity, and a clearer understanding of the relationship between predictors and
the continuous target variable.

In summary, data set preparation for regression models requires careful


consideration of missing values, outliers, and categorical variables. Imputation
techniques, outlier treatment methods, and appropriate encoding approaches are
essential to ensure accurate and reliable regression results. While tree-based
154 SAS, Python & R: A Cross Reference Guide

models have inherent capabilities to handle some of these challenges, regression


models offer unique advantages in understanding the relationships between
predictors and the continuous target variable.

Data Segmentation

Data segmentation, also known as data splitting or partitioning, is a fundamental


step in predictive modeling projects. It involves dividing the available data into
distinct subsets for training, validation, and evaluation purposes. This process is
crucial because it enables us to assess our models' performance and generalization
capabilities accurately. By separating the data into different segments, we can
mimic real-world scenarios and obtain unbiased estimates of how well our models
will perform on unseen data.

The primary goal of data segmentation is to create subsets that are representative
of the underlying population and maintain the integrity of the evaluation process.
Typically, a common approach is to divide the data into three main segments: the
training set, the validation set, and the testing set.

● The training set is used to build and train the model.


● The validation set is employed to fine-tune the model's hyperparameters
and evaluate its performance.
● The testing set serves as an independent data set for the final evaluation.

By segmenting the data, we can detect potential issues such as overfitting, where
the model performs exceedingly well on the training data but fails to generalize to
new data. It allows us to iteratively refine the model, validate its performance on
unseen data, and make necessary adjustments to ensure its reliability.

Furthermore, an additional segment called the out-of-time validation set is often


used in scenarios where time is a critical factor, such as time series data. This
segment simulates the model's future deployment by evaluating its performance
based on data after the training period. This helps us assess how well the model
generalizes over time and ensures that it remains accurate and reliable in real-world
situations.
SAS, Python & R: A Cross Reference Guide

Data segmentation is a crucial step in predictive modeling projects. It enables us to


accurately evaluate and fine-tune our models, assess their generalization
capabilities, identify potential issues, and ensure they perform well on unseen data.
We can build robust and reliable models that deliver accurate predictions and
insights by following appropriate data segmentation techniques.

Different Types of Data Segmentation

In predictive modeling, the most common and widely used approach for data
segmentation involves dividing the data into three main subsets: the training set,
the validation set, and the testing set. This approach is commonly referred to as the
train-validation-test split. The training set is used to build and train the model, the
validation set is used to fine-tune the model and optimize its parameters, and the
testing set is used for the final evaluation of the model's performance.

On the other hand, the term "out-of-time" validation set is used specifically in time
series data analysis. It refers to a validation set that contains data from a time
period that is later than the training period, simulating real-world scenarios where
the model needs to make predictions on future data. This allows us to evaluate the
model's ability to generalize and make accurate predictions on unseen future data.

Figure 4.1: Data Segmentation

To reconcile the terminology, we can say that the train-validation-OOT split is the
standard approach for general predictive modeling tasks. However, in the context of
time series data analysis, an additional segment called the out-of-time validation set
is often used to assess the model's performance on future data.

It's important to note that different sources or individuals may use slightly different
terms or variations, but the key idea remains the same: dividing the data into
156 SAS, Python & R: A Cross Reference Guide

subsets for training, validation, and evaluation purposes to ensure robust and
reliable model performance.

Cross-Validation

Cross-validation is an alternative approach to data segmentation that is particularly


useful when the available data is limited and we want to make the most efficient
use of it. Instead of splitting the data into distinct training, validation, and OOT sets,
cross-validation involves repeatedly dividing the data into subsets and using them
for training and validation in a systematic way.

The most common form of cross-validation is k-fold cross-validation, where the data
is divided into k equally sized folds. The model is trained on k-1 folds and validated
on the remaining fold. This process is repeated k times, with each fold serving as the
validation set exactly once. The model's performance is then averaged across the k
iterations for an overall assessment.

Figure 4.2: K-Fold Cross-validation

Round 1 Test Data Training Data

Round 2 Training Data Test Data Training Data

Round 3 Training Data Test Data Training Data

Round 4 Training Data Test Data Training Data

Round 5 Training Data Test Data

Benefits of cross-validation include:

● Cross-validation helps to address the limitation of having a limited amount


of data by maximizing its utility.
SAS, Python & R: A Cross Reference Guide

● It provides a more comprehensive evaluation of the model's performance


by using different combinations of training and validation data.

● It helps to mitigate the potential bias or variance introduced by a single


train-validation split.

● It provides a more robust estimate of the model's generalization


performance.

Cross-validation is especially beneficial in situations where the data set is small and
more reliable model evaluation is needed. It allows for better estimation of the
model's performance on unseen data and provides insights into its stability across
different training-validation splits. Cross-validation also helps in hyperparameter
tuning, allowing us to assess how well the model generalizes with different
parameter settings.

While cross-validation is a powerful technique, it can be computationally expensive,


especially for large data sets or complex models. Due to computational constraints,
a train-validation-OOT split may still be preferred. The choice between cross-
validation and train-validation-OOT split depends on the specific context, available
data, computational resources, and the importance of model evaluation accuracy.

Comparison of Data Segmentation Techniques

Data segmentation techniques play a crucial role in model development and


evaluation. Two common approaches for data segmentation are the standard
development/validation/out-of-time split and cross-validation. Let's explore the
benefits of each approach:

● Standard Development/Validation/Out-of-Time Split:


- In this approach, the data set is divided into three distinct subsets: a
development set, a validation set, and an out-of-time set.

- The development set is used to train and fine-tune the model


parameters. The validation set is used to assess the performance
and make adjustments to the model. The out-of-time set, typically
158 SAS, Python & R: A Cross Reference Guide

representing future data, is used for final evaluation to simulate


real-world performance.

Benefits:
- Simple and straightforward to implement.

- Mimics the real-world scenario where the model is trained on


historical data and evaluated on future data.

- Allows for assessing the model's generalization ability to unseen


data.

● Cross-Validation:
- Cross-validation involves partitioning the data set into multiple
subsets (folds) and iteratively using different subsets for training
and testing.

- Common types of cross-validation include k-fold cross-validation


and stratified k-fold cross-validation.

Benefits:
- Uses the entire data set for both training and testing, which can
result in more robust model evaluation.

- Reduces the dependence on a single train-test split and provides a


more reliable estimate of model performance.

- Cross-validation is helpful when the data set size is limited, as it


maximizes the use of available data for evaluation.

The choice between these techniques depends on various factors:

- Data Availability: If the data set is large enough, standard


development/validation/out-of-time split can be a suitable choice.
However, if the data set is small, cross-validation allows for better
utilization of available data.
- Time Dependence: If the data is strongly time-dependent, such as
changing patterns or evolving dynamics, the standard split with an
SAS, Python & R: A Cross Reference Guide

out-of-time set is preferred to assess the model's performance on


future data.

- Computational Resources: Cross-validation can be computationally


expensive, especially with large data sets or complex models. In
such cases, the standard split may be more practical.

Decision Time: Choosing Your Segmentation Strategy


It is worth noting that both approaches have pros and cons, and the
choice ultimately depends on the project's specific requirements and
constraints. In practice, it is often beneficial to explore and compare
the results obtained from different segmentation techniques to gain
a comprehensive understanding of the model's performance and
generalizability.

Data Segmentation Approach for the Lending Club Data Set

Given that the Lending Club data set is a time series with a clear temporal order, it is
generally more appropriate to use a development, validation, and out-of-time split
rather than cross-validation. Time series data often exhibit temporal dependencies,
meaning past observations can influence future observations. Therefore, it is
essential to maintain the temporal order in the data segmentation to reflect the
real-world scenario.

The development, validation, and out-of-time split approach enables you to


simulate the model's real-world deployment. The model is trained on historical
data, validated on a separate set, and finally evaluated on future data (out-of-time
set). This setup helps assess the model's ability to generalize and make predictions
on unseen data, which closely resembles the intended use case.

Warning: Risks of Temporal Data Leakage in Cross-Validation


On the other hand, traditional cross-validation techniques may not
preserve the temporal order and can introduce leakage of future
information into the training process. This can lead to over-optimistic
performance estimates and poor model performance when applied to
real-world scenarios.
160 SAS, Python & R: A Cross Reference Guide

Therefore, for time series data like the Lending Club data set, it is recommended to
use a development, validation, and out-of-time split, ensuring that the data subsets
are ordered chronologically. This approach allows for a more realistic evaluation of
the model's performance and helps capture the temporal dynamics in the data.

Modeling Pipeline

A modeling pipeline is a systematic and efficient approach used in data science to


streamline the process of building and deploying predictive models. It is a structured
framework that guides data scientists through the various steps required for data
preparation, feature engineering, model training, and evaluation. Organizing these
steps into a pipeline makes it easier to manage complex workflows, maintain
consistency, and improve the reproducibility of results.

The primary goal of a modeling pipeline is to simplify and automate repetitive tasks,
reducing the risk of human error and saving valuable time. It ensures that data
manipulation and model building processes are consistent across different iterations
of model development and across different projects. As a result, the pipeline
enhances the scalability of the modeling process, allowing data scientists to apply
the same methodology to different data sets and projects efficiently.

Moreover, a modeling pipeline is highly reusable, meaning that once it is


established, it can be applied to similar problems with minimal adjustments. This
reusability encourages best practices and standardization across the organization,
fostering collaboration among data scientists and facilitating knowledge sharing.

A modeling pipeline is a powerful tool in the data science toolkit that empowers
data scientists to efficiently create robust and reliable predictive models. Providing a
clear and organized framework for the model-building process enables data
scientists to focus on extracting insights and driving value from data, ultimately
leading to more informed and impactful decision-making.
SAS, Python & R: A Cross Reference Guide

Model Pipeline Steps

This chapter will employ a well-structured approach to developing predictive


models, ensuring clarity and consistency throughout the process. To achieve this,
we will first divide the data set into two key parts: the development data and the
out-of-time data. The development data will be used for data preparation, including
outlier treatment, missing value imputation, feature engineering, and other
necessary manipulations. Once the development data is ready, we will further split
it into a development and validation data set.

On the other hand, the out-of-time (OOT) data will remain untouched until the
modeling process is complete. We will set it aside, preserving its integrity to serve as
an independent data set for final model evaluation. After conducting model training,
hyperparameter tuning, and model selection using the development and validation
data, we will focus on the final model evaluation phase.

Once we have identified the champion model, we will apply the same imputation
pipeline used on the development data to the out-of-time data. After scoring the
out-of-time data with the final model, we will carefully evaluate its performance
using various metrics to ensure robustness and accuracy.

Figure 4.3 below shows a graphical representation of the overall machine learning
workflow.
162 SAS, Python & R: A Cross Reference Guide

Figure 4.3: Machine Learning Workflow

This systematic approach allows us to maintain the integrity of the out-of-time data,
ensuring that our final evaluation is free from data leakage or bias. By adhering to a
clear sequence of steps, we can confidently build predictive models and assess their
real-world performance with the utmost accuracy and reliability.

Let's break it down step-by-step:

1. Data Split: Split the entire data set into two parts: the model-building data
set and the out-of-time data set (usually representing future data).

2. Data Preparation on Model-Building Data: On the developmental data set,


perform data imputations, outlier treatment, missing value imputations,
feature engineering, and categorical variable encoding. This is where you
manipulate and prepare the data for model training.

3. Data Segmentation for Model-Building Data: After preparing the model-


building data set, split it into development and validation data sets. This is
generally performed at a 70/30 or an 80/20 ratio, depending on how much
data you have and how robust you need your model to be. The
development data set will be used for model training, while the validation
data set will be used for model evaluation and hyperparameter tuning.
SAS, Python & R: A Cross Reference Guide

4. Model Training and Selection: Build and train different models on the
development data set. Evaluate their performance on the validation data
set and select the best-performing model as the champion model.

5. Hyperparameter Tuning: Fine-tune the champion model's hyperparameters


using the validation data set to optimize its performance.

6. Data Preparation on Out-of-Time Data: Now, take the out-of-time data set
(which has not been touched yet) and apply the same data preparation
steps that were performed on the model-building data set. This includes
imputations, outlier treatment, feature engineering, and categorical
variable encoding.

7. Model Scoring and Evaluation: Once the out-of-time data set is prepped,
use the champion model to score the data and generate predictions.
Evaluate the model's performance on the out-of-time data set using the
desired performance metrics.

Following this process ensures that the model is trained and evaluated on separate
data sets, preventing data leakage and providing a more accurate representation of
how the model will perform on unseen data. The out-of-time data set serves as a
true test set, simulating how the model would perform in a real-world scenario
where future data becomes available after model deployment.

Model Pipeline Code

This section will demonstrate the implementation of each step in the data science
modeling pipeline using SAS, Python, and R programming languages. Each language
has its own strengths and unique features that make it well-suited for certain tasks
in the modeling pipeline. By providing code examples in all three languages, we can
compare and contrast their approaches and gain a deeper understanding of the
concepts and techniques discussed in the modeling pipeline workflow.
164 SAS, Python & R: A Cross Reference Guide

Warning: Complete Model Pipeline Code Available on GitHub


In light of the extensive nature of the model pipeline code for SAS,
Python, and R, it is not included directly in the book. However, this
does not limit access to these valuable resources. The complete code is
readily available for review and use in a dedicated GitHub repository.
The full model pipeline code for each of the three programming
languages can be found at:

https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide

Although all three programming languages could not be directly included in the text
due to space limitations, Program 4.1 shows the model pipeline code for the Python
programming language. This program will provide an example of a complete model
pipeline.

Program 4.1: Model Pipeline – Python Code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,
GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_curve, auc, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt

# Load the data and split into developmental and OOT data set
#loan_data = pd.read_csv('loan_data.csv')
oot_start_date = '2017-04-01'
SAS, Python & R: A Cross Reference Guide

oot_end_date = '2017-06-30'
mask = (loan_data['issue_d'] >= oot_start_date) &
(loan_data['issue_d'] <= oot_end_date)
X_oot = loan_data[mask].drop('bad', axis=1)
y_oot = loan_data[mask]['bad']
X_dev = loan_data[~mask].drop('bad', axis=1)
y_dev = loan_data[~mask]['bad']

# Identify numeric and categorical features


numeric_features = ['loan_amnt', 'int_rate', 'annual_inc', 'dti',
'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
'mort_acc', 'pub_rec_bankruptcies']
categorical_features = ['emp_length', 'term', 'grade', 'sub_grade',
'home_ownership', 'verification_status', 'purpose',
'application_type']

# Perform data imputations on developmental data set

# Missing value imputation using median for numeric features and


most frequent for categorical features
imp_median = SimpleImputer(strategy='median')
imp_most_frequent = SimpleImputer(strategy='most_frequent')
X_dev[numeric_features] =
imp_median.fit_transform(X_dev[numeric_features])
X_dev[categorical_features] =
imp_most_frequent.fit_transform(X_dev[categorical_features])

# Winsorization of outliers at 5% and 95% thresholds for numeric


features only
for col in numeric_features:
lower, upper = np.percentile(X_dev[col], [5, 95])
X_dev[col] = np.where(X_dev[col] < lower, lower, X_dev[col])
X_dev[col] = np.where(X_dev[col] > upper, upper, X_dev[col])

# Apply one-hot encoding to categorical features


encoder = OneHotEncoder(handle_unknown='ignore')
X_cat_encoded = encoder.fit_transform(X_dev[categorical_features])
X_dev_encoded = pd.concat([X_dev[numeric_features],
pd.DataFrame(X_cat_encoded.toarray(), index=X_dev.index)], axis=1)

# Feature engineering: create interaction variable between dti and


annual_inc
X_dev_encoded['DTI_INC_interaction'] = X_dev['dti'] *
X_dev['annual_inc']
166 SAS, Python & R: A Cross Reference Guide

# Split developmental data set into development and validation data


sets (80/20 ratio)
X_train, X_val, y_train, y_val = train_test_split(X_dev_encoded,
y_dev, test_size=0.2)

# Make sure that the target variable is numeric


y_val = y_val.astype(int)
y_train = y_train.astype(int)

# Develop machine learning models on development data set


models = {
'Logistic Regression': LogisticRegression(),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier(),
'Support Vector Machine': SVC(probability=True),
'Gradient Boosting Machine': GradientBoostingClassifier(),
'Neural Network': MLPClassifier()
}
auc_scores = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_val)[:, 1]
fpr, tpr, _ = roc_curve(y_val, y_pred_proba)
roc_auc = auc(fpr, tpr)
auc_scores[name] = roc_auc

# Plot AUC for each model


plt.bar(range(len(auc_scores)), list(auc_scores.values()),
align='center')
plt.xticks(range(len(auc_scores)), list(auc_scores.keys()),
rotation=45)
plt.ylabel('AUC')
plt.show()

# Select champion model based on highest AUC on validation data set


champion_model_name = max(auc_scores, key=auc_scores.get)
champion_model = models[champion_model_name]
print(f'Champion model: {champion_model_name}')

# Tune hyperparameters of champion model using grid search


from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [10, 50, 100],
'max_depth': [None, 10, 20],
SAS, Python & R: A Cross Reference Guide

'min_samples_split': [2, 5],


'min_samples_leaf': [1, 2]
}
grid_search = GridSearchCV(champion_model, param_grid=param_grid)
grid_search.fit(X_train, y_train)
champion_model_tuned = grid_search.best_estimator_

# Apply champion model to OOT data set and calculate performance


statistics
X_oot_encoded = pd.concat([X_oot[numeric_features],
pd.DataFrame(encoder.transform(X_oot[categorical_features]).toarray(
), index=X_oot.index)], axis=1)
X_oot_encoded['DTI_INC_interaction'] = X_oot['dti'] *
X_oot['annual_inc']
y_oot_pred_proba =
champion_model_tuned.predict_proba(X_oot_encoded)[:, 1]
fpr_oot, tpr_oot, _ = roc_curve(y_oot, y_oot_pred_proba)
roc_auc_oot = auc(fpr_oot, tpr_oot)
print(f'OOT AUC: {roc_auc_oot:.3f}')
y_oot_pred = champion_model_tuned.predict(X_oot_encoded)
cm = confusion_matrix(y_oot, y_oot_pred)
print(f'OOT Confusion Matrix:\n{cm}')

# Create a table with performance metrics for each ML model

from sklearn.metrics import accuracy_score, precision_score,


recall_score, f1_score

# Make sure that the target variable is numeric


y_oot = y_oot.astype(int)

# Impute missing values in numeric features of OOT data set


X_oot[numeric_features] =
imp_median.transform(X_oot[numeric_features])

# Winsorization of outliers at 5% and 95% thresholds for numeric


features only
for col in numeric_features:
lower, upper = np.percentile(X_dev[col], [5, 95])
X_oot[col] = np.where(X_oot[col] < lower, lower, X_oot[col])
X_oot[col] = np.where(X_oot[col] > upper, upper, X_oot[col])

# Apply one-hot encoding to categorical features


168 SAS, Python & R: A Cross Reference Guide

X_oot_encoded = pd.concat([X_oot[numeric_features],
pd.DataFrame(encoder.transform(X_oot[categorical_features]).toarray(
), index=X_oot.index)], axis=1)
X_oot_encoded['DTI_INC_interaction'] = X_oot['dti'] *
X_oot['annual_inc']

# Impute missing values in categorical features of OOT data set


X_oot[categorical_features] =
imp_most_frequent.transform(X_oot[categorical_features])

# Create a DataFrame to store performance statistics for each model


performance_df = pd.DataFrame(columns=['Model', 'AUC', 'Accuracy',
'Precision', 'Recall', 'F1 Score'])

# Calculate AUC for champion model on OOT data set


y_oot_pred_proba =
champion_model_tuned.predict_proba(X_oot_encoded)[:, 1]
fpr_oot, tpr_oot, _ = roc_curve(y_oot, y_oot_pred_proba)
roc_auc_oot = auc(fpr_oot, tpr_oot)

# Calculate performance statistics for each model


for name, model in models.items():
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
if np.sum(y_pred) == 0:
precision = 0
f1 = 0
else:
precision = precision_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
performance_df = performance_df.append({'Model': name, 'AUC':
auc_scores[name], 'Accuracy': accuracy, 'Precision': precision,
'Recall': recall, 'F1 Score': f1}, ignore_index=True)

# Check if champion_model_tuned is defined


if 'champion_model_tuned' not in globals():
champion_model_tuned = champion_model

# Preprocess OOT data set before applying champion model to it


X_oot_encoded = pd.concat([X_oot[numeric_features],
pd.DataFrame(encoder.transform(X_oot[categorical_features]).toarray(
), index=X_oot.index)], axis=1)
SAS, Python & R: A Cross Reference Guide

X_oot_encoded['DTI_INC_interaction'] = X_oot['dti'] *
X_oot['annual_inc']

# Calculate performance statistics for champion model on OOT data


set
y_oot_pred = champion_model_tuned.predict(X_oot_encoded)
accuracy_oot = accuracy_score(y_oot, y_oot_pred)
if np.sum(y_oot_pred) == 0:
precision_oot = 0
f1_oot = 0
else:
precision_oot = precision_score(y_oot, y_oot_pred)
f1_oot = f1_score(y_oot, y_oot_pred)
recall_oot = recall_score(y_oot, y_oot_pred)
performance_df = performance_df.append({'Model':
f'{champion_model_name} (OOT)', 'AUC': roc_auc_oot, 'Accuracy':
accuracy_oot, 'Precision': precision_oot, 'Recall': recall_oot, 'F1
Score': f1_oot}, ignore_index=True)

# Display the performance table


display(performance_df)

Step-by-step walkthrough of the model pipeline code:

Prepare Data

● Data Ingestion: The code reads in the loan data and formats the date
variable. It also identifies numeric and categorical features.

● Data Partitioning: The code partitions the data into out-of-time (OOT) and
development data sets based on the date of issue. It further splits the
development data set into training and validation data sets.

● Data Imputation: The code imputes missing values in numeric features


using the median and in categorical features using the mode. It also handles
outliers by winsorizing numeric features at 5% and 95% thresholds.
170 SAS, Python & R: A Cross Reference Guide

● One-Hot Encoding: The code applies one-hot encoding to categorical


variables to create dummy variables for each level of the categorical
variables.

● Feature Engineering: The code creates an interaction variable between dti


and annual_inc.

● Prepare OOT Data: All preprocessing steps, including data imputation,


outlier treatment, one-hot encoding, and feature engineering, are applied
to the OOT data to prepare it for scoring.

Build and Train Models

● Model Building: The code builds various machine learning models, including
logistic regression, decision tree, random forest, support vector machine,
gradient boosting machine, and neural network.

● Model Evaluations: The code calculates the AUC value for each candidate
model and compares them in a bar chart. A champion model is selected.

Model Selection and Evaluation

● Model Selection: The code performs hyperparameter tuning for the


champion model by looping through a range of values for each tuning
parameter. The optimal hyperparameters are chosen based on the
validation data set AUC values.

● Model Performance: The code applies the final model with optimal
hyperparameters to score the OOT data set.
SAS, Python & R: A Cross Reference Guide

Deploying and Monitoring Model

● After final evaluation, the champion model with optimal hyperparameters is


ready to be deployed for scoring new data. Once deployed, it’s important to
continuously monitor its performance to ensure it maintains its predictive
power over time. If a significant drop in performance is observed, it may be
necessary to retrain or even rebuild the model with new data.

This pipeline covers all major steps in a typical machine learning project, from data
preparation to feature engineering, model building, selection, evaluation,
deployment, and monitoring.

Modeling Pipeline Output

The modeling pipeline code results in two pieces of output: a graphic chart and a
table. The graphic chart shows a comparison of the AUC of each of the machine
learning models. AUC, or Area Under the Curve, is a measure of the performance of
a binary classifier. An AUC of 1 indicates a perfect classifier, while an AUC of 0.5
indicates a random classifier. The graphic shows that the Gradient Boosting Machine
model has the highest AUC, indicating that it is the best-performing model among
those tested.

Figure 4.4: AUC Comparison of ML Models


172 SAS, Python & R: A Cross Reference Guide

The table displays several performance metrics for each machine learning model,
including AUC, accuracy, precision, recall, and F1 score. These metrics provide a
more detailed view of the performance of each model. Some key points to note
from the table are:

● The Gradient Boosting Machine model has the highest AUC, indicating that
it is the best-performing model among those tested.
● The Neural Network model has the highest recall, indicating that it can
correctly identify a high proportion of positive instances.
● The Logistic Regression and Support Vector Machine models have a
precision of 0, indicating that they cannot correctly identify any positive
instances.

The champion model is the Gradient Boosting Machine model, as it has the highest
AUC on the validation data set. The final data set, which contains the predictions of
the champion model on the OOT data set, is stored in the variable y_oot_pred. This
variable contains the predicted class labels for each observation in the OOT data set.

Table 4.1: Performance Metrics Comparison of ML Models


SAS, Python & R: A Cross Reference Guide

Processing Time

It’s not uncommon for machine learning models to take a long time to train,
especially when working with large data sets or complex models. Several factors
could contribute to long runtimes in the modeling section of your code. Some
possible reasons include:

● Large data set: If the data set you are working with is very large, the models
may take a long time to process all of the data.
● Complex models: Some of the models you are using, such as Support Vector
Machines, Gradient Boosting Machines, and Neural Networks, can be
computationally expensive and may take longer to train than simpler
models like Logistic Regression or Decision Trees.
● Hyperparameter tuning: If you perform hyperparameter tuning using grid
search, this can significantly increase the runtime as the model needs to be
trained multiple times for each combination of hyperparameters.

To speed up the modeling process, you could try some of the following approaches:

● Reduce the data set size: You could try using a smaller sample of the data
for model training and validation. This can help reduce the runtime while
still providing a representative sample of the data for model development.
● Simplify the models: You could try using simpler models that are less
computationally expensive to train. For example, you could try using Logistic
Regression or Decision Tree instead of Support Vector Machine or Neural
Network.
● Use faster hyperparameter tuning methods: Instead of using grid search for
hyperparameter tuning, you could use faster methods such as random
search or Bayesian optimization.

Among the machine learning algorithms that you use in your code, the one that
typically takes the longest to train is the Support Vector Machine (SVM). SVMs are
known to have high computational complexity, especially when working with large
data sets or when using certain kernel functions. However, the actual training time
can vary depending on factors such as the size and dimensionality of the data set,
the choice of kernel function, and the values of the hyperparameters. Other
algorithms, such as Gradient Boosting Machines and Neural Networks, can also take
a long time to train, but typically not as long as SVM.
174 SAS, Python & R: A Cross Reference Guide

By following this structured approach to model development, we can build robust


and reliable predictive models that not only perform well on unseen data but also
provide valuable insights into our data. The next chapter will delve deeper into
specific machine learning models, exploring their strengths, weaknesses, and
suitable applications.

Chapter 4: Model Pipeline – Conclusion and Transition

In this chapter, we explored the concept of a model pipeline, a structured process


that guides your data from preparation through to the final stages of model
evaluation and deployment. You’ve learned how to set up pipelines in SAS, Python,
and R, ensuring that your workflow is both efficient and reproducible. By now, you
should have a clear understanding of how the various components of a pipeline
work together to streamline the modeling process.

With a solid model pipeline in place, it’s time to dive into the core of predictive
modeling. The next chapters will introduce you to the foundational techniques used
to build predictive models, starting with some of the most widely used algorithms
like linear regression and logistic regression. Understanding these foundational
models is crucial because they form the basis upon which more complex models are
built.

In Chapter 5, we will begin with these fundamental techniques, exploring their


mechanics, applications, and how to implement them across SAS, Python, and R.
This chapter will set the stage for deeper exploration into more advanced modeling
techniques, but first, mastering the basics will ensure you have a strong foundation
to build upon.

So, with your model pipeline ready to go, let’s move on to Chapter 5 and start
building your first predictive models, setting the stage for the advanced techniques
to come.
SAS, Python & R: A Cross Reference Guide

Chapter 4 Summary: Model Pipeline

1. Introduction to Model Pipelines

● Overview: The chapter introduces the concept of a model pipeline in data


science and machine learning. A model pipeline is a sequence of steps
designed to ensure systematic, consistent, and repeatable data processing,
leading to the development of robust models. The chapter emphasizes the
importance of pipelines in maintaining consistency, efficiency, and
repeatability across different data sets and iterations of a project.

● Context: The chapter sets the stage for understanding how model pipelines
streamline the machine learning process by automating repetitive tasks and
maintaining the integrity of results.

2. Data Preparation

● Outlier Treatment: The chapter explains the importance of addressing


outliers in the data. Techniques such as the Windsor method are discussed,
where extreme values are capped and floored to minimize their impact on
model performance.

● Feature Engineering: Feature engineering is highlighted as a critical step in


enhancing the predictive power of models. The chapter covers techniques
such as creating polynomial features and exploring feature interactions to
capture complex relationships in the data.

● Missing Value Imputation: The chapter discusses various methods for


handling missing data, including mean, median, and predictive model-based
imputations, ensuring a complete and informative data set for modeling.

● Categorical Variable Encoding: The chapter explains the process of


converting categorical variables into a numeric format suitable for
modeling, using techniques like one-hot encoding.
176 SAS, Python & R: A Cross Reference Guide

● Data Balancing: Addressing class imbalance in binary classification problems


is discussed, with techniques such as undersampling, oversampling, and
synthetic sampling to create a balanced data set.

3. Order of Operations in Data Preparation

● General Guidelines: The chapter provides a recommended sequence for


data preparation steps: starting with outlier treatment, followed by missing
value imputation, feature engineering, categorical variable encoding, and
finally data balancing. This order ensures that the data is prepared optimally
for modeling.

4. Data Segmentation

● Importance of Data Segmentation: The chapter discusses the necessity of


dividing the data set into training, validation, and testing sets to assess
model performance and generalization capabilities accurately.

● Types of Data Segmentation:

o Train-Validation-Test Split: The standard approach for most


predictive modeling tasks, ensuring robust and reliable model
performance.

o Out-of-Time (OOT) Validation: Specifically used in time series data


to simulate real-world scenarios by evaluating the model on future
data.

● Cross-Validation: The chapter introduces cross-validation as an alternative


approach, particularly useful when data is limited. The k-fold cross-
validation technique is discussed in detail, along with its benefits in
providing a more robust estimate of model performance.

5. Model Pipeline Implementation

● Steps in a Model Pipeline: The chapter outlines the key steps in a model
pipeline, from data splitting, data preparation, model training, and
selection, to model scoring and evaluation. It highlights the importance of
SAS, Python & R: A Cross Reference Guide

preserving the integrity of out-of-time data to avoid data leakage and


ensure accurate model evaluation.

● Model Pipeline Code: While the chapter discusses the conceptual steps in
building a model pipeline, it references a GitHub repository where readers
can access the complete code examples in SAS, Python, and R.

6. Modeling Pipeline Output

● Evaluation and Output: The chapter covers how the final model’s
performance is evaluated using metrics such as AUC, accuracy, precision,
recall, and F1 score. It also discusses the visualization of model performance
using bar charts and confusion matrices to compare different models.
178 SAS, Python & R: A Cross Reference Guide

Chapter 4 Quiz

Questions:
1. What is a model pipeline, and why is it important in machine learning
projects?

2. Explain the role of outlier treatment in data preparation.

3. What is feature engineering, and how can it enhance model performance?

4. Describe different techniques for handling missing values in a data set.

5. What is categorical variable encoding, and why is it necessary for predictive


modeling?

6. How does data balancing address class imbalance in binary classification


problems?

7. What is the recommended sequence for data preparation steps in a model


pipeline?

8. Why is data segmentation critical in assessing model performance and


generalization?

9. Describe the train-validation-test split and its purpose in predictive


modeling.

10. What is out-of-time validation, and when is it most useful?

11. Explain the concept of cross-validation and its benefits in model evaluation.

12. How does k-fold cross-validation work, and why is it beneficial?

13. What are the key steps in a model pipeline?

14. Why is it important to preserve the integrity of out-of-time data during


model evaluation?

15. What metrics are commonly used to evaluate model performance?

16. How does the AUC metric help in assessing the performance of a binary
classifier?
SAS, Python & R: A Cross Reference Guide

17. Why is it important to compare model performance using visualizations like


bar charts and confusion matrices?

18. Describe the process of hyperparameter tuning in a model pipeline.

19. How does one-hot encoding facilitate the inclusion of categorical variables
in machine learning models?

20. What are the benefits of using a systematic model pipeline in data science
projects?
180 SAS, Python & R: A Cross Reference Guide

Chapter 4 Cheat Sheet

Category SAS Python R


- Use PROC - Use
UNIVARIATE for np.percentile() to - Use quantile()
outlier detection identify outliers and ifelse() for
Outlier - Apply identifying and
Treatment - Winsorization via Winsorization capping outliers
DATA steps to cap with np.where() using
extreme values or custom Winsorization
functions
- SimpleImputer
- Use PROC STDIZE
from sklearn for - Use impute() or
for imputation
mean, median, na.omit() for basic
methods (mean,
Missing Value and mode imputation
median)
Imputation imputation
- PROC MI for - IterativeImputer - mice package for
multiple for advanced multiple
imputations methods imputations
-
PolynomialFeatur
- PROC TRANSREG - poly() for
es and custom
for polynomial polynomial
functions for
features features
Feature creating new
Engineering features
- Use custom
- Use DATA steps - FeatureUnion for
functions or
for creating combining
model.matrix() for
interaction terms features
interaction terms
- PROC GLMMOD - OneHotEncoder
for one-hot from sklearn for
- model.matrix()
Categorical encoding one-hot encoding
or dummyVars()
Variable - Use CLASS
- from caret for
Encoding statements in
pd.get_dummies() one-hot encoding
modeling
for quick encoding
procedures
SAS, Python & R: A Cross Reference Guide

- Use PROC -
SURVEYSELECT for RandomOverSam - ROSE package
oversampling/unde pler and SMOTE for oversampling,
Data
rsampling from imblearn for undersampling,
Balancing
- Create synthetic oversampling and and synthetic
samples via PROC synthetic sampling
GENMOD sampling
-
LogisticRegression
- PROC LOGISTIC,
, - glm() for logistic
PROC TREESPLIT
RandomForestCla regression
Model for training models
ssifier from
Training
sklearn
- Use PROC - randomForest()
- XGBoost for
HPFOREST for for ensemble
gradient boosting
ensemble models models
- roc_curve and
- PROC ROC for auc from - pROC package
AUC calculation sklearn.metrics for AUC and ROC
for AUC
Model
- PROC FREQ and -
Evaluation -
PROC MEANS for confusion_matrix,
confusionMatrix()
confusion matrix precision_score,
from caret for
and performance recall_score for
metrics
metrics metrics
182 SAS, Python & R: A Cross Reference Guide

Chapter 5: Predictive Modeling Part I –


Foundation

Overview

Predictive modeling is a cornerstone of data science, allowing us to harness the


capabilities of machine learning to predict outcomes and glean insights from data. In
this chapter, we'll delve into the fundamentals of predictive modeling, focusing on
two pivotal branches: regression and classification models. These models are the
backbone of data-driven decision-making, steering us through the complexities of
real-world scenarios.

At the heart of our exploration are linear and logistic regression, fixed-form models
that are crucial in predicting numerical and categorical outcomes, respectively.
These models provide a structured framework, allowing us to understand the
relationships within our data. On the other hand, we encounter the decision tree, a
non-parametric model that operates without predefined assumptions. The beauty
of non-parametric models lies in their flexibility, enabling them to capture complex
patterns in the data without strict adherence to a predetermined structure.

Understanding the difference between fixed-form and non-parametric models is key


to choosing the right tool for the task at hand. Our journey through these
foundational models aims to demystify their intricacies and equip you with the
knowledge to make informed decisions in your data science endeavors. Let's dive
into the world of predictive modeling and explore its significance in data-driven
insights.

Modeling Data

The significance of the modeling data set cannot be overstated in predictive


modeling. This data set, which has undergone rigorous preprocessing, forms the
basis for the construction of our machine learning models. The data set’s quality
SAS, Python & R: A Cross Reference Guide

directly impacts the predictive models’ accuracy and reliability, making it an integral
part of the machine learning process.

The model pipeline is a systematic series of transformations applied to raw data to


prepare it for modeling. The data prep commences with outlier detection, a process
that identifies and addresses anomalous values that could potentially distort our
model’s performance. This is followed by missing value imputation, a step that
ensures completeness of information for every observation. The pipeline also
includes the creation of dummy variables, a method that enables the effective
incorporation of categorical data into our models. The final step in the data
preprocessing step is feature engineering. This process involves creating new
variables from existing ones to potentially reveal hidden relationships and enhance
model performance.

The culmination of this comprehensive pipeline process is the final modeling data
set. This data set, which encapsulates all preprocessing steps, is structured to feed
directly into our machine learning models. You can access this data set in the GitHub
repository for this book at:

https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide

As we explore each machine learning model in subsequent chapters, we will use this
data set to develop these models and provide an in-depth explanation of their
workings.

Machine Learning Models

In Chapter 3, we delved into the distinction between regression and classification


models and how the structure of the target variable determines which type of
model to use. The target variable plays a pivotal role in predictive modeling, and its
nature – whether binary, multi-level, or continuous – dictates the choice of model.
For a more detailed discussion on this topic, please refer to Chapter 3.

Given that our modeling data set is based on a binary target variable, we will focus
on building a series of classification models. However, it’s important to note that
each algorithm we discuss can be used for either regression or classification tasks
with only minor modifications to the programming code.
184 SAS, Python & R: A Cross Reference Guide

Machine learning algorithms are inherently flexible. For instance, despite its name,
logistic regression is used for binary classification tasks. By changing the function
used to map input features to output predictions, it can be adapted for regression
tasks. Similarly, decision trees can be used for both classification and regression by
altering the criteria for splitting nodes and determining leaf node values. Support
Vector Machines (SVMs), primarily used for classification, can also be used for
regression (Support Vector Regression) by introducing an alternative loss function.
This versatility extends to most machine learning algorithms, making them
adaptable tools in a data scientist’s arsenal.

This chapter will focus on creating the classification model versions of the following
algorithms:

● Logistic Regression
● Decision Tree
● Random Forest
● Gradient Boosting Machine
● Support Vector Machines
● Neural Networks

Each of these models has unique characteristics, advantages, and disadvantages.


They all have specific use cases where they excel. Understanding these models and
knowing when to apply each is a crucial data science skill.

We will start with Logistic Regression, a simple yet powerful algorithm for binary
classification tasks. Then, we will explore Decision Trees and their ensemble
versions: Random Forests and Gradient Boosting Machines. We will then move on
to Support Vector Machines, a robust classifier capable of handling high-
dimensional data. Finally, we will delve into Neural Networks, the backbone of most
modern artificial intelligence applications.

By the end of this chapter, you will have a solid understanding of these algorithms
and be able to implement them in SAS, Python, and R.
SAS, Python & R: A Cross Reference Guide

Difference Between Linear and Logistic Regression

The distinction between linear and logistic regression primarily lies in the nature of
the dependent variable. Linear regression is predicated on a quantitative numeric
variable, whereas logistic regression is predicated on a qualitative categorical target
variable. Figure 5.1 illustrates a comparison of value distributions for these
modeling types.

Figure 5.1: Linear vs. Logistic Regression Data Distributions

In an example where the target variable is expressed as a percentage, the target


value can assume any value between zero and one for a linear regression model. For
instance, if we consider the target value to represent “the percentage of tickets sold
(Y) given a certain discount rate (X),” this value could be 0%, 31%, 73%, 100%, or any
value in between. Conversely, for a logistic regression model, the target value is
binary. For example, it could represent “sale or no sale,” with no intermediate
values between “sale” (represented as a 1) and “no sale” (represented as a 0) given
a certain discount rate (X).

Attempting to model a binary outcome with a linear regression model could yield
perplexing results, as demonstrated in Figure 5.2.
186 SAS, Python & R: A Cross Reference Guide

Figure 5.2: Linear vs. Logistic Models on a Binary Outcome

The linear regression model seeks to minimize the data's Residual Sum of Squares
(RSS), assuming the dependent variable (Y) is a continuous value. This assumption
leads to some issues when applied to classification problems:

1. Probabilities being greater than one or less than zero: As the value of X
approaches 0% or 100%, the predicted value of a purchase either becomes
negative or exceeds one, both of which are outside the [0,1] range.
2. Homoscedasticity: Linear regression assumes that the variance of Y is
constant across values of X. However, for binary models, the variance is
calculated as the product of the proportions of positive values (P) and
negative values (Q), i.e., PQ, which varies across the logistic regression line.
3. Normal distribution of residuals: Linear regression assumes that residuals
are normally distributed. This assumption is difficult to justify, given that
classification problems have binary targets.

For these reasons, it’s advisable not to use linear regression for classification
problems.

Logistic Regression
A more suitable approach for binary outcome variables is to model them as a
probability of the event occurring. This probability is depicted as an S-curved
SAS, Python & R: A Cross Reference Guide

regression line in Figure 5.2. While the formula for this line differs from linear
regression, some shared elements exist.

Equation 5.1: Linear Regression

𝑌 ≈ 𝛽0 + 𝛽1 𝑋1

Equation 5.2: Logistic Regression


𝑒 𝛽0 +𝛽1 𝑋
𝑝(𝑋) =
1 + 𝑒𝛽0 +𝛽1 𝑋

The logistic regression equation (Equation 5.2) comprises four main components:

1. p(X) denotes the probability of a positive value (1), which is also the
proportion of 1s. This is equivalent to the mean value of Y in linear
regression.
2. The base value of the natural logarithm (e) is present in the model, as
expected in a logistic model.
3. 𝛽0 represents P when X is zero, similar to the intercept value in linear
regression.
4. 𝛽1 adjusts the rate at which the probability changes with a single unit
change in X, akin to the 𝛽1 value in linear regression.

The logistic regression equation is known as the logistic function, and it’s used
because it produces outputs between 0 and 1 for all values of X and allows the
variance of Y to vary across values of X.

Step-by-Step Construction of a Logistic Regression Model

The process of training a logistic regression model involves several iterative steps. It
begins with initializing the model parameters, which include the weights and biases
for your model. These parameters are then used to calculate the predicted outcome
for each observation in your data set. The discrepancy between these predictions
and the actual outcomes is quantified using a loss function, often binary cross-
entropy loss, in the case of logistic regression. This loss is then used to update the
model parameters using a learning algorithm such as gradient descent. The process
188 SAS, Python & R: A Cross Reference Guide

of calculating predictions, computing loss, and updating parameters is repeated


until the model performs well on your training data or until a stopping criterion is
met. This iterative process allows the model to “learn” from the data and improve
its predictions over time.

Let’s delve deeper into each step of developing a logistic regression model:

1. Initialize the Model Parameters: The first step in logistic regression is to


initialize your model's weights (coefficients) and bias (intercept). These
parameters will be updated during the training process. The weights are
typically initialized with small random values, while the bias can be
initialized to zero. The choice of initialization can impact the speed of
convergence and the final model performance.

2. Calculate the Predicted Outcome: Once the parameters are initialized, you
can calculate the predicted outcome for each observation in your data set.
Logistic regression involves calculating the log odds of the outcome using
the logistic function. The logistic function transforms the linear combination
of your predictors (i.e., the dot product of your predictors and weights plus
the bias) into a probability between 0 and 1.

3. Compute the Loss: After calculating the predicted outcomes, you need to
compute the loss, which measures how well your model’s predictions match
the actual outcomes. In logistic regression, we often use binary cross-
entropy loss (also known as log loss), which measures the error between
our predictions and the actual classes. It considers the predicted
probabilities for the actual class and the inverse probabilities for the other
class.

4. Update the Parameters: Once you’ve computed the loss, you can update
your parameters using a learning algorithm such as gradient descent. This
involves computing the derivative (gradient) of the loss concerning each
parameter, which indicates how much changing that parameter would
change the loss. You then adjust each parameter by a small step in the
direction that reduces the loss (i.e., in the opposite direction of the
gradient). The size of this step is determined by your learning rate.

5. Repeat Steps 2–4: You continue iterating through steps 2-4 until your model
performs well on your training data or until a stopping criterion is met (such
SAS, Python & R: A Cross Reference Guide

as a maximum number of iterations or a minimum change in loss). Each


iteration through these steps is called an epoch.

It’s important to note that these steps represent one cycle of training a logistic
regression model. In practice, you would typically divide your data into batches and
repeat these steps for each batch, updating your parameters after each batch rather
than after each epoch. This approach, known as mini-batch gradient descent, can
lead to faster and more stable convergence.

Remember, logistic regression is an iterative process that gradually improves the


model’s predictions by adjusting its parameters based on the computed loss.

Loss Function

In contrast to linear regression, where the β values are determined by minimizing


the cost function of RSS through gradient descent, logistic regression deals with an
S-curved logistic regression line. It lacks a mathematical solution that will minimize
the sum of squares. For logistic regression, the β values are determined through
maximum likelihood.

Maximum likelihood represents a conditional probability of P(Y|X), i.e., the


probability of Y given X. The parameters of the logistic regression function (𝛽0 and
𝛽1 ) are determined by selecting values such that the predicted probability p̂ (𝑥𝑖 ) of a
value for each observation corresponds as closely as possible to the observation’s
actual value.

Therefore, maximum likelihood implies selecting values for 𝛽0 and 𝛽1 that result in a
probability close to one for all ones and a probability close to zero for all zero
values. This approach is termed maximum likelihood because it seeks to maximize
the likelihood (conditional probability) of matching the model estimate with the
sample data.

Odds Ratio

The odds ratio is a fundamental concept in logistic regression. It represents the ratio
of the odds of an event occurring in one group to the odds of it occurring in another
group. In the context of logistic regression, the odds ratio can be calculated as 𝑒 𝛽1 .
190 SAS, Python & R: A Cross Reference Guide

This means that for a unit increase in X, we can expect the odds of Y=1 (event
happening) to change by a factor of 𝑒 𝛽1 . This is particularly useful when interpreting
the results of a logistic regression, as it measures how each independent variable
impacts the odds of the dependent variable.

The odds ratio quantifies how a change in an independent variable affects the
dependent variable while holding other variables constant. While the odds ratio can
give us valuable insights, it’s not always easy to interpret, especially for continuous
variables or when there are interactions between variables.

Odds Ratio Example


Let’s say we have a logistic regression model trained on the Lending Club data set,
where the target variable is a binary classifier indicating loan defaults (1 for default,
0 for no default). Let’s consider one of the predictors, say annual_income.

The coefficient (β) associated with annual_income in the logistic regression model
represents the change in log odds of the outcome target variable (“bad”) for a unit
increase in annual_income. To calculate the odds ratio, we take the exponent of this
coefficient, i.e., 𝑒 𝛽1 .

For example, if β for annual_income is -0.03, the odds ratio would be 𝑒 −0.03 = 0.97.
This means that for each additional unit increase in annual_income, we can expect
the odds of a loan default (Y=1) to be multiplied by 0.97, assuming all other
variables are held constant. In other words, higher annual income is associated with
lower odds of loan default.

It’s important to note that this is a multiplicative effect on the odds, not the
probabilities, and it assumes all other variables in the model are held constant. Also,
this interpretation is valid for a unit change in annual_income, so depending on
how annual_income is measured (e.g., dollars, thousands of dollars), this would
affect the interpretation.

Remember that while odds ratios can be very informative, they require careful
interpretation and understanding of your data and model.
SAS, Python & R: A Cross Reference Guide

Link Function

Logistic regression is a generalized linear model (GLM) that uses a logit link function.
The link function provides the relationship between the linear predictor and the
mean of the distribution function. In logistic regression, this is the natural log of the
odds (i.e., logit function). The link function transforms the probability into a linear
combination of the predictors. This transformation allows us to model a binary
response using methods similar to those used in linear regression.

The choice of link function is critical as it determines how we model the relationship
between our predictors and our response variable. The logit link function ensures
that our predicted probabilities stay between 0 and 1 and allows us to interpret our
regression coefficients as changes in log odds.

Link Function Example


The link function in logistic regression is the logit function, which is used to
transform the probability p of the positive class (in this case, loan default) into a
linear function of the predictors. The logit function is defined as follows:

Equation 5.3: Link Function


𝑝
𝑙𝑜𝑔𝑖𝑡(𝑝) = 𝑙𝑛 ( )
(1/𝑝)
where ln is the natural logarithm. This function transforms the probability p, which
ranges between 0 and 1, into a value ranging from negative to positive infinity.
Let’s say we have a logistic regression model with two predictors: annual_income
and credit_score, with respective coefficients 𝛽1 and 𝛽2 , and an intercept term 𝛽0 .
The linear predictor for this model would be:

Equation 5.4: Link Function Example

𝜂 = 𝛽0 + 𝛽1 ∗ 𝑎𝑛𝑛𝑢𝑎𝑙_𝑖𝑛𝑐𝑜𝑚𝑒 + 𝛽2 ∗ 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒
192 SAS, Python & R: A Cross Reference Guide

The link function transforms this linear predictor into a probability using the inverse
of the logit function, which is the logistic function:

Equation 5.5: Logistic Function


1
𝑝=
1 + 𝑒 −𝜂

So, if we have a specific borrower with an annual_income of $50,000 and


a credit_score of 700, we would plug these values into our linear predictor η, then
use the logistic function to transform η into a probability p. This probability p
represents the predicted probability of loan default for a borrower with this income
and credit score.

The choice of the logit link function ensures that our predicted probabilities stay
between 0 and 1, no matter the values of our predictors. It also allows us to
interpret our regression coefficients as changes in log odds. For example, 𝛽1
represents the change in log odds of default for each additional dollar of annual
income.

Assumptions

Logistic regression does not require a linear relationship between the dependent
and independent variables, which differentiates it from linear regression. However,
it does require that the independent variables are linearly related to the log odds.
This assumption allows us to use a linear combination of predictors to model a
binary outcome.

Logistic regression also assumes that errors are independently and identically
distributed and that there is no perfect multicollinearity among independent
variables. When building a logistic regression model, it’s important to check these
assumptions, as violations can lead to biased or inefficient estimates.

Let’s delve deeper into the assumptions of logistic regression and how they apply to
the Lending Club data set.

1. Linearity of independent variables and log odds: Although logistic


regression does not require the dependent and independent variables to be
related linearly, the independent variables must be linearly related to the
SAS, Python & R: A Cross Reference Guide

log odds. This means that a unit change in the predictor variable will result
in a constant change in the log odds of the response variable, assuming all
other variables are held constant.

For example, consider a predictor variable from the Lending Club data set,
such as annual_income. The assumption here is that for each additional
dollar of annual income, the log odds of loan default (assuming the target
variable “bad” is coded as 1) changes by a constant amount, given that all
other variables in the model are held constant. This assumption can be
checked by looking at the relationship between each predictor and the log
odds of the response variable. If this assumption is violated, transformations
of the predictors or reparameterization of the model might be necessary.

2. Independence of errors: This assumption implies that the observations


should not influence each other. In other words, the error term of one
observation should not predict the error term of another. This is crucial
because logistic regression uses maximum likelihood estimation (MLE) to
estimate the model parameters, and MLE assumes that the observations are
independent.

In the context of the Lending Club data set, this would mean that the error
term for one loan application does not predict the error term for another
loan application. Violations of this assumption might occur if there is serial
correlation between observations. For instance, if loans are funded based
on time-series data (e.g., loans funded earlier influence those funded later),
this assumption might be violated. If this is the case, techniques such as
time-series or random effects models might be more appropriate.

3. Absence of multicollinearity: Logistic regression assumes that there is no


perfect multicollinearity among independent variables. Multicollinearity is
when two or more independent variables are highly correlated. High
correlation between predictor variables can lead to unstable estimates of
regression coefficients and make it difficult to determine the individual
effects of predictors on the response variable.

For instance, in the Lending Club data set, if annual_income and


employment_length were highly correlated (since income often increases
with years of employment), this could pose a problem for our logistic
194 SAS, Python & R: A Cross Reference Guide

regression model. To detect multicollinearity, we can calculate Variance


Inflation Factors (VIFs) or examine correlation matrices. If multicollinearity is
detected, we might need to drop one of the correlated variables or combine
them into a single predictor.

5. Large sample size: Logistic regression requires a large sample size because
maximum likelihood estimates are less powerful at small sample sizes than
ordinary least squares (used in linear regression). A common rule of thumb
is that you need at least 10 cases with the least frequent outcome for each
predictor in your model. For example, if you have 5 predictors and expect
about 10% positive responses, you would need at least 500 observations for
logistic regression.

This assumption is likely to be met in the Lending Club data set, which has
thousands of observations across multiple predictors. However, it is always
good practice to check whether your data set is large enough relative to
your model complexity.

Model Fit

Goodness-of-fit measures for logistic regression include -2 Log-likelihood, Akaike’s


Information Criterion (AIC), and the Hosmer-Lemeshow test. These measures help
assess how well your model fits your data. For example, AIC is a measure based on
in-sample fit with a penalty for complexity (to avoid overfitting). The Hosmer-
Lemeshow test specifically measures whether or not observed event rates match
expected event rates in subgroups of the model population. These metrics provide
different ways to evaluate your model’s performance and should be used in
conjunction with each other to get a comprehensive understanding of your model’s
fit.

Evaluating the fit of a logistic regression model is a crucial step in the modeling
process. Here’s an expanded explanation of the goodness-of-fit measures, along
with how they apply to the Lending Club data set:

1. -2 Log-likelihood (-2LL): The log-likelihood is a measure of model fit. The


value of -2LL is used to compare the fit of different models, with smaller
values indicating better fit. However, -2LL is sensitive to sample size and
tends to decrease as more predictors are added to the model, regardless of
SAS, Python & R: A Cross Reference Guide

whether they are meaningful. Therefore, it’s often used in combination with
other measures like AIC and BIC that penalize model complexity.

In the Lending Club data set context, suppose you have two models: one
that includes annual_income and credit_score as predictors and another
that adds home_ownership as an additional predictor. If the -2LL decreases
significantly when home_ownership is added, this suggests
that home_ownership improves the model fit.

2. Akaike’s Information Criterion (AIC): AIC is a measure based on in-sample


fit with a penalty for complexity to avoid overfitting. It balances model fit
(as measured by the likelihood) and model complexity (as measured by the
number of predictors). Lower AIC values indicate better-fitting models.

For example, using the Lending Club data set, if adding a predictor like
home_ownership to a model decreases the AIC, this suggests that despite
increasing model complexity, home_ownership improves the model’s ability
to predict loan default.

3. Hosmer-Lemeshow test: The Hosmer-Lemeshow test specifically measures


whether or not observed event rates match expected event rates in
subgroups of the model population. A significant p-value (usually less than
0.05) indicates a poor fit to the data.

In relation to the Lending Club data set, if you divide your data into deciles
based on predicted default probabilities and run the Hosmer-Lemeshow
test, a non-significant p-value would suggest that your model’s predicted
probabilities align well with actual default rates in these groups.

Remember, these metrics provide different ways to evaluate your model’s


performance and should be used in conjunction with each other to get a
comprehensive understanding of your model’s fit. It’s also important to validate
your model using out-of-sample data to ensure it generalizes well to new data.

Interpretation

The coefficients in a logistic regression model are odds ratios. Because of this, we
interpret the coefficients as the change in odds for a one-unit increase in our
196 SAS, Python & R: A Cross Reference Guide

predictor variable, holding all other variables constant. This interpretation is more
intuitive in terms of probability. It is one of the reasons why logistic regression is
popular in fields like medicine, where understanding odds ratios can be crucial.
However, interpreting these coefficients can be challenging, especially when dealing
with multiple predictors or interaction terms.

Interpreting the coefficients in a logistic regression model is a crucial step in


understanding the implications of your model. Here’s an expanded explanation of
how to interpret these coefficients, along with examples using the Lending Club
data set:

1. Interpretation of Coefficients as Odds Ratios: In logistic regression, the


coefficients represent the log odds of the outcome variable per unit
increase in the predictor variable, assuming all other variables are held
constant. By exponentiating these coefficients, we can interpret them as
odds ratios. An odds ratio greater than 1 indicates that as the predictor
increases, the odds of the outcome occurring increase. Conversely, an odds
ratio less than 1 indicates that as the predictor increases, the odds of the
outcome occurring decrease.

For example, consider a predictor variable from the Lending Club data set,
such as annual_income. If the coefficient for annual_income is -0.02, then
the odds ratio is 𝑒 −0.02 = 0.98. This means that for each additional dollar of
annual income, we can expect the odds of a loan default (assuming loan
default is coded as 1) to be multiplied by 0.98, assuming all other variables
are held constant.

2. Dealing with Multiple Predictors: When dealing with multiple predictors,


interpretation can become more complex because each coefficient is
adjusted for all other variables in the model. However, the basic
interpretation remains the same: each coefficient represents the change in
log odds for a unit increase in that predictor, holding all other predictors
constant.

For instance, if credit_score and annual_income are both predictors in your


model, their coefficients represent the change in log odds of default for a
unit increase in credit_score and annual_income, respectively, assuming all
other variables are held constant.
SAS, Python & R: A Cross Reference Guide

3. Interaction Terms: If your model includes interaction terms (e.g.,


annual_income * credit_score), interpretation becomes even more
complex. The coefficient of an interaction term represents the change in log
odds due to a one-unit increase in the product of those variables, holding all
other variables constant.

For example, suppose there’s an interaction term between annual_income


and credit_score. In that case, its coefficient represents how much the log
odds of default change for a one-unit increase in the product of
annual_income and credit_score, assuming that all other variables are held
constant. This allows us to model situations where the effect of one variable
on the outcome depends on the level of another variable.

Remember that interpreting logistic regression coefficients requires careful


consideration and understanding of your data and research question.

Regularization

Regularization is a critical concept in the field of machine learning. It's the practice
of adding a penalty term to the loss function during the training of a model. In
machine learning, we often encounter the problem of overfitting. This occurs when
a model becomes too complex, capturing not only the underlying patterns in the
training data but also the noise. As a result, the model performs exceptionally well
on the training data but fails to generalize to unseen data. This is where
regularization steps in. Regularization techniques help prevent overfitting by
discouraging the model from fitting the noise in the data. Instead, it encourages the
model to find simpler patterns that generalize better. In essence, regularization is a
form of "complexity control" for machine learning models.

Several methods exist to implement regularization in machine learning, each suited


to different types of models and scenarios.

● L1 and L2 Regularization: L1 regularization (Lasso) adds the absolute values


of the model's coefficients to the loss function, while L2 regularization
(Ridge) adds the squared values of the coefficients. These techniques are
particularly useful in linear regression and logistic regression models.
198 SAS, Python & R: A Cross Reference Guide

● Dropout in Neural Networks: In deep learning, dropout is a regularization


technique that randomly deactivates a fraction of neurons during each
training iteration. This prevents over-reliance on specific neurons and
encourages the network to learn more robust features.

● Early Stopping: Another simple form of regularization, early stopping,


monitors the model's performance on a validation set during training.
Training is halted when the validation performance starts to degrade,
preventing the model from overfitting the training data.

Decision Time: Choosing the Right Regularization


The decision to apply regularization is critical for preventing
overfitting. As a data scientist, you need to choose the right
regularization method based on the specific machine learning
algorithm and the characteristics of your data. Consider how much
complexity you’re willing to sacrifice for better generalization. The
goal is to find the optimal balance, ensuring your model performs well
not just on the training data, but also on new, unseen data.

Data Set Balancing

Warning: Balancing the Data Set for Logistic Regression


Balancing a data set is crucial when preparing data for logistic
regression. Logistic regression is particularly sensitive to class
imbalances, often favoring the majority class in an imbalanced data
set. This can lead to poor model performance, especially in correctly
classifying the minority class, which is often of greater importance.
Ensuring a balanced data set helps improve the model's accuracy and
reliability.

The issues of imbalance in logistic regression stem from the way the algorithm
learns. It aims to find the optimal decision boundary that minimizes the logistic loss
function. In imbalanced data sets, the model tends to favor the majority class, as the
loss function is dominated by the abundant class. This bias results in poor sensitivity,
specificity, and overall predictive power, particularly for the minority class.
SAS, Python & R: A Cross Reference Guide

Balancing the data set by either oversampling the minority class or undersampling
the majority class helps mitigate these issues. By presenting the model with a more
equal distribution of classes, logistic regression can learn better from both
categories. As a result, it can help make more informed decisions and provide
improved predictions. Creating a balanced data set for logistic regression is crucial
because it rectifies the algorithm's susceptibility to imbalanced data, ultimately
leading to more accurate and equitable model outcomes.

Dummy Variable Trap

Multicollinearity Risk in One-Hot Encoding


When performing one-hot encoding on a categorical variable, each
category is transformed into a binary (dummy) variable indicating the
presence (1) or absence (0) of that category. However, including all
dummy variables in a regression model can introduce multicollinearity,
a problem that occurs when these variables are highly correlated with
each other. This can distort the model's estimates and reduce its
predictive power.

Multicollinearity occurs when one predictor variable in a regression model can be


linearly predicted from the others with a high degree of accuracy. In the case of
dummy variables, if we include all levels, we can perfectly predict one level from all
the others. For example, if we have a categorical variable with three levels, A, B, and
C, and we include dummy variables for all three levels in our model, knowing that A
and B are both 0 immediately tells us that C must be 1. This perfect correlation
between the predictors leads to multicollinearity.

The consequences of multicollinearity can be severe: it can cause the regression


model to produce unstable and unreliable estimates of the regression coefficients. It
can make it harder to interpret the coefficients, as the effect of each predictor on
the response variable is confounded with the effects of the other predictors. It can
also inflate the standard errors of the coefficients, leading to a loss of statistical
power. To avoid these issues, one level of the categorical variable is typically left out
when creating dummy variables – this is known as the reference level. The
coefficients for the remaining levels are then interpreted relative to this reference
level.
200 SAS, Python & R: A Cross Reference Guide

Pros and Cons of Logistic Regression

Pros:

● Easy to implement and interpret.


● Doesn’t require a linear relationship between dependent and independent
variables.
● Can handle both continuous and categorical variables.

Cons:

● Requires large sample sizes because maximum likelihood estimates are less
powerful at small sample sizes than ordinary least squares.
● Doesn’t handle multicollinearity well. If variables are highly correlated, it
can lead to unstable estimates of coefficients.

Best Practices

● Always check for multicollinearity among independent variables. If high


correlation is present, consider removing or combining correlated variables.
● Make sure your sample size is large enough for logistic regression to provide
meaningful results.
● Always check the goodness-of-fit measures to assess how well your model
fits your data.
● Remember that logistic regression coefficients are in terms of log odds, so
interpret them carefully.

Logistic Regression Model Code

In this section, we will dig into the practical aspect of logistic regression by
developing a model using Python, R, and SAS. We will follow a step-by-step
approach to ensure that each part of the process is clearly understood. The steps
include importing the data, splitting it into training and validation data sets,
performing Variance Inflation Factor (VIF) analysis to check for multicollinearity,
selecting variables based on the VIF analysis, building and tuning the logistic
regression model, applying the model to an out-of-time (OOT) data set, and finally
evaluating the model’s performance using appropriate metrics.
SAS, Python & R: A Cross Reference Guide

Program 5.1: Python Logistic Regression Script

# Import necessary libraries


import pandas as pd
from sklearn.model_selection import train_test_split,
StratifiedKFold, GridSearchCV
from statsmodels.stats.outliers_influence import
variance_inflation_factor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix,
roc_auc_score, roc_curve
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
import statsmodels.api as sm

# Function to calculate performance metrics


def performance_metrics(y_true, y_pred):
print("Classification Report:")
print(classification_report(y_true, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("ROC AUC Score:")
print(roc_auc_score(y_true, y_pred))

# Load the data


train_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

print(train_encoded.columns)

# Define the variables to exclude


excluded_variables = ['id', 'emp_length_3years', 'term_36months',
'grade_G', 'sub_grade_B4',
'verification_status_SourceVerifi',
'purpose_home_improvement',
'home_ownership_RENT',
'application_type_JointApp', 'bad']

# Define predictors excluding the specified variables and the target


variable
202 SAS, Python & R: A Cross Reference Guide

predictors = [col for col in train_encoded.columns if col not in


excluded_variables]

# Calculate VIF
X = sm.add_constant(train_encoded[predictors])
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in
range(X.shape[1])]
vif["features"] = X.columns

# Identify variables with VIF greater than 5 (indicative of


multicollinearity)
high_vif_predictors = vif[vif["VIF Factor"] >
5]["features"].tolist()

# Remove predictors with high VIF from the list of predictors


predictors = [pred for pred in predictors if pred not in
high_vif_predictors]

# Store the target variable temporarily


OOT_bad = OOT_encoded['bad']

# Check if excluded variables are in OOT_encoded columns, if not


then add them
missing_cols = set(predictors) - set(OOT_encoded.columns)
for c in missing_cols:
OOT_encoded[c] = 0

# Ensure the order of column in the OOT set is the same order as in
train set
OOT_encoded = OOT_encoded[predictors]

# Add back the target variable 'bad' to the OOT_encoded data set
OOT_encoded['bad'] = OOT_bad

# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])

# Remove constant columns


X_train = X_train.loc[:, (X_train != X_train.iloc[0]).any()]
SAS, Python & R: A Cross Reference Guide

# Remove duplicate columns


X_train = X_train.loc[:,~X_train.columns.duplicated()]

# Combine predictors and target variable into one DataFrame for


undersampling
train_data = pd.concat([X_train, y_train], axis=1)

# Count the number of samples in the minority class


minority_class_count = train_data['bad'].value_counts().min()

# Perform undersampling on the majority class


undersampled_data = pd.concat([train_data[train_data['bad'] ==
label].sample(minority_class_count, random_state=42) for label in
train_data['bad'].unique()])

# Get the resampled predictors and target variable


X_train_resampled = undersampled_data[predictors]
y_train_resampled = undersampled_data['bad']

# Build the Lasso regression model


lasso = Lasso(random_state=42)

# Define hyperparameters to tune


param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]} #
Expand hyperparameters as needed

# Tune hyperparameters using GridSearchCV


grid_search = GridSearchCV(lasso, param_grid, cv=5)
grid_search.fit(X_train_resampled[predictors], y_train_resampled)

# Get the best model


best_model = grid_search.best_estimator_

# Separate predictors and target variable


OOT_X = OOT_encoded.drop('bad', axis=1)
OOT_y = OOT_encoded['bad']

# Apply the model to the OOT data set


OOT_predictions = best_model.predict(OOT_X)

# Convert predicted values to classes


threshold = 0.5 # You can adjust this threshold as needed
OOT_predictions_classes = [1 if pred > threshold else 0 for pred in
OOT_predictions]
204 SAS, Python & R: A Cross Reference Guide

# Create performance metrics using the function defined earlier


performance_metrics(OOT_y, OOT_predictions_classes)

# Output parameter estimates, AIC, BIC


print("Parameter Estimates:")
print(best_model.coef_)
print("AIC:", best_model.score(X_train_resampled[predictors],
y_train_resampled))
print("BIC:", best_model.score(X_train_resampled[predictors],
y_train_resampled) - best_model.coef_.shape[0])

# Plot ROC curve


fpr, tpr, _ = roc_curve(OOT_y, OOT_predictions)
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Program 5.2: R Logistic Regression Script

# Load necessary libraries


library(caret)
library(glmnet)
library(car)
library(pROC)

# Load the data


train_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

train_encoded.columns

# Define the variables to exclude


excluded_variables <- c('id', 'emp_length_3years', 'term_36months',
'grade_G', 'sub_grade_B4', 'verification_status_SourceVerifi',
SAS, Python & R: A Cross Reference Guide

'purpose_home_improvement', 'home_ownership_RENT',
'application_type_JointApp')

# Define predictors excluding the specified variables and the target


variable
predictors <- setdiff(colnames(train_encoded), c(excluded_variables,
"bad"))

# Calculate VIF
vif_fit <- lm(bad ~ ., data = train_encoded[, c(predictors, "bad")])
if (vif_fit$rank == length(coefficients(vif_fit))) {
vif_values <- car::vif(vif_fit)
high_vif_predictors <- names(vif_values[vif_values > 5]) # Change
this threshold as needed
} else {
high_vif_predictors <- character(0)
}

# Remove predictors with high VIF from the list of predictors


predictors <- setdiff(predictors, high_vif_predictors)

# Split the data into training and validation sets (80/20 split)
using stratified sampling
trainIndex <- createDataPartition(train_encoded$bad, p = .8, list =
FALSE)
X_train <- train_encoded[trainIndex, predictors]
y_train <- train_encoded[trainIndex, "bad"]
X_val <- train_encoded[-trainIndex, predictors]
y_val <- train_encoded[-trainIndex, "bad"]

# Perform undersampling on the majority class


minority_class_count <- min(table(train_encoded$bad))
undersampled_data <- train_encoded[train_encoded$bad == 0, ]
undersampled_data <- rbind(undersampled_data,
train_encoded[train_encoded$bad ==
1, ][sample(minority_class_count), ])

# Get the resampled predictors and target variable


X_train_resampled <- undersampled_data[, predictors]
y_train_resampled <- undersampled_data[, "bad"]

# Check if predictors are in OOT_encoded columns, if not then add


them
206 SAS, Python & R: A Cross Reference Guide

missing_cols <- setdiff(predictors, colnames(OOT_encoded))


for (c in missing_cols) {
OOT_encoded[[c]] <- 0
}

# Ensure the order of column in the OOT set is the same order as in
train set
OOT_encoded <- OOT_encoded[c(predictors, "bad")]

# Apply the model to the OOT data set


OOT_X <- OOT_encoded[, predictors]
OOT_y <- OOT_encoded[, "bad"]
OOT_predictions_prob <- predict(best_model, newx = as.matrix(OOT_X),
type = "response")
OOT_predictions_classes <- ifelse(OOT_predictions_prob > 0.5, 1, 0)

# Build the Lasso regression model


cvfit <- cv.glmnet(as.matrix(X_train_resampled), y_train_resampled,
family = "binomial", alpha = 1)

# Get the best model


best_lambda <- cvfit$lambda.min
best_model <- glmnet(as.matrix(X_train_resampled),
y_train_resampled, family = "binomial", alpha = 1, lambda =
best_lambda)

# Apply the model to the OOT data set


OOT_X <- OOT_encoded[, predictors]
OOT_y <- OOT_encoded[, "bad"]
OOT_predictions_prob <- predict(best_model, newx = as.matrix(OOT_X),
type = "response")
OOT_predictions_classes <- ifelse(OOT_predictions_prob > 0.5, 1, 0)

# Output parameter estimates


print(coef(best_model))

# Ensure OOT_predictions_classes is a factor with levels matching


OOT_y
OOT_predictions_classes <- factor(OOT_predictions_classes, levels =
levels(OOT_y))

# Ensure OOT_predictions_classes and OOT_y are factors with the same


levels
SAS, Python & R: A Cross Reference Guide

OOT_predictions_classes <- factor(OOT_predictions_classes, levels =


c(0, 1))
OOT_y <- factor(OOT_y, levels = c(0, 1))

# Calculate predicted probabilities


OOT_predictions_prob <- predict(best_model, newx = as.matrix(OOT_X),
type = "response")

# Ensure OOT_predictions_prob is numeric


OOT_predictions_prob <- as.numeric(OOT_predictions_prob)

# Plot ROC curve


roc_obj <- roc(OOT_y ~ OOT_predictions_prob)
plot(roc_obj)

# Calculate sensitivity and specificity for each threshold


roc_obj <- roc(OOT_y ~ OOT_predictions_prob)
sensitivities <- roc_obj$sensitivities
specificities <- roc_obj$specificities
thresholds <- roc_obj$thresholds

# Calculate Youden's J statistic for each threshold


J <- sensitivities + specificities - 1

# Find the optimal threshold


optimal_threshold <- thresholds[which.max(J)]

# Convert probabilities to class predictions using the optimal


threshold
OOT_predictions_classes <- ifelse(OOT_predictions_prob >
optimal_threshold, 1, 0)

# Convert to factors and ensure they have the same levels


OOT_predictions_classes <- factor(OOT_predictions_classes, levels =
c(0, 1))
OOT_y <- factor(OOT_y, levels = c(0, 1))

# Print confusion matrix


print(confusionMatrix(OOT_predictions_classes, OOT_y))
208 SAS, Python & R: A Cross Reference Guide

Program 5.3: SAS Logistic Regression Script

LIBNAME james 'INPUT FILE PATHWAY';

/* Load the data */


DATA train_encoded;
SET james.train_encoded;
RUN;

DATA oot_encoded;
SET james.oot_encoded;
RUN;

%let target = bad;

/* Define the variables to exclude to avoid the dummy variable trap


*/
%LET excluded_variables = id bad emp_length_3years term_36months
grade_g sub_grade_b4 verification_status_sourceverifi
purpose_home_improvement home_ownership_rent
application_type_jointapp;

/* Create global variable for predictors */


PROC CONTENTS NOPRINT DATA = train_encoded (DROP= id &target.
&excluded_variables.) OUT = var (KEEP = name); RUN;
PROC SQL NOPRINT; SELECT name INTO:predictors SEPARATED BY " " FROM
var; QUIT;
%PUT &predictors; RUN;

/*Fit regression model and calculate VIF values*/


PROC REG DATA=train_encoded ;
MODEL &target. = &predictors. / VIF;
RUN;

%let high_corr = GRADE_B int_rate;

/* Remove correlated variables and create final global variable for


predictors */
PROC CONTENTS NOPRINT DATA = train_encoded (DROP= id &target.
&excluded_variables. &high_corr.) OUT = var (KEEP = name); RUN;
PROC SQL NOPRINT; SELECT name INTO:predictors SEPARATED BY " " FROM
var; QUIT;
%PUT &predictors; RUN;
SAS, Python & R: A Cross Reference Guide

/* Variables in train_encoded but not in oot_encoded */


PROC SQL NOPRINT;
CREATE TABLE train_vars AS SELECT name FROM dictionary.columns
WHERE libname="WORK" AND memname="TRAIN_ENCODED";
CREATE TABLE oot_vars AS SELECT name FROM dictionary.columns WHERE
libname="WORK" AND memname="OOT_ENCODED";
CREATE TABLE in_train_not_oot AS
SELECT * FROM train_vars
EXCEPT
SELECT * FROM oot_vars;
QUIT;

/* Print the differences */


PROC PRINT DATA=in_train_not_oot; RUN;

/* Create final train and OOT data sets */


DATA train_encoded_final;
SET train_encoded (KEEP=&predictors bad);
RUN;

PROC FREQ DATA=train_encoded_final; TABLE bad; RUN;

/*Create missing variables from TRAIN data set*/


DATA oot_encoded_final;
SET oot_encoded ;
home_ownership_ANY = 0;
purpose_wedding = 0;
KEEP &predictors bad home_ownership_ANY purpose_wedding;
RUN;

PROC FREQ DATA=oot_encoded_final; TABLE bad; RUN;

/* Split the modeling data set by a 80/20 ratio using a random seed
*/
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2
SEED=42; RUN;

PROC FREQ DATA= class2; TABLES selected; RUN;

/* Create TRAIN and VAL data sets */


DATA train_data val_data;
SET class2 ;
IF selected = 1 THEN OUTPUT train_data; ELSE OUTPUT val_data;
RUN;
210 SAS, Python & R: A Cross Reference Guide

/* Assess target rate in train, val and OOT data sets */


PROC FREQ DATA = train_data; TABLES bad; RUN;
PROC FREQ DATA = val_data; TABLES bad; RUN;
PROC FREQ DATA = oot_encoded_final; TABLES bad; RUN;

/* Balance the data by undersampling the majority class */


PROC FREQ DATA=train_data ;
TABLES bad ;
RUN;

DATA pos neg;


SET train_data;
IF bad = 1 THEN OUTPUT pos; ELSE OUTPUT neg;
RUN;

DATA train_bal;
SET pos neg(obs=6901); /*Input the number of positive cases*/
RUN;

PROC FREQ DATA=train_bal; TABLES bad; RUN;

/* Perform logistic regression with LASSO (L1) regularization */


PROC HPGENSELECT DATA=train_bal;
MODEL bad(event='1') = &predictors / DIST=BINARY LINK=LOGIT;
SELECTION METHOD=LASSO(CHOOSE=SBC);
CODE FILE='/OneDrive/Documents/DS_Project/log_model.sas';
OUTPUT OUT=predicted PRED=pred_val;
RUN;

/* Apply the best model to the OOT data set */


DATA oot_score;
SET oot_encoded_final;
%INCLUDE '/OneDrive/Documents/DS_Project/log_model.sas';
IF p_bad1 GE 0.5 THEN pred = 1; ELSE pred = 0;
RUN;

/* Create a confusion matrix */


PROC FREQ DATA=oot_score; TABLE pred*bad / NOCOL NOROW NOPERCENT;
RUN;

/* Create an ROC chart */


ODS GRAPHICS ON;
PROC LOGISTIC DATA=oot_score PLOTS(ONLY)=ROC;
SAS, Python & R: A Cross Reference Guide

MODEL bad(event='1') = p_bad1;


RUN;

Here is a final step-by-step description of the logistic regression pipeline:

1. Import necessary libraries: Import the necessary Python and R libraries. SAS
does not require additional libraries.

2. Define performance metrics function: Define a function to calculate and


print performance metrics such as classification report, confusion matrix,
and ROC-AUC score.

3. Load data: Load the preprocessed training and out-of-time (OOT) data sets.

4. Exclude one dummy variable from each group: To avoid the “dummy
variable trap,” remove one dummy variable created from each categorical
variable.

5. Remove highly correlated variables: Use VIF analysis to identify highly


correlated variables.

6. Define predictors: Define the predictor variables by excluding certain


specified variables from the data set. This avoids the dummy variable trap
and removes the highly correlated variables.

7. Check missing columns: Check if any columns are missing in the OOT data
set compared to the training data set and add them if necessary.

8. Split data: Use stratified sampling to split the training data into training and
validation sets.

9. Remove constant and duplicate columns: Remove any constant or


duplicate columns from the training set.

10. Undersample majority class: Perform undersampling on the majority class


to balance the classes in the training data.
212 SAS, Python & R: A Cross Reference Guide

11. Build lasso regression model: Build a Lasso regression model and tune its
hyperparameters using grid search.

12. Apply model to OOT data set: Apply the best model obtained from the grid
search to the OOT data set to make predictions.

13. Calculate performance metrics: Calculate performance metrics on the OOT


data set for final model evaluation.

This pipeline provides a comprehensive workflow for building a Lasso regression


model, tuning its hyperparameters, and applying it to an out-of-time validation data
set.

Decision Trees

Decision trees are a unique type of non-parametric model that offers a high degree
of flexibility, making them a powerful tool in data science. Unlike parametric
models, which assume a specific form for the underlying relationships between
variables, non-parametric models like decision trees allow the data itself to dictate
the model's structure. This flexibility enables decision trees to effectively handle a
wide variety of modeling issues, especially when relationships between variables are
complex or non-linear.

Structurally, decision trees are fundamentally different from models like logistic
regression. They employ a hierarchical and recursive partitioning of data, resulting
in a tree-like structure. This structure comprises nodes and leaves. Nodes represent
decisions based on specific conditions on input features, and leaves signify output
classes or final predictions.

To illustrate, consider a simple decision tree for predicting whether a loan will be
approved or not:
SAS, Python & R: A Cross Reference Guide

Figure 5.3: Decision Tree Structure

In this example, the root node splits the data based on whether income is greater
than 50K. If yes, it leads to a child node that further splits the data based on credit
score. If no, it leads directly to a leaf node predicting that the loan will be denied.

This tree-like structure makes decision trees highly interpretable. Each path from
the root to a leaf represents a decision rule. For instance, in the above example, one
rule could be “If income is greater than 50K and credit score is above 700, then the
loan is approved.”

Moreover, decision trees form the basis for more complex ensemble methods like
Random Forests and Gradient Boosting Machines (GBMs), often providing superior
predictive performance by combining predictions from multiple decision trees.

Comparing Logistic Regression and Decision Trees


In machine learning, models are often categorized as parametric or non-parametric.
Parametric models, such as logistic regression, rely on assumptions about the
functional form of the relationship between variables. For example, logistic
regression assumes a linear relationship between the input features and the log-
odds of the target variable. In contrast, non-parametric models like decision trees
make no such assumptions, instead allowing the data to determine the structure of
214 SAS, Python & R: A Cross Reference Guide

the model. This makes decision trees more flexible but can also make them prone to
overfitting without proper tuning.

On the other hand, non-parametric models like decision trees do not make strong
assumptions about the functional forms of relationships between variables. Instead,
they allow the data to dictate the model’s structure. This flexibility enables non-
parametric models to handle a variety of modeling issues effectively.

Structurally, decision trees and logistic regression are fundamentally different.


Decision trees employ a hierarchical and recursive partitioning of data. They
determine data splits based on specific conditions on input features, resulting in a
tree-like structure. This structure comprises nodes, which represent decisions, and
leaves, which signify output classes.

In contrast, logistic regression uses a different approach. It employs linear


combinations of features to model the output's log odds, making certain
assumptions about the functional form of relationships between variables.

To further illustrate these differences, let’s look at a comparison table that shows
how logistic regression and decision trees handle various modeling issues:

Table 5.1: Comparing Logistic Regression and Decision Tree Models

Model Issue Logistic Regression Decision Trees


Sensitive. Requires
variance inflation factor
Not sensitive. Can handle
Multicollinearity (VIF) or similar methods
multicollinearity well.
to detect and address
multicollinearity.
Sensitive. Outliers can
Less sensitive. Extreme
significantly impact the
outliers could affect the
Outliers model. Requires outlier
splits but are generally
detection and handling
robust to outliers.
methods.
Cannot handle non-linear
Can handle non-linear
relationships unless
relationships well as they
Non-linearity feature engineering (like
partition the space based
polynomial features) is
on feature values.
done.
SAS, Python & R: A Cross Reference Guide

Requires dummy
variables for categorical Can handle categorical
Categorical features, which can lead variables directly without
Variables to the dummy variable the need for dummy
trap if not handled variables.
correctly.
Some implementations
Cannot handle missing
can handle missing values,
values. Requires
Missing Values but others may require
imputation or removal of
imputation or removal of
missing values.
missing values.
Does not inherently
Inherently performs
perform feature
feature selection by
selection. May require
Feature Selection choosing the most
methods like stepwise
informative features for
regression for feature
splitting.
selection.
May require techniques
Can handle imbalanced
like oversampling,
data but might be biased
undersampling, or
Imbalanced Data toward the majority class.
SMOTE to handle
Balancing techniques may
imbalanced data
still be beneficial.
effectively.

Decision trees excel when you need a model that is easy to interpret and explain.
Their tree-like structure provides clear insights into their decision-making process.
Depending on the implementation, they can also handle numerical and categorical
data, as well as missing values.

Feature Selection Methods

In the context of decision trees, impurity and information gain are two fundamental
concepts that guide the construction of the tree.

● Impurity measures the homogeneity or purity of the data in a node. If a


node is completely homogeneous (i.e., it contains instances of only one
216 SAS, Python & R: A Cross Reference Guide

class), it has an impurity of zero. On the other hand, if a node is equally split
between multiple classes, it has a high impurity. The goal of a decision tree
algorithm is to partition the data in a way that minimizes impurity and thus
increases the purity of the data subsets. Two commonly used measures of
impurity are entropy and Gini impurity.

● Information gain is another key concept in decision trees. It measures how


much information we gain about the target variable by splitting the data
based on a feature. In other words, it quantifies the reduction in impurity
achieved by partitioning the data according to a given feature. The feature
that provides the highest information gain is chosen for the split.

These concepts play a pivotal role in building decision trees. They help determine
the conditions for splitting the data at each node, aiming to maximize information
gain (reduce impurity) and create decision trees that can accurately classify new
instances.

● Entropy quantifies the impurity or disorder within a data set. For binary
classification, it’s computed as:

Equation 5.6: Entropy


𝐶

𝐼𝐻 = − ∑ 𝑝𝑗 𝑙𝑜𝑔2 (𝑝𝑗 )
𝑗=𝑖

● Gini Impurity gauges how often a randomly chosen element would be


incorrectly classified. In binary classification, it’s expressed as:

Equation 5.7: Gini Impurity


𝐶

𝐼𝐺 = 1 − ∑ 𝑝𝑗2
𝑗=𝑖

These measures play a pivotal role in building decision trees. They help determine
the conditions for splitting the data at each node, aiming to maximize information
SAS, Python & R: A Cross Reference Guide

gain (in the case of entropy) or minimize impurity (in the case of Gini impurity). By
doing so, they help create decision trees that can accurately classify new instances.

Entropy and Gini impurity are fundamental to understanding how decision trees
work. They provide a quantitative basis for deciding how to split the data at each
node, guiding the construction of an effective and accurate decision tree model.

Decision Tree Construction

The process to construct a classification decision tree consists of six main steps:

1. Selecting the Root Node: The process begins with all training samples at the
root node. The feature and threshold that best divide the data are selected
as the root node. This is done by either maximizing information gain or
minimizing impurity, such as Gini impurity or entropy. Information gain is
calculated as the difference in impurity before and after a split. The feature
with the highest information gain is chosen for the split.

2. Creating Child Nodes: Once a feature is selected for the root node, the data
set is split into two subsets based on the chosen feature’s threshold. This
process is then recursively applied to each subset, creating child nodes by
selecting the feature and threshold that minimize impurity within each
subset.

3. Recursive Splitting: The process of creating child nodes continues


recursively until a stopping criterion is met. This results in a full decision tree
where each internal node represents a decision based on a feature, and
each leaf node represents an output class.
4. Stopping Criteria: Several criteria can be used to stop the tree from growing
too deep, which could lead to overfitting. These include reaching a
maximum tree depth, reaching a minimum number of samples in a leaf
node, or when no further improvement can be made.

5. Pruning: After building a full decision tree, it may be pruned to avoid


overfitting. Pruning involves removing branches of the tree that contribute
little to prediction accuracy. This can be done using various strategies, such
as reduced error pruning or cost complexity pruning.
218 SAS, Python & R: A Cross Reference Guide

6. Tree Evaluation: Finally, the decision tree's performance is evaluated using


appropriate metrics, such as accuracy for classification tasks or mean
squared error for regression tasks.

By following these steps, you can construct a decision tree model that can make
accurate predictions while also being interpretable and easy to understand.

Pruning Decision Trees

One of the biggest concerns with decision trees is their tendency to overfit the
model to the training data. This occurs because the algorithm will continue building
the tree and constructing very fine leaf nodes if the model is unrestrained. These
are labeled as “deep trees” because several layers of internal nodes lead to the leaf
node. We need a method of finding the optimal tree within the vast space of all
possible trees. One way of limiting the algorithm and preventing deep trees is to
place a constraint on the model. This constraint should reduce the size of the
decision tree without reducing the predictive accuracy.

There are two widely used methods of pruning:

1. Reduced Error Pruning: For each leaf of the decision tree, the leaf is
replaced with a node that represents the prevalent class. If the prediction
accuracy does not decrease, then the node is retained.

2. Cost Complexity Pruning: This technique makes trade-offs between the size
of the tree and its fit to the training data, helping prevent overfitting. A
complexity parameter labeled alpha (α) is introduced and represents the
cost of each new leaf. The tuning parameter α controls a trade-off between
the subtree’s complexity and its fit to the training data.

Equation 5.8: Cost Complexity Pruning:

𝐶𝑜𝑠𝑡 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 = 𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒(# 𝑜𝑓 𝑙𝑒𝑎𝑣𝑒𝑠)+ ∝ (# 𝑜𝑓 𝑙𝑒𝑎𝑣𝑒𝑠)


SAS, Python & R: A Cross Reference Guide

The cost complexity metric penalizes the development of additional leaves that do
not significantly reduce the error rate. The error rate is defined differently
depending on the type of model. For regression models, the residual sum of squares
is used as the error rate. For classification models, the misclassification rate is used
as the error rate.

When α = 0, this is equivalent to having a full tree where no pruning is performed.


As α increases, subtrees decrease in complexity. To find the optimal subtree, we
need to determine the optimal value of α. This process involves using a holdout
method such as a validation sample or cross-validation. The subtree with the
minimum cost metric on the holdout data set (either validation sample or cross-
validation) has been constructed with the optimal value of α.

Performance Metrics

Evaluating the performance of a decision tree model is a crucial step in the machine
learning process. It allows us to understand how well our model is doing and where
it might fall short. We can use several metrics to evaluate a decision tree model,
each providing a different perspective on the model’s performance.

A confusion matrix is a table layout that visualizes an algorithm's performance. Each


row of the matrix represents the instances in a predicted class, while each column
represents the instances in an actual class. The name comes from the fact that it
makes it easy to see if the system is confusing two classes (i.e., commonly
mislabeling one as another).

Another important metric is the Receiver Operating Characteristic – Area Under


the Curve (ROC-AUC). It measures the entire two-dimensional area underneath the
entire ROC curve (think integral calculus) from (0,0) to (1,1). ROC curves plot true
positive rate (TPR) versus false positive rate (FPR) at different classification
thresholds. Lowering the classification threshold classifies more items as positive,
thus increasing both False Positives and True Positives.

The Gini coefficient is another popular metric, especially in economics. It measures


inequality among values of a frequency distribution (for example, income levels). A
Gini coefficient of zero expresses perfect equality, where all values are the same (for
example, where everyone has the same income). A Gini coefficient of one (or 100%)
expresses maximal inequality among values.
220 SAS, Python & R: A Cross Reference Guide

Here are some commonly used performance metrics:

● Accuracy: The proportion of correctly classified instances. It’s calculated as:

𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

For instance, in the Lending Club data set, this would be the percentage of
loans correctly classified as approved or denied.

● Precision: Measures the accuracy of positive predictions. It’s calculated as:

𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃

For example, it helps evaluate how many of the approved loans were truly
creditworthy.

● Recall (Sensitivity): Measures the proportion of actual positives that were


correctly predicted. It’s calculated as:

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁

In the Lending Club context, it tells us how many of the creditworthy loans
were correctly identified.

● F1 Score: Harmonic mean of precision and recall. It’s calculated as:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ∗ ( )
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Useful when there’s an uneven class distribution.

● ROC-AUC: Receiver Operating Characteristic – Area Under the Curve. It


assesses the model’s ability to distinguish between positive and negative
classes, which is crucial in credit approval scenarios.
SAS, Python & R: A Cross Reference Guide

● Gini Coefficient: It’s calculated as:

𝐺𝑖𝑛𝑖 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 = 2 ∗ 𝑅𝑂𝐶 − 1

The Gini coefficient ranges between -1 (perfect inequality) and 1 (perfect


equality), and it’s especially useful when we need to measure how much a
model’s predictions deviate from a perfectly fair allocation.

For example, if we’re using these metrics to evaluate a decision tree model on the
Lending Club data set:

● We might first calculate accuracy by counting the number of loans


correctly classified as either fully paid or charged off.
● Precision would tell us how many of the loans that we predicted would
be fully paid actually ended up being fully paid.
● Recall would tell us how many of the loans that ended up being fully
paid were correctly identified by our model.
● The F1 score would give us a single metric that combines both precision
and recall.
● The ROC-AUC score would tell us how well our model can distinguish
between loans that will be fully paid and those that will be charged off.
● The Gini coefficient would give us an idea of how fair our model’s
predictions are compared to a perfectly fair allocation.

For more information on performance metrics, please see Chapter 8: Performance


Metrics, Implementation, and Model Monitoring.

Hyperparameter Tuning

The Importance of Hyperparameters

Hyperparameters are a critical aspect of machine learning models that often


determine their performance. Unlike model parameters, which are learned from the
training data, hyperparameters are settings or configurations that must be specified
before training begins. These settings control various aspects of how a machine
222 SAS, Python & R: A Cross Reference Guide

learning algorithm learns from data and makes predictions. Selecting appropriate
values for hyperparameters is essential, as it can significantly impact a model's
ability to generalize and perform well on unseen data.

The Role of Hyperparameter Tuning

Hyperparameter tuning is a crucial step in the machine learning model development


process. It can be compared to adjusting the dials on a finely tuned instrument to
produce harmonious music. In machine learning, we adjust hyperparameters to
strike the right balance between underfitting and overfitting:

● Underfitting: An inadequately tuned model may fail to capture the


complexities of the data, leading to poor performance.
● Overfitting: Conversely, a model with overly tuned hyperparameters may fit
the training data too perfectly, causing it to generalize poorly to new,
unseen data.

The goal of hyperparameter tuning is to find the sweet spot that maximizes a
model's predictive power while ensuring it doesn't become overly complex or
specialized to the training data.

Approaches to Hyperparameter Tuning

Hyperparameter tuning is a systematic process that typically involves searching


through a range of hyperparameter values to find the combination that produces
the best model performance. There are several approaches to hyperparameter
tuning:

● Manual Tuning: Data scientists adjust hyperparameters based on domain


knowledge and intuition, then evaluate the model's performance.
● Grid Search and Random Search: These are more systematic methods
where various combinations of hyperparameters are evaluated to identify
the best settings.
● Advanced Techniques: Recent advances in machine learning, such as
Bayesian optimization and genetic algorithms, offer more efficient and
automated ways to search for optimal hyperparameters.
SAS, Python & R: A Cross Reference Guide

The choice of tuning approach often depends on the complexity of the model and
the size of the hyperparameter search space. Hyperparameter tuning is an iterative
and essential part of building accurate and reliable machine learning models.

Pros and Cons


Decision trees offer several advantages, such as interpretability, handling non-
linearity well, and being robust to outliers. However, they can easily overfit and may
not perform optimally on complex data.

Pros:

1. Decision trees are easy to understand and interpret.


2. They can handle both categorical and numerical data.
3. Decision trees implicitly perform feature selection.
4. They can capture complex relationships in data.

Cons:

1. Decision trees can easily overfit or underfit if not properly tuned.


2. They can be unstable, as small changes in data can lead to different trees.
3. They are biased toward features with more levels.
4. Their decision boundaries are axis-parallel, which can be restrictive.

Best Practices

1. Feature Engineering: Invest time in selecting and engineering relevant


features as they profoundly impact decision tree performance.
2. Pruning: Apply pruning techniques like cost complexity pruning to avoid
overfitting.
3. Ensemble Methods: Consider ensemble methods like Random Forests or
Gradient Boosting to improve model robustness and accuracy.
4. Data Splitting: Use techniques like cross-validation to ensure your model
generalizes well to unseen data.
224 SAS, Python & R: A Cross Reference Guide

5. Tune Hyperparameters: Experiment with different hyperparameters, such


as tree depth and minimum samples per leaf, to find the right balance
between bias and variance.

By unifying these concepts, data scientists gain the proficiency to employ decision
trees effectively, from their construction to pruning. This enables them to make
informed choices when faced with modeling decisions and ultimately enhances the
quality of their predictive models.

Decision Tree Model Code


In this section, we will delve into the practical aspect of decision tree modeling by
developing a model using Python, R, and SAS. We will follow a step-by-step
approach to ensure that each part of the process is clearly understood. The steps
include importing the data, splitting it into training and validation data sets,
balancing the data due to our binary target variable, building the decision tree
model, tuning hyperparameters using methods like GridSearchCV or
RandomizedSearchCV, applying the model to an out-of-time (OOT) data set, and
finally evaluating the model’s performance using appropriate metrics. We will also
discuss the importance of each step and how it contributes to the overall
methodology of decision tree development.

Program 5.4: Python Decision Tree Script

# Import necessary libraries


import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix,
roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Function to calculate performance metrics


def performance_metrics(y_true, y_pred):
print("Classification Report:")
print(classification_report(y_true, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
SAS, Python & R: A Cross Reference Guide

print("ROC AUC Score:")


print(roc_auc_score(y_true, y_pred))

# Load the data


train_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

print(train_encoded.columns)

# Define the variables to exclude


excluded_variables = ['id', 'emp_length_3years', 'term_36months',
'grade_G', 'sub_grade_B4',
'verification_status_SourceVerifi',
'purpose_home_improvement',
'home_ownership_RENT',
'application_type_JointApp', 'bad']

# Define predictors excluding the specified variables and the target


variable
predictors = [col for col in train_encoded.columns if col not in
excluded_variables]

# Store the target variable temporarily


OOT_bad = OOT_encoded['bad']

# Check if excluded variables are in OOT_encoded columns, if not


then add them
missing_cols = set(predictors) - set(OOT_encoded.columns)
for c in missing_cols:
OOT_encoded[c] = 0

# Ensure the order of column in the OOT set is the same order as in
train set
OOT_encoded = OOT_encoded[predictors]

# Add back the target variable 'bad' to the OOT_encoded data set
OOT_encoded['bad'] = OOT_bad

# Split the data into training and validation sets (80/20 split)
using stratified sampling
226 SAS, Python & R: A Cross Reference Guide

X_train, X_val, y_train, y_val =


train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])

# Remove constant columns


X_train = X_train.loc[:, (X_train != X_train.iloc[0]).any()]

# Remove duplicate columns


X_train = X_train.loc[:,~X_train.columns.duplicated()]

# Combine predictors and target variable into one DataFrame for


undersampling
train_data = pd.concat([X_train, y_train], axis=1)

# Count the number of samples in the minority class


minority_class_count = train_data['bad'].value_counts().min()

# Perform undersampling on the majority class


undersampled_data = pd.concat([train_data[train_data['bad'] ==
label].sample(minority_class_count, random_state=42) for label in
train_data['bad'].unique()])

# Get the resampled predictors and target variable


X_train_resampled = undersampled_data[predictors]
y_train_resampled = undersampled_data['bad']

# Build the Decision Tree model


dtree = DecisionTreeClassifier(random_state=42)

# Define hyperparameters to tune


param_grid = {'max_depth': [None, 5, 10, 15, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 5]}

# Tune hyperparameters using GridSearchCV


grid_search = GridSearchCV(dtree, param_grid, cv=5)
grid_search.fit(X_train_resampled[predictors], y_train_resampled)

# Get the best model


best_model = grid_search.best_estimator_

# Apply the model to the OOT data set


OOT_predictions = best_model.predict(OOT_encoded[predictors])
SAS, Python & R: A Cross Reference Guide

# Create performance metrics using the function defined earlier


performance_metrics(OOT_encoded['bad'], OOT_predictions)

# Plot ROC curve


fpr, tpr, _ = roc_curve(OOT_encoded['bad'], OOT_predictions)
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Program 5.5: R Decision Tree Script

# Import necessary libraries


library(rpart)
library(caret)
library(pROC)

# Load the data


train_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

# Define the variables to exclude


excluded_variables <- c('id', 'bad', 'emp_length_3years',
'term_36months', 'grade_G', 'sub_grade_B4',
'verification_status_SourceVerifi',
'purpose_home_improvement',
'home_ownership_RENT',
'application_type_JointApp')

# Define predictors excluding the specified variables


predictors <- setdiff(colnames(train_encoded), excluded_variables)

# Check if predictors are in OOT_encoded columns, if not then add


them
missing_cols <- setdiff(predictors, colnames(OOT_encoded))
for (c in missing_cols) {
228 SAS, Python & R: A Cross Reference Guide

OOT_encoded[[c]] <- 0
}

# Ensure the order of column in the OOT set is in the same order as
in train set
OOT_encoded <- OOT_encoded[predictors]

# Add back the target variable 'bad' to the OOT_encoded data set
OOT_encoded$bad <- OOT_bad

# Perform undersampling on the majority class


minority_class_count <- min(table(train_encoded$bad))
majority_class_samples <- train_encoded[train_encoded$bad ==
0, ][sample(minority_class_count), ]
minority_class_samples <- train_encoded[train_encoded$bad == 1, ]
undersampled_data <- rbind(majority_class_samples,
minority_class_samples)

# Get the resampled predictors and target variable


X_train_resampled <- undersampled_data[, predictors]
y_train_resampled <- undersampled_data[, "bad"]

# Check the frequency of each class in the resampled data


freq <- table(undersampled_data$bad)
print(freq)

# Build the Decision Tree model using rpart (recursive partitioning


for regression)
fitControl <- trainControl(method = "cv", number = 5)
grid <- expand.grid(.cp = seq(0.01, 0.5, 0.01)) # complexity
parameter for rpart corresponds to ccp_alpha in sklearn's
DecisionTreeClassifier
dtree_grid_search <- train(x = X_train_resampled, y =
as.factor(y_train_resampled), method = "rpart", trControl =
fitControl, tuneGrid = grid)

# Get the best model


best_model <- dtree_grid_search$finalModel

# Apply the model to the OOT data set


OOT_predictions <- predict(best_model, newdata = OOT_encoded[,
predictors], type = "class")

# Create performance metrics using caret's confusionMatrix function


SAS, Python & R: A Cross Reference Guide

print(confusionMatrix(as.factor(OOT_predictions),
as.factor(OOT_encoded$bad)))

# Plot ROC curve using pROC package


roc_obj <- roc(OOT_encoded$bad,
as.numeric(as.character(OOT_predictions)))
plot(roc_obj, main="ROC Curve")
abline(a=0, b=1)

Program 5.6: SAS Decision Tree Script

/*LIBNAME james 'INPUT FILE PATHWAY';*/

/* Load the data */


DATA train_encoded;
SET james.train_encoded;
RUN;

DATA oot_encoded;
SET james.oot_encoded;
RUN;

/* Define the variables to exclude to avoid the dummy variable trap


*/
%LET excluded_variables = id bad emp_length_3years term_36months
grade_g sub_grade_b4
verification_status_sourceverifi purpose_home_improvement
home_ownership_rent application_type_jointapp;

/* Create global variable for predictors */


PROC CONTENTS NOPRINT DATA = train_encoded (DROP= id bad
&excluded_variables.) OUT = var (KEEP = name); RUN;
PROC SQL NOPRINT; SELECT name INTO:predictors SEPARATED BY " " FROM
var; QUIT;
%PUT &predictors; RUN;

/* Variables in train_encoded but not in oot_encoded */


PROC SQL NOPRINT;
CREATE TABLE train_vars AS SELECT name FROM dictionary.columns
WHERE libname="WORK" AND memname="TRAIN_ENCODED";
CREATE TABLE oot_vars AS SELECT name FROM dictionary.columns WHERE
libname="WORK" AND memname="OOT_ENCODED";
CREATE TABLE in_train_not_oot AS
SELECT * FROM train_vars
230 SAS, Python & R: A Cross Reference Guide

EXCEPT
SELECT * FROM oot_vars;
QUIT;

/* Print the differences */


PROC PRINT DATA=in_train_not_oot; RUN;

/* Create final train and OOT data sets */


DATA train_encoded_final;
SET train_encoded (KEEP=&predictors bad);
RUN;

PROC FREQ DATA=train_encoded_final; TABLE bad; RUN;

/*Create missing variables from TRAIN data set*/


DATA oot_encoded_final;
SET oot_encoded ;
home_ownership_ANY = 0;
purpose_wedding = 0;
KEEP &predictors bad home_ownership_ANY purpose_wedding;
RUN;

PROC FREQ DATA=oot_encoded_final; TABLE bad; RUN;

/* Split the modeling data set by a 80/20 ratio using a random seed
*/
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2
SEED=42; RUN;

PROC FREQ DATA= class2; TABLES selected; RUN;

/* Create TRAIN and VAL data sets */


DATA train_data val_data;
SET class2 ;
IF selected = 1 THEN OUTPUT train_data; ELSE OUTPUT val_data;
RUN;

/* Assess target rate in train, val and OOT data sets */


PROC FREQ DATA = train_data; TABLES bad; RUN;
PROC FREQ DATA = val_data; TABLES bad; RUN;
PROC FREQ DATA = oot_encoded_final; TABLES bad; RUN;

/* Balance the data by undersampling the majority class */


PROC FREQ DATA=train_data ;
SAS, Python & R: A Cross Reference Guide

TABLES bad ;
RUN;

DATA pos neg;


SET train_data;
IF bad = 1 THEN OUTPUT pos; ELSE OUTPUT neg;
RUN;

DATA train_bal;
SET pos neg(obs=6901); /*Input the number of positive cases*/
RUN;

PROC FREQ DATA=train_bal; TABLES bad; RUN;

/*Create hyperparameter tuning macro that will loop through a range


of values for each of the tuning parameters */

/* Create an empty summary_table data set */


PROC DELETE DATA=summary_table;
DATA summary_table;
LENGTH Model 8;
FORMAT Area 8.5;
RUN;

%LET target = bad;

%MACRO tune(iteration, maxdepth, minleafsize);


%put "Iteration: &iteration";
%put "MAXDEPTH: &maxdepth";
%put "MINLEAFSIZE: &minleafsize";

/* Perform decision tree with hyperparameter tuning */


PROC HPSPLIT DATA=train_bal
MAXDEPTH = &maxdepth
MINLEAFSIZE = &minleafsize;
INPUT &predictors / LEVEL=INTERVAL;
TARGET bad / LEVEL=NOMINAL;
PRUNE COSTCOMPLEXITY;
code file="/OneDrive/Documents/DS_Project/decision_tree.sas";
RUN;

data test_score_&iteration.;
set val_data;
%include '/OneDrive/Documents/DS_Project/decision_tree.sas';
232 SAS, Python & R: A Cross Reference Guide

keep id &target. P_bad1;


run;

/* Calculate AUC value on validation data set */


proc logistic data = test_score_&iteration.;
class &target.;
model &target. = P_bad1 / outroc=ROC;
ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;
run;

data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;

/* Append the results to a summary table */


proc append base=summary_table data=outstat;
run;

%mend;

/* Call the macro with different hyperparameters */


%tune(1, 5, 1);
%tune(2, 5, 2);
%tune(3, 5, 3);
%tune(4, 10, 1);
%tune(5, 10, 2);
%tune(6, 10, 3);
%tune(7, 20, 1);
%tune(8, 20, 2);
%tune(9, 20, 3);

PROC SORT DATA=summary_table; BY DESCENDING Area; RUN;

/*Apply optimal hyperparameters to the OOT data set for final


evaluation*/

PROC HPSPLIT DATA=train_bal


MAXDEPTH = 5 /*Input optimal hyperparameter value*/
MINLEAFSIZE = 1; /*Input optimal hyperparameter value*/
INPUT &predictors / LEVEL=INTERVAL;
SAS, Python & R: A Cross Reference Guide

TARGET bad / LEVEL=NOMINAL;


PRUNE COSTCOMPLEXITY;
code file="/OneDrive/Documents/DS_Project/decision_tree.sas";
RUN;

/* Apply the final model to the OOT data set */


DATA oot_score;
SET oot_encoded;
%INCLUDE '/OneDrive/Documents/DS_Project/decision_tree.sas';
KEEP id &target. P_bad1;
RUN;

/* Calculate AUC value on validation data set */


PROC LOGISTIC DATA = oot_score;
CLASS &target.;
MODEL &target. = P_bad1 / OUTROC=ROC;
ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;
RUN;

Here is a step-by-step description of the decision tree workflow code:

1. Import necessary libraries: The code begins by importing necessary libraries


such as pandas, sklearn, and matplotlib. SAS does not require additional
libraries.

2. Function to calculate performance metrics: A function


named performance_metrics is defined to print the classification report,
confusion matrix, and ROC-AUC score.

3. Load the data: The data sets train_encoded and OOT_encoded are loaded
from the specified file paths.

4. Define the variables to exclude: To protect against the dummy variable


trap, a list of variables to exclude from the analysis is defined.

5. Define predictors: The predictors are all columns in train_encoded that are
not in the excluded variables list.
234 SAS, Python & R: A Cross Reference Guide

6. Check for missing columns in OOT_encoded: If any predictors are missing in


the OOT_encoded data set, they are added with a value of 0.

7. Split the data into training and validation sets: The train_encoded data set
is split into training and validation sets using an 80/20 split with stratified
sampling.
8. Remove constant and duplicate columns: Any constant or duplicate
columns in the training set are removed.

9. Combine predictors and target variable for undersampling: The predictors


and target variable are combined into one DataFrame for undersampling.

10. Perform undersampling on the majority class: Undersampling is performed


on the majority class to balance the classes in the training data.

11. Build the Decision Tree model: A Decision Tree model is initialized with a
random state of 42.

12. Define hyperparameters to tune: A dictionary of hyperparameters to tune


is defined for use in a Grid Search.

13. Tune hyperparameters using Grid Search: Grid Search finds the best
hyperparameters for the Decision Tree model.

14. Get the best model: The best model from Grid Search is stored
in best_model.

15. Apply the model to the OOT data set: The best model is used to predict
the OOT_encoded data set.

16. Create final performance metrics: Develop an AUC and ROC chart.

This workflow provides a comprehensive guide to building a decision tree model,


from data loading and preprocessing to model evaluation. It’s important to
remember that each step may require additional fine-tuning depending on the
specifics of your data set and problem statement.
SAS, Python & R: A Cross Reference Guide

Chapter 5: Predictive Modeling Part I – Foundation – Conclusion


and Transition

In this chapter, we delved into the foundational techniques of predictive modeling,


exploring key algorithms like linear regression and logistic regression. These models
are the cornerstone of predictive analytics, providing you with the essential tools to
make accurate predictions and understand relationships within your data. By
mastering these techniques, you've laid a solid groundwork for building more
sophisticated models.

With a strong grasp of these foundational models, you are now ready to explore
more advanced techniques that can significantly enhance model performance. In
the next chapter, we’ll dive into ensemble methods – a powerful approach that
combines multiple models to create a stronger, more robust predictive model.
Ensemble methods like bagging, boosting, and random forests can help you
overcome the limitations of individual models by leveraging the strengths of
multiple algorithms.

In Chapter 6, you will learn how to implement these ensemble methods across SAS,
Python, and R. We’ll discuss how they work, why they’re effective, and how they
can be tuned to achieve optimal performance. By the end of the chapter, you’ll have
a comprehensive understanding of how to use ensemble techniques to take your
predictive modeling to the next level.

So, with your foundation firmly in place, let’s move on to Chapter 6 and explore the
world of ensemble methods, where the power of combining models can lead to
more accurate and reliable predictions.
236 SAS, Python & R: A Cross Reference Guide

Chapter 5 Summary: Predictive Modeling Part I – Foundation

1. Introduction to Predictive Modeling

● Overview: This chapter lays the groundwork for understanding predictive


modeling in data science, focusing on regression and classification models.
These models are essential for making data-driven decisions and predicting
outcomes across various domains. The chapter provides a comprehensive
introduction to the foundational models, including linear regression, logistic
regression, and decision trees, each of which plays a critical role in different
types of predictive tasks.

2. Modeling Data

● Importance of Modeling Data: The chapter emphasizes the significance of


the modeling data set, which is derived from rigorous preprocessing. The
quality of this data set directly impacts the accuracy and reliability of
predictive models. The chapter introduces the concept of a model pipeline,
a systematic series of data transformations designed to prepare raw data
for modeling. Key steps include outlier detection, missing value imputation,
dummy variable creation, and feature engineering.

3. Machine Learning Models

● Regression vs. Classification: The chapter distinguishes between regression


and classification models, explaining that the structure of the target variable
determines the choice of model. The chapter provides an overview of
different machine learning algorithms, including logistic regression, decision
trees, random forests, gradient boosting machines, support vector
machines, and neural networks. Each algorithm is discussed in terms of its
application to binary classification tasks.

4. Logistic Regression

● Concept and Application: Logistic regression is introduced as a simple yet


powerful algorithm for binary classification tasks. The chapter explains the
key components of the logistic regression equation, including the logistic
function, odds ratio, and link function. The step-by-step process of training a
SAS, Python & R: A Cross Reference Guide

logistic regression model, including parameter initialization, loss


computation, and gradient descent, is detailed.

● Assumptions and Interpretation: The chapter discusses the assumptions


underlying logistic regression, such as the linearity of independent variables
and log odds, independence of errors, absence of multicollinearity, and the
need for a large sample size. The interpretation of logistic regression
coefficients as odds ratios is also covered.

5. Decision Trees

● Introduction and Structure: Decision trees are introduced as non-


parametric models that offer flexibility in handling complex data. The
chapter explains the structure of decision trees, including nodes, branches,
and leaves, and how these elements represent decisions and outcomes.

● Comparison with Logistic Regression: The chapter compares logistic


regression and decision trees, highlighting their differences in handling
issues such as multicollinearity, outliers, non-linearity, categorical variables,
missing values, and feature selection.

● Feature Selection and Tree Construction: The chapter introduces key


concepts in decision tree construction, such as impurity, information gain,
entropy, and Gini impurity. The process of constructing a decision tree is
outlined, including selecting the root node, creating child nodes, recursive
splitting, and pruning.

6. Pruning and Performance Metrics

● Pruning Techniques: The chapter discusses pruning methods, such as


reduced error pruning and cost complexity pruning, which help prevent
overfitting in decision trees.

● Evaluating Model Performance: The chapter introduces performance


metrics for decision trees, including accuracy, precision, recall, F1 score,
ROC-AUC, and the Gini coefficient. These metrics are crucial for assessing
the effectiveness of a decision tree model.
238 SAS, Python & R: A Cross Reference Guide

7. Hyperparameter Tuning

● Importance of Tuning: The chapter emphasizes the importance of


hyperparameter tuning in optimizing machine learning models. Various
approaches to hyperparameter tuning, including grid search and random
search, are discussed, along with the impact of tuning on model
performance.

8. Pros and Cons of Logistic Regression and Decision Trees

● Logistic Regression:

o Pros: Easy to implement, interpretable, handles both continuous


and categorical variables.

o Cons: Requires large sample sizes, sensitive to multicollinearity.

● Decision Trees:

o Pros: Interpretable, handles non-linearity, robust to outliers.

o Cons: Prone to overfitting, sensitive to small changes in data.


SAS, Python & R: A Cross Reference Guide

Chapter 5 Quiz

Questions:
1. What is the primary difference between regression and classification
models in predictive modeling?

2. How does the quality of the modeling data set impact the accuracy of
predictive models?

3. What are the key steps involved in the model pipeline for data preparation?

4. Explain the difference between linear regression and logistic regression.

5. What role does the logistic function play in logistic regression?

6. How is the odds ratio interpreted in the context of logistic regression?

7. What assumptions must be met for logistic regression to be effective?

8. What is the significance of the logit link function in logistic regression?

9. How do decision trees differ from logistic regression in handling


multicollinearity?

10. What is impurity, and how is it used in constructing decision trees?

11. Describe the process of creating child nodes in a decision tree.

12. What is the purpose of pruning in decision trees?

13. How is the ROC-AUC metric used to evaluate the performance of a decision
tree model?

14. What are the advantages of using decision trees over logistic regression?

15. How does feature selection occur inherently in decision trees?

16. What is the difference between reduced error pruning and cost complexity
pruning?

17. Explain the concept of hyperparameter tuning and its importance in


machine learning.
240 SAS, Python & R: A Cross Reference Guide

18. How can overfitting be prevented in decision trees?

19. What are the key performance metrics used to evaluate a decision tree
model?

20. What are the pros and cons of using logistic regression versus decision trees
for predictive modeling?
SAS, Python & R: A Cross Reference Guide

Chapter 5 Cheat Sheet

Category SAS Python R


-
- glm() function
- PROC LOGISTIC LogisticRegression
with
for binary from sklearn for
family=binomial for
classification binary
Logistic logistic regression
classification
Regression
- Handles - Use
categorical OneHotEncoder - caret package for
variables with for categorical one-hot encoding
CLASS statement variables
-
- PROC HPSPLIT - rpart package for
DecisionTreeClassi
for decision tree decision tree
fier from
modeling modeling
sklearn.tree
Decision Trees - Supports
- Use
pruning and - caret package for
GridSearchCV for
tuning with hyperparameter
hyperparameter
PRUNE tuning
tuning
statement
- PROC REG and - stepAIC from
- SelectKBest and
VIF for detecting MASS package for
RFE from
Feature multicollinearity stepwise selection
sklearn.feature_s
Selection - Use stepwise - Use vif() from car
election for
regression for feature selection package to check
feature selection multicollinearity
- Use PROC
- GridSearchCV or - trainControl and
HPSPLIT with
RandomizedSearc train from caret for
Hyperparameter MAXDEPTH and
hCV for tuning hyperparameter
Tuning MINLEAFSIZE for
logistic regression tuning of decision
tuning decision
and decision trees trees
trees
- PROC HPSPLIT -
- rpart uses cp
Pruning supports cost DecisionTreeClassi
(complexity
complexity fier supports
242 SAS, Python & R: A Cross Reference Guide

pruning with pruning with parameter) for


ALPHA ccp_alpha pruning
parameter
- roc_auc_score
- PROC LOGISTIC and roc_curve - pROC package for
for ROC-AUC and from ROC-AUC and Gini
Gini coefficient sklearn.metrics coefficient
Performance
for ROC-AUC
Metrics
-
- Use PROC FREQ - confusionMatrix
confusion_matrix
for confusion from caret for
for confusion
matrix confusion matrix
matrix
SAS, Python & R: A Cross Reference Guide

Chapter 6: Predictive Modeling Part II – Ensemble


Methods

Overview

Decision trees are fundamental building blocks in predictive modeling, laying the
foundation for powerful ensemble methods. This chapter explores the intricacies of
ensemble methods, where the synergistic combination of predictive models takes
center stage.

Decision trees, with their inherent capability to capture intricate patterns, form the
cornerstone upon which advanced techniques like random forest and gradient
boosting thrive. Our exploration commences with random forest, an ensemble of
decision trees, each contributing its unique perspective to a collective prediction.
This collaborative approach enhances accuracy and mitigates the overfitting risks
often encountered with standalone models.

Next, we delve into the domain of gradient boosting, a technique that meticulously
constructs a sequence of trees, each rectifying the errors of its predecessor. This
iterative process culminates in a model of exceptional predictive power, making
gradient boosting a trusted tool for data scientists.

While ensemble methods offer unparalleled predictive strength, their black-box


nature, whose internal workings are opaque and challenging to decipher, presents a
limitation. Understanding the inner mechanisms of these methods is paramount for
responsible and transparent data science. In this chapter, we'll dissect the intricacies
of random forest and gradient boosting, demystifying the allure of black-box
models. This journey into the world of ensemble methods will unlock the secrets
behind their predictive prowess.

Random Forest
Random forests are an ensemble learning method that operates by constructing
multiple decision trees at training time and outputting the class, which is the mode
244 SAS, Python & R: A Cross Reference Guide

of the classes (classification) or mean prediction (regression) of the individual trees.


They correct for decision trees’ habit of overfitting to their training set, a weakness
attributed to the “greedy algorithm” nature of decision trees. This means that
decision trees select the best variable and split point at each stage of the tree
development. Still, they never look forward or backward to determine whether
alternative splits would have produced a better overall outcome. This process leads
to a static tree that is not guaranteed to be the best possible tree-based approach.

The random forest algorithm attempts to mitigate decision trees' inherent


weakness. It is an ensemble approach that constructs several different decision
trees and averages the results together. It is the algorithmic equivalent of “wisdom
of the crowds.”

Step-by-Step Construction of a Random Forest Model

The construction of a random forest model involves several steps, which are
detailed below:

1. Bootstrap Data Set Construction: Begin by building a bootstrapped data set.


This involves randomly selecting predictors and observations from your
base data set. The number of predictors chosen is typically the square root
of your data set's total number of predictors (though this can be adjusted
during hyperparameter tuning). The number of observations selected
should equal the number in your original data set, and selection is done
with replacement, meaning an observation can be chosen more than once.

2. Individual Tree Construction: Once your bootstrapped data set has been
constructed, use it to build an individual decision tree. Start by randomly
selecting a subset of available predictors to evaluate for your root node.
After determining your root node, select random subsets of variables at
each step to construct the rest of your tree.

3. Repeat Tree Construction: Repeat step 2 hundreds or even thousands of


times. Because each tree is constructed using a different bootstrapped data
set and different random subsets of predictors at each step, you’ll end up
with a wide variety of different trees.
SAS, Python & R: A Cross Reference Guide

4. Final Model Construction: Your final random forest model is an ensemble of


all the individual trees you’ve constructed. When making predictions, each
individual tree in your “forest” gets a vote, and your final prediction is based
on majority voting for classification problems or averaging for regression
problems.

These steps make random forests a powerful tool for many machine learning tasks.
Compared to single decision trees, they provide higher accuracy and better control
over overfitting.

Comparing Decision Trees and Random Forests


Decision trees and random forests are tree-based models in machine learning.
While they share some similarities, they also have significant differences. Here’s a
comparison table that shows how decision trees and random forests handle various
modeling issues:

Table 6.1: Comparing Decision Trees and Random Forest Models

Model Issue Decision Trees Random Forests


Not sensitive due to
Not sensitive. Can handle
Multicollinearity randomness and averaging of
multicollinearity well.
results from multiple trees.
Prone to overfitting. Less prone due to the
Requires pruning to ensemble method, which
Overfitting
prevent over-complex averages multiple decision
trees. trees to reduce variance.
Lower interpretability due to
High interpretability. Each complexity from multiple
Interpretability path in the tree represents trees, but variable
a decision rule. importance can be
measured.
Lower performance for Higher performance due to
Performance complex data sets with the ability to model complex
many features or high patterns using multiple trees.
246 SAS, Python & R: A Cross Reference Guide

cardinality categorical
variables.

Slower due to building


Faster as it involves multiple trees but can be
Training Speed
building a single tree. parallelized across multiple
CPUs for faster training.
More efficient data usage
due to the Out-of-Bag (OOB)
Requires explicit data
metric. Eliminates the need
splitting into separate
Data Usage for explicit data splitting,
training and validation data
allowing more data to be
sets.
used in training and
validation.

These differences make random forests a powerful tool for many machine learning
tasks, providing higher accuracy and better control over overfitting than single
decision trees.

The Role of the Out-of-Bag (OOB) Metric in Random Forests

Random forests, a versatile ensemble learning technique, have significantly evolved


the landscape of decision tree-based models. They address fundamental limitations
in decision trees, notably their susceptibility to overfitting and instability. The Out-
of-Bag (OOB) metric is a pivotal feature in this context.

● Eliminating the Need for Explicit Data Splitting: Unlike traditional machine
learning models that require division of data sets into training and validation
sets, random forests simplify this process. Each decision tree in the forest
creates its training data set through bootstrapping, effectively leaving out a
portion of the data points. This omitted data acts as an internal validation
set, eliminating the requirement for a predefined validation set.

● Robust Model Performance Evaluation: The OOB metric harnesses the


power of these internal validation sets. During tree construction, each data
point is assigned an OOB prediction by the trees not included in their
SAS, Python & R: A Cross Reference Guide

training. This creates a set of OOB predictions for the entire data set.
Aggregating these predictions across all trees yields the OOB error, a robust
measure of the model’s performance.

● Utilizing More Data: One significant advantage of the OOB approach is its
efficiency in data usage. By eliminating the need for data splitting, random forests
allow you to maximize the utility of your available data. This can be especially
advantageous when data is limited, ensuring that every data point contributes to
model training and validation.

Feature Selection Methods

In the context of random forests, feature selection is an inherent part of the model
building process. A random subset of features is considered for splitting at each
node in each decision tree within the forest. The feature that provides the highest
information gain (for classification) or reduction in variance (for regression) is
selected. This randomness in feature selection contributes to the robustness of the
random forest model.

Both Gini impurity and entropy can be used as criteria for feature selection in a
random forest model. The choice between them depends on the specific problem
and the data at hand. Gini impurity measures the degree or probability of a
particular variable being wrongly classified when it is randomly chosen. On the other
hand, entropy measures the purity, or randomness, of the input set. These metrics
are explained in more detail in the decision tree section of this chapter.

In terms of specific feature selection methods, apart from the inherent feature
selection that happens at each node, there are also methods like permutation
importance and Mean Decrease Impurity (MDI) that can be used. Permutation
importance is calculated by permuting the values of each feature one by one and
measuring the decrease in accuracy. MDI, on the other hand, computes the total
reduction of the criterion brought by that feature. These methods provide a global
view of feature importance across all the trees.

Pruning Random Forests

Pruning is not typically used with random forests. This is because the ensemble
method of combining a multitude of decision trees tends to protect against
overfitting. Each tree in a random forest is allowed to grow with high complexity,
248 SAS, Python & R: A Cross Reference Guide

which means the trees can adapt to complex patterns in the data and even noise.
The final prediction, which is an average of predictions from all trees, will usually
not overfit as long as there are enough trees in the forest.

Performance Metrics

Performance metrics for random forests are similar to those used for decision trees.
For classification problems, these can include accuracy, precision, recall, F1 score,
and area under the ROC curve (AUC-ROC). For regression problems, mean absolute
error (MAE), mean squared error (MSE), or R-squared might be used. These metrics
are explained in more detail in the decision tree section of this chapter.

Pros and Cons

Random forests have several advantages:

● They can handle both numerical and categorical data.


● They can handle missing values by either skipping splits on missing values or
imputing them.
● They provide feature importance scores.
● They are less likely to overfit than individual decision trees.

However, they also have some disadvantages:

● They are more complex and computationally intensive than individual


decision trees.
● They are less interpretable than individual decision trees due to their
ensemble nature.

Best Practices

When using random forests:

● Use OOB error as an efficient estimate of the test error.


● Tune hyperparameters such as the number of trees in the forest and the
number of features considered at each split.
● Balance your data set if dealing with imbalanced classes, either by
undersampling, oversampling, or generating synthetic samples.
SAS, Python & R: A Cross Reference Guide

● Use feature importance scores for feature selection or to gain insights about
the data.

Random Forest Model Code

The following random forest code is a comprehensive workflow for creating a


random forest model, tuning its hyperparameters, applying the model to an out-of-
time (OOT) data set, and generating final model performance metrics on the scored
OOT data set. This workflow is designed to be efficient and robust, leveraging the
power of the random forest algorithm and the flexibility of each programming
language’s modeling library.

The steps removed from the original decision tree model workflow include
removing excluded variables and balancing the data. These steps are unnecessary
for a random forest model, as it inherently handles multicollinearity and imbalance
in the data. The random forest model also considers all variables when building the
trees, so there is no need to manually exclude any variables.

Program 6.1: Python Random Forest Script


# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix,
roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Function to calculate performance metrics


def performance_metrics(y_true, y_pred):
print("Classification Report:")
print(classification_report(y_true, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("ROC AUC Score:")
print(roc_auc_score(y_true, y_pred))

# Load the data


train_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
250 SAS, Python & R: A Cross Reference Guide

OOT_encoded = pd.read_csv('/...INPUT YOUR FILE


PATHWAY.../oot_encoded.csv')

# Define predictors excluding the specified variables and the target


variable
predictors = [col for col in train_encoded.columns if col != 'bad']

# Check if excluded variables are in OOT_encoded columns, if not


then add them
missing_cols = set(predictors) - set(OOT_encoded.columns)
for c in missing_cols:
OOT_encoded[c] = 0

# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])

# Build the Random Forest model


rf = RandomForestClassifier(random_state=42)

# Define hyperparameters to tune


param_grid = {'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10]}

# Tune hyperparameters using GridSearchCV


grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

# Get the best model


best_model = grid_search.best_estimator_

# Apply the model to the OOT data set


OOT_predictions = best_model.predict(OOT_encoded[predictors])

# Get the probabilities of the positive class


OOT_probabilities =
best_model.predict_proba(OOT_encoded[predictors])[:, 1]

# Define your threshold


threshold = 0.2
SAS, Python & R: A Cross Reference Guide

# Apply threshold to get predictions


OOT_predictions = (OOT_probabilities > threshold).astype(int)

# Now you can evaluate your model using these predictions


performance_metrics(OOT_encoded['bad'], OOT_predictions)

# Plot ROC curve


fpr, tpr, _ = roc_curve(OOT_encoded['bad'], OOT_predictions)
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Program 6.2: R Random Forest Script


# Import necessary libraries
library(randomForest)
library(caret)
library(pROC)

# Load the data


train_encoded <- read.csv("/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv")
OOT_encoded <- read.csv("/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv")

# Define predictors excluding the specified variables and the target


variable
predictors <- setdiff(names(train_encoded), "bad")

# Check if excluded variables are in OOT_encoded columns, if not


then add them
missing_cols <- setdiff(predictors, names(OOT_encoded))
OOT_encoded[missing_cols] <- 0

# Split the data into training and validation sets (80/20 split)
using stratified sampling
trainIndex <- createDataPartition(train_encoded$bad, p = .8, list =
FALSE, times = 1)
X_train <- train_encoded[trainIndex, predictors]
252 SAS, Python & R: A Cross Reference Guide

y_train <- train_encoded[trainIndex, "bad"]


X_val <- train_encoded[-trainIndex, predictors]
y_val <- train_encoded[-trainIndex, "bad"]

# Build the Random Forest model


rf <- randomForest(x = X_train, y = as.factor(y_train), ntree = 100)

# Define hyperparameters to tune


tuneGrid <- expand.grid(.mtry=c(2, sqrt(ncol(X_train)),
ncol(X_train)))

# Tune hyperparameters using caret's train function


control <- trainControl(method="cv", number=5)
tuned_rf <- train(x=X_train, y=y_train, method="rf",
tuneGrid=tuneGrid, trControl=control)

# Get the best model


best_model <- tuned_rf$finalModel

# Apply the model to the OOT data set to get probabilities


OOT_probabilities <- predict(best_model,
newdata=OOT_encoded[predictors], type="prob")

# Define your threshold


threshold <- 0.2

# Apply threshold to get predictions


OOT_predictions <- ifelse(OOT_probabilities[,2] > threshold, 1, 0)

# Create performance metrics using caret's confusionMatrix function


print(confusionMatrix(data = OOT_predictions, reference =
as.factor(OOT_encoded$bad)))

# Plot ROC curve


roc_obj <- roc(response=OOT_encoded$bad,
predictor=OOT_probabilities[,2])
plot(roc_obj)

Program 6.3: SAS Random Forest Script


/*LIBNAME james 'INPUT FILE PATHWAY';*/

/* Load the data */


SAS, Python & R: A Cross Reference Guide

DATA train_encoded;
SET james.train_encoded;
RUN;

DATA oot_encoded;
SET james.oot_encoded;
RUN;

/* Create global variable for predictors */


PROC CONTENTS NOPRINT DATA = train_encoded (DROP= id bad) OUT = var (KEEP = name);
RUN;
PROC SQL NOPRINT; SELECT name INTO:predictors SEPARATED BY " " FROM var; QUIT;
%PUT &predictors; RUN;

/* Variables in train_encoded but not in oot_encoded */


PROC SQL NOPRINT;
CREATE TABLE train_vars AS SELECT name FROM dictionary.columns WHERE
libname="WORK" AND memname="TRAIN_ENCODED";
CREATE TABLE oot_vars AS SELECT name FROM dictionary.columns WHERE
libname="WORK" AND memname="OOT_ENCODED";
CREATE TABLE in_train_not_oot AS
SELECT * FROM train_vars
EXCEPT
SELECT * FROM oot_vars;
QUIT;

/* Print the differences */


PROC PRINT DATA=in_train_not_oot; RUN;

/* Create final train and OOT data sets */


DATA train_encoded_final;
SET train_encoded (KEEP=&predictors bad);
RUN;

PROC FREQ DATA=train_encoded_final; TABLE bad; RUN;

/*Create missing variables from TRAIN data set*/


DATA oot_encoded_final;
SET oot_encoded ;
home_ownership_ANY = 0;
purpose_wedding = 0;
KEEP &predictors bad home_ownership_ANY purpose_wedding;
254 SAS, Python & R: A Cross Reference Guide

RUN;

PROC FREQ DATA=oot_encoded_final; TABLE bad; RUN;


/* Split the modeling data set by a 80/20 ratio using a random seed */
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2 SEED=42;
RUN;

PROC FREQ DATA= class2; TABLES selected; RUN;

/* Create TRAIN and VAL data sets */


DATA train_data val_data;
SET class2 ;
IF selected = 1 THEN OUTPUT train_data; ELSE OUTPUT val_data;
RUN;

/* Assess target rate in train, val and OOT data sets */


PROC FREQ DATA = train_data; TABLES bad; RUN;
PROC FREQ DATA = val_data; TABLES bad; RUN;
PROC FREQ DATA = oot_encoded_final; TABLES bad; RUN;

/*Create hyperparameter tuning macro that will loop through a range of values for each of
the tuning parameters */

/* Create an empty summary_table data set */


PROC DELETE DATA=summary_table;

DATA summary_table;
LENGTH Model 8;
FORMAT Area 8.5;
RUN;

%LET target = bad;

%MACRO tune(iteration, maxtrees, maxdepth, splitsize);


%put "Iteration: &iteration";
%put "MAXTREES: &maxtrees";
%put "MAXDEPTH: &maxdepth";
%put "SPLITSIZE: &splitsize";

/* Create Random Forest with hyperparameter tuning */


PROC HPFOREST DATA=train_data
MAXTREES = &maxtrees.
SAS, Python & R: A Cross Reference Guide

MAXDEPTH = &maxdepth.
SPLITSIZE = &splitsize.;
TARGET &target. /LEVEL=binary;
INPUT &predictors. / LEVEL=interval;
SAVE FILE = '/OneDrive/Documents/DS_Project/random_forest.sas';
RUN;

PROC HP4SCORE DATA = val_data;


SCORE FILE = '/OneDrive/Documents/DS_Project/random_forest.sas'
OUT = test_score_&iteration. (keep = &target. P_bad1);
RUN;

/* Calculate AUC value on validation data set */


proc logistic data = test_score_&iteration.;
class &target.;
model &target. = P_bad1 / outroc=ROC;
ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;
run;

data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;

/* Append the results to a summary table */


proc append base=summary_table data=outstat;
run;

%mend;

/* Call the macro with different hyperparameters */


%tune(1, 100, 5, 2);
%tune(2, 200, 5, 2);
%tune(3, 300, 5, 2);
%tune(4, 100, 10, 2);
%tune(5, 200, 10, 2);
%tune(6, 300, 10, 2);
%tune(7, 100, 15, 2);
%tune(8, 200, 15, 2);
256 SAS, Python & R: A Cross Reference Guide

%tune(9, 300, 15, 2);


%tune(10, 100, 5, 5);
%tune(11, 200, 5, 5);
%tune(12, 300, 5, 5);
%tune(13, 100, 10, 5);
%tune(14, 200, 10, 5);
%tune(15, 300, 10, 5);
%tune(16, 100, 15, 5);
%tune(17, 200, 15, 5);
%tune(18, 300, 15 ,5);
%tune(19 ,100 ,5 ,10);
%tune(20 ,200 ,5 ,10);
%tune(21 ,300 ,5 ,10);
%tune(22 ,100 ,10 ,10);
%tune(23 ,200 ,10 ,10);
%tune(24 ,300 ,10 ,10);
%tune(25 ,100 ,15 ,10);
%tune(26 ,200 ,15 ,10);
%tune(27 ,300 ,15 ,10);

PROC SORT DATA=summary_table; BY DESCENDING Area; RUN;

/*Apply optimal hyperparameters to the OOT data set for final evaluation*/

/* Create Random Forest with optimal hyperparameter values */


PROC HPFOREST DATA=train_data
MAXTREES = 200 /*Input optimal hyperparameter value*/
MAXDEPTH = 15 /*Input optimal hyperparameter value*/
SPLITSIZE = 5; /*Input optimal hyperparameter value*/
TARGET &target. /LEVEL=binary;
INPUT &predictors. / LEVEL=interval;
SAVE FILE = '/OneDrive/Documents/DS_Project/random_forest.sas';
RUN;

PROC HP4SCORE DATA = oot_encoded_final;


SCORE FILE = '/OneDrive/Documents/DS_Project/random_forest.sas'
OUT = oot_scored (keep = &target. P_bad1);
RUN;

/* Calculate AUC value on OOT data set */


PROC LOGISTIC DATA = oot_scored;
CLASS &target.;
SAS, Python & R: A Cross Reference Guide

MODEL &target. = P_bad1 / OUTROC=ROC;


ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;

RUN;

Here is a step-by-step description of the random forest workflow:

1. Import necessary libraries: The code begins by importing the necessary


libraries. These include pandas for data manipulation and sklearn or caret
for machine learning. SAS does not require loading additional libraries.

2. Define performance metrics function: A function is defined to calculate and


print the model's performance metrics. These metrics include a
classification report, confusion matrix, and ROC-AUC score.

3. Load the data: The training and OOT data sets are loaded from CSV files.

4. Define predictors: The predictors for the model are defined as all columns
in the training data set except for the target variable “bad”.

5. Split the data: The training data is split into training and validation sets
using stratified sampling.

6. Build the Random Forest model: A random forest model is initialized.

7. Define hyperparameters to tune: The hyperparameters to be tuned are


defined. These include the number of trees in the forest (n_estimators), the
maximum depth of the trees (max_depth), and the minimum number of
samples required to split an internal node (min_samples_split).

8. Tune hyperparameters using Grid Search: Grid search is used to tune the
hyperparameters of the random forest model. It fits the model to the
training data and finds the best hyperparameters.

9. Apply the model to the OOT data set: The best model is applied to the OOT
data set to make predictions.
258 SAS, Python & R: A Cross Reference Guide

10. Create performance metrics: The performance metrics function calculates


and prints the model's performance metrics on the OOT data set.

11. Plot ROC curve: Finally, an ROC curve is plotted to visualize the model's
performance.

Gradient Boosting

Gradient boosting is a powerful machine learning algorithm that builds on the


concept of decision trees, evolving it into a more robust and accurate method. It
operates by constructing a sequence of decision trees, where each subsequent tree
is built to correct the errors of its predecessor. This iterative approach allows the
model to learn from its mistakes, leading to a model that can generalize well to
unseen data. Unlike random forests, which build each tree independently, gradient
boosting builds trees in a sequential manner, with each tree learning from the
mistakes of the previous ones.

The gradient boosting algorithm is an evolution of the decision tree model. It


leverages the power of ensemble learning, where multiple weak learners (in this
case, decision trees) are combined to create a strong learner. The “gradient” in
gradient boosting refers to using gradient descent, an optimization algorithm that
minimizes the loss function. By using gradient descent, the model can iteratively
learn from its errors and improve its predictions.

Step-by-Step Construction of a Gradient Boosting Model

The construction of a gradient boosting model involves several steps, which are
detailed below:

1. Initialize the Model: Start with a base model that makes a single prediction
for all observations in your data set. This prediction could be the mean (for
regression problems) or the mode (for classification problems) of the target
variable.
SAS, Python & R: A Cross Reference Guide

2. Calculate Residuals: Calculate the residuals, which are the differences


between the observed and predicted values of the target variable.

3. Fit a Decision Tree: Fit a decision tree to the residuals from step 2. The goal
here is to predict the residuals, not the actual target variable.

4. Update Predictions: Update your predictions by adding a fraction of the


predictions made by the decision tree in step 3 to your previous predictions.

5. Repeat Steps 2–4: Repeat steps 2–4 for a specified number of iterations.
Each iteration fits a new decision tree to the residuals of the current
predictions and updates the predictions.

6. Make Final Predictions: The final gradient boosting model is an ensemble of


all the individual trees constructed during each iteration. When making
predictions, it sums up the predictions made by each individual tree.

These steps make gradient boosting a formidable tool for many machine learning
tasks. Compared to single decision trees or even random forests, it provides higher
accuracy and better control over overfitting.

Different Types of Gradient Boosting Models

Gradient boosting is a robust machine learning algorithm that’s been adapted into
several different models, each with its own strengths and unique features. Here are
a few popular types of gradient-boosting models:

● XGBoost: Short for eXtreme Gradient Boosting, XGBoost is one of the most
popular gradient boosting models due to its speed and performance. It’s
known for its scalability in all scenarios and supports parallel processing.
XGBoost also includes a feature that allows cross-validation at each iteration
of the boosting process.

● LightGBM: Developed by Microsoft, LightGBM (Light Gradient Boosting


Machine) is another fast, high-performance gradient boosting model. It’s
known for its efficiency and speed and is especially good at handling large
data set. LightGBM uses a novel technique of Gradient-based One-Side
Sampling (GOSS) to filter out the data instances to find a split value.
260 SAS, Python & R: A Cross Reference Guide

● CatBoost: Developed by Yandex, CatBoost (short for Categorical Boosting) is


a gradient boosting model designed to handle categorical features better
than other models. It converts categorical values into numbers using various
statistics on combinations of categorical features and combinations of
categorical and numerical features.

Each of these models has its own set of hyperparameters to tune and may perform
differently depending on the specifics of your data and problem. It’s always a good
idea to try different models and see which works best for your specific use case.

Comparing Random Forests and Gradient Boosting

Gradient boosting models are often more accurate than random forests because
they build trees sequentially, with each tree learning from the mistakes of the
previous ones. This iterative approach allows the model to learn from its errors and
improve its predictions. However, this increased accuracy comes at a cost: gradient
boosting models typically require more data to train effectively and are more
computationally intensive, which means they may not be as efficient as random
forests.

Choosing between these two models often depends on your project's specific
requirements. If you have a large amount of data and computational efficiency is
not a primary concern, a gradient boosting model may be the best choice due to its
potential for higher accuracy. On the other hand, if you have less data or need a
model that can train quickly, a random forest might be more suitable.

In summary, both random forests and gradient boosting models are powerful tools
for machine learning tasks, each with their own strengths and weaknesses. The
choice between them should be based on the amount of available data, the
computational resources at your disposal, and the level of accuracy required for
your specific use case.
SAS, Python & R: A Cross Reference Guide

Table 6.2: Comparing Random Forests to Gradient Boosting Models

Model Issue Random Forests Gradient Boosting Models


Less sensitive as each weak
Less sensitive due to
learner (tree) is built on the
Multicollinearity randomness and averaging of
residuals/errors of the previous
results from multiple trees.
one.
Can overfit if the number of
Less prone due to ensemble
trees is too large, but this can
method which averages
Overfitting be controlled using early
multiple decision trees to
stopping and other
reduce variance.
regularization techniques.
Lower interpretability due to Lower interpretability due to
complexity from multiple sequential nature of boosting,
Interpretability
trees, but variable importance but variable importance can be
can be measured. measured.
Typically provides even higher
performance than random
Higher performance due to
forests, especially for data sets
Performance ability to model complex
where the relationship between
patterns using multiple trees.
features and response is
complex.
Slower due to building Typically, slower than random
multiple trees but can be forests because trees are built
Training Speed
parallelized across multiple sequentially, so it’s harder to
CPUs for faster training. parallelize.
More efficient data usage
due to the Out-of-Bag (OOB)
metric. Eliminates the need Requires explicit data splitting
Data Usage for explicit data splitting, into separate training and
allowing for more data to be validation data sets.
used in both training and
validation.

Out-of-Bag vs. Validation Data Sets

The Out-of-Bag (OOB) metric is specific to bagging-based ensemble methods like


random forests. It estimates the model’s generalization error by using the training
262 SAS, Python & R: A Cross Reference Guide

instances that were not included (left out or “out of bag”) in the bootstrap sample
for each tree. The OOB instances are passed down the tree and used to compute an
unbiased estimate of the model’s error rate.

Gradient boosting algorithms do not use the OOB metric because they build trees
sequentially, with each tree learning from the mistakes of the previous ones rather
than independently, like in random forests. However, gradient boosting models
often include a validation set as part of their training process and use early stopping
to prevent overfitting. The validation error can be considered a similar metric to the
OOB error in random forests.

Feature Selection Methods

In the context of gradient boosting models, feature selection is an inherent part of


the model building process. A feature is considered for splitting at each node in each
decision tree within the sequence. The feature that provides the highest gain (for
classification) or reduction in variance (for regression) is selected. This process
contributes to the robustness of the gradient boosting model.

Here are some key points about feature selection in gradient boosting models:

● Splitting Criterion: Both Gini impurity and entropy can be used as criteria
for feature selection in a gradient boosting model. The choice between
them depends on the specific problem and the data at hand. However,
unlike decision trees or random forests, gradient boosting typically uses a
different criterion for splitting. It uses a loss function that depends on the
task at hand (e.g., mean squared error for regression, log loss for
classification).

● Feature Importance: After the model has been trained, we can obtain a
measure of feature importance, which indicates how useful or valuable each
feature was in the construction of the boosted decision trees within the
model. Features with higher importance were more influential in creating
the model, indicating a stronger association with the response variable.

● Regularization: Gradient boosting includes additional parameters for


regularization to avoid overfitting. These include the learning rate (which
scales the contribution of each tree) and subsampling parameters (which
SAS, Python & R: A Cross Reference Guide

control the fraction of samples to be used for fitting the individual base
learners).

● Interaction Effects: Because gradient boosting builds trees one at a time,


where each new tree helps to correct errors made by previously trained
trees, it can capture interaction effects between features.

Remember, while gradient boosting models can provide high performance, they
also require careful tuning and consideration of these factors.

Log Loss

Log loss, also known as logistic loss or cross-entropy loss, is a fundamental metric in
machine learning used to evaluate the performance of binary classification models.
It quantifies the accuracy of a classifier by penalizing false classifications, making it
particularly useful when the prediction output is a probability that an instance
belongs to a particular class.

The equation for log loss is:

Equation 6.1: Log Loss


𝑁
1
− ∑[𝑦𝑖 ⋅ 𝑙𝑜𝑔(𝑝(𝑦𝑖 )) + (1 − 𝑦𝑖 ) ⋅ 𝑙𝑜𝑔(1 − 𝑝(𝑦𝑖 ))]
𝑁
𝑖=1

where:

● N is the number of observations,


● 𝑦𝑖 is the actual value (0 or 1), and
● p(𝑦𝑖 ) is the predicted probability that 𝑦𝑖 equals 1.

The goal of any machine learning model is to minimize this value. A perfect model
would have a log loss of 0. However, it’s important to note that log loss heavily
penalizes classifiers that are confident about an incorrect classification. For
example, if for a particular observation, the actual label is 1 but the model predicts
it as 0 with high confidence, then the log loss would be a large positive number. This
makes log loss an excellent metric for assessing the reliability of model predictions.
264 SAS, Python & R: A Cross Reference Guide

Pruning and Regularization in Gradient Boosting Models

While gradient boosting models do not typically use pruning in the traditional sense,
they do incorporate a form of “pruning” through regularization techniques like early
stopping and shrinkage. These techniques help control the complexity of the model
and prevent overfitting, which is crucial when dealing with high-dimensional data or
complex patterns.

● Early Stopping: This technique prevents overfitting by stopping the training


process if the model’s performance on a validation set does not improve for
a number of iterations. This effectively limits the number of trees in the
model, which can be seen as a form of pruning.

● Shrinkage: Also known as learning rate, this technique slows down the
learning process by shrinking the contribution of each tree by a factor
(between 0 and 1) when it is added to the current ensemble. This helps
regularize the model and prevents overfitting.

These techniques, along with others like subsampling, help control the complexity of
gradient boosting models and prevent them from overfitting. They are part of what
makes gradient boosting a powerful and flexible machine learning method.

Performance Metrics

Performance metrics for gradient boosting models are similar to those used for
decision trees. For classification problems, these can include accuracy, precision,
recall, F1 score, and area under the ROC curve (AUC-ROC). For regression problems,
mean absolute error (MAE), mean squared error (MSE), or R-squared might be used.
These metrics are explained in more detail in the decision tree section of Chapter 5.

Pros and Cons

Gradient boosting models have several advantages:

● They can handle both numerical and categorical data.


● They provide feature importance scores.
● They are less likely to overfit than individual decision trees.
SAS, Python & R: A Cross Reference Guide

However, they also have some disadvantages:

● They are more complex and computationally intensive than individual


decision trees.
● They are less interpretable than individual decision trees due to their
ensemble nature.

Best Practices

When using gradient boosting models:

● Tune hyperparameters such as the number of trees in the sequence and the
learning rate.
● Balance your data set if dealing with imbalanced classes, either by
undersampling, oversampling, or generating synthetic samples.
● Use feature importance scores for feature selection or to gain insights about
the data.

Gradient Boosting Model Code

The following gradient boosting code is a robust workflow for creating a gradient
boosting model, tuning its hyperparameters, applying the model to an out-of-time
(OOT) data set, and generating final model performance metrics on the scored OOT
data set. This workflow is designed to be efficient and effective, leveraging the
power of the gradient boosting algorithm and the flexibility of each programming
language’s modeling library.

Unlike random forests, gradient boosting models do not inherently handle


imbalance in the data. Therefore, depending on the level of imbalance in your data,
you might need to consider techniques such as undersampling, oversampling, or
generating synthetic samples when preparing your data for a gradient boosting
model.

Program 6.4: Python Gradient Boosting Script

# Import necessary libraries


266 SAS, Python & R: A Cross Reference Guide

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix,
roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Function to calculate performance metrics


def performance_metrics(y_true, y_pred):
print("Classification Report:")
print(classification_report(y_true, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("ROC AUC Score:")
print(roc_auc_score(y_true, y_pred))

# Load the data


train_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

print(train_encoded.columns)

# Define predictors and the target variable


predictors = [col for col in train_encoded.columns if col != 'bad']
# Check if excluded variables are in OOT_encoded columns, if not
then add them
missing_cols = set(predictors) - set(OOT_encoded.columns)
for c in missing_cols:
OOT_encoded[c] = 0

# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])

# Combine predictors and target variable into one DataFrame for


undersampling
train_data = pd.concat([X_train, y_train], axis=1)

# Count the number of samples in the minority class


minority_class_count = train_data['bad'].value_counts().min()
SAS, Python & R: A Cross Reference Guide

# Perform undersampling on the majority class


undersampled_data = pd.concat([train_data[train_data['bad'] ==
label].sample(minority_class_count, random_state=42) for label in
train_data['bad'].unique()])

# Get the resampled predictors and target variable


X_train_resampled = undersampled_data[predictors]
y_train_resampled = undersampled_data['bad']

# Build the XGBoost model


xgb = XGBClassifier(random_state=42)

# Define hyperparameters to tune


param_grid = {'n_estimators': [100, 200, 500],
'learning_rate': [0.1, 0.01, 0.001],
'max_depth': [3, 5, 7]}

# Tune hyperparameters using GridSearchCV


grid_search = GridSearchCV(xgb, param_grid, cv=5)
grid_search.fit(X_train_resampled[predictors], y_train_resampled)

# Get the best model


best_model = grid_search.best_estimator_

# Apply the model to the OOT data set


OOT_predictions = best_model.predict(OOT_encoded[predictors])

# Create performance metrics using the function defined earlier


performance_metrics(OOT_encoded['bad'], OOT_predictions)

# Plot ROC curve


fpr, tpr, _ = roc_curve(OOT_encoded['bad'], OOT_predictions)
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
268 SAS, Python & R: A Cross Reference Guide

Program 6.5: R Gradient Boosting Script

# Import necessary libraries


library(xgboost)
library(caret)
library(pROC)

# Load the data


train_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

# Define predictors and the target variable


predictors <- names(train_encoded)[!names(train_encoded) %in% 'bad']

# Check if excluded variables are in OOT_encoded columns, if not


then add them
missing_cols <- setdiff(predictors, names(OOT_encoded))
OOT_encoded[missing_cols] <- 0

# Split the data into training and validation sets (80/20 split)
using stratified sampling
trainIndex <- createDataPartition(train_encoded$bad, p = .8, list =
FALSE, times = 1)
X_train <- train_encoded[ trainIndex, predictors]
y_train <- train_encoded[ trainIndex, 'bad']
X_val <- train_encoded[-trainIndex, predictors]
y_val <- train_encoded[-trainIndex, 'bad']

# Build the XGBoost model


xgb <- xgboost(data = as.matrix(X_train), label =
as.numeric(as.factor(y_train))-1, nrounds=100,
objective="binary:logistic")

# Apply the model to the OOT data set


OOT_predictions <- predict(xgb, as.matrix(OOT_encoded[,
predictors]))

# Create performance metrics using the function defined earlier


roc_obj <- roc(OOT_encoded$bad, OOT_predictions)
print(roc_obj)

# Plot ROC curve


SAS, Python & R: A Cross Reference Guide

plot(roc_obj)

Program 6.6: SAS Gradient Boosting Script

/* Load the data */


DATA train_encoded;
SET james.train_encoded;
RUN;

DATA oot_encoded;
SET james.oot_encoded;
RUN;

/* Create global variable for predictors */


PROC CONTENTS NOPRINT DATA = train_encoded (DROP= id bad) OUT = var (KEEP = name);
RUN;
PROC SQL NOPRINT; SELECT name INTO:predictors SEPARATED BY " " FROM var; QUIT;
%PUT &predictors; RUN;

/* Variables in train_encoded but not in oot_encoded */


PROC SQL NOPRINT;
CREATE TABLE train_vars AS SELECT name FROM dictionary.columns WHERE
libname="WORK" AND memname="TRAIN_ENCODED";
CREATE TABLE oot_vars AS SELECT name FROM dictionary.columns WHERE
libname="WORK" AND memname="OOT_ENCODED";
CREATE TABLE in_train_not_oot AS
SELECT * FROM train_vars
EXCEPT
SELECT * FROM oot_vars;
QUIT;

/* Print the differences */


PROC PRINT DATA=in_train_not_oot; RUN;

/* Create final train and OOT data sets */


DATA train_encoded_final;
SET train_encoded (KEEP=&predictors bad);
RUN;

PROC FREQ DATA=train_encoded_final; TABLE bad; RUN;


270 SAS, Python & R: A Cross Reference Guide

/*Create missing variables from TRAIN data set*/


DATA oot_encoded_final;
SET oot_encoded ;
home_ownership_ANY = 0;
purpose_wedding = 0;
KEEP &predictors bad home_ownership_ANY purpose_wedding;
RUN;

PROC FREQ DATA=oot_encoded_final; TABLE bad; RUN;

/* Split the modeling data set by a 80/20 ratio using a random seed */
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2 SEED=42;
RUN;

PROC FREQ DATA= class2; TABLES selected; RUN;

/* Create TRAIN and VAL data sets */


DATA train_data val_data;
SET class2 ;
IF selected = 1 THEN OUTPUT train_data; ELSE OUTPUT val_data;
RUN;

/* Assess target rate in train, val and OOT data sets */


PROC FREQ DATA = train_data; TABLES bad; RUN;
PROC FREQ DATA = val_data; TABLES bad; RUN;
PROC FREQ DATA = oot_encoded_final; TABLES bad; RUN;

/* Balance the data by undersampling the majority class */


PROC FREQ DATA=train_data ;
TABLES bad ;
RUN;

DATA pos neg;


SET train_data;
IF bad = 1 THEN OUTPUT pos; ELSE OUTPUT neg;
RUN;

DATA train_bal;
SET pos neg(obs=6901); /*Input the number of positive cases*/
RUN;
SAS, Python & R: A Cross Reference Guide

PROC FREQ DATA=train_bal; TABLES bad; RUN;

/*Create hyperparameter tuning macro that will loop through a range of values for each of
the tuning parameters */

/* Create an empty summary_table data set */


PROC DELETE DATA=summary_table;

DATA summary_table;
LENGTH Model 8;
FORMAT Area 8.5;
RUN;

%LET target = bad;

%MACRO tune(iteration, maxdepth, leafsize);


%put "Iteration: &iteration";
%put "MAXDEPTH: &maxdepth";
%put "LEAFSIZE: &leafsize";

/* Create Gradient boosting model with hyperparameter tuning */


PROC TREEBOOST DATA=train_bal NOPRINT
MAXDEPTH = &maxdepth.
LEAFSIZE = &leafsize.;
TARGET &target. /LEVEL=binary;
INPUT &predictors. / LEVEL=interval;
CODE FILE = '/OneDrive/Documents/DS_Project/gradient_boosting.sas';
RUN;

DATA test_score_&iteration.;
SET val_data;
%INCLUDE '/OneDrive/Documents/DS_Project/gradient_boosting.sas';
KEEP id &target. P_bad1;
RUN;

/* Calculate AUC value on validation data set */


proc logistic data = test_score_&iteration.;
class &target.;
model &target. = P_bad1 / outroc=ROC;
ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;
run;
272 SAS, Python & R: A Cross Reference Guide

data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;

/* Append the results to a summary table */


proc append base=summary_table data=outstat;
run;

%mend;

/* Call the macro with different hyperparameters */


%tune(1, 5, 2);
%tune(2, 5, 5);
%tune(3, 10, 2);
%tune(4, 10, 5);
%tune(5, 15, 2);
%tune(6, 15, 5);

PROC SORT DATA=summary_table; BY DESCENDING Area; RUN;

************************************************************************;
/*Apply optimal hyperparameters to the OOT data set for final evaluation*/
************************************************************************;

PROC TREEBOOST DATA=train_bal


MAXDEPTH = 5 /*Input optimal hyperparameter value*/
LEAFSIZE = 5; /*Input optimal hyperparameter value*/
TARGET bad /LEVEL=binary;
INPUT &predictors. / LEVEL=interval;
CODE FILE = '/OneDrive/Documents/DS_Project/gradient_boosting.sas';
RUN;

DATA oot_score;
SET oot_encoded_final;
%INCLUDE '/OneDrive/Documents/DS_Project/gradient_boosting.sas';
KEEP id bad P_bad1;
RUN;
SAS, Python & R: A Cross Reference Guide

/* Calculate AUC value on validation data set */


PROC LOGISTIC DATA = oot_score;
CLASS bad;
MODEL bad = P_bad1 / OUTROC=ROC;
ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;
RUN;

Here is a step-by-step description of the gradient boosting workflow:

1. Import necessary libraries: The code begins by importing the necessary


libraries. These include pandas for data manipulation and sklearn or caret
for machine learning. SAS does not require loading additional libraries.

2. Define performance metrics function: A function is defined to calculate and


print the model's performance metrics. These metrics include a
classification report, confusion matrix, and ROC-AUC score.

3. Load the data: The training and OOT data sets are loaded from CSV files.

4. Define predictors: The predictors for the model are defined as all columns
in the training data set except for the target variable “bad”.

5. Split the data: The training data is split into training and validation sets
using stratified sampling.

6. Balance the data: Undersample the positive class to create a balanced


training data set.

7. Build the Gradient Boosting model: A Gradient Boosting model is initialized.

8. Define hyperparameters to tune: The hyperparameters to be tuned are


defined. These include the number of trees in the forest (n_estimators), the
maximum depth of the trees (max_depth), and the learning rate
(learning_rate).
274 SAS, Python & R: A Cross Reference Guide

9. Tune hyperparameters using Grid Search: Grid Search is used to tune the
hyperparameters of the gradient boosting model. It fits the model to the
training data and finds the best hyperparameters.

10. Apply the model to the OOT data set: The best model is applied to the OOT
data set to make predictions.

11. Create performance metrics: The performance metrics function calculates


and prints the model's performance metrics on the OOT data set.

12. Plot ROC curve: Finally, an ROC curve is plotted to visualize the model's
performance.

Chapter 6: Predictive Modeling Part II – Ensemble Methods –


Conclusion and Transition

In this chapter, we explored the power of ensemble methods, learning how


techniques like bagging, boosting, and random forests can be used to create
stronger, more reliable predictive models. By combining multiple models, you can
mitigate the weaknesses of individual algorithms and enhance the overall
performance of your predictive efforts. Ensemble methods are a critical tool in your
data science arsenal, offering a robust approach to tackling complex predictive
challenges.

With a solid understanding of ensemble methods, it’s time to push the boundaries
even further. In the next chapter, we’ll delve into advanced modeling techniques
that go beyond the traditional and ensemble methods you’ve learned so far. These
techniques, including support vector machines and neural networks, represent the
cutting edge of predictive modeling and are capable of handling highly complex data
structures and relationships.

In Chapter 7, you will discover how to implement these advanced techniques in SAS,
Python, and R. We’ll explore their underlying principles, practical applications, and
how to optimize these models for the best performance. By the end of the chapter,
you’ll have the skills to apply some of the most sophisticated tools in predictive
modeling, enabling you to tackle even the most challenging data science problems.
SAS, Python & R: A Cross Reference Guide

So, with your ensemble methods knowledge in hand, let’s advance to Chapter 7,
where we’ll explore the frontiers of predictive modeling and equip you with the
techniques needed to handle the complexities of modern data science.
276 SAS, Python & R: A Cross Reference Guide

Chapter 6 Summary: Predictive Modeling Part II – Ensemble


Methods

1. Introduction to Ensemble Methods

● Overview: This chapter delves into the world of ensemble methods,


emphasizing their role in enhancing predictive performance by combining
multiple models. The chapter primarily focuses on two popular ensemble
techniques: Random Forest and Gradient Boosting. These methods build
upon decision trees to create more robust and accurate models by
addressing the weaknesses inherent in individual models.

2. Random Forest

● Concept and Mechanism: Random Forest is introduced as an ensemble


learning method that builds multiple decision trees during training and
outputs the mode (for classification) or mean (for regression) of individual
tree predictions. The chapter explains how Random Forest reduces
overfitting by averaging the results of many trees, each built on a different
subset of the data.

● Step-by-Step Construction: The chapter provides a detailed guide on


constructing a Random Forest model, from bootstrapping the data set to
building and combining individual trees. It emphasizes the randomness
introduced in the selection of predictors and observations, which is key to
the model's robustness.

● Out-of-Bag (OOB) Metric: The OOB metric, a key feature of Random Forest,
is discussed in detail. This metric provides an unbiased estimate of the
model's error without the need for a separate validation set, making it
efficient in data usage.

3. Gradient Boosting

● Concept and Mechanism: Gradient Boosting is presented as a sequential


ensemble method where each tree corrects the errors of its predecessor.
SAS, Python & R: A Cross Reference Guide

The chapter explains how this iterative approach allows Gradient Boosting
to achieve higher accuracy by focusing on difficult-to-predict observations.

● Step-by-Step Construction: The chapter outlines the process of building a


Gradient Boosting model, from initializing the model with a simple
prediction to iteratively adding trees that correct residual errors. It also
discusses key parameters like learning rate, which controls the contribution
of each tree to the final model.

● Different Types of Gradient Boosting: The chapter introduces popular


variations of Gradient Boosting, including XGBoost, LightGBM, and
CatBoost, each with its unique strengths and applications.

4. Comparing Random Forest and Gradient Boosting

● Strengths and Weaknesses: The chapter compares Random Forest and


Gradient Boosting, highlighting their respective strengths and weaknesses.
While Random Forest is less prone to overfitting and faster to train,
Gradient Boosting often achieves higher accuracy but at the cost of longer
training times and a higher risk of overfitting.

● Use Cases: The chapter advises on when to use each method, suggesting
that Random Forest may be preferable for quick, robust models, while
Gradient Boosting is suited for situations where the highest possible
accuracy is required, and computational resources are not a limiting factor.

5. Feature Selection in Ensemble Methods

● Random Forest: Feature selection in Random Forest is inherently performed


as each tree in the forest considers a random subset of features at each
split. The chapter explains how the model's feature importance scores can
be used to identify the most influential variables.

● Gradient Boosting: Feature selection in Gradient Boosting is tied to the


model's iterative process, where features that consistently reduce residuals
are given more importance. The chapter discusses techniques like Gini
impurity, entropy, and permutation importance for assessing feature
importance in Gradient Boosting models.
278 SAS, Python & R: A Cross Reference Guide

6. Pruning and Regularization

● Random Forest: The chapter explains that pruning is not typically used in
Random Forest due to its ensemble nature, which naturally mitigates
overfitting. Instead, the focus is on selecting the right number of trees and
controlling their depth.

● Gradient Boosting: Regularization techniques like early stopping and


shrinkage (learning rate) are discussed as methods to prevent overfitting in
Gradient Boosting models. These techniques control the complexity of the
model by limiting the impact of each tree.

7. Performance Metrics

● Evaluation: The chapter reviews common performance metrics for both


Random Forest and Gradient Boosting, including accuracy, precision, recall,
F1 score, and AUC-ROC. It emphasizes the importance of using these metrics
to assess and compare model performance, especially when dealing with
imbalanced data sets.
SAS, Python & R: A Cross Reference Guide

Chapter 6 Quiz

1. What is the main advantage of using ensemble methods over individual


models in predictive modeling?

2. How does Random Forest reduce the risk of overfitting compared to a single
decision tree?

3. What is the role of the Out-of-Bag (OOB) metric in Random Forest?

4. Explain the process of bootstrapping in the context of Random Forest.

5. How does Gradient Boosting differ from Random Forest in its approach to
building models?

6. What is the significance of the learning rate in Gradient Boosting models?

7. Describe the key difference between XGBoost and traditional Gradient


Boosting models.

8. When should you consider using Random Forest over Gradient Boosting?

9. How does the randomness in Random Forest contribute to its robustness?

10. What is feature importance, and how is it calculated in Random Forest?

11. How does Gradient Boosting handle difficult-to-predict observations?

12. Explain the concept of early stopping in Gradient Boosting.

13. What are the typical use cases for LightGBM compared to XGBoost?

14. How do ensemble methods like Random Forest handle multicollinearity?

15. Why is regularization important in Gradient Boosting models?

16. What are the pros and cons of using Gradient Boosting in terms of
computational efficiency?

17. How can you use permutation importance to assess feature significance in
Gradient Boosting?
280 SAS, Python & R: A Cross Reference Guide

18. What metrics would you use to evaluate the performance of a Random
Forest model?

19. How does Gradient Boosting handle imbalanced data sets compared to
Random Forest?

20. What are the common hyperparameters that need tuning in Gradient
Boosting models?
SAS, Python & R: A Cross Reference Guide

Chapter 6 Cheat Sheet

Category SAS Python R


-
- PROC HPFOREST - randomForest
RandomForestClas
for building package for
sifier from
Random Forest building Random
sklearn.ensemble
models Forest models
for classification
- OOBError for - oob_score=True - importance() for
out-of-bag error to get out-of-bag feature
Random Forest
estimation score importance
- Use mtry to
control the
- Use MAXDEPTH - Use max_depth
number of
to control tree to control tree
features
depth depth
considered at
each split
-
- PROC
GradientBoostingC - gbm package for
GRADBOOST for
lassifier from Gradient Boosting
Gradient Boosting
sklearn.ensemble models
models
for classification
- Use learning_rate
Gradient Boosting - Control learning - Use shrinkage to
to control the
rate with control learning
contribution of
SHRINKAGE rate
each tree
- Use MAXTREES - n_estimators to - Set n.trees to
to set the number set the number of determine the
of trees boosting stages number of trees
-
feature_importanc
- PROC HPFOREST es_ attribute in - importance()
automatically RandomForestClas function in
Feature Selection
ranks features by sifier and randomForest for
importance GradientBoostingC feature ranking
lassifier to rank
features
282 SAS, Python & R: A Cross Reference Guide

- varImp() from
- Use PROC - Use
caret to assess
VARREDUCE to SelectFromModel
feature
select important for feature
importance in
features selection
gbm models
- Random Forest:
No pruning - Random Forest: - Random Forest:
required; control Control tree depth Pruning not
model complexity and the number of necessary; control
via tree depth estimators to avoid with ntree and
and the number overfitting maxnodes
Pruning / of trees
Regularization - Gradient
- Gradient
- Gradient Boosting: Use
Boosting: Use
Boosting: Control learning_rate and
shrinkage and
overfitting with max_depth for
interaction.depth
SHRINKAGE and regularization;
to prevent
LEAFSIZE early stopping with
overfitting
validation_fraction
- Use PROC - sklearn.metrics - caret and pROC
HPFOREST and module for packages for
PROC calculating calculating
GRADBOOST to accuracy, accuracy,
Performance calculate precision, recall, F1 precision, recall,
Metrics performance score, AUC-ROC F1 score, AUC-
metrics such as for both Random ROC for Random
accuracy, AUC- Forest and Forest and
ROC, precision, Gradient Boosting Gradient Boosting
recall models models
- Random Forest: - Random Forest:
- Random Forest:
When a quick, Ideal for data sets
Useful for creating
robust model is with many
interpretable
needed that can features and a
models that
Best Use Cases handle large data need for
handle noisy data
sets with many robustness against
well
features overfitting
- Gradient - Gradient - Gradient
Boosting: When Boosting: Best Boosting:
SAS, Python & R: A Cross Reference Guide

higher accuracy is suited for tasks Preferred for


required, requiring high achieving high
particularly with accuracy, even at accuracy in
complex data sets the cost of predictive tasks,
and more increased particularly when
computation computation time dealing with
resources complex data
available
284 SAS, Python & R: A Cross Reference Guide

Chapter 7: Predictive Modeling Part III –


Advanced Techniques

Overview

This chapter explores advanced predictive modeling techniques, focusing on two


powerful algorithms: Support Vector Machines (SVM) and neural networks. Unlike
ensemble methods, which combine multiple models for enhanced performance,
SVM and neural networks are distinct approaches to tackling complex problems.

Support vector machines excel in both linear and non-linear classification tasks,
effectively identifying boundaries between classes in high-dimensional data spaces.
Their ability to find optimal hyperplanes, even in noisy environments, makes SVMs a
valuable tool for a wide range of applications, including image recognition, text
classification, and anomaly detection.

Neural networks, the foundation of deep learning, represent a paradigm shift in


artificial intelligence. Inspired by the interconnected neurons of the human brain,
neural networks possess an extraordinary ability to learn and adapt from vast data
sets, making them indispensable for tasks like natural language processing, image
recognition, and machine translation. Their ability to capture complex patterns and
non-linear relationships sets them apart from traditional statistical models.

As we delve into these advanced techniques, we'll unravel their intricate workings
and illuminate their potential in solving challenging problems. Our exploration will
showcase how these algorithms are shaping the future of data science and artificial
intelligence.

Support Vector Machines

Support vector machines are a distinct class of machine learning algorithms that
provide an alternative approach to predictive modeling. Unlike tree-based models
that use a hierarchical, rule-based structure, SVMs operate on the principle of
SAS, Python & R: A Cross Reference Guide

finding an optimal hyperplane that best separates the classes in a high-dimensional


space. This makes SVMs particularly effective in high-dimensional spaces and
situations where the number of dimensions exceeds the number of samples.

The goal of SVMs is to identify a distinct boundary – a point, line, or hyperplane –


that effectively separates the classes within the data. This concept is extended
through the use of “kernels,” enabling non-linear separation. The boundary that
maximizes the space between the two classes allows us to classify new observations
with greater confidence. If a new observation falls on either side of this boundary,
known as the support vector classifier, we can confidently assign it to a class.

The key components of SVMs include support vectors, hyperplanes, and margins.
Here’s how they work in SVMs:

● Support Vectors: These are data points that are closest to the hyperplane
and influence its orientation and position. They play a pivotal role in
defining the decision boundaries of the Support Vector Classifier (SVC). Due
to their key role, other observations become almost irrelevant. The entire
model can be constructed from just these few data points.

● Hyperplanes: These are decision boundaries that separate different classes.


In a two-dimensional space, a hyperplane is a line; in a three-dimensional
space, it’s a flat plane; and in higher dimensions, it’s referred to as a
hyperplane.

● Margins: This is the distance between the hyperplane and the support
vectors. The goal of an SVM is to maximize this margin to ensure robustness
against overfitting and to allow for better generalization on unseen data.

These characteristics make SVMs one of the most efficient models available, as they
require only a handful of observations for prediction.

Figure 7.1 shows a hypothetical data distribution of observations with two classes.
The left side of the figure shows the raw data distribution. There are many ways
that we can separate the two classes. Our goal is to find the point of separation that
provides the maximum distance between a pair of data points of both classes. This
is called the maximum margin.
286 SAS, Python & R: A Cross Reference Guide

The right side of the figure shows the same data distribution with the optimal
boundary that provides the maximum separation between the two closest points of
each class.

Figure 7.1: Data Distribution Boundaries

The line that provides the maximum space between the two classes provides us
some comfort that new observations will be classified with more confidence. If a
new observation is on one or the other side of the support vector classifier, then we
can comfortably classify that observation.

The observations closest to the support vector classifier are called support vectors.
These observations define the position of the support vector classifier. They
influence the position and orientation of the line or hyperplane. Due to the key role
these specific observations play in creating the decision boundaries of the Support
Vector Classifier (SVC), the remaining observations are nearly unimportant. The
entire model can be derived from just a few data points. This makes the support
vector machine one of the “lightest” models available because it only requires a few
observations to make predictions.

Step-by-Step Construction of an SVM Model

The construction of an SVM model involves several steps, which are detailed below:
SAS, Python & R: A Cross Reference Guide

1. Initialize the Model: Start by choosing a random hyperplane that divides


your data into classes.

2. Identify Support Vectors: Identify the data points (support vectors) closest
to the hyperplane.

3. Optimize the Hyperplane: Adjust the position and orientation of the


hyperplane to maximize the margin between the support vectors and the
hyperplane.

4. Handle Misclassifications: If your data is not linearly separable, allow some


misclassifications by introducing a penalty term “C” in your optimization
problem.

5. Transform to Higher Dimension: If your data is still not separable, transform


it into a higher dimension using a kernel function (like the Radial Basis
Function).

6. Find Separating Hyperplane: Find the separating hyperplane in this higher-


dimensional space.

7. Transform Back to Original Space: Transform this hyperplane back to its


original space. This will be your decision boundary.

These steps make SVMs a powerful tool for many machine learning tasks, providing
higher accuracy in high-dimensional spaces compared to single decision trees or
even random forests.

Comparing Ensemble Tree-Based Models to SVMs

Support vector machines and ensemble tree-based models, such as random forests
and gradient boosting machines, are powerful machine learning algorithms widely
used in predictive modeling. While they share some similarities, they also have
distinct differences in how they handle data, their sensitivity to certain issues, and
their overall performance. The following table provides a comparison of these two
types of models:
288 SAS, Python & R: A Cross Reference Guide

Table 7.1: Comparing Ensemble Tree-Based Models to SVMs

Model Issue Ensemble Tree-Based Models Support Vector Machines


Less sensitive due to
randomness and averaging of
Not affected as SVMs do
results from multiple trees
not rely on the entire data
(random forests) or because
Multicollinearity set; they are influenced
each weak learner (tree) is built
only by the support
on the residuals/errors of the
vectors.
previous one (gradient
boosting).
Can be controlled in random
forests due to ensemble
method, which averages
multiple decision trees to Less prone due to the
reduce variance. Gradient maximization of the margin
Overfitting
boosting can overfit if the between classes, which
number of trees is too large, but helps to avoid overfitting.
this can be controlled using
early stopping and other
regularization techniques.
Lower interpretability due
Lower interpretability due to to high-dimensional space
complexity from multiple trees, and complex decision
Interpretability
but variable importance can be boundaries, but support
measured. vectors provide some
insight into the model.
High performance due to the
ability to model complex
patterns using multiple trees.
High performance in high-
Gradient boosting typically
dimensional spaces and
provides even higher
Performance when the number of
performance than random
dimensions exceeds the
forests, especially for data sets
number of samples.
where the relationship between
features and response is
complex.
SAS, Python & R: A Cross Reference Guide

Slower due to building multiple


trees but can be parallelized
Faster for smaller data sets
across multiple CPUs for faster
as it uses only a subset of
training (random forests).
data (support vectors), but
Training Speed Gradient boosting is typically
can be slower for larger
slower than random forests
data sets or with a large
because trees are built
number of support vectors.
sequentially, so it’s harder to
parallelize.

These characteristics make both ensemble tree-based models and SVMs valuable
tools in the machine learning toolkit, each with its own strengths and ideal use
cases.

Understanding Kernel Functions and the C Parameter in SVMs


In support vector machines, the kernel and the C parameter are not feature
selection methods but key components of the SVM algorithm that help adjust data
and control model complexity.

Kernel:

● The kernel is a function used in SVM to transform the input data into a
higher-dimensional space where it might be easier to classify it. This is
particularly useful when the data is not linearly separable in its original
space.
● The choice of the kernel (linear, polynomial, radial basis function, sigmoid,
etc.) depends on the nature of the data and significantly impacts the
performance of the SVM.
● For example, using the Lending Club data set, a radial basis function (RBF)
kernel could be used to classify whether a loan would be paid off based on
features like loan amount, interest rate, and borrower’s credit score. The
RBF kernel can capture the complex, non-linear relationships between these
features and the target variable.
290 SAS, Python & R: A Cross Reference Guide

C Parameter:

● The C parameter in SVM is a regularization parameter that controls the


trade-off between achieving a low training error and a low testing error,
which is the trade-off between overfitting and underfitting.
● A small value of C creates a wider margin, which may allow more
misclassifications but may generalize better to unseen data. A large value of
C creates a narrower margin, which may fit the training data better but risks
overfitting.
● In the context of the Lending Club data set, a small C value might be better
if we have a lot of noisy data (e.g., outliers in income or erroneous entries)
as it would prevent the model from fitting these anomalies too closely.

Remember, the choice of kernel and the value of C should ideally be determined
through cross-validation or other model selection methods to ensure the best
performance on unseen data.

Standardization in Support Vector Machines


Standardization is a crucial preprocessing step in machine learning, and it holds
particular significance when working with support vector machines. Standardization,
also known as feature scaling or normalization, is the process of transforming the
features of a data set to have a common scale or range. This is done by subtracting
the mean and dividing by the standard deviation for each feature, ensuring they all
have a mean of 0 and a standard deviation of 1. The goal is to bring all features to a
uniform scale, preventing certain variables from dominating the learning algorithm
due to their larger magnitudes.

Warning: Importance of Standardization for SVM Accuracy


Standardization is crucial for SVMs because these models are sensitive
to the scale of input features. SVMs aim to find the optimal hyperplane
that maximizes the margin between different classes. If the features
are not standardized, those with larger scales can disproportionately
influence the determination of the hyperplane. In such cases, the SVM
may not generalize well to new, unseen data, as the learned decision
boundary may be skewed.
SAS, Python & R: A Cross Reference Guide

In the SVM workflow, standardization typically occurs after data preprocessing steps
like missing value imputation and outlier handling. It ensures that all features
contribute equally to the model's decision-making process. Without standardization,
the SVM may misclassify instances or produce suboptimal results. After
standardization, the data is ready for model training, parameter tuning, and
evaluation. Fortunately, most machine learning libraries provide convenient
functions for standardization, making it an easily implementable step in the SVM
modeling process.

Standardization plays a pivotal role in SVMs by ensuring that the scale of input
features does not bias the model's decision boundary. It promotes fairness among
features, allowing SVMs to identify the most effective hyperplane for classification
tasks. By integrating standardization into the SVM workflow, data scientists can
improve model performance, leading to more accurate and reliable results in
various applications.

Performance Metrics

Performance metrics for support vector machines are similar to those used for other
classification and regression models. For classification problems, these can include
accuracy, precision, recall, F1 score, and area under the ROC curve (AUC-ROC). For
regression problems, mean absolute error (MAE), mean squared error (MSE), or R-
squared might be used. These metrics are explained in more detail in the Decision
Tree section of Chapter 5.

Pros and Cons

Support vector machines have several advantages:

● They can handle both numerical and categorical data.


● They are effective in high-dimensional spaces.
● They are memory efficient, using a subset of training points in the decision
function (support vectors).

However, they also have some disadvantages:

● They are not suitable for large data sets due to high training time.
● They do not directly provide probability estimates.
292 SAS, Python & R: A Cross Reference Guide

● They are less interpretable than individual decision trees.

Best Practices

When using support vector machines:

● Tune hyperparameters such as the C parameter and the type of kernel.


● Balance your data set if dealing with imbalanced classes, either by
undersampling, oversampling, or generating synthetic samples.
● Standardize your data, as SVMs are not scale-invariant.

Data Preparation for SVM Models

When developing a support vector machine model, several data preparation steps
are typically necessary to ensure the best performance of the model. Here are some
key considerations:

● Outliers: SVMs are not very sensitive to outliers because they focus on the
points that are hardest to tell apart (the support vectors). However, in some
cases, outliers can affect the hyperplane and lead to suboptimal results. It
might be beneficial to handle outliers based on the specific data set and
problem context.

● Multicollinearity: While SVMs can handle multicollinearity to some extent,


it can make the model more complex and harder to interpret. Identifying
and addressing multicollinearity might be useful, for example, by removing
redundant features or using dimensionality reduction techniques.

● Missing Data: SVMs cannot handle missing data. Any missing values in the
data set must be handled appropriately before training the model. This
could involve imputation strategies or, in some cases, removing instances or
features with too many missing values.

● Imbalanced Data: SVMs can be sensitive to class imbalance, which could


lead to a bias toward the majority class. Techniques like resampling the
data, using different weights for different classes, or using anomaly
detection methods can help address this issue.
SAS, Python & R: A Cross Reference Guide

● Data Standardization: SVMs are not scale-invariant, i.e., they can be


sensitive to the range of the features. Therefore, standardizing the data
(mean 0, standard deviation 1) is generally a good practice before training
an SVM. This ensures that all features contribute equally to the decision
boundary.

● Data Transformation: Depending on the nature of the data and the kernel
used, it might be beneficial to transform the data (e.g., using a log
transformation for skewed data) before training the SVM.

Remember, the specific data preparation steps can depend on the nature of the
data set and the specific problem context. It’s always a good idea to explore the
data thoroughly and make informed decisions based on the insights gained from
this exploration.

Support Vector Machine Model Code


The following SVM code is a robust workflow for creating an SVM model, tuning its
hyperparameters, applying the model to an out-of-time (OOT) data set, and
generating final model performance metrics on the scored OOT data set. This
workflow is designed to be efficient and effective, leveraging the power of the SVM
algorithm and the flexibility of each programming language’s modeling library.

Unlike random forests, SVMs do not inherently handle an imbalance in the data.
Therefore, depending on the level of imbalance in your data, you might need to
consider techniques such as undersampling, oversampling, or generating synthetic
samples when preparing your data for an SVM model.

Additionally, SVMs are not scale-invariant, meaning they can be sensitive to the
range of the features. Therefore, standardizing the data before training an SVM is
generally a good practice. This ensures that all features contribute equally to the
decision boundary. The choice of kernel and the value of C should ideally be
determined through cross-validation or other model selection methods to ensure
the best performance on unseen data. These are key components of the SVM
algorithm that help in data adjustment and model complexity control. They allow
the model to capture complex, non-linear patterns in the data and control the
294 SAS, Python & R: A Cross Reference Guide

balance between bias and variance. Therefore, understanding these concepts is


crucial for effectively using SVMs and interpreting their results.

The provided code is a pipeline for training a support vector machine model on
preprocessed data, tuning its hyperparameters, and evaluating its performance.
Here’s a step-by-step explanation of the code:

1. Import necessary libraries: The code begins by importing the necessary


libraries. These include pandas for data manipulation, sklearn for machine
learning tasks, and matplotlib for plotting. SAS does not require additional
libraries.

2. Define a function for performance metrics: A function named performance


metrics is defined to calculate and print various performance metrics of the
model, including the classification report, confusion matrix, and ROC-AUC
score.

3. Load the data: The preprocessed training and out-of-time (OOT) data sets
are loaded from CSV files.

4. Define predictors and target variable: The predictors (features) and the
target variable (“bad”) are defined. Any missing columns in the OOT data set
are added.

5. Split the data: The training data is split into a training set and a validation
set using stratified sampling.

6. Undersample the majority class: To deal with class imbalance, the majority
class in the training data is undersampled.

7. Standardize the data: The predictors in the resampled training data and the
OOT data set are standardized using sklearn’s StandardScaler.
8. Build the SVM model: An SVM model is initialized with a random state for
reproducibility.

9. Define hyperparameters to tune: The hyperparameters to be tuned are


defined in a dictionary. These include the C parameter and the type of
kernel.
SAS, Python & R: A Cross Reference Guide

10. Tune hyperparameters using grid search: Grid search is used to find the
best hyperparameters for the SVM model.

11. Apply the model to the OOT data set: The best model found by grid search
is used to make predictions on the OOT data set.

12. Evaluate the model: The performance metrics function evaluates the
model’s performance on the OOT data set. A ROC curve is also plotted.

This pipeline provides a comprehensive approach to training an SVM model, from


handling class imbalance and standardizing the data to tuning hyperparameters and
evaluating the model’s performance. It’s important to note that the specific steps
and parameters may need to be adjusted based on the nature of your data and the
specific problem you’re trying to solve. For example, you might need to choose
different kernels or C values for the SVM or use a different method for handling
class imbalance. As always, it’s a good idea to experiment with different approaches
and evaluate their performance to find the best solution for your specific use case.

Possible Training Time Issues

Warning: Computational Challenges of SVMs with Large Data Sets


Support vector machines are a powerful and flexible class of
supervised algorithms for classification and regression. However, they
can be computationally intensive and slow to train with large data sets.
This is because the training time for SVMs scales between quadratic
and cubic with the size of the data set, depending on the specific
implementation and the settings of the hyperparameters.

In our case, we have a balanced data set with 13,836 instances (6,918 positive and
6,918 negative) and 91 predictors. This is a relatively large data set, and training an
SVM on this data set involves solving a complex optimization problem that requires
significant computational resources.

Here are a few strategies you might consider to speed up the process:

1. Feature Selection: Reduce the number of features in your data set if


possible. Only keep those features data set that contribute to the
296 SAS, Python & R: A Cross Reference Guide

prediction. Feature selection techniques can help identify these important


features.

2. Data Sampling: Use a subset of your data to train the SVM. Once the SVM is
trained, it can be tested on the rest of the data.

3. Parameter Tuning: The choice of parameters can significantly affect the


training time. For example, a linear kernel is typically faster than a radial
basis function (RBF) kernel.

4. Software/Hardware Optimization: Ensure that your SVM implementation is


optimized and that you’re effectively leveraging your hardware resources.
Some libraries can use parallel computing resources to speed up SVM
training.

Remember, it’s important to balance model accuracy and training time. Sometimes,
a slightly less accurate model can be an acceptable trade-off for significantly faster
training times. It’s always a good idea to start with a smaller, simpler model and
data set and then gradually scale up as needed.

Program 7.1: Python SVM Script

# Import necessary libraries


import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix,
roc_auc_score, roc_curve
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Function to calculate performance metrics


def performance_metrics(y_true, y_pred):
print("Classification Report:")
print(classification_report(y_true, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("ROC AUC Score:")
print(roc_auc_score(y_true, y_pred))
SAS, Python & R: A Cross Reference Guide

# Load the data


train_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

print(train_encoded.columns)

# Define predictors and the target variable


predictors = [col for col in train_encoded.columns if col != 'bad']

# Check if excluded variables are in OOT_encoded columns, if not


then add them
missing_cols = set(predictors) - set(OOT_encoded.columns)
for c in missing_cols:
OOT_encoded[c] = 0

# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])

# Combine predictors and target variable into one DataFrame for


undersampling
train_data = pd.concat([X_train, y_train], axis=1)

# Count the number of samples in the minority class


minority_class_count = train_data['bad'].value_counts().min()

# Perform undersampling on the majority class


undersampled_data = pd.concat([train_data[train_data['bad'] ==
label].sample(minority_class_count, random_state=42) for label in
train_data['bad'].unique()])

# Get the resampled predictors and target variable


X_train_resampled = undersampled_data[predictors]
y_train_resampled = undersampled_data['bad']

# Standardize the data


scaler = StandardScaler()
X_train_resampled = scaler.fit_transform(X_train_resampled)
OOT_encoded[predictors] = scaler.transform(OOT_encoded[predictors])
298 SAS, Python & R: A Cross Reference Guide

# Build the SVM model


svm = SVC(random_state=42)

# Define hyperparameters to tune


param_grid = {'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf', 'poly']}

# Tune hyperparameters using GridSearchCV


grid_search = GridSearchCV(svm, param_grid, cv=5)
grid_search.fit(X_train_resampled, y_train_resampled)

# Get the best model


best_model = grid_search.best_estimator_

# Apply the model to the OOT data set


OOT_predictions = best_model.predict(OOT_encoded[predictors])

# Create performance metrics using the function defined earlier


performance_metrics(OOT_encoded['bad'], OOT_predictions)

# Plot ROC curve


fpr, tpr, _ = roc_curve(OOT_encoded['bad'], OOT_predictions)
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Program 7.2: R SVM Script

# Import necessary libraries


library(e1071)
library(caret)
library(pROC)

# Function to calculate performance metrics


performance_metrics <- function(y_true, y_pred) {
print("Classification Report:")
print(confusionMatrix(as.factor(y_pred), as.factor(y_true)))
print("ROC AUC Score:")
SAS, Python & R: A Cross Reference Guide

roc_obj <- roc(y_true, y_pred)


print(auc(roc_obj))
return(roc_obj)
}

# Load the data


train_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

print(colnames(train_encoded))

# Define predictors and the target variable


predictors <- colnames(train_encoded)[!colnames(train_encoded) %in%
'bad']

# Check if excluded variables are in OOT_encoded columns, if not


then add them
missing_cols <- setdiff(predictors, colnames(OOT_encoded))
OOT_encoded[missing_cols] <- 0

# Split the data into training and validation sets (80/20 split)
using stratified sampling
train_index <- createDataPartition(train_encoded$bad, p = 0.8, list
= FALSE)
X_train <- train_encoded[train_index, predictors]
y_train <- train_encoded[train_index, 'bad']
X_val <- train_encoded[-train_index, predictors]
y_val <- train_encoded[-train_index, 'bad']

# Combine predictors and target variable into one DataFrame for


undersampling
train_data <- cbind(X_train, bad = y_train)

# Count the number of samples in the minority class


minority_class_count <- min(table(train_data$bad))

# Perform undersampling on the majority class


undersampled_data <- train_data[train_data$bad ==
0, ][sample(1:nrow(train_data[train_data$bad == 0, ]),
minority_class_count), ]
undersampled_data <- rbind(undersampled_data,
train_data[train_data$bad ==
300 SAS, Python & R: A Cross Reference Guide

1, ][sample(1:nrow(train_data[train_data$bad == 1, ]),
minority_class_count), ])

# Get the resampled predictors and target variable


X_train_resampled <- undersampled_data[, predictors]
y_train_resampled <- undersampled_data$bad

# Standardize the data


preProcValues <- preProcess(X_train_resampled, method = c("center",
"scale"))
X_train_resampled <- predict(preProcValues, X_train_resampled)
OOT_encoded[predictors] <- predict(preProcValues,
OOT_encoded[predictors])

# Define the SVM model


svm_model <- svm

# Define the tuning grid


tune_grid <- expand.grid(C = c(0.1, 1, 10),
kernel = c('linear', 'radial',
'polynomial'))

# Perform hyperparameter tuning


tuned_model <- tune(svm_model, bad ~ ., data =
cbind(X_train_resampled, bad = y_train_resampled), ranges =
tune_grid)

# Get the best model


best_model <- tuned_model$best.model

# Apply the best model to the OOT data set


OOT_predictions <- predict(best_model, OOT_encoded[predictors])

# Create performance metrics using the function defined earlier


roc_obj <- performance_metrics(OOT_encoded$bad, OOT_predictions)

# Plot ROC curve


roc_obj <- roc(OOT_encoded$bad, OOT_predictions)
plot(roc_obj, print.thres = "best")
SAS, Python & R: A Cross Reference Guide

Program 7.3: SAS SVM Script

/*LIBNAME james 'INPUT FILE PATHWAY';*/

/* Load the data */


DATA train_encoded;
SET james.train_encoded;
RUN;

DATA oot_encoded;
SET james.oot_encoded;
RUN;

/* Create global variable for predictors */


PROC CONTENTS NOPRINT DATA = train_encoded (DROP= id bad) OUT = var
(KEEP = name); RUN;
PROC SQL NOPRINT; SELECT name INTO:predictors SEPARATED BY " " FROM
var; QUIT;
%PUT &predictors; RUN;

/* Variables in train_encoded but not in oot_encoded */


PROC SQL NOPRINT;
CREATE TABLE train_vars AS SELECT name FROM dictionary.columns
WHERE libname="WORK" AND memname="TRAIN_ENCODED";
CREATE TABLE oot_vars AS SELECT name FROM dictionary.columns WHERE
libname="WORK" AND memname="OOT_ENCODED";
CREATE TABLE in_train_not_oot AS
SELECT * FROM train_vars
EXCEPT
SELECT * FROM oot_vars;
QUIT;

/* Print the differences */


PROC PRINT DATA=in_train_not_oot; RUN;

/* Create final train and OOT data sets */


DATA train_encoded_final;
SET train_encoded (KEEP=&predictors bad);
RUN;

PROC FREQ DATA=train_encoded_final; TABLE bad; RUN;


302 SAS, Python & R: A Cross Reference Guide

/*Create missing variables from TRAIN data set*/


DATA oot_encoded_final;
SET oot_encoded ;
home_ownership_ANY = 0;
purpose_wedding = 0;
KEEP &predictors bad home_ownership_ANY purpose_wedding;
RUN;

PROC FREQ DATA=oot_encoded_final; TABLE bad; RUN;

/* Split the modeling data set by a 80/20 ratio using a random seed
*/
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2
SEED=42; RUN;

PROC FREQ DATA= class2; TABLES selected; RUN;

/* Create TRAIN and VAL data sets */


DATA train_data val_data;
SET class2 ;
IF selected = 1 THEN OUTPUT train_data; ELSE OUTPUT val_data;
RUN;

/* Assess target rate in train, val and OOT data sets */


PROC FREQ DATA = train_data; TABLES bad; RUN;
PROC FREQ DATA = val_data; TABLES bad; RUN;
PROC FREQ DATA = oot_encoded_final; TABLES bad; RUN;

/* Standardize the data */


PROC STDIZE DATA=train_data OUT=train_data_std METHOD=STD;
VAR &predictors;
RUN;

PROC STDIZE DATA=val_data OUT=val_data_std METHOD=STD;


VAR &predictors;
RUN;

PROC STDIZE DATA=oot_encoded_final OUT=oot_encoded_final_std


METHOD=STD;
VAR &predictors;
RUN;

/*Create hyperparameter tuning macro that will loop through a range


of values for each of the tuning parameters */
SAS, Python & R: A Cross Reference Guide

/* Create an empty summary_table data set */


PROC DELETE DATA=summary_table;

DATA summary_table;
LENGTH Model 8;
FORMAT Area 8.5;
RUN;
%MACRO tune(iteration, kernel, C);
%put "Iteration: &iteration";
%put "Kernel: &kernel";
%put "C: &C";

%let target = bad;

/* Build the SVM model */


PROC HPSVM DATA=train_data_std;
INPUT &predictors. / LEVEL=INTERVAL;
TARGET &target. / ORDER=desc;
KERNEL=&kernel.;
PENALTY C=&C.;
CODE FILE='/OneDrive/Documents/DS_Project/svm.sas';
RUN;

PROC HP4SCORE DATA = val_data_std;


SCORE FILE = '/OneDrive/Documents/DS_Project/svm.sas'
OUT = test_score_&iteration. (keep = &target. P_bad1);
RUN;

/* Calculate AUC value on validation data set */


proc logistic data = test_score_&iteration.;
class &target.;
model &target. = P_bad1 / outroc=ROC;
ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;
run;

data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;
304 SAS, Python & R: A Cross Reference Guide

/* Append the results to a summary table */


proc append base=summary_table data=outstat;
run;

%mend;

/* Call the macro with different hyperparameters */


%tune(1, LINEAR, 1);
%tune(2, POLYNOM, 1);
%tune(3, RBF, 1);
%tune(4, LINEAR, 10);
%tune(5, POLYNOM, 10);
%tune(6, RBF, 10);
%tune(7, LINEAR, 100);
%tune(8, POLYNOM, 100);
%tune(9, RBF, 100);

/* Build the final SVM model */


PROC HPSVM DATA=train_data_std;
INPUT &predictors. / LEVEL=INTERVAL;
TARGET &target. / ORDER=desc;
PENALTY C=1; /* Insert optimal hyperparameter value */
SELECT FOLD=3 CV=SPLIT;
KERNEL=RBF; /* Insert optimal hyperparameter value */
CODE FILE='/OneDrive/Documents/DS_Project/svm.sas';
RUN;

PROC HP4SCORE DATA = oot_encoded_final_std;


SCORE FILE = '/OneDrive/Documents/DS_Project/svm.sas'
OUT = oot_scored (keep = &target. P_bad1);
RUN;

/* Calculate AUC value on OOT data set */


PROC LOGISTIC DATA = oot_scored;
CLASS &target.;
MODEL &target. = P_bad1 / OUTROC=ROC;
ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;
RUN;
SAS, Python & R: A Cross Reference Guide

Neural Networks

Neural networks, often in the spotlight of machine learning, are the backbone of
deep learning – an innovation that has revolutionized artificial intelligence. These
networks shine in applications like image recognition, speech-to-text conversion,
gaming, and even complex problem-solving, such as defeating human champions in
both chess and Go. At their core, neural networks are the linchpin of deep learning
endeavors.

Deep learning, a pivotal concept, extends the power of neural networks. It involves
training neural networks on extensive data sets, often with multiple interconnected
layers. The objective is straightforward: proficiently classify or predict outcomes
with accuracy. Deep learning's charm lies in its ability to embrace more data and
complexity, constantly improving predictions as it encounters new information and
layers of intricacy within the neural network.

Consider Figure 7.2, showcasing how neural networks surge in performance as they
ingest more data and introduce additional hidden nodes. Unlike many other
machine learning techniques that plateau with data abundance, neural networks
uniquely continue learning and predicting, adapting to heightened data and added
complexity through these hidden layers.

Figure 7.2: Performance Gains of Additional Data


306 SAS, Python & R: A Cross Reference Guide

Neural networks don't need to be overly intricate. They can be as straightforward as


standard linear regressions. For instance, Figure 7.3 illustrates a basic linear
regression predicting car prices based on a single predictor: horsepower.

Figure 7.3: Simple Linear Regression Example

However, the allure of neural networks lies in their capability to go beyond simple
linear models. Figure 7.4 offers an alternative representation – a neural network
format for a regression model.

Figure 7.4: Neural Network Form of a Regression Model

Each neuron embodies a linear equation expressing the relationship between the
predictor and the target variable. It might seem like an elementary version of linear
regression, but here's the twist: neural networks aren't constrained to a single linear
equation. Instead, they stack multiple neurons, weaving intricate relationships
between numerous predictors and one target variable.
SAS, Python & R: A Cross Reference Guide

Figure 7.5 introduces a neural network example with four input features and a
hidden layer of nodes.

Figure 7.5: Multilayer Perceptron Example

It begins like a linear regression, with four input features (x) predicting an output
feature (y). However, the pivotal shift occurs in the middle, where hidden nodes
emerge. These nodes aren't explicitly defined; the neural network identifies and
constructs patterns autonomously. In this case, it might amalgamate horsepower
and anti-lock brakes into a "performance" node and safety ratings with make/model
into a "luxury" node. These hidden nodes then assist in predicting the outcome
variable.

This multilayer perceptron, as this design is known, signifies forward information


flow. It embodies multiple processing stages, each comprising generalized linear
models predicting outcomes. In simpler terms, if this were a linear regression, it'd
resemble Equation 7.1:

Equation 7.1: Multiple Linear Regression Equation

𝑌 ≈ 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝛽4 𝑋4 …
308 SAS, Python & R: A Cross Reference Guide

where weights (𝛽𝑝 ) correspond to the input features of the hidden nodes. Figure 7.6
illustrates how these nodes generate new weighted features for the final prediction.

Figure 7.6: Coefficient Weights Example

However, deep neural networks are unsatisfied with a single hidden layer. They are
algorithms with additional layers of hidden nodes. The more layers, the more
intricate and computationally demanding they become.

Before predicting an outcome, neural networks employ a final trick. After computing
the weighted sum for each hidden unit, a non-linear function is applied. The choice
of function, such as ReLU or tanh, empowers neural networks to master intricate
functions, surpassing the capabilities of simple linear regression models.

Neural networks excel in problems demanding complex pattern recognition and are
the bedrock of deep learning's transformative impact. Whether you're exploring
image data, natural language, or intricate relationships, neural networks can be your
ace in the hole.

Step-by-Step Construction of a Neural Network Model

Building a neural network model involves several key steps:


SAS, Python & R: A Cross Reference Guide

1. Initialize the Model: Start with the architecture of the neural network,
which includes the number of layers, the number of neurons in each layer,
and the activation functions. Typically, a neural network consists of an
input layer, one or more hidden layers, and an output layer.

2. Data Preparation: Preprocess your data by scaling it to a standard range


and splitting it into training, validation, and testing sets. Neural networks
require large amounts of data for training.

3. Forward Propagation: Implement the forward propagation process. Data is


fed into the input layer, and computations flow through the hidden layers,
ultimately producing predictions at the output layer. Each neuron in a layer
is connected to every neuron in the subsequent layer.

4. Calculate Loss: Compute a loss function that quantifies the error between
the predicted values and the actual target values. Common loss functions
include mean squared error for regression and cross-entropy for
classification.

5. Backpropagation: Backpropagation is the heart of neural network training.


It calculates the gradients of the loss function with respect to the model's
parameters, which guide the optimization process.

6. Update Weights: Adjust the weights and biases of the neural network
using an optimization algorithm like Stochastic Gradient Descent (SGD) or
Adam. This step minimizes the loss function and fine-tunes the model.

7. Repeat Steps 3–6: Iterate through the training data multiple times
(epochs), continuously updating the weights to improve the model's
performance.

8. Validation: Monitor the model's performance on a separate validation data


set during training. This helps prevent overfitting and allows you to tune
hyperparameters.

9. Testing: Once training is complete, evaluate the model's performance on a


completely unseen test data set to assess its generalization capability.
310 SAS, Python & R: A Cross Reference Guide

10. Make Predictions: Deploy the trained neural network to make predictions
on new, unseen data.

Types of Neural Network Models

Neural networks encompass various architectures, each designed for specific tasks.
Here are some popular types:

● Feedforward Neural Networks (FNN): The standard neural network


architecture with information flowing in one direction, from input to output.

● Convolutional Neural Networks (CNN): Ideal for image-related tasks, CNNs


use convolutional layers to identify patterns within images.

● Recurrent Neural Networks (RNN): Tailored for sequence data like time
series or natural language. They maintain a memory of past inputs through
recurrent connections.

● Long Short-Term Memory (LSTM): A type of RNN that overcomes the


vanishing gradient problem, making it effective for tasks that require
modeling long-term dependencies.

● Gated Recurrent Unit (GRU): Similar to LSTM but computationally more


efficient.

● Autoencoders: Used for unsupervised learning and dimensionality


reduction. They consist of an encoder and decoder to reconstruct input
data.

Comparing Baseline Models to Neural Network Models

When comparing neural network models to baseline models, consider various


factors:

1. Performance: Neural networks often outperform simpler models on


complex tasks. They excel at learning intricate patterns in data.
SAS, Python & R: A Cross Reference Guide

2. Data Size: Neural networks require substantial amounts of data for training,
making them suitable for big data sets with many features.

3. Computational Resources: Training deep neural networks can be


computationally intensive and may necessitate access to powerful hardware
like GPUs or TPUs.

4. Interpretability: Neural networks are often considered black-box models,


meaning it can be challenging to interpret why they make specific
predictions.

5. Regularization: Neural networks offer various regularization techniques like


dropout and weight decay to prevent overfitting.

6. Hyperparameter Tuning: Tuning hyperparameters, such as the number of


layers, neurons, and learning rate, is crucial for optimizing neural networks.

7. Training Speed: Deep neural networks with many layers can take a long
time to train, but techniques like transfer learning can expedite the process.

Neural networks are potent tools for complex machine learning tasks, but they
require careful consideration of data size, computational resources, interpretability,
and hyperparameter tuning. Therefore, it's essential to choose the right neural
network architecture and preprocessing techniques for your specific problem.

Pros and Cons

Neural networks have several advantages:

1. Complex Pattern Recognition: Neural networks excel in recognizing


intricate patterns within data. This makes them invaluable for applications
such as image and speech recognition.

2. Adaptability: Neural networks can adapt and learn from new data, making
them suitable for scenarios where the underlying patterns evolve over time.

3. Feature Extraction: They can automatically extract essential features from


raw data, reducing the need for extensive feature engineering.
312 SAS, Python & R: A Cross Reference Guide

4. Parallel Processing: Neural networks can leverage parallel processing, which


can significantly speed up training on modern hardware.

5. High Accuracy: In many cases, neural networks outperform traditional


machine learning models, achieving state-of-the-art results in various
domains.

6. Universal Approximators: Theoretically, neural networks can approximate


any function, given a sufficiently large and well-structured architecture.

Neural networks have several disadvantages:

1. Complexity: Deep neural networks, in particular, can be exceedingly


complex, making them challenging to understand and interpret.

2. Computationally Intensive: Training deep neural networks can be


computationally expensive, requiring substantial processing power and
time.

3. Data Hungry: Neural networks often require large amounts of data for
effective training, which might not be available for all problems.

4. Overfitting: They are prone to overfitting, especially when dealing with


small data sets. Careful regularization is necessary.

5. Black Box: The inner workings of neural networks can resemble a black box,
making it difficult to interpret their decision-making process.

6. Hyperparameter Tuning: Finding the right hyperparameters for neural


networks can be time-consuming, requiring extensive experimentation.

Best Practices for Neural Networks

Adhering to best practices is crucial to harness the power of neural networks


effectively.
SAS, Python & R: A Cross Reference Guide

1. Data Preparation: Start with clean, well-preprocessed data. Address issues


like missing values, outliers, and data scaling.

2. Architecture Design: Carefully choose the neural network architecture that


suits your problem, considering factors like depth, width, and activation
functions.

3. Regularization: Apply regularization techniques such as dropout, weight


decay, and early stopping to combat overfitting.

4. Hyperparameter Tuning: Invest time in hyperparameter tuning, using


methods like grid search or random search to optimize model performance.

5. Training Strategies: Implement effective training strategies, including


learning rate schedules and batch size adjustments.

6. Validation: Use appropriate validation techniques like cross-validation to


assess model generalization.

Issues to Address in Neural Network Development

Before diving into neural network development, addressing specific issues and
challenges that can affect model performance and reliability is essential. These
include:

1. Dummy Variable Trap: Be mindful of dummy variables when encoding


categorical data to avoid multicollinearity issues.

2. Multicollinearity: Detect and mitigate multicollinearity, which can distort


feature importance and affect model stability.

3. Data Imbalance: Address class imbalance in classification problems using


techniques like oversampling, undersampling, or synthetic data generation.

4. Standardization: Standardize or normalize numerical features to ensure


consistent scaling and improve convergence during training.
314 SAS, Python & R: A Cross Reference Guide

5. Feature Engineering: Consider feature engineering to enhance the neural


network's ability to capture relevant patterns in the data.

6. Exploratory Data Analysis: Perform exploratory data analysis to gain


insights into data distributions and identify potential issues. By proactively
addressing these issues, you can build more robust and reliable neural
network models for a wide range of data science problems.

Program 7.4: Python Neural Network Script

# Import necessary libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix,
roc_auc_score, roc_curve
from sklearn.model_selection import GridSearchCV
from statsmodels.stats.outliers_influence import
variance_inflation_factor
import statsmodels.api as sm

# Function to calculate performance metrics


def performance_metrics(y_true, y_pred):
print("Classification Report:")
print(classification_report(y_true, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("ROC AUC Score:")
print(roc_auc_score(y_true, y_pred))

# Load the data


train_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded = pd.read_csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

print(train_encoded.columns)
SAS, Python & R: A Cross Reference Guide

# Define the variables to exclude


excluded_variables = ['id', 'emp_length_3years', 'term_36months',
'grade_G', 'sub_grade_B4', 'verification_status_SourceVerifi',
'purpose_home_improvement',
'home_ownership_RENT', 'application_type_JointApp', 'bad']

# Define predictors excluding the specified variables and the target


variable
predictors = [col for col in train_encoded.columns if col not in
excluded_variables]

# Calculate VIF
X = sm.add_constant(train_encoded[predictors])
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in
range(X.shape[1])]
vif["features"] = X.columns

# Identify variables with VIF greater than 5 (indicative of


multicollinearity)
high_vif_predictors = vif[vif["VIF Factor"] >
5]["features"].tolist()

# Remove predictors with high VIF from the list of predictors


predictors = [pred for pred in predictors if pred not in
high_vif_predictors]

# Store the target variable temporarily


OOT_bad = OOT_encoded['bad']

# Check if excluded variables are in OOT_encoded columns, if not


then add them
missing_cols = set(predictors) - set(OOT_encoded.columns)
for c in missing_cols:
OOT_encoded[c] = 0

# Ensure the order of column in the OOT set is the same order as in
train set
OOT_encoded = OOT_encoded[predictors]

# Add back the target variable 'bad' to the OOT_encoded data set
OOT_encoded['bad'] = OOT_bad
316 SAS, Python & R: A Cross Reference Guide

# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])

# Remove constant columns


X_train = X_train.loc[:, (X_train != X_train.iloc[0]).any()]

# Remove duplicate columns


X_train = X_train.loc[:,~X_train.columns.duplicated()]

# Combine predictors and target variable into one DataFrame for


undersampling
train_data = pd.concat([X_train, y_train], axis=1)

# Count the number of samples in the minority class


minority_class_count = train_data['bad'].value_counts().min()

# Perform undersampling on the majority class


undersampled_data = pd.concat([train_data[train_data['bad'] ==
label].sample(minority_class_count, random_state=42) for label in
train_data['bad'].unique()])

# Get the resampled predictors and target variable


X_train_resampled = undersampled_data[predictors]
y_train_resampled = undersampled_data['bad']

# Perform undersampling on the majority class for validation data


set
validation_data = pd.concat([X_val, y_val], axis=1)
minority_class_count = validation_data['bad'].value_counts().min()
undersampled_validation_data =
pd.concat([validation_data[validation_data['bad'] ==
label].sample(minority_class_count, random_state=42) for label in
validation_data['bad'].unique()])
X_val_resampled = undersampled_validation_data[predictors]
y_val_resampled = undersampled_validation_data['bad']

# Standardize the data


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_resampled)
X_val = scaler.transform(X_val_resampled)
SAS, Python & R: A Cross Reference Guide

# Build the neural network model


model = MLPClassifier(random_state=42)

# Define hyperparameters to tune


param_grid = { 'hidden_layer_sizes': [(64, 32), (128, 64, 32)],
'activation': ['relu', 'tanh'],
'solver': ['adam', 'lbfgs'],
'alpha': [0.0001, 0.001, 0.01],
'max_iter': [10, 50, 100] }

# Create GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=5)

# Train the model


grid_search.fit(X_train, y_train_resampled)

# Get the best model


best_model = grid_search.best_estimator_

# Separate predictors and target variable


OOT_X = OOT_encoded.drop('bad', axis=1)
OOT_y = OOT_encoded['bad']

# Standardize the OOT data


OOT_X = scaler.transform(OOT_X)

# Apply the model to the OOT data set


OOT_predictions = best_model.predict(OOT_X)

# Create performance metrics using the function defined earlier


performance_metrics(OOT_y, OOT_predictions)

# Plot ROC curve


fpr, tpr, _ = roc_curve(OOT_y, OOT_predictions)
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

# Print the best hyperparameters


print("Best Hyperparameters:")
318 SAS, Python & R: A Cross Reference Guide

print(grid_search.best_params_)

Program 7.4: R Neural Network Script

# Load necessary libraries


library(caret)
library(ROCR)
library(nnet)
library(MLmetrics)
library(ROSE)
library(pROC)

# Load the data


train_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../train_encoded.csv')
OOT_encoded <- read.csv('/...INPUT YOUR FILE
PATHWAY.../oot_encoded.csv')

# Define the variables to exclude


excluded_variables <- c('id', 'emp_length_3years', 'term_36months',
'grade_G', 'sub_grade_B4',
'verification_status_SourceVerifi',
'purpose_home_improvement', 'home_ownership_RENT',
'application_type_JointApp')

# Define predictors excluding the specified variables and the target


variable
predictors <- setdiff(colnames(train_encoded), c(excluded_variables,
"bad"))

# Calculate VIF
vif_fit <- lm(bad ~ ., data = train_encoded[, c(predictors, "bad")])
if (vif_fit$rank == length(coefficients(vif_fit))) {
vif_values <- car::vif(vif_fit)
high_vif_predictors <- names(vif_values[vif_values > 5]) # Change
this threshold as needed
} else {
high_vif_predictors <- character(0)
}

# Remove predictors with high VIF from the list of predictors


predictors <- setdiff(predictors, high_vif_predictors)

# Perform undersampling on the majority class


SAS, Python & R: A Cross Reference Guide

minority_class_count <- min(table(train_encoded$bad))


undersampled_data <- train_encoded[train_encoded$bad == 0, ]
undersampled_data <- rbind(undersampled_data,
train_encoded[train_encoded$bad ==
1, ][sample(minority_class_count), ])

# Split the data into training and validation sets (80/20 split)
using stratified sampling
trainIndex <- createDataPartition(undersampled_data$bad, p = .8,
list = FALSE)
X_train <- train_encoded[trainIndex, predictors]
y_train <- train_encoded[trainIndex, "bad"]
X_val <- train_encoded[-trainIndex, predictors]
y_val <- train_encoded[-trainIndex, "bad"]

# Check if predictors are in OOT_encoded columns, if not then add


them
missing_cols <- setdiff(predictors, colnames(OOT_encoded))
for (c in missing_cols) {
OOT_encoded[[c]] <- 0
}

# Ensure the order of columns in the OOT set is the same as in the
train set
OOT_encoded <- OOT_encoded[c(predictors, "bad")]

# Standardize the data


scaler <- preProcess(X_train[, predictors], method = c("center",
"scale"))
X_train[, predictors] <- predict(scaler, X_train[, predictors])
X_val[, predictors] <- predict(scaler, X_val[, predictors])
OOT_encoded[, predictors] <- predict(scaler, OOT_encoded[,
predictors])

#Combine target and predictors


X_train$bad <- y_train

# Define the neural network model with reduced complexity


model <- nnet(bad ~ ., data = X_train, size = 10, linout = TRUE,
family = binomial, decay = 0.01, maxit = 100, trace = FALSE)

# Predict on the validation data set


validation_preds <- predict(model, newdata = X_val)
validation_preds <- ifelse(validation_preds > 0.5, 1, 0)
320 SAS, Python & R: A Cross Reference Guide

# Ensure that validation_preds and y_val have the same length


if (length(validation_preds) != length(y_val)) {
stop("Dimensions of validation_preds and y_val do not match.")
}

y_val <-as.factor(y_val)

# Calculate performance metrics for validation


confusion_matrix <- confusionMatrix(validation_preds, y_val)
roc_auc <- roc.area(ROCR::prediction(validation_preds, y_val))

# Print performance metrics


print("Validation Performance Metrics:")
print(confusion_matrix)
print(paste("ROC AUC:", roc_auc))

# Hyperparameter tuning using grid search


param_grid <- expand.grid(size = c(5, 10),
decay = c(0.001, 0.01, 0.1),
maxit = c(50, 100))

set.seed(42)
tuned_model <- train( bad ~ ., data = X_train[X_train$bad %in% c(0,
1), ],
method = "nnet",
tuneGrid = param_grid,
trControl = trainControl(method = "cv", number
= 5)
)

# Get the best model


best_model <- tuned_model$finalModel

# Predict on the OOT data set


OOT_preds <- predict(best_model, newdata = OOT_encoded)
OOT_preds <- ifelse(OOT_preds > 0.5, 1, 0)

# Calculate performance metrics for OOT data set


confusion_matrix_OOT <- confusionMatrix(OOT_preds, OOT_encoded$bad)
roc_auc_OOT <- roc.area(ROCR::prediction(OOT_preds,
OOT_encoded$bad))

# Print performance metrics for OOT data set


print("OOT Performance Metrics:")
SAS, Python & R: A Cross Reference Guide

print(confusion_matrix_OOT)
print(paste("ROC AUC (OOT):", roc_auc_OOT))

Program 7.6: SAS Neural Network Script

/* Load the data */


DATA train_encoded;
SET james.train_encoded;
RUN;

DATA oot_encoded;
SET james.oot_encoded;
RUN;

/* Define the variables to exclude to avoid the dummy variable trap


*/
%LET excluded_variables = id bad emp_length_3years term_36months
grade_g sub_grade_b4
verification_status_sourceverifi purpose_home_improvement
home_ownership_rent application_type_jointapp;

/* Create global variable for predictors */


PROC CONTENTS NOPRINT DATA = train_encoded (DROP= id bad
&excluded_variables.) OUT = var (KEEP = name); RUN;
PROC SQL NOPRINT; SELECT name INTO:predictors SEPARATED BY " " FROM
var; QUIT;
%PUT &predictors; RUN;

/*Fit regression model and calculate VIF values*/


PROC REG DATA=train_encoded ;
MODEL &target. = &predictors. / VIF;
RUN;

%let high_corr = GRADE_B int_rate;

/* Remove correlated variables and create final global variable for


predictors */
PROC CONTENTS NOPRINT DATA = train_encoded (DROP= id &target.
&excluded_variables. &high_corr.) OUT = var (KEEP = name); RUN;
PROC SQL NOPRINT; SELECT name INTO:predictors SEPARATED BY " " FROM
var; QUIT;
%PUT &predictors; RUN;
322 SAS, Python & R: A Cross Reference Guide

/* Variables in train_encoded but not in oot_encoded */


PROC SQL NOPRINT;
CREATE TABLE train_vars AS SELECT name FROM dictionary.columns
WHERE libname="WORK" AND memname="TRAIN_ENCODED";
CREATE TABLE oot_vars AS SELECT name FROM dictionary.columns WHERE
libname="WORK" AND memname="OOT_ENCODED";
CREATE TABLE in_train_not_oot AS
SELECT * FROM train_vars
EXCEPT
SELECT * FROM oot_vars;
QUIT;

/* Print the differences */


PROC PRINT DATA=in_train_not_oot; RUN;

/* Create final train and OOT data sets */


DATA train_encoded_final;
SET train_encoded (KEEP=&predictors bad);
RUN;

PROC FREQ DATA=train_encoded_final; TABLE bad; RUN;

/*Create missing variables from TRAIN data set*/


DATA oot_encoded_final;
SET oot_encoded ;
home_ownership_ANY = 0;
purpose_wedding = 0;
KEEP &predictors bad home_ownership_ANY purpose_wedding;
RUN;

PROC FREQ DATA=oot_encoded_final; TABLE bad; RUN;

/* Split the modeling data set by a 80/20 ratio using a random seed
*/
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2
SEED=42; RUN;

PROC FREQ DATA= class2; TABLES selected; RUN;

/* Create TRAIN and VAL data sets */


DATA train_data val_data;
SET class2 ;
IF selected = 1 THEN OUTPUT train_data; ELSE OUTPUT val_data;
SAS, Python & R: A Cross Reference Guide

RUN;

/* Assess target rate in train, val and OOT data sets */


PROC FREQ DATA = train_data; TABLES bad; RUN;
PROC FREQ DATA = val_data; TABLES bad; RUN;
PROC FREQ DATA = oot_encoded_final; TABLES bad; RUN;

/* Standardize the data */


PROC STDIZE DATA=train_data OUT=train_data_std METHOD=STD;
VAR &predictors;
RUN;

PROC STDIZE DATA=val_data OUT=val_data_std METHOD=STD;


VAR &predictors;
RUN;

PROC STDIZE DATA=oot_encoded_final OUT=oot_encoded_final_std


METHOD=STD;
VAR &predictors;
RUN;

/*Create hyperparameter tuning macro that will loop through a range


of values for each of the tuning parameters */

/* Create an empty summary_table data set */


PROC DELETE DATA=summary_table;

DATA summary_table;
LENGTH Model 8;
FORMAT Area 8.5;
RUN;

%MACRO tune(iteration, hidden, nhidden);

%let target = bad;

/* Build the SVM model */


PROC HPNEURAL DATA=train_data_std;
ARCHITECTURE MLP;
INPUT &predictors. / LEVEL=INT;
TARGET &target. / LEVEL=NOM;
HIDDEN &nhidden.;
HIDDEN &hidden.;
TRAIN;
324 SAS, Python & R: A Cross Reference Guide

SCORE OUT=scored_NN;
CODE
FILE='/OneDrive/Documents/DS_Project/neural_network.sas';
RUN;

DATA test_score_&iteration. (keep = &target. P_bad1);


SET val_data_std;
%INCLUDE
'/OneDrive/Documents/DS_Project/neural_network.sas';
RUN;

/* Calculate AUC value on validation data set*/


proc logistic data = test_score_&iteration.;
class &target.;
model &target. = P_bad1 / outroc=ROC;
ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;
run;

data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;

/* Append the results to a summary table */


proc append base=summary_table data=outstat;
run;

%mend;

/* Call the macro with different hyperparameters */


%tune(1, 5, 32);
%tune(2, 10, 32);
%tune(3, 15, 32);
%tune(4, 5, 64);
%tune(5, 10, 64);
%tune(6, 15, 64);
%tune(7, 5, 128);
%tune(8, 10, 128);
%tune(9, 15, 128);
PROC SORT DATA=summary_table; BY DESCENDING area; RUN;
SAS, Python & R: A Cross Reference Guide

/* Build the optimized Neural Network model */


PROC HPNEURAL DATA=train_data_std;
ARCHITECTURE MLP;
INPUT &predictors. / LEVEL=INT;
TARGET &target. / LEVEL=NOM;
HIDDEN 15;
HIDDEN 32;
TRAIN;
SCORE OUT=scored_NN;
CODE FILE='/OneDrive/Documents/DS_Project/neural_network.sas';
RUN;

DATA oot_scored (keep = &target. P_bad1);


SET oot_encoded_final_std;
%INCLUDE '/OneDrive/Documents/DS_Project/neural_network.sas';
RUN;

/* Calculate AUC value on OOT data set */


PROC LOGISTIC DATA = oot_scored;
CLASS &target.;
MODEL &target. = P_bad1 / OUTROC=ROC;
ROC;
ODS OUTPUT ROCASSOCIATION = auc_out;
run;

The preceding code is a comprehensive workflow for building and evaluating a


neural network model. Let's break down the key steps to ensure that all necessary
preprocessing and modeling steps are correctly implemented:

1. Import necessary libraries: The code begins by importing the necessary


libraries. These include pandas for data manipulation and sklearn or caret
for machine learning. SAS does not require loading additional libraries.

2. Define performance metrics function: A function is defined to calculate and


print the model's performance metrics. These metrics include a
classification report, confusion matrix, and ROC-AUC score.

3. Load the data: The training and OOT data sets are loaded from CSV files.

4. Calculate VIF: Remove highly correlated variables through VIF analysis.


326 SAS, Python & R: A Cross Reference Guide

5. Avoid the Dummy Variable Trap: To avoid the dummy variable trap, leave
one dummy variable out of each categorical level.

6. Define predictors: The predictors for the model are defined as all columns
in the training data set except for the target variable “bad”.

7. Split the data: The training data is split into training and validation sets
using stratified sampling.

8. Balance the data: Undersample the positive class to create a balanced


training data set.

9. Standardization: Standardize the training, validation, and OOT data using


“StandardScaler”. This is important for neural networks to ensure that
features have similar scales.

10. Building the Neural Network model: Create an MLPClassifier (Multilayer


Perceptron) model, which is a type of neural network.

11. Hyperparameter tuning: Define a grid of hyperparameters and use Grid


Search to find the best hyperparameters for the model. This is a crucial step
to optimize the model's performance.

12. Model evaluation: Evaluate the model's performance on the OOT data set
using classification metrics and plot an ROC curve. This is essential for
understanding how well the model generalizes to new data.

13. Printing best hyperparameters: Print the best hyperparameters found


during hyperparameter tuning.
SAS, Python & R: A Cross Reference Guide

Chapter 7: Predictive Modeling Part III – Advanced Techniques –


Conclusion and Transition

In this chapter, we demonstrated advanced predictive modeling techniques,


exploring powerful algorithms like support vector machines and neural networks.
These methods enable you to handle complex and high-dimensional data, providing
sophisticated solutions to challenging data science problems. By mastering these
techniques, you’ve expanded your toolkit with some of the most cutting-edge
methods in the field.

However, building advanced models is only part of the journey. The next crucial step
is to evaluate and monitor the performance of these models effectively. No matter
how complex or sophisticated a model is, its value is determined by how well it
performs on new, unseen data. This is where performance metrics and model
monitoring come into play.

In Chapter 8, we’ll shift our focus to the implementation of performance metrics


and the ongoing monitoring of models after deployment. You’ll learn how to
measure the accuracy, robustness, and stability of your models using various
metrics like AUC, precision, recall, PSI, and VSI. We’ll also explore best practices for
setting up monitoring systems that ensure your models continue to perform well
over time, even as the underlying data evolves.

So, with your advanced modeling techniques in hand, let’s move forward to Chapter
8, where we’ll ensure that your models not only work well today but continue to
deliver reliable predictions long into the future.
328 SAS, Python & R: A Cross Reference Guide

Chapter 7 Summary: Predictive Modeling Part III – Advanced


Techniques

1. Introduction to Advanced Predictive Modeling Techniques

● Overview: This chapter explores two advanced predictive modeling


techniques: Support Vector Machines (SVM) and Neural Networks. Both of
these algorithms offer unique approaches to solving complex problems that
go beyond traditional methods, such as decision trees or logistic regression.

2. Support Vector Machines (SVM)

● Concept and Mechanism: SVM is introduced as a robust classifier that


excels in both linear and non-linear classification tasks. The chapter explains
how SVM finds an optimal hyperplane that separates classes in high-
dimensional spaces. Key components include support vectors, hyperplanes,
and margins, with an emphasis on maximizing the margin to improve model
generalization.

● Kernel Functions: The chapter discusses how kernel functions (e.g., linear,
polynomial, radial basis function) transform input data into higher-
dimensional spaces, enabling SVMs to handle non-linear separations. The
importance of choosing the right kernel based on data characteristics is
highlighted.

● C Parameter: The C parameter is explained as a regularization tool that


balances the trade-off between maximizing the margin and minimizing
classification errors. The chapter provides guidance on how to tune this
parameter to control overfitting.

3. Neural Networks

● Concept and Mechanism: Neural networks, the foundation of deep


learning, are introduced as models inspired by the human brain's structure.
The chapter explains the basic architecture of neural networks, including
SAS, Python & R: A Cross Reference Guide

input layers, hidden layers, and output layers. It also covers how neurons
within these layers process and transmit information.

● Deep Learning: The chapter extends the discussion to deep learning,


emphasizing how multiple hidden layers and a large amount of data allow
neural networks to learn complex patterns and improve predictive accuracy.
Various activation functions like ReLU and tanh are also discussed, which
enable neural networks to capture non-linear relationships.

● Backpropagation and Optimization: The chapter delves into the


backpropagation algorithm, which is central to training neural networks. It
explains how gradients are calculated and used to update the weights of the
network to minimize the loss function.

4. Comparing SVMs and Neural Networks

● Strengths and Weaknesses: The chapter compares SVMs and neural


networks, highlighting the scenarios in which each excels. SVMs are praised
for their effectiveness in high-dimensional spaces and for being less prone
to overfitting. Neural networks are celebrated for their ability to model
complex patterns and their adaptability to new data.

● Use Cases: The chapter suggests that SVMs are best suited for smaller data
sets with clear margins between classes, while neural networks are
preferred for tasks requiring high accuracy and large data sets.

5. Data Preparation for Advanced Models

● Standardization: The importance of standardizing data before feeding it


into SVMs or neural networks is emphasized. This step ensures that all
features contribute equally to the model's decisions and prevents features
with larger scales from dominating the learning process.

● Handling Imbalanced Data: Techniques such as oversampling,


undersampling, and generating synthetic samples are discussed as methods
to address class imbalance in data sets, which can significantly impact the
performance of SVMs and neural networks.
6. Hyperparameter Tuning
330 SAS, Python & R: A Cross Reference Guide

● SVM Tuning: The chapter covers the tuning of the C parameter and the
selection of kernel functions, advising on cross-validation techniques to find
the optimal settings for a given data set.

● Neural Network Tuning: Hyperparameter tuning for neural networks is


discussed in the context of optimizing the number of layers, neurons,
learning rate, and regularization parameters. The chapter also highlights the
use of grid search and random search to find the best configuration.

7. Performance Metrics

● Evaluation: The chapter reviews performance metrics such as accuracy,


precision, recall, F1 score, and AUC-ROC, which are used to assess the
effectiveness of SVM and neural network models. It emphasizes the need to
choose appropriate metrics based on the specific problem being addressed.
SAS, Python & R: A Cross Reference Guide

Chapter 7 Quiz

Questions:
1. What is the primary goal of a Support Vector Machine (SVM) in classification
tasks?

2. How does a kernel function in SVM help in handling non-linear separations?

3. Explain the role of the C parameter in SVM.

4. What are support vectors, and why are they important in SVM?

5. Describe the architecture of a basic neural network.

6. How does the backpropagation algorithm work in neural networks?

7. What is the significance of activation functions in neural networks?

8. In what scenarios is it preferable to use SVM over neural networks?

9. What are the advantages of using neural networks for deep learning tasks?

10. How does data standardization impact the performance of SVMs and neural
networks?

11. What is the difference between overfitting and underfitting, and how does
the C parameter in SVM address these issues?

12. Why is it important to tune hyperparameters in neural networks?

13. What is the difference between a feedforward neural network and a


recurrent neural network?

14. How can class imbalance be addressed when training SVMs and neural
networks?

15. Explain the concept of a margin in SVM and its importance.

16. What are some common use cases for neural networks in machine learning?

17. How do you assess the performance of an SVM model?


332 SAS, Python & R: A Cross Reference Guide

18. Why might a deep neural network require more computational resources
than an SVM?

19. What are the pros and cons of using a radial basis function (RBF) kernel in
SVM?

20. Describe the process of hyperparameter tuning in SVMs and neural


networks.
SAS, Python & R: A Cross Reference Guide

Chapter 7 Cheat Sheet

Category SAS Python R


- Use PROC SVM - Use SVC from - Use svm
for training SVM sklearn.svm for function from
models SVM classification e1071 package
- Specify kernel
- Choose kernel
type with - Specify kernel
with kernel=
KERNEL= option with kernel
parameter (e.g.,
Support Vector (e.g., linear, argument
linear, poly, rbf)
Machines (SVM) polynomial, RBF)
- Tune the C=
parameter to
- Adjust C to - Control
control the trade-
balance margin regularization
off between
and with cost
margin width and
misclassification parameter
classification
errors
- Linear,
- Kernel options
Polynomial, and - Choose from
include linear,
Radial Basis linear, polynomial,
polynomial, and
Function (RBF) RBF, and sigmoid
radial basis (RBF)
kernels available kernels with SVC
in svm
Kernel Functions in PROC SVM
- Use
- Use PROC - Use kernlab
PolynomialFeature
KERNEL to explore package for
s for non-linear
different kernel advanced kernel
feature
functions methods
transformations
- Use PROC - Use Keras or
- Use nnet or
NEURAL for TensorFlow for
Neural Networks keras packages for
building neural constructing neural
neural networks
networks networks
334 SAS, Python & R: A Cross Reference Guide

- Construct
models using
- Define network
- Build models with keras_model_seq
architecture with
Sequential or uential or
ARCHITECTURE
Functional API keras_model for
statement
flexible
architectures
- Customize
- Customize layers, - Control training
activation
activation with optimizer,
functions and
functions, and loss, and metrics
learning
optimizers arguments
parameters
- Commonly used
functions in PROC - Use ReLU, - nnet supports
NEURAL include sigmoid, tanh, and logistic and
sigmoid and more in Keras and softmax for
hyperbolic TensorFlow output layers
tangent
Activation
- Choose
Functions
appropriate - keras supports a
- Customize
activation wide range of
activation
functions based on functions
functions for each
the problem (e.g., including ReLU,
layer
classification, sigmoid, tanh
regression)
- Tune SVM - Use GridSearchCV
- Use tune.svm in
hyperparameters or
e1071 for SVM
like C=, GAMMA=, RandomizedSearch
hyperparameter
and EPSILON= in CV in sklearn for
tuning
PROC SVM SVM tuning
Hyperparameter - Optimize neural
- Adjust network - Use caret
Tuning networks with
parameters like package's train
Keras callbacks
learning rate, function for
(e.g.,
momentum, and tuning neural
EarlyStopping,
regularization in networks with
ReduceLROnPlatea
PROC NEURAL nnet or keras
u)
SAS, Python & R: A Cross Reference Guide

- Use
- Evaluate SVMs - Evaluate SVM
sklearn.metrics for
using PROC SVM models with caret
SVM evaluation
output metrics metrics (e.g.,
(e.g.,
like accuracy, accuracy,
accuracy_score,
precision, and sensitivity,
precision_score,
Performance recall specificity)
roc_auc_score)
Metrics
- Assess neural
- Evaluate neural
networks with - Use ROCR or
networks with
AUC, ROC, and pROC for ROC and
Keras built-in
misclassification AUC metrics in
metrics or custom
rate in PROC neural networks
metrics
NEURAL
- Standardize data
- Standardize
in PROC
features using
STANDARDIZE - Use scale() to
StandardScaler
before feeding standardize
from
into SVM or features in R
sklearn.preprocessi
neural network
Data ng
models
Preparation
- Handle missing
- Handle missing data with - Handle missing
values with PROC SimpleImputer or values with mice
MI or PROC IterativeImputer or na.omit()
STDIZE from before modeling
sklearn.impute
336 SAS, Python & R: A Cross Reference Guide

Chapter 8 Performance Metrics, Implementation


and Model Monitoring

Overview

In the journey of model development, performance metrics serve as the final arbiter
of success. These metrics are not just numbers – they are the definitive measures of
how well your model performs in the initial stages and as it encounters new data
over time. Without them, there’s no way to objectively determine if a model is
functioning correctly or if it needs adjustment. They provide the necessary feedback
to ensure that the model meets the initial objectives and continues to perform
reliably in production environments.

The importance of performance metrics lies in their ability to offer a clear view of a
model's accuracy, efficiency, and reliability. Metrics like precision, recall, AUC, and
MSE are the tools through which we assess a model's strengths and weaknesses. By
evaluating these metrics, we gain insights into areas where the model excels and
where it may require further refinement. This step is critical in the modeling
pipeline because it transforms theoretical models into actionable insights, making
them valuable assets in decision-making processes.

Moreover, implementation and monitoring are crucial to ensure the model's


longevity and adaptability. Once a model is built and evaluated, it must be packaged
and deployed into a different environment, exposing it to new data and real-world
conditions. The choices made during implementation, such as whether to use
Docker or Kubernetes, directly impact the model’s scalability and operational
efficiency. Continuous monitoring, including metrics like PSI and VSI, is essential for
detecting and addressing performance drift, ensuring the model remains effective
over time.

This chapter integrates all aspects of the modeling pipeline, highlighting the
interconnectivity between model development, evaluation, implementation, and
monitoring. By understanding how these components work together, data scientists
can ensure their models perform well initially and continue to deliver value
throughout their lifecycles. The chapter emphasizes that a robust approach to
SAS, Python & R: A Cross Reference Guide

performance metrics, combined with strategic implementation and diligent


monitoring, is critical to maintaining a model's effectiveness in the long term.

Introduction to Performance Metrics for Classification vs.


Regression Models

In data science, performance metrics are essential tools for assessing how well
models perform their intended tasks. The type of model – classification or
regression – dictates the choice of metrics due to the fundamental differences in
the nature of the predictions. Classification models predict categorical outcomes
(often binary), where the goal is to categorize data points into distinct classes.
Metrics like accuracy, precision, recall, and F1 score are used to evaluate these
models, focusing on correctly classifying instances and understanding different
types of errors.

On the other hand, regression models predict continuous outcomes, where the
objective is to estimate a value along a continuous scale. Metrics such as Mean
Squared Error (MSE), Mean Absolute Error (MAE), and R-squared are critical
because they evaluate the residuals – the differences between actual and predicted
values. This focus on residuals allows us to quantify how closely the model's
predictions match the actual values. In regression, the residuals provide a direct
measure of prediction accuracy, but this approach is not appropriate for
classification models because classification is concerned with discrete categories,
not continuous values.

The need for different performance metrics arises from these fundamental
differences: classification models require metrics that handle categorical outcomes
and binary decisions. In contrast, regression models need metrics that handle
continuous, numerical predictions based on residuals. Understanding and applying
the appropriate metrics for each type of model ensures that the evaluation is
accurate and meaningful, enabling data scientists to make informed decisions about
model performance and potential improvements.

Table 8.1 below outlines key performance metrics for both classification and
regression models, highlighting their equations, descriptions, and interpretation to
help evaluate model performance effectively.
338 SAS, Python & R: A Cross Reference Guide

Table 8.1: Classification and Regression Performance Metrics

Metric
Equation Description Interpretation
Name
High accuracy
The ratio of
indicates that
𝑇𝑃+𝑇𝑁 correctly predicted
Accuracy most
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 instances to the
predictions are
total instances.
correct.
The ratio of true High precision
𝑇𝑃 positive predictions indicates a low
Precision
𝑇𝑃+𝐹𝑃 to the total positive false positive
predictions. rate.
Classification Model Performance Metrics

The ratio of true High recall


Recall 𝑇𝑃 positive predictions indicates a low
(Sensitivity) 𝑇𝑃+𝐹𝑁 to all actual false negative
positives. rate.
High F1 score
indicates a
The harmonic mean
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑋 𝑅𝑒𝑐𝑎𝑙𝑙 good balance
F1 Score 2X of precision and
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 between
recall.
precision and
recall.
AUC close to 1
The AUC value is
Area Under the ROC indicates a
calculated
AUC (ROC Curve, which plots model with
numerically by
Curve) TPR against FPR at good
integrating the
different thresholds. discriminatory
ROC curve.
ability.
A measure of
inequality among
Higher Gini
values of a
Gini indicates better
2 𝑋 𝐴𝑈𝐶 − 1 frequency
Coefficient model
distribution (e.g.,
discrimination.
income
distribution).
SAS, Python & R: A Cross Reference Guide

Measures the
maximum difference
between the High KS
max (𝑇𝑃𝑅 cumulative indicates better
KS Statistic
− 𝐹𝑃𝑅) distributions of the model
true positive rate separation.
and the false
positive rate.
Higher lift
Shows how much
Ratio of predicted indicates better
better the model is
positives to actual model
Lift at predicting
positives over performance at
positives compared
thresholds identifying
to random selection.
positives.
Higher gain
Cumulative gain Illustrates model's
indicates better
Gain over a range of advantage in
model
thresholds identifying positives
performance
Average of the
Lower MSE
MSE (Mean 1 𝑛 squares of the
∑ (𝑦 − indicates better
Regression Model Performance Metrics

Squared 𝑛 𝑖=1 𝑖 errors between


2
𝑦̂𝑖 ) model
Error) predicted and actual
accuracy.
values.
Lower MAE
indicates that
Measures the
predictions are
MAE (Mean average absolute
1 𝑛 closer to actual
Absolute ∑ |𝑦 − 𝑦̂𝑖 | difference between
𝑛 𝑖=1 𝑖 outcomes,
Error) predicted and actual
signifying
values.
better model
performance.
The square root of
RMSE (Root Lower RMSE
MSE, providing error
Mean indicates better
√𝑀𝑆𝐸 measurement in the
Squared model
same units as the
Error) accuracy.
predicted values.
340 SAS, Python & R: A Cross Reference Guide

Higher R-
squared
Proportion of
∑(𝑦𝑖 −𝑦̂𝑖 )2 indicates a
R-squared 1− ∑(𝑦𝑖 −𝑦̅)2
variance explained
better fit of the
by the model.
model to the
data.
Higher adjusted
Similar to R- R-squared
Adjusted for the squared, but indicates a
Adjusted R-
number of penalizes for adding better model,
squared
predictors. variables that don’t considering the
improve the model. number of
predictors.
Measure of model
AIC (Akaike Lower AIC
quality, balancing
Information 2𝑘 − 2 ln(𝐿̂) indicates a
goodness of fit with
Criterion) better model.
model complexity.
Lower BIC
BIC Similar to AIC, A criterion for indicates a
(Bayesian with a stronger model selection better model,
Information penalty for among a finite set of with a stronger
Criterion) complexity. models. penalty for
complexity.

Classification Metrics: Introduction and Example

In data science, classification models are crucial for predicting categorical outcomes,
such as fraud detection, customer churn, or disease diagnosis. Evaluating these
models' performance requires various metrics that measure how well the model is
distinguishing between classes. This section introduces key classification metrics,
demonstrated through a practical example using a synthetic data set. By
implementing a random forest model in SAS, Python, and R, we will explore metrics
such as the confusion matrix, accuracy, precision, recall, F1 Score, AUC, ROC curve,
Gini, KS, lift and gain tables, and charts.
SAS, Python & R: A Cross Reference Guide

Establishing the Example: Creating the Data Set

We'll begin by creating a synthetic data set in each programming language,


containing a binary target variable, "Fraud," and six predictors: FICO Score, Number
of Credit Cards, Annual Income, Online Purchase Indicator, Transaction Amount, and
Average Transaction Amount.

Program 8.1: Creating the Example Data Set Using Synthetic Data
Language Programming Code
library(caret)
library(randomForest)
library(pROC)

set.seed(12345)
n <- 1000

fraud_data <- as.data.frame(


make_classification(n_samples=n, n_features=6,
R n_informative=3, n_redundant=2,
Programming n_clusters_per_class=2,
weights=c(0.7, 0.3), flip_y=0.1,
random_state=12345)
)

names(fraud_data) <- c('FICO_Score',


'Num_Credit_Cards', 'Annual_Income',
'Online_Purchase_Ind',
'Trans_Amount', 'Avg_Trans_Amount', 'Fraud')

from sklearn.data sets import make_classification


import pandas as pd
import numpy as np

np.random.seed(12345)
n = 1000
Python
# Reducing the number of informative features and adding noise
Programming X, y = make_classification(n_samples=n, n_features=6,
n_informative=3, n_redundant=2, n_repeated=1,
n_clusters_per_class=2, weights=[0.7,
0.3], flip_y=0.1, class_sep=0.8, random_state=12345)

df = pd.DataFrame(X, columns=['FICO_Score', 'Num_Credit_Cards',


'Annual_Income',
342 SAS, Python & R: A Cross Reference Guide

'Online_Purchase_Ind',
'Trans_Amount', 'Avg_Trans_Amount'])
df['Fraud'] = y

DATA fraud_data;
ARRAY x[6];
CALL STREAMINIT(12345);
DO ID = 1 TO 1000;
CALL RANUNI(12345, w);
CALL RANUNI(12345, r);
DO i = 1 TO 4;
x[i] = (i <= 3) * (0.7 + 0.3 * r) *
RANNOR(12345);
END;
x[5] = r * x[1];
x[6] = r * x[2];
SAS
Fraud = (w > 0.7);
Programming
FICO_Score = 300 + ROUND(550 * x[1]);
Num_Credit_Cards = ROUND(9 * RANUNI(12345) +
1);
Annual_Income = ROUND(20000 + x[2] * 130000);
Online_Purchase_Ind = ROUND(RANUNI(12345));
Trans_Amount = ROUND(50 + x[3] * 4950, 2);
Avg_Trans_Amount = ROUND(Trans_Amount /
Num_Credit_Cards, 2);
OUTPUT;
END;
RUN;

Explanation of the Data Set Creation Process

In each programming environment – SAS, Python, and R – we’ve created a synthetic


data set that simulates a scenario where we want to predict whether a transaction
is fraudulent based on several predictors. The predictors include FICO Score,
Number of Credit Cards, Annual Income, Online Purchase Indicator, Transaction
Amount, and Average Transaction Amount.

To ensure consistency across all three environments, we’ve used a random seed
(set.seed(42) in R, np.random.seed(42) in Python, and ranuni(42) in SAS). Setting a
seed ensures that the random numbers generated are reproducible, which means
SAS, Python & R: A Cross Reference Guide

that each time we run the code, we get the same data set. This reproducibility is
critical when comparing models across different programming languages, as it
directly compares performance metrics.

Introducing the Random Forest Model Development

With our data set prepared, the next step is to build a classification model. We will
use a Random Forest algorithm, a powerful ensemble method known for its
robustness and ability to handle a large number of input variables without
overfitting. In the following sections, we will build and evaluate a Random Forest
model in SAS, Python, and R. This will include generating and interpreting key
performance metrics such as the confusion matrix, accuracy, precision, recall, F1
Score, AUC, ROC curve, Gini coefficient, KS statistic, and lift and gain tables and
charts.

These metrics will help us assess how well our model predicts fraud and enable us
to compare the performance of the Random Forest implementation across the
three programming environments.

Program 8.2 below develops a Random Forest model using the data set created
earlier, followed by generating key classification performance metrics to assess the
model's effectiveness.

Program 8.2: Creating Random Forest Model and Classification Performance Metrics

Language Programming Code


# Splitting the data
trainIndex <- createDataPartition(fraud_data$Fraud, p
= .7,
list = FALSE,
times = 1)
train_data <- fraud_data[ trainIndex,]
R
test_data <- fraud_data[-trainIndex,]
Programming

# Building the Random Forest model


rf_model <- randomForest(Fraud ~ ., data=train_data)

# Predictions and probabilities


pred <- predict(rf_model, test_data, type="response")
344 SAS, Python & R: A Cross Reference Guide

probs <- predict(rf_model, test_data,


type="prob")[,2]

# Confusion Matrix
confusionMatrix(pred, test_data$Fraud)

# AUC, Gini, KS
roc_obj <- roc(test_data$Fraud, probs)
auc(roc_obj)
gini_coeff <- 2 * auc(roc_obj) - 1
ks_stat <- max(roc_obj$sensitivities -
roc_obj$specificities)

# Lift and Gain Charts


test_data$probs <- probs
test_data$decile <- cut(probs, breaks=quantile(probs,
probs=seq(0,1,0.1)), include.lowest=TRUE,
labels=FALSE)

lift_table <- aggregate(Fraud ~ decile,


data=test_data, mean)
gain_table <- aggregate(Fraud ~ decile,
data=test_data, sum)
gain_table$Cumulative_Gain <-
cumsum(gain_table$Fraud)

from sklearn.ensemble import RandomForestClassifier


from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,
classification_report, roc_auc_score, roc_curve,
precision_score, recall_score, f1_score
import pandas as pd
import numpy as np
Python
# Splitting the data
Programming
X_train, X_test, y_train, y_test =
train_test_split(df[['FICO_Score',
'Num_Credit_Cards', 'Annual_Income',
'Online_Purchase_Ind', 'Trans_Amount',
'Avg_Trans_Amount']], df['Fraud'], test_size=0.3,
random_state=12345)

# Building the Random Forest model


SAS, Python & R: A Cross Reference Guide

rf_model = RandomForestClassifier(n_estimators=100,
random_state=12345)
rf_model.fit(X_train, y_train)

# Predicting and Evaluating the Model


y_pred = rf_model.predict(X_test)
y_proba = rf_model.predict_proba(X_test)[:, 1]

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

# Performance Metrics
print("Accuracy:", rf_model.score(X_test, y_test))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

# ROC Curve and AUC


roc_auc = roc_auc_score(y_test, y_proba)
fpr, tpr, _ = roc_curve(y_test, y_proba)
print("AUC:", roc_auc)

# Gini Coefficient
gini_coefficient = 2 * roc_auc - 1
print("Gini Coefficient:", gini_coefficient)

# KS Statistic
ks_stat = max(tpr - fpr)
print("KS Statistic:", ks_stat)

# Lift and Gain Charts


df_lift = pd.DataFrame({'prob': y_proba, 'target':
y_test})
df_lift['decile'] = pd.qcut(df_lift['prob'], 10,
labels=False, duplicates='drop')

lift = df_lift.groupby('decile').apply(lambda x:
np.mean(x['target']))
gain =
df_lift.groupby('decile')['target'].sum().cumsum()
print("Lift:\n", lift)
print("Gain:\n", gain)
346 SAS, Python & R: A Cross Reference Guide

PROC HPSPLIT DATA=fraud_data;


CLASS Fraud;
MODEL Fraud = FICO_Score Num_Credit_Cards
Annual_Income Online_Purchase_Ind Trans_Amount
Avg_Trans_Amount;
GROW ENTROPY;
PRUNE COSTCOMPLEXITY;
OUTPUT OUT=rf_output PRED=pred;
RUN;

/* Confusion Matrix and Performance Metrics */


PROC FREQ DATA=rf_output;
TABLES Fraud*pred / NOCOL NOPERCENT CHISQ
MEASURES;
OUTPUT OUT=conf_matrix MEASURES=ALL;
RUN;
SAS
Programming /* AUC, Gini, KS */
PROC LOGISTIC DATA=rf_output PLOTS(ONLY)=ROC;
MODEL Fraud(EVENT='1') = pred;
ROC;
RUN;

/* Lift and Gain */


PROC RANK DATA=rf_output OUT=lift_groups GROUPS=10;
VAR pred;
RANKS decile;
RUN;

PROC MEANS DATA=lift_groups MEAN;


CLASS decile;
VAR Fraud;
RUN;

Figure 8.1 shows the output from the Python program, which demonstrates the
model's performance, particularly its ability to correctly identify fraudulent
transactions, as evidenced by the confusion matrix and various performance
metrics.
SAS, Python & R: A Cross Reference Guide

Figure 8.1: Python Program Output

[[195 6]
[ 20 79]]
Accuracy: 0.9133333333333333
Precision: 0.9294117647058824
Recall: 0.797979797979798
F1 Score: 0.8586956521739131
AUC: 0.9320820141715664
Gini Coefficient: 0.8641640283431329
KS Statistic: 0.8139605005276648
Lift:
decile
0 0.000000
1 0.068966
2 0.103448
3 0.062500
4 0.000000
5 0.034483
6 0.346154
7 0.866667
8 0.933333
dtype: float64
Gain:
decile
0 0
1 2
2 5
3 7
4 7
5 8
6 17
7 43
8 99
Name: target, dtype: int32

The output of our Python model indicates strong performance, particularly in


identifying the minority class (Fraud). With an accuracy of 91.3%, the model
348 SAS, Python & R: A Cross Reference Guide

demonstrates a high precision of 92.9% and a recall of 79.8%, leading to a solid F1


score of 85.9%. The AUC of 0.93 suggests excellent model discrimination between
classes, while the Gini coefficient of 0.86 and a KS statistic of 0.81 further confirm
the model's robustness. The lift and gain charts highlight the model's ability to
concentrate true positives in the top deciles, demonstrating the model's
effectiveness in prioritizing high-risk cases.

Classification Metrics Rules of Thumb

In data science, assessing the effectiveness of classification models is essential for


ensuring they deliver accurate and reliable predictions in real-world scenarios.
While each classification task may have its nuances, general benchmarks – often
referred to as "rules of thumb" – can help determine whether a model's
performance metrics are poor, good, or very good. These benchmarks provide a
quick reference for interpreting common classification metrics such as accuracy,
precision, recall, F1 score, AUC, Gini coefficient, and KS statistic. Table 8.2 below
outlines these rules of thumb and explains the rationale behind the chosen
threshold levels.

Table 8.2: Classification Performance Metrics “Rules of Thumb” Ranges

Performance Poor Good Very Good


Explanation
Metric Threshold Threshold Threshold
Accuracy below 70%
suggests the model is
not reliably making
correct predictions.
Between 70% and 85%
is generally considered
acceptable in many
Accuracy < 70% 70% – 85% > 85%
contexts, though it
may not be sufficient in
high-stakes
applications. Above
85% is typically
indicative of strong
model performance,
SAS, Python & R: A Cross Reference Guide

especially in balanced
data sets.
Precision below 50%
indicates the model is
incorrectly classifying
too many negatives as
positives, leading to a
high rate of false
positives. A precision
Precision < 50% 50% – 75% > 75%
of 50% to 75% is
generally acceptable,
with over 75%
considered very good,
particularly in cases
where false positives
are costly.
Recall below 50%
suggests the model is
missing too many
actual positives, which
could be critical
depending on the
application (e.g., fraud
Recall
< 50% 50% – 75% > 75% detection). A recall
(Sensitivity)
between 50% and 75%
is generally considered
good, with over 75%
being very good,
indicating the model is
capturing the majority
of positive cases.
An F1 Score below 0.6
indicates a significant
imbalance between
F1 Score < 0.6 0.6 – 0.75 > 0.75 precision and recall,
leading to a less
effective model. Scores
between 0.6 and 0.75
350 SAS, Python & R: A Cross Reference Guide

are acceptable, with


values over 0.75
indicating a well-
balanced model with
good precision and
recall.

An AUC below 0.7


suggests the model has
a limited ability to
distinguish between
classes, potentially
only slightly better
than random guessing.
AUC (Area
An AUC between 0.7
Under the < 0.7 0.7 – 0.85 > 0.85
and 0.85 is considered
ROC Curve)
good, while an AUC
over 0.85 indicates
strong discriminatory
power, which is
particularly desirable in
binary classification
problems.
A Gini coefficient
below 0.4 indicates
poor model
performance with little
separation between
classes. Values
Gini
< 0.4 0.4 – 0.6 > 0.6 between 0.4 and 0.6
Coefficient
are considered
moderate, while above
0.6 reflects a high level
of discrimination,
similar to an AUC over
0.85.
A KS statistic below 0.2
KS Statistic < 0.2 0.2 – 0.4 > 0.4
suggests poor
SAS, Python & R: A Cross Reference Guide

discrimination
between the positive
and negative classes.
Values between 0.2
and 0.4 are generally
acceptable, indicating
the model has some
discriminatory power.
A KS statistic above 0.4
indicates strong
separation, often
considered very good
in practice.

Understanding Classification Thresholds

In the context of classification models, the output is typically a continuous


probability score ranging between 0 and 1, which reflects the likelihood of an event
occurring (e.g., fraud or no fraud). To convert these probabilities into actionable
binary outcomes (such as classifying an event as "fraud" or "no fraud"), we must
establish a threshold. This threshold determines the point at which the model’s
output probability will be classified as a positive event (1) or a negative event (0).
The default threshold in most AI/ML algorithms is 0.5, meaning that any probability
greater than or equal to 0.5 will be classified as a positive event.

However, this default setting might not always align with the specific objectives or
risk tolerance of the business. Adjusting the threshold can help tailor the model’s
performance to meet the organization's needs better, balancing between precision
(minimizing false positives) and recall (minimizing false negatives). For instance, a
more conservative threshold might be set higher (e.g., 0.7), which would classify
fewer events as positive, reducing false positives but potentially increasing false
negatives.

The decision to set the threshold should consider statistical optimization and
business objectives. Statistically, thresholds can be optimized based on metrics like
the F1 Score, which balances precision and recall, or by maximizing the AUC (Area
Under the Curve) of the ROC curve. From a business perspective, the threshold
352 SAS, Python & R: A Cross Reference Guide

might be adjusted to reflect risk tolerance, cost factors, or strategic priorities. For
example, in fraud detection, where false negatives (missed fraud) are more costly
than false positives (false alarms), the threshold might be lowered to ensure more
potential fraud cases are flagged, even if it results in more false positives.

Understanding Classification Thresholds

In classification models, the output is often a continuous probability score between


0 and 1, representing the likelihood of an event occurring (e.g., fraud). However, to
make a final decision (fraud or no fraud), we must apply a threshold to this
probability. The default threshold in most machine learning models is 0.5, meaning
any probability above 0.5 is classified as a positive event, while anything below 0.5 is
classified as a negative event. However, this threshold is not fixed and can be
adjusted based on the specific business goals or risk tolerance.

The histogram in Figure 8.2 visualizes the distribution of the predicted probabilities
generated by our model. The red dashed line represents the default threshold of
0.5; the green dashed line represents a threshold of 0.7 optimized for the F1 Score;
the blue dashed line indicates a threshold of 0.39 optimized for AUC, and the orange
dashed line shows a threshold of 0.97 optimized for Precision. Figure 8.2 illustrates
the distribution of the model's predicted probabilities and the impact of setting
different thresholds.
SAS, Python & R: A Cross Reference Guide

Figure 8.2: Histogram of Predicted Probabilities with Various Thresholds

Interpreting the Thresholds

● Threshold at 0.5 (Default Setting): The threshold of 0.5 is the default setting
for most classification models, where probabilities of 0.5 or higher are
classified as positive events. This threshold is typically used when a balance
between false positives (incorrectly predicting an event) and false negatives
(failing to predict an event) is desired. In many scenarios, a threshold of 0.5
is a starting point because it equally weighs the chances of identifying a true
positive against the risk of a false positive. However, this threshold may not
always align with specific business needs or risk tolerances, especially in
high-stakes scenarios where the cost of errors is significant.

● Threshold at 0.7 (F1 Score Optimized): This higher threshold is adjusted to


maximize the F1 Score, which is a balance between precision and recall. By
setting the threshold at 0.7, the model becomes more conservative,
meaning it only classifies instances as positive (e.g., fraud) when it is highly
confident. This threshold is particularly useful when the cost of false
positives is high (e.g., mistakenly flagging legitimate transactions as fraud),
as it reduces the likelihood of such errors. However, the trade-off is an
increased number of false negatives, where actual positive events may be
missed. This approach is beneficial in scenarios where false positives are
more detrimental than false negatives.
354 SAS, Python & R: A Cross Reference Guide

● Threshold at 0.39 (AUC Optimized): A lower threshold of 0.39 is often


chosen to optimize the sensitivity or true positive rate, which means
capturing as many true positives as possible (e.g., detecting more fraud
cases). This threshold is generally determined by analyzing the ROC curve
and selecting a point where the sensitivity is maximized relative to the false
positive rate. While this threshold increases the detection of positive
events, it also raises the number of false positives, which might be
acceptable when missing an event (false negative) is more costly than
dealing with false positives. This threshold is particularly useful in cases
where the priority is to maximize the model’s ability to identify actual
events, even at the expense of precision.

● Threshold at 0.97 (Precision Optimized): Setting the threshold at 0.97


makes the model extremely conservative, where only the most confident
predictions are classified as positive. This approach minimizes false positives
to a great extent, which is crucial in scenarios where the consequences of
false positives are extremely costly or damaging (e.g., wrongful accusations
in fraud detection). However, the downside is that many true positive cases
might be missed, leading to a high number of false negatives. This threshold
is often used in scenarios where the cost of false positives far outweighs the
cost of false negatives, and high precision is paramount.

Business Goals and Threshold Setting

Choosing the appropriate threshold is critical and should be aligned with the
business’s objectives:

● Reducing False Positives: If the primary concern is to minimize the


occurrence of false positives (e.g., avoiding the unnecessary investigation of
legitimate transactions), a higher threshold is advisable. This is particularly
relevant in industries where the cost of a false positive is significant.

● Broad Coverage: In scenarios where it is crucial to identify as many true


positives as possible (e.g., capturing all potential fraud cases), a lower
threshold may be more suitable. This approach accepts a higher rate of false
positives to ensure that few true positives are missed.
SAS, Python & R: A Cross Reference Guide

● Reducing False Negatives: When missing a positive event is particularly


costly (e.g., undetected fraud leading to financial losses), lowering the
threshold can help reduce the number of false negatives. This approach is
more aggressive in identifying positive events but at the cost of increased
false positives.

● Balanced Approach: A balanced threshold that considers both false


positives and false negatives may be the best approach for many business
scenarios. This is where metrics like the F1 Score come into play, offering a
middle ground that balances precision and recall.

Analyzing the Model Score Distribution

In Figure 8.2, the histogram visualizing the predicted probabilities reveals a binary-
like distribution rather than the typical bell curve seen in many data sets. This binary
distribution indicates that the model strongly differentiates between two classes,
often resulting in predictions clustering near 0 or 1. Such a distribution suggests that
the model is confident in its classifications, which can be advantageous for clear
decision-making. However, it may also indicate that the model could be overfitting,
particularly if it is too aggressive in pushing predictions to extremes. This
distribution pattern is important to consider when setting thresholds, as it directly
impacts how the model will perform across different probability ranges.

Understanding and adjusting classification thresholds is essential for aligning model


performance with business objectives. The choice of threshold can significantly
affect the balance between different types of errors (false positives and false
negatives) and, ultimately, the model’s effectiveness in a real-world context.
Whether the goal is to optimize based on statistical performance metrics like the F1
Score or to meet specific business needs, the threshold setting is a powerful lever in
the deployment of predictive models.

Understanding the Confusion Matrix

The confusion matrix is a fundamental tool for evaluating the performance of a


classification model. It is called a "confusion" matrix because it shows how confused
356 SAS, Python & R: A Cross Reference Guide

the model is in making predictions by displaying the number of correct and incorrect
predictions made by the model, broken down by each class.

The confusion matrix is structured as follows:

Figure 8.3: Confusion Matrix

Predicted: No Predicted: Yes


Actual: No True Negatives (TN) False Positive (FP)
Actual: Yes False Positive (FP) True Positive (TP)

● True Negative (TN): The number of correct predictions where the model
predicted "No" (non-fraud) and the actual outcome was "No."

● False Positive (FP): The number of incorrect predictions where the model
predicted "Yes" (fraud) but the actual outcome was "No." This is also known
as a Type I error.

● False Negative (FN): The number of incorrect predictions where the model
predicted "No" (non-fraud) but the actual outcome was "Yes." This is known
as a Type II error.

● True Positive (TP): The number of correct predictions where the model
predicted "Yes" (fraud) and the actual outcome was "Yes."

From the confusion matrix, several key performance metrics can be derived:

1. Accuracy:

Definition:
Accuracy is the proportion of correct predictions (both true positives and
true negatives) out of the total predictions made. It is calculated as:

𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
SAS, Python & R: A Cross Reference Guide

Interpretation:
High accuracy, such as 91.33%, suggests that the model performs well
overall in correctly predicting both fraud and non-fraud cases. However, in
imbalanced data sets like fraud detection, accuracy can be misleading
because the model might still fail to detect the minority class effectively.
Low accuracy would indicate that the model struggles with making correct
predictions in general, which could necessitate further tuning or feature
engineering.

2. Precision:

Definition:
Precision is the proportion of true positives out of all positive predictions
made by the model. It is calculated as:

𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃

Interpretation:
A precision of 92.94% indicates that when the model predicts fraud, it is
correct about 93% of the time. High precision is crucial in reducing false
positives, which is especially important in contexts like fraud detection,
where false positives can lead to unnecessary investigations and potential
customer dissatisfaction. Conversely, low precision would imply that many
of the fraud predictions are incorrect, resulting in wasted resources.

3. Recall (Sensitivity or True Positive Rate):

Definition:
Recall measures the proportion of actual positive cases (fraud) the model
correctly identified. It is calculated as:

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁

Interpretation:
A recall of 79.80% indicates that the model successfully identifies around
80% of actual fraud cases. High recall is important in minimizing false
358 SAS, Python & R: A Cross Reference Guide

negatives and detecting as many fraud cases as possible. Low recall would
suggest that the model is missing a significant number of fraud cases, which
could lead to severe consequences if fraudulent activities go undetected.

4. F1 Score:

Definition:
The F1 Score is the harmonic mean of precision and recall, balancing the
trade-offs between precision and recall. It is calculated as:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙

Interpretation:
An F1 Score of 85.87% indicates a strong balance between precision and
recall, meaning the model is both accurate in its positive predictions and
effective in identifying most fraud cases. A high F1 Score is particularly
valuable in scenarios with imbalanced data, such as fraud detection, where
both precision and recall are critical. A low F1 Score would suggest that
either precision, recall, or both are suboptimal, indicating room for
improvement in the model's tuning.

5. False Positive Rate (FPR):

Definition:
FPR measures the proportion of negative instances (non-fraud) that are
incorrectly classified as positive (fraud). It is calculated as:

𝐹𝑃
𝐹𝑃𝑅 =
𝐹𝑃 + 𝑇𝑁

Interpretation:
A low FPR indicates that the model rarely misclassifies legitimate
transactions as fraud, which is crucial for maintaining customer satisfaction
and reducing operational costs. A high FPR would suggest that too many
non-fraudulent transactions are being flagged as fraud, leading to excessive
false alarms and potentially eroding customer trust.
SAS, Python & R: A Cross Reference Guide

6. True Negative Rate (Specificity):

Definition:
Specificity is the proportion of actual negative cases (non-fraud) the model
correctly identified. It is calculated as:

𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃

Interpretation:
High specificity indicates that the model effectively identifies non-
fraudulent transactions, reducing the likelihood of legitimate customers
being wrongly flagged as fraudulent. Low specificity would mean that the
model frequently misclassifies non-fraudulent transactions as fraud, which
could negatively impact user experience and operational efficiency.

AUC, Gini, and KS Statistic

AUC, Gini, and KS are three key metrics often reported together to evaluate the
performance of classification models, particularly in how well the model separates
events from non-events. These metrics provide insights into the model's ability to
distinguish between positive and negative classes, which is crucial in applications
like fraud detection, credit scoring, and other predictive tasks. They are closely
related but offer different perspectives on model performance.

How They Are Related

● AUC (Area Under the Curve) is a measure derived from the ROC (Receiver
Operating Characteristic) curve, representing the model's ability to
differentiate between classes across various threshold values.

● Gini Coefficient is directly related to the AUC (Area Under the Curve) and is
often calculated using the formula: Gini = 2 * AUC - 1. The Gini coefficient
quantifies the inequality of model predictions, where a higher value
360 SAS, Python & R: A Cross Reference Guide

indicates better model performance in distinguishing between classes.


Conversely, you can derive the AUC from the Gini coefficient using the
formula: AUC = (Gini + 1) / 2. This relationship underscores the close
connection between these two metrics in evaluating the discriminatory
power of a model.

● KS Statistic (Kolmogorov-Smirnov Statistic) is related to the maximum


difference between the cumulative distributions of the positive and
negative classes and provides insight into the model's discriminatory power.

These metrics are related because they all measure the same underlying concept:
the model’s ability to distinguish between the two classes correctly. However, they
do so from slightly different angles, providing a comprehensive view of the model’s
performance.

Figure 8.4: Visualizing Model Discrimination: AUC and Gini Metrics in ROC Space

Explanation of Each Metric

AUC (Area Under the Curve)

● Calculation:
The AUC is the area under the ROC curve, which plots the True Positive Rate
SAS, Python & R: A Cross Reference Guide

(TPR) against the False Positive Rate (FPR) at various threshold levels. The
AUC value ranges from 0 to 1, with 0.5 indicating no discriminatory power
(equivalent to random guessing) and 1.0 representing perfect discrimination
between classes. The provided ROC curve visualizes the model's ability to
distinguish between positive and negative instances.

● Interpretation:
A higher AUC value indicates a better model. In this example, the AUC is
0.93, suggesting that the model has a 93% chance of correctly distinguishing
between a positive and a negative instance. This makes AUC a crucial metric
in evaluating how well a model is likely to perform across different
thresholds, providing insight into the overall effectiveness of the model
across the entire range of possible decision boundaries.

Gini Coefficient

● Calculation:
The Gini coefficient is calculated as Gini = 2 * AUC - 1. It ranges from -1 to 1,
where 0 indicates no discriminatory power, 1 represents perfect separation,
and negative values indicate a model that is worse than random guessing.

● Interpretation:
The Gini coefficient provides a measure of the inequality in model
predictions. A higher Gini coefficient indicates that the model effectively
separates the classes, with values close to 1 signifying strong performance.
For example, a Gini of 0.86 (derived from an AUC of 0.93) would be
interpreted as a strong model in terms of its discriminatory power.

KS Statistic (Kolmogorov-Smirnov Statistic)

● Calculation:
The KS Statistic is calculated as the maximum difference between the
cumulative distribution functions of the positive class and the negative
class. It identifies the point where the model most effectively distinguishes
between the two classes.

● Interpretation:
The KS Statistic is useful in identifying the threshold at which the model
362 SAS, Python & R: A Cross Reference Guide

most effectively separates the positive and negative classes. A high KS value,
such as 0.81, indicates the model has a strong discriminatory ability at a
particular threshold. This metric is particularly useful in determining where
to set the cutoff for decision-making in binary classification problems.

Lift Table

Construction

A lift table is an essential tool for evaluating the effectiveness of a classification


model. It breaks down the predictions into deciles, allowing you to examine the
distribution of events (e.g., fraud cases) across the population. Here’s how the table
is constructed:

● Decile: The population is divided into ten equal parts based on predicted
probabilities.

● Number of Cases: Represents the total number of observations in each


decile.

● Number of Fraud Cases: Counts the actual events (fraud) in each decile.

● Average Score: The average predicted probability score for each decile.

● Cumulative Fraud Cases: Sum of the fraud cases up to that decile.

● Cumulative Cases: Total number of cases up to that decile.

● % of Events: Percentage of the total events in each decile.

● Gain: Represents the cumulative gain in identifying fraud cases.

● Lift: The lift value compares the model's effectiveness to random guessing.

● Cumulative Non-Events: Tracks non-events (e.g., non-fraud cases)


cumulatively.

● Cumulative Non-Event Rate: The rate at which non-events accumulate.


SAS, Python & R: A Cross Reference Guide

● KS Statistic: Reflects the maximum difference between the cumulative


event and non-event rates.

Table 8.3: Lift Table

Lift Table Interpretation

The lift table helps make informed decisions about where to set the probability
threshold for classification. By analyzing the cumulative metrics, you can determine
the trade-offs between capturing a higher percentage of events and minimizing
false positives. This allows you to tailor the model to meet specific business goals,
such as maximizing fraud detection while minimizing unnecessary investigations.

For example, if the goal is to capture as many fraud cases as possible, you might
choose a threshold that aligns with a decile showing high cumulative fraud cases
and a significant lift, indicating that the model performs better than random
guessing. Conversely, if the cost of false positives is high, you may opt for a more
conservative threshold to reduce the number of non-fraud cases identified as fraud.

This granular approach provides a clear understanding of model performance across


different segments, enabling you to fine-tune the model according to business
needs.

Program 8.3 illustrates how to create a lift table and corresponding charts, providing
a visual interpretation of model performance.
364 SAS, Python & R: A Cross Reference Guide

Program 8.3: Creating Lift Table and Charts

Language Programming Code


# Random Forest model is already built and we have
y_proba and y_test
df_lift <- data.frame(prob = y_proba, target =
y_test)
df_lift$decile <- cut(df_lift$prob, breaks =
quantile(df_lift$prob, probs = seq(0, 1, by =
0.1)), include.lowest = TRUE, labels = FALSE)

# Create Lift Table


lift_table <- df_lift %>%
group_by(decile) %>%
summarise(
Number_of_Cases = n(),
Number_of_Fraud_Cases = sum(target),
Average_Score = mean(prob)
) %>%
mutate(
Cumulative_Fraud_Cases =
cumsum(Number_of_Fraud_Cases),
R Cumulative_Cases = cumsum(Number_of_Cases),
Programming `%_of_Events` = Number_of_Fraud_Cases /
sum(Number_of_Fraud_Cases) * 100,
Gain = Cumulative_Fraud_Cases /
sum(Number_of_Fraud_Cases) * 100,
Lift = Gain / (Cumulative_Cases /
sum(Number_of_Cases) * 100)
)

# Calculate Cumulative Non-Events and KS Statistic


lift_table <- lift_table %>%
mutate(
Cumulative_Non_Events = Cumulative_Cases -
Cumulative_Fraud_Cases,
Cum_Non_Events_Rate = Cumulative_Non_Events /
(sum(Number_of_Cases) -
sum(Number_of_Fraud_Cases)),
KS_Statistic = abs(Gain - Cum_Non_Events_Rate *
100)
)
SAS, Python & R: A Cross Reference Guide

# Print Lift Table


print(lift_table)

# Plot Lift Chart


ggplot(lift_table, aes(x = as.numeric(decile), y =
Lift)) +
geom_line() +
geom_point() +
ggtitle('Lift Chart') +
xlab('Decile') +
ylab('Lift') +
theme_minimal()

# Plot Gain Chart


ggplot(lift_table, aes(x = as.numeric(decile), y =
Gain)) +
geom_line() +
geom_point() +
ggtitle('Gain Chart') +
xlab('Decile') +
ylab('Gain (%)') +
theme_minimal()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Assuming the Random Forest model is already built


and we have y_proba and y_test
df_lift = pd.DataFrame({'prob': y_proba, 'target':
y_test})
Python df_lift['decile'] = pd.qcut(df_lift['prob'], 10,
Programming labels=False, duplicates='drop')

# Create Lift Table


lift_table = df_lift.groupby('decile').agg(
Number_of_Cases=('target', 'size'),
Number_of_Fraud_Cases=('target', 'sum'),
Average_Score=('prob', 'mean')
).reset_index()
366 SAS, Python & R: A Cross Reference Guide

lift_table['Cumulative_Fraud_Cases'] =
lift_table['Number_of_Fraud_Cases'].cumsum()
lift_table['Cumulative_Cases'] =
lift_table['Number_of_Cases'].cumsum()
lift_table['%_of_Events'] =
lift_table['Number_of_Fraud_Cases'] /
lift_table['Number_of_Fraud_Cases'].sum() * 100
lift_table['Gain'] =
lift_table['Cumulative_Fraud_Cases'] /
lift_table['Number_of_Fraud_Cases'].sum() * 100
lift_table['Lift'] = lift_table['Gain'] /
(lift_table['Cumulative_Cases'] /
lift_table['Number_of_Cases'].sum() * 100)

# Calculate Cumulative Non-Events and KS Statistic


lift_table['Cumulative_Non_Events'] =
lift_table['Cumulative_Cases'] -
lift_table['Cumulative_Fraud_Cases']
lift_table['Cumulative_Non_Event_Rate'] =
lift_table['Cumulative_Non_Events'] /
(lift_table['Number_of_Cases'].sum() -
lift_table['Number_of_Fraud_Cases'].sum())
lift_table['KS_Statistic'] = abs(lift_table['Gain']
- lift_table['Cumulative_Non_Event_Rate'] * 100)

# Print Lift Table


print(lift_table)

# Plot Lift Chart


plt.plot(lift_table['decile'] + 1,
lift_table['Lift'], marker='o', color='blue')
plt.title('Lift Chart')
plt.xlabel('Decile')
plt.ylabel('Lift')
plt.grid(True)
plt.show()

# Plot Gain Chart


plt.plot(lift_table['decile'] + 1,
lift_table['Gain'], marker='o', color='green')
plt.title('Gain Chart')
plt.xlabel('Decile')
plt.ylabel('Gain (%)')
plt.grid(True)
SAS, Python & R: A Cross Reference Guide

plt.show()

PROC RANK DATA=rf_model_output OUT=decile_groups


GROUPS=10;
VAR predicted_proba;
RANKS Decile;
RUN;

PROC SUMMARY DATA=decile_groups NWAY;


CLASS Decile;
VAR actual target predicted_proba;
OUTPUT OUT=LiftTable
N=Number_of_Cases
SUM(target)=Number_of_Fraud_Cases
MEAN(predicted_proba)=Average_Score;
RUN;

DATA LiftTable;
SET LiftTable;
BY Decile;
RETAIN Cumulative_Fraud_Cases Cumulative_Cases
Gain Lift Cumulative_Non_Events Cum_Non_Events_Rate
SAS
KS_Statistic;
Programming
IF _N_ = 1 THEN DO;
Cumulative_Fraud_Cases = 0;
Cumulative_Cases = 0;
END;
Cumulative_Fraud_Cases + Number_of_Fraud_Cases;
Cumulative_Cases + Number_of_Cases;
Gain = Cumulative_Fraud_Cases /
SUM(Number_of_Fraud_Cases) * 100;
Lift = Gain / (Cumulative_Cases /
SUM(Number_of_Cases) * 100);
Cumulative_Non_Events = Cumulative_Cases -
Cumulative_Fraud_Cases;
Cum_Non_Events_Rate = Cumulative_Non_Events /
(SUM(Number_of_Cases) -
SUM(Number_of_Fraud_Cases));
KS_Statistic = ABS(Gain - Cum_Non_Events_Rate *
100);
RUN;

PROC PRINT DATA=LiftTable;


368 SAS, Python & R: A Cross Reference Guide

TITLE "Lift Table";


RUN;

PROC SGPLOT DATA=LiftTable;


SERIES X=Decile Y=Lift / MARKERS;
TITLE "Lift Chart";
RUN;

PROC SGPLOT DATA=LiftTable;


SERIES X=Decile Y=Gain / MARKERS;
TITLE "Gain Chart";
RUN;

Lift and Gain Charts

Lift and Gain Charts are essential tools for assessing the performance of
classification models, particularly in scenarios where the correct identification of
positive cases (such as fraud detection or marketing responses) is crucial. These
charts help quantify how much better a model performs compared to random
guessing and allow us to evaluate the model's effectiveness in identifying high-risk
or high-value cases.

Lift Chart

● Construction: The Lift Chart is constructed by plotting the lift value on the Y
axis against the deciles on the X axis. Each decile represents a tenth of the
population, ordered by the predicted probabilities from highest to lowest.
The lift is calculated as the ratio of the target rate in each decile to the
target rate in the entire population.
SAS, Python & R: A Cross Reference Guide

Figure 8.5: Lift Chart

● Interpretation: The Lift Chart visualizes how much better the model is at
identifying positive cases compared to random selection. A steeper slope
indicates better performance, showing that the model effectively
concentrates positive cases in the top deciles. The flatter the curve, the less
effective the model is at differentiating between positive and negative
cases.

Gain Chart

● Construction: The Gain Chart is constructed by plotting the cumulative gain


(percentage of captured positive cases) on the Y axis against the deciles on
the X axis. It shows the proportion of positive cases identified as we move
through the population from highest to lowest predicted probability.

Figure 8.6: Gains Chart


370 SAS, Python & R: A Cross Reference Guide

● Interpretation: The Gain Chart provides insight into the cumulative


percentage of the target population (e.g., fraud cases) captured as you
move through the deciles. A steeper curve indicates that the model
effectively captures a large portion of positive instances early on, which is
particularly valuable in scenarios where resources are limited, and
prioritization is needed.

These charts are derived from the Lift Table, which provides detailed information on
the model’s performance at each decile. From the table, metrics such as cumulative
gain and lift are calculated and visualized in the charts. The Lift Table itself includes
columns like the number of cases, number of positive cases (fraud cases),
cumulative responses, cumulative non-events, average score, gain, and lift.

Setting the Threshold

● Threshold Setting: One practical application of the Lift Table and the
corresponding charts is in determining the optimal threshold for
classification. Instead of defaulting to a 0.5 probability threshold, you can
choose a threshold that balances precision and recall according to your
specific business needs. For example, if you aim to capture a certain
percentage of fraud cases while minimizing false positives, you could use
the Lift Table to identify the decile where the trade-off between capturing
fraud cases and the number of cases to investigate is most favorable.

By understanding these charts and how to use them effectively, you can make more
informed decisions about model deployment and resource allocation, ensuring that
your classification models provide maximum value.

Regression Metrics: Introduction and Example

In data science, regression models are essential for predicting continuous outcomes,
such as house prices, customer lifetime value, or sales forecasts. To evaluate these
models, various metrics measure how closely the model's predictions align with
actual values. This section introduces key regression metrics, demonstrated through
SAS, Python & R: A Cross Reference Guide

a practical example using a synthetic data set. By implementing a linear regression


model in SAS, Python, and R, we will explore metrics such as Mean Squared Error
(MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), R-squared,
Adjusted R-squared, AIC, and BIC.

Establishing the Example: Creating the Data Set

We'll begin by creating a synthetic data set in each programming language,


containing a continuous target variable, "HousePrice," and five predictors: Number
of Bedrooms, Square Footage, Age of the House, Number of Bathrooms, and
Distance to City Center.

This section will walk through the creation of this data set, ensuring consistency
across SAS, Python, and R. We will also introduce the concept of regression metrics
and explain why these metrics are crucial for understanding and improving model
performance.

Program 8.4 creates a synthetic data set and evaluates the model using various
regression performance metrics across different programming environments.

Program 8.4: Regression Example with Performance Metrics

Language Programming Code


# Load necessary libraries
library(ggplot2)
library(stats)
library(MASS)

# Synthetic Data
set.seed(42)
n <- 100
R Bedrooms <- sample(1:5, n, replace = TRUE)
Programming Bathrooms <- sample(1:3, n, replace = TRUE)
SquareFootage <- sample(600:3500, n, replace = TRUE)
Age <- sample(0:100, n, replace = TRUE)
DistanceFromCity <- sample(1:50, n, replace = TRUE)
Price <- Bedrooms * 50000 + Bathrooms * 30000 + SquareFootage
* 100 + Age * (-2000) + DistanceFromCity * (-3000) + rnorm(n,
mean = 0, sd = 10000)

# Creating the DataFrame


372 SAS, Python & R: A Cross Reference Guide

df <- data.frame(Bedrooms, Bathrooms, SquareFootage, Age,


DistanceFromCity, Price)

# Splitting the data into training and testing sets


set.seed(42)
train_idx <- sample(1:n, size = 0.7 * n)
train_data <- df[train_idx, ]
test_data <- df[-train_idx, ]

# Building the Linear Regression Model


model <- lm(Price ~ Bedrooms + Bathrooms + SquareFootage +
Age + DistanceFromCity, data = train_data)

# Predictions
y_pred <- predict(model, newdata = test_data)
y_test <- test_data$Price

# Calculating Metrics
mse <- mean((y_test - y_pred)^2)
mae <- mean(abs(y_test - y_pred))
rmse <- sqrt(mse)
r_squared <- summary(model)$r.squared
adj_r_squared <- summary(model)$adj.r.squared

cat("MSE:", mse, "\n")


cat("MAE:", mae, "\n")
cat("RMSE:", rmse, "\n")
cat("R-Squared:", r_squared, "\n")
cat("Adjusted R-Squared:", adj_r_squared, "\n")

# AIC and BIC


aic <- AIC(model)
bic <- BIC(model)

cat("AIC:", aic, "\n")


cat("BIC:", bic, "\n")

# Residuals
residuals <- y_test - y_pred

# Residual Plot
ggplot(data.frame(y_pred, residuals), aes(x = y_pred, y =
residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype =
"dashed") +
labs(title = "Residual Plot", x = "Predicted Values", y =
"Residuals")

# QQ Plot
qqnorm(residuals)
qqline(residuals, col = "red")
title("QQ Plot")

# Histogram of Residuals
SAS, Python & R: A Cross Reference Guide

ggplot(data.frame(residuals), aes(x = residuals)) +


geom_histogram(binwidth = 10000, fill = "blue", alpha =
0.7) +
geom_density(color = "red") +
labs(title = "Histogram of Residuals", x = "Residuals", y =
"Frequency")

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,
mean_absolute_error
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

# Synthetic Data (add your data here)


np.random.seed(42)
n = 100
Bedrooms = np.random.randint(1, 6, n)
Bathrooms = np.random.randint(1, 4, n)
SquareFootage = np.random.randint(600, 3500, n)
Age = np.random.randint(0, 100, n)
DistanceFromCity = np.random.randint(1, 50, n)
Price = Bedrooms*50000 + Bathrooms*30000 + SquareFootage*100
+ Age*(-2000) + DistanceFromCity*(-3000) +
np.random.normal(0, 10000, n)

# Creating the DataFrame


Python
df = pd.DataFrame({
Programming 'Bedrooms': Bedrooms,
'Bathrooms': Bathrooms,
'SquareFootage': SquareFootage,
'Age': Age,
'DistanceFromCity': DistanceFromCity,
'Price': Price
})

# Splitting the data


X = df[['Bedrooms', 'Bathrooms', 'SquareFootage', 'Age',
'DistanceFromCity']]
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)

# Building the Linear Regression Model


model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Calculating Metrics
374 SAS, Python & R: A Cross Reference Guide

mse = mean_squared_error(y_test, y_pred)


mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)
r_squared = model.score(X_test, y_test)

# Adjusted R-squared
n = len(y_test)
p = X_test.shape[1]
adj_r_squared = 1 - (1-r_squared)*(n-1)/(n-p-1)

print(f"MSE: {mse}")
print(f"MAE: {mae}")
print(f"RMSE: {rmse}")
print(f"R-Squared: {r_squared}")
print(f"Adjusted R-Squared: {adj_r_squared}")

# Calculating AIC and BIC using statsmodels


X_train_sm = sm.add_constant(X_train) # Adding a constant
term
model_sm = sm.OLS(y_train, X_train_sm).fit()
aic = model_sm.aic
bic = model_sm.bic

print(f"AIC: {aic}")
print(f"BIC: {bic}")

# Residuals
residuals = y_test - y_pred

# Residual Plot
plt.figure(figsize=(8, 5))
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

# QQ Plot
sm.qqplot(residuals, line='s')
plt.title('QQ Plot')
plt.show()

# Histogram of Residuals
plt.figure(figsize=(8, 5))
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.title('Histogram of Residuals')
plt.show()

/* GENERATING THE SYNTHETIC DATA */


DATA housing;
SAS
CALL STREAMINIT(42);
Programming DO i = 1 TO 100;
Bedrooms = RAND("Integer", 1, 5);
SAS, Python & R: A Cross Reference Guide

Bathrooms = RAND("Integer", 1, 3);


SquareFootage = RAND("Integer", 600, 3500);
Age = RAND("Integer", 0, 100);
DistanceFromCity = RAND("Integer", 1, 50);
Price = Bedrooms * 50000 + Bathrooms * 30000 +
SquareFootage * 100 + Age * (-2000) + DistanceFromCity * (-
3000) + RAND("Normal", 0, 10000);
OUTPUT;
END;
RUN;

/* SPLITTING THE DATA INTO TRAINING AND TESTING SETS */


PROC SURVEYSELECT DATA=housing OUT=train SAMPRATE=0.7 OUTALL;
RUN;

DATA train test;


SET housing;
IF selected = 1 THEN OUTPUT train;
ELSE OUTPUT test;
RUN;

/* BUILDING THE LINEAR REGRESSION MODEL */


PROC REG DATA=train OUTEST=est;
MODEL Price = Bedrooms Bathrooms SquareFootage Age
DistanceFromCity;
OUTPUT OUT=pred P=Predicted R=Residual;
RUN;

/* GETTING THE PERFORMANCE METRICS FOR THE TEST DATA SET */


PROC REG DATA=test OUTEST=est_test;
MODEL Price = Bedrooms Bathrooms SquareFootage Age
DistanceFromCity;
OUTPUT OUT=pred_test P=Predicted R=Residual;
RUN;

PROC MEANS DATA=pred_test MEAN;


VAR Residual;
OUTPUT OUT=metrics MEAN(Residual)=MAE;
RUN;

PROC SQL NOPRINT;


SELECT MEAN(Residual*Residual) INTO :MSE FROM pred_test;
SELECT SQRT(MEAN(Residual*Residual)) INTO :RMSE FROM
pred_test;
SELECT 1 - (SUM(Residual*Residual) / SUM((Price -
MEAN(Price))**2)) INTO :R_Squared FROM pred_test;
SELECT COUNT(*) INTO :n FROM test;
SELECT COUNT(*) INTO :p FROM dictionary.columns WHERE
libname='WORK' AND memname='PRED_TEST' AND name IN
('Bedrooms', 'Bathrooms', 'SquareFootage', 'Age',
'DistanceFromCity');
SELECT 1 - ((1-&R_Squared)*(&n-1)/(&n-&p-1))
INTO :Adj_R_Squared FROM dual;
QUIT;
376 SAS, Python & R: A Cross Reference Guide

%PUT MSE = &MSE;


%PUT MAE = &MAE;
%PUT RMSE = &RMSE;
%PUT R-SQUARED = &R_Squared;
%PUT ADJUSTED R-SQUARED = &Adj_R_Squared;

/* AIC AND BIC CALCULATION */


PROC REG DATA=train;
MODEL Price = Bedrooms Bathrooms SquareFootage Age
DistanceFromCity / AIC BIC;
RUN;

/* RESIDUAL PLOT */
PROC SGPLOT DATA=pred_test;
SCATTER X=Predicted Y=Residual;
REFLINE 0 / AXIS=y LINEATTRS=(COLOR=red
PATTERN=shortdash);
TITLE "Residual Plot";
RUN;

/* QQ PLOT */
PROC UNIVARIATE DATA=pred_test;
QQPLOT Residual / NORMAL(MU=EST SIGMA=EST) SQUARE;
TITLE "QQ Plot";
RUN;

/* HISTOGRAM OF RESIDUALS */
PROC SGPLOT DATA=pred_test;
HISTOGRAM Residual;
DENSITY Residual / TYPE=normal;
TITLE "Histogram of Residuals";
RUN;

Figure 8.7 shows the output from the regression model, including MSE, MAE, RMSE,
and R-squared values, which provide insights into model accuracy.

Figure 8.7: Regression Example Output


MSE: 83100494.4554503

MAE: 6887.38035920041

RMSE: 9115.947260457922

R-Squared: 0.9964657894636241

Adjusted R-Squared: 0.9957294956018791


SAS, Python & R: A Cross Reference Guide

AIC: 1455.1364031128012

BIC: 1468.6273745650974

Residuals Analysis: A Key Tool in Regression Metrics


Before diving into the various performance metrics used to evaluate regression
models, it is important to understand residuals and their significance. The residual
for each observation is the difference between the actual value and the predicted
value:

𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦𝑖 − 𝑦̂𝑖

Why Residuals Matter

● Model Fit: Residuals help to assess how well the model fits the data. Ideally,
residuals should be randomly distributed around zero, indicating that the
model captures the underlying data patterns without bias.

● Outlier Detection: Large residuals may indicate outliers or areas where the
model does not perform well. Investigating these residuals can provide
insights into whether the model needs to be improved or whether certain
data points are problematic.

● Assumption Checking: Residuals are used to check the assumptions of


linear regression, such as homoscedasticity (constant variance of residuals)
and normality of residuals. Violations of these assumptions can lead to
incorrect conclusions from the model.

Visualizing Residuals

Visualizing residuals through various plots is an essential practice in regression


analysis to evaluate model performance. These visualizations provide insights into
the model's assumptions and can help diagnose potential issues. Below are analyses
for three key residual plots:
378 SAS, Python & R: A Cross Reference Guide

1. Residual Plot:

o Interpretation: The residual plot shows the residuals (the


differences between actual and predicted values) on the Y axis and
the predicted values on the X axis. Ideally, the residuals should be
randomly scattered around the horizontal axis (y = 0), with no
discernible pattern.

o What to Look For:

▪ Non-linearity: If the residuals display a pattern (e.g., a


curve), it suggests that the model may not have captured a
non-linear relationship in the data.

▪ Heteroscedasticity: If the spread of residuals increases or


decreases with the predicted values, this indicates
heteroscedasticity, meaning that the variance of the errors
is not constant. This violates one of the key assumptions of
linear regression.

▪ Outliers: Points far from the horizontal axis or deviate


significantly from the general pattern may indicate outliers,
which could disproportionately influence the model.

o Possible Indications of Problems:

▪ A pattern in the residuals might suggest that a linear model


is not appropriate, and a non-linear model might be
needed.

▪ Heteroscedasticity might require a transformation of the


dependent variable or using a different model that can
handle varying error variances.
SAS, Python & R: A Cross Reference Guide

Figure 8.8: Residual Plot Example

2. QQ Plot (Quantile-Quantile Plot):

o Interpretation: The QQ plot compares the distribution of residuals


to a normal distribution. Residuals are plotted on the Y axis, and the
corresponding theoretical quantiles from a normal distribution are
plotted on the X axis. If the residuals follow a normal distribution,
the points will lie along the red diagonal line.

o What to Look For:

▪ Deviations from the Line: Significant deviations from the


diagonal line indicate that the residuals are not normally
distributed. For example, a curve in the plot suggests
skewness, while points that deviate at the ends of the plot
suggest heavy tails or outliers.

o Possible Indications of Problems:

▪ Non-normality of residuals might imply that the model's


predictions are biased or that the model is not well-
specified. This may not be a critical issue in some cases,
especially for large sample sizes, but it can affect the
reliability of hypothesis tests and confidence intervals.
380 SAS, Python & R: A Cross Reference Guide

▪ A transformation of the dependent variable or the use of a


robust regression method might be necessary if the
residuals deviate significantly from normality.

Figure 8.9: QQ Plot Example

3. Histogram of Residuals:

o Interpretation: The histogram visually represents the distribution of


residuals. Ideally, the histogram should resemble a normal
distribution (bell-shaped curve) centered around zero, indicating
that the residuals are symmetrically distributed around the
predicted values.
o

o What to Look For:

▪ Skewness: A skewed histogram indicates that the residuals


are not symmetrically distributed, which could affect the
accuracy of the model's predictions.

▪ Kurtosis: If the histogram shows heavy tails or a peaked


distribution, this suggests that there are more extreme
values (outliers) than expected under normality.

▪ Bimodality or Multimodality: If the histogram displays


multiple peaks, it might indicate that the model is missing
SAS, Python & R: A Cross Reference Guide

important variables or that the data could be segmented


into distinct groups.

o Possible Indications of Problems:

▪ A non-normal histogram suggests that the model's error


distribution deviates from the normal distribution, which
could lead to inaccurate predictions and unreliable
statistical inferences.

▪ If skewness or kurtosis is present, consider applying a


transformation to the target variable or using a different
modeling approach that does not assume normality.

Figure 8.10: Histogram of Residuals

These residual plots are crucial diagnostic tools for assessing the validity of a
regression model. By carefully analyzing these visualizations, you can identify
potential issues with model fit, validate assumptions, and take corrective actions to
improve model performance.
382 SAS, Python & R: A Cross Reference Guide

Using Residuals in Conjunction with Performance Metrics

Once residuals have been analyzed and understood, performance metrics like MSE,
MAE, RMSE, and others can provide a quantitative evaluation of model
performance. These metrics are directly related to residuals, as they aggregate the
residuals in different ways to summarize the model’s accuracy and reliability.

Mean Squared Error (MSE)

● Calculation:

o The MSE is calculated as the average of the squared differences


between actual values 𝑦𝑖 and predicted values 𝑦̂𝑖 :
𝑛 2
1
𝑀𝑆𝐸 = ∑(𝑦𝑖 − 𝑦̂𝑖 )
𝑛
𝑖=1

o Here, n is the number of observations. Each squared difference


penalizes larger errors more heavily, giving greater weight to
outliers.

● Interpretation:

o MSE provides an average measure of the squared differences


between actual and predicted values, reflecting the overall error
magnitude in the model's predictions. Since it squares the errors,
MSE is particularly sensitive to large deviations, making it useful for
highlighting significant discrepancies. A lower MSE indicates higher
model accuracy. For instance, a high MSE might suggest that the
model’s predictions are consistently far from the actual values,
potentially due to model issues or outliers in the data.

Mean Absolute Error (MAE)

● Calculation:

o MAE is the average of the absolute differences between actual


values and predicted values:
SAS, Python & R: A Cross Reference Guide

𝑛
1
𝑀𝐴𝐸 = ∑|𝑦𝑖 − 𝑦̂𝑖 |
𝑛
𝑖=1

o Unlike MSE, MAE does not square the errors, so each error
contributes to the overall metric in a linear fashion.

● Interpretation:

o MAE provides a straightforward measure of error in the model’s


predictions. It is less sensitive to outliers than MSE, making it a
more robust metric in some cases. An MAE of 12,105.19 means
that, on average, the model’s predictions are off by $12,105, which
is easier to interpret because it’s in the same units as the target
variable.

Root Mean Squared Error (RMSE)

● Calculation:

o RMSE is the square root of the MSE:

𝑅𝑀𝑆𝐸 = √𝑀𝑆𝐸

o RMSE provides a measure of error with the same units as the target
variable.

● Interpretation:

o RMSE gives a more interpretable measure of error than MSE


because it is in the same units as the target variable. It’s also more
sensitive to outliers than MAE because it involves squaring the
errors. An RMSE of 15,187.73 suggests that the model’s typical
prediction error is about $15,188.
384 SAS, Python & R: A Cross Reference Guide

R-squared (R²)

● Calculation:

o R-squared is calculated as the proportion of the variance in the


dependent variable that is predictable from the independent
variables:
∑(𝑦𝑖 −𝑦̂𝑖 )2
𝑅² = 1 − ∑(𝑦𝑖 −𝑦̅𝑖 )2

o Where 𝑦̅ is the mean of the actual values.

● Interpretation:

o R-squared values range from 0 to 1, with values closer to 1


indicating that a larger proportion of the variance in the dependent
variable is explained by the model. An R-squared of 0.92 suggests
that 92% of the variance in house prices is explained by the model’s
predictors, indicating a strong fit. However, R-squared can be
misleadingly high in the presence of overfitting, so it should be used
in conjunction with other metrics like Adjusted R-squared.

Adjusted R-squared

● Calculation:

o Adjusted R-squared adjusts the R-squared value for the number of


predictors in the model, accounting for the model’s complexity:
1− 𝑅² 𝑛−1
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 R² = 1 – (𝑛−𝑝−1)(𝑛−𝑝)

o Here, p is the number of predictors, and n is the number of


observations.

● Interpretation:

o Adjusted R-squared provides a more accurate measure of model fit,


particularly when comparing models with different numbers of
predictors. It penalizes the addition of non-significant predictors,
helping to avoid overfitting. If the adjusted R-squared is significantly
SAS, Python & R: A Cross Reference Guide

lower than the R-squared, it indicates that some predictors may not
contribute meaningful information to the model.

Akaike Information Criterion (AIC)

● Calculation:

o AIC is calculated as:

𝐴𝐼𝐶 = 2𝑘 − 2 𝑙𝑛(𝐿̂)

o Where k is the number of parameters in the model, and L is the


maximum likelihood of the model.

● Interpretation:

o AIC balances model fit and complexity. A lower AIC indicates a


better model, with a good balance between fit and simplicity. It
penalizes models with more predictors, discouraging overfitting.
Comparing AIC values across models can help select the best model.

Bayesian Information Criterion (BIC)

● Calculation:

o BIC is similar to AIC but includes a stronger penalty for models with
more parameters:

𝐵𝐼𝐶 = 𝑙𝑛 (𝑛)𝑘 − 2 𝑙𝑛(𝐿̂)

o Here, n is the number of observations.

● Interpretation:

o BIC is often used to compare models, with a lower BIC indicating a


better model fit. Like AIC, it penalizes complexity, but more strictly.
BIC is particularly useful when comparing models with varying
numbers of predictors, favoring more parsimonious models.
386 SAS, Python & R: A Cross Reference Guide

Regression Metrics Rules of Thumb

In regression modeling, performance metrics help us evaluate how well the model's
predictions align with the actual values. Regression metrics are more context-
dependent than classification metrics, where fixed thresholds are often used. The
effectiveness of a regression model is closely tied to the range and variability of the
target variable. As such, what might be considered a "good" performance in one
scenario could be suboptimal in another, depending on the specific application and
data set characteristics.

For example, in predicting house prices, an MAE of $10,000 might be acceptable if


the target prices range from $200,000 to $1,000,000, as this represents a relatively
small percentage of the overall value. However, in predicting monthly utility bills,
where values typically range from $50 to $300, a $10,000 MAE would indicate an
extremely poor model. This variability highlights the importance of considering the
context when evaluating regression metrics, making it essential to tailor
performance expectations to the specific characteristics of the problem at hand.

Table 8.4 provides general rules of thumb for interpreting regression metrics,
offering guidance on what constitutes poor, good, and very good model
performance in different contexts.

Table 8.4: Regression Performance Metrics “Rules of Thumb” Ranges

Very
Performance Poor Good
Good Explanation
Metric Threshold Threshold
Threshold
A high MSE indicates large
High Moderate average squared
value value Low value differences between
Mean relative relative relative to predicted and actual
Squared to the to the the range values, which can be due
Error (MSE) range of range of of target to poor model fit or
target target variable significant outliers. A
variable variable lower MSE suggests better
model accuracy.
Mean High Moderate Low value MAE is a more
Absolute value value relative to interpretable measure
Error (MAE) relative relative the range than MSE, representing
SAS, Python & R: A Cross Reference Guide

to the to the of target the average magnitude of


range of range of variable errors in the same units as
target target the target variable. Lower
variable variable MAE values indicate better
fit.
High Moderate RMSE provides a similar
value value Low value interpretation as MSE but
Root Mean relative relative relative to in the same units as the
Squared to the to the the range target variable. It is
Error (RMSE) range of range of of target sensitive to outliers, so
target target variable lower RMSE values are
variable variable preferable.
An R² below 0.3 suggests
that the model explains
very little of the variance
in the target variable.
Values between 0.3 and
R-squared 0.7 are generally
< 0.3 0.3 – 0.7 > 0.7
(R²) acceptable, while values
above 0.7 indicate a strong
model fit. However, be
cautious of very high R²
values as they may
indicate overfitting.
Adjusted R² adjusts R²
based on the number of
predictors, penalizing for
unnecessary complexity.
Adjusted R-
< 0.3 0.3 – 0.7 > 0.7 Similar to R², a higher
squared
value is better, but it
should be interpreted in
context, especially with
many predictors.
388 SAS, Python & R: A Cross Reference Guide

AIC helps in model


comparison, with lower
High Lower Lowest
Akaike values indicating a better
value value value
Information balance between model fit
relative relative among
Criterion and complexity. A
to other to other competing
(AIC) significantly lower AIC is
models models models
desirable when comparing
multiple models.
BIC, like AIC, balances fit
and complexity but with a
High Lower Lowest
Bayesian stricter penalty for
value value value
Information additional parameters.
relative relative among
Criterion Lower BIC values indicate
to other to other competing
(BIC) better models when
models models models
comparing different
models.

Rationale for Thresholds

● MSE, MAE, RMSE: These metrics directly measure the prediction error. In
practice, the specific threshold depends on the context of the model and
the range of the target variable. Generally, lower values are better, but
context matters – a "low" error in one application could be unacceptable in
another.

● R-squared (R²) and Adjusted R-squared: R² values closer to 1 indicate a


model that explains most of the variability in the target variable. However,
very high R² values could suggest overfitting, particularly if the model is too
complex. Adjusted R² corrects for this by considering the number of
predictors.

● AIC and BIC: These are relative metrics used primarily for model comparison
rather than absolute evaluation. Lower AIC or BIC values suggest a model
that better balances fit with complexity, with BIC imposing a stricter penalty
for additional predictors.

These rules of thumb are context-dependent and should be applied considering the
specific application and data set characteristics. In high-stakes environments, the
SAS, Python & R: A Cross Reference Guide

"good" threshold might be insufficient, while in exploratory analysis, even "poor"


thresholds might provide valuable insights.

Comparison of Classification and Regression Performance


Metrics

Performance metrics are fundamentally tied to the nature of the target variable in
both classification and regression models. Classification models are designed to
predict categorical outcomes, such as determining whether a transaction is
fraudulent, while regression models predict continuous outcomes, like estimating
house prices. These core differences influence the types of metrics we use to
evaluate model performance.

● Classification Models: Since the target variable is categorical, metrics like


the confusion matrix, accuracy, precision, and recall are used to assess how
well the model distinguishes between classes. A confusion matrix makes
sense here as it breaks down the model's predictions versus actual
outcomes across the different classes.

● Regression Models: In contrast, regression models predict continuous


outcomes, making metrics like Mean Squared Error (MSE) and R-squared
more appropriate. These metrics focus on the residuals (the difference
between actual and predicted values), which are crucial for understanding
the model's accuracy in predicting numerical values. Unlike classification
models, calculating a confusion matrix for regression models is not
applicable, as there are no distinct classes to compare.

Table 8.5 provides a detailed comparison of the key performance metrics used for
classification versus regression models, highlighting how each set of metrics aligns
with the unique goals of these model types.
390 SAS, Python & R: A Cross Reference Guide

Table 8.5: Comparison of Classification and Regression Performance Metrics

Metric Aspect Classification Metrics Regression Metrics


To predict categorical To predict continuous
Primary Objective outcomes (e.g., fraud/no outcomes (e.g., house
fraud, churn/no churn) prices, sales forecasts)
Accuracy, Precision,
MSE, MAE, RMSE, R-
Recall, F1 Score, AUC,
Examples of Metrics squared, Adjusted R-
Gini, KS, Confusion
squared, AIC, BIC
Matrix
Typically binary (0 or 1)
Continuous numerical
Output Type or multi-class
predictions
predictions
Focuses on how well the Focuses on how close the
Evaluation Focus model separates predictions are to the
different classes actual continuous values
Sensitive to false Sensitive to the
Error Sensitivity positives and false magnitude of prediction
negatives errors
Metrics like MSE and
Techniques like
RMSE do not directly
oversampling,
Handling Imbalanced account for imbalanced
undersampling, and
Data data, but AIC/BIC help
SMOTE are commonly
penalize model
used
complexity
Often includes metrics
Involves interpreting error
that are easier to
measures and variance
Interpretability interpret for binary
explained by the model
outcomes (e.g.,
(e.g., R-squared)
precision, recall)
Residuals are not
Residuals (actual -
typically used directly;
Use of Residuals predicted) are key to most
focus is on class
metrics (e.g., MSE, RMSE)
prediction
SAS, Python & R: A Cross Reference Guide

May include metrics that


penalize model Directly includes
Model Complexity
complexity indirectly penalties for model
Penalty
(e.g., AUC might drop if complexity (e.g., AIC, BIC)
overfitting occurs)
Metrics like Precision, Not applicable; metrics
Recall, and F1 Score are evaluate the model’s
Threshold Setting
affected by the overall performance
classification threshold across the data

Introduction to Model Implementation

After developing and evaluating a model, the next critical step is implementing it in
a production environment. Implementation refers to the process of taking a model
from the development stage, where it is built and tested, to an operational setting,
where it can be used to make real-time predictions or decisions. This process can
vary significantly depending on whether the model is being deployed within the
same environment where it was developed or in a different environment.
Understanding these differences, as well as the tools and strategies available for
implementation, is crucial for ensuring that your model performs effectively in the
real world.

Development vs. Production Environments

Development Environment: In the development environment, models are typically


built, tested, and iterated upon. This environment is often optimized for flexibility,
allowing data scientists to experiment with different algorithms, hyperparameters,
and data preprocessing techniques. The primary focus in this phase is on creating a
model that performs well on the training and validation data. In this book, we've
utilized three prominent development environments: SAS Studio, Python Spyder,
and RStudio. Each of these environments provides the necessary tools and flexibility
to implement, test, and refine models effectively, catering to the specific needs and
preferences of data scientists working in SAS, Python, or R. These development
392 SAS, Python & R: A Cross Reference Guide

environments are essential for the initial stages of model creation and refinement
before moving to production deployment.

Production Environment: In contrast, the production environment is where the


model is deployed to process live data and generate predictions. This environment
must be stable, efficient, and secure, as the model will be used in real-time or batch
processing to make decisions that could impact business operations. The production
environment often has stricter performance requirements and may involve
hardware, software, or operating systems different from those in the development
environment.

Key Differences:

● Performance Optimization: Models in production need to be optimized for


speed and efficiency, whereas development models focus on accuracy and
experimentation.

● Scalability: Production environments must handle larger volumes of data


and potentially support multiple concurrent users or processes.

● Security and Compliance: Production systems must adhere to strict security


and compliance standards, particularly when dealing with sensitive data.

● Monitoring: Continuous monitoring is essential in production to ensure the


model's performance remains consistent over time.

Model Packages

Model packaging is the process of encapsulating a trained model, along with its
associated dependencies and preprocessing steps, into a format that can be easily
deployed in a production environment. This ensures that the model can be applied
to new, live data in a consistent and reliable manner. The approach to packaging
varies depending on the programming language used:
SAS, Python & R: A Cross Reference Guide

● In SAS:

o Model packaging typically involves exporting the model as a scoring


code, a SAS data set, or a catalog entry. This package contains all
the essential components, including model coefficients, decision
rules, scoring code, and any data preprocessing steps, such as data
normalization, imputation of missing values, or variable
transformations. These preprocessing steps ensure that the live
data is processed consistently with the training data before scoring.

o SAS also offers procedures like PROC SCORE or PROC PLM, which
can be used to apply the model to new data, ensuring that the
entire workflow – from preprocessing to scoring – is accurately
replicated in the production environment.

● In Python:

o Models are often packaged using libraries like joblib or pickle, which
serialize the trained model object into a file. This file can be loaded
later to score new data. The package typically includes not just the
model itself but also any custom preprocessing pipelines created
using scikit-learn or other libraries. These pipelines may include
steps like scaling, encoding categorical variables, and handling
missing data, ensuring consistency in the model's application to new
data sets.

o Additionally, tools like Pipenv or Docker can be used to package the


entire environment, including dependencies, to ensure that the
model runs exactly as it did in the development environment.

● In R:

o Model packaging often uses the saveRDS function to serialize the


model object, which can then be reloaded with readRDS. This
package includes the trained model and any preprocessing steps,
such as data transformation or feature selection, that were applied
using tools like caret or recipes. These preprocessing steps are
394 SAS, Python & R: A Cross Reference Guide

essential to ensure that the new data is processed identically to the


training data, maintaining the integrity of the model’s predictions.

o R also offers the option to deploy models using R scripts within a


production environment, where the model is applied to new data
along with the preprocessing steps.

Why Preprocessing Steps Are Included: Including preprocessing steps within the
model package is crucial because the model's performance depends on how the
input data is structured and transformed. Inconsistent preprocessing between the
training and production phases can lead to incorrect predictions or a significant drop
in model performance. By encapsulating these steps within the model package, data
scientists ensure that the model will process new data in exactly the same way as
the training data, preserving the integrity and accuracy of the model's predictions.

Applying to Live Data: Once the model package is deployed in a production


environment, it is applied to live data. The model first applies the same
preprocessing steps (e.g., scaling, encoding) to this new data as it did during
development. Only after these steps are applied does the model generate
predictions. This consistent approach is essential for maintaining the reliability of
the model's output in real-world applications.

Implementing in Different Environments

When implementing a model in a different environment from where it was


developed, several challenges may arise, including compatibility issues, differing
system architectures, or missing dependencies. To address these challenges, models
are often packaged with all necessary dependencies and configurations to ensure
they run smoothly in the target environment.

Key Considerations

● Environment Configuration: Ensure the target environment has the


necessary software, libraries, and configurations to support the model.
SAS, Python & R: A Cross Reference Guide

● Dependency Management: Include all required dependencies in the model


package to avoid issues with missing libraries or incompatible versions.

● System Architecture: Be aware of differences in system architecture (e.g.,


operating systems, hardware) that might affect the model's performance or
compatibility.

Third-Party Implementation Tools: Docker and Kubernetes

Docker: Docker is a platform that enables developers to package applications,


including machine learning models, into containers. These containers encapsulate
the model, its dependencies, and the environment configuration, allowing it to run
consistently across different systems.

● Advantages: Docker ensures that the model runs the same way regardless
of the environment, eliminating issues related to environment
discrepancies. It also allows for easy scaling and deployment in cloud
environments.

● Use Cases: Docker is ideal for deploying models as microservices, where


each model can be run in its own container and accessed via APIs.

Kubernetes: Kubernetes is an open-source platform designed to automate the


deployment, scaling, and management of containerized applications. It is
particularly useful for managing large-scale deployments where multiple containers
(e.g., Docker containers) need to be orchestrated.

● Advantages: Kubernetes provides advanced features for managing


containerized models, including load balancing, scaling, and automatic
rollbacks. It is well-suited for complex environments where models need to
be deployed across multiple nodes or regions.

● Use Cases: Kubernetes is ideal for enterprises that require robust, scalable
solutions for deploying machine learning models in production, particularly
in cloud or hybrid environments.
396 SAS, Python & R: A Cross Reference Guide

Successfully implementing a model in production requires careful consideration of


the environment, packaging, and deployment strategies. By understanding the
differences between development and production environments, leveraging model
packages, and utilizing third-party tools like Docker and Kubernetes, data scientists
can ensure that their models are not only accurate but also efficient, scalable, and
robust in real-world applications.

Model Monitoring: Ensuring Long-Term Model Performance

Model monitoring is a critical phase in the lifecycle of any predictive model. Once a
model is deployed into a production environment, continuous monitoring is
essential to ensure that the model maintains its performance and reliability over
time. This section will explore what model monitoring is, why it is crucial, and the
various components involved in effectively tracking and evaluating a model’s
performance.

What Is Model Monitoring?

Model monitoring refers to the ongoing process of tracking the performance of a


deployed model to detect any degradation in accuracy or reliability. This process
involves observing various metrics, comparing them against predefined thresholds,
and taking corrective actions when necessary. The goal is to ensure that the model
continues to make accurate predictions as it encounters new data over time.

Why Is Model Monitoring Important?

The importance of model monitoring cannot be overstated. Even the most


rigorously tested models can experience performance degradation due to changes
in the underlying data distribution, external factors, or environmental shifts.
Continuous monitoring helps identify these issues early, allowing data scientists and
business stakeholders to decide whether to retrain, recalibrate, or retire a model.

Environment and Data Being Monitored


SAS, Python & R: A Cross Reference Guide

In the context of model monitoring, the environment being monitored typically


refers to the production system where the model is deployed. This environment
includes the live data, the model processes, the infrastructure supporting the
model, and any external factors that may influence the model’s performance.
Monitoring focuses on various aspects, such as data quality, prediction accuracy,
and model performance metrics.

The data being monitored includes both the input data (features) and the output
data (predictions) that the model generates. This monitoring ensures that the input
data remains consistent with the data used during model development and that the
predictions align with expected outcomes.

Overview of Model Monitoring Approaches

Model monitoring typically involves several key components that work together to
provide a comprehensive view of model performance. These components include
dashboards, model monitoring reports, threshold evaluation, and stability indices
like PSI and VSI.

Dashboards

Dashboards play a vital role in model monitoring by offering a real-time view of key
performance indicators (KPIs) and metrics. These visual tools allow stakeholders to
easily track trends, identify anomalies, and gain insights into the model’s
performance over time. Dashboards are particularly useful for providing a high-level
overview and enabling quick decision-making when issues arise.

Model Monitoring Reports

Model monitoring reports are detailed documents that provide an in-depth analysis
of the model’s performance. These reports typically include various performance
metrics such as accuracy, precision, recall, MSE, MAE, and more, depending on the
model type. Reports also highlight any significant deviations from expected
performance and may include recommendations for corrective actions. Regularly
reviewing these reports helps ensure that models continue to perform as expected
in the production environment.
398 SAS, Python & R: A Cross Reference Guide

Threshold Analysis for Classification Model Performance Metrics

Threshold analysis is a critical component of model monitoring, as it helps in


understanding when a model's performance is degrading to a level where action is
needed. In this section, we will use an example of a classification model's key
performance metrics – AUC, KS, and GINI – to demonstrate how thresholds can be
set and monitored over time.

Example of Threshold Analysis Using AUC, KS, and GINI

Table 8.6 below shows how threshold ranges can be defined for the AUC, KS, and
Gini metrics. These thresholds are based on the percentage deviation from a base
metric, representing the model's performance during development.

Table 8.6: Example Threshold Ranges for Model Monitoring


Thresholds
Rationale for
KPI Name Description (Green, Yellow, Frequency
KPI
Red)
Green: Deviation
< 10% of base value
Assesses the Indicates
AUC (Area Yellow: Deviation
model's ability to predictive
Under the 10% to 15% of base Monthly
discriminate accuracy and
Curve) value
between classes model utility
Red: Deviation
≥ 25% of base value

Green: Deviation
Measures the < 10% of base value
degree of
Reflects the
KS separation Yellow: Deviation
model's
(Kolmogorov- between the 10% to 15% of base Monthly
discriminatory
Smirnov) distributions of value
power
positive and
negative classes Red: Deviation
≥ 25% of base value

Indicates Green: Deviation


Quantifies the model's ability < 10% of base value
Gini
inequality in to differentiate Monthly
Coefficient Yellow: Deviation
model predictions between
10% to 15% of base
classes
value
SAS, Python & R: A Cross Reference Guide

Red: Deviation
≥ 25% of base value

Interpretation of Threshold Ranges

● Green: If the metric falls within the green range, it suggests that the model
is performing within acceptable limits and no immediate action is required.
For example, if the AUC drops by less than 10% from the base value, the
model's predictive accuracy is still considered robust.

● Yellow: Metrics in the yellow range indicate that the model's performance is
beginning to degrade. This range acts as a warning signal that the model
may need to be recalibrated or that the data being used in the production
environment is shifting. For instance, if the KS statistic shows a 10% to 15%
drop, it may be time to investigate potential issues before they become
critical.

● Red: The red range signals a significant drop in performance, typically 25%
or more from the base metric. This indicates that the model is no longer
reliable and requires immediate intervention, such as retraining with new
data, adjusting the thresholds, or even redeveloping the model.

Why These Thresholds Matter

Setting appropriate thresholds allows for proactive model management. By


continuously monitoring these metrics and adhering to the established thresholds,
data scientists can ensure that the model remains relevant and effective, even as
the underlying data evolves. This approach helps maintain the accuracy and
reliability of predictions, which is crucial in business-critical applications such as
fraud detection or customer churn prediction.

Monitoring Frequency

In the table, we've indicated that these metrics should be monitored monthly.
However, the frequency can vary depending on the business context, the model's
importance, and the volatility of the data. More critical models may require weekly
400 SAS, Python & R: A Cross Reference Guide

or even daily monitoring, while others may be checked less frequently. The key is to
ensure that the monitoring frequency aligns with the business needs and the
model's usage.

PSI (Population Stability Index) and VSI (Variance Stability Index)

Monitoring the stability of a model's input data and predictions over time is crucial
for ensuring its continued accuracy and reliability. Two key metrics used for this
purpose are the Population Stability Index (PSI) and the Variance Stability Index
(VSI).

● Population Stability Index (PSI):

o Purpose: The PSI measures shifts in the distribution of input


features over time. It compares the distribution of a model's
features from the training data set (or a baseline period) to the
distribution from the current scoring data set. A significant change
in PSI indicates that the characteristics of the population on which
the model is making predictions have diverged from those on which
it was trained. This shift could lead to model performance
degradation, as the model may not generalize well to the new
population.

o Interpretation:

▪ A low PSI value suggests that the population has remained


stable, meaning that the model is likely still performing well.

▪ A high PSI value indicates that the population has shifted


significantly, which might necessitate retraining or
recalibrating the model.

● Variance Stability Index (VSI):

o Purpose: The VSI assesses the variance in model predictions over


time. It measures changes in the distribution of predicted
probabilities or scores. A significant change in VSI could indicate
that the model is encountering data that differs from what it was
SAS, Python & R: A Cross Reference Guide

trained on, potentially leading to issues like increased bias or


reduced accuracy.

o Interpretation:

▪ A low VSI suggests that the model predictions are


consistent over time, indicating stable performance.

▪ A high VSI could signal that the model is encountering new


types of data, possibly resulting in performance issues.

PSI and VSI Thresholds

To effectively monitor these metrics, it’s useful to categorize them into thresholds
that signal different levels of concern. These thresholds are typically broken down
into three categories: Green, Yellow, and Red.

Table 8.7: PSI and VSI Thresholds

Metric Description Green Range Yellow Range Red Range


Measures shifts
PSI (Population in the distribution
< 0.10 0.10 – 0.25 ≥ 0.25
Stability Index) of input features
over time.
Assesses the
VSI (Variance variance in model
< 0.10 0.10 – 0.25 ≥ 0.25
Stability Index) predictions over
time.

● Green Range: Indicates that the model's input features and predictions have
remained stable. No immediate action is needed, and the model is likely
performing as expected.

● Yellow Range: Signals that there has been some change in the population or
variance of predictions. The model may still be performing adequately, but
it should be monitored more closely, and adjustments may be necessary in
the near future.
402 SAS, Python & R: A Cross Reference Guide

● Red Range: Suggests significant shifts in the population or variance of


predictions. Immediate action is required, which could involve retraining the
model, recalibrating thresholds, or investigating data quality issues.

Monitoring PSI and VSI regularly allows you to detect early signs of model drift,
enabling proactive adjustments to maintain model performance and reliability over
time.

Implementing a Robust Model Monitoring Framework


By implementing a robust model monitoring framework that includes dashboards,
detailed reports, threshold analysis, and stability indices, organizations can ensure
that their predictive models continue to deliver value over time. This framework
allows models to adapt to changes in data and external conditions while maintaining
high levels of accuracy and reliability, ensuring that the models remain effective in
supporting business decisions.

Chapter 8: Performance Metrics Implementation and Model


Monitoring – Conclusion and Transition

In this chapter, we focused on the critical aspects of evaluating and monitoring your
predictive models. By implementing various performance metrics such as AUC,
precision, recall, PSI, and VSI, you’ve learned how to assess the effectiveness and
stability of your models. Additionally, we discussed the importance of continuous
model monitoring to ensure that your models remain reliable and accurate over
time, even as data and conditions change.

Now that you understand how to evaluate and maintain your models, the next step
is to explore how to integrate these practices across different programming
environments. In many real-world projects, you may find yourself working with
multiple languages, such as Python, R, and SAS. Each language has its own strengths,
and being able to leverage the best features of each can significantly enhance your
data science workflow.

In Chapter 9, we will delve into the concept of language fusion, where you’ll learn
how to integrate Python, R, and SAS in a single project. We’ll explore how to
SAS, Python & R: A Cross Reference Guide

seamlessly transition between these languages, share data, and combine their
unique capabilities to create a more flexible and powerful modeling environment.
By mastering language fusion, you’ll be equipped to tackle complex projects that
require the strengths of multiple programming environments.

So, with your knowledge of performance metrics and model monitoring solidified,
let’s move on to Chapter 9 and discover how to harness the power of multiple
languages to take your data science projects to the next level.
404 SAS, Python & R: A Cross Reference Guide

Chapter 8 Summary: Performance Metrics, Implementation,


and Model Monitoring

1. Introduction to Performance Metrics

o Overview: This chapter delves into the essential tools used to assess
and ensure the accuracy, reliability, and generalizability of
predictive models. It covers both classification and regression
models, discussing how their respective performance metrics are
tailored to the nature of the target variable. Understanding these
metrics is crucial for evaluating model effectiveness, diagnosing
issues, and making informed decisions.

2. Classification Metrics

o Key Metrics: The chapter explores crucial classification metrics such


as Accuracy, Precision, Recall, F1 Score, AUC, Gini Coefficient, and
KS Statistic. These metrics are instrumental in understanding a
model's ability to distinguish between different classes, especially in
scenarios like fraud detection or customer churn prediction.

o Practical Example: A synthetic data set is used to demonstrate how


these metrics are computed and interpreted across different
programming environments, including SAS, Python, and R. The
chapter emphasizes the importance of selecting the right threshold
and understanding how metrics like AUC and Gini provide insights
into a model's discriminatory power.

3. Regression Metrics

o Key Metrics: The chapter covers essential regression metrics such as


MSE, MAE, RMSE, R-squared, Adjusted R-squared, AIC, and BIC.
These metrics are crucial for evaluating how well a model's
predictions align with continuous outcomes, such as house prices or
sales forecasts.

o Practical Example: The chapter provides an example data set to


illustrate how these metrics are calculated and interpreted,
emphasizing the importance of residual analysis and the sensitivity
of these metrics to outliers and model complexity.
SAS, Python & R: A Cross Reference Guide

4. Model Implementation

o Development vs. Production Environments: The chapter


distinguishes between the flexibility of the development
environment, where models are iterated and tested, and the
constraints of the production environment, where models are
deployed and must perform consistently.

o Model Packaging: The chapter discusses how models are packaged


for deployment in SAS, Python, and R, including the encapsulation
of preprocessing steps. It also covers third-party tools like Docker
and Kubernetes that facilitate model deployment across different
environments.

5. Model Monitoring

o Importance of Monitoring: This section underscores the necessity


of continuous model monitoring to ensure long-term performance
and reliability. It introduces key concepts such as threshold
evaluation, performance metrics monitoring, and stability indices.

o Monitoring Tools: The chapter discusses the use of dashboards,


model monitoring reports, and automated alerts to track model
performance. It also explains the significance of PSI (Population
Stability Index) and VSI (Variance Stability Index) in detecting shifts
in data distributions over time.

o Threshold Analysis: A detailed example is provided on setting and


interpreting thresholds for classification metrics like AUC, KS, and
Gini, focusing on green, yellow, and red ranges to indicate model
performance health.

6. Conclusion

o Final Thoughts: The chapter ties together the concepts of


performance metrics, model implementation, and monitoring,
emphasizing their interconnectivity and the importance of a robust
evaluation framework in ensuring model success in real-world
applications.
406 SAS, Python & R: A Cross Reference Guide

Chapter 8 Quiz:

1. What is the primary purpose of performance metrics in evaluating


predictive models?

2. How do classification metrics differ from regression metrics in terms of their


application and interpretation?

3. Explain the significance of the confusion matrix in classification model


evaluation.

4. What does a high Precision score indicate in a classification model?

5. How is the AUC (Area Under the Curve) calculated, and what does it
represent?

6. Describe the relationship between the Gini coefficient and AUC.

7. What is the KS Statistic, and why is it important in model evaluation?

8. How does the Mean Squared Error (MSE) differ from Mean Absolute Error
(MAE) in regression analysis?

9. Why is R-squared an important metric in regression, and what does it tell


you about the model?

10. Explain the difference between R-squared and Adjusted R-squared.

11. What are AIC and BIC, and how are they used to assess model complexity in
regression models?

12. Why is it important to monitor a model after it has been deployed into
production?

13. Describe the role of model packaging in deploying models from a


development environment to production.

14. What are Docker and Kubernetes, and how do they assist in model
deployment?

15. What is PSI (Population Stability Index), and why is it critical in model
monitoring?
SAS, Python & R: A Cross Reference Guide

16. How does VSI (Variance Stability Index) contribute to model stability
monitoring?

17. What does a yellow range in threshold analysis indicate about a model’s
performance?

18. How can the lift table help in setting thresholds for classification models?

19. Why is it necessary to periodically reassess the thresholds set for model
performance metrics?

20. Explain how dashboards and monitoring reports can be used to track and
evaluate model performance over time.
408 SAS, Python & R: A Cross Reference Guide

Chapter 8 Cheat Sheet

Category: Performance Metrics

Category SAS Python R


PROC REG:
Output MSE is mean_squared_ mean((actua
MSE (Mean error(y_true, l-
available in the
Squared Error) y_pred) from predicted)^2
regression sklearn.metrics )
output
PROC REG: Use mean_absolute_
MAE (Mean residuals to error(y_true, mean(abs(actual -
Absolute Error) manually y_pred) from predicted))
calculate MAE sklearn.metrics
PROC REG: np.sqrt(mean_sq
RMSE (Root RMSE is uared_error(y_tr
sqrt(mean((actual
Mean Squared available in ue, y_pred))
- predicted)^2))
Error) regression from
output sklearn.metrics
PROC REG:
Automatically r2_score(y_true,
summary(lm_mod
R-squared calculated in y_pred) from
el)$r.squared
regression sklearn.metrics
output
PROC REG:
Available in Manually
output, adjusts calculated using
Adjusted R- summary(lm_mod
R-squared r2_score and
squared el)$adj.r.squared
based on the model degrees of
number of freedom
predictors
PROC REG:
Available in
AIC (Akaike model.aic_ in
regression
Information statsmodels AIC(lm_model)
output for
Criterion) package
model
comparison
SAS, Python & R: A Cross Reference Guide

PROC REG:
Available in
BIC (Bayesian model.bic_ in
regression
Information statsmodels BIC(lm_model)
output for
Criterion) package
model
comparison

Category: Implementation

Category SAS Python R


Export model as a SAS saveRDS()
Model packaging
data set or catalog function to save
via joblib or pickle
entry using PROC model objects.
libraries.
Model EXPORT or PROC Ensure
Preprocessing
Packaging CATALOG. preprocessing
steps must be
Preprocessing steps steps are included
included in a
are included in the in the model
pipeline
package object
Use PROC SCORE to
Deployed using
apply model on new
Docker or directly Use predict() on
data. Deployed
Deployment in on a production new data after
models in production
Production server. Models loading the model
are similar to the ones
can be loaded with readRDS()
developed in SAS
using joblib/pickle
Studio
Docker images
Integrate with Docker and Kubernetes Use Docker with R
and Kubernetes via pods used for images, or RShiny
Third-party
SAS Viya for deploying Python and Plumber for
Implementation
containerized models in deploying R
deployment scalable models as APIs
environments
410 SAS, Python & R: A Cross Reference Guide

Category: Model Monitoring

Category SAS Python R


Calculate PSI Custom PSI Custom PSI
manually using calculation using calculation using
PSI (Population
PROC SQL to bin pandas for binning dplyr for binning
Stability Index)
data and compare and numpy for and base R for
distributions calculations calculations
Custom VSI
Calculate VSI by Custom VSI
calculation using
monitoring calculation using
VSI (Variance pandas for
changes in input dplyr for variable
Stability Index) variable
variables with distribution
distribution
PROC UNIVARIATE comparisons
comparisons
Evaluate
Use if-else Use ifelse() or
thresholds with
Threshold conditions in custom scripts in
custom rules in
Evaluation pandas or custom dplyr to apply
PROC SQL or
scripts threshold logic
DATA steps
Use SAS Visual Use matplotlib or Use shiny or
Analytics for dash for custom flexdashboard for
Dashboards dashboard dashboards, or interactive
creation and integrate with monitoring
monitoring third-party BI tools dashboards in R
SAS, Python & R: A Cross Reference Guide

Chapter 9: Language Fusion: Integrating Python,


R, and SAS

Overview

In data science and analytics, one often finds themselves navigating through a
multitude of programming environments. It's common for data scientists to work
with the tools provided by their organizations, and these tools may not always align
with their expertise. For instance, you might be a seasoned SAS expert but find
yourself in an environment that primarily relies on RStudio, or vice versa. The
question arises: Are you confined to using only the programming languages native
to a particular environment? The answer is a resounding no.

This chapter explores the liberating concept of interoperability between


programming languages and analytical environments. It's a testament to the
adaptability and versatility of modern data science. No longer are you restricted to a
single language or environment. Instead, you have the power to leverage the
strengths of different languages within the environment you're working in.

Consider this scenario: you've honed your skills in Python and SAS, yet your new
data science position requires you to work in RStudio. Do you need to scramble to
learn R from scratch? Not necessarily. This chapter will show you how to implement
your Python or SAS code directly within RStudio. Conversely, if your company's
analytical environment is SAS-centric, and you need to incorporate some R magic,
we've got you covered there as well.

Let's embark on a journey that breaks down the barriers between programming
languages and analytical environments. Discover how to seamlessly integrate
Python, R, and SAS into different platforms. By the end of this chapter, you'll
possess a valuable skill set that enables you to harness the full potential of your
preferred programming languages, regardless of the environment you find yourself
in.
412 SAS, Python & R: A Cross Reference Guide

Python Spyder IDE

Welcome to the Python Spyder IDE, where versatility in data science reigns
supreme. In the world of analytics, professionals often face the challenge of working
in environments that demand expertise in multiple programming languages. In this
section, we'll explore how Python Spyder allows you to seamlessly integrate both
SAS and R code into your Python workflow. This integration empowers you to
leverage the strengths of different languages, transcending the constraints of any
single ecosystem.

Python Spyder is not confined to Python alone. It serves as a bridge that enables
data scientists to harness R's statistical prowess and the data management
capabilities of SAS, all within a Python-centric environment.

As we delve into the intricacies of incorporating SAS and R code, you'll discover the
practical advantages of multi-language integration. By the end of this section, you'll
be well-equipped to navigate the dynamic landscape of data science, where
adaptability and versatility are your greatest assets.

Running R Code in Python Spyder

In Python Spyder, you can seamlessly integrate R code into your data analysis
workflow using the “rpy2” library. This allows you to harness the power of both R
and Python within a single environment. Below are the steps to execute R code in
Python Spyder:

● Step 1: Install the rpy2 Package

Before you begin, ensure that you have the rpy2 package installed. If it's not
already installed, you can do so via pip, the Python package manager, with the
following command:

Program 9.1: rpy2 Library Installation

pip install rpy2


SAS, Python & R: A Cross Reference Guide

● Step 2: Create a Python Script

Open Python Spyder and create a new Python script (a .py file).

● Step 3: Import the rpy2 Package

At the beginning of your Python script, import the rpy2 package like this:

Program 9.2: Import rpy2 Package

import rpy2.robjects as robjects

● Step 4: Define and Execute R Code

Define your R code as a string within your Python script. For example:

Program 9.3: Create R Code Within Python

# Define your R code as a string


r_code = """
x <- 5 y <- 10
result <- x + y
result
"""

# Execute the R code


result = robjects.r(r_code)

# Print the result


print(result[0])

● Step 5: Run the Python Script


414 SAS, Python & R: A Cross Reference Guide

Save your Python script and execute it within Python Spyder. The embedded R code
will run seamlessly, and the result will be displayed in the Python console.

Using R Libraries
If your R code requires specific R libraries like caret, you can load them within your
R code block:

Program 9.4: Using R Libraries in Python

r_code = """

# Load required R libraries


library(caret)

# Your R code here

"""

This approach allows you to leverage R's capabilities, including its libraries,
alongside Python in Python Spyder. It's particularly useful when you need to utilize
R's extensive data analysis and statistical packages in your data science projects.

Subprocess Module

If you cannot install the rpy2 library to run R code within Python, there are
alternative ways to execute R code from Python. One popular option is to use the
subprocess module in Python to run R scripts externally. Here's a step-by-step
guide on how to do this:

● Step 1: Write Your R Script

First, create an R script (e.g., my_r_script.R) that contains the R code you want
to execute. Save this script in the same directory as your Python script or provide
the full path to it. Example R script (my_r_script.R):
SAS, Python & R: A Cross Reference Guide

Program 9.5: Create Example R Script

# Example R code
result <- sum(1:10)
cat("The sum of numbers 1 to 10 is", result, "\n")

● Step 2: Execute R Script from Python

Now, you can use Python's subprocess module to run the R script. Here's an
example Python script:

Program 9.6: Python Subprocess Module Example

import subprocess

# Path to the R script


r_script_path = 'path_to_your_script/my_r_script.R' # Update
with your actual path

# Run the R script using subprocess


try:
result = subprocess.check_output(['Rscript',
r_script_path], universal_newlines=True)
print(result)
except subprocess.CalledProcessError as e:
print(f"Error: {e}")

Make sure to replace 'path_to_your_script/my_r_script.R' with the


actual path to your R script.

● Step 3: Run the Python Script

Execute your Python script. It will call the R script and print the output to the
console. In this example, it calculates the sum of numbers 1 to 10 using R and prints
the result.
416 SAS, Python & R: A Cross Reference Guide

Warning: Seamless R Integration in Python Without rpy2


This approach allows you to integrate R code seamlessly into your
Python workflow without the need for the rpy2 library. Just ensure
that you have R installed on your system, as the script relies on the
Rscript command-line utility to execute R code. Remember to adapt
the Python and R scripts to your specific needs by replacing the
example code with your desired R functionality.

Jupyter Notebook R Implementation

Using the rpy2 library or the subprocess module are two common ways to run R
code within Python. However, there is another approach called "Jupyter Notebooks"
that provides an interactive environment for combining Python and R code
seamlessly. Here's how to do it:

● Step 1: Install Jupyter Notebook

If you haven't already, install Jupyter Notebook using pip:

Program 9.7: Install Jupyter Notebook

pip install notebook

● Step 2: Create a Jupyter Notebook

1. Open your command line or terminal.

2. Navigate to the directory where you want to create your Jupyter Notebook.

3. Run the following command to create a new Jupyter Notebook:

Program 9.8: Create a New Jupyter Notebook

jupyter notebook
SAS, Python & R: A Cross Reference Guide

This will open a web browser showing the Jupyter Notebook interface.

● Step 3: Create a New Notebook

1. Click the "New" button in the top-right corner and select "Python 3" to
create a new Python notebook.

2. Rename the notebook if desired.

● Step 4: Write and Execute R Code

Inside the notebook, you can write both Python and R code in separate cells. To add
a new cell, click the "+" button in the toolbar. To specify that you're writing R code
in a cell, use the `%%R` magic command at the beginning of the cell. Example:

Program 9.9: R Magic Command Example

%%R
# This is R code
result <- sum(1:10)
cat("The sum of numbers 1 to 10 in R is", result, "\n")

● Step 5: Run the Cells

To run a cell, select it and click the "Run" button in the toolbar or press Shift+Enter.
The R code's output will appear directly in the notebook.

This approach allows you to create interactive documents seamlessly combining


Python and R. It's a popular choice for data scientists working with both languages.
You can save your Jupyter Notebook for future reference and easily share it with
others.

Jupyter Notebooks support many programming languages, so this approach can be


used not only for Python and R but also for other languages like Julia and Scala.
418 SAS, Python & R: A Cross Reference Guide

Running SAS Code in Python Spyder

Here are three different methods to run SAS code within Python's Spyder IDE, along
with step-by-step instructions for each method:

Method 1: Using the Subprocess Module

Running SAS code in Python's Spyder IDE can be accomplished through various
methods. One straightforward approach is to utilize the Python subprocess
module. This method enables you to run SAS code as an external process directly
from your Python script. You'll need to specify the path to your SAS executable and
can pass your SAS code as a string to be executed. It's a versatile method that offers
control over your SAS environment while seamlessly integrating it into your Python
workflow. In this section, we'll walk you through the steps to set up and execute SAS
code using the subprocess module within Python's Spyder IDE.

● Step 1: Install SAS and Configure the Environment

Ensure that SAS is installed on your system. You'll need to know the path to the SAS
executable (e.g., sas.exe on Windows).

● Step 2: Open Python Spyder

Launch the Spyder IDE.

● Step 3: Create a Python Script

Create a new Python script or open an existing one.

● Step 4: Use the Subprocess Module to Run SAS Code

Use the subprocess module to run the SAS code in your Python script. Here's an
example:

Program 9.10: Subprocess Module to Run SAS Code

import subprocess
SAS, Python & R: A Cross Reference Guide

# Replace 'sas.exe' with the path to your SAS executable


sas_path = 'path_to_sas.exe'

# Your SAS code


sas_code = """
data work.example;
input x y;
datalines;
1 2
3 4
5 6
;
run;
"""

# Save the SAS code to a temporary file


with open('temp.sas', 'w') as f:
f.write(sas_code)

# Run the SAS code using subprocess


subprocess.run([sas_path, 'temp.sas'])

● Step 5: Run the Python Script

Execute your Python script. It will run the SAS code and display the output in the
SAS console.

Method 2: Using SAS Kernel for Jupyter

Another effective way to run SAS code within Python's Spyder IDE is by harnessing
the power of Jupyter Notebook with the SAS Kernel. Jupyter Notebooks are
renowned for their interactivity and code-sharing capabilities. Installing the SAS
Kernel allows you to create a dedicated SAS Jupyter Notebook and execute SAS code
cells alongside your Python code. This method is ideal if you prefer a more
interactive and document-driven approach to working with SAS within the Python
ecosystem. In the following section, we'll guide you through the process of installing
the SAS Kernel and using it within Jupyter Notebooks.
420 SAS, Python & R: A Cross Reference Guide

● Step 1: Install Jupyter Notebook and SAS Kernel

If you haven't already, install Jupyter Notebook using pip:

Program 9.11: Install Jupyter Notebook

pip install notebook

Next, install the SAS Kernel for Jupyter:

Program 9.12: Install SAS Kernel for Jupyter

pip install sas_kernel

● Step 2: Launch Jupyter Notebook

Launch Jupyter Notebook by running:

Program 9.13: Launch Jupyter Notebook

jupyter notebook

● Step 3: Create a New Jupyter Notebook

Create a new Jupyter Notebook by selecting "New" → "SAS".

● Step 4: Write and Run SAS Code

In your new SAS Jupyter Notebook, you can write and execute SAS code cells.
SAS, Python & R: A Cross Reference Guide

Method 3: Using SASPy


For seamless integration of SAS capabilities into your Python Spyder environment,
SASPy offers an excellent solution. SASPy is a Python module that facilitates
communication with SAS from Python. By configuring a connection to your SAS
environment and utilizing the SASPy library, you can easily submit SAS code directly
from your Python script or Jupyter Notebook. This method not only streamlines
your workflow but also allows you to combine the strengths of both SAS and Python
for your data analytics tasks. In the upcoming section, we'll explore how to set up
SASPy and effectively integrate SAS functionality into your Python Spyder
environment.

SASPy is a Python module that enables you to connect to and interact with SAS from
Python. To use SASPy, follow these steps:

● Step 1: Install SASPy

Install SASPy using pip:

Program 9.14: Install SASPy Module

pip install saspy

● Step 2: Configure SASPy

Create a configuration file for SASPy. You can do this by running:

Program 9.15: Create Configuration File for SASPy

python import saspy saspy.SAScfg.create_config()

This will generate a configuration file (sascfg_personal.py) that you can edit to
specify your SAS environment settings.

● Step 3: Import and Use SASPy


422 SAS, Python & R: A Cross Reference Guide

In your Python script or Jupyter Notebook, import SASPy and use it to execute SAS
code. Here's an example:

Program 9.16: Import SASPy and Execute SAS Code

python import saspy

# Connect to SAS using your configured profile

sas = saspy.SASsession(cfgname='default')

# Define your SAS code

sas_code = """
data work.example;
input x y;
datalines;
1 2
3 4
5 6
;
run;
"""

# Submit the SAS code

sas.submit(sas_code)

These methods provide various ways to integrate SAS code into the Python Spyder
environment, depending on your preferences and requirements.
SAS, Python & R: A Cross Reference Guide

RStudio IDE: A Hub for Multilingual Data Science

Welcome to RStudio, the versatile integrated development environment (IDE) that


empowers data scientists to seamlessly blend the power of Python, SAS, and R
within a single cohesive workspace. In the realm of data analytics, professionals
often find themselves working across diverse programming languages to tackle
multifaceted challenges. In this section, we'll embark on a journey through RStudio,
exploring how it provides an optimal environment for integrating both Python and
SAS code into your data science workflow.

RStudio goes beyond being a specialized IDE for a single programming language. It
serves as a dynamic hub that unlocks the potential of cross-language synergy.
Whether you're well-versed in Python, proficient in SAS, or a seasoned R enthusiast,
RStudio offers a welcoming space to unite these languages, enabling you to harness
their unique strengths effectively.

As we navigate the intricacies of implementing Python and SAS within RStudio,


you'll witness the real-world advantages of multilingual data science. By the end of
this section, you'll be equipped with the knowledge and skills needed to thrive in
the diverse landscape of modern data analysis, where adaptability and integration
are your keys to success.

Implementing Python in RStudio

RStudio, renowned for its prowess in the R programming world, is not just limited to
R. In this section, we explore the diverse world of data science by bringing Python
into the mix. RStudio's flexibility extends to Python, allowing data scientists to
integrate Python code into their workflow seamlessly. Whether you're a Python
enthusiast or looking to combine the strengths of both Python and R, RStudio
empowers you to do so effortlessly.

As we journey through this section, you'll discover the art of implementing Python
in RStudio. This integration opens doors to a wide array of data science libraries,
tools, and capabilities. From data analysis to machine learning and beyond, you'll
find that RStudio is your canvas for creating data-driven masterpieces. By the end of
424 SAS, Python & R: A Cross Reference Guide

this section, you'll be equipped to harness the full potential of Python within the
RStudio environment.

Method 1: Using Reticulate for Python Integration

One of the most straightforward ways to work with Python in RStudio is by utilizing
the Reticulate package. Reticulate allows you to create Python scripts within
RStudio, execute Python code chunks, and seamlessly switch between R and Python.
It's particularly useful when you want to combine the strengths of both languages in
a single RMarkdown document.

By installing and configuring Reticulate, you can write Python code directly in
RStudio and harness Python libraries, making it an excellent choice for projects that
require collaboration between R and Python developers.

● Step 1: Load the Reticulate Package

Program 9.17: Load the Reticulate Package

R library(reticulate)

● Step 2: Create a Python Environment

Program 9.18: Create Python Environment

# Specify your Python version (e.g., "python3" or "anaconda3")


use_python("python3")

● Step 3: Run Python Code in RStudio

Program 9.19: Run Python Code String

# You can now use Python code in RStudio, for example:


py_run_string("print('Hello from Python!')")
SAS, Python & R: A Cross Reference Guide

Method 2: Running Python Scripts

For those who prefer a more traditional Python development environment, RStudio
also allows you to run Python scripts directly. By clicking on the "Run" button or
using keyboard shortcuts, you can execute Python code in RStudio's console or
terminal. This method is perfect if you want to maintain your Python code in
separate script files while leveraging RStudio's interface for code execution and
result visualization. Whether you're working on data analysis, machine learning, or
any Python-centric task, this approach provides familiarity and flexibility.

● Step 1: Create an R Markdown Document

# Go to File > New File > R Markdown...


# Choose "HTML" or another output format and click "OK"

● Step 2: Insert Python Code Chunks

# In your R Markdown document, use ```{python} ... ``` to


insert Python code chunks.

● Step 3: Knit the Document

# Click the "Knit" button to run both R and Python code and
generate the report.

Method 3: Jupyter Notebooks Integration

RStudio's compatibility with Jupyter Notebooks offers yet another powerful way to
work with Python. By creating a new R Markdown document with a Reticulate
engine, you can seamlessly include Jupyter code chunks. This integration allows you
to combine narrative text, R, and Python code in a single dynamic document. It's an
excellent choice for projects that require detailed explanations and interactive code
execution. In this method, you can benefit from Jupyter's extensive ecosystem while
taking advantage of RStudio's authoring capabilities. In the following sections, we'll
delve into the steps for implementing Python using each of these methods within
426 SAS, Python & R: A Cross Reference Guide

the RStudio environment, ensuring you have the flexibility to choose the approach
that best suits your needs.
● Step 1: Create a Python Script

Create a .py Python script (e.g., my_script.py) with your Python code. Save it in
your working directory.

● Step 2: Run the Python Script from RStudio

Program 9.20: Run Python Script Using System()

# Use system() or system2() to run the Python script:


system("python my_script.py")

Example:

Here's a simple example of Method 1 (Using R's reticulate Package):

Program 9.21: RStudio Reticulate Example

# Load the reticulate package


library(reticulate)

# Create a Python environment


use_python("python3")

# Run Python code


py_run_string("print('Hello from Python!')")

When you run this code in RStudio, it will execute Python code within the RStudio
environment and print "Hello from Python!". Repeat the above steps for each
method to effectively use both SAS and Python in your RStudio environment.
SAS, Python & R: A Cross Reference Guide

Implementing SAS in RStudio

RStudio, synonymous with the R programming world, has a knack for embracing
diversity in the data science arena. In this section, we delve into the realm of SAS
programming, demonstrating how RStudio extends its arms to welcome the SAS
language. This integration combines the analytical prowess of SAS with the data
manipulation and visualization capabilities of RStudio. Whether you're a seasoned
SAS user or seeking the best of both worlds, you're in the right place.

As we embark on this exploration, you'll discover three distinct methods for running
SAS within RStudio. Each method offers a unique perspective on leveraging the
strengths of SAS while harnessing the versatile environment of RStudio. By the end
of this section, you'll have a toolbox of techniques to blend SAS into your RStudio
workflow seamlessly.

Now, let's outline the three methods with step-by-step instructions and
explanations for each.

Method 1: Using the RSASSA Package

Introducing a bridge between SAS and RStudio: the RSASSA package. This method
offers an intuitive way to execute SAS code within RStudio. RSASSA facilitates
interaction between R and SAS, allowing you to send your SAS code to a remote SAS
server and retrieve the results seamlessly. If you're accustomed to the SAS
environment but want to explore the visualization and reporting capabilities of
RStudio, this method provides the best of both worlds.

Step-by-Step Instructions:

1. Install RSASSA Package: In RStudio, install the RSASSA package using


install.packages("RSASSA").

2. Load RSASSA Package: Load the package with: library(RSASSA).

3. Connect to SAS Server: Establish a connection to your SAS server using


sas_session().
428 SAS, Python & R: A Cross Reference Guide

4. Execute SAS Code: Write your SAS code within the sas_submit() function
and run it to send it to the SAS server.

5. Retrieve Results: Use the sas_submit() function to retrieve results, such


as data set or logs, from the SAS server.

Method 2: Using RMarkdown with SAS Chunk

Here's an intriguing method for incorporating SAS into your RStudio environment:
RMarkdown with SAS chunks. This approach offers a dynamic way to combine the
powers of R and SAS within a single RMarkdown document. Whether you're focused
on creating dynamic reports, presentations, or documents, RMarkdown allows you
to interweave R and SAS code seamlessly.

Step-by-Step Instructions:

1. Create an RMarkdown Document: Start by creating an RMarkdown


document in RStudio.

2. Specify SAS Chunk: Within your RMarkdown document, define a SAS chunk
by using triple backticks (\````{r, engine='SAS'}\````).

Here's an example of how to specify a SAS chunk within an RMarkdown document:

Program 9.22: Example SAS Chunk Technique

--- title: "RMarkdown with SAS Chunk Example"


output: html_document ---

This is a simple RMarkdown document that incorporates SAS


code.

```{r, engine='SAS'}
/* This is a SAS chunk */
data work.example;
set sashelp.class;
run;
proc print data=work.example;
SAS, Python & R: A Cross Reference Guide

run;
The document starts with metadata, specifying the title and output format. The
triple backticks (\````{r, engine='SAS'}\````) define a SAS chunk within
the document. You can write your SAS code within this chunk, and it will be
executed when you knit the RMarkdown document. In this example, we create a
new data set using the SAS DATA step and then print it using a PROC PRINT step.

When you knit this document in RStudio, it will execute the embedded SAS code
and generate an HTML document that includes both the code and the results.

3. Write SAS Code: Inside the SAS chunk, write your SAS code as you normally
would.

4. Knit Document: Knit your RMarkdown document to render the output,


which will include both R and SAS code execution.

Method 3: Using System2 to Call SAS

If you prefer to keep things straightforward and execute SAS scripts directly, RStudio
provides a method akin to calling SAS from the command line. By utilizing the
system2 function, you can invoke the SAS executable and execute your SAS code
files within RStudio. This method is ideal for those who wish to maintain their SAS
scripts but want the convenience of the RStudio environment.

Step-by-Step Instructions:

1. Write SAS Script: Prepare your SAS script in a separate .sas file, just like
you would in a typical SAS environment.

2. Use the System2 Function: In your RStudio script, utilize the system2
function to call the SAS executable and specify the path to your SAS script
file.

Here's an example of using the system2 function in RStudio to call a SAS program:
430 SAS, Python & R: A Cross Reference Guide

Program 9.23: Example System2() Function to Call SAS Program

# Specify the path to your SAS executable (modify as


needed)
sas_path <- "C:/Program
Files/SASHome/SASFoundation/9.4/sas.exe"

# Specify the location of your SAS program file


sas_program <- "C:/Path/To/Your/SAS/Program.sas"

# Run the SAS program using system2


system2(command = sas_path, args = c(sas_program))

In this example:

1. sas_path is set to the path of your SAS executable. Be sure to adjust this
path to match your SAS installation directory.

2. sas_program is set to the location of your SAS program file (.sas file).

3. The system2 function is used to run the SAS program by specifying the
command (SAS executable path) and args (arguments, in this case, the SAS
program path).

When you run this R script in RStudio, it will call SAS and execute the specified SAS
program. Ensure that you replace the paths with the actual paths to your SAS
executable and program file.

3. Run RStudio Script: Run your RStudio script containing the system2
function. This will execute your SAS script, producing the desired output.

These three methods offer flexibility in incorporating SAS into RStudio, catering to
various preferences and requirements. Whether you use the RSASSA package for
tight integration, RMarkdown for dynamic documents, or system2 for familiar script
execution, RStudio's versatility welcomes SAS into its environment with open arms.
SAS, Python & R: A Cross Reference Guide

SAS Studio: Where the Power of Python and R Meets SAS


Strength

In the ever-evolving landscape of data science, adaptability and versatility are


paramount. Professionals are frequently tasked with navigating different
programming languages, each with its unique strengths. Enter SAS Studio, a
dynamic environment that seamlessly marries the robust capabilities of SAS with
the agility of Python and R. This section is your gateway to exploring the unmatched
synergy of these three powerful languages within SAS Studio. It's not just about
coexistence; it's about achieving unparalleled efficiency and effectiveness in your
data-driven endeavors.

SAS Studio stands as a testament to the modern data scientist's dream – an


environment that recognizes the value of Python's extensive libraries, R's statistical
prowess, and the legendary analytics of SAS. Here, you can harness the full
spectrum of data science tools with unparalleled ease. The possibilities are limitless,
from data wrangling with Pandas and data manipulation with dplyr to advanced
analytics with SAS. Join us on a journey demonstrating how SAS Studio, as the
conduit for Python, R, and SAS integration, provides an edge in model development,
data analysis, and reporting.

In the following pages, we'll unveil the methods to incorporate Python and R into
SAS Studio, equipping you with the tools to embark on data science projects with
unmatched flexibility. Whether you're a SAS aficionado, a Python enthusiast, an R
expert, or somewhere in between, SAS Studio's unique capability to synergize these
three languages will redefine your data science experience. So, let's dive in and
explore the symbiotic relationship between Python, R, and SAS within this
remarkable environment.

Balancing Two Worlds: SAS Studio and SAS Enterprise Guide

Throughout this book, we've explored the dynamic capabilities of SAS Studio, an
environment that bridges the realms of SAS, Python, and R, facilitating a seamless
synergy between these languages. However, we also understand the diverse
landscape of data science, where SAS Enterprise Guide remains a formidable choice
for many professionals. As we venture into this section, we recognize that readers
432 SAS, Python & R: A Cross Reference Guide

may employ both SAS Studio and SAS Enterprise Guide, and we're here to empower
you to integrate Python and R code in either environment.

SAS Enterprise Guide has long been a staple for those seeking the extensive
analytical capabilities and structured workflows it offers. It's a testament to the
versatility of SAS that it caters to different preferences, enabling users to harness
the power of SAS, Python, and R interchangeably. In the pages ahead, we will delve
into the nuances of SAS Enterprise Guide, demonstrating how you can achieve
similar flexibility and efficiency in implementing Python and R code, even if you
primarily work with this robust environment.

This section is your guide to transcending the boundaries of your chosen SAS
environment, whether it's SAS Studio or SAS Enterprise Guide. As we explore the
methods for seamlessly incorporating Python and R, you'll find that the power of
data science knows no constraints. Let's embark on a journey that empowers you to
excel in your data-driven endeavors, no matter which SAS environment you prefer.

Implementation Differences between SAS Studio and SAS Enterprise


Guide

SAS Studio and SAS Enterprise Guide are both SAS software products, but they serve
somewhat different purposes and have distinct features. Here are the key
differences between them:

1. Purpose:

● SAS Studio: It's a web-based interface for SAS designed to be lightweight


and user-friendly. It's often used for tasks like data preparation, analysis,
and basic reporting.

● SAS Enterprise Guide: This is a more comprehensive SAS client tool that
provides a rich, point-and-click interface for data management, advanced
analytics, and reporting. It's used for more complex analytics and
reporting tasks.

2. User Interface:
SAS, Python & R: A Cross Reference Guide

● SAS Studio: Has a web-based, notebook-like interface similar to Jupyter


Notebooks. It's known for its simplicity and ease of use.

● SAS Enterprise Guide: Has a more traditional Windows-based interface


with a menu and task-oriented design.

3. Coding Options:

● SAS Studio: Supports traditional SAS programming (DATA steps,


procedures) and open-source languages like Python, R, and Lua. You can
easily mix SAS and open-source code.

● SAS Enterprise Guide: Primarily focuses on SAS programming. While it


can execute external programs like Python and R, it's more SAS-centric.

4. Deployment:

● SAS Studio: Often used in SAS Viya environments (the cloud or on-
premises). It can connect to both SAS 9 and SAS Viya.

● SAS Enterprise Guide: Primarily used with SAS 9.

5. Advanced Analytics:

● SAS Studio: Supports basic machine learning and statistics with SAS
procedures and open-source integration for more advanced analytics.

● SAS Enterprise Guide: Offers more comprehensive capabilities for


advanced analytics and reporting.

6. Customization:

● SAS Studio: Provides options for custom tasks and custom code snippets.

● SAS Enterprise Guide: Offers more extensive customization and


automation options.
434 SAS, Python & R: A Cross Reference Guide

Regarding cross-implementation (using Python and R in SAS):

● Both SAS Studio and SAS Enterprise Guide allow you to incorporate
Python and R code into your SAS workflows.

● The process is similar in both environments. You can use code nodes or
code windows to write and execute Python and R code.
So, in general, you can implement Python and R code in either SAS Studio or SAS
Enterprise Guide. The choice between the two would depend on your specific
needs, with SAS Studio being more lightweight and user-friendly and SAS Enterprise
Guide offering more extensive features for advanced SAS analytics and reporting.

Here are three methods to implement Python code in SAS Studio, along with
explanations and step-by-step procedures:

Method 1: Using the Python Code Node in SAS Studio

One of the most straightforward methods to run Python in SAS Studio is to utilize
the Python Code node. This node provides a dedicated environment for writing,
executing, and visualizing Python code. It's convenient if you want to integrate
Python scripts into a larger SAS workflow.

Step-by-Step Procedure:

1. Open a New Project: Launch SAS Studio and open your project or create a
new one.

2. Add a Python Code Node: Within your project, select "Tasks and Utilities" in
the navigation pane on the left. Under "Tasks," expand "Programs" and
select "Python Code."

3. Write Your Python Code: In the Python Code node, you can write, edit, or
paste your Python script.

4. Execute the Code: Click the "Run" button (a green triangle icon) to execute
your Python code.
SAS, Python & R: A Cross Reference Guide

5. View Results: Any output or plots your Python script generates will be
displayed in the results pane.

Note: SAS Enterprise Guide may not provide the Python Code node. However, you
can still achieve this functionality by using Method 2.

Method 2: Using the X Python Statement

This method enables you to embed Python code directly within a SAS program using
the X Python statement. It's a versatile approach that gives you more control over
the integration of Python and SAS.

Step-by-Step Procedure:

1. Open a SAS Program: Launch SAS Studio or SAS Enterprise Guide and open
a SAS program.

2. Use the X Python Statement: Insert the X Python statement in your SAS
program, followed by your Python code within single quotation marks. For
example:

DATA _null_;
x 'python your_python_code_here';
RUN;

Let's say you want to calculate the sum of two numbers using Python code
within a SAS program. Here's how you can do it:

DATA _null_;
x 'python
num1 = 5
num2 = 7
result = num1 + num2
print(result)
';
RUN;

In this example:
436 SAS, Python & R: A Cross Reference Guide

● DATA _null_; initiates a SAS DATA step without actually creating


a data set.

● x 'python ...'; is where you insert your Python code. Inside


the Python code block, we've defined two variables, `num1` and
`num2`, and calculated their sum in the result variable. We then
print the result.

When you run this SAS program, it will execute the embedded Python code,
and you'll see the result in the SAS log. In this case, the SAS log will display:

The SAS System


result=12
NOTE: DATA statement used (Total process time):
real time 0.01 seconds

3. Run the SAS Program: Execute your SAS program as you normally would.

4. Review Output: The SAS log will include any output generated by your
Python code.

Method 3: Using Jupyter Notebooks in SAS

SAS provides an integration with Jupyter Notebooks, allowing you to create and run
Python code in a Jupyter environment. This method provides a rich interactive
Python experience within your SAS environment.

Using Jupyter Notebooks in SAS is typically available in SAS Studio. SAS Enterprise
Guide, on the other hand, doesn't natively support Jupyter notebooks. SAS Studio
provides a web-based interface with integrated support for Jupyter notebooks,
making it easier to incorporate Python and R code alongside SAS in a notebook
format. This feature enhances interactivity and flexibility in data analysis and model
development.

SAS Enterprise Guide supports various scripting languages and can interact with
external Jupyter environments, but it doesn't have built-in support for Jupyter
notebooks.
SAS, Python & R: A Cross Reference Guide

Step-by-Step Procedure:

1. Access Jupyter Notebooks: In SAS Studio, click "Utilities" in the navigation


pane on the left. Then, select "Jupyter Notebook."

2. Create a New Notebook: Click "New" to create a new Jupyter Notebook.


You can choose either a Python 2 or Python 3 kernel.

3. Write and Execute Python Code: In the notebook, you can write, execute,
and document your Python code just like in a standalone Jupyter
environment.

4. Save Your Work: Save your Jupyter Notebook in your SAS Studio project.

5. Access Notebook Results: Any output or visualizations produced by your


Python code will be visible within the notebook.

These methods enable you to harness the power of Python alongside SAS in your
data science projects, offering flexibility and versatility for your analytics tasks.

Chapter 9: Language Fusion – Conclusion and Transition to Your


Data Science Journey

In this final chapter, we explored the powerful concept of language fusion, learning
how to integrate Python, R, and SAS to harness the unique strengths of each
environment. By combining these tools, you’ve gained the flexibility and power to
tackle complex data science challenges, leveraging the best of what each language
has to offer.

Now that you’ve completed this journey through the essentials of data science and
machine learning, you’re no longer just a student of these techniques – you’re ready
to step into the role of a data scientist. You have built a robust foundation, from
understanding the core programming environments, mastering data preparation,
and creating effective modeling pipelines, to implementing advanced predictive
models and ensuring their performance over time. You’ve also learned how to
438 SAS, Python & R: A Cross Reference Guide

integrate multiple programming languages, giving you the versatility to apply your
skills across any industry and solve a wide range of data problems.

As you move forward, remember that data science is as much about continuous
learning as it is about applying the knowledge you’ve gained. The tools and
techniques you’ve mastered in this book are just the beginning. Whether you’re
working in finance, healthcare, marketing, or any other field, the principles and
practices you’ve learned here will empower you to extract meaningful insights from
data and make informed, impactful decisions.

So, as you close this final chapter, know that you are equipped with the knowledge
and tools to tackle any data challenge that comes your way. Your journey as a data
scientist begins now, and with the skills you’ve developed, there’s no limit to what
you can achieve.
SAS, Python & R: A Cross Reference Guide

Chapter 9 Summary: Language Fusion

1. Introduction to Language Fusion

● Overview: This chapter delves into the concept of language fusion,


demonstrating how data scientists can integrate Python, R, and SAS within
different environments. The primary focus is on leveraging the strengths of
each language to create a versatile and adaptable workflow, overcoming the
limitations of working within a single environment.

2. Python Spyder IDE

● Integration Capabilities: Python Spyder is presented as a versatile


environment where Python can be used alongside R and SAS. The chapter
outlines the steps for incorporating R code into Python using the rpy2
library, highlighting how this integration allows for seamless execution of R
functions within Python scripts.

● Running SAS Code: The chapter also covers how to run SAS code within
Python Spyder, using methods like the subprocess module and SASPy. This
section emphasizes the practical advantages of integrating the data
management capabilities in SAS with Python's flexibility.

3. RStudio IDE

● Implementing Python in RStudio: RStudio is explored as an IDE that


supports the integration of Python through the Reticulate package, allowing
users to run Python code within R scripts or RMarkdown documents. This
section emphasizes the ability to combine Python's machine learning
libraries with R's statistical tools.

● Running SAS in RStudio: The chapter explains how to incorporate SAS code
into RStudio using methods such as the RSASSA package and RMarkdown
with SAS chunks. These techniques enable data scientists to harness the
powerful analytics of SAS within the RStudio environment.

4. SAS Studio
440 SAS, Python & R: A Cross Reference Guide

● Multilingual Integration: SAS Studio is introduced as an environment where


Python, R, and SAS can be used together. The chapter covers how to run
Python and R code within SAS Studio using various methods, including the
Python Code node, the X Python statement, and Jupyter Notebooks.

● Balancing SAS Studio and SAS Enterprise Guide: The chapter provides a
comparison between SAS Studio and SAS Enterprise Guide, focusing on their
capabilities for integrating Python and R. It offers guidance on choosing the
right environment based on specific project needs.

5. Practical Applications of Language Fusion

● Real-World Scenarios: The chapter concludes with examples of how


language fusion can be applied in real-world data science projects. It
discusses scenarios where combining Python's data manipulation, R's
statistical analysis, and the data management of SAS can lead to more
efficient and effective workflows.
SAS, Python & R: A Cross Reference Guide

Chapter 9 Quiz

Questions:
1. What is the primary advantage of integrating Python, R, and SAS within a
single environment?

2. How does the rpy2 library facilitate the use of R code in Python Spyder?

3. Describe how the subprocess module can be used to run SAS code in
Python.

4. What is the role of the Reticulate package in RStudio?

5. How can SAS code be integrated into an RMarkdown document in RStudio?

6. Explain how the RSASSA package allows for interaction between RStudio
and SAS.

7. What are the benefits of using Jupyter Notebooks within SAS Studio?

8. How does the X Python statement work in SAS Studio?

9. What are the key differences between SAS Studio and SAS Enterprise Guide
in terms of language integration?

10. Why might a data scientist choose to run Python code within RStudio?

11. What steps are necessary to run R code within Python Spyder?

12. How does using the SASPy module enhance Python's capabilities in handling
SAS code?

13. In what situations would it be beneficial to use SAS Studio over SAS
Enterprise Guide?

14. How can language fusion improve the efficiency of data science workflows?

15. Describe a scenario where integrating Python, R, and SAS would be


particularly advantageous.

16. What are the challenges of maintaining a multilingual data science


environment?
442 SAS, Python & R: A Cross Reference Guide

17. How does RStudio's integration of Python and SAS differ from that of
Python Spyder?

18. What are the potential drawbacks of using multiple programming languages
within a single project?

19. How can Jupyter Notebooks be used to create a dynamic document that
includes Python and R code?

20. What considerations should be taken into account when choosing an


environment for language fusion?
SAS, Python & R: A Cross Reference Guide

Chapter 9 Cheat Sheet

Category SAS Studio Python Spyder RStudio


- Use rpy2 package - Use the
- Use RSUBMIT to
to run R code Reticulate package
submit R code to a
within Python for Python
remote R session
scripts integration
- Execute R scripts - Use subprocess - Write R code
Running R Code via SYSTEM to call external R directly in R scripts
command scripts or RMarkdown
- Jupyter
Notebooks with R - Use system() to
kernel for R code call R scripts
execution
- Native Python - Use Reticulate
- Use Python Code
environment for package to run
Node
script execution Python code
- Execute Python
- Execute Python - Integrate SAS via
Running Python code in
scripts via X SASPy or
Code RMarkdown or R
command or subprocess
script with python
SUBMITPYTHON module
chunks
- Use system() to
call Python scripts
- Use RSASSA
- Native - Execute SAS
package to
environment for scripts using SASPy
connect to SAS
SAS module
servers
Running SAS Code
- Use Python - Execute SAS code
- Use subprocess
integration with within
to call external
SASPy or SUBMIT RMarkdown using
SAS scripts
statements SAS chunks
- Use system2() to
call SAS scripts
444 SAS, Python & R: A Cross Reference Guide

- Use data.table or
- Import and - Use Pandas for
dplyr for data
export data data manipulation
manipulation
between SAS and
- Exchange data
Data Integration R/Python using - Exchange data
with Python via
PROC with R using rpy2
Reticulate and
IMPORT/EXPORT and with SAS using
with SAS via
and PROC SQL SASPy
RSASSA
- Install Reticulate
- Install rpy2 for R
for Python
integration
- SAS Studio is integration
typically web- - Install SASPy for - Install RSASSA for
Installation based, with built- SAS integration SAS integration
in support for - Set up
Python and R RMarkdown for
multilingual
documents
- Leverage the full - Use rpy2 for
- Use Reticulate to
power of SAS with seamless
seamlessly
integrated Python integration of R
integrate Python
and R for code within
code
specialized tasks Python
Best Practices
- Use Jupyter
- Use RMarkdown
Notebooks for - Utilize SASPy for
for combining R,
more interactive running SAS code
Python, and SAS in
and collaborative in Python
reports or analysis
workflows
SAS, Python & R: A Cross Reference Guide

Appendix A: Further Learning Resources

Introduction

As you progress in your data science journey, continuous learning is essential to stay
updated with the latest tools, techniques, and industry trends. This appendix
provides a curated list of resources, including books, websites, YouTube channels,
subreddits, and podcasts, to help you deepen your knowledge and skills. Whether
you are just starting or looking to advance your expertise, these resources offer
valuable insights and practical guidance across various aspects of data science.

Books

Book Title Author(s) Description


A comprehensive guide
to data science with
End-to-End Data Science
James Gearheart hands-on SAS code,
with SAS
written by the author of
this book.
A beginner-friendly
introduction to
Introduction to Gareth James, Daniela statistical learning
Statistical Learning Witten, Trevor Hastie, methods, often
(ISLR) Robert Tibshirani recommended as a
starting point in
machine learning.
A foundational text in
Trevor Hastie, Robert machine learning,
The Elements of
Tibshirani, Jerome covering key algorithms
Statistical Learning
Friedman and methods in
statistical learning.
A practical guide to
Hands-On Machine
machine learning with
Learning with Scikit-
Aurélien Géron Python, covering key
Learn, Keras, and
libraries like Scikit-Learn
TensorFlow
and TensorFlow.
446 SAS, Python & R: A Cross Reference Guide

A comprehensive
overview of the data
The Data Science Carl Shan, Henry Wang,
science landscape with
Handbook William Chen, Max Song
interviews from leading
data scientists.
Essential for learning
pandas, a powerful
Python for Data Analysis Wes McKinney Python library for data
manipulation and
analysis.
A complete guide to
Hadley Wickham, data import, cleaning,
R for Data Science
Garrett Grolemund visualization, and
modeling with RStudio.
A solid foundation in
SAS Essentials:
Alan C. Elliott, Wayne A. SAS, focusing on data
Mastering SAS for Data
Woodward analytics and the use of
Analytics
SAS Studio.
Covers practical
Practical Statistics for Peter Bruce, Andrew statistical methods with
Data Scientists Bruce examples in both R and
Python.
Introduction to deep
Deep Learning with learning with Keras and
François Chollet
Python TensorFlow, written by
the creator of Keras.
Teaches fundamental
Data Science from concepts of data science
Joel Grus
Scratch with Python, building up
from the basics.
Explores the concepts
Artificial Intelligence: A
and implications of AI,
Guide for Thinking Melanie Mitchell
making it accessible to a
Humans
general audience.
SAS, Python & R: A Cross Reference Guide

Websites

Website URL Description


A platform for data
science competitions,
Kaggle www.kaggle.com data sets, and
community-driven
learning.
Medium publication
with articles on data
Towards Data Science towardsdatascience.com science, machine
learning, and
programming.
Official Python website
with comprehensive
Python.org www.python.org documentation and
resources for Python
programming.
Insights, tutorials, and
updates from the
RStudio Blog www.rstudio.com/blog creators of RStudio,
focusing on R and data
science.
Official SAS support site
with extensive
SAS Support support.sas.com documentation,
tutorials, and user
guides for SAS Studio.
Interactive courses on
data science, statistics,
DataCamp www.datacamp.com and programming,
including Python, R, and
SQL.
Online courses and
www.coursera.org/brow
Coursera Data Science specializations in data
se/data-science
science, offered by top
448 SAS, Python & R: A Cross Reference Guide

universities and
companies.

Community-based
platform offering
www.analyticsvidhya.co
Analytics Vidhya tutorials, discussions,
m
and resources for data
science enthusiasts.
A community for data
www.datasciencecentral science professionals,
Data Science Central
.com featuring articles,
webinars, and forums.
Explores complex data
stories through engaging
The Pudding www.pudding.cool visual essays, blending
data science with
journalism.

YouTube Channels

Channel Name URL Description


Clear and accessible
StatQuest with Josh www.youtube.com/user explanations of statistical
Starmer /joshstarmer concepts, primarily using R
and Python.
Tutorials on data science
www.youtube.com/user topics, especially focused
Data School
/dataschool on Python and R, including
pandas and scikit-learn.
Official SAS channel
www.youtube.com/user offering tutorials, tips, and
SAS Users
/SASsoftware user stories focused on SAS
software.
SAS, Python & R: A Cross Reference Guide

Detailed Python tutorials,


covering a wide range of
www.youtube.com/user
Corey Schafer topics including Python
/schafer5
programming and data
science libraries.
Python programming
tutorials, including data
https://www.youtube.c
Tech With Tim science-focused content,
om/c/techwithtim
ideal for users of Python
Spyder.
Visual explanations of
mathematical concepts,
www.youtube.com/c/3b
3Blue1Brown which are foundational for
lue1brown
understanding algorithms
in data science.
Python tutorials with a
www.youtube.com/user focus on machine learning,
Sentdex
/sentdex data analysis, and web
development.
A broad range of tutorials
www.youtube.com/user on data science, AI, and
Simplilearn
/Simplilearn programming, ideal for
beginners.
Creative coding tutorials in
www.youtube.com/user JavaScript and Python,
Coding Train
/shiffman focusing on visualizations
and simulations.
Offers in-depth, free
programming tutorials,
www.youtube.com/c/Fr
freeCodeCamp including full courses on
eecodecamp
Python, machine learning,
and data science.
450 SAS, Python & R: A Cross Reference Guide

Subreddits

Subreddit URL Description


A community focused on
data science discussions,
www.reddit.com/r/datasc sharing resources, and
r/datascience
ience solving problems. Useful
for all programming
languages.
Ideal for beginners and
www.reddit.com/r/learnp intermediate users of
r/learnpython
ython Python, providing help,
resources, and tutorials.
A subreddit dedicated to
discussions about SAS
r/SAS www.reddit.com/r/SAS software, including tips,
tricks, and
troubleshooting.
Focused on R
programming, sharing
r/rstats www.reddit.com/r/rstats resources, advice, and
discussions related to
data science with R.
A general programming
www.reddit.com/r/progr subreddit covering a wide
r/programming
amming range of topics, including
Python, R, and SAS.
A community centered on
www.reddit.com/r/Machi machine learning
r/MachineLearning
neLearning research, news, and
project discussions.
A place to ask and answer
questions about statistics,
www.reddit.com/r/AskSt
r/AskStatistics helpful for understanding
atistics
the statistical foundations
of data science.
SAS, Python & R: A Cross Reference Guide

A subreddit focused on
www.reddit.com/r/DataS job opportunities, career
r/DataScienceJobs
cienceJobs advice, and discussions
related to data science.
Discusses AI research,
www.reddit.com/r/Artifici tools, and developments,
r/ArtificialIntelligence
alIntelligence intersecting with many
aspects of data science.
Geared toward learning
machine learning
www.reddit.com/r/learn
r/learnmachinelearning techniques and sharing
machinelearning
resources for beginners
and practitioners alike.

Podcasts

Podcast Name Platform Description


Covers a broad range of
Available on Apple data science topics, with
Podcasts, Spotify, episodes on
Data Skeptic
Stitcher, and Google programming languages,
Podcasts machine learning, and
practical applications.
Focuses on machine
learning and data
Available on Apple
science, discussing real-
Podcasts, Spotify,
Linear Digressions world examples and
Stitcher, and Google
explaining complex
Podcasts
concepts in an
accessible way.
Available on Apple Hosted by data science
SuperDataScience Podcasts, Spotify, experts, this podcast
Podcast Stitcher, and Google covers a wide range of
Podcasts topics, including
452 SAS, Python & R: A Cross Reference Guide

programming in Python
and R.
Hosted by data science
experts Hilary Parker
Available on Apple
Not So Standard and Roger D. Peng,
Podcasts, Spotify, and
Deviations discussing data analysis,
Stitcher
data science tools, and R
programming.
A podcast from SAS
Available on SAS.com, covering various topics,
SAS Users Podcast Apple Podcasts, and including best practices,
Google Podcasts tips, and interviews with
SAS users.
A podcast from
Available on DataCamp, DataCamp exploring
DataFramed Apple Podcasts, Spotify, data science, its
and Google Podcasts applications, and its
impact on industries.
Discusses the long-term
Available on Apple
impact of AI, with
Podcasts, Spotify,
AI Alignment Podcast insights relevant to the
Stitcher, and Google
broader data science
Podcasts
field.
Available on Apple Focuses on machine
Podcasts, Spotify, learning, AI, and deep
The TWIML AI Podcast
Stitcher, and Google learning, with interviews
Podcasts from industry experts.
Available on Apple Follows the journey of
Becoming a Data Podcasts, Spotify, learning data science,
Scientist Stitcher, and Google with insights and advice
Podcasts for beginners.
A podcast about data
Available on Apple
visualization, how data
Data Stories Podcasts, Spotify, and
is presented, and the
Google Podcasts
stories it tells.
SAS, Python & R: A Cross Reference Guide

Appendix B: Open-Source Data Sources

Introduction

Access to high-quality data is crucial for developing and honing your data science
skills. This appendix provides a comprehensive list of open-source data sources
where you can find data sets for a wide range of applications, from machine
learning projects to statistical analysis. These resources offer diverse data sets that
can be used to build, test, and refine your models, providing ample opportunities to
apply your knowledge in real-world scenarios.

Data Source URL Description


A vast collection of
data sets across
www.kaggle.com/data various domains, ideal
Kaggle Data sets
sets for machine learning
and data science
projects.
A popular resource for
machine learning data
UCI Machine Learning www.archive.ics.uci.edu/
sets, widely used in
Repository ml/index.php
academic research
and education.
A search engine for
data data sets across the
Google Data set Search setsearch.research.google web, covering a broad
.com range of topics and
formats.
The U.S. government’s
open data portal,
offering data sets on a
Data.gov www.data.gov wide range of topics
including economics,
health, and
environment.
454 SAS, Python & R: A Cross Reference Guide

Data sets hosted on


Amazon Web Services,
covering domains such
AWS Public Data sets registry.opendata.aws
as genomics, satellite
imagery, and climate
data.
A community-driven
platform for
discovering and
Open Data Portal www.data.world
sharing data sets, with
tools for analysis and
collaboration.
A collection of large
network data set,
Stanford Large Network
snap.stanford.edu/data ideal for research in
Data set Collection
network analysis and
social network studies.
Curated data sets
from Microsoft,
www.azure.microsoft.co
Microsoft Azure Open designed to be used
m/en-us/services/open-
Data Sets with Azure machine
data sets
learning tools and
services.
Data sets used in
FiveThirtyEight’s data
FiveThirtyEight data.fivethirtyeight.com journalism, covering
politics, sports, and
culture.
A platform offering
financial, economic,
Quandl www.quandl.com and alternative data
sets for investment
professionals.
Data sets on global
development metrics,
Gapminder www.gapminder.org/data
used to promote fact-
based worldviews.
SAS, Python & R: A Cross Reference Guide

A vast collection of
global development
data, including
World Bank Open Data data.worldbank.org
economic, social, and
environmental
indicators.
A gateway to global
statistics provided by
UN Data data.un.org the United Nations,
covering a wide range
of topics.
Public data sets
available through
Google Cloud Public Data console.cloud.google.com
Google Cloud, suitable
sets /marketplace/browse
for big data analysis
and machine learning.
A data set containing
over five million
Yelp Data set www.yelp.com/data set reviews, useful for NLP
and sentiment analysis
projects.
Open data from the
City of Chicago,
including crime
City of Chicago Data
data.cityofchicago.org reports, public health
Portal
data, and
transportation
information.
A collection of data
sets from various
European Union Open www.data.europa.eu/euo European Union
Data Portal dp/en/data institutions and
agencies, covering
diverse topics.
456 SAS, Python & R: A Cross Reference Guide

A platform for sharing


data on humanitarian
Humanitarian Data crises, offering data
data.humdata.org
Exchange sets on population,
conflict, and disaster
response.
NASA’s portal for
accessing satellite and
other earth science
Earthdata www.earthdata.nasa.gov
data, ideal for
environmental and
climate research.
A collaborative project
to create a free,
editable map of the
OpenStreetMap www.openstreetmap.org
world, providing
geographic data for a
variety of uses.
SAS, Python & R: A Cross Reference Guide

Appendix C: GitHub Repositories

Introduction

GitHub is a treasure trove of resources for data scientists and AI/ML practitioners.
This appendix features a carefully selected list of GitHub repositories that offer
valuable code, tools, and libraries to enhance your data science and machine
learning projects. These repositories include everything from hands-on tutorials and
reference implementations to advanced algorithms and libraries for data
processing, modeling, and visualization. The repositories span multiple
programming languages, including Python, R, and SAS, and are suitable for learners
at all levels, from beginners to seasoned professionals. Use this appendix as a guide
to explore, learn, and contribute to the vibrant open-source data science
community.

Repository Programming
Author Web Address Description
Name Language
Repository for
this book SAS,
https://github. Python, and
SAS, Python, com/Gearhj/S R: A Cross-
and R: A AS-Python- Reference
Cross- Gearhj and-R-A- Guide, SAS, Python, R
Reference Cross- featuring
Guide Reference- code
Guide examples in
SAS, Python,
and R.
Repository for
the book End-
https://github. to-End Data
End-to-End
com/Gearhj/E Science with
Data Science Gearhj SAS
nd-to-End- SAS, featuring
with SAS
Data-Science hands-on SAS
code
examples.
458 SAS, Python & R: A Cross Reference Guide

A curated list
of awesome
https://github. data science
academic/aw
Awesome com/academi resources,
esome- Various
Data Science c/awesome- including
datascience
datascience books,
courses, and
tools.
A
comprehensiv
e guide to
https://github. learning
com/Avik- machine
100 Days of
Avik-Jain Jain/100- learning Python
ML Code
Days-Of-ML- through 100
Code days of code,
covering
various ML
topics.
A collection of
R packages
designed for
https://github.
data science,
Tidyverse tidyverse com/tidyverse R
making data
/tidyverse
wrangling and
visualization
easier.
Classification
and
Regression
Training
https://github.
package in R,
Caret topepo com/topepo/c R
offering tools
aret
to streamline
the model
training
process.
SAS, Python & R: A Cross Reference Guide

Fastai
simplifies
https://github. training fast
fastai fastai com/fastai/fas and accurate Python
tai neural nets
using modern
best practices.
Models and
examples
built with
https://github. TensorFlow,
TensorFlow
tensorflow com/tensorflo showcasing Python
Models
w/models various
machine
learning
techniques.
A Python
module
integrating
https://github. classic
com/scikit- machine
Scikit-learn scikit-learn Python
learn/scikit- learning
learn algorithms
with the
scientific
Python stack.
PyTorch
tutorials
covering
https://github.
PyTorch basics,
pytorch com/pytorch/t Python
Tutorials advanced
utorials
concepts, and
research
topics.
460 SAS, Python & R: A Cross Reference Guide

An R package
for creating
https://github. elegant data
ggplot2 tidyverse com/tidyverse visualizations R
/ggplot2 using the
grammar of
graphics.
Scalable,
portable, and
distributed
https://github. gradient
Python, R,
XGBoost dmlc com/dmlc/xgb boosting
Julia, Scala
oost library in
Python, R,
and other
languages.
A collection of
R scripts and
projects from
https://github. the R-
com/r- Bloggers
R-Bloggers R-Bloggers R
bloggers/R- community,
Bloggers focused on
various data
science
topics.
A Python
https://github. library to
SASPy sassoftware com/sassoftw connect to Python, SAS
are/saspy and run SAS
from Python.
A repository
with
https://github. advanced SAS
Advanced SAS advanced-sas com/advance programming SAS
d-sas techniques
and examples
for
SAS, Python & R: A Cross Reference Guide

experienced
SAS users.

A 12-week,
26-lesson
https://github.
Machine curriculum
com/microsof
Learning for microsoft teaching basic Python
t/ML-For-
Beginners machine
Beginners
learning
concepts.
A collection of
deep learning
models and
https://github.
algorithms
Deep Learning com/rasbt/de
rasbt implemented Python
Models eplearning-
in Python
models
using
TensorFlow
and Keras.
A curated list
https://github.
of machine
Awesome com/josephmi
learning
Machine josephmisiti siti/awesome- Various
frameworks,
Learning machine-
libraries, and
learning
software.
Code for the
Hands-On
book Hands-
Machine
https://github. On Machine
Learning with
ageron com/ageron/h Learning with Python
Scikit-Learn
andson-ml2 Scikit-Learn,
and
Keras, and
TensorFlow
TensorFlow.
462 SAS, Python & R: A Cross Reference Guide

The official
repository for
the pandas
https://github. library,
Pandas pandas-dev com/pandas- essential for Python
dev/pandas data
manipulation
and analysis
in Python.
A collection of
algorithms
https://github. implemented
The TheAlgorithm
com/TheAlgori in Python for Python
Algorithms s
thms/Python data science
and machine
learning.
Jupyter
notebooks for
https://github.
the Python
com/jakevdp/
Data Science Data Science
jakevdp PythonDataSci Python
Notebooks Handbook,
enceHandboo
covering
k
essential data
science tools.
A roadmap to
https://github.
help
com/floodsun
Deep Learning understand
g/Deep-
Papers deep learning
floodsung Learning- Various
Reading by reading the
Papers-
Roadmap top research
Reading-
papers in the
Roadmap
field.
A repository
https://github.
of natural
com/dair-
NLP Projects dair-ai language Python
ai/nlp_paper_s
processing
ummaries
(NLP) projects
SAS, Python & R: A Cross Reference Guide

and paper
summaries.

Cheat sheets
for the
https://github. Stanford CS
Machine com/afshinea/ 229 Machine
Learning afshinea stanford-cs- Learning Various
Cheatsheets 229-machine- course,
learning covering key
concepts and
algorithms.
464 SAS, Python & R: A Cross Reference Guide

Appendix D: Chapter Quiz and Answers

Chapter 1 Quiz Answers

1. What are the primary advantages of using SAS in industries such as


financial services and biotechnology?

● SAS offers robust data handling capabilities, advanced analytics, and


strong regulatory compliance features, making it ideal for industries like
financial services and biotechnology.

2. How does Python’s open-source nature benefit data scientists in terms of


library availability and community support?

● Python's open-source nature ensures a vast repository of libraries,


regular updates, and extensive community support, enabling data
scientists to access a wide range of tools and solutions for various data
science tasks.

3. In what ways does R excel in statistical analysis and visualization


compared to other programming languages?

● R excels in statistical analysis and visualization due to its extensive


packages designed specifically for complex statistical modeling and
high-quality graphics.

4. Explain how SAS Studio’s point-and-click interface can benefit data


scientists who may not have extensive programming experience.

● SAS Studio’s point-and-click interface allows users to perform complex


data analysis tasks without needing to write code, benefiting those with
limited programming experience.

5. Discuss the role of Integrated Development Environments (IDEs) in


enhancing the productivity of data scientists.
SAS, Python & R: A Cross Reference Guide

● IDEs enhance productivity by providing tools such as debugging, code


completion, and version control, which streamline the development
process for data scientists.

6. How does Python Spyder facilitate the integration of data manipulation


and machine learning libraries such as NumPy, pandas, and scikit-learn?

● Python Spyder facilitates integration with libraries like NumPy, pandas,


and scikit-learn by offering a comprehensive development environment
tailored for scientific computing and data analysis.

7. What are the key features of RStudio that make it suitable for
reproducible research and collaboration in data science?

● RStudio is suitable for reproducible research and collaboration due to its


support for version control, literate programming with RMarkdown, and
seamless package management.

8. Describe how SAS Studio handles large data sets and why this is important
for advanced analytics.

● SAS Studio handles large data sets efficiently with its ability to process
data in chunks, perform parallel processing, and leverage in-memory
analytics, which is crucial for advanced analytics.

9. What advantages does Python Spyder offer for exploratory data analysis
in scientific computing?

● Python Spyder offers a robust environment for exploratory data analysis


with features like real-time code execution, variable exploration, and
integration with scientific libraries.

10. How does RStudio’s package management system contribute to its


flexibility in data science projects?

● RStudio’s package management system allows easy installation,


updating, and management of packages, contributing to its flexibility in
handling various data science projects.
466 SAS, Python & R: A Cross Reference Guide

11. Explain the significance of cross-referencing programming languages like


SAS, Python, and R in a data science project.

● Cross-referencing programming languages like SAS, Python, and R


enhances a data scientist’s ability to leverage the strengths of each
language, improving problem-solving capabilities and project outcomes.

12. How can knowledge of multiple IDEs and programming languages enhance
a data scientist’s ability to tackle complex data problems?

● Knowledge of multiple IDEs and programming languages allows a data


scientist to select the most appropriate tools for different tasks,
increasing versatility and effectiveness in solving complex data
problems.

13. Describe how SAS Studio’s log output assists in debugging and optimizing
code.

● SAS Studio’s log output provides detailed information about code


execution, including errors, warnings, and performance statistics, which
assists in debugging and optimizing code.

14. What role does the interactive console in Python Spyder play in iterative
code development?

● The interactive console in Python Spyder allows for real-time code


testing and debugging, facilitating an iterative approach to code
development and improving productivity.

15. How does RStudio support the creation of interactive data visualizations
and what are some use cases?

● RStudio supports the creation of interactive data visualizations through


packages like shiny and ggplot2, which are used in dashboards, reports,
and exploratory data analysis.

16. Why is it important for data scientists to be familiar with the different
programming environments discussed in this chapter?
SAS, Python & R: A Cross Reference Guide

● Familiarity with different programming environments allows data


scientists to choose the best tools for specific tasks, enhancing their
ability to efficiently tackle a wide range of data challenges.

17. Discuss the impact of the long-standing presence of SAS in the industry on
the development of analytical products and automated processes.

● The long-standing presence of SAS has led to the development of


mature, reliable analytical products and processes that are trusted in
critical applications across industries like finance and healthcare.

18. How do Python’s extensive libraries and frameworks support the


implementation of machine learning models?

● Python’s extensive libraries, such as scikit-learn, TensorFlow, and


PyTorch, provide comprehensive tools for implementing, training, and
deploying machine learning models across various domains.

19. What are the benefits of using RStudio for statistical modeling in academic
and research settings?

● RStudio offers powerful tools for statistical modeling, such as extensive


statistical packages, integration with RMarkdown for reproducible
research, and support for data visualization, making it ideal for
academic and research settings.

20. How does understanding the similarities and differences between SAS,
Python, and R improve a data scientist’s versatility and problem-solving
capabilities?

● Understanding the similarities and differences between SAS, Python,


and R allows data scientists to select the best tools for specific tasks,
enhancing their versatility and ability to solve a broader range of data
science problems.
468 SAS, Python & R: A Cross Reference Guide

Chapter 2 Quiz Answers:

1. What is the significance of the "Garbage In, Garbage Out" (GIGO) principle
in data science?

● The GIGO principle emphasizes that the quality of input data directly
affects the quality of the output. Poor-quality data will lead to
unreliable and inaccurate models, regardless of the sophistication of the
algorithms used.

2. Why is high-quality data essential for building reliable and robust models?

● High-quality data is essential because it ensures that the models are


trained on accurate and relevant information, leading to reliable
predictions and insights.

3. Describe the structure of the Lending Club data set used in this project.

● The Lending Club data set typically includes features such as loan
amount, interest rate, grade, subgrade, loan status, and borrower
attributes, which are used to assess credit risk.

4. Explain the process of creating a target variable for the Lending Club risk
model.

● The target variable is usually created by categorizing loan outcomes into


binary classes (e.g., default or non-default) based on loan status, helping
to predict credit risk.

5. How does the PROC IMPORT procedure in SAS facilitate data import and
variable selection?

● PROC IMPORT in SAS automates the process of importing data from


various file formats and allows users to specify the desired variables,
streamlining data preparation.

6. What is the role of the pandas library in Python for importing data?
SAS, Python & R: A Cross Reference Guide

● The pandas library in Python provides functions like read_csv() and


read_excel() for easy data import and manipulation, making it a go-to
tool for data preparation.

7. How does R's fread function from the data.table package assist in data
import and selection?

● The fread function in R’s data.table package is efficient for importing


large data set and allows users to quickly load and select variables from
a file.

8. Why is it important to download a consistent copy of the data set before


importing it into an analytical environment?

● Downloading a consistent copy ensures that all analyses are based on


the same data set, avoiding discrepancies and ensuring reproducibility.

9. What is simple random sampling, and why is it used in this project?

● Simple random sampling is a technique where each member of the


population has an equal chance of being selected. It’s used to create a
representative sample of the data set for analysis.

10. How does stratified random sampling ensure representation of subgroups


in a data set?

● Stratified random sampling involves dividing the population into


subgroups (strata) and sampling each subgroup proportionately,
ensuring that all subgroups are adequately represented.

11. Describe the process of cluster sampling and its applications.

● Cluster sampling involves dividing the population into clusters,


randomly selecting some clusters, and then sampling all members
within those clusters. It’s often used in geographically dispersed
populations.

12. What is systematic sampling, and when is it most useful?


470 SAS, Python & R: A Cross Reference Guide

● Systematic sampling involves selecting every nth item from a list after a
random start. It’s useful when a population is ordered and a simple
method of sampling is desired.

13. How does the use of a seed value in sampling ensure reproducibility?

● A seed value ensures that the random number generation process can
be replicated, making the sampling process reproducible.

14. Compare the data import and sampling techniques in SAS, Python, and R.

● SAS uses procedures like PROC IMPORT for data import and PROC
SURVEYSELECT for sampling. Python uses pandas for import and numpy
or scikit-learn for sampling. R uses fread for import and functions like
sample() for sampling.

15. What factors should be considered when choosing a data import method?

● Factors include file size, format, the need for variable selection, and the
efficiency of the method in the given programming environment.

16. Why might a data scientist choose to sample a data set before analysis?

● Sampling reduces the data set size, making analysis faster and more
manageable, especially with large data sets.

17. How does understanding different sampling techniques benefit a data


scientist?

● It allows the data scientist to choose the most appropriate method for
ensuring that the sample is representative and the analysis is robust.

18. What are the benefits of reducing the size of a data set through sampling?

● Benefits include reduced computational cost, faster processing, and the


ability to focus on a manageable subset of data without losing critical
information.
SAS, Python & R: A Cross Reference Guide

19. How does cross-referencing programming languages enhance a data


scientist’s ability to work with different tools?

● It enables the data scientist to leverage the strengths of different tools


and choose the best one for each task, improving overall efficiency and
effectiveness.

20. What are the next steps after importing and sampling data in a data
science project?

● The next steps typically include data cleaning, preprocessing,


exploratory data analysis, feature engineering, and then model building.

Chapter 3 Quiz and Answers

1. What are the key differences between academic data sets and real-world
data sets?

● Academic data sets are often cleaned, well-structured, and contain


fewer missing values, whereas real-world data sets are messy, contain
missing or incorrect data, and require extensive preprocessing before
analysis.

2. Explain the importance of the target variable in a data science project.

● The target variable represents the outcome or the dependent variable


that the model is trying to predict. It is crucial as it directly influences
how the model is trained and evaluated.

3. How does the "art" of data science influence the decision-making process
during data preparation?

● The "art" of data science involves making subjective decisions based on


intuition, experience, and domain knowledge, especially when dealing
with ambiguous or incomplete data. This can include choosing how to
472 SAS, Python & R: A Cross Reference Guide

handle missing data, selecting relevant features, and determining


appropriate data transformations.

4. What is the business problem defined for the Lending Club data set in this
chapter?

● The business problem involves predicting whether a borrower will


default on a loan, which is crucial for risk management and decision-
making in lending.

5. How is the "loan_status" field in the Lending Club data set used to create
the target variable "bad"?

● The "loan_status" field is recoded into a binary target variable "bad,"


where certain statuses indicating delinquency or default are labeled as
"bad" loans, while others are considered "good."

6. Why is it important to examine the stability of the target variable over


time?

● Examining stability ensures that the target variable’s distribution


remains consistent over time, which is important for building a model
that performs reliably on future data.

7. What is "runway time," and how does it affect the analysis of the target
variable?

● "Runway time" refers to the period between when a loan is issued and
when its performance can be evaluated. It affects the analysis by
determining the point at which the outcome of the loan (e.g., default or
no default) can be reliably assessed.

8. Describe the process of limiting the data set to a specific date range and its
impact on the modeling process.

● Limiting the data set to a specific date range ensures that the data used
for modeling is relevant and consistent, which helps in building a model
that reflects current trends and behaviors.
SAS, Python & R: A Cross Reference Guide

9. What is Exploratory Data Analysis (EDA), and why is it important in data


science?

● EDA involves analyzing the data through summary statistics and


visualizations to uncover patterns, anomalies, and relationships. It is
important because it provides insights that guide the data preparation
and modeling processes.

10. How can summary statistics and visualizations like histograms help in
understanding the data set?

● Summary statistics provide a numerical overview of the data, while


histograms visualize the distribution of variables, helping to identify
central tendencies, spread, skewness, and the presence of outliers.

11. What are outliers, and why is it important to address them in data
science?

● Outliers are data points that deviate significantly from the rest of the
data. Addressing them is important because they can skew the results
and negatively impact the performance of models.

12. Explain the Winsorizing technique for handling outliers.

● Winsorizing involves limiting extreme values in the data to reduce the


effect of possibly spurious outliers, replacing the extreme values with
the nearest value that is not considered an outlier.

13. What is the Interquartile Range (IQR) method, and how is it used to
identify outliers?

● The IQR method identifies outliers by calculating the range between the
first and third quartiles (Q1 and Q3) and defining outliers as points that
fall below Q1 - 1.5IQR or above Q3 + 1.5IQR.

14. Why is feature selection important in machine learning, and what are its
benefits?
474 SAS, Python & R: A Cross Reference Guide

● Feature selection is important because it reduces the dimensionality of


the data, which can improve model performance, reduce overfitting,
and decrease computational cost.

15. How can correlation analysis be used as a feature selection method?

● Correlation analysis helps in identifying features that are highly


correlated with the target variable and with each other, allowing the
selection of the most relevant features while avoiding multicollinearity.

16. What are wrapper methods in feature selection, and how do they differ
from filtering methods?

● Wrapper methods involve selecting features based on the model's


performance with different subsets of features, whereas filtering
methods select features based on statistical criteria, independent of the
model.

17. Explain the concept of embedded methods in feature selection and their
implementation.

● Embedded methods perform feature selection during the model


training process, integrating feature selection with the modeling
process itself, such as regularization techniques in regression models.

18. What is feature engineering, and how can it improve model performance?

● Feature engineering involves creating new features or transforming


existing ones to better represent the underlying patterns in the data,
thus improving model accuracy and interpretability.

19. Describe the process of feature scaling and its importance in machine
learning.

● Feature scaling standardizes the range of features, ensuring that all


features contribute equally to the model. It is particularly important for
algorithms sensitive to feature magnitudes, like SVMs and gradient
descent.
SAS, Python & R: A Cross Reference Guide

20. How can encoding categorical variables impact the performance of a


machine learning model?

● Encoding categorical variables allows the model to process non-numeric


data by converting categories into numeric codes, which can improve
the model's performance and its ability to learn from the data.

Chapter 4 Quiz and Answers

1. What is a model pipeline, and why is it important in machine learning


projects?

● A model pipeline is a sequence of data processing steps that are applied


to data before it is passed through a machine learning model. It is
important because it ensures that all steps, from data preprocessing to
model training and evaluation, are conducted systematically and
consistently, reducing errors, and improving reproducibility.

2. Explain the role of outlier treatment in data preparation.

● Outlier treatment involves identifying and handling extreme values that


deviate significantly from the rest of the data. Proper treatment of
outliers can prevent them from disproportionately influencing the
model and leading to biased or inaccurate predictions.

3. What is feature engineering, and how can it enhance model performance?

● Feature engineering is the process of creating new features or


modifying existing ones to better represent the underlying patterns in
the data. It can enhance model performance by improving the model’s
ability to learn from the data, leading to more accurate predictions.

4. Describe different techniques for handling missing values in a data set.

● Techniques for handling missing values include:


476 SAS, Python & R: A Cross Reference Guide

▪ Imputation: Filling in missing values with the mean,


median, or mode.
▪ Deletion: Removing rows or columns with missing data.
▪ Using algorithms that can handle missing values, such as
decision trees.

5. What is categorical variable encoding, and why is it necessary for


predictive modeling?

● Categorical variable encoding involves converting categorical data into a


numerical format that can be used by machine learning models. This is
necessary because most models require numerical input, and encoding
allows the inclusion of categorical variables in the model.

6. How does data balancing address class imbalance in binary classification


problems?

● Data balancing involves adjusting the distribution of classes in the


training data to ensure that the model is not biased toward the majority
class. Techniques include oversampling the minority class,
undersampling the majority class, or using synthetic data generation
methods like SMOTE.

7. What is the recommended sequence for data preparation steps in a model


pipeline?

● The recommended sequence typically includes:

o Data Cleaning (handling missing values, outliers)


o Feature Engineering (creating new features)
o Categorical Variable Encoding
o Data Balancing
o Data Splitting (train-validation-test)

8. Why is data segmentation critical in assessing model performance and


generalization?
SAS, Python & R: A Cross Reference Guide

● Data segmentation (splitting data into training, validation, and test sets)
is critical because it allows for an unbiased evaluation of the model’s
performance on unseen data, ensuring that the model generalizes well
beyond the data it was trained on.

9. Describe the train-validation-test split and its purpose in predictive


modeling.

● The train-validation-test split involves dividing the data set into three
parts:

o Training set: Used to train the model.


o Validation set: Used to tune hyperparameters and evaluate the
model during training.
o Test set: Used for the final evaluation of the model’s
performance on unseen data.

10. What is out-of-time validation, and when is it most useful?

● Out-of-time validation involves testing a model on data from a time


period that was not included in the training set. It is useful for time-
series data or when there is a temporal component, ensuring that the
model performs well on future data.

11. Explain the concept of cross-validation and its benefits in model


evaluation.

● Cross-validation is a technique where the data set is divided into


multiple subsets (folds), and the model is trained and validated on
different folds in a systematic way. This provides a more robust
estimate of model performance by reducing variance due to the specific
data split.

12. How does k-fold cross-validation work, and why is it beneficial?

● K-fold cross-validation divides the data set into k equally sized folds. The
model is trained on k-1 folds and validated on the remaining fold,
repeating this process k times. It is beneficial because it uses all data
478 SAS, Python & R: A Cross Reference Guide

points for both training and validation, leading to a more accurate and
stable estimate of model performance.

13. What are the key steps in a model pipeline?

● Key steps include data cleaning, feature engineering, encoding


categorical variables, data balancing, model training, hyperparameter
tuning, validation, and final model evaluation.

14. Why is it important to preserve the integrity of out-of-time data during


model evaluation?

● Preserving the integrity of out-of-time data ensures that the model’s


performance is evaluated on truly unseen data, which is critical for
assessing how well the model will perform in a real-world, future
setting.

15. What metrics are commonly used to evaluate model performance?

● Common metrics include accuracy, precision, recall, F1 score, AUC-ROC,


mean squared error (MSE), mean absolute error (MAE), R-squared, and
adjusted R-squared.

16. How does the AUC metric help in assessing the performance of a binary
classifier?

● AUC (Area Under the Curve) measures the model’s ability to distinguish
between classes across all threshold levels. A higher AUC indicates a
better performing model that can effectively differentiate between
positive and negative cases.

17. Why is it important to compare model performance using visualizations


like bar charts and confusion matrices?

● Visualizations help to quickly identify patterns, strengths, and


weaknesses in model performance, making it easier to communicate
results and understand the model’s behavior.
SAS, Python & R: A Cross Reference Guide

18. Describe the process of hyperparameter tuning in a model pipeline.

● Hyperparameter tuning involves selecting the best set of


hyperparameters for a model, typically done through methods like grid
search or random search, using cross-validation to assess performance.

19. How does one-hot encoding facilitate the inclusion of categorical variables
in machine learning models?

● One-hot encoding converts categorical variables into a binary format


where each category is represented by a separate column, allowing the
model to interpret categorical data as numerical inputs.

20. What are the benefits of using a systematic model pipeline in data science
projects?

● A systematic model pipeline ensures consistency, reproducibility, and


efficiency in model development, allowing data scientists to build,
evaluate, and deploy models more effectively.

Chapter 5 Quiz and Answers

1. What is the primary difference between regression and classification


models in predictive modeling?

● Regression models predict continuous outcomes, while classification


models predict categorical outcomes.

2. How does the quality of the modeling data set impact the accuracy of
predictive models?

● The quality of the data set directly impacts the accuracy of predictive
models, as clean, well-prepared data ensures the model can learn
meaningful patterns rather than noise.

3. What are the key steps involved in the model pipeline for data
preparation?
480 SAS, Python & R: A Cross Reference Guide

● Key steps include data cleaning, handling missing values, feature


engineering, encoding categorical variables, data balancing, and splitting
the data into training, validation, and test sets.

4. Explain the difference between linear regression and logistic regression.

● Linear regression predicts continuous outcomes using a linear


relationship between input variables and the target, while logistic
regression predicts binary outcomes using the logistic function to model
probabilities.

5. What role does the logistic function play in logistic regression?

● The logistic function maps predicted values to probabilities between 0


and 1, which are then used to classify the data into binary outcomes.

6. How is the odds ratio interpreted in the context of logistic regression?

● The odds ratio represents the change in odds of the outcome occurring
for a one-unit increase in the predictor variable, holding other variables
constant.

7. What assumptions must be met for logistic regression to be effective?

● Assumptions include linearity between the log odds and independent


variables, no multicollinearity, independence of errors, and a sufficiently
large sample size.

8. What is the significance of the logit link function in logistic regression?

● The logit link function transforms the probability of the outcome into a
linear combination of the predictor variables, facilitating the use of
linear regression techniques for binary classification.

9. How do decision trees differ from logistic regression in handling


multicollinearity?
SAS, Python & R: A Cross Reference Guide

● Decision trees are not sensitive to multicollinearity, as they do not rely


on coefficients but rather on splitting data based on the most
informative features.

10. What is impurity, and how is it used in constructing decision trees?

● Impurity measures the homogeneity of a node; decision trees use


impurity (e.g., Gini impurity, entropy) to decide the best feature to split
the data at each node.

11. Describe the process of creating child nodes in a decision tree.

● Child nodes are created by splitting a parent node based on the feature
that results in the highest reduction in impurity, dividing the data into
more homogeneous groups.

12. What is the purpose of pruning in decision trees?

● Pruning removes branches that add little predictive power to prevent


overfitting and improve the model’s generalizability to unseen data.

13. How is the ROC-AUC metric used to evaluate the performance of a


decision tree model?

● ROC-AUC measures the model's ability to discriminate between positive


and negative classes across all threshold values, with a higher AUC
indicating better performance.

14. What are the advantages of using decision trees over logistic regression?

● Decision trees can handle non-linear relationships, are robust to outliers


and multicollinearity, and inherently perform feature selection.

15. How does feature selection occur inherently in decision trees?

● Feature selection occurs as the tree algorithm chooses the most


informative features at each split, automatically reducing the
dimensionality of the data.
482 SAS, Python & R: A Cross Reference Guide

16. What is the difference between reduced error pruning and cost complexity
pruning?

● Reduced error pruning removes nodes if it improves validation set


accuracy, while cost complexity pruning removes nodes based on a
complexity parameter that balances tree size and accuracy.

17. Explain the concept of hyperparameter tuning and its importance in


machine learning.

● Hyperparameter tuning involves selecting the best set of


hyperparameters for a model, which can significantly affect the model's
performance and its ability to generalize to new data.

18. How can overfitting be prevented in decision trees?

● Overfitting can be prevented by pruning, limiting tree depth, or using


techniques like cross-validation to ensure the model generalizes well.

19. What are the key performance metrics used to evaluate a decision tree
model?

● Key metrics include accuracy, precision, recall, F1 score, ROC-AUC, and


Gini coefficient.

20. What are the pros and cons of using logistic regression versus decision
trees for predictive modeling?

● Logistic Regression:

o Pros: Easy to implement, interpretable, works well with linearly


separable data.
o Cons: Sensitive to outliers, requires linearity assumption,
multicollinearity can be an issue.

● Decision Trees:
SAS, Python & R: A Cross Reference Guide

o Pros: Handles non-linear relationships, robust to outliers and


multicollinearity, easy to interpret.
o Cons: Prone to overfitting without pruning, sensitive to small
changes in data.

Chapter 6 Quiz and Answers

1. What is the main advantage of using ensemble methods over individual


models in predictive modeling?

● Ensemble methods improve model performance by combining multiple


models to reduce overfitting and increase generalizability, leading to
more robust predictions.

2. How does Random Forest reduce the risk of overfitting compared to a


single decision tree?

● Random Forest reduces overfitting by averaging multiple decision trees,


each trained on a random subset of the data, which decreases the
likelihood that any single tree dominates the model's predictions.

3. What is the role of the Out-of-Bag (OOB) metric in Random Forest?

● The OOB metric provides an unbiased estimate of the model’s accuracy


by using data not included in the bootstrap sample for each tree to
evaluate the model's performance.

4. Explain the process of bootstrapping in the context of Random Forest.

● Bootstrapping involves creating multiple training data sets by randomly


sampling with replacement from the original data set. Each tree in the
Random Forest is trained on one of these bootstrap samples, ensuring
diversity among the trees.

5. How does Gradient Boosting differ from Random Forest in its approach to
building models?
484 SAS, Python & R: A Cross Reference Guide

● Gradient Boosting builds models sequentially, where each new tree


attempts to correct the errors made by the previous trees, whereas
Random Forest builds all trees independently and combines their
predictions.

6. What is the significance of the learning rate in Gradient Boosting models?

● The learning rate controls the contribution of each tree to the final
model. A lower learning rate reduces the impact of each individual tree,
which can lead to better generalization but requires more trees.

7. Describe the key difference between XGBoost and traditional Gradient


Boosting models.

● XGBoost introduces optimizations such as regularization, parallelization,


and tree pruning to enhance the performance and efficiency of Gradient
Boosting models.

8. When should you consider using Random Forest over Gradient Boosting?

● Random Forest is preferable when you need a quick, robust model with
less risk of overfitting, especially in data sets with many features and
less need for fine-tuned performance.

9. How does the randomness in Random Forest contribute to its robustness?

● Randomness in both feature selection and data sampling helps Random


Forest to avoid overfitting and makes the model more robust by
reducing variance.

10. What is feature importance, and how is it calculated in Random Forest?

● Feature importance measures how much each feature contributes to


the prediction accuracy. In Random Forest, it is calculated based on the
average decrease in impurity (Gini or entropy) across all trees where the
feature is used for splitting.

11. How does Gradient Boosting handle difficult-to-predict observations?


SAS, Python & R: A Cross Reference Guide

● Gradient Boosting iteratively focuses on difficult-to-predict observations


by giving them more weight in subsequent models, thereby improving
their prediction accuracy over time.

12. Explain the concept of early stopping in Gradient Boosting.

● Early stopping involves halting the training process once the model’s
performance on a validation set stops improving, which helps prevent
overfitting.

13. What are the typical use cases for LightGBM compared to XGBoost?

● LightGBM is typically used for large data sets with many features, where
its gradient-based one-sided sampling and leaf-wise growth strategy
offer faster training times and lower memory usage compared to
XGBoost.

14. How do ensemble methods like Random Forest handle multicollinearity?

● Random Forest is relatively unaffected by multicollinearity because it


randomly selects subsets of features for each tree, preventing any one
feature from dominating the model.

15. Why is regularization important in Gradient Boosting models?

● Regularization in Gradient Boosting models helps prevent overfitting by


penalizing complex models, thereby improving the model’s ability to
generalize to new data.

16. What are the pros and cons of using Gradient Boosting in terms of
computational efficiency?

● Pros: Gradient Boosting often results in more accurate models. Cons: It


can be computationally expensive and time-consuming, particularly
with large data sets or when using a low learning rate.

17. How can you use permutation importance to assess feature significance in
Gradient Boosting?
486 SAS, Python & R: A Cross Reference Guide

● Permutation importance is assessed by randomly shuffling a feature's


values and observing the decrease in model accuracy, which indicates
the feature's importance in making accurate predictions.

18. What metrics would you use to evaluate the performance of a Random
Forest model?

● Common metrics include accuracy, precision, recall, F1 score, ROC-AUC,


and feature importance.

19. How does Gradient Boosting handle imbalanced data sets compared to
Random Forest?

● Gradient Boosting can be more sensitive to imbalanced data sets, but it


can handle them by adjusting class weights or using specialized
techniques like SMOTE. Random Forest also handles imbalanced data
well by adjusting the decision criteria within individual trees.

20. What are the common hyperparameters that need tuning in Gradient
Boosting models?

● Common hyperparameters include learning rate, number of trees


(n_estimators), maximum depth of trees, subsample size, and
regularization parameters (like L1 and L2 penalties).

Chapter 7 Quiz and Answers

1. What is the primary goal of a Support Vector Machine (SVM) in


classification tasks?

● The primary goal of an SVM in classification tasks is to find the optimal


hyperplane that maximizes the margin between the different classes in
the data set.
SAS, Python & R: A Cross Reference Guide

2. How does a kernel function in SVM help in handling non-linear


separations?

● A kernel function in SVM maps the input features into higher-


dimensional space, allowing the model to find a linear separation in this
transformed space, thus handling non-linear relationships in the original
data.

3. Explain the role of the C parameter in SVM.

● The C parameter in SVM controls the trade-off between maximizing the


margin and minimizing classification errors. A small C value allows for a
wider margin, even if it means more classification errors, while a large C
value focuses on classifying all training points correctly at the cost of a
smaller margin.

4. What are support vectors, and why are they important in SVM?

● Support vectors are the data points closest to the hyperplane, and they
define the margin of separation between classes. These vectors are
crucial because the position and orientation of the hyperplane depend
directly on them.

5. Describe the architecture of a basic neural network.

● A basic neural network consists of an input layer, one or more hidden


layers, and an output layer. Each layer is made up of neurons that are
connected to neurons in the subsequent layer, with each connection
having an associated weight.

6. How does the backpropagation algorithm work in neural networks?

● Backpropagation is an algorithm used to minimize the error by adjusting


the weights in the network. It works by propagating the error from the
output layer back through the network and updating the weights based
on the gradient of the loss function with respect to each weight.

7. What is the significance of activation functions in neural networks?


488 SAS, Python & R: A Cross Reference Guide

● Activation functions introduce non-linearity into the network, enabling


it to learn and model complex relationships in the data. Without
activation functions, the network would only be able to model linear
relationships.

8. In what scenarios is it preferable to use SVM over neural networks?

● SVM is preferable when working with smaller data sets, when the
number of features is large relative to the number of observations, and
when the decision boundary is clear but non-linear.

9. What are the advantages of using neural networks for deep learning
tasks?

● Neural networks, particularly deep ones, excel at capturing complex


patterns and representations in data, making them highly effective for
tasks such as image and speech recognition, where the data is large and
unstructured.

10. How does data standardization impact the performance of SVMs and
neural networks?

● Data standardization scales the features to have a mean of 0 and a


standard deviation of 1, which is crucial for SVMs and neural networks
as it ensures that all features contribute equally to the model and
prevents any single feature from dominating the learning process.

11. What is the difference between overfitting and underfitting, and how does
the C parameter in SVM address these issues?

● Overfitting occurs when the model is too complex and captures noise in
the data, while underfitting happens when the model is too simple to
capture the underlying trend. The C parameter in SVM helps balance
this by controlling the trade-off between achieving a wider margin
(which may lead to underfitting) and minimizing classification errors
(which may lead to overfitting).

12. Why is it important to tune hyperparameters in neural networks?


SAS, Python & R: A Cross Reference Guide

● Tuning hyperparameters, such as learning rate, number of layers, and


number of neurons, is important to ensure that the network learns
effectively, generalizes well to new data, and avoids overfitting or
underfitting.

13. What is the difference between a feedforward neural network and a


recurrent neural network?

● A feedforward neural network has connections that flow in one


direction from input to output, while a recurrent neural network has
connections that form cycles, allowing it to maintain a memory of
previous inputs and making it suitable for sequence prediction tasks.

14. How can class imbalance be addressed when training SVMs and neural
networks?

● Class imbalance can be addressed by techniques such as resampling the


data (oversampling the minority class or undersampling the majority
class), using class weights to penalize the model more for misclassifying
the minority class, or generating synthetic samples (e.g., using SMOTE).

15. Explain the concept of a margin in SVM and its importance.

● The margin in SVM is the distance between the hyperplane and the
nearest data points (support vectors) from each class. A larger margin is
preferred as it implies a better generalization ability of the model.

16. What are some common use cases for neural networks in machine
learning?

● Common use cases for neural networks include image recognition,


speech recognition, natural language processing, and any tasks that
involve complex, high-dimensional data.

17. How do you assess the performance of an SVM model?


490 SAS, Python & R: A Cross Reference Guide

● The performance of an SVM model is typically assessed using metrics


such as accuracy, precision, recall, F1 score, ROC-AUC, and sometimes
the margin width.

18. Why might a deep neural network require more computational resources
than an SVM?

● A deep neural network typically requires more computational resources


because it involves a large number of parameters (weights) to optimize,
multiple layers, and potentially a large amount of data, all of which
demand significant processing power and memory.

19. What are the pros and cons of using a radial basis function (RBF) kernel in
SVM?

● Pros: The RBF kernel can handle non-linear relationships well and can
map the data into higher dimensions to find a separating hyperplane.
Cons: It may lead to overfitting if not properly tuned and can be
computationally expensive for large data sets.

20. Describe the process of hyperparameter tuning in SVMs and neural


networks.

● Hyperparameter tuning involves systematically testing different


combinations of hyperparameters (such as the C and gamma in SVM or
the learning rate and number of layers in neural networks) to find the
set that provides the best model performance on a validation set. This
can be done using grid search, random search, or more advanced
methods like Bayesian optimization.

Chapter 8 Quiz and Answers

1. What is the primary purpose of performance metrics in evaluating


predictive models?
SAS, Python & R: A Cross Reference Guide

● The primary purpose of performance metrics is to quantify the accuracy


and effectiveness of predictive models, enabling data scientists to
assess how well the model is performing in terms of making correct
predictions.

2. How do classification metrics differ from regression metrics in terms of


their application and interpretation?

● Classification metrics evaluate the performance of models predicting


categorical outcomes (e.g., accuracy, precision, recall), while regression
metrics assess the performance of models predicting continuous
outcomes (e.g., MSE, R-squared).

3. Explain the significance of the confusion matrix in classification model


evaluation.

● The confusion matrix provides a detailed breakdown of true positives,


false positives, true negatives, and false negatives, offering insight into
the model's predictive performance, including precision and recall.

4. What does a high Precision score indicate in a classification model?

● A high Precision score indicates that a large proportion of the positive


predictions made by the model are correct, meaning the model is good
at minimizing false positives.

5. How is the AUC (Area Under the Curve) calculated, and what does it
represent?

● The AUC is calculated as the area under the ROC curve, which plots the
True Positive Rate (Sensitivity) against the False Positive Rate (1 -
Specificity). It represents the model’s ability to distinguish between the
positive and negative classes across different thresholds.

6. Describe the relationship between the Gini coefficient and AUC.


492 SAS, Python & R: A Cross Reference Guide

● The Gini coefficient is directly related to AUC and is calculated as Gini =


2 * AUC - 1. It quantifies the inequality in model predictions, with higher
values indicating better model performance.

7. What is the KS Statistic, and why is it important in model evaluation?

● The KS Statistic (Kolmogorov-Smirnov Statistic) measures the maximum


difference between the cumulative distributions of the positive and
negative classes. It is important for evaluating the discriminatory power
of a model.

8. How does the Mean Squared Error (MSE) differ from Mean Absolute Error
(MAE) in regression analysis?

● MSE measures the average of the squared differences between actual


and predicted values, making it more sensitive to outliers, while MAE
measures the average of the absolute differences, providing a more
direct interpretation of error in the same units as the target variable.

9. Why is R-squared an important metric in regression, and what does it tell


you about the model?

● R-squared indicates the proportion of the variance in the dependent


variable that is predictable from the independent variables, showing
how well the model explains the variation in the data.

10. Explain the difference between R-squared and Adjusted R-squared.

● Adjusted R-squared adjusts the R-squared value for the number of


predictors in the model, penalizing for adding unnecessary predictors
that do not improve the model’s explanatory power.

11. What are AIC and BIC, and how are they used to assess model complexity
in regression models?

● AIC (Akaike Information Criterion) and BIC (Bayesian Information


Criterion) are metrics used to assess model complexity by penalizing
SAS, Python & R: A Cross Reference Guide

models for having too many parameters, helping to select models that
balance fit and complexity.

12. Why is it important to monitor a model after it has been deployed into
production?

● Monitoring a model after deployment is crucial to ensure that it


continues to perform well as the data or environment changes,
preventing performance degradation over time.

13. Describe the role of model packaging in deploying models from a


development environment to production.

● Model packaging involves bundling the model with its dependencies


and preprocessing steps, ensuring that it can be reliably deployed and
executed in a production environment.

14. What are Docker and Kubernetes, and how do they assist in model
deployment?

● Docker is a platform for creating, deploying, and running applications in


containers, while Kubernetes is an orchestration system for automating
deployment, scaling, and management of containerized applications,
both of which help manage and scale models in production.

15. What is PSI (Population Stability Index), and why is it critical in model
monitoring?

● PSI measures shifts in the distribution of input features over time. It is


critical for detecting changes in the population that the model is applied
to, which could impact model performance.

16. How does VSI (Variance Stability Index) contribute to model stability
monitoring?

● VSI measures changes in the variance of model predictions over time,


helping to identify when the model is encountering different data than
it was trained on, which could affect its accuracy.
494 SAS, Python & R: A Cross Reference Guide

17. What does a yellow range in threshold analysis indicate about a model’s
performance?

● A yellow range in threshold analysis indicates that the model’s


performance is degrading and may require attention or intervention,
but it is not yet critically underperforming.

18. How can the lift table help in setting thresholds for classification models?

● The lift table helps identify the deciles where the model is most
effective at capturing true positives, guiding the setting of thresholds to
maximize business objectives like fraud detection or customer
retention.

19. Why is it necessary to periodically reassess the thresholds set for model
performance metrics?

● Periodically reassessing thresholds is necessary because the data or


business environment may change over time, which could impact the
model’s performance and the appropriateness of the original
thresholds.

20. Explain how dashboards and monitoring reports can be used to track and
evaluate model performance over time.

● Dashboards and monitoring reports provide real-time and periodic


insights into model performance, tracking key metrics, detecting
anomalies, and ensuring that the model remains effective in production.

Chapter 9 Quiz and Answers

1. What is the primary advantage of integrating Python, R, and SAS within a


single environment?

● The primary advantage is the ability to leverage the strengths and


specialized capabilities of each language, leading to more efficient
SAS, Python & R: A Cross Reference Guide

workflows and robust data analysis. Integration allows seamless use of


different tools and libraries best suited for specific tasks.

2. How does the rpy2 library facilitate the use of R code in Python Spyder?

● The rpy2 library allows R code to be embedded and executed within


Python, enabling data scientists to access R's statistical functions and
libraries directly from Python, thereby enhancing Python’s analytical
capabilities.

3. Describe how the subprocess module can be used to run SAS code in
Python.

● The subprocess module in Python can execute external programs,


including SAS scripts, by running command-line instructions from within
Python code. This enables Python to control and integrate SAS
processes as part of a larger workflow.

4. What is the role of the Reticulate package in RStudio?

● The Reticulate package in RStudio allows R to interface with Python,


enabling Python code to be executed within R scripts. It supports data
sharing between R and Python, facilitating a seamless integration of the
two languages.

5. How can SAS code be integrated into an RMarkdown document in


RStudio?

● SAS code can be integrated into an RMarkdown document using the sas
engine in code chunks. This allows for the execution of SAS code directly
within the RMarkdown file, producing dynamic reports that combine
SAS output with R analysis.

6. Explain how the RSASSA package allows for interaction between RStudio
and SAS.

● The RSASSA package enables RStudio to connect to a SAS server and run
SAS code from within R. It facilitates the transfer of data between R and
496 SAS, Python & R: A Cross Reference Guide

SAS, allowing for a cohesive workflow that leverages the strengths of


both environments.

7. What are the benefits of using Jupyter Notebooks within SAS Studio?

● Jupyter Notebooks within SAS Studio provide an interactive


environment where data scientists can combine code, visualizations,
and narrative text. This allows for dynamic and reproducible data
analyses that can incorporate multiple languages and tools.

8. How does the X Python statement work in SAS Studio?

● The X Python statement in SAS Studio allows users to execute Python


code directly within a SAS session. This integration enables the
combination of SAS data manipulation capabilities with Python’s
analytical and visualization libraries.

9. What are the key differences between SAS Studio and SAS Enterprise
Guide in terms of language integration?

● SAS Studio is more flexible in integrating languages like Python and R,


providing built-in support for executing Python code and integrating
Jupyter Notebooks. SAS Enterprise Guide is more focused on point-and-
click workflows with less emphasis on multi-language support.

10. Why might a data scientist choose to run Python code within RStudio?

● A data scientist might choose to run Python code within RStudio to


leverage Python’s machine learning libraries while maintaining the
ability to perform statistical analysis and reporting in R, all within a
single environment.

11. What steps are necessary to run R code within Python Spyder?

● To run R code within Python Spyder, one would typically use the rpy2
library, which allows R functions and objects to be accessed directly
from Python. Installation of rpy2 and setting up the appropriate
environment are necessary steps.
SAS, Python & R: A Cross Reference Guide

12. How does using the SASPy module enhance Python's capabilities in
handling SAS code?

● The SASPy module allows Python to connect to SAS, execute SAS code,
and retrieve results. This integration enhances Python’s capabilities by
enabling it to harness the powerful data manipulation and statistical
analysis tools in SAS.

13. In what situations would it be beneficial to use SAS Studio over SAS
Enterprise Guide?

● SAS Studio is beneficial when there is a need for flexibility in integrating


multiple programming languages (e.g., Python, R), or when working in a
web-based environment that allows for easier collaboration and access.

14. How can language fusion improve the efficiency of data science
workflows?

● Language fusion allows data scientists to choose the most appropriate


tool for each task, improving efficiency by using specialized libraries and
functions from different languages, all within a single workflow.

15. Describe a scenario where integrating Python, R, and SAS would be


particularly advantageous.

● Integrating Python, R, and SAS is advantageous in a scenario where


Python’s machine learning capabilities, R’s statistical analysis tools, and
the data processing strengths of SAS are all required to complete a
comprehensive data science project.

16. What are the challenges of maintaining a multilingual data science


environment?

● Challenges include managing dependencies and environments for each


language, ensuring compatibility and communication between
languages, and the complexity of maintaining documentation and
codebases that span multiple programming languages.
498 SAS, Python & R: A Cross Reference Guide

17. How does RStudio's integration of Python and SAS differ from that of
Python Spyder?

● RStudio integrates Python and SAS primarily through the Reticulate and
RSASSA packages, respectively, allowing code from these languages to
be executed within R scripts. Python Spyder, on the other hand, uses
libraries like rpy2 and subprocess to run R and SAS code within Python
scripts.

18. What are the potential drawbacks of using multiple programming


languages within a single project?

● Potential drawbacks include increased complexity in code maintenance,


difficulties in debugging across languages, potential for incompatibility
issues, and the need for data scientists to be proficient in multiple
languages.

19. How can Jupyter Notebooks be used to create a dynamic document that
includes Python and R code?

● Jupyter Notebooks support multi-language cells, allowing both Python


and R code to be executed within the same document. By using magic
commands like %R, users can seamlessly switch between languages in
different cells, creating dynamic, interactive documents.

20. What considerations should be taken into account when choosing an


environment for language fusion?

● Considerations include the ease of integration between languages, the


availability of libraries and tools, the complexity of setting up and
maintaining the environment, and the specific needs of the project (e.g.,
data processing, machine learning, or statistical analysis).
SAS, Python & R: A Cross Reference Guide

Appendix E: Glossary of Data Science and AI/ML Terms

A/B Testing
A method of comparing two versions of a web page, model, or other product to
determine which one performs better. Commonly used in marketing and product
development.

Accuracy
A metric that measures the percentage of correctly predicted instances out of the
total instances, often used to evaluate classification models.

Activation Function
A function in a neural network that determines the output of a node given an input
or set of inputs. Common activation functions include ReLU, sigmoid, and tanh.

Akaike Information Criterion (AIC)


A metric used to compare the goodness of fit of different statistical models,
penalizing models with more parameters to avoid overfitting.

Algorithm
A set of rules or instructions given to an AI/ML model to help it learn from the data
and make predictions or decisions.

Area Under the Curve (AUC)


A performance metric for classification models, specifically measuring the area
under the Receiver Operating Characteristic (ROC) curve, which plots true positive
rate versus false positive rate.
500 SAS, Python & R: A Cross Reference Guide

Artificial Intelligence (AI)


The simulation of human intelligence in machines that are programmed to think,
learn, and problem-solve like humans.

Backpropagation
An algorithm used in training neural networks, where the error is calculated and
propagated backward through the network to update the weights.

Bagging (Bootstrap Aggregating)


An ensemble method that trains multiple models on different subsets of the data
(created via bootstrapping) and combines their predictions to improve accuracy and
reduce variance.

Bayesian Inference
A statistical method that updates the probability estimate for a hypothesis as more
evidence or information becomes available.

Bayesian Information Criterion (BIC)


Similar to AIC, but with a stronger penalty for models with more parameters,
making it more conservative in model selection.

Bias
A model’s systematic error that leads to consistent deviations from the true value,
typically resulting in underfitting.

Bias-Variance Trade-off
A balance between bias (error from erroneous assumptions) and variance (error
from sensitivity to small fluctuations in the training data). The trade-off affects
model performance on unseen data.

C
SAS, Python & R: A Cross Reference Guide

Classification Model
A type of predictive model that assigns a label or category to an input based on its
features. Examples include logistic regression, decision trees, and neural networks.

Clustering
A type of unsupervised learning that groups similar data points together based on
their features. Common algorithms include K-means and hierarchical clustering.

Confusion Matrix
A table used to describe the performance of a classification model, showing the true
positives, false positives, true negatives, and false negatives.

Correlation
A statistical measure that describes the strength and direction of a relationship
between two variables. It ranges from -1 to 1, where 1 indicates a perfect positive
relationship, -1 indicates a perfect negative relationship, and 0 indicates no
relationship.

Cross-Validation
A technique used to assess the performance of a model by dividing the data into
multiple subsets (folds) and training/testing the model on different combinations of
these subsets.

Curse of Dimensionality
A phenomenon where the performance of a model degrades as the number of
features (dimensions) increases, due to the sparsity of data in high-dimensional
space.

Data Augmentation
A technique used to increase the diversity of training data by applying random
transformations such as rotation, flipping, or scaling, commonly used in image
processing.
502 SAS, Python & R: A Cross Reference Guide

Data Imputation
The process of replacing missing data with substituted values, commonly using
mean, median, or mode, or more complex methods like k-nearest neighbors or
regression.

Data Preprocessing
The process of cleaning, transforming, and organizing raw data into a suitable
format for analysis and model building.

Decision Trees
A type of model that splits data into branches based on feature values, creating a
tree-like structure to make decisions or predictions.

Deep Learning
A subset of machine learning involving neural networks with many layers (deep
networks) that can learn from large amounts of data to model complex patterns.

Dependent Variable (Target Variable)


The variable that a model aims to predict, also known as the outcome or target
variable.

Dimensionality Reduction
Techniques used to reduce the number of features in a data set while retaining as
much information as possible, often through methods like PCA (Principal
Component Analysis) or t-SNE.

Distributions
A statistical function that describes the likelihood of different outcomes in a data
set. Common distributions include normal, binomial, and Poisson distributions.

Dummy Variable Trap


A situation in regression analysis where the inclusion of all dummy variables (binary
variables) for a categorical feature leads to multicollinearity. This issue is typically
avoided by omitting one dummy variable.
SAS, Python & R: A Cross Reference Guide

Ensemble Methods
Techniques that combine multiple models to improve overall performance.
Common methods include bagging, boosting, and stacking.

Epoch
A single pass through the entire training data set during the training process of a
machine learning model, particularly in neural networks.

Explainability (Interpretability)
The extent to which a human can understand the decisions or predictions made by a
machine learning model. Explainability is crucial in fields like healthcare and finance,
where understanding model decisions is essential.

F1 Score
A performance metric that combines precision and recall, calculated as the
harmonic mean of precision and recall. It provides a single score that balances the
two metrics.

False Negative
A situation where a model incorrectly predicts the negative class when the actual
class is positive.

False Positive
A situation where a model incorrectly predicts the positive class when the actual
class is negative.

Feature Engineering
The process of selecting, transforming, and creating new features from raw data to
improve model performance.
504 SAS, Python & R: A Cross Reference Guide

Feature Importance
A metric that measures the contribution of each feature in predicting the target
variable, often used in tree-based models.

Feature Scaling
A method used to normalize or standardize the range of independent variables
(features) to ensure that no single feature dominates the learning process.

Gradient Descent
An optimization algorithm used to minimize the loss function by iteratively adjusting
the model parameters in the direction of the steepest decrease in the loss.

Gradient Boosting Machines (GBM)


An ensemble technique that builds models sequentially, where each new model
attempts to correct the errors of the previous ones, leading to highly accurate
predictions.

Hyperparameter Tuning
The process of optimizing the hyperparameters of a model (settings that are not
learned from data) to improve its performance.

Hypothesis Testing
A statistical method used to test an assumption (hypothesis) about a population
parameter, often using p-values to determine significance.
SAS, Python & R: A Cross Reference Guide

Imbalanced Data
A situation where one class is significantly more frequent than the other(s) in a
classification problem, potentially leading to biased models.

Independent Variable (Predictor)


A variable used to predict the outcome (dependent variable) in a model. Also known
as a feature.

Inference
The process of making predictions or drawing conclusions from a trained machine
learning model on new, unseen data.

K-Nearest Neighbors (KNN)


A simple, instance-based learning algorithm that classifies a data point based on the
majority class of its k-nearest neighbors.

K-Means Clustering
An unsupervised learning algorithm that partitions data into k clusters, where each
data point belongs to the cluster with the nearest mean.

Lasso Regression
A type of linear regression that includes L1 regularization, which penalizes the
absolute value of coefficients, leading to sparse models where some coefficients are
exactly zero.
506 SAS, Python & R: A Cross Reference Guide

Learning Rate
A hyperparameter that controls the step size during gradient descent, determining
how quickly or slowly a model learns.

Linear Regression
A type of regression model that predicts the value of a continuous dependent
variable based on one or more independent variables using a linear equation.

Logistic Regression
A classification model that predicts the probability of a binary outcome using a
logistic function.

Loss Function
A function that measures how well a model's predictions match the true values,
guiding the optimization process during model training.

Machine Learning (ML)


A branch of AI focused on building systems that learn from data to make predictions
or decisions without being explicitly programmed.

Model Evaluation
The process of assessing the performance of a machine learning model using various
metrics like accuracy, precision, recall, F1 score, and AUC.

Multicollinearity
A situation in regression analysis where independent variables are highly correlated,
leading to unreliable coefficient estimates.
SAS, Python & R: A Cross Reference Guide

Neural Networks
A class of machine learning models inspired by the human brain, consisting of layers
of interconnected nodes (neurons) that process data and learn complex patterns.

Normalization
The process of scaling data to a standard range, often between 0 and 1, to improve
model performance and convergence.

Optimization
The process of finding the best set of parameters for a model that minimizes or
maximizes a given objective function, often using algorithms like gradient descent.

Overfitting
A modeling error that occurs when a model is too complex and captures noise in the
training data, leading to poor generalization on new data.

P-Value
A statistical measure that helps determine the significance of the results. In
hypothesis testing, a low p-value (< 0.05) indicates strong evidence against the null
hypothesis.

Precision
A metric for classification models that measures the accuracy of positive predictions
(True Positives / (True Positives + False Positives)).

Predictor
See Independent Variable.
508 SAS, Python & R: A Cross Reference Guide

Principal Component Analysis (PCA)


A dimensionality reduction technique that transforms high-dimensional data into a
lower-dimensional form by identifying the most important features.

Random Forest
An ensemble learning method that builds multiple decision trees and aggregates
their predictions to improve accuracy and reduce overfitting.

Recall (Sensitivity)
A metric for classification models that measures the ability to identify all positive
instances (True Positives / (True Positives + False Negatives)).

Reinforcement Learning
A type of machine learning where an agent learns to make decisions by interacting
with an environment and receiving feedback through rewards or penalties.

Regression Model
A type of model that predicts a continuous outcome based on one or more
independent variables. Examples include linear regression and polynomial
regression.

Regularization
A technique used to prevent overfitting by adding a penalty to the loss function,
encouraging simpler models. Common methods include L1 (Lasso) and L2 (Ridge)
regularization.

Ridge Regression
A type of linear regression that includes L2 regularization, which penalizes the
square of coefficients, helping to reduce model complexity and multicollinearity.
SAS, Python & R: A Cross Reference Guide

Sampling
The process of selecting a subset of data from a larger data set for analysis, often
used in the context of training machine learning models.

Stochastic Gradient Descent (SGD)


An optimization algorithm used to minimize the loss function by updating the
model's parameters for each training example, rather than for the entire data set. It
is often faster than standard gradient descent and is used in large-scale machine
learning models.

Supervised Learning
A type of machine learning where the model is trained on labeled data, meaning
that the input data comes with corresponding output labels. The model learns to
map inputs to outputs and is evaluated based on how well it can predict the labels
for new data.

Support Vector Machines (SVM)


A classification algorithm that finds the optimal hyperplane that best separates data
points of different classes with the maximum margin. It can also be used for
regression tasks.

Synthetic Data
Artificially generated data that mimics the properties of real data. It is often used to
augment data sets, perform privacy-preserving data analysis, or test models in
controlled environments.

Target Variable (Dependent Variable)


The variable that a model aims to predict, also known as the output or response
variable. In supervised learning, the model learns to predict this variable based on
input features.
510 SAS, Python & R: A Cross Reference Guide

Test Set
A subset of data used to evaluate the performance of a trained model. The test set
is separate from the training set and is used to assess how well the model
generalizes to new, unseen data.

Time Series Analysis


A type of analysis used to model and predict data points that are ordered in time. It
is commonly used in finance, economics, and weather forecasting. Techniques
include ARIMA, exponential smoothing, and seasonal decomposition.

Train/Test Split
The process of dividing a data set into two parts: one for training the model (training
set) and one for testing its performance (test set). This split allows for an unbiased
evaluation of the model's generalization to new data.

Transfer Learning
A machine learning technique where a model developed for a particular task is
reused as the starting point for a model on a second task. It is commonly used in
deep learning, where pre-trained models are fine-tuned on new tasks.

Tuning
The process of optimizing hyperparameters to improve the performance of a
machine learning model. Tuning is typically done using techniques like grid search,
random search, or Bayesian optimization.

Underfitting
A modeling error that occurs when a model is too simple to capture the underlying
patterns in the data, resulting in poor performance on both the training and test
sets. Underfitting often occurs when the model is not complex enough or when
there is not enough training data.
SAS, Python & R: A Cross Reference Guide

Unsupervised Learning
A type of machine learning where the model is trained on unlabeled data, meaning
the input data does not come with corresponding output labels. The model learns to
identify patterns, clusters, or structures in the data.

Validation Set
A subset of data used during model training to tune hyperparameters and prevent
overfitting. The validation set provides an additional check on model performance,
helping to select the best model before evaluating on the test set.

Variance
A measure of a model's sensitivity to fluctuations in the training data. High variance
can lead to overfitting, where the model performs well on the training data but
poorly on new, unseen data.

Vectorization
The process of converting data (e.g., text, images) into numerical vectors that can be
used in machine learning models. Vectorization is commonly used in natural
language processing (NLP) and computer vision.

Weights
Parameters in a machine learning model that are learned from the data, influencing
the model's predictions. In neural networks, weights are adjusted during training to
minimize the loss function.

Word Embedding
A type of word representation that maps words into continuous vector space where
semantically similar words are located closer together. Common techniques include
Word2Vec, GloVe, and FastText.
512 SAS, Python & R: A Cross Reference Guide

Wrapper Method
A feature selection method that evaluates subsets of features based on their
contribution to model performance. The method wraps around a model to select
the best-performing subset of features.

XGBoost
An optimized gradient boosting algorithm that is widely used in machine learning
competitions and real-world applications for its efficiency, speed, and performance.
XGBoost can be used for both classification and regression tasks.

Y Variable
Another term for the target or dependent variable, representing the output that a
model is trained to predict.

Zero-Inflated Model
A statistical model used for count data that has an excess of zero counts. These
models assume that the data-generating process has two parts: one that generates
the excess zeros and another that generates the counts.

Z-Score
A statistical measure that describes a data point's relationship to the mean of a
group of values. Z-scores are expressed in terms of standard deviations from the
mean, with a positive Z-score indicating a value above the mean and a negative Z-
score indicating a value below the mean.
SAS, Python & R: A Cross Reference Guide

Index of Terms

activation 12, 297, 300, 304, 316, 318, 321, 467, 478

AIC 99, 104, 186–187, 196, 327, 356–357, 359–361, 369, 371–374, 387, 389, 391,
472, 478–479

alpha 195, 198, 209, 219, 232, 304, 357

anaconda 28, 46

API 193, 302, 321, 358, 378, 392

append 102, 108, 161–162, 223, 246, 262, 291, 312

architecture 72, 122, 297–300, 311–312, 316, 318, 321, 377–378, 467

ARIMA 488

array 15–16, 28, 32, 65, 92, 96, 328, 406

astype 159–160, 241

AUC 6, 12–13, 157, 159–165, 170–171, 174, 193, 215–216, 223–225, 233, 239–241,
246–247, 255–256, 262–263, 268, 272, 280, 284–285, 287, 291–292, 302, 307–
308, 311–314, 317, 321–323, 325, 327, 330–334, 336–339, 345–347, 373–374,
381–382, 385–387, 389, 458, 461, 470–471, 478, 485

autoencoder 122, 298

backpropagation 297, 316, 318, 467, 479

bagging 226, 252, 264, 479, 481


514 SAS, Python & R: A Cross Reference Guide

baseline 6, 116, 298, 383

batch 181, 301, 375

Bayesian 71, 166, 214, 327, 369, 372, 391, 470, 472, 479, 488

bias 16, 86, 128–129, 131–132, 144–146, 150, 155, 180–181, 185, 191, 206, 214–
215, 279, 281–282, 297, 362, 364, 384, 455–456, 479, 483

BIC 99, 187, 196, 327, 356–357, 359–361, 369, 372–374, 387, 389, 391, 472, 479

binary 49, 69–72, 75–76, 80, 109, 113–117, 124, 131, 137, 141, 144, 164, 168, 171,
176–184, 191, 202, 207, 215, 227, 232, 245, 247, 253, 259, 262–263, 324, 327,
336–337, 340, 347, 373–374, 449, 452, 456, 458–460, 481, 484

binning 84, 392

binomial 11, 198, 232, 307, 481

black-box 234, 299

boosting 5, 12, 70, 72, 110, 116, 132, 147, 159, 163–166, 174, 177, 205, 215, 226–
227, 234, 249–256, 258–259, 261–264, 266–272, 276–277, 442, 463–466, 481,
483, 490

bootstrapping 235, 237, 252, 266, 269, 463, 479

boxplot 11, 141

capped 2, 66, 87–89, 144–145, 168, 173

cardinality 116, 118–119, 236

caret 10, 13, 36, 95, 99–101, 106–107, 114, 141, 173–174, 196, 218–219, 232–233,
242–243, 248, 258, 263, 271–272, 286, 305, 313, 321, 328, 377, 397, 441
SAS, Python & R: A Cross Reference Guide

CatBoost 250, 267

categorical 2, 68–69, 110, 113, 115–118, 123–124, 138, 140, 143–147, 155–156,
158, 160–162, 168–169, 171–173, 175–177, 191–192, 203, 206, 214, 228–229,
232, 236, 239, 250, 255, 280, 301, 313, 324, 327, 372–373, 376, 455–460, 470,
481

champion 154, 156, 159–163, 165

character 10, 76–77, 80–81, 118, 220

CHISQ 332

classification 2, 6–7, 49, 65, 68–73, 85, 107–110, 112, 131–132, 144, 146, 164, 168,
171, 175–177, 179, 182, 190, 193, 202, 207–210, 215–216, 224, 227, 230, 232,
234–235, 238–240, 248–249, 252–256, 263, 266, 271, 273–275, 278–279, 282–
285, 287, 293, 297, 301–302, 313–315, 318, 320–321, 324–325, 327–331, 334–
338, 340–341, 345, 347, 349, 353, 355, 370, 372–374, 381, 386–387, 389–390,
441, 456, 458–460, 466–468, 470, 473, 478–480, 483–484, 486–487, 490

clean 38, 56, 65, 135–136, 144, 300, 452, 460

cloud 21, 378–379, 416, 438

cluster 21, 55, 59, 61, 65, 328, 340, 450, 480, 484, 489

CNN 298

coefficients 180, 183–184, 186, 188–192, 196–198, 228, 306, 376, 461, 484, 487

combine 28, 34, 49, 65–67, 70–71, 111, 119–120, 132–133, 135–136, 173, 186, 192,
195, 205, 212, 217, 225–226, 238, 249, 257, 264, 266, 273, 285, 287, 304, 307,
323, 386, 399–400, 404, 406–408, 410–411, 420–422, 426, 463, 474–475, 479,
481–482

compile 12
516 SAS, Python & R: A Cross Reference Guide

complexity 41, 86, 94, 118, 121–122, 166, 186–187, 189–190, 209–210, 214–215,
219, 228, 230, 232, 236, 238, 251, 254–255, 268, 272, 277–278, 282, 293, 300,
307, 327, 368–369, 371–372, 374, 387, 389, 461–462, 472, 476–477, 487

concatenate 117, 134, 158, 160–161, 195, 217, 257, 285, 304

conditional 10, 182

configuration 21, 36, 213, 317, 378, 404

confusion matrix 6, 12, 18, 23, 157, 160, 170–171, 174, 193, 199, 202, 210, 215–
216, 219, 224, 233, 240, 243, 248, 256, 263, 282, 284–285, 287, 302, 307–308,
313, 327, 330–333, 341–342, 373, 389, 458, 470, 480

console 1, 19, 28–30, 32–34, 37, 40–41, 44–45, 396, 398, 402, 408, 438, 447

constraints 150, 152, 209, 294, 387, 395, 415

container 378, 392, 472

continuous 68–70, 87, 113–114, 123–124, 146–147, 176, 179, 182, 192, 229, 323–
324, 337, 355–356, 372–373, 375, 379, 385, 387, 420, 427, 459–460, 470, 484,
486, 490

convergence 180–181, 301, 485

convolutional 298

correlation 2, 95–98, 103, 116, 123–125, 137, 139, 141, 185–186, 191–192, 200,
203, 309, 313, 454, 480, 485

COSTCOMPLEXITY 222–223, 332

covariance 124

CPUs 236, 252, 277

cross-entropy 180–181, 253, 297


SAS, Python & R: A Cross Reference Guide

crosstabulation 3, 76–77, 99–100, 149–153, 169, 171, 210, 215, 250, 278, 282, 301,
317, 457–459, 462, 480

crowd-sourced 85

csv 10, 23, 25, 30, 53–54, 58, 63, 157, 193, 196, 216, 218, 240, 242, 248, 256–258,
263, 282, 285, 287, 302, 306, 313, 449

cutoff 95, 127, 347

cvfit 198

dashboard 21–22, 89, 380, 385, 387, 390, 393, 447, 473

dataframe 31–32, 54, 108, 158, 160–161, 194–195, 217, 225, 257, 285, 287, 303–
304, 328, 332, 351, 356, 358, 434

datalines 402, 405

date 66, 76–80, 136–137, 139, 157–158, 162, 453

DBMS 10, 53–54, 63

debugging 19, 24, 27, 29, 34, 41, 43, 45–46, 445, 447, 477

DecisionTreeClassifier 11, 157, 159, 215, 217, 219, 232

decoder 122, 298

decomposition 125, 488

deploy 33, 136, 148, 153, 156, 163, 166, 297, 314, 323, 341, 355, 374–380, 387, 389,
392, 416, 448, 459, 472

depth 159, 209, 215, 217, 241, 248, 257, 264, 267, 271–272, 300, 462, 466
518 SAS, Python & R: A Cross Reference Guide

deviation 86, 112, 125, 146, 212, 279, 281, 363–366, 380–382, 434, 453, 455, 468,
479, 491

differentiate 184, 340, 345, 354, 382, 458

dimensionality 2, 65, 116, 118–119, 121–125, 166, 281, 298, 454, 461, 480–481,
486

discrete 69–70, 113, 123, 324

distinct 69, 97, 118, 147, 149, 151, 273–274, 276, 324, 365, 373, 410, 415

distribution 6, 12, 15, 23, 27, 46, 55, 59, 74–77, 80, 82, 84–85, 87, 112, 114, 117,
129–135, 137, 145, 177–179, 183, 185, 191, 211–212, 274–275, 301, 325, 338,
340, 345, 347, 362–365, 379, 381, 383–384, 387, 392–393, 442, 453, 456, 471–
472, 481

Docker 7, 323, 376, 378–379, 387, 389, 392, 472

docstring 31

download 18, 28, 33, 36–37, 46, 49–50, 52, 61, 63, 68, 449–450

dplyr 90, 125, 133, 392–393, 414, 425

drift 323, 385

dropna 76, 102, 108

dtree 217, 219

dtype 91, 102, 108, 110, 114, 333–334

dummy 2, 4, 111, 115–118, 124, 144, 146, 162, 173, 176, 191, 200, 203, 206, 220,
224, 227, 301, 308, 313, 481

duplicate 194, 203, 217, 225, 303, 332, 351


SAS, Python & R: A Cross Reference Guide

EDA 2, 81–82, 137, 139, 141, 453

editor 1, 19, 28–30, 32, 34, 36–37, 41

eigenvalues 126–128

elbow 128

encoding 2, 110–111, 115–117, 138, 140, 143–147, 155–156, 158, 160, 162, 168–
169, 171–173, 191, 232, 301, 376–377, 455–460

ensemble 5, 11–12, 70–72, 108, 132, 157, 174, 177, 205, 215, 226, 234–240, 249–
252, 254–255, 264–267, 269, 271, 273, 276–277, 329, 331, 463, 465, 479, 481,
483, 486

entropy 207–208, 228, 238, 253, 267, 332, 461, 464

epoch 181, 297, 481

epsilon 321

estimate 68, 86–87, 93, 104–105, 132, 146–147, 150, 152–153, 159–160, 169, 182,
185–186, 191–192, 195–196, 198, 217, 239, 241, 248, 252, 257–258, 264, 266,
271–272, 280, 286, 305, 324, 331, 372, 457–458, 463, 466, 479, 485

excel 16, 43, 177, 206, 273, 296, 298–299, 415, 445, 449, 468

explainability 482

exponential 119, 182, 488

feature 2, 4–5, 11, 19, 21–22, 27–29, 31, 33–34, 36–38, 40, 43, 45–49, 57, 65, 68,
70–71, 73, 81–82, 93–95, 97–103, 105–116, 118–125, 135, 137–141, 143–145,
153–156, 158, 160–163, 168–169, 171, 173, 176, 190, 194, 204–208, 214, 227–
520 SAS, Python & R: A Cross Reference Guide

228, 230, 232, 236–239, 250, 252–253, 255–256, 266–267, 269, 271–272, 277–
279, 281–282, 284, 295, 298–299, 301, 303, 313, 316, 320, 322, 328, 342, 377–
378, 380, 383–385, 415, 417, 419, 440, 445–446, 449, 451–452, 454–456, 458,
460–461, 464–468, 472, 479–483, 486, 488, 490

feedforward 298, 318, 468

filter 2, 49, 80, 90–91, 95, 97–98, 101–102, 104–105, 107–110, 133, 139, 250, 454

fine-tune 136, 148, 151, 156, 225, 297, 349, 464, 488

fit 4, 10–12, 94, 103, 108, 114, 120, 126, 158–160, 186–188, 195, 197, 200, 209–
210, 213, 217, 241, 249, 258, 278, 286, 304–306, 309, 326–327, 331, 358–359,
362, 366, 368–372, 472, 478

fixed-form 175

floor 88

FNN 298

fold 149, 151, 292, 457–458, 480

forecasts 355, 373, 387, 488

forest 4–5, 11, 65, 70–72, 94, 107–110, 113, 115–116, 132, 147, 159, 163, 177, 205,
215, 226–227, 234–243, 245, 247–253, 256, 264, 266–272, 276–277, 282, 327,
329–331, 349, 351, 463–466, 486

format 10, 24, 26, 48, 53–54, 57, 77–78, 81, 144, 146, 162, 168, 222, 245, 261, 291,
294, 311, 376, 408, 412, 419, 436, 449, 451, 456, 459, 480

FPR 13, 159–161, 196, 210, 218, 242, 258, 286, 305, 325, 332, 344, 346

framework 7, 19, 28, 44, 66, 116, 153–154, 175, 385, 388, 448

fraud 69, 130, 327–335, 337, 339–345, 347–353, 355, 373, 382, 386, 473
SAS, Python & R: A Cross Reference Guide

freeware 16–17, 22

frequency 12, 23, 74–77, 80, 117–118, 133–135, 174, 201–202, 211, 219, 221–222,
233, 244–245, 260–261, 289–290, 310, 325, 332, 357, 383

gain 6, 29, 82, 94, 119, 123, 152, 156, 207–208, 215, 228, 238–239, 252, 256, 301,
323, 326–327, 330–334, 348, 350–355, 380

gamma 321, 470

GBM 12, 205, 271, 483

generalize 86, 94, 129, 148, 152–153, 189, 213, 249, 278–279, 383, 386, 461–463,
465

GETNAMES 10, 53, 63

ggplot 10, 350, 357

ggtitle 350

gini 6, 12, 207–208, 211–212, 228, 233, 238, 253, 267, 325, 327, 330, 332–334, 336,
345–347, 373, 381–382, 386–387, 389, 461–462, 464, 471

github 8, 50, 68, 157, 169, 176, 440–444

GLMSELECT 11, 99, 101–104, 106, 118, 141, 174, 183, 232

goodness-of-fit 186–187, 192

GOSS 250

GPUs 298
522 SAS, Python & R: A Cross Reference Guide

gradient 5, 12, 70, 72, 110, 112, 116, 132, 147, 157, 159, 163–166, 174, 177, 180–
181, 205, 215, 227, 234, 249–256, 258–259, 261–264, 266–272, 276–277, 297–
298, 316, 442, 455, 463–467, 483–485, 487, 490

graph 18, 22, 30, 36, 45, 85, 103, 127, 164, 202, 442, 445

GridSearchCV 159, 193, 195, 215, 217, 232, 240–241, 256–257, 284, 286, 302, 304,
321

groupby 332, 351

GRU 298

hardware 284, 298–299, 375, 378

harmonic 211, 325, 343, 482

heteroscedasticity 362–363

hierarchical 204–205, 273, 480

high-dimensional 71, 122–123, 177, 254, 273–274, 276–278, 280, 314–316, 466,
469, 480, 486

high-performance 18, 40, 250

histogram 11, 25, 84–85, 137, 139, 141, 338, 340, 357, 359, 361, 364–365, 453

homoscedasticity 179, 207, 362, 461

Hosmer-Lemeshow 186–187

HPFOREST 11, 109, 174, 245, 247, 271–272

HPGENSELECT 106, 202


SAS, Python & R: A Cross Reference Guide

HPNEURAL 311–312

HPSPLIT 11, 222–223, 232, 332

HPSVM 291–292

hyperparameter 4, 143, 148, 150, 154–156, 159, 163, 166, 171, 195, 203, 213–215,
217, 222–223, 225, 229–230, 232, 235, 239–243, 245–248, 250, 255–257, 261–
264, 270, 280–283, 286, 288, 290, 292, 297, 299–301, 304–305, 307, 311–312,
314, 317–319, 321, 375, 457–459, 462, 466, 468–470, 483–484, 488–489

hyperplane 71, 273–276, 279, 281, 315, 466–467, 469, 487

IDE 1, 7, 17, 19, 22–23, 28–30, 33–37, 39–41, 43, 45, 395, 400–402, 406, 421, 445,
447

if-else 75, 141, 173, 198–199, 243, 307–308, 393

iloc 96, 194, 217, 303

imbalance 128–133, 144–145, 168, 171, 173, 190–191, 206, 239–240, 255–256, 268,
270, 280–283, 301, 316, 318, 336, 342–343, 374, 456, 466, 468–469, 483

import 1, 10–13, 23, 32, 38, 47, 49, 52–54, 58–59, 61, 63, 75, 84, 91, 95, 102, 108,
114, 120–121, 125, 133–134, 157, 159–160, 193, 202, 215, 218, 224, 240, 242,
248, 256, 258, 263, 282, 284, 286, 302, 313, 328, 331, 350, 357–358, 396, 398,
401, 404–405, 425–426, 428, 449–451

importance 10, 39, 43–44, 47, 49, 56, 58–61, 68, 71–72, 76, 82, 85, 88, 94, 100, 103,
105–106, 108–110, 115, 118–120, 122–125, 128–132, 136–137, 139–140, 143,
146–147, 149, 151, 163, 168–169, 171, 176, 181, 183, 185, 188, 190, 210, 213,
215, 225, 227, 229–230, 236, 238–239, 251, 253–256, 267–269, 271, 277, 279,
283–284, 301, 313, 315–316, 318, 323, 340, 342–343, 361, 365, 370, 379, 383,
385–389, 446–447, 449, 452–455, 458, 462, 464–469, 471–472, 482, 486

impurity 207–208, 228, 230, 238, 253, 267, 461, 464


524 SAS, Python & R: A Cross Reference Guide

imputation 11, 55, 59, 80, 93, 143–147, 154–158, 160, 162, 168–169, 173, 176, 206,
227, 239, 279, 281, 322, 376, 456, 480

independent 17, 49, 119, 148, 154, 182, 184–186, 192, 228, 368, 454, 460, 471,
482–486

index 7, 46, 82, 102, 108, 110, 158, 160–162, 287, 351, 383–384, 387, 389, 392–393,
436, 472

indicator 48, 74–75, 114–115, 117, 124, 327, 329, 380, 438

inference 2, 89, 93, 365, 479, 483

integral 175, 210

interaction 2, 19, 28–29, 32–34, 40–41, 44–45, 111, 119–121, 131, 138, 142, 144–
145, 158, 160–162, 168, 173, 182, 188–189, 253, 272, 393, 399–400, 402, 408,
410, 419, 423, 426, 429, 447, 474–475, 477

intercept 105, 180, 184

interquartile 2, 87, 89–92, 137, 139, 454

interval 109, 222–223, 245, 247, 262–263, 291–292, 364

IQR 2, 11, 87–92, 137, 139, 141, 454

kaggle 16, 39, 49, 429, 436

KDE 359

keras 12, 320–322, 427–428, 443

kernel 5, 7, 166, 274, 276, 278, 280–284, 286, 288, 291–292, 315, 317–318, 320,
402–403, 419, 425, 466, 469
SAS, Python & R: A Cross Reference Guide

k-fold 149–151, 169, 171, 458

knit 408, 412

KNN 110, 112–113, 480, 484

Kolmogorov-Smirnov 345, 347, 381, 471

KPI 380–381

Kubernetes 7, 323, 378–379, 387, 389, 392, 472

kurtosis 365

lambda 198, 332

lasso 106, 189, 193, 195, 198, 202–203, 484, 487

layer 71, 209, 293, 295–299, 304, 316–317, 321, 467–470, 481, 485

lbfgs 304

leaf 159, 176, 204–205, 208–209, 215, 217, 261, 263, 272, 465

leakage 153, 155–156, 169

libraries 7, 11–13, 15, 19–21, 28–29, 36–37, 39–41, 43–46, 52–53, 58, 61, 74, 90, 95,
99–101, 106–107, 114, 125, 133, 193, 196, 199–200, 202, 215, 218, 220, 224,
240, 242–244, 248, 256, 258, 260, 263, 279, 282, 284, 286–287, 289, 302, 305,
309, 313, 328, 356, 360, 376, 378, 392, 395, 397–399, 404, 406–407, 409–410,
414, 421, 423, 427–428, 430, 440, 442–443, 445–446, 448–449, 474–477

lift 6, 13, 326–327, 330–334, 347–355, 390, 473

LightGBM 70, 132, 250, 267, 269, 465


526 SAS, Python & R: A Cross Reference Guide

likelihood 49, 73, 181–182, 185–187, 192, 337, 339, 344, 369, 463, 481

linear 3, 11, 70–71, 89, 94, 102, 106, 112, 119, 122, 157, 167, 175, 177–181, 183–
186, 190–193, 205, 225, 227–228, 230, 273, 276, 278, 284, 286, 288, 292, 294–
296, 315, 320, 355, 357–358, 360, 362–363, 367, 460, 462, 466–467, 484, 486–
487

logarithm 119, 180, 184

logistic 3–4, 11–13, 70, 94, 102–103, 106, 112, 115, 132, 157, 159, 163–167, 174–
193, 196, 199, 202, 204–206, 223–225, 227–233, 246–247, 253, 259, 262–263,
291–292, 311–312, 315, 321, 332, 460–462, 479, 484

logit 183–184, 202, 230, 460

log-likelihood 186–187

log-odds 205

loop 10, 163, 222, 245, 261, 290, 311

LSTM 298

Lua 416

macro 222–223, 245–246, 261–262, 290–292, 311–312

MAE 13, 239, 255, 280, 324, 326, 356–358, 360–361, 366–367, 370–373, 380, 387,
389, 391, 458, 471

manifold 122

map 122, 176, 439, 460, 466, 469, 487, 490

markdown 408
SAS, Python & R: A Cross Reference Guide

matplotlib 10, 13, 19, 28–29, 84, 141, 157, 193, 215, 224, 240, 256, 282, 284, 302,
350, 358, 393

matrix 6, 12, 95–96, 98, 117, 124, 157, 160, 170–171, 173–174, 186, 193, 198–199,
202, 210, 215–216, 224, 233, 240, 248, 256, 259, 263, 282, 284–285, 302, 307–
308, 313, 327, 330–333, 341–342, 373, 389, 458, 470, 480

maxdepth 222–223, 232, 245, 247, 261, 263, 271

MDI 238

mean 11, 13, 82–83, 85–86, 104, 112, 114, 126, 145–146, 168, 173, 180, 183, 185,
209, 211, 234, 238–239, 249, 253, 255, 266, 279–281, 297, 324–326, 331–333,
343–344, 349, 351–352, 356–358, 360, 366–368, 370–371, 373, 389, 391, 456,
458, 468, 471, 480, 482, 484, 491

median 82–83, 93, 145, 158, 160, 162, 168, 173, 456, 480

memname 200, 220, 244, 260, 289, 309, 360

memory 24, 280, 298, 465, 468–469

mend 223, 246, 262, 292, 312

metadata 23, 32, 80–81, 412

metrics 4–7, 12–13, 68, 73, 75, 99–101, 107, 129, 154, 156–157, 160, 164–165, 170–
171, 174, 186–187, 192–193, 195, 202–203, 209–212, 215, 218–219, 224–225,
228, 230, 233, 236–241, 243, 248, 252–256, 258–259, 263–264, 266, 268–270,
272, 279–288, 302, 305, 307–308, 313–314, 317, 321–327, 329–335, 337, 340–
342, 345–347, 349, 355–358, 360–361, 366–368, 370, 372–374, 379–391, 437,
458, 461–463, 465–466, 469–473, 478, 482, 485–486

mini-batch 181

Min-Max 112, 141

misclassification 129–130, 210, 276, 278–279, 320, 322, 344, 469


528 SAS, Python & R: A Cross Reference Guide

missing 11, 24, 49, 55, 57, 66, 74–75, 80, 82, 93, 97, 101–102, 107–109, 135–136,
143–147, 154–155, 158, 160, 162, 168–169, 171, 173, 176, 194, 197, 201, 203, 206,
216, 218, 221, 224, 227–228, 239, 241–242, 244, 257–258, 260, 279, 281–282, 285,
287, 289, 300, 303, 307, 310, 322, 335, 339–340, 343, 365, 376–378, 452, 456, 460,
480

MLE 185

MLP 157, 159, 302, 304, 311–312, 314

MLR 99–100

monitoring 7, 163, 212, 297, 314, 323, 375, 379–390, 392–393, 472–473

mse 13, 104, 239, 255, 280, 323–324, 326, 356–358, 360–361, 366–367, 370–374,
380, 387, 389, 391, 458, 470–471

mtry 243, 271

multi-class 70–71, 131, 373

multicollinearity 111, 116, 123, 185–186, 191–192, 194, 206, 228–230, 232, 236,
240, 251, 269, 276, 281, 301, 303, 454, 460–462, 465, 481, 485, 487

multilayer 295, 314

multi-output 71–72

multi-target 2, 71

naive 71

neural 5–6, 12, 71–72, 112, 122, 157, 159, 163–166, 177, 190, 227, 265, 273, 293–
302, 304–305, 307–308, 311–322, 441, 467–470, 478–479, 481, 485, 490

nlp 438, 444, 489


SAS, Python & R: A Cross Reference Guide

nnet 12, 305, 307–308, 320–321

node 8, 71–72, 176, 204–205, 207–209, 228, 230, 235, 238, 248, 252, 293, 295–296,
378, 416–417, 422, 425, 461–462, 478, 485

non-parametric 175, 204–205, 228

normalization 110, 112, 114, 279, 376, 485

numpy 11, 19, 28, 40, 43, 45, 75, 91, 102, 108, 157, 193, 302, 328, 331, 350, 357,
392, 446, 450

odds 3, 180, 182–185, 188–189, 192, 205, 227–228, 230, 460

OneHotEncoder 110–111, 145–146, 157–158, 160, 162, 168, 172–173, 191, 232,
459

OOB 4, 236–237, 239, 252, 266, 269, 271, 463

OOT 149, 154, 157–158, 160–163, 165, 169, 192–203, 215–221, 223–225, 240–245,
247–248, 256–261, 263–264, 281–283, 285–290, 292, 302–303, 305–310, 312–
314

opensource 8, 20, 28, 43, 45, 378, 416, 436, 440, 445

optimization 18–19, 29, 40, 43, 45–46, 63, 99, 106, 110, 112, 143, 145, 148, 156,
166, 214, 229, 249, 265, 276, 284, 297, 299, 301, 312, 314, 316–317, 321, 337–
339, 341, 375, 447, 464, 469–470, 483–485, 487–488, 490

order 3, 18, 76, 113, 115–116, 123, 126, 145–146, 152–153, 169, 194, 197, 216, 219,
291–292, 303, 307

ordinal 113, 115, 123–124

orthogonal 122
530 SAS, Python & R: A Cross Reference Guide

outlier 2, 11, 55, 65–66, 82–93, 103, 112, 131, 136–137, 139, 141, 143–147, 154–
156, 158, 160, 162, 168–169, 171, 173, 175, 193, 206, 214, 227–229, 278–281,
300, 302, 362–367, 370–371, 387, 453–456, 461–462, 471

out-of-bag 4–5, 236–237, 252, 266, 269, 271, 463

out-of-time 148–149, 151–156, 162, 169, 171, 192, 202–203, 215, 240, 256, 281–
282, 457–458

overfitting 86, 93–94, 112, 118, 122, 137, 148, 186–187, 189–190, 205, 209, 213–
215, 228–230, 234, 236–239, 250–255, 266–269, 272, 274, 277–278, 297, 299–
300, 315–316, 318, 329, 340, 368–369, 371–372, 374, 454, 461–465, 468–469,
478, 485–487, 489

oversampling 55, 130, 145, 168, 173–174, 191, 206, 239, 255–256, 280, 282, 301,
316, 374, 456, 469

pandas 10, 19, 28, 40, 43, 45, 53, 58, 61, 63, 84, 95, 102, 108, 114, 121, 125, 133,
141, 157, 193, 215, 224, 240, 248, 256, 263, 282, 284, 302, 313, 328, 331, 350,
357, 392–393, 414, 425, 428, 430, 443, 446, 449–450

panel 1, 29–31

parallelization 236, 250, 252, 277, 284, 299, 446, 464

parameter 5, 63, 86, 97, 104–105, 143, 148, 150–151, 163, 180–182, 185, 196, 198,
209–210, 213, 219, 222, 227, 232, 245, 253, 261, 267, 278–280, 283–284, 290,
297, 311, 315, 317–318, 320–321, 369, 372, 462, 466, 468–469, 472, 478–479,
483, 485, 487, 489

parametric 204–205

pca 2, 122–126, 128, 481, 486

pearson 96–97
SAS, Python & R: A Cross Reference Guide

penalize 106, 186–187, 189, 210, 253–254, 276, 291–292, 327, 366, 368–369, 371–
372, 374, 465–466, 469, 471–472, 478–479, 484, 486–487

perceptron 295, 314

permutation 238, 267, 269, 465

php 15, 436

pickle 376, 392

pip 395, 399, 403–404

pipeline 3, 116, 135–136, 143, 145, 153–154, 156–157, 162–164, 166–172, 175–
176, 202–203, 227, 230, 282–283, 323, 376, 392, 420, 455–456, 458–460

plot 10, 12–13, 17, 29–31, 34, 78, 82, 84, 95, 126–128, 157, 159, 193, 196, 198–199,
202, 210, 215, 218, 220, 240–243, 248, 256, 258–259, 264, 282–284, 286, 288,
302, 305, 314, 325, 332, 346, 350–354, 357–364, 366, 417, 471, 478

Poisson 481

polynomial 111, 119–120, 144–145, 168, 173, 206, 278, 288, 292, 315, 320, 486

precision 12, 129–130, 157, 160–162, 164, 170, 174, 211–212, 228, 239, 255, 268,
272, 280, 314, 317, 321, 323–325, 327, 330–331, 333–340, 342–343, 355, 373–
374, 380, 385–386, 389, 458, 462, 466, 469–470, 482, 485–486

predict 12–13, 48–49, 51, 58, 68–73, 80, 86, 93, 95, 103, 114, 130, 146–147, 159–
161, 175, 180–189, 191, 193–198, 200–203, 216–225, 235, 240–245, 247–249,
257–260, 262–264, 266, 282–294, 302–313, 324, 327, 329–331, 338, 356–358,
368–369, 371–373, 391–392, 449, 452, 459, 471, 481, 483, 487–488, 490

preprocess 11, 111, 114, 119–121, 128, 143, 157, 161–162, 175–176, 202, 225, 227,
279, 282, 284, 288, 297, 299, 302, 307, 313, 322, 375–377, 387, 392, 451–452,
455, 472, 480
532 SAS, Python & R: A Cross Reference Guide

principal 2, 122–128, 481, 486

production 7, 323, 374–377, 379–382, 387, 389, 392, 472–473

prune 4–5, 209–210, 215, 222–223, 228, 230, 232, 236, 238, 254, 267, 272, 332,
461–462, 464

PSI 7, 314, 323, 380, 383–385, 387, 389, 392, 472

p-value 187, 483, 486

pyplot 10, 13, 84, 157, 193, 215, 240, 256, 284, 302, 350, 358

pytorch 442, 448

quadrant 17

quadratic 283

qualitative 177

quantile 11, 90, 92, 141, 173, 331, 349

quartile 87–91, 454

random 4–5, 11, 54–56, 59, 61, 63–64, 70–72, 94, 107–110, 113, 115–116, 118, 132,
134, 147, 159, 163–164, 166, 177, 180, 185, 194–195, 201, 205, 214–215, 217,
221, 225–227, 229, 234–245, 247–253, 256–257, 260, 264, 266–272, 275–277,
282–283, 285–286, 290, 301, 303–304, 310, 317, 326–331, 336, 346–349, 351,
353–354, 358, 450, 459, 463–466, 470, 480, 486, 488

RandomForestClassifier 11, 108, 157, 159, 174, 240–241, 271, 331

RandomizedSearchCV 215, 232, 321


SAS, Python & R: A Cross Reference Guide

rank 13, 197, 271, 306, 333, 352

RBF 278, 284, 286, 292, 318, 320, 469

real-time 45, 374–375, 380, 446–447, 473

recall 12, 129–130, 157, 160–162, 164, 170, 174, 211–212, 228, 239, 255, 268, 272,
280, 314, 317, 321, 323–325, 327, 330–331, 333–337, 339–340, 343, 355, 373–
374, 380, 385–386, 458, 462, 466, 469–470, 482, 485–486

regression 2–4, 6, 11, 21, 65, 68–71, 89, 94, 106, 111–112, 115–116, 131–132, 146–
147, 159, 163–167, 174–193, 195–196, 198–200, 202–206, 209–210, 219, 225,
227–232, 234–235, 238–239, 249, 252–253, 255, 266, 279–280, 283, 294–297,
309, 315, 321, 324–326, 355–358, 360–364, 366, 370, 372–373, 386–387, 389,
391, 441, 454, 459–462, 470–472, 479–481, 484–487, 490

regularization 4–5, 105–106, 112, 189–190, 202, 251, 253–254, 267–269, 272, 277–
278, 299–300, 315, 317, 320–321, 454, 464–466, 484, 487

relu 12, 296, 304, 316, 321, 478

replace 10, 56, 89–92, 133–134, 146, 356, 398, 401, 413

report 51, 89, 193, 202, 215–216, 224, 240, 248, 256, 263, 282, 284–285, 287, 302,
313, 331, 408

residual 6, 179, 210, 249–251, 267, 276, 324, 357, 359–366, 373–374, 387, 391

reticulate 7, 407–409, 421, 423, 425–426, 474, 476

RFE 100–103, 141, 232

rget 13

ridge 106, 189, 487

RMarkdown 8, 407, 411–413, 421, 423, 425–426, 446, 448, 474


534 SAS, Python & R: A Cross Reference Guide

RMSE 13, 326, 356–358, 360–361, 366–367, 371–374, 387, 391

RNN 298

ROC 12–13, 157, 159–162, 174, 193, 196, 198–199, 202, 210, 212, 215–216, 218,
220, 223–225, 228, 230, 233, 239–243, 246–248, 255–256, 258–259, 262–264,
272, 280, 282–288, 291–292, 302, 305, 307–308, 311–314, 321–322, 325, 327,
330–332, 336–337, 339, 345–346, 461–462, 466, 469, 471, 478

RSASSA 8, 410, 413, 421, 423, 425–426, 474, 476

RShiny 392

R-squared 104, 239, 255, 280, 324, 326–327, 356–358, 360–361, 368, 371–373, 387,
389, 391, 458, 470–471

RSS 179, 181

rstats 432

runtime 22, 166

sampling 1, 54–55, 59, 61–64, 133, 145, 168, 174, 194, 197, 203, 217, 224, 241–242,
248, 250, 257, 259, 264, 282, 284–285, 287, 303, 306, 313, 450–451, 463–465,
487

saspy 7, 21, 404–405, 421, 423, 425–426, 442, 475–476

SBC 104–105, 202

Scala 16, 400, 442

scaling 2, 11, 34, 110–115, 123–125, 138, 140–141, 145, 153, 250, 253, 279–284,
288, 297, 300–301, 307, 313, 316, 322–324, 375–379, 392, 442, 455, 468, 472,
480, 482, 485
SAS, Python & R: A Cross Reference Guide

scatter 359–360

scikit-learn 19, 28, 40, 43, 45, 99–100, 106, 376, 427, 430, 441, 443, 446, 448, 451

scipy 12, 95

score 6, 12–13, 156–157, 160–164, 170, 174, 184, 187, 189, 193, 196, 202, 204–205,
211–212, 215–216, 222–224, 228, 233, 239–240, 245–248, 255–256, 262–263,
268, 271–272, 278, 280, 282, 284–285, 287, 291–292, 302, 311–313, 317, 321,
324–325, 327–334, 336–341, 343, 348–349, 351–352, 355, 358, 373–374, 376,
386, 389, 391–392, 458, 462, 466, 469–470, 482, 485

scree 126–128

S-curved 179, 181

seaborn 29, 141, 358

seed 50, 55–56, 59, 61, 63–64, 201, 221, 244, 260, 290, 310, 329, 450

segmentation 3, 78, 147–149, 151–153, 155, 169, 171, 457

sensitivity 2, 10, 132, 191, 199, 211, 276, 321, 325, 335, 339, 343, 373, 387, 471,
479, 486, 489

setsearch 436

SGD 297, 487

shiny 37, 393, 447

shrinkage 254, 268, 271–272

sigma 361

sigmoid 278, 320–321, 478


536 SAS, Python & R: A Cross Reference Guide

skewness 82, 84–85, 131–132, 364–365, 453

sklearn 10–13, 95, 102, 108, 114, 120, 125, 134, 141, 157, 159–160, 173–174, 193,
215, 224, 232–233, 240, 248, 256, 263, 271–272, 282, 284, 302, 313, 320–322,
328, 331, 357, 391

SMOTE 130, 173, 206, 374, 456, 466, 469

sns 358–359

splitting 147, 149, 169, 176, 192, 206–208, 215, 228, 236–238, 252–253, 297, 330–
331, 356, 358–359, 457, 460–461, 464

SQL 16, 97, 103, 109, 134, 200, 220, 244, 259, 289, 309, 360, 392–393, 426, 429

sqrt 13, 243, 357–358, 360, 391

standardize 11, 114, 123, 125–126, 141, 157, 280, 283–284, 286, 288, 290, 301–302,
304–305, 307, 310, 313, 322, 482

statsmodels 193, 302, 358–359, 391

stepwise 118, 206, 232

Stochastic 122, 297, 487

stopping 180–181, 190, 208–209, 251–252, 254, 268–269, 272, 277, 300, 464–465

stratify 194, 217, 241, 257, 285, 303

subset 79, 99–102, 122–123, 208, 235, 238, 266–267, 277, 280, 284, 451, 463, 481,
487–490

SURVEYSELECT 56, 63, 134, 173, 201, 221, 244, 260, 290, 310, 359, 450

SVM 5, 12, 70–71, 110, 112–113, 132, 157, 159, 166, 273–275, 278–284, 286, 288–
289, 291–292, 311, 315, 317–318, 320–322, 466–470, 487
SAS, Python & R: A Cross Reference Guide

synthetic 145, 168, 174, 239, 255–256, 280, 282, 301, 316, 327–329, 355–356, 358–
359, 386, 456, 469, 487

tanh 296, 304, 316, 321, 478

target 2, 11–13, 48–52, 58, 61, 65, 68–76, 80, 101–104, 107–110, 119, 131, 133–
135, 137, 139, 141, 144–147, 159–160, 176–178, 182, 185, 193–197, 200–201,
205, 207, 215–217, 219, 221–225, 227, 240, 242, 245–249, 257–258, 260–264,
278, 282, 285, 287–288, 290–292, 294, 297, 302–307, 309–313, 327, 332, 334,
349, 351–353, 355–356, 365, 367, 370–373, 378, 386, 449, 452–454, 460, 471,
481–482, 488, 490

t-distributed 122

temporal 76, 152–153, 457

tensorflow 320–321, 427–428, 441, 443, 448

terminal 34, 399, 408

third-party 7, 378–379, 387, 392

three-dimensional 274

threshold 6–7, 73, 87–89, 95–97, 129, 158, 160, 162, 195, 197, 199, 208, 210–211,
241, 243, 306, 325–326, 334–335, 337–341, 345–347, 349, 355, 370, 372, 374,
379–382, 384–387, 389–390, 393, 458, 461, 471, 473

tidyverse 10, 441–442

time-series 185, 457

tpr 13, 159–161, 196, 210, 218, 242, 258, 286, 305, 325, 332, 346

trade-off 67, 209–210, 278, 284, 315, 339, 343, 349, 355, 466, 468, 479
538 SAS, Python & R: A Cross Reference Guide

transform 11, 17–18, 47, 54, 57, 59, 65, 73, 88–89, 93, 110, 114–115, 119–120, 123–
124, 135–136, 144–145, 158, 160–161, 175, 183–185, 227, 276, 278, 281, 286,
304–305, 315, 320, 363–365, 376–377, 452, 480

tree 4, 11, 157, 159, 163, 166, 175, 177, 204, 206–210, 212, 214–215, 217–220,
222–225, 228, 230, 232, 235–240, 249–255, 266–269, 271–272, 276, 280, 461–
465

t-SNE 122, 481

tune 159, 195, 203, 215, 217, 219, 222, 225, 239, 241–243, 245–248, 250, 255, 257,
261, 264, 280, 283, 286, 288, 291–292, 297, 304, 308, 311, 315, 318, 320–321,
457, 468, 489

underfitting 213–214, 278, 318, 468, 479, 489

undersampling 130, 133, 145, 168, 173, 191, 195, 197, 201, 203, 206, 217, 219, 221,
225, 239, 255–257, 261, 280, 282, 285, 287, 301, 304, 306, 316, 374, 456, 469

univariate 11, 84, 122, 141, 173, 360, 393

unsupervised 122, 298, 480, 484, 489

validation 5, 147–156, 158–159, 162–163, 165–166, 169, 171, 190, 192, 194, 197,
203, 210, 215, 217, 223–224, 236–237, 241–242, 246, 248, 252, 254, 257, 259,
262–264, 266, 272, 282, 285, 287, 291, 297, 301, 303–304, 306–307, 311, 313,
375, 457–458, 460, 462, 465, 470, 489

variance 7, 104, 112, 114, 122–123, 126–128, 150, 179–180, 186, 192–194, 206,
215, 236, 238, 251–252, 277, 282, 302–303, 326, 362–363, 368, 371, 374, 383–
385, 387, 389, 393, 457, 464, 471–473, 479, 489
SAS, Python & R: A Cross Reference Guide

vectors 5, 12, 70–71, 106, 110, 112–113, 115, 132, 159, 163–166, 176–177, 227,
265, 273–283, 314–315, 318, 320, 466–467, 469, 487, 489–490

VIF 192, 194, 197, 200, 203, 206, 232, 303, 306, 309, 313

Viya 27, 45, 392, 416

VSI 7, 314, 323, 380, 383–385, 387, 389, 393, 472–473

weights 88, 131–132, 180–181, 281, 295–297, 299–300, 316, 328, 366, 464, 466–
467, 469, 479, 489–490

whitespace 10, 53

winsorizing 2, 86–87, 89, 137, 139, 141, 144, 146, 158, 160, 162, 168, 173, 453–454

WOE 88

workspace 202, 222–224, 245, 247, 262–263, 291–292, 311–312, 406

wrapper 2, 95, 98–101, 104, 137, 139, 454, 490

xgboost 70, 132, 174, 250, 256–259, 267, 269, 442, 464–465, 490

Z-score 112, 491

You might also like