SAS Python R Full Book
SAS Python R Full Book
Overview 15
Programming Environments 18
SAS OnDemand for Academics 21
SAS Studio IDE 22
Log 24
Results and Output 25
Python Spyder 27
Python Spyder IDE 28
Editor 29
Console 29
Help Panel 30
Variable Explorer 32
RStudio 33
RStudio IDE 34
Global Environment 35
Packages 36
Chapter 1: Programming Environments – Conclusion and Transition 38
Chapter 1 Summary: Programming Environments 39
Chapter 1 Quiz 43
Chapter 1 Cheat Sheet 45
Overview 48
Project Overview 49
Import Data 50
Data Dictionary 51
Sampling 55
Chapter 2: Gathering Data – Conclusion and Transition 58
Chapter 2 Summary: Gathering Data 59
Chapter 2 Quiz 62
Chapter 2 Cheat Sheet 64
Overview 66
2 SAS, Python & R: A Cross Reference Guide
Acknowledgments
This book would not have been possible without the invaluable contributions and
unwavering support of numerous individuals and teams. I would like to take this
opportunity to extend my heartfelt gratitude to everyone who played a role in its
creation.
To the SAS Team: I am deeply grateful to Carrie Vetter, Suzanne Morgen, and
Catherine Connolly, whose expertise and guidance were instrumental in shaping the
technical content of this book. Your dedication to the advancement of data science
has been a constant source of inspiration.
To the Wells Fargo Team: A special thank you to Paul Davis, Cem Isin, Chris Challis,
Abdulaziz Gebril, Jie Chen, Michael Luo, Nijan Balan, Debjyoti Sadhu, Jordan
Eustaquio, Ferda Ozcakir Yilmaz, Vinothdumar Venkataraman, Todd Anderson,
Swapnesh Sanghavi, Weishun Chen, Tom Zhu, Boobalaganesh E., Dip Chatterjee,
Ibro Mujacic, Justin Gaiski, Soumyaa Mukherjee, Prabhat Vashishth, Kumar
Nityanand, Saurabh Chauhan, Anna Koltsova, Andrew Wolschlag, Yusuf Qaddura,
Tinhang Hong, Sara Slovensky, Satish Komirishetti, Sidharth Thakur, Insiya Hanif,
Jody Zhang, Nageswara Reddy, Anupam Chatterjee, Anoop Kamath, Nicole Chhabra,
Deeksha Tembhare, Sruthy V., Ashish Agrawal, Shyamasis Guchhait, Patrick Hook,
Brian Abrams, Mirela Mulaj, Christina Fang, Travis Alexander, Kathy Cunningham,
and Howard Kim. Your insights, feedback, and collaborative spirit have significantly
enriched the quality of this work. It has been an honor to work alongside such a
talented and dedicated group of professionals.
To the ELVTR Team: I am also incredibly thankful to the team at ELVTR, the online
data science platform, for their support throughout this journey. Anna Ansulene Van
Zyl, Yana Dovgopolova, Olga Viun, Saul Mora, and Olia Tsvek, your commitment to
education and innovation in the field of data science has been truly inspiring. Thank
you for providing a platform that fosters learning and growth for so many.
To My Family: Everything that I have ever done or ever will do is credited to the love
and support of my wife and son, Tanya and Jake. I don’t know if we’re the three
musketeers, a three-man jam band or a skeleton pirate crew, but we’re in it ‘till the
fuckin wheels come off.
10 SAS, Python & R: A Cross Reference Guide
Before you dive into the code cheat sheet, it's important to understand a key
difference between the programming languages featured in this book: whitespace
character sensitivity.
● Python and R: On the other hand, Python and R are sensitive to whitespace
characters, which means that indentation and spacing are crucial for the
correct execution of code. In Python, for instance, the indentation level
indicates code blocks, and incorrect indentation can lead to syntax errors or
incorrect program logic. R is slightly less strict but still relies on proper
formatting, especially in conditional statements and loops.
Due to publishing limitations, you may notice that some Python and R code snippets
in this cheat sheet are wrapped or formatted to fit within the page. When writing or
copying this code into your development environment, ensure that you adhere to
the correct indentation and spacing rules specific to each language. This will prevent
errors and ensure that the code executes as intended.
DATA=mydata;
RUN;
from
PROC HPFOREST sklearn.ensemble library(randomForest
DATA=mydata; import ) model <-
Random Forest TARGET target; RandomForestClass randomForest(target
INPUT var1 var2; ifier model = ~ var1 + var2,
RUN; RandomForestClass data=mydata)
ifier() model.fit(X, y)
from
PROC sklearn.ensemble library(gbm) model
GRADBOOST import <- gbm(target ~ var1
Gradient Boosting DATA=mydata; GradientBoostingCla + var2,
Machines TARGET target; ssifier model = data=mydata,
INPUT var1 var2; GradientBoostingCla distribution="bernoul
RUN; ssifier() model.fit(X, li")
y)
PROC SVMOD
library(e1071) model
DATA=mydata; from sklearn.svm
Support Vector <- svm(target ~ var1
TARGET target; import SVC model =
Machines + var2,
INPUT var1 var2; SVC() model.fit(X, y)
data=mydata)
RUN;
from keras.models
PROC NEURAL import Sequential library(nnet) model
DATA=mydata; model = Sequential() <- nnet(target ~ var1
Neural Networks INPUT var1 var2; model.add(Dense(1 + var2,
TARGET target; 0, input_dim=10, data=mydata,
RUN; activation='relu')) size=10)
model.compile()
PROC LOGISTIC
from sklearn.metrics
DATA=mydata library(pROC) auc <-
import
PLOTS=ROC; roc(response=mydat
roc_auc_score auc
AUC MODEL a$target,
=
target(event='1') = predictor=model$fitt
roc_auc_score(y_tru
var1 var2; ROC; ed.values)$auc
e, y_pred)
RUN;
PROC LOGISTIC
DATA=mydata;
MODEL
gini = 2 *
target(event='1') =
Gini roc_auc_score(y_tru gini <- 2 * auc - 1
var1 var2; OUTPUT
e, y_pred) - 1
OUT=gini
PREDPROBS=I;
RUN;
from scipy.stats ks_stat <-
PROC NPAR1WAY
import ks_2samp ks.test(mydata$targ
DATA=mydata KS;
KS Statistic ks_stat, p_value = et,
CLASS target; VAR
ks_2samp(y_true, model$fitted.values)
score; RUN;
y_pred) $statistic
SAS, Python & R: A Cross Reference Guide
Overview
You may not be aware of it, but data science touches nearly every aspect of your
daily life. If you are reading a book about computer programming, you are probably
already aware of the prominent and most common applications of data science.
These include how Netflix uses data science to recommend movies based on your
demographics and history, how Amazon upsells products, how dating sites
recommend matches, and how TikTok populates your video stream.
We generally know that online systems such as Netflix, Amazon, and TikTok gather
large volumes of data and subject it to machine learning algorithms to customize
content and sell products. However, data science has escaped the confines of the
tech giants and now drives nearly every aspect of our lives. Examples include the
formulation of medications, police patrol routes, fashion trends, distribution of aid
in times of crisis, housing development locations, wildfire prevention and control,
military actions, pet food ingredients, and much, much more.
Among the top programming languages in data science, SAS, Python, and R stand
out. SAS has a long-standing presence in the industry, offering a comprehensive
suite of tools for data analysis and modeling. Python has gained significant
popularity due to its versatility, ease of use, and extensive libraries for data
16 SAS, Python & R: A Cross Reference Guide
Although SAS, Python, and R are among the top programming languages in data
science,1 there are other programming languages and analytical environments
commonly used for data science, such as:
● Julia
● C++
● Scala
● MATLAB
● Octave
● SQL
● Java
● Alteryx
● Tableau
● and even Microsoft Excel
Each of these programming languages is powerful and flexible and can perform a
wide array of data science functions. However, this book aims to provide you with a
practical, hands-on guide to the top three programming languages used in the
business world.
Truth be told, in the 2024 Kaggle Machine Learning and Data Science Survey, the top
programming languages included both #1 Python and #3 R (#2 was SQL), but SAS
came in at #12 on the list. I believe that this is a result of selection bias. The survey
results show that nearly half of the respondents have less than five years of
programming experience and that almost 40% of the respondents were students.
SAS is generally not taught in schools because the full SAS product line is a license-
based product. Universities and students prefer freeware for obvious reasons.
1
Yes, I realize that Python is listed as a programming language for software engineering, web
development and data science. It is a very versatile language with a lot of applications and a huge
support community. This is one of the main reasons that it is used extensively in academia. You should
definitely at least learn the basics of working with this programming language. However, many
corporations do not support Python due to a legacy codebase developed in other languages or they
feel that other analytical product offerings support their business needs in a more integrated manner.
SAS, Python & R: A Cross Reference Guide
However, SAS does offer a free student version of its product, which contains all the
capabilities of a fully licensed product and a file size limit of 5GB. For most students,
this SAS freeware version will meet 90% of their needs.
So, including Python and R as part of this book is obvious, but why include SAS when
it is listed as #12 on the data science popularity list? The reason is that in the real
world, SAS is still a dominant analytical environment where senior data scientists
have developed analytical products, automated processes, and bleeding-edge
artificial intelligence codebases that contain over five decades of knowledge. SAS is
still the leading programming language throughout the financial services, insurance,
and biotech industries and all of government services. If you want to work in any of
these fields, you will need to transform your freeware knowledge into SAS.
The Gartner Magic Quadrant reviews and analyzes all the major data science and
machine learning (DS/ML) software products. This independent evaluation has
placed SAS as the top DS/ML software provider for its entire review history, lasting
over 15 years. Figure 1.1 shows the Gartner Magic Quadrant for DS/ML software
products. This quadrant analysis plots DS/ML software companies where the X axis
represents the company’s “completeness of vision,” and the Y axis represents the
company’s “ability to execute.”
Although Python and R are not specifically identified in the quadrant analysis, they
are the programming languages that drive the data science and machine learning
18 SAS, Python & R: A Cross Reference Guide
This chapter will provide you with an overview of the Integrated Development
Environment (IDE) for Python Spyder, RStudio, and SAS Studio. Although the
underlying constructs of the environments are quite different, the user experience
and overall look of the environments are surprisingly similar.
Programming Environments
Every analytical programming environment has its own look and feel that can be
customized in a multitude of ways. The arrangement of the coding window, the
output window, the data set view, and the graphics display, along with the look and
feel of the font and background colors – all of this can be customized. Some
programmers prefer “dark mode,” while others prefer a white background. In order
to cut through all the confusion that can be brought on by customization, I will use
the default settings in the programming environments in this book.
The following sections will provide you with the links to either access or download
each of the respective software programs. These sections will not include step-by-
step installation procedures because these procedures vary depending on which
operating system you are using and which version of the software you choose to
install. Fortunately, each of these software programs has highly detailed support
documentation and large user communities that will quickly guide you through the
installation process.
SAS Studio:
● Performance: SAS Studio is known for its efficient and optimized processing,
making it suitable for handling large-scale data and complex analyses. It
leverages the SAS platform's powerful engine to deliver high-performance
computing capabilities.
Python Spyder:
RStudio:
Comparison:
● Python Spyder, with its versatile libraries and extensive ecosystem, provides
flexibility for various data analysis and machine learning tasks. It benefits
from Python's efficiency and performance.
● RStudio shines in its rich statistical capabilities and extensive library support.
It is widely adopted in the academic and research community, offering
specialized packages for specific statistical domains.
● SAS Studio and RStudio have point-and-click interfaces, making them more
accessible to users without programming expertise. Python Spyder caters to
both coding and point-and-click approaches.
● Python Spyder and RStudio have larger communities and a broader range of
open-source libraries available, while SAS Studio offers the advantage of
being a comprehensive, integrated environment.
With SAS OnDemand for Academics, users can perform complex data
manipulations, explore large data sets, and conduct advanced statistical analyses.
The environment provides an intuitive interface that enables users to write and
execute SAS code seamlessly, with features such as syntax highlighting,
autocompletion, and error checking to enhance the programming experience.
One key advantage of SAS OnDemand for Academics is its extensive library of SAS
procedures and functions, which covers a wide range of statistical techniques and
data analysis methods. Users can easily access and use these procedures to perform
tasks such as data cleaning, regression analysis, clustering, and more. Additionally,
SAS offers comprehensive documentation and resources to support users in learning
and applying statistical methods effectively.
The process to get started in SODA is easy. Once you access the link provided above,
you will register as a user and receive an email. Just follow the guided process, and
you will be able to access SODA within 5 minutes. I will not bore you with
screenshots of how to fill out a registration form or accept a license agreement.
We’ve all done these many times before.
Once you register and sign in, you will be presented with the SODA dashboard. For
our current purposes, we will select the first item, SAS Studio. In later chapters, we
will explore SASPy access to SAS-hosted servers along with several other ways to
integrate SAS, Python, and R.
Once we select the SAS Studio option, we are connected to the SAS Studio IDE
(Integrated Development Environment). This is the primary programming
environment for SAS users. Figure 1.3 demonstrates the SAS Studio IDE.
The IDE consists of two main sections. On the left is the navigation pane. This
section contains a series of drop-down menus that enable you to select point-and-
click procedures to perform a variety of tasks. These tasks can range from simply
importing data to creating graphs or even developing advanced machine learning
models.
The navigation pane also contains code snippets that provide you with the building
blocks of SAS code to create various data manipulation procedures and graphical
SAS, Python & R: A Cross Reference Guide
output. The combination of the point-and-click and code snippet features provides
data scientists with a powerful arsenal of prepared data wrangling and analytical
techniques that will get you started in SAS Studio very quickly. These are the types
of features that you cannot find in most freeware products.
The second section of the SAS Studio IDE is the work area. This is the section where
you will perform most of your work. SAS code is developed in the code window.
Once the code has been submitted, the log window will provide you with a variety
of information, including the number of observations included in each section of the
submitted code, along with runtime information and will also contain an error log.
SAS error logs are generally informative and provide you with common language
information that helps you debug your code (unlike some program language’s error
logs that only add to your confusion about what went wrong).
The work area also contains a results window. This window displays output
requested as part of the code, such as frequency distributions, database contents,
metadata, and graphical output.
Log
In the example above, I created a simple import statement that accesses a .csv file
that I stored locally and imported into the SAS Studio environment. Once the
program has completed running, SAS Studio automatically generates a log that
provides you with information about the completed process. Figure 1.4
demonstrates that the log will contain three sections in the “Log Tabs” area.
The first section will identify any errors that occurred during the processing. These
are often hard stops that will prevent the rest of the program from running. SAS
error logs are the most useful error logs among the most popular analytical software
packages. SAS error logs will show you exactly which part of the submitted code has
a problem and provide you with suggestions on how to resolve the problem. For
example, the error log will specify if you are missing a semicolon to close a
statement or if there is a misalignment due to variable formatting, or many other
issues that arise when debugging code. The combination of specifying exactly where
the problem is occurring along with common language descriptions of what the
problem is and suggestions on how to resolve the problem can save you hours of
debugging time.
The second section is the warning section. This will provide information about
possible issues that need to be addressed. Warnings will not terminate a program,
but they will provide you with information about the program that you might want
to address. These issues can be related to long processing times or truncated
variable names or calculation issues or a variety of different items that you may
need to be aware of.
The final section is the notes section. This item will provide general information
about the completed process. It will commonly include the number of observations
and variables in a data set, the total processing time, how much memory was used,
along with several other items depending on the nature of the program that was
run.
SAS, Python & R: A Cross Reference Guide
Once the program has successfully run, the results can be found in the results and
output data sections. These sections provide you with information concerning the
completed process and the final output of that process. In the example shown in
Figure 1.5, the program imported data from a local .csv file and loaded it into the
SAS Studio environment. The results window contains the output of a PROC
CONTENTS statement (we will cover exactly what this is in a later section). The
information displayed in the results window will vary depending on the nature of
the program that was run. For example, if you ran a program designed to create a
histogram, then the visual output of the histogram would be contained in the results
window.
26 SAS, Python & R: A Cross Reference Guide
The final piece of information provided with a successful program run is the output
data. This tab provides you with a spreadsheet-like view of the resulting data set
created from a program run. Figure 1.6 shows you the two main pieces of
information in the output data tab. First, the output contains a list of all the
variables contained in the data table, along with the properties of each of the
variables. These properties include the variable label, name, length, type, format,
and informat.
The second piece of information is a view of the data table. This view is incredibly
useful in understanding the contents of the data set and its completeness. The data
table shows all the variables and observations in the data set in a spreadsheet-like
view.
SAS, Python & R: A Cross Reference Guide
SAS Studio provides the data scientist with lots of features that save development
time and reduce debugging headaches. The prepared code snippets provide you
with valuable instruction on a variety of topics such as data preparation, data
manipulation, analytics, machine learning, and graphical output. The log information
provides you with specific information that reduces debugging time, and the file
management system allows you to organize your project workflow.
SAS Studio is also the interface for the latest SAS product, SAS Viya. Although this
product is not the focus of this book, it is a fantastic data science and machine
learning environment that provides cutting-edge machine learning algorithms along
with highly efficient distributed database storage and processing.
Python Spyder
or
Python is a broad programming language that can be used for software engineering,
data science, web development and many other things. Because of the language’s
versatility, the Python Spyder IDE is not set up specifically for data science and
machine learning. When compared to the analytical environment of SAS Studio,
Python Spyder can look a bit basic on its surface. However, this could not be farther
from the truth.
The power of the Python Spyder environment is not in its “bells and whistles” but its
inherent flexibility. Since Python is an open-source product, data scientists have a
wide array of libraries that they can download and use for free. These libraries can
range from mathematical formulas to machine learning algorithms to graphical
displays to web development to just about anything that you can think of.
Spyder provides a user-friendly interface that enables data scientists to write and
execute Python code with ease. The IDE includes features such as syntax
highlighting, code completion, and error detection, which enhance the coding
experience and improve productivity. Spyder also integrates with popular Python
libraries and frameworks commonly used in data science, such as NumPy, pandas,
matplotlib, and scikit-learn, allowing seamless integration of these tools into the
workflow.
One of the key strengths of Spyder is its interactive console, which provides a
convenient environment for exploring and manipulating data. Data scientists can
execute Python commands and view the results instantly, making it easy to test and
iterate code snippets. The IDE also supports advanced debugging features, profiling
tools, and variable exploration, which are valuable for troubleshooting and
optimizing code performance.
SAS, Python & R: A Cross Reference Guide
Spyder offers extensive support for data visualization, with built-in plotting
capabilities and integration with popular data visualization libraries like Matplotlib
and Seaborn. This enables data scientists to generate insightful visualizations to gain
a deeper understanding of the data. Spyder also provides a comprehensive data
editor that allows users to view and manipulate data sets directly within the IDE.
With its rich set of features and focus on scientific computing, Spyder is a valuable
tool for data scientists and researchers working with Python. Its user-friendly
interface, extensive library integration, and interactive console make it an excellent
choice for exploratory data analysis, model development, and scientific computing
tasks. Spyder's versatility and flexibility make it suitable for both beginners and
experienced Python programmers in the field of data science.
There are three main components to the Spyder IDE. Figure 1.6 shows the three
sections as the editor, the console, and the help panel.
Editor
The editor is where you will create most of your code. This section allows you to
type directly into the editor panel to create your code. You can then run the code by
either highlighting the section and selecting Ctrl + Enter, or you can highlight the
section that you want to run and then select the green arrow button above the
editor panel. Finally, you can also right-click within the editor panel and select the
run function from the list of features that pop up. The editor panel allows you to
open existing programs or create new ones within the editor panel.
Console
The console panel allows you to type directly into the console interface, just like if
you were to create and run programs in your computer’s command panel. This
panel will enable you to work interactively within the console panel or run code in
the editor. It also provides some high-level information, such as which version of
Python you are using.
Figure 1.7 shows a small program developed in the editor that takes a base input
and raises it to a specified power. In this case, eight is raised to the power of two,
and the program prints the result of the calculation. The output of the code is
provided in the console panel.
30 SAS, Python & R: A Cross Reference Guide
Help Panel
The help panel has four sections. These sections include files, plots, help, and the
variable explorer. Figure 1.7 shows the file section. This section allows you to
explore and structure your project files. Figure 1.8 shows the plot area of the help
section. This figure demonstrates some simple code that reads a .csv file and plots
the data points on a line graph. The graphical results are displayed in the plots area.
SAS, Python & R: A Cross Reference Guide
The help section provides documentation for any object with a docstring. A
docstring functions like an embedded comment that contains information about
modules, classes, functions, and methods. If you have a question about any feature
of Python, you can simply place your cursor after the item in question and press Ctrl
+ I. This will bring up the documentation within the help panel.
Figure 1.9 provides an example. In this example, I want to know more information
about DataFrames. I placed the cursor just after the DataFrame statement and
pressed Ctrl + I. The documentation for DataFrames appears in the help section. You
can also manually enter an object’s name directly into the “Object” area just above
the help window.
Finally, you can enable automatic help by changing your settings under Preferences
– Help – Automatic Connections. Once this setting is selected, you can turn it on and
off by using the lock icon directly to the right of the Object area above the help
window. Once this setting is selected, you would simply type an open left
parenthesis after a function or method name and the associated information will
32 SAS, Python & R: A Cross Reference Guide
appear in the help window. For example, if I wanted more information about arrays,
I would type np.array( into the editor, and the documentation for arrays will appear
in the help window.
Variable Explorer
The final area in Python Spyder is the variable explorer area. This section provides
metadata information for all data sets in each project. Figure 1.10 shows the results
from an import statement. The work area has the code to pull a local data set into
Spyder. The console section provides information concerning the data pull, and the
variable explorer section provides metadata about the imported data set. This
metadata includes information about what type of data set it is, the variables
contained in the data set, the data types, the data size, and each variable’s values.
The variable explorer section is interactive. You can click the DataFrame’s name to
view the whole data set. You can even click on the individual variable’s name to see
an individual column of that variable’s values. The variable explorer section will also
SAS, Python & R: A Cross Reference Guide
let you right-click on a variable and provide you with options to delete or modify its
values.
The Python Spyder IDE is a powerful and flexible tool that provides data scientists
with several features to organize, develop and deploy various projects. The IDE
provides detailed information about coding features, graphical displays, object
documentation, data set and variable information, and interactive console and data
set features.
RStudio
and data science. It provides a powerful and user-friendly environment for data
exploration, visualization, and modeling. RStudio combines a code editor, a console
for interactive R programming, and various tools to enhance the data science
workflow.
With RStudio, data scientists and statisticians can leverage R's extensive capabilities
to perform complex analyses and build sophisticated models. The IDE offers an
intuitive interface with features such as syntax highlighting, code completion, and
error checking, facilitating efficient coding and debugging. RStudio also provides
seamless integration with R's vast ecosystem of packages, which cover a wide range
of statistical techniques, machine learning algorithms, and visualization tools.
RStudio IDE
Figure 1.11 shows the RStudio IDE. Notice how similarly all three interfaces have
been constructed. The RStudio interface is not much different than the SAS Studio
and Python Spyder interfaces. There are always some changes in the location of
certain sections and features, but the four main sections are present. RStudio has a
SAS, Python & R: A Cross Reference Guide
code editor in the top left section, a console and terminal section in the bottom left
area, a global environment section in the top right, and a tab selection section of
files, plots, R packages, and views in the bottom right area. The location of each of
these sections is changeable, so you can configure them in a manner that best suits
you.
Global Environment
All the data sets you have created are available for inspection in the global
environment area. On the far right side of the section containing the data set name,
you will find a spreadsheet-like icon that you can select. This will bring up a
spreadsheet view of your data set. Figure 1.12 demonstrates this view.
The spreadsheet view provides a great amount of insight into your data set. You can
easily see the number of variables, the data types, value ranges and many other
valuable pieces of information.
36 SAS, Python & R: A Cross Reference Guide
Packages
One of the main features that separates the RStudio interface from the other IDEs is
the R packages tab located in the bottom right area of my configuration. Figure 1.13
shows the RStudio Package Editor. This is an interface that allows the programmer
to easily search, install, or delete any of the available R packages.
We will review libraries and packages in the next section of the book, but for now,
understand that these packages contain logic that allows a programmer to develop
data science models, web pages, documentation, graphics, and many other
possibilities. They are essentially expansion packs for your R language.
To find a new R package, you simply select the “Install” button just above the
packages area. An interface will come up, and you can type the name of the package
that you want to install. Figure 1.13 shows the view with the package installer
interface. In this example, I typed the package name “caret” into the package
installer. Once the “Install” button is selected, RStudio automatically downloads the
SAS, Python & R: A Cross Reference Guide
package and all its dependencies and installs them in the area that you have
selected.
Once the package is installed, you will find all the installation notes in your console
section and the newly downloaded and installed package will be at the top of your
packages list. This is by far the easiest interface to search, download, install and
update libraries/packages across all the IDEs that we have reviewed.
The RStudio IDE is a flexible and powerful interface that extends far beyond the
original design of calculating statistics and performing data analysis. With the
extension of packages such as “Shiny,” RStudio is a fully integrated website and web
app development tool. Although these features are beyond the scope of this book,
they are powerful and awesome tools that are definitely worth exploring.
38 SAS, Python & R: A Cross Reference Guide
In this chapter, we explored the three primary programming environments that you
will use throughout this book: SAS Studio, Python Spyder, and RStudio. We
compared their interfaces, functionalities, and unique features, setting the stage for
our journey through data science and machine learning across these platforms.
Now that you're familiar with the environments in which we'll be working, the next
step is to delve into the data itself. After all, data is the foundation of any data
science project, and how we gather, clean, and prepare it will significantly influence
the outcomes of our analyses.
In the upcoming chapter, we'll shift our focus to the process of gathering data. You'll
learn how to import data from various sources into SAS, Python, and R, and we'll
discuss best practices for ensuring the quality and integrity of the data you work
with. Understanding how to gather and prepare your data efficiently is crucial as it
sets the stage for the subsequent steps in your data science workflow.
So, with your programming environment ready, let's move on to exploring the
diverse world of data collection and preparation. This is where the real work begins,
and mastering these techniques will empower you to build robust, reliable models
in the chapters to come.
SAS, Python & R: A Cross Reference Guide
● Context: The chapter sets the stage for the book’s central theme – cross-
referencing SAS, Python, and R – by highlighting their significance in the
data science community, even though their popularity may vary across
different user bases, as evidenced by the 2024 Kaggle Machine Learning and
Data Science Survey.
● Overview of IDEs: This chapter introduces the three primary IDEs associated
with SAS, Python, and R: SAS Studio, Python Spyder, and RStudio. Each IDE is
40 SAS, Python & R: A Cross Reference Guide
● Comparison of IDEs:
● SAS Studio:
● Python Spyder:
● RStudio:
6. Looking Ahead
● Transition to Analytical Concepts: The chapter prepares the reader for the
transition from understanding the programming environments to applying
analytical concepts across SAS, Python, and R in subsequent chapters.
Chapter 1 Quiz
Questions:
1. What are the primary advantages of using SAS in industries such as financial
services and biotechnology?
4. Explain how SAS Studio’s point-and-click interface can benefit data scientists
who may not have extensive programming experience.
6. How does Python Spyder facilitate the integration of data manipulation and
machine learning libraries such as NumPy, pandas, and scikit-learn?
7. What are the key features of RStudio that make it suitable for reproducible
research and collaboration in data science?
8. Describe how SAS Studio handles large data sets and why this is important
for advanced analytics.
9. What advantages does Python Spyder offer for exploratory data analysis in
scientific computing?
10. How does RStudio’s package management system contribute to its flexibility
in data science projects?
12. How can knowledge of multiple IDEs and programming languages enhance a
data scientist’s ability to tackle complex data problems?
44 SAS, Python & R: A Cross Reference Guide
13. Describe how SAS Studio’s log output assists in debugging and optimizing
code.
14. What role does the interactive console in Python Spyder play in iterative
code development?
15. How does RStudio support the creation of interactive data visualizations and
what are some use cases?
16. Why is it important for data scientists to be familiar with the different
programming environments discussed in this chapter?
17. Discuss the impact of the long-standing presence of SAS in the industry on
the development of analytical products and automated processes.
19. What are the benefits of using RStudio for statistical modeling in academic
and research settings?
20. How does understanding the similarities and differences between SAS,
Python, and R improve a data scientist’s versatility and problem-solving
capabilities?
SAS, Python & R: A Cross Reference Guide
- Support for
- Detailed log outputs - Integrated help and
version control
for debugging documentation
and projects
- Integrated with SAS - Variable explorer
- Interactive data
Viya for advanced for easy data
visualization tools
analytics inspection
- Intuitive
- Accessible to both
interface with
novice and - User-friendly with
support for both
experienced users due features like code
basic and
Ease of Use to the combination of completion, syntax
advanced users,
a point-and-click highlighting, and
especially in
interface and coding debugging tools
statistical
capabilities
modeling
- Strong in
- Highly efficient in - Efficient with statistical
processing large-scale Python's strong modeling but may
data, particularly in performance across face performance
Performance regulated data processing challenges with
environments tasks; benefits from very large data
requiring rigorous an optimized open- sets due to its
data handling source ecosystem single-threaded
nature
- Available as a
- Access via SAS
- Available via direct free download,
OnDemand for
download or through with extensive
Installation Academics (web-
Anaconda documentation
based) or licensed
distribution and community
installations
support
- Organize
- Regularly update
- Utilize the built-in projects using
Best Python libraries to
code snippets for RStudio’s project
Practices access the latest
common tasks management
features
features
SAS, Python & R: A Cross Reference Guide
Overview
Garbage in, garbage out (GIGO). This is perhaps the most important concept of data
science. Here’s why:
GIGO refers to the quality of the data you use to build your data science projects. If
you have poor-quality data as your input, then you will undoubtedly have a poor-
quality output. It doesn’t matter how powerful your machine learning algorithm is
or how many terabytes of data you have; if you put garbage in, you will get garbage
out. Imagine trying to build a house on sand with a foundation made of balsa wood.
It doesn’t matter how sophisticated the design of the house is; I would not live in it.
This book aims to provide you with a tool that will allow you to utilize your existing
knowledge of a programming language and expand it to other programming
languages. If you already know SAS but have a new project that requires you to
develop it in R, this book will provide you with a cross-reference guide where you
can look up a SAS procedure that you already know and convert it into R code.
As useful as this cross-reference guide can be, the real goal is to learn each of these
programming languages to a point where we do not need to look up an existing
procedure we already know. Instead, given time and practice, you will learn two
new programming languages and expand the skills of your existing programming
knowledge.
The structure of this book will be based on a single project. We will import data,
analyze and transform the data, perform feature engineering, create several
machine learning models, and evaluate the effectiveness of these models. For
consistency and to build our skills throughout each project phase, we will focus on a
single data set with a single objective. You should be able to transfer each model
SAS, Python & R: A Cross Reference Guide
development step to nearly any data science project that you create. So, even
though we will focus on a single project in a specific industry, you can apply these
skills to various projects in any industry.
We will not discuss the underlying foundations of data science or the details
surrounding different modeling algorithms in great detail. For a more in-depth study
of these concepts, check out my previous book, End-to-End Data Science with SAS.
Project Overview
The project focuses on building a risk model using the Lending Club data set to
predict loan default based on application information. The Lending Club data set is a
comprehensive collection of loan data from the Lending Club platform, a peer-to-
peer lending marketplace. It provides valuable insights into borrower characteristics
and loan details, making it an ideal data set for risk assessment.
The structure of the Lending Club data set follows a common format for supervised
machine learning tasks:
● However, one crucial aspect is that the data set does not include a pre-
existing target variable indicating loan default. Therefore, we need to derive
this target variable ourselves based on the available information in the data
set.
Creating a target variable is an essential step in this project. We will define loan
default based on specific criteria such as past due payments, charged-off status, or
other relevant indicators available in the data set. Creating the target variable from
the available information is crucial in accurately training our risk model.
50 SAS, Python & R: A Cross Reference Guide
● Target Variable: The target variable, also known as the dependent variable,
is the variable that we aim to predict. In this case, the target variable is loan
default, which represents whether a borrower will default on their loan. It is
typically binary, taking on values like "default" or "non-default," "1" or "0,"
or "yes" or "no." Creating an accurate prediction for the target variable is
the primary objective of our risk model.
Import Data
The data set we will work with is the Lending Club loan database. This data set
represents loan accounts sourced from the Lending Club website. It provides
anonymized account-level information with data attributes representing the
borrower, loan type, loan duration, loan performance and loan status. The raw data
can be found at:
https://www.kaggle.com/wordsforthewise/lending-club
SAS, Python & R: A Cross Reference Guide
Users are required to register on Kaggle before they can download the data set.
However, the remainder of this chapter focuses on getting the data and performing
some filtering and minor data manipulation. The resulting data set is a sample of the
overall data set with fewer features. To ensure we are all working from the same
data set, I have included the reduced data set in the GitHub repository for this book.
The data set is labeled “Loan_Samp” and can be found at:
https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide
Since we will be creating risk models based on historical customer data, we will only
need to focus on the data set that contains accepted applications that result in
active accounts. We certainly cannot use rejected data because you cannot default
on a loan for which you were never approved. Therefore, we will only need to
download the “accepted_2007_to_2018Q4” file. This file should contain all the
available variables at the point of application and the critical information necessary
to create our target variable.
*Note: GitHub has a size limitation on items that you are allowed to keep in your
repository. Unfortunately, this data set exceeds the max size restriction. However, if
you apply the code shown in the next section, you should be able to download the
data. By using the “seed” option, you should also be able to generate the same
sample as I will be using throughout this book. Also, the final reduced data set can
be found in the GitHub repository for this book.
Data Dictionary
One of the nice things about the Lending Club data set is that it comes with a data
dictionary. A data dictionary is a document that contains definitions for each field in
the data set. The level of detail can vary greatly for these types of documents. This
particular one only lists the variables in the data set and their common language
definition. The full data dictionary can be found in my GitHub repository:
https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide
However, due to the size of the data and the number of attributes, we will not use
all the available data. The raw data set has 2.2 million observations and 151
variables. We will limit the data to the following attributes and limit the number of
52 SAS, Python & R: A Cross Reference Guide
observations of the data. Table 2.1 shows the variable type, name, and description
for all variables that we will use as part of this project. Notice that “loan_status” will
be used to construct our target variable, and the remaining data attributes will be
our predictors.
Our first step is to download the raw data to our laptop (or put it on a server if you
have that kind of setup). Once the data is downloaded, we will import it into our
analytical tool.
Although we could connect directly to the data set and read it into our analytical
program, this can lead to unforeseen data consistency issues. Websites routinely
drop data sets, update existing data sets with new data, or corrupt those data sets.
There are plenty of ways that the data can change from the original data file, so it is
a good practice to download a copy of the data set and refer to that consistent
copy.
The raw data set contains 151 variables, including the “loan_status” variable, which
we will use to develop our target variable. Many of these variables are not relevant
or useful to our project and take up a lot of space. So, we will only import a limited
set of variables that we have selected using our prior business knowledge. These
variables will focus on items that we know are related to loan default risk.
Program 2.1 shows how to import a limited set of variables with all observations
into your analytical environment. You will only need to change the highlighted
sections that specify the file pathway of where you placed the raw data file.
import pandas as pd
● Both Python and R are sensitive to whitespace characters. This means that
even though the formatting of the programming code above might look like
the filename is spread out over a couple of lines (due to publishing
standards), these languages will register this as an error. Remember to place
all pathway specifications on the same line when using Python or R.
● Python and R require you to import specific libraries to handle the data
properly, while SAS does not require you to import any additional libraries
to manipulate data.
SAS, Python & R: A Cross Reference Guide
● The raw data is in a .csv format. Python uses the “read_csv” function, and R
uses the “fread” function to import delimited files. At the same time, SAS
relies on the “DBMS” statement within the PROC IMPORT procedure to
process the delimited file.
● Each programming language has its own way of limiting the selection of
variables during the data import step.
o In SAS, the variables are placed into a global variable and those
variables are retained using a “keep” statement.
● I use the “pd.DataFrame()” statement to transform the Python data set into
a panda’s dataframe. This format provides several benefits that we will
examine in the following chapters.
● All three programming languages create a final data set named “loan” in
their respective environments.
Sampling
The Lending Club data set is not particularly large compared to data science projects
that use petabytes of data to make predictions. However, to make data processing
more efficient and ensure that all of our computers can easily perform all of the
data manipulation and machine learning techniques that we will perform in the
following chapters, we will create a random sample of the data set.
There are four main sampling techniques that we could perform on a data set.
These include:
56 SAS, Python & R: A Cross Reference Guide
For this project, we will use simple random sampling to reduce the data set's size
from 2.2 million observations to 100K observations. In later chapters, we will decide
whether to oversample or undersample the population, but for now, we will simply
create a simple random sample of the full data set to reduce its overall size while
maintaining the original data distributions of each variable.
One of the main reasons that we are not going to create a balanced data set at this
stage is that we will have to impute some missing values, adjust for outliers, and
perform a few other data manipulation techniques to prepare the data for
modeling. To perform these adjustments properly, we must base any adjustments
on the original data distributions.
Program 2.2 shows how to perform simple random sampling with a seed value. The
seed value ensures that if we were to rerun the sampling program, we would
achieve the same results (because in computer science and life, nothing is truly
random). I have chosen the seed value of 42. It doesn’t matter what value you
choose for the seed value; however, since Douglas Adams taught us that the answer
to life, the universe and everything is 42, that value is good enough for us.
SAS, Python & R: A Cross Reference Guide
● All three programming languages explicitly state the seed value of 42.
● All three programming languages create a final data set named “loan_samp”
in their respective environments.
58 SAS, Python & R: A Cross Reference Guide
In this chapter, we explored the essential steps involved in gathering data, from
importing data from various sources into SAS, Python, and R, to ensuring that the
data is clean and ready for analysis. We discussed best practices for handling
different types of data and emphasized the importance of data quality in laying a
strong foundation for your data science projects.
With your data successfully gathered and prepared, the next step in your journey is
to transform this raw data into a format suitable for modeling. This is where the
concept of a modeling data set comes into play. In the next chapter, we will dive
into the critical process of creating a modeling data set, which involves selecting the
right features, handling missing data, and preparing the data for the specific
requirements of your predictive models.
Understanding how to create a modeling data set is crucial because it bridges the
gap between raw data and the machine learning algorithms that will analyze it. The
decisions you make in this phase will directly impact the performance and accuracy
of your models.
So, with your data ready to go, let’s move on to Chapter 3, where you’ll learn how
to craft a data set that will maximize the potential of your models and set the stage
for successful predictive analysis.
SAS, Python & R: A Cross Reference Guide
● GIGO Principle: The chapter stresses that the volume of data does not
compensate for poor data quality. High-quality data is essential for building
models that are accurate, reliable, and generalizable.
● Project Overview: The Lending Club data set is introduced as the primary
data set for the book’s ongoing project – a risk model to predict loan
default. The chapter describes how the data set is structured, the types of
variables it includes, and the need to create a target variable for the model.
4. Sampling Techniques
● Types of Sampling:
5. Comparison and Choosing the Right Data Import and Sampling Techniques
6. Looking Ahead
● Transition to Data Transformation: The chapter prepares the reader for the
next steps in the data science process, which involve transforming the
sampled data set into a modeling data set by performing data
transformations and imputations.
Chapter 2 Quiz
Questions:
1. What is the significance of the "Garbage In, Garbage Out" (GIGO) principle
in data science?
2. Why is high-quality data essential for building reliable and robust models?
3. Describe the structure of the Lending Club data set used in this project.
4. Explain the process of creating a target variable for the Lending Club risk
model.
5. How does the PROC IMPORT procedure in SAS facilitate data import and
variable selection?
6. What is the role of the pandas library in Python for importing data?
7. How does R's fread function from the data.table package assist in data
import and selection?
13. How does the use of a seed value in sampling ensure reproducibility?
14. Compare the data import and sampling techniques in SAS, Python, and R.
15. What factors should be considered when choosing a data import method?
16. Why might a data scientist choose to sample a data set before analysis?
SAS, Python & R: A Cross Reference Guide
18. What are the benefits of reducing the size of a data set through sampling?
20. What are the next steps after importing and sampling data in a data science
project?
64 SAS, Python & R: A Cross Reference Guide
- Ensure
- Use
consistency by - Use fread() for
pandas.read_csv()
downloading a efficient data
with selective
static version of import and column
column import for
the data set selection
Best Practices efficiency
before import
- Set
- Always set a - Set set.seed() for
random_state for
seed value for reproducibility in
consistent
reproducibility sampling
sampling
66 SAS, Python & R: A Cross Reference Guide
Overview
In data science, we often encounter a stark contrast between the pristine academic
data sets used for education and the messy nature of real-world data used for
decision making. While academic examples like the well-known “Iris” data set for
clustering, the “MNIST” data set for classification, or the “Real Estate Price
Prediction” data set for regression modeling offer clean and structured data, real-
world data sets present unique challenges and complexities. These data sets are not
meticulously constructed for machine learning; rather, they serve as repositories of
events and observations from various domains.
Consider the diverse scenarios that real-world data encompasses, such as bank
transactions, car accidents, disease conditions, forest fires, lost luggage, or even the
historical pricing of Van Halen tickets. These data sets reflect the intricacies and
nuances of our world, carrying invaluable insights waiting to be unlocked through
the lens of data science.
By combining the art and science of data science, we transform messy real-world
data into a refined and structured modeling data set. We equip ourselves with the
SAS, Python & R: A Cross Reference Guide
ability to extract hidden patterns, build accurate models, and make informed
decisions.
Data science is a field that combines both art and science to extract insights and
value from data. On the one hand, there is the science of data science, which
involves using statistical methods, machine learning algorithms, and computer
science techniques to analyze and model data. This requires a solid foundation in
mathematics, programming, and data manipulation skills. On the other hand, the art
of data science involves creativity, intuition, and domain expertise to identify
meaningful patterns and relationships in the data that may not be immediately
obvious.
Every data science project will have inflection points where a decision that will
impact the entire project must be made. These decision points could be as
foundational as defining your business problem or deciding what data set you will
use. Other decision points will concern the actual data of your modeling approach:
These decision points are part of the “art” of data science. Although we have been
trained in the mathematical and programming skills essential to understand
machine learning algorithms and code them from scratch, if necessary, these skills
cannot tell you how to frame your business question or if you should cap your
outliers. The holistic view, strategy, framework, and approach are all part of the Art
of Data Science.
Throughout this book, I will point out the critical decision points data scientists must
make given our sample project. These symbols will indicate these decision points
and critical pieces of information:
68 SAS, Python & R: A Cross Reference Guide
To be successful in data science, one must possess both the technical skills and the
ability to think critically and creatively. It is not enough to simply apply algorithms
and models to data; one must also have a deep understanding of the problem
domain and be able to ask the right questions to extract the insights that matter.
Furthermore, data science is a constantly evolving field, and practitioners must be
willing to adapt and learn new skills to keep up with the latest developments.
Ultimately, the art and science of data science come together to create a powerful
approach to solving complex problems and driving innovation. By combining
analytical rigor with creative thinking, data scientists can uncover new opportunities
and insights to help organizations make better decisions and improve their
operations.
Project Overview
In the previous chapter, we identified the Lending Club data set as the data set we
will use to answer our business questions. Now, we need a more formal definition
of our business problem to ensure we correctly frame the issue.
Default risk is a significant issue for lending platforms. How can a potential lender
decide who they should lend money to? The highest interest rates also come with
SAS, Python & R: A Cross Reference Guide
the highest probability of default. That is essentially what loan interest is. It is the
trade-off between risk and income. This is why people with long and stable credit
histories pay low interest rates, because they have proved that they are low-risk
borrowers, and the lender has a high probability of getting their money back with
interest.
The goal of this project is to design a predictive model that calculates the probability
of a borrower defaulting on a loan. This information can be used in two ways. First,
we can use the probability of default metric at the application stage and deny
applicants with a high probability of default.
Secondly, this information may help identify existing customers who are about to
default on their debts. The lender could help these clients by identifying them and
providing assistance in the form of postponed payments, loan restructuring, or
lowered interest rates. These steps may reduce losses from loan defaults and
provide proactive customer support to borrowers who are having trouble with their
loan payments.
In the previous chapter, we accessed the Lending Club data set, downloaded it to
our local PC, and fed it into our statistical programs. We must create a target
variable and explore the relationship between the target and the predictors.
Remember to use the data set found in the GitHub repository for this book:
https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide
Target Variable
There are several reasons why the target variable is so important in modeling:
70 SAS, Python & R: A Cross Reference Guide
● Defines the problem: The target variable defines the problem the model
attempts to solve. By identifying the target variable, we can clearly
articulate the research questions and the objective of the analysis.
● Determines the model type: The model used for analysis depends on the
type of target variable. If the target variable is categorical, we would use
classification models. If the target variable is continuous, we would use
regression models.
● Guides data preparation and cleaning: The choice of the target variable
influences the quality and structure of the data required for the analysis.
Data preparation and cleaning techniques should be chosen with the target
variable in mind.
Classification and regression are two types of statistical models commonly used to
analyze data with distinct target variables. Table 3.1 provides some general
descriptions of regression and classification models.
SAS, Python & R: A Cross Reference Guide
Regression Classification
Regression means to Classification means to
General Description predict the output value output the group into a
using training data class
Regression to predict the Classification to predict the
value ($ amount) of an type of an insurance claim
Example
insurance claim using (fraud vs. non-fraud) using
training data training data
If it is a real If it is a discrete/categorical
Target Variable number/continuous, then variable, then it is a
it is a regression problem classification problem
2. Decision Trees: Tree-based models that recursively split the data based on
input features to make predictions.
72 SAS, Python & R: A Cross Reference Guide
On the other hand, classification models focus on predicting discrete class labels or
categories. Some popular machine learning techniques for classification include:
2. Decision Trees: Similar to regression, decision trees are also commonly used
for classification tasks. This technique uses recursive partitioning of the data
to categorize a prediction.
Multi-Target Models
A single machine learning model can have two separate target variables. This
scenario is known as multi-output or multi-target regression/classification. Instead
of predicting a single target variable, the model aims to predict multiple target
variables simultaneously.
The multi-output modeling approach depends on the problem and the relationships
between the target variables. Some techniques that can be used for multi-output
modeling include:
The choice of technique depends on the nature of the problem, the relationships
between the target variables, and the available data. When designing the model, it
is important to consider the dependencies or correlations between the target
variables, as they can impact the model's performance and interpretability.
74 SAS, Python & R: A Cross Reference Guide
Defining the target variable is a crucial step in creating a modeling data set.
Sometimes, a data set already contains a variable that can serve as the target,
particularly when working with historical data. For example, in marketing
campaigns, we might know who responded positively or negatively to previous
campaigns, and in financial domains, there may be records of account defaults or
loan applications. These pre-existing target variables are typically binary (e.g., 1/0 or
Yes/No), providing a clear objective to predict or classify.
Using these pre-existing variables enables you to:
● Enhance Model Accuracy: Develop predictive models that are more likely to
yield accurate results and actionable insights.
However, not all data sets come with a pre-existing target variable. In such cases,
you’ll need to create one based on the available data. This process involves selecting
and transforming relevant variables and applying your domain knowledge to define
the desired outcome.
Example:
Consider a scenario where you're working with a data set that tracks customer
behavior, but it lacks a predefined target variable for predicting customer churn. In
this case, you would examine variables such as customer activity, purchase history,
and engagement metrics. By setting specific criteria or thresholds based on data
patterns, you can define a target variable that indicates the likelihood of customer
churn.
However, this definition alone does not fully explain the nature of the
variable or how we can use it to construct a meaningful target
variable for modeling. Understanding and defining the target variable
is essential to ensuring the success of your predictive models.
When constructing your own target variable, you need a deep understanding of
both the data and the problem domain. This involves:
By taking charge of crafting your target variable, you empower yourself to tackle
prediction and classification tasks where no predefined target variable exists,
unlocking the data's predictive potential and building models that provide valuable
insights.
A key decision in data science is identifying or creating the target variable. The
Lending Club data set, for instance, doesn't include a predefined target variable.
However, it does have a variable called “loan_status,” which describes the current
state of the loan.
The loan status field includes categories such as paid, current, late, charged off, or in
default. By running a frequency distribution on this field, you can see the
distribution of accounts across these categories. This step is essential in
understanding the data and deciding how to use it to construct a meaningful target
variable.
Table 3.2 shows the SAS output of the frequency distribution. We can see that the
“loan_status” variable consists of eight categories and has only one missing value.
If we are trying to create a target variable that indicates if someone is a credit risk,
then we would want to identify the “Charged Off” and the “Does not meet the
credit policy. Status: Charged Off” categories as positive indicators for credit risk
and all other categories as a negative indicator for credit risk. An account with a
loan status of “Late” does not meet our definition of risk; therefore, those
categories will not be included in our risk target variable.
Program 3.2 shows the development of the target variable in each of our
programming languages. We are simply creating a binary target variable defined as:
if the loan status is either “Charged Off” or “Does not meet the credit policy.
SAS, Python & R: A Cross Reference Guide
Status:Charged Off,” then the binary indicator will be 1. For all other categories, the
binary indicator will be 0. The name of the target variable will be “bad.”
import numpy as np
loan_samp['bad'] =
Python np.where((loan_samp['loan_status']=='Charged
Programming Off') | (loan_samp['loan_status']=='Does not
meet the credit policy. Status:Charged
Off'),'1','0')
A final frequency distribution of the newly created target variable shows that the
event rate is 11.91%. However, this metric represents the overall event rate across
the entire data set. Sometimes, the event rate will vary widely across different time
periods, and this can tell us a lot about the data.
Cumulative Cumulative
bad Frequency Percent
Frequency Percent
0 88089 88.09 88089 88.09
1 11911 11.91 100000 100
78 SAS, Python & R: A Cross Reference Guide
Table 3.3 below shows the difference between the output generated by the three
programming languages and environments. The Python and R output looks
identical, with the “issue_d” variable being interpreted as a character variable. The
SAS environment read the “issue_d” variable as a date variable and placed the
output in the correct logical order for a date field.
SAS, Python & R: A Cross Reference Guide
For Python and R, we would need to convert the character formatted “issue_d”
variable to a date format. Program 3.4 shows the simple conversion code for each
program language.
loan_samp['issue_d'] =
Python Programming
pd.to_datetime(loan_samp['issue_d'])
Please note that both the R and Python programming languages required us to
convert the “issue_d” and the “earliest_cr_dt” variables into a date field. However,
SAS identified the proper format for these fields as date fields. The code contained
in Program 3.4 shows how to convert the R and Python “issue_d” fields into date
80 SAS, Python & R: A Cross Reference Guide
formats. The SAS example shows how to convert a character field into a date field.
This code would be included in a DATA step if necessary.
Once we have correctly formatted the “issue_d” field, we can create a chart that
plots the account volume as bars and the event rate as a line, using the newly
formatted “issue_d” field as the X axis.
Figure 3.1 shows the account volumes and default rates by month. This chart shows
us two critical pieces of information. First, the account volume starts very low,
resulting in wild swings to the event rate. Second, the event rate began to trail off
about 18 months after there was a stable period. This tells us that people need
about a year and a half of credit availability before they default.
Chart 3.1 shows us that the Lending Club data set has a significant ramp-up period
that starts in July 2007 and goes up to about January 2014. Notice how much the
event rate fluctuates during this time period. These fluctuations are due to low
monthly account volume. Around January 2014, the monthly account volume was
large enough that the event rate stabilized. This leads us to believe we should not
use data before January 2014.
Around January 2017, the event rate began to decrease steadily. This behavior
seems strange because the account volume is still high, and there had been stability
for years prior to this point. Since there had not been any external factors, such as
federal interest rate cuts or consumer bailout programs introduced, that would
have affected this population, we can conclude that the steady reduction in default
rates is a result of what we can call “runway time.”
In this context, “runway time” means the general amount of time that must pass for
an event to occur. In our example, people do not generally default on their loans
within the first couple of months of acquiring them. The general “runway time” for
loan default can be anywhere between six months to five years, depending on the
type of loan.
By examining the event rate over time, we can determine that the runway time for
this event is about 18 months. The decline in the event rate began in January 2014
and bottomed out around June 2017. This leads us to believe we should not use
data after January 2017.
Program 3.5 shows the code to limit the data set to the January 2014 to January
2017 time period.
loan_data =
Python
loan_samp[(loan_samp['issue_d']>='2014-01-
Programming
01') & (loan_samp['issue_d']<='2017-12-31')]
Table 3.4 shows that filtering the data set to the specified date range has decreased
the overall size of the data set by about a third while significantly increasing the
overall default rate.
Cumulative Cumulative
bad Frequency Percent
Frequency Percent
0 58065 85.44 58065 85.44
1 9897 14.56 67962 100
We needed to make these decisions about up-front filtering concerning date ranges
because when we analyze our predictor variables, they should be much more stable
over time due to the increased account volume. Also, if we need to impute any
missing values, we will have a stable data set to make those imputations.
Now that we have limited the data set to a specified date range, it consists of 67,962
observations and 24 variables.
Predictive Variables
Predictive variables are points of information that you are using to predict the state
of the target variable. In the previous chapter, we limited the number of predictive
variables to a group that made sense from a business perspective. Table 3.5 shows
the SAS output that displays the metadata content for the data set. We can easily
see that there are 24 variables in the data set. These variables are a combination of
SAS, Python & R: A Cross Reference Guide
character and numeric variables. They contain a unique identifier (“id”), two date
variables (“issue_d” and “earliest_cr_line”), and a binary target variable that we
developed labeled “bad.”
Table 3.5: SAS PROC Contents Output – Metadata Content for the Data Set
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
1 Id Num 8 BEST9. BEST9.
2 loan_amnt Num 8 BEST5. BEST5.
3 Term Char 9 $CHAR9. $CHAR9.
4 int_rate Num 8 BEST5. BEST5.
5 Grade Char 1 $CHAR1. $CHAR1.
6 sub_grade Char 2 $CHAR2. $CHAR2.
7 emp_length Char 9 $CHAR9. $CHAR9.
8 home_ownership Char 8 $CHAR8. $CHAR8.
9 annual_inc Num 8 BEST9. BEST9.
10 verification_status Char 15 $CHAR15. $CHAR15.
11 issue_d Num 8 MMDDYY10. MMDDYY10.
12 loan_status Char 51 $CHAR51. $CHAR51.
13 purpose Char 18 $CHAR18. $CHAR18.
14 dti Num 8 BEST6. BEST6.
15 earliest_cr_line Num 8 MMDDYY10. MMDDYY10.
16 open_acc Num 8 BEST2. BEST2.
17 pub_rec Num 8 BEST2. BEST2.
18 revol_bal Num 8 BEST7. BEST7.
19 revol_util Num 8 BEST5. BEST5.
20 total_acc Num 8 BEST3. BEST3.
21 application_type Char 10 $CHAR10. $CHAR10.
22 mort_acc Num 8 BEST2. BEST2.
23 pub_rec_bankruptcies Num 8 BEST1. BEST1.
24 bad Num 8
Exploratory Data Analysis (EDA) is a critical phase in the data science journey,
allowing us to unravel the secrets hidden within the data. It involves
84 SAS, Python & R: A Cross Reference Guide
comprehensively examining the data set and uncovering patterns, trends, and
relationships that provide valuable insights. EDA provides us with a deep
understanding of the data's characteristics, guiding subsequent steps in data
preparation, feature engineering, and modeling.
The importance of EDA cannot be overstated. It allows us to gain familiarity with the
data, identifying potential issues such as missing values, outliers, or inconsistencies.
Visualizing the data through plots, charts, and statistical summaries lets us grasp its
distribution, central tendencies, and dispersion. Moreover, EDA helps us explore
relationships between variables, highlighting correlations or dependencies that
inform feature selection and modeling decisions.
EDA serves as the foundation upon which data-driven decisions are made. It enables
us to ask the right questions, validate assumptions, and generate hypotheses.
Through visualizations and statistical techniques, we gain insights that go beyond
mere numbers, fostering a deeper understanding of the underlying mechanisms
driving the data. Armed with these insights, we can make informed decisions,
identify pitfalls, and unlock the data's true potential.
The first step of exploratory data analysis is to create a statistical summary of the
numeric variables of the data set. Program 3.6 shows how to create a summary
overview of the numeric values in your data set.
pd.set_option('display.max_rows',
len(loan_data.index))
Python Programming pd.set_option('display.max_columns',
len(loan_data.columns))
loan_data.describe()
● The SAS statement appears verbose, but I wanted additional statistics in the
output. I could have run the PROC MEANS statement without specifying
exactly which variables I wanted to display, and it would have output the
standard default statistics of N, MEAN, MIN, STD, and MAX. However, the
next section will discuss outliers, and I wanted to show the information
necessary for identifying outliers.
loan_amnt 67962 0 1000 1600 8000 20000 36400 40000 14,968.40 12975 0.74
int_rate 67962 0 5.32 5.32 9.49 15.61 27.31 30.99 13.09 12.69 0.83
annual_inc 67962 0 0 18000 47000 94000 261813 8900000 78,349.63 65000 50.66
dti 67937 25 0 2.03 12.27 24.77 39.58 999 18.92 18.23 23.45
revol_bal 67962 0 0 226 6131 20493 103632 805550 16,949.58 11458 8.53
revol_util 67911 51 0 1.6 33.3 70.1 98.6 154.3 51.53 51.4 (0.02)
pub_rec_
67962 0 0 0 0 0 1 6 0.14 0 3.19
bankruptcies
This simple table of information tells us a lot about the data set. Starting from the
top, the “loan_amnt” variable shows us that the minimum loan available through
Lending Club is $1000, and the maximum amount is $40K. The mean value is slightly
higher than the median value. That tells us that although there are some outliers on
the high end, the average loan amount is about $15K across all observations. The
skewness is slightly positive, confirming our suspicion of high-end outliers.
Program 3.7 shows the programming code to develop histograms for each language.
You can investigate each of the variables to have a much better understanding of
the data. This is often much more beneficial than complicated statistical techniques.
Understanding the range, mean, and skewness of the data is essential. For example,
let’s look at the “annual_inc” variable. When using financial data, there are often
outliers on the high and low ends. The table of information above shows us that the
“annual_inc” variable has a minimum value of 0 and a maximum value of $8.9MM.
Both of these values appear suspect.
Generally, loans are not distributed to individuals with no income, and if you have
an annual income of $8.9MM, then you are not usually using a crowd-sourced loan
platform such as Lending Club. I have a feeling that the high-end outlier values were
erroneously entered. Their annual income is probably $89K, but they included .00
to represent cents, and the intake field does not recognize the period. Although we
cannot verify it, that is probably true for values over $250K.
Figure 3.3 shows a histogram for the “annual_inc” variable. This graphic confirms
that only a few high outliers are skewing the data set.
88 SAS, Python & R: A Cross Reference Guide
Outliers are a critical issue in data science and general data analysis. It is important
to know how to identify outliers and what to do with them (if anything) once you
do.
Outliers in a data set can significantly impact our analysis and modeling, warranting
careful attention and treatment. Untreated outliers can introduce various concerns
and distort our understanding of the underlying patterns and relationships within
the data. Here are the key concerns associated with having untreated outliers:
SAS, Python & R: A Cross Reference Guide
Winsorizing
Winsorizing is a technique for managing outliers that replaces extreme values with
less extreme values. Instead of removing or transforming outliers completely,
winsorizing truncates the extreme values at a specified percentile and replaces
them with the nearest values within that percentile. This approach helps reduce the
impact of outliers without completely eliminating their influence on the data.
90 SAS, Python & R: A Cross Reference Guide
In winsorizing, the lower and upper tails of the data distribution are trimmed to a
certain percentile, often chosen as a small percentage like 1% or 5%. Any values
below the lower percentile are replaced with the value at that percentile, and any
values above the upper percentile are replaced with those at that percentile. This
ensures that extreme outliers are "trimmed" or "capped" to a more moderate
range.
By winsorizing the data, we retain the overall shape and characteristics of the
distribution while mitigating the influence of extreme values. This approach is
particularly useful when the outliers are considered valid data points but are
believed to be measurement errors or extreme observations due to rare
circumstances.
For example, to find the outlier threshold for the annual income variable
(“annual_inc”) in our data set, we can refer to Table 3.6 above and see that the 25th
percentile value for annual income is $47,000 while the 75th percentile value is
$94,000. The interquartile range would be ($94,000 - $47,000 = $47,000). We would
then multiply $47,000 by 1.5 to get $70,500. Finally, we would add this new value to
the third quartile to get the upper threshold to identify outliers ($94,000 + $70,500
= $164,500). We can interpret this value as any observation with an annual income
above $164,500 is considered an outlier.
Table 3.7 below shows each piece of the IQR methodology and the calculations for
the annual income and revolving balance variables. It is important to note that
although the low-end adjusted values are below zero, we often set a floor for those
SAS, Python & R: A Cross Reference Guide
variables at zero. It does not make sense for the data to contain values below zero
for these variables.
The table above shows that for the “Annual Income” variable, the high-end adjusted
value would be capped at $164,500. Any values over this amount would be replaced
with $164,500. However, the low-end adjusted value is negative. Therefore, we
would simply set the floor to zero. Any values less than zero are set to zero. This
makes sense in this business problem because it is not appropriate to have negative
values for annual income (especially if you are applying for a loan).
However, if you are going to create a dashboard report that shows each variable’s
average value or if you are going to create a linear regression model for prediction,
then these processes will be significantly impacted by outliers. This would require us
to transform these outliers.
This simple logic to adjust for outliers using the IQR technique can result in some
verbose code because there are several steps in the process:
4.) Calculate the upper and lower bounds for the outliers.
5.) Replace the outlier values with the 1.5 IQR values.
library(dplyr)
numeric_data = filter_numeric_columns(loan_data)
return arr
outliers = replace_outliers(numeric_data)
outliers.describe()
DATA outliers;
/*Select numeric variables*/
SET loan_data (KEEP=_NUMERIC_);
SAS
ARRAY vars(*) loan_amnt -- pub_rec_bankruptcies;
Programming
DO i = 1 TO dim(vars);
SAS, Python & R: A Cross Reference Guide
Each program accesses the current data set LOAN_DATA and applies the five steps
to adjust for outliers using the 1.5 IQR rule. The R and Python programs use a
function to make the adjustments in the examples provided. For SAS, I decided to
use a single DATA step to make the adjustments to demonstrate that all of the steps
can be performed in a single DATA step.
Data Inference
Data inference is a statistical technique that uses patterns observed in a sample to
make inferences about a larger population. One way to use data inference to deal
with outliers in a modeling data set is to use imputation methods. Imputation
involves replacing values with estimated values based on the patterns observed in
the data.
96 SAS, Python & R: A Cross Reference Guide
To use imputation for handling outliers, you can first identify the outliers in the data
set using visualization or statistical methods. Then, you can impute the missing
values in the data set using a method robust to outliers, such as the median or
mode.
We will review the programming code to develop these predictive models in the
predictive modeling chapter of this book.
Data Rejection
Finally, another option is to remove observations where outliers occur. This is
obviously not a data transformation like our other two options. We are always
hesitant to throw out data. It is generally a better approach to use one of the
adjustment methods described above. However, there are some circumstances
where it doesn’t make sense to keep outliers because they could represent bad-
quality data, making the whole observation useless. Then, feel free to drop these
observations.
Feature selection is a crucial step in data science that involves selecting relevant
variables to include in a model. However, the need for feature selection depends on
the specific machine learning problem and the algorithms being used.
For example, decision trees and random forests can handle many features and have
built-in feature selection methods that can automatically select the most
informative features. In contrast, linear models such as linear regression and logistic
regression can benefit from feature selection techniques to reduce the impact of
irrelevant or redundant features.
It is important to note that feature selection should always be done cautiously and
with a good understanding of the problem domain. Blindly removing features can
sometimes result in the loss of important information, and it is essential to evaluate
the impact of feature selection on model performance before making a final
decision.
Overall, feature selection is an important step in machine learning that can improve
the models' performance, efficiency, and interpretability.
98 SAS, Python & R: A Cross Reference Guide
There are several techniques for feature selection, including filtering, wrapper, and
embedded methods.
OUTPUT;
END;
END;
END;
RUN;
In the example provided in Program 3.9, the data was limited to numeric values
only, and the correlation threshold was set to 0.8. Any correlated features with a
correlation value greater than 0.8 are eliminated from the output data set, which is
labeled “corr_limit.”
Figure 3.4 below shows the R output for correlation visualization. This figure shows
us that there are a few variables that are significantly correlated with one another.
The “total_acc” and “open_acc” variables are correlated. This makes sense because
the “open_acc” variable is a natural subcategory of the “total_acc” variable.
However, if we look at the correlation table, we can see that the Pearson correlation
statistic between these two variables is 0.7145. This falls below our 0.8 threshold
and will, therefore, be excluded from filtering.
The reason that the R output contains “?” values for the variables “dti” and
“revol_util” is that they both have a small number of missing values. The “dti”
variable has 25 missing values, while the “revol_util” variable has 51 missing values.
If we wanted R to ignore the missing values and provide us with the correlation
charts and tables without “?” or “NA” values, we simply need to update our
correlation code with a USE parameter. For example: cor(numeric_data,
use=”complete.obs”).
SAS, Python & R: A Cross Reference Guide
All three programming languages, SAS, Python, and R, share commonalities and
differences regarding wrapper methods in feature selection.
Commonalities:
Differences:
caret, mlr, and glmnet that offer specialized functions for feature selection
and wrapper methods.
● Ecosystem and community support: Each language has its own ecosystem
of packages, libraries, and community support. Python boasts a rich
ecosystem with a wide range of machine learning and data science libraries,
making it highly versatile and widely adopted. R strongly focuses on
statistical analysis and provides a comprehensive collection of packages for
various modeling tasks. As a proprietary software, SAS has its own
ecosystem and tools tailored for data analysis and modeling.
While the core principles of wrapper methods remain the same across the three
languages, the differences lie in the specific syntax, implementation options,
available packages, and the overall ecosystem supporting feature selection and
wrapper methods.
One of the wrapper methods commonly implemented and available across all three
programming languages is Recursive Feature Elimination (RFE). RFE is a popular
wrapper method for feature selection in machine learning.
RFE works by recursively eliminating features and evaluating their impact on model
performance. It starts with the complete set of features and iteratively eliminates
less important features until a specified number or a desired subset of features
remains. During each iteration, the model is trained and evaluated, typically using
cross-validation or a performance metric, and the feature with the lowest
importance is removed. This process continues until the desired number of features
is reached.
RFE is widely used because it can be applied to various machine learning algorithms
and is agnostic to the specific modeling technique being used. It focuses on the
model's predictive performance and provides a ranking or selection of features
based on their importance.
The implementation of RFE may vary across programming languages and libraries,
but the underlying concept remains the same. SAS, Python (through libraries like
104 SAS, Python & R: A Cross Reference Guide
import pandas as pd
import numpy as np
from sklearn.feature_selection import RFE
from sklearn.linear_model import
Python LogisticRegression
Programming
# Filter numeric columns including the target
variable
numeric_columns =
loan_data.select_dtypes(include=[np.number]).col
umns
106 SAS, Python & R: A Cross Reference Guide
numeric_columns =
numeric_columns.append(pd.Index(['bad'])) #
Include the target variable
loan_data_numeric =
loan_data[numeric_columns].copy()
Table 3.9 below shows the SAS output for the GLMSELECT procedure. This output
shows that the method incorporated the backward selection methodology, which
removed five variables from the list of predictors. The algorithm selected eight
variables with a significant relationship with the target variable.
It’s important to note that these variables have not been adjusted from the
correlation or outlier analysis introduced previously. This raw data set results in a
poor-quality model; however, this technique was not employed to create a finalized
model. Instead, we are using this technique to identify variables significantly related
to the target variable and filter out those not significantly related to the target
variable.
The SAS output contains separate sections for the effects removed, an ANOVA table,
performance estimates, and parameter estimates. You can get these same pieces of
information from both Python and R; however, generating the additional output will
require more coding.
Table 3.9: Filtering with Wrapper Method – SAS Output – PROC GLMSELECT
Analysis of Variance
Sum of Mean
DF F Value
Source Squares Square
Model 8 533.50893 66.68862 571.96
Error 67877 7914.24501 0.1166
Corrected
67885 8447.75394
Total
Parameter Estimates
Standard t Valu
DF Estimate
Error e
Parameter
Intercept 1 1.807715 0.068935 26.22
dti 1 0.000833 0.00012 6.96
int_rate 1 0.015374 0.000284 54.22
issue_d 1 -0.000092186 0.000003345 -27.56
loan_amnt 1 0.000000604 0.000000161 3.76
mort_acc 1 -0.008414 0.000721 -11.67
open_acc 1 0.001791 0.000247 7.25
pub_rec 1 0.010701 0.002024 5.29
revol_bal 1 -0.000000384 6.39E-08 -6.01
SAS, Python & R: A Cross Reference Guide
All three programming languages, SAS, Python, and R, share commonalities and
differences when it comes to filtering with embedded methods.
Commonalities:
Differences:
Embedded feature selection methods integrate feature selection directly into the
model training process. While the core principles are similar across SAS, Python, and
R, the implementation specifics differ based on each language's available libraries,
procedures, and packages. Data scientists can leverage the respective functionalities
provided by each language to implement embedded feature selection techniques in
their machine learning models.
Program 3.11 below shows the filtering with embedded methods implemented in all
three programming languages.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
Python from sklearn.feature_selection import SelectFromModel
Programming
# Filter numeric columns including the target
variable
numeric_columns =
loan_data.select_dtypes(include=[np.number]).columns
112 SAS, Python & R: A Cross Reference Guide
numeric_columns =
numeric_columns.append(pd.Index(['bad'])) # Include
the target variable
DATA work.filtered_data;
SET work.loan_data;
RUN;
Table 3.10 shows the Python output. This output shows the list of features selected
from the random forest classifier algorithm. Additional output options are available
for both the Python and R models; however, these options require additional code
to display specific output.
Table 3.10: Filtering with Embedded Method – Python Output – Random Forest
Classifier
print(selected_features)
Index(['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'revol_bal', 'revol_util',
'total_acc'], dtype='object')
114 SAS, Python & R: A Cross Reference Guide
In this section, we will delve into the art of feature engineering and explore its vital
role in building robust and accurate models. We will examine several fundamental
techniques that allow us to engineer features effectively, providing our models with
the best possible representation of the underlying data.
● Feature Scaling:
- Scaling numerical features to a consistent range (e.g., normalization
or standardization) ensures that no single feature dominates the
learning process. This is particularly important for algorithms
sensitive to feature scales, such as K-nearest neighbors (KNN) and
support vector machines (SVM).
● Feature Interaction:
- Creating new features by combining or interacting with existing
ones can capture complex relationships and interactions that
individual features may fail to represent adequately.
Feature Scaling
Feature scaling is a preprocessing technique that aims to bring all data set features
onto a similar scale. It involves transforming the values of different features to a
consistent range, which helps improve the performance of many machine learning
algorithms. By scaling the features, we ensure that no single feature dominates the
learning process due to differences in their magnitudes.
Scaling the features helps ensure that all features contribute equally
to the distance calculations.
● Regularization Techniques:
- Regularization techniques, such as L1 and L2 regularization, prevent
overfitting in regression and classification models. Feature scaling
ensures that the regularization terms have a similar effect on all
features, preventing any particular feature from dominating the
regularization process.
● Numerical Features:
- Continuous numerical features usually require scaling. Examples
include age, income, and temperature.
● Categorical Features:
- Categorical features already encoded as numeric values, such as
binary variables (0/1), do not typically require scaling.
● Ordinal Features:
- Ordinal features have a natural order or ranking. Scaling might not
be necessary if the order is already preserved.
- However, if the range of values in ordinal features is large, scaling
can help improve model performance.
● Tree-Based Models:
- Tree-based models, such as decision trees and random forests, are
not sensitive to feature scaling. Scaling may not be required for
these algorithms.
● Distance-Based Models:
- Models that rely on distance measures, such as K-nearest neighbors
(KNN) and support vector machines (SVM), often require feature
scaling for optimal performance.
Ultimately, the decision to scale features depends on the specific data set, the
characteristics of the features, and the algorithms being employed. Experimenting
118 SAS, Python & R: A Cross Reference Guide
Program 2.13 provides the code examples for feature scaling in SAS, Python, and R.
import pandas as pd
from sklearn.preprocessing import StandardScaler
numeric_cols =
loan_data.select_dtypes(include=['float64',
Python
'int64']).columns
Programming
scaler = StandardScaler()
loan_data_scaled = loan_data.copy()
loan_data_scaled[numeric_cols] =
scaler.fit_transform(loan_data[numeric_cols])
However, it is important to note that the decision to exclude binary variables from
feature scaling depends on the specific context and the machine learning algorithm
being used. Some algorithms may be sensitive to the scaling of binary variables,
while others may not be affected.
In general, if you use an algorithm that is not sensitive to feature scaling, such as
tree-based models (e.g., decision trees, random forests), you can include binary
variables in the feature scaling process without significant impact. On the other
hand, if you're using algorithms sensitive to scaling, such as logistic regression or
support vector machines, it is advisable to exclude binary variables from feature
scaling to prevent unnecessary transformations.
Always consider the requirements and assumptions of the specific algorithm being
used and adjust the feature scaling approach accordingly.
Categorical variables are commonly encountered in many real-world data sets, and
they require appropriate encoding to be used effectively in machine learning
models. One popular technique for encoding categorical variables is the use of
dummy variables. Dummy variables create binary features for each unique category
within a categorical variable. Here are some key points to consider about dummy
variables:
● When to Use Dummy Variables: Dummy variables are typically used when
dealing with nominal or unordered categorical variables. These are variables
with categories that have no inherent order or ranking. Examples include
variables like color (red, blue, green), city (New York, London, Paris), or
product type (A, B, C). By creating dummy variables, each category becomes
a separate binary feature, preserving the distinctiveness of the categories
without imposing any numerical order.
● Dealing with High Cardinality: When dealing with categorical variables with
many unique categories (high cardinality), creating individual dummy
variables for each category can lead to an explosion in the feature space. In
such cases, it may be necessary to apply additional techniques like feature
hashing or entity embeddings to reduce dimensionality while still capturing
useful information from the categorical variable.
Cumulative Cumulative
home_ownership Frequency Percent
Frequency Percent
ANY 19 0.03 19 0.03
MORTGAGE 33484 49.27 33503 49.3
OWN 7695 11.32 41198 60.62
RENT 26764 39.38 67962 100
We want to create a binary numeric indicator for each level of this categorical
variable. We could perform this manually with explicit code as demonstrated in
Program 3.13:
DATA dummy;
SET loan_data;
IF home_ownership = 'ANY' THEN ANY_IND = 1; ELSE ANY_IND = 0;
IF home_ownership = 'MORTGAGE' THEN MORT_IND = 1; ELSE MORT_IND = 0;
IF home_ownership = 'OWN' THEN OWN_IND = 1; ELSE OWN_IND = 0;
IF home_ownership = 'RENT' THEN RENT_IND = 1; ELSE RENT_IND = 0;
RUN;
Although this approach works perfectly fine, it can become burdensome when a
categorical variable has several levels. A more efficient approach is to use the
methods and procedures available in each programming language. Program 3.14
shows how to create dummy variables efficiently in each programming language.
dummy_vars =
pd.get_dummies(loan_data['home_ownership'],
Python prefix='home_ownership')
Programming
loan_data = pd.concat([loan_data, dummy_vars], axis=1)
loan_data.drop('home_ownership', axis=1, inplace=True)
Remember that it is important to know how many levels there are for each of your
categorical variables. If you have a variable encoded as a character variable that has
hundreds of levels, such as “job category” or “ZIP code,” then this will result in a
large number of dummy variables, which can lead to issues with high cardinality.
● Sparse data: When dealing with high cardinality variables, some categories
may have very few observations or occur infrequently in the data set. This
leads to sparse data, where many dummy variables have a low frequency of
occurrence. Sparse data can pose challenges for certain modeling
techniques that assume a sufficient number of observations in each
category.
well to unseen data, the model may learn patterns specific to each category,
including noise or random fluctuations. Overfitting can lead to poor model
performance and a lack of generalization ability.
Feature Interaction
Feature interaction refers to the phenomenon where the combined effect of two or
more features in a machine learning model significantly impacts the target variable.
It recognizes that the relationship between features and the target variable may not
be linear or independent but can be influenced by their interactions. By capturing
these interactions, we can improve the model's predictive power and gain deeper
insights into the data.
Here are four major topics related to feature interaction in machine learning:
Program 3.15 provides the code examples for developing polynomial features for
the numeric feature “loan_amnt” in each of the programming languages:
DATA poly;
SET loan_data;
SAS Programming
loan_amnt_poly = loan_amnt**2;
RUN;
SAS, Python & R: A Cross Reference Guide
When considering which variables to combine for feature interaction, it's important
to consider the underlying relationships of the data and your own domain
knowledge. Here are a couple of potential feature interactions that you could
explore based on the numeric variables in the data set:
● Number of Open Accounts and Loan Amount: The interaction between the
number of open accounts and the loan amount can capture the relationship
between an individual's credit activity and the size of the loan they are
applying for. It may reveal whether individuals with more open accounts
tend to apply for larger loans.
These are just a few examples, and the choice of feature interactions ultimately
depends on the specific context and goals of your analysis. It's recommended to
explore different combinations and evaluate their impact on the model's
performance and interpretability. Additionally, consulting with domain experts or
conducting thorough exploratory data analysis can help guide the selection of
meaningful feature interactions.
Let’s take the “income and debt-to-income ratio” feature interaction example and
look at the code in all three programming languages.
import pandas as pd
Python
Programming
loan_data['DTI_INC_INTERACTION'] = loan_data['dti'] *
loan_data['annual_inc']
126 SAS, Python & R: A Cross Reference Guide
DATA loan_data;
SET loan_data;
SAS Programming
DTI_INC_INTERACTION = dti * annual_inc;
RUN;
Reducing the dimensionality of a data set is a crucial step in data preprocessing and
feature engineering. As data sets grow in size and complexity, they often contain a
large number of features, making it challenging to extract meaningful insights or
build efficient models. Dimensionality reduction techniques aim to overcome this
problem by transforming the data into a lower-dimensional space while preserving
essential information. By reducing the number of features, we can improve
computational efficiency, alleviate the curse of dimensionality, enhance model
interpretability, and mitigate issues such as overfitting. This section will explore
various dimensionality reduction techniques and their practical applications.
Key Concepts:
1. Principal Component Analysis (PCA): PCA is a widely used linear
dimensionality reduction technique. It identifies the orthogonal axes, known
as principal components, which capture the maximum variance in the data.
By projecting the data onto a subset of these components, we can represent
the data set in a lower-dimensional space while retaining the most valuable
information.
When selecting variables for Principal Component Analysis (PCA), the primary
consideration is to choose numerical variables that are relevant and informative for
capturing the underlying variance in the data. Here are some guidelines to help
determine which variables to use for PCA analysis:
4. Correlation: Look for variables that are correlated with each other. High
correlation indicates a potential redundancy in the information captured by
those variables. Including highly correlated variables can result in a
multicollinearity problem, affecting the principal components'
interpretability.
Including binary indicators in PCA can result in their domination over other variables
and may produce misleading outcomes. Therefore, excluding binary indicators or
SAS, Python & R: A Cross Reference Guide
categorical variables from the PCA analysis is advisable. However, if there are
ordinal variables with multiple levels, they can be included in PCA after appropriate
scaling or transformation. It is essential to carefully consider the nature of the
variables and their suitability for PCA to ensure accurate dimensionality reduction
and meaningful results.
In our PCA analysis, we are including the variables "total accounts" (total_acc) and
"open accounts" (open_acc), as well as "public records" (pub_rec) and "public
bankruptcies" (pub_rec_bankruptcies). These variables have a high correlation
above 70% and exhibit a wide standard deviation. By including these variables, we
aim to capture the underlying patterns and relationships in the data represented by
these correlated and variable-rich features.
Now, let's provide coding examples in each of the three programming languages for
performing PCA analysis on these variables:
130 SAS, Python & R: A Cross Reference Guide
# Perform PCA
pca_result <- prcomp(data_std)
# Perform PCA
pca = PCA()
pca.fit(data_std)
The output of the SAS PCA analysis automatically generates a scree plot and a
variance explained plot. The Python and R algorithms can also generate these plots
with additional code. The scree plot and variance explained plot are commonly used
to interpret the results of a PCA analysis. Here's a brief explanation of each plot:
1. Scree Plot:
The scree plot displays the eigenvalues of each principal component in
descending order. The eigenvalues represent the amount of variance
explained by each principal component. The scree plot helps determine the
number of principal components to retain in the analysis. The plot usually
shows a steep drop in eigenvalues at the beginning, followed by a leveling
off. The point at which the drop levels off is considered a cutoff point for
retaining principal components. Components before this point are typically
retained as they explain a significant amount of variance, while those after
the point contribute less to the overall variance.
By examining these plots, you can determine the optimal number of principal
components to retain based on the steep drop in eigenvalues in the scree plot and
the cumulative proportion of variance explained in the variance explained plot.
Figure 3.5 shows the scree and variance explained plots generated from the SAS
code.
The eigenvalues in the PCA analysis scree plot represent the variance explained by
each principal component. In our example, the scree plot shows that the
eigenvalues for the first two principal components are relatively high (1.73 and
1.65), indicating that they capture significant variance in the data.
The drop in eigenvalues for the third and fourth principal components (0.35 and
0.26) suggests that these components explain less variance compared to the first
two components. Typically, a significant drop in eigenvalues indicates that the
corresponding principal components may not contribute much to the overall
variability in the data.
SAS, Python & R: A Cross Reference Guide
To determine the number of principal components to retain, you can consider the
"elbow" rule, which suggests selecting the components before the significant drop
in eigenvalues. In our example, you may choose to retain the first two principal
components as they have relatively high eigenvalues and capture a substantial
portion of the variance in the data.
Data Balancing
Balancing a data set is a crucial step in data preprocessing when working with
imbalanced data sets. In many real-world scenarios, data sets often exhibit class
imbalance, where one class significantly outweighs the other(s). This can pose
challenges in machine learning models, as they tend to be biased toward the
majority class, leading to poor performance in predicting the minority class.
Balancing the data set helps address this issue and allows the model to learn
effectively from all classes present in the data. In this section, we will explore the
importance of balancing data sets, discuss situations where it is necessary, and
outline practical strategies to achieve balanced data sets.
2. Enhanced model performance: Balancing the data set can improve model
performance, particularly in scenarios where accurate predictions for the
minority class are critical. Models trained on imbalanced data tend to
prioritize the majority class, resulting in lower recall and precision for the
minority class. Balancing the data set can help address this bias and provide
more balanced performance across all classes.
While there is no hard-and-fast rule regarding the specific class imbalance threshold
that requires balancing the data set, a general guideline is to consider balancing
when the class distribution is significantly skewed. The 80/20 ratio (80% for one
class, 20% for the other) can be considered as a potential threshold, but it's not a
definitive rule.
The decision to balance the data set depends on several key factors:
SAS, Python & R: A Cross Reference Guide
addressing class imbalance and avoiding overcorrection that might introduce biases
or distort the underlying data characteristics.
Modeling Scenarios
Data set balancing is not limited to classification models alone. While class
imbalance is often a concern in classification tasks, balancing data can also be
relevant in other modeling scenarios. Here are a few examples:
While class imbalance is commonly associated with classification models, data set
balancing can be relevant in various modeling scenarios to address data skewness,
improve model performance, and ensure fairness and accuracy in predictions.
SAS, Python & R: A Cross Reference Guide
● Decision Trees: Decision trees are less sensitive to class imbalance than
some other algorithms. However, when combined with ensemble methods
like Random Forest or Gradient Boosting, imbalanced data can still impact
their performance. In ensemble models, the individual decision trees can be
influenced by the class distribution, and balancing the data can help prevent
bias toward the majority class.
While various algorithms can be affected by class imbalance, the sensitivity may
vary. Logistic regression and SVMs are typically more sensitive, while decision trees
and gradient boosting models are relatively more robust. However, it is important
to note that the impact of imbalance can still vary based on the severity of class
imbalance, data set characteristics, and other factors. Balancing the data can help
improve the model's performance, especially when the class imbalance is
substantial or when the minority class is of particular interest.
138 SAS, Python & R: A Cross Reference Guide
import pandas as pd
Python from sklearn.utils import resample
Programming
# Separate positive and negative cases
SAS, Python & R: A Cross Reference Guide
positive_cases = loan_data[loan_data['bad'] ==
1]
negative_cases = loan_data[loan_data['bad'] ==
0]
Table 3.12 below shows the SAS output of frequency distribution for the balanced
data set.
0 9897 50 9897 50
1 9897 50 19794 100
In this chapter, you’ve learned how to transform raw data into a structured
modeling data set, a critical step that involves selecting and engineering features,
handling missing data, and preparing the data for analysis. These tasks ensure that
the data you input into your models is clean, relevant, and ready to produce reliable
results.
SAS, Python & R: A Cross Reference Guide
Now that you have a well-prepared modeling data set, the next step is to
understand how to use this data set within a model pipeline. A model pipeline is a
sequence of steps that takes your data through various transformations, modeling
processes, and evaluations. This pipeline is essential for ensuring that your models
are built, tested, and deployed in a systematic and efficient manner.
So, with your modeling data set ready, let’s proceed to Chapter 4 and delve into the
intricacies of constructing an effective model pipeline that will serve as the
backbone of your data science projects.
142 SAS, Python & R: A Cross Reference Guide
● Context: The chapter sets the stage for the creation of a modeling data
set, emphasizing the importance of data preparation, manipulation, and
transformation in the data science workflow.
● Art and Science: The chapter explores the dual nature of data science as
both an art and a science. It discusses how data scientists must combine
technical skills with creativity and domain expertise to make informed
decisions throughout the data science process.
● Lending Club Data Set: The Lending Club data set is reintroduced as the
primary data set for the ongoing project. The chapter formally defines
the business problem – designing a predictive model to calculate the
probability of a borrower defaulting on a loan.
● Limiting the Data Set: The chapter describes how to limit the data set to
a specific date range to ensure stability and reliability in the target
variable, resulting in a more manageable and accurate data set.
● Looking Ahead: The chapter prepares readers for the next steps in the
data science process, which involve building and evaluating predictive
models using the prepared data set.
SAS, Python & R: A Cross Reference Guide
Chapter 3 Quiz
Questions:
1. What are the key differences between academic data sets and real-world
data sets?
3. How does the "art" of data science influence the decision-making process
during data preparation?
4. What is the business problem defined for the Lending Club data set in this
chapter?
5. How is the "loan_status" field in the Lending Club data set used to create
the target variable "bad"?
6. Why is it important to examine the stability of the target variable over time?
7. What is "runway time," and how does it affect the analysis of the target
variable?
8. Describe the process of limiting the data set to a specific date range and its
impact on the modeling process.
10. How can summary statistics and visualizations like histograms help in
understanding the data set?
11. What are outliers, and why is it important to address them in data science?
13. What is the Interquartile Range (IQR) method, and how is it used to identify
outliers?
14. Why is feature selection important in machine learning, and what are its
benefits?
146 SAS, Python & R: A Cross Reference Guide
16. What are wrapper methods in feature selection, and how do they differ
from filtering methods?
17. Explain the concept of embedded methods in feature selection and their
implementation.
18. What is feature engineering, and how can it improve model performance?
19. Describe the process of feature scaling and its importance in machine
learning.
- StandardScaler
- Use PROC and
- Use scale() for
STANDARDIZE MinMaxScaler
feature scaling
for feature scaling from sklearn for
Feature scaling
Engineering
- Use - Create
- Create
PolynomialFeatur interaction terms
interaction terms
es for creating with poly()
with DATA steps
interaction terms function
SAS, Python & R: A Cross Reference Guide
Overview
In data science and machine learning, a model pipeline is a critical concept that
ensures the systematic and repeatable processing of data, leading to the
development of robust models.
● Data preprocessing prepares the raw data for analysis and modeling. It
includes steps such as outlier treatment, missing value imputation, and
categorical variable encoding.
● Model development involves selecting an appropriate algorithm and
training it on the preprocessed data.
● Hyperparameter tuning optimizes the parameters of the model to improve
its performance.
● Performance evaluation assesses how well the model performs on unseen
data.
Data Preparation
In the previous chapter, we explored the process of creating a modeling data set,
focusing on techniques such as outlier treatment, feature engineering, missing value
imputation, and data balancing. These steps are essential to preparing our data for
predictive modeling and ensuring we have a clean and reliable data set. In this
section, we will outline the specific techniques we applied to the Lending Club data
set, setting the foundation for the predictive models that we will develop in the
following sections.
Order of Operations
While the specific order in which you should perform these adjustments may vary
depending on the data set and the particular problem at hand, there is a generally
recommended order for performing these data treatments to optimize the data set:
2. Missing Value Imputation: Once outliers have been addressed, the next
step is to handle missing values. Missing data can be imputed using various
techniques such as mean imputation, median imputation, or more advanced
methods like multiple imputation or predictive modeling-based imputation.
Imputing missing values allows for a more complete data set and avoids
biases caused by omitting incomplete observations.
2. Outliers: Extreme values deviating from most of the data can heavily
influence regression models. In regression analysis, outliers can distort the
estimated relationships between predictors and the target variable, leading
to biased coefficient estimates and compromised model performance.
Addressing outliers through techniques like Winsorization, truncation, or
robust regression helps mitigate their impact on regression models, allowing
for more reliable estimates.
Data Segmentation
The primary goal of data segmentation is to create subsets that are representative
of the underlying population and maintain the integrity of the evaluation process.
Typically, a common approach is to divide the data into three main segments: the
training set, the validation set, and the testing set.
By segmenting the data, we can detect potential issues such as overfitting, where
the model performs exceedingly well on the training data but fails to generalize to
new data. It allows us to iteratively refine the model, validate its performance on
unseen data, and make necessary adjustments to ensure its reliability.
In predictive modeling, the most common and widely used approach for data
segmentation involves dividing the data into three main subsets: the training set,
the validation set, and the testing set. This approach is commonly referred to as the
train-validation-test split. The training set is used to build and train the model, the
validation set is used to fine-tune the model and optimize its parameters, and the
testing set is used for the final evaluation of the model's performance.
On the other hand, the term "out-of-time" validation set is used specifically in time
series data analysis. It refers to a validation set that contains data from a time
period that is later than the training period, simulating real-world scenarios where
the model needs to make predictions on future data. This allows us to evaluate the
model's ability to generalize and make accurate predictions on unseen future data.
To reconcile the terminology, we can say that the train-validation-OOT split is the
standard approach for general predictive modeling tasks. However, in the context of
time series data analysis, an additional segment called the out-of-time validation set
is often used to assess the model's performance on future data.
It's important to note that different sources or individuals may use slightly different
terms or variations, but the key idea remains the same: dividing the data into
156 SAS, Python & R: A Cross Reference Guide
subsets for training, validation, and evaluation purposes to ensure robust and
reliable model performance.
Cross-Validation
The most common form of cross-validation is k-fold cross-validation, where the data
is divided into k equally sized folds. The model is trained on k-1 folds and validated
on the remaining fold. This process is repeated k times, with each fold serving as the
validation set exactly once. The model's performance is then averaged across the k
iterations for an overall assessment.
Cross-validation is especially beneficial in situations where the data set is small and
more reliable model evaluation is needed. It allows for better estimation of the
model's performance on unseen data and provides insights into its stability across
different training-validation splits. Cross-validation also helps in hyperparameter
tuning, allowing us to assess how well the model generalizes with different
parameter settings.
Benefits:
- Simple and straightforward to implement.
● Cross-Validation:
- Cross-validation involves partitioning the data set into multiple
subsets (folds) and iteratively using different subsets for training
and testing.
Benefits:
- Uses the entire data set for both training and testing, which can
result in more robust model evaluation.
Given that the Lending Club data set is a time series with a clear temporal order, it is
generally more appropriate to use a development, validation, and out-of-time split
rather than cross-validation. Time series data often exhibit temporal dependencies,
meaning past observations can influence future observations. Therefore, it is
essential to maintain the temporal order in the data segmentation to reflect the
real-world scenario.
Therefore, for time series data like the Lending Club data set, it is recommended to
use a development, validation, and out-of-time split, ensuring that the data subsets
are ordered chronologically. This approach allows for a more realistic evaluation of
the model's performance and helps capture the temporal dynamics in the data.
Modeling Pipeline
The primary goal of a modeling pipeline is to simplify and automate repetitive tasks,
reducing the risk of human error and saving valuable time. It ensures that data
manipulation and model building processes are consistent across different iterations
of model development and across different projects. As a result, the pipeline
enhances the scalability of the modeling process, allowing data scientists to apply
the same methodology to different data sets and projects efficiently.
A modeling pipeline is a powerful tool in the data science toolkit that empowers
data scientists to efficiently create robust and reliable predictive models. Providing a
clear and organized framework for the model-building process enables data
scientists to focus on extracting insights and driving value from data, ultimately
leading to more informed and impactful decision-making.
SAS, Python & R: A Cross Reference Guide
On the other hand, the out-of-time (OOT) data will remain untouched until the
modeling process is complete. We will set it aside, preserving its integrity to serve as
an independent data set for final model evaluation. After conducting model training,
hyperparameter tuning, and model selection using the development and validation
data, we will focus on the final model evaluation phase.
Once we have identified the champion model, we will apply the same imputation
pipeline used on the development data to the out-of-time data. After scoring the
out-of-time data with the final model, we will carefully evaluate its performance
using various metrics to ensure robustness and accuracy.
Figure 4.3 below shows a graphical representation of the overall machine learning
workflow.
162 SAS, Python & R: A Cross Reference Guide
This systematic approach allows us to maintain the integrity of the out-of-time data,
ensuring that our final evaluation is free from data leakage or bias. By adhering to a
clear sequence of steps, we can confidently build predictive models and assess their
real-world performance with the utmost accuracy and reliability.
1. Data Split: Split the entire data set into two parts: the model-building data
set and the out-of-time data set (usually representing future data).
4. Model Training and Selection: Build and train different models on the
development data set. Evaluate their performance on the validation data
set and select the best-performing model as the champion model.
6. Data Preparation on Out-of-Time Data: Now, take the out-of-time data set
(which has not been touched yet) and apply the same data preparation
steps that were performed on the model-building data set. This includes
imputations, outlier treatment, feature engineering, and categorical
variable encoding.
7. Model Scoring and Evaluation: Once the out-of-time data set is prepped,
use the champion model to score the data and generate predictions.
Evaluate the model's performance on the out-of-time data set using the
desired performance metrics.
Following this process ensures that the model is trained and evaluated on separate
data sets, preventing data leakage and providing a more accurate representation of
how the model will perform on unseen data. The out-of-time data set serves as a
true test set, simulating how the model would perform in a real-world scenario
where future data becomes available after model deployment.
This section will demonstrate the implementation of each step in the data science
modeling pipeline using SAS, Python, and R programming languages. Each language
has its own strengths and unique features that make it well-suited for certain tasks
in the modeling pipeline. By providing code examples in all three languages, we can
compare and contrast their approaches and gain a deeper understanding of the
concepts and techniques discussed in the modeling pipeline workflow.
164 SAS, Python & R: A Cross Reference Guide
https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide
Although all three programming languages could not be directly included in the text
due to space limitations, Program 4.1 shows the model pipeline code for the Python
programming language. This program will provide an example of a complete model
pipeline.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,
GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_curve, auc, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
# Load the data and split into developmental and OOT data set
#loan_data = pd.read_csv('loan_data.csv')
oot_start_date = '2017-04-01'
SAS, Python & R: A Cross Reference Guide
oot_end_date = '2017-06-30'
mask = (loan_data['issue_d'] >= oot_start_date) &
(loan_data['issue_d'] <= oot_end_date)
X_oot = loan_data[mask].drop('bad', axis=1)
y_oot = loan_data[mask]['bad']
X_dev = loan_data[~mask].drop('bad', axis=1)
y_dev = loan_data[~mask]['bad']
X_oot_encoded = pd.concat([X_oot[numeric_features],
pd.DataFrame(encoder.transform(X_oot[categorical_features]).toarray(
), index=X_oot.index)], axis=1)
X_oot_encoded['DTI_INC_interaction'] = X_oot['dti'] *
X_oot['annual_inc']
X_oot_encoded['DTI_INC_interaction'] = X_oot['dti'] *
X_oot['annual_inc']
Prepare Data
● Data Ingestion: The code reads in the loan data and formats the date
variable. It also identifies numeric and categorical features.
● Data Partitioning: The code partitions the data into out-of-time (OOT) and
development data sets based on the date of issue. It further splits the
development data set into training and validation data sets.
● Model Building: The code builds various machine learning models, including
logistic regression, decision tree, random forest, support vector machine,
gradient boosting machine, and neural network.
● Model Evaluations: The code calculates the AUC value for each candidate
model and compares them in a bar chart. A champion model is selected.
● Model Performance: The code applies the final model with optimal
hyperparameters to score the OOT data set.
SAS, Python & R: A Cross Reference Guide
This pipeline covers all major steps in a typical machine learning project, from data
preparation to feature engineering, model building, selection, evaluation,
deployment, and monitoring.
The modeling pipeline code results in two pieces of output: a graphic chart and a
table. The graphic chart shows a comparison of the AUC of each of the machine
learning models. AUC, or Area Under the Curve, is a measure of the performance of
a binary classifier. An AUC of 1 indicates a perfect classifier, while an AUC of 0.5
indicates a random classifier. The graphic shows that the Gradient Boosting Machine
model has the highest AUC, indicating that it is the best-performing model among
those tested.
The table displays several performance metrics for each machine learning model,
including AUC, accuracy, precision, recall, and F1 score. These metrics provide a
more detailed view of the performance of each model. Some key points to note
from the table are:
● The Gradient Boosting Machine model has the highest AUC, indicating that
it is the best-performing model among those tested.
● The Neural Network model has the highest recall, indicating that it can
correctly identify a high proportion of positive instances.
● The Logistic Regression and Support Vector Machine models have a
precision of 0, indicating that they cannot correctly identify any positive
instances.
The champion model is the Gradient Boosting Machine model, as it has the highest
AUC on the validation data set. The final data set, which contains the predictions of
the champion model on the OOT data set, is stored in the variable y_oot_pred. This
variable contains the predicted class labels for each observation in the OOT data set.
Processing Time
It’s not uncommon for machine learning models to take a long time to train,
especially when working with large data sets or complex models. Several factors
could contribute to long runtimes in the modeling section of your code. Some
possible reasons include:
● Large data set: If the data set you are working with is very large, the models
may take a long time to process all of the data.
● Complex models: Some of the models you are using, such as Support Vector
Machines, Gradient Boosting Machines, and Neural Networks, can be
computationally expensive and may take longer to train than simpler
models like Logistic Regression or Decision Trees.
● Hyperparameter tuning: If you perform hyperparameter tuning using grid
search, this can significantly increase the runtime as the model needs to be
trained multiple times for each combination of hyperparameters.
To speed up the modeling process, you could try some of the following approaches:
● Reduce the data set size: You could try using a smaller sample of the data
for model training and validation. This can help reduce the runtime while
still providing a representative sample of the data for model development.
● Simplify the models: You could try using simpler models that are less
computationally expensive to train. For example, you could try using Logistic
Regression or Decision Tree instead of Support Vector Machine or Neural
Network.
● Use faster hyperparameter tuning methods: Instead of using grid search for
hyperparameter tuning, you could use faster methods such as random
search or Bayesian optimization.
Among the machine learning algorithms that you use in your code, the one that
typically takes the longest to train is the Support Vector Machine (SVM). SVMs are
known to have high computational complexity, especially when working with large
data sets or when using certain kernel functions. However, the actual training time
can vary depending on factors such as the size and dimensionality of the data set,
the choice of kernel function, and the values of the hyperparameters. Other
algorithms, such as Gradient Boosting Machines and Neural Networks, can also take
a long time to train, but typically not as long as SVM.
174 SAS, Python & R: A Cross Reference Guide
With a solid model pipeline in place, it’s time to dive into the core of predictive
modeling. The next chapters will introduce you to the foundational techniques used
to build predictive models, starting with some of the most widely used algorithms
like linear regression and logistic regression. Understanding these foundational
models is crucial because they form the basis upon which more complex models are
built.
So, with your model pipeline ready to go, let’s move on to Chapter 5 and start
building your first predictive models, setting the stage for the advanced techniques
to come.
SAS, Python & R: A Cross Reference Guide
● Context: The chapter sets the stage for understanding how model pipelines
streamline the machine learning process by automating repetitive tasks and
maintaining the integrity of results.
2. Data Preparation
4. Data Segmentation
● Steps in a Model Pipeline: The chapter outlines the key steps in a model
pipeline, from data splitting, data preparation, model training, and
selection, to model scoring and evaluation. It highlights the importance of
SAS, Python & R: A Cross Reference Guide
● Model Pipeline Code: While the chapter discusses the conceptual steps in
building a model pipeline, it references a GitHub repository where readers
can access the complete code examples in SAS, Python, and R.
● Evaluation and Output: The chapter covers how the final model’s
performance is evaluated using metrics such as AUC, accuracy, precision,
recall, and F1 score. It also discusses the visualization of model performance
using bar charts and confusion matrices to compare different models.
178 SAS, Python & R: A Cross Reference Guide
Chapter 4 Quiz
Questions:
1. What is a model pipeline, and why is it important in machine learning
projects?
11. Explain the concept of cross-validation and its benefits in model evaluation.
16. How does the AUC metric help in assessing the performance of a binary
classifier?
SAS, Python & R: A Cross Reference Guide
19. How does one-hot encoding facilitate the inclusion of categorical variables
in machine learning models?
20. What are the benefits of using a systematic model pipeline in data science
projects?
180 SAS, Python & R: A Cross Reference Guide
- Use PROC -
SURVEYSELECT for RandomOverSam - ROSE package
oversampling/unde pler and SMOTE for oversampling,
Data
rsampling from imblearn for undersampling,
Balancing
- Create synthetic oversampling and and synthetic
samples via PROC synthetic sampling
GENMOD sampling
-
LogisticRegression
- PROC LOGISTIC,
, - glm() for logistic
PROC TREESPLIT
RandomForestCla regression
Model for training models
ssifier from
Training
sklearn
- Use PROC - randomForest()
- XGBoost for
HPFOREST for for ensemble
gradient boosting
ensemble models models
- roc_curve and
- PROC ROC for auc from - pROC package
AUC calculation sklearn.metrics for AUC and ROC
for AUC
Model
- PROC FREQ and -
Evaluation -
PROC MEANS for confusion_matrix,
confusionMatrix()
confusion matrix precision_score,
from caret for
and performance recall_score for
metrics
metrics metrics
182 SAS, Python & R: A Cross Reference Guide
Overview
At the heart of our exploration are linear and logistic regression, fixed-form models
that are crucial in predicting numerical and categorical outcomes, respectively.
These models provide a structured framework, allowing us to understand the
relationships within our data. On the other hand, we encounter the decision tree, a
non-parametric model that operates without predefined assumptions. The beauty
of non-parametric models lies in their flexibility, enabling them to capture complex
patterns in the data without strict adherence to a predetermined structure.
Modeling Data
directly impacts the predictive models’ accuracy and reliability, making it an integral
part of the machine learning process.
The culmination of this comprehensive pipeline process is the final modeling data
set. This data set, which encapsulates all preprocessing steps, is structured to feed
directly into our machine learning models. You can access this data set in the GitHub
repository for this book at:
https://github.com/Gearhj/SAS-Python-and-R-A-Cross-Reference-Guide
As we explore each machine learning model in subsequent chapters, we will use this
data set to develop these models and provide an in-depth explanation of their
workings.
Given that our modeling data set is based on a binary target variable, we will focus
on building a series of classification models. However, it’s important to note that
each algorithm we discuss can be used for either regression or classification tasks
with only minor modifications to the programming code.
184 SAS, Python & R: A Cross Reference Guide
Machine learning algorithms are inherently flexible. For instance, despite its name,
logistic regression is used for binary classification tasks. By changing the function
used to map input features to output predictions, it can be adapted for regression
tasks. Similarly, decision trees can be used for both classification and regression by
altering the criteria for splitting nodes and determining leaf node values. Support
Vector Machines (SVMs), primarily used for classification, can also be used for
regression (Support Vector Regression) by introducing an alternative loss function.
This versatility extends to most machine learning algorithms, making them
adaptable tools in a data scientist’s arsenal.
This chapter will focus on creating the classification model versions of the following
algorithms:
● Logistic Regression
● Decision Tree
● Random Forest
● Gradient Boosting Machine
● Support Vector Machines
● Neural Networks
We will start with Logistic Regression, a simple yet powerful algorithm for binary
classification tasks. Then, we will explore Decision Trees and their ensemble
versions: Random Forests and Gradient Boosting Machines. We will then move on
to Support Vector Machines, a robust classifier capable of handling high-
dimensional data. Finally, we will delve into Neural Networks, the backbone of most
modern artificial intelligence applications.
By the end of this chapter, you will have a solid understanding of these algorithms
and be able to implement them in SAS, Python, and R.
SAS, Python & R: A Cross Reference Guide
The distinction between linear and logistic regression primarily lies in the nature of
the dependent variable. Linear regression is predicated on a quantitative numeric
variable, whereas logistic regression is predicated on a qualitative categorical target
variable. Figure 5.1 illustrates a comparison of value distributions for these
modeling types.
Attempting to model a binary outcome with a linear regression model could yield
perplexing results, as demonstrated in Figure 5.2.
186 SAS, Python & R: A Cross Reference Guide
The linear regression model seeks to minimize the data's Residual Sum of Squares
(RSS), assuming the dependent variable (Y) is a continuous value. This assumption
leads to some issues when applied to classification problems:
1. Probabilities being greater than one or less than zero: As the value of X
approaches 0% or 100%, the predicted value of a purchase either becomes
negative or exceeds one, both of which are outside the [0,1] range.
2. Homoscedasticity: Linear regression assumes that the variance of Y is
constant across values of X. However, for binary models, the variance is
calculated as the product of the proportions of positive values (P) and
negative values (Q), i.e., PQ, which varies across the logistic regression line.
3. Normal distribution of residuals: Linear regression assumes that residuals
are normally distributed. This assumption is difficult to justify, given that
classification problems have binary targets.
For these reasons, it’s advisable not to use linear regression for classification
problems.
Logistic Regression
A more suitable approach for binary outcome variables is to model them as a
probability of the event occurring. This probability is depicted as an S-curved
SAS, Python & R: A Cross Reference Guide
regression line in Figure 5.2. While the formula for this line differs from linear
regression, some shared elements exist.
𝑌 ≈ 𝛽0 + 𝛽1 𝑋1
The logistic regression equation (Equation 5.2) comprises four main components:
1. p(X) denotes the probability of a positive value (1), which is also the
proportion of 1s. This is equivalent to the mean value of Y in linear
regression.
2. The base value of the natural logarithm (e) is present in the model, as
expected in a logistic model.
3. 𝛽0 represents P when X is zero, similar to the intercept value in linear
regression.
4. 𝛽1 adjusts the rate at which the probability changes with a single unit
change in X, akin to the 𝛽1 value in linear regression.
The logistic regression equation is known as the logistic function, and it’s used
because it produces outputs between 0 and 1 for all values of X and allows the
variance of Y to vary across values of X.
The process of training a logistic regression model involves several iterative steps. It
begins with initializing the model parameters, which include the weights and biases
for your model. These parameters are then used to calculate the predicted outcome
for each observation in your data set. The discrepancy between these predictions
and the actual outcomes is quantified using a loss function, often binary cross-
entropy loss, in the case of logistic regression. This loss is then used to update the
model parameters using a learning algorithm such as gradient descent. The process
188 SAS, Python & R: A Cross Reference Guide
Let’s delve deeper into each step of developing a logistic regression model:
2. Calculate the Predicted Outcome: Once the parameters are initialized, you
can calculate the predicted outcome for each observation in your data set.
Logistic regression involves calculating the log odds of the outcome using
the logistic function. The logistic function transforms the linear combination
of your predictors (i.e., the dot product of your predictors and weights plus
the bias) into a probability between 0 and 1.
3. Compute the Loss: After calculating the predicted outcomes, you need to
compute the loss, which measures how well your model’s predictions match
the actual outcomes. In logistic regression, we often use binary cross-
entropy loss (also known as log loss), which measures the error between
our predictions and the actual classes. It considers the predicted
probabilities for the actual class and the inverse probabilities for the other
class.
4. Update the Parameters: Once you’ve computed the loss, you can update
your parameters using a learning algorithm such as gradient descent. This
involves computing the derivative (gradient) of the loss concerning each
parameter, which indicates how much changing that parameter would
change the loss. You then adjust each parameter by a small step in the
direction that reduces the loss (i.e., in the opposite direction of the
gradient). The size of this step is determined by your learning rate.
5. Repeat Steps 2–4: You continue iterating through steps 2-4 until your model
performs well on your training data or until a stopping criterion is met (such
SAS, Python & R: A Cross Reference Guide
It’s important to note that these steps represent one cycle of training a logistic
regression model. In practice, you would typically divide your data into batches and
repeat these steps for each batch, updating your parameters after each batch rather
than after each epoch. This approach, known as mini-batch gradient descent, can
lead to faster and more stable convergence.
Loss Function
Therefore, maximum likelihood implies selecting values for 𝛽0 and 𝛽1 that result in a
probability close to one for all ones and a probability close to zero for all zero
values. This approach is termed maximum likelihood because it seeks to maximize
the likelihood (conditional probability) of matching the model estimate with the
sample data.
Odds Ratio
The odds ratio is a fundamental concept in logistic regression. It represents the ratio
of the odds of an event occurring in one group to the odds of it occurring in another
group. In the context of logistic regression, the odds ratio can be calculated as 𝑒 𝛽1 .
190 SAS, Python & R: A Cross Reference Guide
This means that for a unit increase in X, we can expect the odds of Y=1 (event
happening) to change by a factor of 𝑒 𝛽1 . This is particularly useful when interpreting
the results of a logistic regression, as it measures how each independent variable
impacts the odds of the dependent variable.
The odds ratio quantifies how a change in an independent variable affects the
dependent variable while holding other variables constant. While the odds ratio can
give us valuable insights, it’s not always easy to interpret, especially for continuous
variables or when there are interactions between variables.
The coefficient (β) associated with annual_income in the logistic regression model
represents the change in log odds of the outcome target variable (“bad”) for a unit
increase in annual_income. To calculate the odds ratio, we take the exponent of this
coefficient, i.e., 𝑒 𝛽1 .
For example, if β for annual_income is -0.03, the odds ratio would be 𝑒 −0.03 = 0.97.
This means that for each additional unit increase in annual_income, we can expect
the odds of a loan default (Y=1) to be multiplied by 0.97, assuming all other
variables are held constant. In other words, higher annual income is associated with
lower odds of loan default.
It’s important to note that this is a multiplicative effect on the odds, not the
probabilities, and it assumes all other variables in the model are held constant. Also,
this interpretation is valid for a unit change in annual_income, so depending on
how annual_income is measured (e.g., dollars, thousands of dollars), this would
affect the interpretation.
Remember that while odds ratios can be very informative, they require careful
interpretation and understanding of your data and model.
SAS, Python & R: A Cross Reference Guide
Link Function
Logistic regression is a generalized linear model (GLM) that uses a logit link function.
The link function provides the relationship between the linear predictor and the
mean of the distribution function. In logistic regression, this is the natural log of the
odds (i.e., logit function). The link function transforms the probability into a linear
combination of the predictors. This transformation allows us to model a binary
response using methods similar to those used in linear regression.
The choice of link function is critical as it determines how we model the relationship
between our predictors and our response variable. The logit link function ensures
that our predicted probabilities stay between 0 and 1 and allows us to interpret our
regression coefficients as changes in log odds.
𝜂 = 𝛽0 + 𝛽1 ∗ 𝑎𝑛𝑛𝑢𝑎𝑙_𝑖𝑛𝑐𝑜𝑚𝑒 + 𝛽2 ∗ 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒
192 SAS, Python & R: A Cross Reference Guide
The link function transforms this linear predictor into a probability using the inverse
of the logit function, which is the logistic function:
The choice of the logit link function ensures that our predicted probabilities stay
between 0 and 1, no matter the values of our predictors. It also allows us to
interpret our regression coefficients as changes in log odds. For example, 𝛽1
represents the change in log odds of default for each additional dollar of annual
income.
Assumptions
Logistic regression does not require a linear relationship between the dependent
and independent variables, which differentiates it from linear regression. However,
it does require that the independent variables are linearly related to the log odds.
This assumption allows us to use a linear combination of predictors to model a
binary outcome.
Logistic regression also assumes that errors are independently and identically
distributed and that there is no perfect multicollinearity among independent
variables. When building a logistic regression model, it’s important to check these
assumptions, as violations can lead to biased or inefficient estimates.
Let’s delve deeper into the assumptions of logistic regression and how they apply to
the Lending Club data set.
log odds. This means that a unit change in the predictor variable will result
in a constant change in the log odds of the response variable, assuming all
other variables are held constant.
For example, consider a predictor variable from the Lending Club data set,
such as annual_income. The assumption here is that for each additional
dollar of annual income, the log odds of loan default (assuming the target
variable “bad” is coded as 1) changes by a constant amount, given that all
other variables in the model are held constant. This assumption can be
checked by looking at the relationship between each predictor and the log
odds of the response variable. If this assumption is violated, transformations
of the predictors or reparameterization of the model might be necessary.
In the context of the Lending Club data set, this would mean that the error
term for one loan application does not predict the error term for another
loan application. Violations of this assumption might occur if there is serial
correlation between observations. For instance, if loans are funded based
on time-series data (e.g., loans funded earlier influence those funded later),
this assumption might be violated. If this is the case, techniques such as
time-series or random effects models might be more appropriate.
5. Large sample size: Logistic regression requires a large sample size because
maximum likelihood estimates are less powerful at small sample sizes than
ordinary least squares (used in linear regression). A common rule of thumb
is that you need at least 10 cases with the least frequent outcome for each
predictor in your model. For example, if you have 5 predictors and expect
about 10% positive responses, you would need at least 500 observations for
logistic regression.
This assumption is likely to be met in the Lending Club data set, which has
thousands of observations across multiple predictors. However, it is always
good practice to check whether your data set is large enough relative to
your model complexity.
Model Fit
Evaluating the fit of a logistic regression model is a crucial step in the modeling
process. Here’s an expanded explanation of the goodness-of-fit measures, along
with how they apply to the Lending Club data set:
whether they are meaningful. Therefore, it’s often used in combination with
other measures like AIC and BIC that penalize model complexity.
In the Lending Club data set context, suppose you have two models: one
that includes annual_income and credit_score as predictors and another
that adds home_ownership as an additional predictor. If the -2LL decreases
significantly when home_ownership is added, this suggests
that home_ownership improves the model fit.
For example, using the Lending Club data set, if adding a predictor like
home_ownership to a model decreases the AIC, this suggests that despite
increasing model complexity, home_ownership improves the model’s ability
to predict loan default.
In relation to the Lending Club data set, if you divide your data into deciles
based on predicted default probabilities and run the Hosmer-Lemeshow
test, a non-significant p-value would suggest that your model’s predicted
probabilities align well with actual default rates in these groups.
Interpretation
The coefficients in a logistic regression model are odds ratios. Because of this, we
interpret the coefficients as the change in odds for a one-unit increase in our
196 SAS, Python & R: A Cross Reference Guide
predictor variable, holding all other variables constant. This interpretation is more
intuitive in terms of probability. It is one of the reasons why logistic regression is
popular in fields like medicine, where understanding odds ratios can be crucial.
However, interpreting these coefficients can be challenging, especially when dealing
with multiple predictors or interaction terms.
For example, consider a predictor variable from the Lending Club data set,
such as annual_income. If the coefficient for annual_income is -0.02, then
the odds ratio is 𝑒 −0.02 = 0.98. This means that for each additional dollar of
annual income, we can expect the odds of a loan default (assuming loan
default is coded as 1) to be multiplied by 0.98, assuming all other variables
are held constant.
Regularization
Regularization is a critical concept in the field of machine learning. It's the practice
of adding a penalty term to the loss function during the training of a model. In
machine learning, we often encounter the problem of overfitting. This occurs when
a model becomes too complex, capturing not only the underlying patterns in the
training data but also the noise. As a result, the model performs exceptionally well
on the training data but fails to generalize to unseen data. This is where
regularization steps in. Regularization techniques help prevent overfitting by
discouraging the model from fitting the noise in the data. Instead, it encourages the
model to find simpler patterns that generalize better. In essence, regularization is a
form of "complexity control" for machine learning models.
The issues of imbalance in logistic regression stem from the way the algorithm
learns. It aims to find the optimal decision boundary that minimizes the logistic loss
function. In imbalanced data sets, the model tends to favor the majority class, as the
loss function is dominated by the abundant class. This bias results in poor sensitivity,
specificity, and overall predictive power, particularly for the minority class.
SAS, Python & R: A Cross Reference Guide
Balancing the data set by either oversampling the minority class or undersampling
the majority class helps mitigate these issues. By presenting the model with a more
equal distribution of classes, logistic regression can learn better from both
categories. As a result, it can help make more informed decisions and provide
improved predictions. Creating a balanced data set for logistic regression is crucial
because it rectifies the algorithm's susceptibility to imbalanced data, ultimately
leading to more accurate and equitable model outcomes.
Pros:
Cons:
● Requires large sample sizes because maximum likelihood estimates are less
powerful at small sample sizes than ordinary least squares.
● Doesn’t handle multicollinearity well. If variables are highly correlated, it
can lead to unstable estimates of coefficients.
Best Practices
In this section, we will dig into the practical aspect of logistic regression by
developing a model using Python, R, and SAS. We will follow a step-by-step
approach to ensure that each part of the process is clearly understood. The steps
include importing the data, splitting it into training and validation data sets,
performing Variance Inflation Factor (VIF) analysis to check for multicollinearity,
selecting variables based on the VIF analysis, building and tuning the logistic
regression model, applying the model to an out-of-time (OOT) data set, and finally
evaluating the model’s performance using appropriate metrics.
SAS, Python & R: A Cross Reference Guide
print(train_encoded.columns)
# Calculate VIF
X = sm.add_constant(train_encoded[predictors])
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in
range(X.shape[1])]
vif["features"] = X.columns
# Ensure the order of column in the OOT set is the same order as in
train set
OOT_encoded = OOT_encoded[predictors]
# Add back the target variable 'bad' to the OOT_encoded data set
OOT_encoded['bad'] = OOT_bad
# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])
train_encoded.columns
'purpose_home_improvement', 'home_ownership_RENT',
'application_type_JointApp')
# Calculate VIF
vif_fit <- lm(bad ~ ., data = train_encoded[, c(predictors, "bad")])
if (vif_fit$rank == length(coefficients(vif_fit))) {
vif_values <- car::vif(vif_fit)
high_vif_predictors <- names(vif_values[vif_values > 5]) # Change
this threshold as needed
} else {
high_vif_predictors <- character(0)
}
# Split the data into training and validation sets (80/20 split)
using stratified sampling
trainIndex <- createDataPartition(train_encoded$bad, p = .8, list =
FALSE)
X_train <- train_encoded[trainIndex, predictors]
y_train <- train_encoded[trainIndex, "bad"]
X_val <- train_encoded[-trainIndex, predictors]
y_val <- train_encoded[-trainIndex, "bad"]
# Ensure the order of column in the OOT set is the same order as in
train set
OOT_encoded <- OOT_encoded[c(predictors, "bad")]
DATA oot_encoded;
SET james.oot_encoded;
RUN;
/* Split the modeling data set by a 80/20 ratio using a random seed
*/
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2
SEED=42; RUN;
DATA train_bal;
SET pos neg(obs=6901); /*Input the number of positive cases*/
RUN;
1. Import necessary libraries: Import the necessary Python and R libraries. SAS
does not require additional libraries.
3. Load data: Load the preprocessed training and out-of-time (OOT) data sets.
4. Exclude one dummy variable from each group: To avoid the “dummy
variable trap,” remove one dummy variable created from each categorical
variable.
7. Check missing columns: Check if any columns are missing in the OOT data
set compared to the training data set and add them if necessary.
8. Split data: Use stratified sampling to split the training data into training and
validation sets.
11. Build lasso regression model: Build a Lasso regression model and tune its
hyperparameters using grid search.
12. Apply model to OOT data set: Apply the best model obtained from the grid
search to the OOT data set to make predictions.
Decision Trees
Decision trees are a unique type of non-parametric model that offers a high degree
of flexibility, making them a powerful tool in data science. Unlike parametric
models, which assume a specific form for the underlying relationships between
variables, non-parametric models like decision trees allow the data itself to dictate
the model's structure. This flexibility enables decision trees to effectively handle a
wide variety of modeling issues, especially when relationships between variables are
complex or non-linear.
Structurally, decision trees are fundamentally different from models like logistic
regression. They employ a hierarchical and recursive partitioning of data, resulting
in a tree-like structure. This structure comprises nodes and leaves. Nodes represent
decisions based on specific conditions on input features, and leaves signify output
classes or final predictions.
To illustrate, consider a simple decision tree for predicting whether a loan will be
approved or not:
SAS, Python & R: A Cross Reference Guide
In this example, the root node splits the data based on whether income is greater
than 50K. If yes, it leads to a child node that further splits the data based on credit
score. If no, it leads directly to a leaf node predicting that the loan will be denied.
This tree-like structure makes decision trees highly interpretable. Each path from
the root to a leaf represents a decision rule. For instance, in the above example, one
rule could be “If income is greater than 50K and credit score is above 700, then the
loan is approved.”
Moreover, decision trees form the basis for more complex ensemble methods like
Random Forests and Gradient Boosting Machines (GBMs), often providing superior
predictive performance by combining predictions from multiple decision trees.
the model. This makes decision trees more flexible but can also make them prone to
overfitting without proper tuning.
On the other hand, non-parametric models like decision trees do not make strong
assumptions about the functional forms of relationships between variables. Instead,
they allow the data to dictate the model’s structure. This flexibility enables non-
parametric models to handle a variety of modeling issues effectively.
To further illustrate these differences, let’s look at a comparison table that shows
how logistic regression and decision trees handle various modeling issues:
Requires dummy
variables for categorical Can handle categorical
Categorical features, which can lead variables directly without
Variables to the dummy variable the need for dummy
trap if not handled variables.
correctly.
Some implementations
Cannot handle missing
can handle missing values,
values. Requires
Missing Values but others may require
imputation or removal of
imputation or removal of
missing values.
missing values.
Does not inherently
Inherently performs
perform feature
feature selection by
selection. May require
Feature Selection choosing the most
methods like stepwise
informative features for
regression for feature
splitting.
selection.
May require techniques
Can handle imbalanced
like oversampling,
data but might be biased
undersampling, or
Imbalanced Data toward the majority class.
SMOTE to handle
Balancing techniques may
imbalanced data
still be beneficial.
effectively.
Decision trees excel when you need a model that is easy to interpret and explain.
Their tree-like structure provides clear insights into their decision-making process.
Depending on the implementation, they can also handle numerical and categorical
data, as well as missing values.
In the context of decision trees, impurity and information gain are two fundamental
concepts that guide the construction of the tree.
class), it has an impurity of zero. On the other hand, if a node is equally split
between multiple classes, it has a high impurity. The goal of a decision tree
algorithm is to partition the data in a way that minimizes impurity and thus
increases the purity of the data subsets. Two commonly used measures of
impurity are entropy and Gini impurity.
These concepts play a pivotal role in building decision trees. They help determine
the conditions for splitting the data at each node, aiming to maximize information
gain (reduce impurity) and create decision trees that can accurately classify new
instances.
● Entropy quantifies the impurity or disorder within a data set. For binary
classification, it’s computed as:
𝐼𝐻 = − ∑ 𝑝𝑗 𝑙𝑜𝑔2 (𝑝𝑗 )
𝑗=𝑖
𝐼𝐺 = 1 − ∑ 𝑝𝑗2
𝑗=𝑖
These measures play a pivotal role in building decision trees. They help determine
the conditions for splitting the data at each node, aiming to maximize information
SAS, Python & R: A Cross Reference Guide
gain (in the case of entropy) or minimize impurity (in the case of Gini impurity). By
doing so, they help create decision trees that can accurately classify new instances.
Entropy and Gini impurity are fundamental to understanding how decision trees
work. They provide a quantitative basis for deciding how to split the data at each
node, guiding the construction of an effective and accurate decision tree model.
The process to construct a classification decision tree consists of six main steps:
1. Selecting the Root Node: The process begins with all training samples at the
root node. The feature and threshold that best divide the data are selected
as the root node. This is done by either maximizing information gain or
minimizing impurity, such as Gini impurity or entropy. Information gain is
calculated as the difference in impurity before and after a split. The feature
with the highest information gain is chosen for the split.
2. Creating Child Nodes: Once a feature is selected for the root node, the data
set is split into two subsets based on the chosen feature’s threshold. This
process is then recursively applied to each subset, creating child nodes by
selecting the feature and threshold that minimize impurity within each
subset.
By following these steps, you can construct a decision tree model that can make
accurate predictions while also being interpretable and easy to understand.
One of the biggest concerns with decision trees is their tendency to overfit the
model to the training data. This occurs because the algorithm will continue building
the tree and constructing very fine leaf nodes if the model is unrestrained. These
are labeled as “deep trees” because several layers of internal nodes lead to the leaf
node. We need a method of finding the optimal tree within the vast space of all
possible trees. One way of limiting the algorithm and preventing deep trees is to
place a constraint on the model. This constraint should reduce the size of the
decision tree without reducing the predictive accuracy.
1. Reduced Error Pruning: For each leaf of the decision tree, the leaf is
replaced with a node that represents the prevalent class. If the prediction
accuracy does not decrease, then the node is retained.
2. Cost Complexity Pruning: This technique makes trade-offs between the size
of the tree and its fit to the training data, helping prevent overfitting. A
complexity parameter labeled alpha (α) is introduced and represents the
cost of each new leaf. The tuning parameter α controls a trade-off between
the subtree’s complexity and its fit to the training data.
The cost complexity metric penalizes the development of additional leaves that do
not significantly reduce the error rate. The error rate is defined differently
depending on the type of model. For regression models, the residual sum of squares
is used as the error rate. For classification models, the misclassification rate is used
as the error rate.
Performance Metrics
Evaluating the performance of a decision tree model is a crucial step in the machine
learning process. It allows us to understand how well our model is doing and where
it might fall short. We can use several metrics to evaluate a decision tree model,
each providing a different perspective on the model’s performance.
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
For instance, in the Lending Club data set, this would be the percentage of
loans correctly classified as approved or denied.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
For example, it helps evaluate how many of the approved loans were truly
creditworthy.
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
In the Lending Club context, it tells us how many of the creditworthy loans
were correctly identified.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ∗ ( )
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
For example, if we’re using these metrics to evaluate a decision tree model on the
Lending Club data set:
Hyperparameter Tuning
learning algorithm learns from data and makes predictions. Selecting appropriate
values for hyperparameters is essential, as it can significantly impact a model's
ability to generalize and perform well on unseen data.
The goal of hyperparameter tuning is to find the sweet spot that maximizes a
model's predictive power while ensuring it doesn't become overly complex or
specialized to the training data.
The choice of tuning approach often depends on the complexity of the model and
the size of the hyperparameter search space. Hyperparameter tuning is an iterative
and essential part of building accurate and reliable machine learning models.
Pros:
Cons:
Best Practices
By unifying these concepts, data scientists gain the proficiency to employ decision
trees effectively, from their construction to pruning. This enables them to make
informed choices when faced with modeling decisions and ultimately enhances the
quality of their predictive models.
print(train_encoded.columns)
# Ensure the order of column in the OOT set is the same order as in
train set
OOT_encoded = OOT_encoded[predictors]
# Add back the target variable 'bad' to the OOT_encoded data set
OOT_encoded['bad'] = OOT_bad
# Split the data into training and validation sets (80/20 split)
using stratified sampling
226 SAS, Python & R: A Cross Reference Guide
OOT_encoded[[c]] <- 0
}
# Ensure the order of column in the OOT set is in the same order as
in train set
OOT_encoded <- OOT_encoded[predictors]
# Add back the target variable 'bad' to the OOT_encoded data set
OOT_encoded$bad <- OOT_bad
print(confusionMatrix(as.factor(OOT_predictions),
as.factor(OOT_encoded$bad)))
DATA oot_encoded;
SET james.oot_encoded;
RUN;
EXCEPT
SELECT * FROM oot_vars;
QUIT;
/* Split the modeling data set by a 80/20 ratio using a random seed
*/
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2
SEED=42; RUN;
TABLES bad ;
RUN;
DATA train_bal;
SET pos neg(obs=6901); /*Input the number of positive cases*/
RUN;
data test_score_&iteration.;
set val_data;
%include '/OneDrive/Documents/DS_Project/decision_tree.sas';
232 SAS, Python & R: A Cross Reference Guide
data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;
%mend;
3. Load the data: The data sets train_encoded and OOT_encoded are loaded
from the specified file paths.
5. Define predictors: The predictors are all columns in train_encoded that are
not in the excluded variables list.
234 SAS, Python & R: A Cross Reference Guide
7. Split the data into training and validation sets: The train_encoded data set
is split into training and validation sets using an 80/20 split with stratified
sampling.
8. Remove constant and duplicate columns: Any constant or duplicate
columns in the training set are removed.
11. Build the Decision Tree model: A Decision Tree model is initialized with a
random state of 42.
13. Tune hyperparameters using Grid Search: Grid Search finds the best
hyperparameters for the Decision Tree model.
14. Get the best model: The best model from Grid Search is stored
in best_model.
15. Apply the model to the OOT data set: The best model is used to predict
the OOT_encoded data set.
16. Create final performance metrics: Develop an AUC and ROC chart.
With a strong grasp of these foundational models, you are now ready to explore
more advanced techniques that can significantly enhance model performance. In
the next chapter, we’ll dive into ensemble methods – a powerful approach that
combines multiple models to create a stronger, more robust predictive model.
Ensemble methods like bagging, boosting, and random forests can help you
overcome the limitations of individual models by leveraging the strengths of
multiple algorithms.
In Chapter 6, you will learn how to implement these ensemble methods across SAS,
Python, and R. We’ll discuss how they work, why they’re effective, and how they
can be tuned to achieve optimal performance. By the end of the chapter, you’ll have
a comprehensive understanding of how to use ensemble techniques to take your
predictive modeling to the next level.
So, with your foundation firmly in place, let’s move on to Chapter 6 and explore the
world of ensemble methods, where the power of combining models can lead to
more accurate and reliable predictions.
236 SAS, Python & R: A Cross Reference Guide
2. Modeling Data
4. Logistic Regression
5. Decision Trees
7. Hyperparameter Tuning
● Logistic Regression:
● Decision Trees:
Chapter 5 Quiz
Questions:
1. What is the primary difference between regression and classification
models in predictive modeling?
2. How does the quality of the modeling data set impact the accuracy of
predictive models?
3. What are the key steps involved in the model pipeline for data preparation?
13. How is the ROC-AUC metric used to evaluate the performance of a decision
tree model?
14. What are the advantages of using decision trees over logistic regression?
16. What is the difference between reduced error pruning and cost complexity
pruning?
19. What are the key performance metrics used to evaluate a decision tree
model?
20. What are the pros and cons of using logistic regression versus decision trees
for predictive modeling?
SAS, Python & R: A Cross Reference Guide
Overview
Decision trees are fundamental building blocks in predictive modeling, laying the
foundation for powerful ensemble methods. This chapter explores the intricacies of
ensemble methods, where the synergistic combination of predictive models takes
center stage.
Decision trees, with their inherent capability to capture intricate patterns, form the
cornerstone upon which advanced techniques like random forest and gradient
boosting thrive. Our exploration commences with random forest, an ensemble of
decision trees, each contributing its unique perspective to a collective prediction.
This collaborative approach enhances accuracy and mitigates the overfitting risks
often encountered with standalone models.
Next, we delve into the domain of gradient boosting, a technique that meticulously
constructs a sequence of trees, each rectifying the errors of its predecessor. This
iterative process culminates in a model of exceptional predictive power, making
gradient boosting a trusted tool for data scientists.
Random Forest
Random forests are an ensemble learning method that operates by constructing
multiple decision trees at training time and outputting the class, which is the mode
244 SAS, Python & R: A Cross Reference Guide
The construction of a random forest model involves several steps, which are
detailed below:
2. Individual Tree Construction: Once your bootstrapped data set has been
constructed, use it to build an individual decision tree. Start by randomly
selecting a subset of available predictors to evaluate for your root node.
After determining your root node, select random subsets of variables at
each step to construct the rest of your tree.
These steps make random forests a powerful tool for many machine learning tasks.
Compared to single decision trees, they provide higher accuracy and better control
over overfitting.
cardinality categorical
variables.
These differences make random forests a powerful tool for many machine learning
tasks, providing higher accuracy and better control over overfitting than single
decision trees.
● Eliminating the Need for Explicit Data Splitting: Unlike traditional machine
learning models that require division of data sets into training and validation
sets, random forests simplify this process. Each decision tree in the forest
creates its training data set through bootstrapping, effectively leaving out a
portion of the data points. This omitted data acts as an internal validation
set, eliminating the requirement for a predefined validation set.
training. This creates a set of OOB predictions for the entire data set.
Aggregating these predictions across all trees yields the OOB error, a robust
measure of the model’s performance.
● Utilizing More Data: One significant advantage of the OOB approach is its
efficiency in data usage. By eliminating the need for data splitting, random forests
allow you to maximize the utility of your available data. This can be especially
advantageous when data is limited, ensuring that every data point contributes to
model training and validation.
In the context of random forests, feature selection is an inherent part of the model
building process. A random subset of features is considered for splitting at each
node in each decision tree within the forest. The feature that provides the highest
information gain (for classification) or reduction in variance (for regression) is
selected. This randomness in feature selection contributes to the robustness of the
random forest model.
Both Gini impurity and entropy can be used as criteria for feature selection in a
random forest model. The choice between them depends on the specific problem
and the data at hand. Gini impurity measures the degree or probability of a
particular variable being wrongly classified when it is randomly chosen. On the other
hand, entropy measures the purity, or randomness, of the input set. These metrics
are explained in more detail in the decision tree section of this chapter.
In terms of specific feature selection methods, apart from the inherent feature
selection that happens at each node, there are also methods like permutation
importance and Mean Decrease Impurity (MDI) that can be used. Permutation
importance is calculated by permuting the values of each feature one by one and
measuring the decrease in accuracy. MDI, on the other hand, computes the total
reduction of the criterion brought by that feature. These methods provide a global
view of feature importance across all the trees.
Pruning is not typically used with random forests. This is because the ensemble
method of combining a multitude of decision trees tends to protect against
overfitting. Each tree in a random forest is allowed to grow with high complexity,
248 SAS, Python & R: A Cross Reference Guide
which means the trees can adapt to complex patterns in the data and even noise.
The final prediction, which is an average of predictions from all trees, will usually
not overfit as long as there are enough trees in the forest.
Performance Metrics
Performance metrics for random forests are similar to those used for decision trees.
For classification problems, these can include accuracy, precision, recall, F1 score,
and area under the ROC curve (AUC-ROC). For regression problems, mean absolute
error (MAE), mean squared error (MSE), or R-squared might be used. These metrics
are explained in more detail in the decision tree section of this chapter.
Best Practices
● Use feature importance scores for feature selection or to gain insights about
the data.
The steps removed from the original decision tree model workflow include
removing excluded variables and balancing the data. These steps are unnecessary
for a random forest model, as it inherently handles multicollinearity and imbalance
in the data. The random forest model also considers all variables when building the
trees, so there is no need to manually exclude any variables.
# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])
# Split the data into training and validation sets (80/20 split)
using stratified sampling
trainIndex <- createDataPartition(train_encoded$bad, p = .8, list =
FALSE, times = 1)
X_train <- train_encoded[trainIndex, predictors]
252 SAS, Python & R: A Cross Reference Guide
DATA train_encoded;
SET james.train_encoded;
RUN;
DATA oot_encoded;
SET james.oot_encoded;
RUN;
RUN;
/*Create hyperparameter tuning macro that will loop through a range of values for each of
the tuning parameters */
DATA summary_table;
LENGTH Model 8;
FORMAT Area 8.5;
RUN;
MAXDEPTH = &maxdepth.
SPLITSIZE = &splitsize.;
TARGET &target. /LEVEL=binary;
INPUT &predictors. / LEVEL=interval;
SAVE FILE = '/OneDrive/Documents/DS_Project/random_forest.sas';
RUN;
data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;
%mend;
/*Apply optimal hyperparameters to the OOT data set for final evaluation*/
RUN;
3. Load the data: The training and OOT data sets are loaded from CSV files.
4. Define predictors: The predictors for the model are defined as all columns
in the training data set except for the target variable “bad”.
5. Split the data: The training data is split into training and validation sets
using stratified sampling.
8. Tune hyperparameters using Grid Search: Grid search is used to tune the
hyperparameters of the random forest model. It fits the model to the
training data and finds the best hyperparameters.
9. Apply the model to the OOT data set: The best model is applied to the OOT
data set to make predictions.
258 SAS, Python & R: A Cross Reference Guide
11. Plot ROC curve: Finally, an ROC curve is plotted to visualize the model's
performance.
Gradient Boosting
The construction of a gradient boosting model involves several steps, which are
detailed below:
1. Initialize the Model: Start with a base model that makes a single prediction
for all observations in your data set. This prediction could be the mean (for
regression problems) or the mode (for classification problems) of the target
variable.
SAS, Python & R: A Cross Reference Guide
3. Fit a Decision Tree: Fit a decision tree to the residuals from step 2. The goal
here is to predict the residuals, not the actual target variable.
5. Repeat Steps 2–4: Repeat steps 2–4 for a specified number of iterations.
Each iteration fits a new decision tree to the residuals of the current
predictions and updates the predictions.
These steps make gradient boosting a formidable tool for many machine learning
tasks. Compared to single decision trees or even random forests, it provides higher
accuracy and better control over overfitting.
Gradient boosting is a robust machine learning algorithm that’s been adapted into
several different models, each with its own strengths and unique features. Here are
a few popular types of gradient-boosting models:
● XGBoost: Short for eXtreme Gradient Boosting, XGBoost is one of the most
popular gradient boosting models due to its speed and performance. It’s
known for its scalability in all scenarios and supports parallel processing.
XGBoost also includes a feature that allows cross-validation at each iteration
of the boosting process.
Each of these models has its own set of hyperparameters to tune and may perform
differently depending on the specifics of your data and problem. It’s always a good
idea to try different models and see which works best for your specific use case.
Gradient boosting models are often more accurate than random forests because
they build trees sequentially, with each tree learning from the mistakes of the
previous ones. This iterative approach allows the model to learn from its errors and
improve its predictions. However, this increased accuracy comes at a cost: gradient
boosting models typically require more data to train effectively and are more
computationally intensive, which means they may not be as efficient as random
forests.
Choosing between these two models often depends on your project's specific
requirements. If you have a large amount of data and computational efficiency is
not a primary concern, a gradient boosting model may be the best choice due to its
potential for higher accuracy. On the other hand, if you have less data or need a
model that can train quickly, a random forest might be more suitable.
In summary, both random forests and gradient boosting models are powerful tools
for machine learning tasks, each with their own strengths and weaknesses. The
choice between them should be based on the amount of available data, the
computational resources at your disposal, and the level of accuracy required for
your specific use case.
SAS, Python & R: A Cross Reference Guide
instances that were not included (left out or “out of bag”) in the bootstrap sample
for each tree. The OOB instances are passed down the tree and used to compute an
unbiased estimate of the model’s error rate.
Gradient boosting algorithms do not use the OOB metric because they build trees
sequentially, with each tree learning from the mistakes of the previous ones rather
than independently, like in random forests. However, gradient boosting models
often include a validation set as part of their training process and use early stopping
to prevent overfitting. The validation error can be considered a similar metric to the
OOB error in random forests.
Here are some key points about feature selection in gradient boosting models:
● Splitting Criterion: Both Gini impurity and entropy can be used as criteria
for feature selection in a gradient boosting model. The choice between
them depends on the specific problem and the data at hand. However,
unlike decision trees or random forests, gradient boosting typically uses a
different criterion for splitting. It uses a loss function that depends on the
task at hand (e.g., mean squared error for regression, log loss for
classification).
● Feature Importance: After the model has been trained, we can obtain a
measure of feature importance, which indicates how useful or valuable each
feature was in the construction of the boosted decision trees within the
model. Features with higher importance were more influential in creating
the model, indicating a stronger association with the response variable.
control the fraction of samples to be used for fitting the individual base
learners).
Remember, while gradient boosting models can provide high performance, they
also require careful tuning and consideration of these factors.
Log Loss
Log loss, also known as logistic loss or cross-entropy loss, is a fundamental metric in
machine learning used to evaluate the performance of binary classification models.
It quantifies the accuracy of a classifier by penalizing false classifications, making it
particularly useful when the prediction output is a probability that an instance
belongs to a particular class.
where:
The goal of any machine learning model is to minimize this value. A perfect model
would have a log loss of 0. However, it’s important to note that log loss heavily
penalizes classifiers that are confident about an incorrect classification. For
example, if for a particular observation, the actual label is 1 but the model predicts
it as 0 with high confidence, then the log loss would be a large positive number. This
makes log loss an excellent metric for assessing the reliability of model predictions.
264 SAS, Python & R: A Cross Reference Guide
While gradient boosting models do not typically use pruning in the traditional sense,
they do incorporate a form of “pruning” through regularization techniques like early
stopping and shrinkage. These techniques help control the complexity of the model
and prevent overfitting, which is crucial when dealing with high-dimensional data or
complex patterns.
● Shrinkage: Also known as learning rate, this technique slows down the
learning process by shrinking the contribution of each tree by a factor
(between 0 and 1) when it is added to the current ensemble. This helps
regularize the model and prevents overfitting.
These techniques, along with others like subsampling, help control the complexity of
gradient boosting models and prevent them from overfitting. They are part of what
makes gradient boosting a powerful and flexible machine learning method.
Performance Metrics
Performance metrics for gradient boosting models are similar to those used for
decision trees. For classification problems, these can include accuracy, precision,
recall, F1 score, and area under the ROC curve (AUC-ROC). For regression problems,
mean absolute error (MAE), mean squared error (MSE), or R-squared might be used.
These metrics are explained in more detail in the decision tree section of Chapter 5.
Best Practices
● Tune hyperparameters such as the number of trees in the sequence and the
learning rate.
● Balance your data set if dealing with imbalanced classes, either by
undersampling, oversampling, or generating synthetic samples.
● Use feature importance scores for feature selection or to gain insights about
the data.
The following gradient boosting code is a robust workflow for creating a gradient
boosting model, tuning its hyperparameters, applying the model to an out-of-time
(OOT) data set, and generating final model performance metrics on the scored OOT
data set. This workflow is designed to be efficient and effective, leveraging the
power of the gradient boosting algorithm and the flexibility of each programming
language’s modeling library.
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix,
roc_auc_score, roc_curve
import matplotlib.pyplot as plt
print(train_encoded.columns)
# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])
# Split the data into training and validation sets (80/20 split)
using stratified sampling
trainIndex <- createDataPartition(train_encoded$bad, p = .8, list =
FALSE, times = 1)
X_train <- train_encoded[ trainIndex, predictors]
y_train <- train_encoded[ trainIndex, 'bad']
X_val <- train_encoded[-trainIndex, predictors]
y_val <- train_encoded[-trainIndex, 'bad']
plot(roc_obj)
DATA oot_encoded;
SET james.oot_encoded;
RUN;
/* Split the modeling data set by a 80/20 ratio using a random seed */
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2 SEED=42;
RUN;
DATA train_bal;
SET pos neg(obs=6901); /*Input the number of positive cases*/
RUN;
SAS, Python & R: A Cross Reference Guide
/*Create hyperparameter tuning macro that will loop through a range of values for each of
the tuning parameters */
DATA summary_table;
LENGTH Model 8;
FORMAT Area 8.5;
RUN;
DATA test_score_&iteration.;
SET val_data;
%INCLUDE '/OneDrive/Documents/DS_Project/gradient_boosting.sas';
KEEP id &target. P_bad1;
RUN;
data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;
%mend;
************************************************************************;
/*Apply optimal hyperparameters to the OOT data set for final evaluation*/
************************************************************************;
DATA oot_score;
SET oot_encoded_final;
%INCLUDE '/OneDrive/Documents/DS_Project/gradient_boosting.sas';
KEEP id bad P_bad1;
RUN;
SAS, Python & R: A Cross Reference Guide
3. Load the data: The training and OOT data sets are loaded from CSV files.
4. Define predictors: The predictors for the model are defined as all columns
in the training data set except for the target variable “bad”.
5. Split the data: The training data is split into training and validation sets
using stratified sampling.
9. Tune hyperparameters using Grid Search: Grid Search is used to tune the
hyperparameters of the gradient boosting model. It fits the model to the
training data and finds the best hyperparameters.
10. Apply the model to the OOT data set: The best model is applied to the OOT
data set to make predictions.
12. Plot ROC curve: Finally, an ROC curve is plotted to visualize the model's
performance.
With a solid understanding of ensemble methods, it’s time to push the boundaries
even further. In the next chapter, we’ll delve into advanced modeling techniques
that go beyond the traditional and ensemble methods you’ve learned so far. These
techniques, including support vector machines and neural networks, represent the
cutting edge of predictive modeling and are capable of handling highly complex data
structures and relationships.
In Chapter 7, you will discover how to implement these advanced techniques in SAS,
Python, and R. We’ll explore their underlying principles, practical applications, and
how to optimize these models for the best performance. By the end of the chapter,
you’ll have the skills to apply some of the most sophisticated tools in predictive
modeling, enabling you to tackle even the most challenging data science problems.
SAS, Python & R: A Cross Reference Guide
So, with your ensemble methods knowledge in hand, let’s advance to Chapter 7,
where we’ll explore the frontiers of predictive modeling and equip you with the
techniques needed to handle the complexities of modern data science.
276 SAS, Python & R: A Cross Reference Guide
2. Random Forest
● Out-of-Bag (OOB) Metric: The OOB metric, a key feature of Random Forest,
is discussed in detail. This metric provides an unbiased estimate of the
model's error without the need for a separate validation set, making it
efficient in data usage.
3. Gradient Boosting
The chapter explains how this iterative approach allows Gradient Boosting
to achieve higher accuracy by focusing on difficult-to-predict observations.
● Use Cases: The chapter advises on when to use each method, suggesting
that Random Forest may be preferable for quick, robust models, while
Gradient Boosting is suited for situations where the highest possible
accuracy is required, and computational resources are not a limiting factor.
● Random Forest: The chapter explains that pruning is not typically used in
Random Forest due to its ensemble nature, which naturally mitigates
overfitting. Instead, the focus is on selecting the right number of trees and
controlling their depth.
7. Performance Metrics
Chapter 6 Quiz
2. How does Random Forest reduce the risk of overfitting compared to a single
decision tree?
5. How does Gradient Boosting differ from Random Forest in its approach to
building models?
8. When should you consider using Random Forest over Gradient Boosting?
13. What are the typical use cases for LightGBM compared to XGBoost?
16. What are the pros and cons of using Gradient Boosting in terms of
computational efficiency?
17. How can you use permutation importance to assess feature significance in
Gradient Boosting?
280 SAS, Python & R: A Cross Reference Guide
18. What metrics would you use to evaluate the performance of a Random
Forest model?
19. How does Gradient Boosting handle imbalanced data sets compared to
Random Forest?
20. What are the common hyperparameters that need tuning in Gradient
Boosting models?
SAS, Python & R: A Cross Reference Guide
- varImp() from
- Use PROC - Use
caret to assess
VARREDUCE to SelectFromModel
feature
select important for feature
importance in
features selection
gbm models
- Random Forest:
No pruning - Random Forest: - Random Forest:
required; control Control tree depth Pruning not
model complexity and the number of necessary; control
via tree depth estimators to avoid with ntree and
and the number overfitting maxnodes
Pruning / of trees
Regularization - Gradient
- Gradient
- Gradient Boosting: Use
Boosting: Use
Boosting: Control learning_rate and
shrinkage and
overfitting with max_depth for
interaction.depth
SHRINKAGE and regularization;
to prevent
LEAFSIZE early stopping with
overfitting
validation_fraction
- Use PROC - sklearn.metrics - caret and pROC
HPFOREST and module for packages for
PROC calculating calculating
GRADBOOST to accuracy, accuracy,
Performance calculate precision, recall, F1 precision, recall,
Metrics performance score, AUC-ROC F1 score, AUC-
metrics such as for both Random ROC for Random
accuracy, AUC- Forest and Forest and
ROC, precision, Gradient Boosting Gradient Boosting
recall models models
- Random Forest: - Random Forest:
- Random Forest:
When a quick, Ideal for data sets
Useful for creating
robust model is with many
interpretable
needed that can features and a
models that
Best Use Cases handle large data need for
handle noisy data
sets with many robustness against
well
features overfitting
- Gradient - Gradient - Gradient
Boosting: When Boosting: Best Boosting:
SAS, Python & R: A Cross Reference Guide
Overview
Support vector machines excel in both linear and non-linear classification tasks,
effectively identifying boundaries between classes in high-dimensional data spaces.
Their ability to find optimal hyperplanes, even in noisy environments, makes SVMs a
valuable tool for a wide range of applications, including image recognition, text
classification, and anomaly detection.
As we delve into these advanced techniques, we'll unravel their intricate workings
and illuminate their potential in solving challenging problems. Our exploration will
showcase how these algorithms are shaping the future of data science and artificial
intelligence.
Support vector machines are a distinct class of machine learning algorithms that
provide an alternative approach to predictive modeling. Unlike tree-based models
that use a hierarchical, rule-based structure, SVMs operate on the principle of
SAS, Python & R: A Cross Reference Guide
The key components of SVMs include support vectors, hyperplanes, and margins.
Here’s how they work in SVMs:
● Support Vectors: These are data points that are closest to the hyperplane
and influence its orientation and position. They play a pivotal role in
defining the decision boundaries of the Support Vector Classifier (SVC). Due
to their key role, other observations become almost irrelevant. The entire
model can be constructed from just these few data points.
● Margins: This is the distance between the hyperplane and the support
vectors. The goal of an SVM is to maximize this margin to ensure robustness
against overfitting and to allow for better generalization on unseen data.
These characteristics make SVMs one of the most efficient models available, as they
require only a handful of observations for prediction.
Figure 7.1 shows a hypothetical data distribution of observations with two classes.
The left side of the figure shows the raw data distribution. There are many ways
that we can separate the two classes. Our goal is to find the point of separation that
provides the maximum distance between a pair of data points of both classes. This
is called the maximum margin.
286 SAS, Python & R: A Cross Reference Guide
The right side of the figure shows the same data distribution with the optimal
boundary that provides the maximum separation between the two closest points of
each class.
The line that provides the maximum space between the two classes provides us
some comfort that new observations will be classified with more confidence. If a
new observation is on one or the other side of the support vector classifier, then we
can comfortably classify that observation.
The observations closest to the support vector classifier are called support vectors.
These observations define the position of the support vector classifier. They
influence the position and orientation of the line or hyperplane. Due to the key role
these specific observations play in creating the decision boundaries of the Support
Vector Classifier (SVC), the remaining observations are nearly unimportant. The
entire model can be derived from just a few data points. This makes the support
vector machine one of the “lightest” models available because it only requires a few
observations to make predictions.
The construction of an SVM model involves several steps, which are detailed below:
SAS, Python & R: A Cross Reference Guide
2. Identify Support Vectors: Identify the data points (support vectors) closest
to the hyperplane.
These steps make SVMs a powerful tool for many machine learning tasks, providing
higher accuracy in high-dimensional spaces compared to single decision trees or
even random forests.
Support vector machines and ensemble tree-based models, such as random forests
and gradient boosting machines, are powerful machine learning algorithms widely
used in predictive modeling. While they share some similarities, they also have
distinct differences in how they handle data, their sensitivity to certain issues, and
their overall performance. The following table provides a comparison of these two
types of models:
288 SAS, Python & R: A Cross Reference Guide
These characteristics make both ensemble tree-based models and SVMs valuable
tools in the machine learning toolkit, each with its own strengths and ideal use
cases.
Kernel:
● The kernel is a function used in SVM to transform the input data into a
higher-dimensional space where it might be easier to classify it. This is
particularly useful when the data is not linearly separable in its original
space.
● The choice of the kernel (linear, polynomial, radial basis function, sigmoid,
etc.) depends on the nature of the data and significantly impacts the
performance of the SVM.
● For example, using the Lending Club data set, a radial basis function (RBF)
kernel could be used to classify whether a loan would be paid off based on
features like loan amount, interest rate, and borrower’s credit score. The
RBF kernel can capture the complex, non-linear relationships between these
features and the target variable.
290 SAS, Python & R: A Cross Reference Guide
C Parameter:
Remember, the choice of kernel and the value of C should ideally be determined
through cross-validation or other model selection methods to ensure the best
performance on unseen data.
In the SVM workflow, standardization typically occurs after data preprocessing steps
like missing value imputation and outlier handling. It ensures that all features
contribute equally to the model's decision-making process. Without standardization,
the SVM may misclassify instances or produce suboptimal results. After
standardization, the data is ready for model training, parameter tuning, and
evaluation. Fortunately, most machine learning libraries provide convenient
functions for standardization, making it an easily implementable step in the SVM
modeling process.
Standardization plays a pivotal role in SVMs by ensuring that the scale of input
features does not bias the model's decision boundary. It promotes fairness among
features, allowing SVMs to identify the most effective hyperplane for classification
tasks. By integrating standardization into the SVM workflow, data scientists can
improve model performance, leading to more accurate and reliable results in
various applications.
Performance Metrics
Performance metrics for support vector machines are similar to those used for other
classification and regression models. For classification problems, these can include
accuracy, precision, recall, F1 score, and area under the ROC curve (AUC-ROC). For
regression problems, mean absolute error (MAE), mean squared error (MSE), or R-
squared might be used. These metrics are explained in more detail in the Decision
Tree section of Chapter 5.
● They are not suitable for large data sets due to high training time.
● They do not directly provide probability estimates.
292 SAS, Python & R: A Cross Reference Guide
Best Practices
When developing a support vector machine model, several data preparation steps
are typically necessary to ensure the best performance of the model. Here are some
key considerations:
● Outliers: SVMs are not very sensitive to outliers because they focus on the
points that are hardest to tell apart (the support vectors). However, in some
cases, outliers can affect the hyperplane and lead to suboptimal results. It
might be beneficial to handle outliers based on the specific data set and
problem context.
● Missing Data: SVMs cannot handle missing data. Any missing values in the
data set must be handled appropriately before training the model. This
could involve imputation strategies or, in some cases, removing instances or
features with too many missing values.
● Data Transformation: Depending on the nature of the data and the kernel
used, it might be beneficial to transform the data (e.g., using a log
transformation for skewed data) before training the SVM.
Remember, the specific data preparation steps can depend on the nature of the
data set and the specific problem context. It’s always a good idea to explore the
data thoroughly and make informed decisions based on the insights gained from
this exploration.
Unlike random forests, SVMs do not inherently handle an imbalance in the data.
Therefore, depending on the level of imbalance in your data, you might need to
consider techniques such as undersampling, oversampling, or generating synthetic
samples when preparing your data for an SVM model.
Additionally, SVMs are not scale-invariant, meaning they can be sensitive to the
range of the features. Therefore, standardizing the data before training an SVM is
generally a good practice. This ensures that all features contribute equally to the
decision boundary. The choice of kernel and the value of C should ideally be
determined through cross-validation or other model selection methods to ensure
the best performance on unseen data. These are key components of the SVM
algorithm that help in data adjustment and model complexity control. They allow
the model to capture complex, non-linear patterns in the data and control the
294 SAS, Python & R: A Cross Reference Guide
The provided code is a pipeline for training a support vector machine model on
preprocessed data, tuning its hyperparameters, and evaluating its performance.
Here’s a step-by-step explanation of the code:
3. Load the data: The preprocessed training and out-of-time (OOT) data sets
are loaded from CSV files.
4. Define predictors and target variable: The predictors (features) and the
target variable (“bad”) are defined. Any missing columns in the OOT data set
are added.
5. Split the data: The training data is split into a training set and a validation
set using stratified sampling.
6. Undersample the majority class: To deal with class imbalance, the majority
class in the training data is undersampled.
7. Standardize the data: The predictors in the resampled training data and the
OOT data set are standardized using sklearn’s StandardScaler.
8. Build the SVM model: An SVM model is initialized with a random state for
reproducibility.
10. Tune hyperparameters using grid search: Grid search is used to find the
best hyperparameters for the SVM model.
11. Apply the model to the OOT data set: The best model found by grid search
is used to make predictions on the OOT data set.
12. Evaluate the model: The performance metrics function evaluates the
model’s performance on the OOT data set. A ROC curve is also plotted.
In our case, we have a balanced data set with 13,836 instances (6,918 positive and
6,918 negative) and 91 predictors. This is a relatively large data set, and training an
SVM on this data set involves solving a complex optimization problem that requires
significant computational resources.
Here are a few strategies you might consider to speed up the process:
2. Data Sampling: Use a subset of your data to train the SVM. Once the SVM is
trained, it can be tested on the rest of the data.
Remember, it’s important to balance model accuracy and training time. Sometimes,
a slightly less accurate model can be an acceptable trade-off for significantly faster
training times. It’s always a good idea to start with a smaller, simpler model and
data set and then gradually scale up as needed.
print(train_encoded.columns)
# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])
print(colnames(train_encoded))
# Split the data into training and validation sets (80/20 split)
using stratified sampling
train_index <- createDataPartition(train_encoded$bad, p = 0.8, list
= FALSE)
X_train <- train_encoded[train_index, predictors]
y_train <- train_encoded[train_index, 'bad']
X_val <- train_encoded[-train_index, predictors]
y_val <- train_encoded[-train_index, 'bad']
1, ][sample(1:nrow(train_data[train_data$bad == 1, ]),
minority_class_count), ])
DATA oot_encoded;
SET james.oot_encoded;
RUN;
/* Split the modeling data set by a 80/20 ratio using a random seed
*/
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2
SEED=42; RUN;
DATA summary_table;
LENGTH Model 8;
FORMAT Area 8.5;
RUN;
%MACRO tune(iteration, kernel, C);
%put "Iteration: &iteration";
%put "Kernel: &kernel";
%put "C: &C";
data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;
304 SAS, Python & R: A Cross Reference Guide
%mend;
Neural Networks
Neural networks, often in the spotlight of machine learning, are the backbone of
deep learning – an innovation that has revolutionized artificial intelligence. These
networks shine in applications like image recognition, speech-to-text conversion,
gaming, and even complex problem-solving, such as defeating human champions in
both chess and Go. At their core, neural networks are the linchpin of deep learning
endeavors.
Deep learning, a pivotal concept, extends the power of neural networks. It involves
training neural networks on extensive data sets, often with multiple interconnected
layers. The objective is straightforward: proficiently classify or predict outcomes
with accuracy. Deep learning's charm lies in its ability to embrace more data and
complexity, constantly improving predictions as it encounters new information and
layers of intricacy within the neural network.
Consider Figure 7.2, showcasing how neural networks surge in performance as they
ingest more data and introduce additional hidden nodes. Unlike many other
machine learning techniques that plateau with data abundance, neural networks
uniquely continue learning and predicting, adapting to heightened data and added
complexity through these hidden layers.
However, the allure of neural networks lies in their capability to go beyond simple
linear models. Figure 7.4 offers an alternative representation – a neural network
format for a regression model.
Each neuron embodies a linear equation expressing the relationship between the
predictor and the target variable. It might seem like an elementary version of linear
regression, but here's the twist: neural networks aren't constrained to a single linear
equation. Instead, they stack multiple neurons, weaving intricate relationships
between numerous predictors and one target variable.
SAS, Python & R: A Cross Reference Guide
Figure 7.5 introduces a neural network example with four input features and a
hidden layer of nodes.
It begins like a linear regression, with four input features (x) predicting an output
feature (y). However, the pivotal shift occurs in the middle, where hidden nodes
emerge. These nodes aren't explicitly defined; the neural network identifies and
constructs patterns autonomously. In this case, it might amalgamate horsepower
and anti-lock brakes into a "performance" node and safety ratings with make/model
into a "luxury" node. These hidden nodes then assist in predicting the outcome
variable.
𝑌 ≈ 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝛽4 𝑋4 …
308 SAS, Python & R: A Cross Reference Guide
where weights (𝛽𝑝 ) correspond to the input features of the hidden nodes. Figure 7.6
illustrates how these nodes generate new weighted features for the final prediction.
However, deep neural networks are unsatisfied with a single hidden layer. They are
algorithms with additional layers of hidden nodes. The more layers, the more
intricate and computationally demanding they become.
Before predicting an outcome, neural networks employ a final trick. After computing
the weighted sum for each hidden unit, a non-linear function is applied. The choice
of function, such as ReLU or tanh, empowers neural networks to master intricate
functions, surpassing the capabilities of simple linear regression models.
Neural networks excel in problems demanding complex pattern recognition and are
the bedrock of deep learning's transformative impact. Whether you're exploring
image data, natural language, or intricate relationships, neural networks can be your
ace in the hole.
1. Initialize the Model: Start with the architecture of the neural network,
which includes the number of layers, the number of neurons in each layer,
and the activation functions. Typically, a neural network consists of an
input layer, one or more hidden layers, and an output layer.
4. Calculate Loss: Compute a loss function that quantifies the error between
the predicted values and the actual target values. Common loss functions
include mean squared error for regression and cross-entropy for
classification.
6. Update Weights: Adjust the weights and biases of the neural network
using an optimization algorithm like Stochastic Gradient Descent (SGD) or
Adam. This step minimizes the loss function and fine-tunes the model.
7. Repeat Steps 3–6: Iterate through the training data multiple times
(epochs), continuously updating the weights to improve the model's
performance.
10. Make Predictions: Deploy the trained neural network to make predictions
on new, unseen data.
Neural networks encompass various architectures, each designed for specific tasks.
Here are some popular types:
● Recurrent Neural Networks (RNN): Tailored for sequence data like time
series or natural language. They maintain a memory of past inputs through
recurrent connections.
2. Data Size: Neural networks require substantial amounts of data for training,
making them suitable for big data sets with many features.
7. Training Speed: Deep neural networks with many layers can take a long
time to train, but techniques like transfer learning can expedite the process.
Neural networks are potent tools for complex machine learning tasks, but they
require careful consideration of data size, computational resources, interpretability,
and hyperparameter tuning. Therefore, it's essential to choose the right neural
network architecture and preprocessing techniques for your specific problem.
2. Adaptability: Neural networks can adapt and learn from new data, making
them suitable for scenarios where the underlying patterns evolve over time.
3. Data Hungry: Neural networks often require large amounts of data for
effective training, which might not be available for all problems.
5. Black Box: The inner workings of neural networks can resemble a black box,
making it difficult to interpret their decision-making process.
Before diving into neural network development, addressing specific issues and
challenges that can affect model performance and reliability is essential. These
include:
print(train_encoded.columns)
SAS, Python & R: A Cross Reference Guide
# Calculate VIF
X = sm.add_constant(train_encoded[predictors])
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in
range(X.shape[1])]
vif["features"] = X.columns
# Ensure the order of column in the OOT set is the same order as in
train set
OOT_encoded = OOT_encoded[predictors]
# Add back the target variable 'bad' to the OOT_encoded data set
OOT_encoded['bad'] = OOT_bad
316 SAS, Python & R: A Cross Reference Guide
# Split the data into training and validation sets (80/20 split)
using stratified sampling
X_train, X_val, y_train, y_val =
train_test_split(train_encoded[predictors], train_encoded['bad'],
test_size=0.2, random_state=42, stratify=train_encoded['bad'])
# Create GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=5)
print(grid_search.best_params_)
# Calculate VIF
vif_fit <- lm(bad ~ ., data = train_encoded[, c(predictors, "bad")])
if (vif_fit$rank == length(coefficients(vif_fit))) {
vif_values <- car::vif(vif_fit)
high_vif_predictors <- names(vif_values[vif_values > 5]) # Change
this threshold as needed
} else {
high_vif_predictors <- character(0)
}
# Split the data into training and validation sets (80/20 split)
using stratified sampling
trainIndex <- createDataPartition(undersampled_data$bad, p = .8,
list = FALSE)
X_train <- train_encoded[trainIndex, predictors]
y_train <- train_encoded[trainIndex, "bad"]
X_val <- train_encoded[-trainIndex, predictors]
y_val <- train_encoded[-trainIndex, "bad"]
# Ensure the order of columns in the OOT set is the same as in the
train set
OOT_encoded <- OOT_encoded[c(predictors, "bad")]
y_val <-as.factor(y_val)
set.seed(42)
tuned_model <- train( bad ~ ., data = X_train[X_train$bad %in% c(0,
1), ],
method = "nnet",
tuneGrid = param_grid,
trControl = trainControl(method = "cv", number
= 5)
)
print(confusion_matrix_OOT)
print(paste("ROC AUC (OOT):", roc_auc_OOT))
DATA oot_encoded;
SET james.oot_encoded;
RUN;
/* Split the modeling data set by a 80/20 ratio using a random seed
*/
PROC SURVEYSELECT DATA=train_encoded_final RATE=.8 OUTALL OUT=class2
SEED=42; RUN;
RUN;
DATA summary_table;
LENGTH Model 8;
FORMAT Area 8.5;
RUN;
SCORE OUT=scored_NN;
CODE
FILE='/OneDrive/Documents/DS_Project/neural_network.sas';
RUN;
data outstat;
set auc_out;
Model = &iteration.;
where ROCModel = 'Model';
keep Model Area;
run;
%mend;
3. Load the data: The training and OOT data sets are loaded from CSV files.
5. Avoid the Dummy Variable Trap: To avoid the dummy variable trap, leave
one dummy variable out of each categorical level.
6. Define predictors: The predictors for the model are defined as all columns
in the training data set except for the target variable “bad”.
7. Split the data: The training data is split into training and validation sets
using stratified sampling.
12. Model evaluation: Evaluate the model's performance on the OOT data set
using classification metrics and plot an ROC curve. This is essential for
understanding how well the model generalizes to new data.
However, building advanced models is only part of the journey. The next crucial step
is to evaluate and monitor the performance of these models effectively. No matter
how complex or sophisticated a model is, its value is determined by how well it
performs on new, unseen data. This is where performance metrics and model
monitoring come into play.
So, with your advanced modeling techniques in hand, let’s move forward to Chapter
8, where we’ll ensure that your models not only work well today but continue to
deliver reliable predictions long into the future.
328 SAS, Python & R: A Cross Reference Guide
● Kernel Functions: The chapter discusses how kernel functions (e.g., linear,
polynomial, radial basis function) transform input data into higher-
dimensional spaces, enabling SVMs to handle non-linear separations. The
importance of choosing the right kernel based on data characteristics is
highlighted.
3. Neural Networks
input layers, hidden layers, and output layers. It also covers how neurons
within these layers process and transmit information.
● Use Cases: The chapter suggests that SVMs are best suited for smaller data
sets with clear margins between classes, while neural networks are
preferred for tasks requiring high accuracy and large data sets.
● SVM Tuning: The chapter covers the tuning of the C parameter and the
selection of kernel functions, advising on cross-validation techniques to find
the optimal settings for a given data set.
7. Performance Metrics
Chapter 7 Quiz
Questions:
1. What is the primary goal of a Support Vector Machine (SVM) in classification
tasks?
4. What are support vectors, and why are they important in SVM?
9. What are the advantages of using neural networks for deep learning tasks?
10. How does data standardization impact the performance of SVMs and neural
networks?
11. What is the difference between overfitting and underfitting, and how does
the C parameter in SVM address these issues?
14. How can class imbalance be addressed when training SVMs and neural
networks?
16. What are some common use cases for neural networks in machine learning?
18. Why might a deep neural network require more computational resources
than an SVM?
19. What are the pros and cons of using a radial basis function (RBF) kernel in
SVM?
- Construct
models using
- Define network
- Build models with keras_model_seq
architecture with
Sequential or uential or
ARCHITECTURE
Functional API keras_model for
statement
flexible
architectures
- Customize
- Customize layers, - Control training
activation
activation with optimizer,
functions and
functions, and loss, and metrics
learning
optimizers arguments
parameters
- Commonly used
functions in PROC - Use ReLU, - nnet supports
NEURAL include sigmoid, tanh, and logistic and
sigmoid and more in Keras and softmax for
hyperbolic TensorFlow output layers
tangent
Activation
- Choose
Functions
appropriate - keras supports a
- Customize
activation wide range of
activation
functions based on functions
functions for each
the problem (e.g., including ReLU,
layer
classification, sigmoid, tanh
regression)
- Tune SVM - Use GridSearchCV
- Use tune.svm in
hyperparameters or
e1071 for SVM
like C=, GAMMA=, RandomizedSearch
hyperparameter
and EPSILON= in CV in sklearn for
tuning
PROC SVM SVM tuning
Hyperparameter - Optimize neural
- Adjust network - Use caret
Tuning networks with
parameters like package's train
Keras callbacks
learning rate, function for
(e.g.,
momentum, and tuning neural
EarlyStopping,
regularization in networks with
ReduceLROnPlatea
PROC NEURAL nnet or keras
u)
SAS, Python & R: A Cross Reference Guide
- Use
- Evaluate SVMs - Evaluate SVM
sklearn.metrics for
using PROC SVM models with caret
SVM evaluation
output metrics metrics (e.g.,
(e.g.,
like accuracy, accuracy,
accuracy_score,
precision, and sensitivity,
precision_score,
Performance recall specificity)
roc_auc_score)
Metrics
- Assess neural
- Evaluate neural
networks with - Use ROCR or
networks with
AUC, ROC, and pROC for ROC and
Keras built-in
misclassification AUC metrics in
metrics or custom
rate in PROC neural networks
metrics
NEURAL
- Standardize data
- Standardize
in PROC
features using
STANDARDIZE - Use scale() to
StandardScaler
before feeding standardize
from
into SVM or features in R
sklearn.preprocessi
neural network
Data ng
models
Preparation
- Handle missing
- Handle missing data with - Handle missing
values with PROC SimpleImputer or values with mice
MI or PROC IterativeImputer or na.omit()
STDIZE from before modeling
sklearn.impute
336 SAS, Python & R: A Cross Reference Guide
Overview
In the journey of model development, performance metrics serve as the final arbiter
of success. These metrics are not just numbers – they are the definitive measures of
how well your model performs in the initial stages and as it encounters new data
over time. Without them, there’s no way to objectively determine if a model is
functioning correctly or if it needs adjustment. They provide the necessary feedback
to ensure that the model meets the initial objectives and continues to perform
reliably in production environments.
The importance of performance metrics lies in their ability to offer a clear view of a
model's accuracy, efficiency, and reliability. Metrics like precision, recall, AUC, and
MSE are the tools through which we assess a model's strengths and weaknesses. By
evaluating these metrics, we gain insights into areas where the model excels and
where it may require further refinement. This step is critical in the modeling
pipeline because it transforms theoretical models into actionable insights, making
them valuable assets in decision-making processes.
This chapter integrates all aspects of the modeling pipeline, highlighting the
interconnectivity between model development, evaluation, implementation, and
monitoring. By understanding how these components work together, data scientists
can ensure their models perform well initially and continue to deliver value
throughout their lifecycles. The chapter emphasizes that a robust approach to
SAS, Python & R: A Cross Reference Guide
In data science, performance metrics are essential tools for assessing how well
models perform their intended tasks. The type of model – classification or
regression – dictates the choice of metrics due to the fundamental differences in
the nature of the predictions. Classification models predict categorical outcomes
(often binary), where the goal is to categorize data points into distinct classes.
Metrics like accuracy, precision, recall, and F1 score are used to evaluate these
models, focusing on correctly classifying instances and understanding different
types of errors.
On the other hand, regression models predict continuous outcomes, where the
objective is to estimate a value along a continuous scale. Metrics such as Mean
Squared Error (MSE), Mean Absolute Error (MAE), and R-squared are critical
because they evaluate the residuals – the differences between actual and predicted
values. This focus on residuals allows us to quantify how closely the model's
predictions match the actual values. In regression, the residuals provide a direct
measure of prediction accuracy, but this approach is not appropriate for
classification models because classification is concerned with discrete categories,
not continuous values.
The need for different performance metrics arises from these fundamental
differences: classification models require metrics that handle categorical outcomes
and binary decisions. In contrast, regression models need metrics that handle
continuous, numerical predictions based on residuals. Understanding and applying
the appropriate metrics for each type of model ensures that the evaluation is
accurate and meaningful, enabling data scientists to make informed decisions about
model performance and potential improvements.
Table 8.1 below outlines key performance metrics for both classification and
regression models, highlighting their equations, descriptions, and interpretation to
help evaluate model performance effectively.
338 SAS, Python & R: A Cross Reference Guide
Metric
Equation Description Interpretation
Name
High accuracy
The ratio of
indicates that
𝑇𝑃+𝑇𝑁 correctly predicted
Accuracy most
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 instances to the
predictions are
total instances.
correct.
The ratio of true High precision
𝑇𝑃 positive predictions indicates a low
Precision
𝑇𝑃+𝐹𝑃 to the total positive false positive
predictions. rate.
Classification Model Performance Metrics
Measures the
maximum difference
between the High KS
max (𝑇𝑃𝑅 cumulative indicates better
KS Statistic
− 𝐹𝑃𝑅) distributions of the model
true positive rate separation.
and the false
positive rate.
Higher lift
Shows how much
Ratio of predicted indicates better
better the model is
positives to actual model
Lift at predicting
positives over performance at
positives compared
thresholds identifying
to random selection.
positives.
Higher gain
Cumulative gain Illustrates model's
indicates better
Gain over a range of advantage in
model
thresholds identifying positives
performance
Average of the
Lower MSE
MSE (Mean 1 𝑛 squares of the
∑ (𝑦 − indicates better
Regression Model Performance Metrics
Higher R-
squared
Proportion of
∑(𝑦𝑖 −𝑦̂𝑖 )2 indicates a
R-squared 1− ∑(𝑦𝑖 −𝑦̅)2
variance explained
better fit of the
by the model.
model to the
data.
Higher adjusted
Similar to R- R-squared
Adjusted for the squared, but indicates a
Adjusted R-
number of penalizes for adding better model,
squared
predictors. variables that don’t considering the
improve the model. number of
predictors.
Measure of model
AIC (Akaike Lower AIC
quality, balancing
Information 2𝑘 − 2 ln(𝐿̂) indicates a
goodness of fit with
Criterion) better model.
model complexity.
Lower BIC
BIC Similar to AIC, A criterion for indicates a
(Bayesian with a stronger model selection better model,
Information penalty for among a finite set of with a stronger
Criterion) complexity. models. penalty for
complexity.
In data science, classification models are crucial for predicting categorical outcomes,
such as fraud detection, customer churn, or disease diagnosis. Evaluating these
models' performance requires various metrics that measure how well the model is
distinguishing between classes. This section introduces key classification metrics,
demonstrated through a practical example using a synthetic data set. By
implementing a random forest model in SAS, Python, and R, we will explore metrics
such as the confusion matrix, accuracy, precision, recall, F1 Score, AUC, ROC curve,
Gini, KS, lift and gain tables, and charts.
SAS, Python & R: A Cross Reference Guide
Program 8.1: Creating the Example Data Set Using Synthetic Data
Language Programming Code
library(caret)
library(randomForest)
library(pROC)
set.seed(12345)
n <- 1000
np.random.seed(12345)
n = 1000
Python
# Reducing the number of informative features and adding noise
Programming X, y = make_classification(n_samples=n, n_features=6,
n_informative=3, n_redundant=2, n_repeated=1,
n_clusters_per_class=2, weights=[0.7,
0.3], flip_y=0.1, class_sep=0.8, random_state=12345)
'Online_Purchase_Ind',
'Trans_Amount', 'Avg_Trans_Amount'])
df['Fraud'] = y
DATA fraud_data;
ARRAY x[6];
CALL STREAMINIT(12345);
DO ID = 1 TO 1000;
CALL RANUNI(12345, w);
CALL RANUNI(12345, r);
DO i = 1 TO 4;
x[i] = (i <= 3) * (0.7 + 0.3 * r) *
RANNOR(12345);
END;
x[5] = r * x[1];
x[6] = r * x[2];
SAS
Fraud = (w > 0.7);
Programming
FICO_Score = 300 + ROUND(550 * x[1]);
Num_Credit_Cards = ROUND(9 * RANUNI(12345) +
1);
Annual_Income = ROUND(20000 + x[2] * 130000);
Online_Purchase_Ind = ROUND(RANUNI(12345));
Trans_Amount = ROUND(50 + x[3] * 4950, 2);
Avg_Trans_Amount = ROUND(Trans_Amount /
Num_Credit_Cards, 2);
OUTPUT;
END;
RUN;
To ensure consistency across all three environments, we’ve used a random seed
(set.seed(42) in R, np.random.seed(42) in Python, and ranuni(42) in SAS). Setting a
seed ensures that the random numbers generated are reproducible, which means
SAS, Python & R: A Cross Reference Guide
that each time we run the code, we get the same data set. This reproducibility is
critical when comparing models across different programming languages, as it
directly compares performance metrics.
With our data set prepared, the next step is to build a classification model. We will
use a Random Forest algorithm, a powerful ensemble method known for its
robustness and ability to handle a large number of input variables without
overfitting. In the following sections, we will build and evaluate a Random Forest
model in SAS, Python, and R. This will include generating and interpreting key
performance metrics such as the confusion matrix, accuracy, precision, recall, F1
Score, AUC, ROC curve, Gini coefficient, KS statistic, and lift and gain tables and
charts.
These metrics will help us assess how well our model predicts fraud and enable us
to compare the performance of the Random Forest implementation across the
three programming environments.
Program 8.2 below develops a Random Forest model using the data set created
earlier, followed by generating key classification performance metrics to assess the
model's effectiveness.
Program 8.2: Creating Random Forest Model and Classification Performance Metrics
# Confusion Matrix
confusionMatrix(pred, test_data$Fraud)
# AUC, Gini, KS
roc_obj <- roc(test_data$Fraud, probs)
auc(roc_obj)
gini_coeff <- 2 * auc(roc_obj) - 1
ks_stat <- max(roc_obj$sensitivities -
roc_obj$specificities)
rf_model = RandomForestClassifier(n_estimators=100,
random_state=12345)
rf_model.fit(X_train, y_train)
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)
# Performance Metrics
print("Accuracy:", rf_model.score(X_test, y_test))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
# Gini Coefficient
gini_coefficient = 2 * roc_auc - 1
print("Gini Coefficient:", gini_coefficient)
# KS Statistic
ks_stat = max(tpr - fpr)
print("KS Statistic:", ks_stat)
lift = df_lift.groupby('decile').apply(lambda x:
np.mean(x['target']))
gain =
df_lift.groupby('decile')['target'].sum().cumsum()
print("Lift:\n", lift)
print("Gain:\n", gain)
346 SAS, Python & R: A Cross Reference Guide
Figure 8.1 shows the output from the Python program, which demonstrates the
model's performance, particularly its ability to correctly identify fraudulent
transactions, as evidenced by the confusion matrix and various performance
metrics.
SAS, Python & R: A Cross Reference Guide
[[195 6]
[ 20 79]]
Accuracy: 0.9133333333333333
Precision: 0.9294117647058824
Recall: 0.797979797979798
F1 Score: 0.8586956521739131
AUC: 0.9320820141715664
Gini Coefficient: 0.8641640283431329
KS Statistic: 0.8139605005276648
Lift:
decile
0 0.000000
1 0.068966
2 0.103448
3 0.062500
4 0.000000
5 0.034483
6 0.346154
7 0.866667
8 0.933333
dtype: float64
Gain:
decile
0 0
1 2
2 5
3 7
4 7
5 8
6 17
7 43
8 99
Name: target, dtype: int32
especially in balanced
data sets.
Precision below 50%
indicates the model is
incorrectly classifying
too many negatives as
positives, leading to a
high rate of false
positives. A precision
Precision < 50% 50% – 75% > 75%
of 50% to 75% is
generally acceptable,
with over 75%
considered very good,
particularly in cases
where false positives
are costly.
Recall below 50%
suggests the model is
missing too many
actual positives, which
could be critical
depending on the
application (e.g., fraud
Recall
< 50% 50% – 75% > 75% detection). A recall
(Sensitivity)
between 50% and 75%
is generally considered
good, with over 75%
being very good,
indicating the model is
capturing the majority
of positive cases.
An F1 Score below 0.6
indicates a significant
imbalance between
F1 Score < 0.6 0.6 – 0.75 > 0.75 precision and recall,
leading to a less
effective model. Scores
between 0.6 and 0.75
350 SAS, Python & R: A Cross Reference Guide
discrimination
between the positive
and negative classes.
Values between 0.2
and 0.4 are generally
acceptable, indicating
the model has some
discriminatory power.
A KS statistic above 0.4
indicates strong
separation, often
considered very good
in practice.
However, this default setting might not always align with the specific objectives or
risk tolerance of the business. Adjusting the threshold can help tailor the model’s
performance to meet the organization's needs better, balancing between precision
(minimizing false positives) and recall (minimizing false negatives). For instance, a
more conservative threshold might be set higher (e.g., 0.7), which would classify
fewer events as positive, reducing false positives but potentially increasing false
negatives.
The decision to set the threshold should consider statistical optimization and
business objectives. Statistically, thresholds can be optimized based on metrics like
the F1 Score, which balances precision and recall, or by maximizing the AUC (Area
Under the Curve) of the ROC curve. From a business perspective, the threshold
352 SAS, Python & R: A Cross Reference Guide
might be adjusted to reflect risk tolerance, cost factors, or strategic priorities. For
example, in fraud detection, where false negatives (missed fraud) are more costly
than false positives (false alarms), the threshold might be lowered to ensure more
potential fraud cases are flagged, even if it results in more false positives.
The histogram in Figure 8.2 visualizes the distribution of the predicted probabilities
generated by our model. The red dashed line represents the default threshold of
0.5; the green dashed line represents a threshold of 0.7 optimized for the F1 Score;
the blue dashed line indicates a threshold of 0.39 optimized for AUC, and the orange
dashed line shows a threshold of 0.97 optimized for Precision. Figure 8.2 illustrates
the distribution of the model's predicted probabilities and the impact of setting
different thresholds.
SAS, Python & R: A Cross Reference Guide
● Threshold at 0.5 (Default Setting): The threshold of 0.5 is the default setting
for most classification models, where probabilities of 0.5 or higher are
classified as positive events. This threshold is typically used when a balance
between false positives (incorrectly predicting an event) and false negatives
(failing to predict an event) is desired. In many scenarios, a threshold of 0.5
is a starting point because it equally weighs the chances of identifying a true
positive against the risk of a false positive. However, this threshold may not
always align with specific business needs or risk tolerances, especially in
high-stakes scenarios where the cost of errors is significant.
Choosing the appropriate threshold is critical and should be aligned with the
business’s objectives:
In Figure 8.2, the histogram visualizing the predicted probabilities reveals a binary-
like distribution rather than the typical bell curve seen in many data sets. This binary
distribution indicates that the model strongly differentiates between two classes,
often resulting in predictions clustering near 0 or 1. Such a distribution suggests that
the model is confident in its classifications, which can be advantageous for clear
decision-making. However, it may also indicate that the model could be overfitting,
particularly if it is too aggressive in pushing predictions to extremes. This
distribution pattern is important to consider when setting thresholds, as it directly
impacts how the model will perform across different probability ranges.
the model is in making predictions by displaying the number of correct and incorrect
predictions made by the model, broken down by each class.
● True Negative (TN): The number of correct predictions where the model
predicted "No" (non-fraud) and the actual outcome was "No."
● False Positive (FP): The number of incorrect predictions where the model
predicted "Yes" (fraud) but the actual outcome was "No." This is also known
as a Type I error.
● False Negative (FN): The number of incorrect predictions where the model
predicted "No" (non-fraud) but the actual outcome was "Yes." This is known
as a Type II error.
● True Positive (TP): The number of correct predictions where the model
predicted "Yes" (fraud) and the actual outcome was "Yes."
From the confusion matrix, several key performance metrics can be derived:
1. Accuracy:
Definition:
Accuracy is the proportion of correct predictions (both true positives and
true negatives) out of the total predictions made. It is calculated as:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
SAS, Python & R: A Cross Reference Guide
Interpretation:
High accuracy, such as 91.33%, suggests that the model performs well
overall in correctly predicting both fraud and non-fraud cases. However, in
imbalanced data sets like fraud detection, accuracy can be misleading
because the model might still fail to detect the minority class effectively.
Low accuracy would indicate that the model struggles with making correct
predictions in general, which could necessitate further tuning or feature
engineering.
2. Precision:
Definition:
Precision is the proportion of true positives out of all positive predictions
made by the model. It is calculated as:
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Interpretation:
A precision of 92.94% indicates that when the model predicts fraud, it is
correct about 93% of the time. High precision is crucial in reducing false
positives, which is especially important in contexts like fraud detection,
where false positives can lead to unnecessary investigations and potential
customer dissatisfaction. Conversely, low precision would imply that many
of the fraud predictions are incorrect, resulting in wasted resources.
Definition:
Recall measures the proportion of actual positive cases (fraud) the model
correctly identified. It is calculated as:
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
Interpretation:
A recall of 79.80% indicates that the model successfully identifies around
80% of actual fraud cases. High recall is important in minimizing false
358 SAS, Python & R: A Cross Reference Guide
negatives and detecting as many fraud cases as possible. Low recall would
suggest that the model is missing a significant number of fraud cases, which
could lead to severe consequences if fraudulent activities go undetected.
4. F1 Score:
Definition:
The F1 Score is the harmonic mean of precision and recall, balancing the
trade-offs between precision and recall. It is calculated as:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
Interpretation:
An F1 Score of 85.87% indicates a strong balance between precision and
recall, meaning the model is both accurate in its positive predictions and
effective in identifying most fraud cases. A high F1 Score is particularly
valuable in scenarios with imbalanced data, such as fraud detection, where
both precision and recall are critical. A low F1 Score would suggest that
either precision, recall, or both are suboptimal, indicating room for
improvement in the model's tuning.
Definition:
FPR measures the proportion of negative instances (non-fraud) that are
incorrectly classified as positive (fraud). It is calculated as:
𝐹𝑃
𝐹𝑃𝑅 =
𝐹𝑃 + 𝑇𝑁
Interpretation:
A low FPR indicates that the model rarely misclassifies legitimate
transactions as fraud, which is crucial for maintaining customer satisfaction
and reducing operational costs. A high FPR would suggest that too many
non-fraudulent transactions are being flagged as fraud, leading to excessive
false alarms and potentially eroding customer trust.
SAS, Python & R: A Cross Reference Guide
Definition:
Specificity is the proportion of actual negative cases (non-fraud) the model
correctly identified. It is calculated as:
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
Interpretation:
High specificity indicates that the model effectively identifies non-
fraudulent transactions, reducing the likelihood of legitimate customers
being wrongly flagged as fraudulent. Low specificity would mean that the
model frequently misclassifies non-fraudulent transactions as fraud, which
could negatively impact user experience and operational efficiency.
AUC, Gini, and KS are three key metrics often reported together to evaluate the
performance of classification models, particularly in how well the model separates
events from non-events. These metrics provide insights into the model's ability to
distinguish between positive and negative classes, which is crucial in applications
like fraud detection, credit scoring, and other predictive tasks. They are closely
related but offer different perspectives on model performance.
● AUC (Area Under the Curve) is a measure derived from the ROC (Receiver
Operating Characteristic) curve, representing the model's ability to
differentiate between classes across various threshold values.
● Gini Coefficient is directly related to the AUC (Area Under the Curve) and is
often calculated using the formula: Gini = 2 * AUC - 1. The Gini coefficient
quantifies the inequality of model predictions, where a higher value
360 SAS, Python & R: A Cross Reference Guide
These metrics are related because they all measure the same underlying concept:
the model’s ability to distinguish between the two classes correctly. However, they
do so from slightly different angles, providing a comprehensive view of the model’s
performance.
Figure 8.4: Visualizing Model Discrimination: AUC and Gini Metrics in ROC Space
● Calculation:
The AUC is the area under the ROC curve, which plots the True Positive Rate
SAS, Python & R: A Cross Reference Guide
(TPR) against the False Positive Rate (FPR) at various threshold levels. The
AUC value ranges from 0 to 1, with 0.5 indicating no discriminatory power
(equivalent to random guessing) and 1.0 representing perfect discrimination
between classes. The provided ROC curve visualizes the model's ability to
distinguish between positive and negative instances.
● Interpretation:
A higher AUC value indicates a better model. In this example, the AUC is
0.93, suggesting that the model has a 93% chance of correctly distinguishing
between a positive and a negative instance. This makes AUC a crucial metric
in evaluating how well a model is likely to perform across different
thresholds, providing insight into the overall effectiveness of the model
across the entire range of possible decision boundaries.
Gini Coefficient
● Calculation:
The Gini coefficient is calculated as Gini = 2 * AUC - 1. It ranges from -1 to 1,
where 0 indicates no discriminatory power, 1 represents perfect separation,
and negative values indicate a model that is worse than random guessing.
● Interpretation:
The Gini coefficient provides a measure of the inequality in model
predictions. A higher Gini coefficient indicates that the model effectively
separates the classes, with values close to 1 signifying strong performance.
For example, a Gini of 0.86 (derived from an AUC of 0.93) would be
interpreted as a strong model in terms of its discriminatory power.
● Calculation:
The KS Statistic is calculated as the maximum difference between the
cumulative distribution functions of the positive class and the negative
class. It identifies the point where the model most effectively distinguishes
between the two classes.
● Interpretation:
The KS Statistic is useful in identifying the threshold at which the model
362 SAS, Python & R: A Cross Reference Guide
most effectively separates the positive and negative classes. A high KS value,
such as 0.81, indicates the model has a strong discriminatory ability at a
particular threshold. This metric is particularly useful in determining where
to set the cutoff for decision-making in binary classification problems.
Lift Table
Construction
● Decile: The population is divided into ten equal parts based on predicted
probabilities.
● Number of Fraud Cases: Counts the actual events (fraud) in each decile.
● Average Score: The average predicted probability score for each decile.
● Lift: The lift value compares the model's effectiveness to random guessing.
The lift table helps make informed decisions about where to set the probability
threshold for classification. By analyzing the cumulative metrics, you can determine
the trade-offs between capturing a higher percentage of events and minimizing
false positives. This allows you to tailor the model to meet specific business goals,
such as maximizing fraud detection while minimizing unnecessary investigations.
For example, if the goal is to capture as many fraud cases as possible, you might
choose a threshold that aligns with a decile showing high cumulative fraud cases
and a significant lift, indicating that the model performs better than random
guessing. Conversely, if the cost of false positives is high, you may opt for a more
conservative threshold to reduce the number of non-fraud cases identified as fraud.
Program 8.3 illustrates how to create a lift table and corresponding charts, providing
a visual interpretation of model performance.
364 SAS, Python & R: A Cross Reference Guide
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
lift_table['Cumulative_Fraud_Cases'] =
lift_table['Number_of_Fraud_Cases'].cumsum()
lift_table['Cumulative_Cases'] =
lift_table['Number_of_Cases'].cumsum()
lift_table['%_of_Events'] =
lift_table['Number_of_Fraud_Cases'] /
lift_table['Number_of_Fraud_Cases'].sum() * 100
lift_table['Gain'] =
lift_table['Cumulative_Fraud_Cases'] /
lift_table['Number_of_Fraud_Cases'].sum() * 100
lift_table['Lift'] = lift_table['Gain'] /
(lift_table['Cumulative_Cases'] /
lift_table['Number_of_Cases'].sum() * 100)
plt.show()
DATA LiftTable;
SET LiftTable;
BY Decile;
RETAIN Cumulative_Fraud_Cases Cumulative_Cases
Gain Lift Cumulative_Non_Events Cum_Non_Events_Rate
SAS
KS_Statistic;
Programming
IF _N_ = 1 THEN DO;
Cumulative_Fraud_Cases = 0;
Cumulative_Cases = 0;
END;
Cumulative_Fraud_Cases + Number_of_Fraud_Cases;
Cumulative_Cases + Number_of_Cases;
Gain = Cumulative_Fraud_Cases /
SUM(Number_of_Fraud_Cases) * 100;
Lift = Gain / (Cumulative_Cases /
SUM(Number_of_Cases) * 100);
Cumulative_Non_Events = Cumulative_Cases -
Cumulative_Fraud_Cases;
Cum_Non_Events_Rate = Cumulative_Non_Events /
(SUM(Number_of_Cases) -
SUM(Number_of_Fraud_Cases));
KS_Statistic = ABS(Gain - Cum_Non_Events_Rate *
100);
RUN;
Lift and Gain Charts are essential tools for assessing the performance of
classification models, particularly in scenarios where the correct identification of
positive cases (such as fraud detection or marketing responses) is crucial. These
charts help quantify how much better a model performs compared to random
guessing and allow us to evaluate the model's effectiveness in identifying high-risk
or high-value cases.
Lift Chart
● Construction: The Lift Chart is constructed by plotting the lift value on the Y
axis against the deciles on the X axis. Each decile represents a tenth of the
population, ordered by the predicted probabilities from highest to lowest.
The lift is calculated as the ratio of the target rate in each decile to the
target rate in the entire population.
SAS, Python & R: A Cross Reference Guide
● Interpretation: The Lift Chart visualizes how much better the model is at
identifying positive cases compared to random selection. A steeper slope
indicates better performance, showing that the model effectively
concentrates positive cases in the top deciles. The flatter the curve, the less
effective the model is at differentiating between positive and negative
cases.
Gain Chart
These charts are derived from the Lift Table, which provides detailed information on
the model’s performance at each decile. From the table, metrics such as cumulative
gain and lift are calculated and visualized in the charts. The Lift Table itself includes
columns like the number of cases, number of positive cases (fraud cases),
cumulative responses, cumulative non-events, average score, gain, and lift.
● Threshold Setting: One practical application of the Lift Table and the
corresponding charts is in determining the optimal threshold for
classification. Instead of defaulting to a 0.5 probability threshold, you can
choose a threshold that balances precision and recall according to your
specific business needs. For example, if you aim to capture a certain
percentage of fraud cases while minimizing false positives, you could use
the Lift Table to identify the decile where the trade-off between capturing
fraud cases and the number of cases to investigate is most favorable.
By understanding these charts and how to use them effectively, you can make more
informed decisions about model deployment and resource allocation, ensuring that
your classification models provide maximum value.
In data science, regression models are essential for predicting continuous outcomes,
such as house prices, customer lifetime value, or sales forecasts. To evaluate these
models, various metrics measure how closely the model's predictions align with
actual values. This section introduces key regression metrics, demonstrated through
SAS, Python & R: A Cross Reference Guide
This section will walk through the creation of this data set, ensuring consistency
across SAS, Python, and R. We will also introduce the concept of regression metrics
and explain why these metrics are crucial for understanding and improving model
performance.
Program 8.4 creates a synthetic data set and evaluates the model using various
regression performance metrics across different programming environments.
# Synthetic Data
set.seed(42)
n <- 100
R Bedrooms <- sample(1:5, n, replace = TRUE)
Programming Bathrooms <- sample(1:3, n, replace = TRUE)
SquareFootage <- sample(600:3500, n, replace = TRUE)
Age <- sample(0:100, n, replace = TRUE)
DistanceFromCity <- sample(1:50, n, replace = TRUE)
Price <- Bedrooms * 50000 + Bathrooms * 30000 + SquareFootage
* 100 + Age * (-2000) + DistanceFromCity * (-3000) + rnorm(n,
mean = 0, sd = 10000)
# Predictions
y_pred <- predict(model, newdata = test_data)
y_test <- test_data$Price
# Calculating Metrics
mse <- mean((y_test - y_pred)^2)
mae <- mean(abs(y_test - y_pred))
rmse <- sqrt(mse)
r_squared <- summary(model)$r.squared
adj_r_squared <- summary(model)$adj.r.squared
# Residuals
residuals <- y_test - y_pred
# Residual Plot
ggplot(data.frame(y_pred, residuals), aes(x = y_pred, y =
residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype =
"dashed") +
labs(title = "Residual Plot", x = "Predicted Values", y =
"Residuals")
# QQ Plot
qqnorm(residuals)
qqline(residuals, col = "red")
title("QQ Plot")
# Histogram of Residuals
SAS, Python & R: A Cross Reference Guide
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,
mean_absolute_error
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
# Predictions
y_pred = model.predict(X_test)
# Calculating Metrics
374 SAS, Python & R: A Cross Reference Guide
# Adjusted R-squared
n = len(y_test)
p = X_test.shape[1]
adj_r_squared = 1 - (1-r_squared)*(n-1)/(n-p-1)
print(f"MSE: {mse}")
print(f"MAE: {mae}")
print(f"RMSE: {rmse}")
print(f"R-Squared: {r_squared}")
print(f"Adjusted R-Squared: {adj_r_squared}")
print(f"AIC: {aic}")
print(f"BIC: {bic}")
# Residuals
residuals = y_test - y_pred
# Residual Plot
plt.figure(figsize=(8, 5))
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
# QQ Plot
sm.qqplot(residuals, line='s')
plt.title('QQ Plot')
plt.show()
# Histogram of Residuals
plt.figure(figsize=(8, 5))
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.title('Histogram of Residuals')
plt.show()
/* RESIDUAL PLOT */
PROC SGPLOT DATA=pred_test;
SCATTER X=Predicted Y=Residual;
REFLINE 0 / AXIS=y LINEATTRS=(COLOR=red
PATTERN=shortdash);
TITLE "Residual Plot";
RUN;
/* QQ PLOT */
PROC UNIVARIATE DATA=pred_test;
QQPLOT Residual / NORMAL(MU=EST SIGMA=EST) SQUARE;
TITLE "QQ Plot";
RUN;
/* HISTOGRAM OF RESIDUALS */
PROC SGPLOT DATA=pred_test;
HISTOGRAM Residual;
DENSITY Residual / TYPE=normal;
TITLE "Histogram of Residuals";
RUN;
Figure 8.7 shows the output from the regression model, including MSE, MAE, RMSE,
and R-squared values, which provide insights into model accuracy.
MAE: 6887.38035920041
RMSE: 9115.947260457922
R-Squared: 0.9964657894636241
AIC: 1455.1364031128012
BIC: 1468.6273745650974
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦𝑖 − 𝑦̂𝑖
● Model Fit: Residuals help to assess how well the model fits the data. Ideally,
residuals should be randomly distributed around zero, indicating that the
model captures the underlying data patterns without bias.
● Outlier Detection: Large residuals may indicate outliers or areas where the
model does not perform well. Investigating these residuals can provide
insights into whether the model needs to be improved or whether certain
data points are problematic.
Visualizing Residuals
1. Residual Plot:
3. Histogram of Residuals:
These residual plots are crucial diagnostic tools for assessing the validity of a
regression model. By carefully analyzing these visualizations, you can identify
potential issues with model fit, validate assumptions, and take corrective actions to
improve model performance.
382 SAS, Python & R: A Cross Reference Guide
Once residuals have been analyzed and understood, performance metrics like MSE,
MAE, RMSE, and others can provide a quantitative evaluation of model
performance. These metrics are directly related to residuals, as they aggregate the
residuals in different ways to summarize the model’s accuracy and reliability.
● Calculation:
● Interpretation:
● Calculation:
𝑛
1
𝑀𝐴𝐸 = ∑|𝑦𝑖 − 𝑦̂𝑖 |
𝑛
𝑖=1
o Unlike MSE, MAE does not square the errors, so each error
contributes to the overall metric in a linear fashion.
● Interpretation:
● Calculation:
𝑅𝑀𝑆𝐸 = √𝑀𝑆𝐸
o RMSE provides a measure of error with the same units as the target
variable.
● Interpretation:
R-squared (R²)
● Calculation:
● Interpretation:
Adjusted R-squared
● Calculation:
● Interpretation:
lower than the R-squared, it indicates that some predictors may not
contribute meaningful information to the model.
● Calculation:
𝐴𝐼𝐶 = 2𝑘 − 2 𝑙𝑛(𝐿̂)
● Interpretation:
● Calculation:
o BIC is similar to AIC but includes a stronger penalty for models with
more parameters:
● Interpretation:
In regression modeling, performance metrics help us evaluate how well the model's
predictions align with the actual values. Regression metrics are more context-
dependent than classification metrics, where fixed thresholds are often used. The
effectiveness of a regression model is closely tied to the range and variability of the
target variable. As such, what might be considered a "good" performance in one
scenario could be suboptimal in another, depending on the specific application and
data set characteristics.
Table 8.4 provides general rules of thumb for interpreting regression metrics,
offering guidance on what constitutes poor, good, and very good model
performance in different contexts.
Very
Performance Poor Good
Good Explanation
Metric Threshold Threshold
Threshold
A high MSE indicates large
High Moderate average squared
value value Low value differences between
Mean relative relative relative to predicted and actual
Squared to the to the the range values, which can be due
Error (MSE) range of range of of target to poor model fit or
target target variable significant outliers. A
variable variable lower MSE suggests better
model accuracy.
Mean High Moderate Low value MAE is a more
Absolute value value relative to interpretable measure
Error (MAE) relative relative the range than MSE, representing
SAS, Python & R: A Cross Reference Guide
● MSE, MAE, RMSE: These metrics directly measure the prediction error. In
practice, the specific threshold depends on the context of the model and
the range of the target variable. Generally, lower values are better, but
context matters – a "low" error in one application could be unacceptable in
another.
● AIC and BIC: These are relative metrics used primarily for model comparison
rather than absolute evaluation. Lower AIC or BIC values suggest a model
that better balances fit with complexity, with BIC imposing a stricter penalty
for additional predictors.
These rules of thumb are context-dependent and should be applied considering the
specific application and data set characteristics. In high-stakes environments, the
SAS, Python & R: A Cross Reference Guide
Performance metrics are fundamentally tied to the nature of the target variable in
both classification and regression models. Classification models are designed to
predict categorical outcomes, such as determining whether a transaction is
fraudulent, while regression models predict continuous outcomes, like estimating
house prices. These core differences influence the types of metrics we use to
evaluate model performance.
Table 8.5 provides a detailed comparison of the key performance metrics used for
classification versus regression models, highlighting how each set of metrics aligns
with the unique goals of these model types.
390 SAS, Python & R: A Cross Reference Guide
After developing and evaluating a model, the next critical step is implementing it in
a production environment. Implementation refers to the process of taking a model
from the development stage, where it is built and tested, to an operational setting,
where it can be used to make real-time predictions or decisions. This process can
vary significantly depending on whether the model is being deployed within the
same environment where it was developed or in a different environment.
Understanding these differences, as well as the tools and strategies available for
implementation, is crucial for ensuring that your model performs effectively in the
real world.
environments are essential for the initial stages of model creation and refinement
before moving to production deployment.
Key Differences:
Model Packages
Model packaging is the process of encapsulating a trained model, along with its
associated dependencies and preprocessing steps, into a format that can be easily
deployed in a production environment. This ensures that the model can be applied
to new, live data in a consistent and reliable manner. The approach to packaging
varies depending on the programming language used:
SAS, Python & R: A Cross Reference Guide
● In SAS:
o SAS also offers procedures like PROC SCORE or PROC PLM, which
can be used to apply the model to new data, ensuring that the
entire workflow – from preprocessing to scoring – is accurately
replicated in the production environment.
● In Python:
o Models are often packaged using libraries like joblib or pickle, which
serialize the trained model object into a file. This file can be loaded
later to score new data. The package typically includes not just the
model itself but also any custom preprocessing pipelines created
using scikit-learn or other libraries. These pipelines may include
steps like scaling, encoding categorical variables, and handling
missing data, ensuring consistency in the model's application to new
data sets.
● In R:
Why Preprocessing Steps Are Included: Including preprocessing steps within the
model package is crucial because the model's performance depends on how the
input data is structured and transformed. Inconsistent preprocessing between the
training and production phases can lead to incorrect predictions or a significant drop
in model performance. By encapsulating these steps within the model package, data
scientists ensure that the model will process new data in exactly the same way as
the training data, preserving the integrity and accuracy of the model's predictions.
Key Considerations
● Advantages: Docker ensures that the model runs the same way regardless
of the environment, eliminating issues related to environment
discrepancies. It also allows for easy scaling and deployment in cloud
environments.
● Use Cases: Kubernetes is ideal for enterprises that require robust, scalable
solutions for deploying machine learning models in production, particularly
in cloud or hybrid environments.
396 SAS, Python & R: A Cross Reference Guide
Model monitoring is a critical phase in the lifecycle of any predictive model. Once a
model is deployed into a production environment, continuous monitoring is
essential to ensure that the model maintains its performance and reliability over
time. This section will explore what model monitoring is, why it is crucial, and the
various components involved in effectively tracking and evaluating a model’s
performance.
The data being monitored includes both the input data (features) and the output
data (predictions) that the model generates. This monitoring ensures that the input
data remains consistent with the data used during model development and that the
predictions align with expected outcomes.
Model monitoring typically involves several key components that work together to
provide a comprehensive view of model performance. These components include
dashboards, model monitoring reports, threshold evaluation, and stability indices
like PSI and VSI.
Dashboards
Dashboards play a vital role in model monitoring by offering a real-time view of key
performance indicators (KPIs) and metrics. These visual tools allow stakeholders to
easily track trends, identify anomalies, and gain insights into the model’s
performance over time. Dashboards are particularly useful for providing a high-level
overview and enabling quick decision-making when issues arise.
Model monitoring reports are detailed documents that provide an in-depth analysis
of the model’s performance. These reports typically include various performance
metrics such as accuracy, precision, recall, MSE, MAE, and more, depending on the
model type. Reports also highlight any significant deviations from expected
performance and may include recommendations for corrective actions. Regularly
reviewing these reports helps ensure that models continue to perform as expected
in the production environment.
398 SAS, Python & R: A Cross Reference Guide
Table 8.6 below shows how threshold ranges can be defined for the AUC, KS, and
Gini metrics. These thresholds are based on the percentage deviation from a base
metric, representing the model's performance during development.
Green: Deviation
Measures the < 10% of base value
degree of
Reflects the
KS separation Yellow: Deviation
model's
(Kolmogorov- between the 10% to 15% of base Monthly
discriminatory
Smirnov) distributions of value
power
positive and
negative classes Red: Deviation
≥ 25% of base value
Red: Deviation
≥ 25% of base value
● Green: If the metric falls within the green range, it suggests that the model
is performing within acceptable limits and no immediate action is required.
For example, if the AUC drops by less than 10% from the base value, the
model's predictive accuracy is still considered robust.
● Yellow: Metrics in the yellow range indicate that the model's performance is
beginning to degrade. This range acts as a warning signal that the model
may need to be recalibrated or that the data being used in the production
environment is shifting. For instance, if the KS statistic shows a 10% to 15%
drop, it may be time to investigate potential issues before they become
critical.
● Red: The red range signals a significant drop in performance, typically 25%
or more from the base metric. This indicates that the model is no longer
reliable and requires immediate intervention, such as retraining with new
data, adjusting the thresholds, or even redeveloping the model.
Monitoring Frequency
In the table, we've indicated that these metrics should be monitored monthly.
However, the frequency can vary depending on the business context, the model's
importance, and the volatility of the data. More critical models may require weekly
400 SAS, Python & R: A Cross Reference Guide
or even daily monitoring, while others may be checked less frequently. The key is to
ensure that the monitoring frequency aligns with the business needs and the
model's usage.
Monitoring the stability of a model's input data and predictions over time is crucial
for ensuring its continued accuracy and reliability. Two key metrics used for this
purpose are the Population Stability Index (PSI) and the Variance Stability Index
(VSI).
o Interpretation:
o Interpretation:
To effectively monitor these metrics, it’s useful to categorize them into thresholds
that signal different levels of concern. These thresholds are typically broken down
into three categories: Green, Yellow, and Red.
● Green Range: Indicates that the model's input features and predictions have
remained stable. No immediate action is needed, and the model is likely
performing as expected.
● Yellow Range: Signals that there has been some change in the population or
variance of predictions. The model may still be performing adequately, but
it should be monitored more closely, and adjustments may be necessary in
the near future.
402 SAS, Python & R: A Cross Reference Guide
Monitoring PSI and VSI regularly allows you to detect early signs of model drift,
enabling proactive adjustments to maintain model performance and reliability over
time.
In this chapter, we focused on the critical aspects of evaluating and monitoring your
predictive models. By implementing various performance metrics such as AUC,
precision, recall, PSI, and VSI, you’ve learned how to assess the effectiveness and
stability of your models. Additionally, we discussed the importance of continuous
model monitoring to ensure that your models remain reliable and accurate over
time, even as data and conditions change.
Now that you understand how to evaluate and maintain your models, the next step
is to explore how to integrate these practices across different programming
environments. In many real-world projects, you may find yourself working with
multiple languages, such as Python, R, and SAS. Each language has its own strengths,
and being able to leverage the best features of each can significantly enhance your
data science workflow.
In Chapter 9, we will delve into the concept of language fusion, where you’ll learn
how to integrate Python, R, and SAS in a single project. We’ll explore how to
SAS, Python & R: A Cross Reference Guide
seamlessly transition between these languages, share data, and combine their
unique capabilities to create a more flexible and powerful modeling environment.
By mastering language fusion, you’ll be equipped to tackle complex projects that
require the strengths of multiple programming environments.
So, with your knowledge of performance metrics and model monitoring solidified,
let’s move on to Chapter 9 and discover how to harness the power of multiple
languages to take your data science projects to the next level.
404 SAS, Python & R: A Cross Reference Guide
o Overview: This chapter delves into the essential tools used to assess
and ensure the accuracy, reliability, and generalizability of
predictive models. It covers both classification and regression
models, discussing how their respective performance metrics are
tailored to the nature of the target variable. Understanding these
metrics is crucial for evaluating model effectiveness, diagnosing
issues, and making informed decisions.
2. Classification Metrics
3. Regression Metrics
4. Model Implementation
5. Model Monitoring
6. Conclusion
Chapter 8 Quiz:
5. How is the AUC (Area Under the Curve) calculated, and what does it
represent?
8. How does the Mean Squared Error (MSE) differ from Mean Absolute Error
(MAE) in regression analysis?
11. What are AIC and BIC, and how are they used to assess model complexity in
regression models?
12. Why is it important to monitor a model after it has been deployed into
production?
14. What are Docker and Kubernetes, and how do they assist in model
deployment?
15. What is PSI (Population Stability Index), and why is it critical in model
monitoring?
SAS, Python & R: A Cross Reference Guide
16. How does VSI (Variance Stability Index) contribute to model stability
monitoring?
17. What does a yellow range in threshold analysis indicate about a model’s
performance?
18. How can the lift table help in setting thresholds for classification models?
19. Why is it necessary to periodically reassess the thresholds set for model
performance metrics?
20. Explain how dashboards and monitoring reports can be used to track and
evaluate model performance over time.
408 SAS, Python & R: A Cross Reference Guide
PROC REG:
Available in
BIC (Bayesian model.bic_ in
regression
Information statsmodels BIC(lm_model)
output for
Criterion) package
model
comparison
Category: Implementation
Overview
In data science and analytics, one often finds themselves navigating through a
multitude of programming environments. It's common for data scientists to work
with the tools provided by their organizations, and these tools may not always align
with their expertise. For instance, you might be a seasoned SAS expert but find
yourself in an environment that primarily relies on RStudio, or vice versa. The
question arises: Are you confined to using only the programming languages native
to a particular environment? The answer is a resounding no.
Consider this scenario: you've honed your skills in Python and SAS, yet your new
data science position requires you to work in RStudio. Do you need to scramble to
learn R from scratch? Not necessarily. This chapter will show you how to implement
your Python or SAS code directly within RStudio. Conversely, if your company's
analytical environment is SAS-centric, and you need to incorporate some R magic,
we've got you covered there as well.
Let's embark on a journey that breaks down the barriers between programming
languages and analytical environments. Discover how to seamlessly integrate
Python, R, and SAS into different platforms. By the end of this chapter, you'll
possess a valuable skill set that enables you to harness the full potential of your
preferred programming languages, regardless of the environment you find yourself
in.
412 SAS, Python & R: A Cross Reference Guide
Welcome to the Python Spyder IDE, where versatility in data science reigns
supreme. In the world of analytics, professionals often face the challenge of working
in environments that demand expertise in multiple programming languages. In this
section, we'll explore how Python Spyder allows you to seamlessly integrate both
SAS and R code into your Python workflow. This integration empowers you to
leverage the strengths of different languages, transcending the constraints of any
single ecosystem.
Python Spyder is not confined to Python alone. It serves as a bridge that enables
data scientists to harness R's statistical prowess and the data management
capabilities of SAS, all within a Python-centric environment.
As we delve into the intricacies of incorporating SAS and R code, you'll discover the
practical advantages of multi-language integration. By the end of this section, you'll
be well-equipped to navigate the dynamic landscape of data science, where
adaptability and versatility are your greatest assets.
In Python Spyder, you can seamlessly integrate R code into your data analysis
workflow using the “rpy2” library. This allows you to harness the power of both R
and Python within a single environment. Below are the steps to execute R code in
Python Spyder:
Before you begin, ensure that you have the rpy2 package installed. If it's not
already installed, you can do so via pip, the Python package manager, with the
following command:
Open Python Spyder and create a new Python script (a .py file).
At the beginning of your Python script, import the rpy2 package like this:
Define your R code as a string within your Python script. For example:
Save your Python script and execute it within Python Spyder. The embedded R code
will run seamlessly, and the result will be displayed in the Python console.
Using R Libraries
If your R code requires specific R libraries like caret, you can load them within your
R code block:
r_code = """
"""
This approach allows you to leverage R's capabilities, including its libraries,
alongside Python in Python Spyder. It's particularly useful when you need to utilize
R's extensive data analysis and statistical packages in your data science projects.
Subprocess Module
If you cannot install the rpy2 library to run R code within Python, there are
alternative ways to execute R code from Python. One popular option is to use the
subprocess module in Python to run R scripts externally. Here's a step-by-step
guide on how to do this:
First, create an R script (e.g., my_r_script.R) that contains the R code you want
to execute. Save this script in the same directory as your Python script or provide
the full path to it. Example R script (my_r_script.R):
SAS, Python & R: A Cross Reference Guide
# Example R code
result <- sum(1:10)
cat("The sum of numbers 1 to 10 is", result, "\n")
Now, you can use Python's subprocess module to run the R script. Here's an
example Python script:
import subprocess
Execute your Python script. It will call the R script and print the output to the
console. In this example, it calculates the sum of numbers 1 to 10 using R and prints
the result.
416 SAS, Python & R: A Cross Reference Guide
Using the rpy2 library or the subprocess module are two common ways to run R
code within Python. However, there is another approach called "Jupyter Notebooks"
that provides an interactive environment for combining Python and R code
seamlessly. Here's how to do it:
2. Navigate to the directory where you want to create your Jupyter Notebook.
jupyter notebook
SAS, Python & R: A Cross Reference Guide
This will open a web browser showing the Jupyter Notebook interface.
1. Click the "New" button in the top-right corner and select "Python 3" to
create a new Python notebook.
Inside the notebook, you can write both Python and R code in separate cells. To add
a new cell, click the "+" button in the toolbar. To specify that you're writing R code
in a cell, use the `%%R` magic command at the beginning of the cell. Example:
%%R
# This is R code
result <- sum(1:10)
cat("The sum of numbers 1 to 10 in R is", result, "\n")
To run a cell, select it and click the "Run" button in the toolbar or press Shift+Enter.
The R code's output will appear directly in the notebook.
Here are three different methods to run SAS code within Python's Spyder IDE, along
with step-by-step instructions for each method:
Running SAS code in Python's Spyder IDE can be accomplished through various
methods. One straightforward approach is to utilize the Python subprocess
module. This method enables you to run SAS code as an external process directly
from your Python script. You'll need to specify the path to your SAS executable and
can pass your SAS code as a string to be executed. It's a versatile method that offers
control over your SAS environment while seamlessly integrating it into your Python
workflow. In this section, we'll walk you through the steps to set up and execute SAS
code using the subprocess module within Python's Spyder IDE.
Ensure that SAS is installed on your system. You'll need to know the path to the SAS
executable (e.g., sas.exe on Windows).
Use the subprocess module to run the SAS code in your Python script. Here's an
example:
import subprocess
SAS, Python & R: A Cross Reference Guide
Execute your Python script. It will run the SAS code and display the output in the
SAS console.
Another effective way to run SAS code within Python's Spyder IDE is by harnessing
the power of Jupyter Notebook with the SAS Kernel. Jupyter Notebooks are
renowned for their interactivity and code-sharing capabilities. Installing the SAS
Kernel allows you to create a dedicated SAS Jupyter Notebook and execute SAS code
cells alongside your Python code. This method is ideal if you prefer a more
interactive and document-driven approach to working with SAS within the Python
ecosystem. In the following section, we'll guide you through the process of installing
the SAS Kernel and using it within Jupyter Notebooks.
420 SAS, Python & R: A Cross Reference Guide
jupyter notebook
In your new SAS Jupyter Notebook, you can write and execute SAS code cells.
SAS, Python & R: A Cross Reference Guide
SASPy is a Python module that enables you to connect to and interact with SAS from
Python. To use SASPy, follow these steps:
This will generate a configuration file (sascfg_personal.py) that you can edit to
specify your SAS environment settings.
In your Python script or Jupyter Notebook, import SASPy and use it to execute SAS
code. Here's an example:
sas = saspy.SASsession(cfgname='default')
sas_code = """
data work.example;
input x y;
datalines;
1 2
3 4
5 6
;
run;
"""
sas.submit(sas_code)
These methods provide various ways to integrate SAS code into the Python Spyder
environment, depending on your preferences and requirements.
SAS, Python & R: A Cross Reference Guide
RStudio goes beyond being a specialized IDE for a single programming language. It
serves as a dynamic hub that unlocks the potential of cross-language synergy.
Whether you're well-versed in Python, proficient in SAS, or a seasoned R enthusiast,
RStudio offers a welcoming space to unite these languages, enabling you to harness
their unique strengths effectively.
RStudio, renowned for its prowess in the R programming world, is not just limited to
R. In this section, we explore the diverse world of data science by bringing Python
into the mix. RStudio's flexibility extends to Python, allowing data scientists to
integrate Python code into their workflow seamlessly. Whether you're a Python
enthusiast or looking to combine the strengths of both Python and R, RStudio
empowers you to do so effortlessly.
As we journey through this section, you'll discover the art of implementing Python
in RStudio. This integration opens doors to a wide array of data science libraries,
tools, and capabilities. From data analysis to machine learning and beyond, you'll
find that RStudio is your canvas for creating data-driven masterpieces. By the end of
424 SAS, Python & R: A Cross Reference Guide
this section, you'll be equipped to harness the full potential of Python within the
RStudio environment.
One of the most straightforward ways to work with Python in RStudio is by utilizing
the Reticulate package. Reticulate allows you to create Python scripts within
RStudio, execute Python code chunks, and seamlessly switch between R and Python.
It's particularly useful when you want to combine the strengths of both languages in
a single RMarkdown document.
By installing and configuring Reticulate, you can write Python code directly in
RStudio and harness Python libraries, making it an excellent choice for projects that
require collaboration between R and Python developers.
R library(reticulate)
For those who prefer a more traditional Python development environment, RStudio
also allows you to run Python scripts directly. By clicking on the "Run" button or
using keyboard shortcuts, you can execute Python code in RStudio's console or
terminal. This method is perfect if you want to maintain your Python code in
separate script files while leveraging RStudio's interface for code execution and
result visualization. Whether you're working on data analysis, machine learning, or
any Python-centric task, this approach provides familiarity and flexibility.
# Click the "Knit" button to run both R and Python code and
generate the report.
RStudio's compatibility with Jupyter Notebooks offers yet another powerful way to
work with Python. By creating a new R Markdown document with a Reticulate
engine, you can seamlessly include Jupyter code chunks. This integration allows you
to combine narrative text, R, and Python code in a single dynamic document. It's an
excellent choice for projects that require detailed explanations and interactive code
execution. In this method, you can benefit from Jupyter's extensive ecosystem while
taking advantage of RStudio's authoring capabilities. In the following sections, we'll
delve into the steps for implementing Python using each of these methods within
426 SAS, Python & R: A Cross Reference Guide
the RStudio environment, ensuring you have the flexibility to choose the approach
that best suits your needs.
● Step 1: Create a Python Script
Create a .py Python script (e.g., my_script.py) with your Python code. Save it in
your working directory.
Example:
When you run this code in RStudio, it will execute Python code within the RStudio
environment and print "Hello from Python!". Repeat the above steps for each
method to effectively use both SAS and Python in your RStudio environment.
SAS, Python & R: A Cross Reference Guide
RStudio, synonymous with the R programming world, has a knack for embracing
diversity in the data science arena. In this section, we delve into the realm of SAS
programming, demonstrating how RStudio extends its arms to welcome the SAS
language. This integration combines the analytical prowess of SAS with the data
manipulation and visualization capabilities of RStudio. Whether you're a seasoned
SAS user or seeking the best of both worlds, you're in the right place.
As we embark on this exploration, you'll discover three distinct methods for running
SAS within RStudio. Each method offers a unique perspective on leveraging the
strengths of SAS while harnessing the versatile environment of RStudio. By the end
of this section, you'll have a toolbox of techniques to blend SAS into your RStudio
workflow seamlessly.
Now, let's outline the three methods with step-by-step instructions and
explanations for each.
Introducing a bridge between SAS and RStudio: the RSASSA package. This method
offers an intuitive way to execute SAS code within RStudio. RSASSA facilitates
interaction between R and SAS, allowing you to send your SAS code to a remote SAS
server and retrieve the results seamlessly. If you're accustomed to the SAS
environment but want to explore the visualization and reporting capabilities of
RStudio, this method provides the best of both worlds.
Step-by-Step Instructions:
4. Execute SAS Code: Write your SAS code within the sas_submit() function
and run it to send it to the SAS server.
Here's an intriguing method for incorporating SAS into your RStudio environment:
RMarkdown with SAS chunks. This approach offers a dynamic way to combine the
powers of R and SAS within a single RMarkdown document. Whether you're focused
on creating dynamic reports, presentations, or documents, RMarkdown allows you
to interweave R and SAS code seamlessly.
Step-by-Step Instructions:
2. Specify SAS Chunk: Within your RMarkdown document, define a SAS chunk
by using triple backticks (\````{r, engine='SAS'}\````).
```{r, engine='SAS'}
/* This is a SAS chunk */
data work.example;
set sashelp.class;
run;
proc print data=work.example;
SAS, Python & R: A Cross Reference Guide
run;
The document starts with metadata, specifying the title and output format. The
triple backticks (\````{r, engine='SAS'}\````) define a SAS chunk within
the document. You can write your SAS code within this chunk, and it will be
executed when you knit the RMarkdown document. In this example, we create a
new data set using the SAS DATA step and then print it using a PROC PRINT step.
When you knit this document in RStudio, it will execute the embedded SAS code
and generate an HTML document that includes both the code and the results.
3. Write SAS Code: Inside the SAS chunk, write your SAS code as you normally
would.
If you prefer to keep things straightforward and execute SAS scripts directly, RStudio
provides a method akin to calling SAS from the command line. By utilizing the
system2 function, you can invoke the SAS executable and execute your SAS code
files within RStudio. This method is ideal for those who wish to maintain their SAS
scripts but want the convenience of the RStudio environment.
Step-by-Step Instructions:
1. Write SAS Script: Prepare your SAS script in a separate .sas file, just like
you would in a typical SAS environment.
2. Use the System2 Function: In your RStudio script, utilize the system2
function to call the SAS executable and specify the path to your SAS script
file.
Here's an example of using the system2 function in RStudio to call a SAS program:
430 SAS, Python & R: A Cross Reference Guide
In this example:
1. sas_path is set to the path of your SAS executable. Be sure to adjust this
path to match your SAS installation directory.
2. sas_program is set to the location of your SAS program file (.sas file).
3. The system2 function is used to run the SAS program by specifying the
command (SAS executable path) and args (arguments, in this case, the SAS
program path).
When you run this R script in RStudio, it will call SAS and execute the specified SAS
program. Ensure that you replace the paths with the actual paths to your SAS
executable and program file.
3. Run RStudio Script: Run your RStudio script containing the system2
function. This will execute your SAS script, producing the desired output.
These three methods offer flexibility in incorporating SAS into RStudio, catering to
various preferences and requirements. Whether you use the RSASSA package for
tight integration, RMarkdown for dynamic documents, or system2 for familiar script
execution, RStudio's versatility welcomes SAS into its environment with open arms.
SAS, Python & R: A Cross Reference Guide
In the following pages, we'll unveil the methods to incorporate Python and R into
SAS Studio, equipping you with the tools to embark on data science projects with
unmatched flexibility. Whether you're a SAS aficionado, a Python enthusiast, an R
expert, or somewhere in between, SAS Studio's unique capability to synergize these
three languages will redefine your data science experience. So, let's dive in and
explore the symbiotic relationship between Python, R, and SAS within this
remarkable environment.
Throughout this book, we've explored the dynamic capabilities of SAS Studio, an
environment that bridges the realms of SAS, Python, and R, facilitating a seamless
synergy between these languages. However, we also understand the diverse
landscape of data science, where SAS Enterprise Guide remains a formidable choice
for many professionals. As we venture into this section, we recognize that readers
432 SAS, Python & R: A Cross Reference Guide
may employ both SAS Studio and SAS Enterprise Guide, and we're here to empower
you to integrate Python and R code in either environment.
SAS Enterprise Guide has long been a staple for those seeking the extensive
analytical capabilities and structured workflows it offers. It's a testament to the
versatility of SAS that it caters to different preferences, enabling users to harness
the power of SAS, Python, and R interchangeably. In the pages ahead, we will delve
into the nuances of SAS Enterprise Guide, demonstrating how you can achieve
similar flexibility and efficiency in implementing Python and R code, even if you
primarily work with this robust environment.
This section is your guide to transcending the boundaries of your chosen SAS
environment, whether it's SAS Studio or SAS Enterprise Guide. As we explore the
methods for seamlessly incorporating Python and R, you'll find that the power of
data science knows no constraints. Let's embark on a journey that empowers you to
excel in your data-driven endeavors, no matter which SAS environment you prefer.
SAS Studio and SAS Enterprise Guide are both SAS software products, but they serve
somewhat different purposes and have distinct features. Here are the key
differences between them:
1. Purpose:
● SAS Enterprise Guide: This is a more comprehensive SAS client tool that
provides a rich, point-and-click interface for data management, advanced
analytics, and reporting. It's used for more complex analytics and
reporting tasks.
2. User Interface:
SAS, Python & R: A Cross Reference Guide
3. Coding Options:
4. Deployment:
● SAS Studio: Often used in SAS Viya environments (the cloud or on-
premises). It can connect to both SAS 9 and SAS Viya.
5. Advanced Analytics:
● SAS Studio: Supports basic machine learning and statistics with SAS
procedures and open-source integration for more advanced analytics.
6. Customization:
● SAS Studio: Provides options for custom tasks and custom code snippets.
● Both SAS Studio and SAS Enterprise Guide allow you to incorporate
Python and R code into your SAS workflows.
● The process is similar in both environments. You can use code nodes or
code windows to write and execute Python and R code.
So, in general, you can implement Python and R code in either SAS Studio or SAS
Enterprise Guide. The choice between the two would depend on your specific
needs, with SAS Studio being more lightweight and user-friendly and SAS Enterprise
Guide offering more extensive features for advanced SAS analytics and reporting.
Here are three methods to implement Python code in SAS Studio, along with
explanations and step-by-step procedures:
One of the most straightforward methods to run Python in SAS Studio is to utilize
the Python Code node. This node provides a dedicated environment for writing,
executing, and visualizing Python code. It's convenient if you want to integrate
Python scripts into a larger SAS workflow.
Step-by-Step Procedure:
1. Open a New Project: Launch SAS Studio and open your project or create a
new one.
2. Add a Python Code Node: Within your project, select "Tasks and Utilities" in
the navigation pane on the left. Under "Tasks," expand "Programs" and
select "Python Code."
3. Write Your Python Code: In the Python Code node, you can write, edit, or
paste your Python script.
4. Execute the Code: Click the "Run" button (a green triangle icon) to execute
your Python code.
SAS, Python & R: A Cross Reference Guide
5. View Results: Any output or plots your Python script generates will be
displayed in the results pane.
Note: SAS Enterprise Guide may not provide the Python Code node. However, you
can still achieve this functionality by using Method 2.
This method enables you to embed Python code directly within a SAS program using
the X Python statement. It's a versatile approach that gives you more control over
the integration of Python and SAS.
Step-by-Step Procedure:
1. Open a SAS Program: Launch SAS Studio or SAS Enterprise Guide and open
a SAS program.
2. Use the X Python Statement: Insert the X Python statement in your SAS
program, followed by your Python code within single quotation marks. For
example:
DATA _null_;
x 'python your_python_code_here';
RUN;
Let's say you want to calculate the sum of two numbers using Python code
within a SAS program. Here's how you can do it:
DATA _null_;
x 'python
num1 = 5
num2 = 7
result = num1 + num2
print(result)
';
RUN;
In this example:
436 SAS, Python & R: A Cross Reference Guide
When you run this SAS program, it will execute the embedded Python code,
and you'll see the result in the SAS log. In this case, the SAS log will display:
3. Run the SAS Program: Execute your SAS program as you normally would.
4. Review Output: The SAS log will include any output generated by your
Python code.
SAS provides an integration with Jupyter Notebooks, allowing you to create and run
Python code in a Jupyter environment. This method provides a rich interactive
Python experience within your SAS environment.
Using Jupyter Notebooks in SAS is typically available in SAS Studio. SAS Enterprise
Guide, on the other hand, doesn't natively support Jupyter notebooks. SAS Studio
provides a web-based interface with integrated support for Jupyter notebooks,
making it easier to incorporate Python and R code alongside SAS in a notebook
format. This feature enhances interactivity and flexibility in data analysis and model
development.
SAS Enterprise Guide supports various scripting languages and can interact with
external Jupyter environments, but it doesn't have built-in support for Jupyter
notebooks.
SAS, Python & R: A Cross Reference Guide
Step-by-Step Procedure:
3. Write and Execute Python Code: In the notebook, you can write, execute,
and document your Python code just like in a standalone Jupyter
environment.
4. Save Your Work: Save your Jupyter Notebook in your SAS Studio project.
These methods enable you to harness the power of Python alongside SAS in your
data science projects, offering flexibility and versatility for your analytics tasks.
In this final chapter, we explored the powerful concept of language fusion, learning
how to integrate Python, R, and SAS to harness the unique strengths of each
environment. By combining these tools, you’ve gained the flexibility and power to
tackle complex data science challenges, leveraging the best of what each language
has to offer.
Now that you’ve completed this journey through the essentials of data science and
machine learning, you’re no longer just a student of these techniques – you’re ready
to step into the role of a data scientist. You have built a robust foundation, from
understanding the core programming environments, mastering data preparation,
and creating effective modeling pipelines, to implementing advanced predictive
models and ensuring their performance over time. You’ve also learned how to
438 SAS, Python & R: A Cross Reference Guide
integrate multiple programming languages, giving you the versatility to apply your
skills across any industry and solve a wide range of data problems.
As you move forward, remember that data science is as much about continuous
learning as it is about applying the knowledge you’ve gained. The tools and
techniques you’ve mastered in this book are just the beginning. Whether you’re
working in finance, healthcare, marketing, or any other field, the principles and
practices you’ve learned here will empower you to extract meaningful insights from
data and make informed, impactful decisions.
So, as you close this final chapter, know that you are equipped with the knowledge
and tools to tackle any data challenge that comes your way. Your journey as a data
scientist begins now, and with the skills you’ve developed, there’s no limit to what
you can achieve.
SAS, Python & R: A Cross Reference Guide
● Running SAS Code: The chapter also covers how to run SAS code within
Python Spyder, using methods like the subprocess module and SASPy. This
section emphasizes the practical advantages of integrating the data
management capabilities in SAS with Python's flexibility.
3. RStudio IDE
● Running SAS in RStudio: The chapter explains how to incorporate SAS code
into RStudio using methods such as the RSASSA package and RMarkdown
with SAS chunks. These techniques enable data scientists to harness the
powerful analytics of SAS within the RStudio environment.
4. SAS Studio
440 SAS, Python & R: A Cross Reference Guide
● Balancing SAS Studio and SAS Enterprise Guide: The chapter provides a
comparison between SAS Studio and SAS Enterprise Guide, focusing on their
capabilities for integrating Python and R. It offers guidance on choosing the
right environment based on specific project needs.
Chapter 9 Quiz
Questions:
1. What is the primary advantage of integrating Python, R, and SAS within a
single environment?
2. How does the rpy2 library facilitate the use of R code in Python Spyder?
3. Describe how the subprocess module can be used to run SAS code in
Python.
6. Explain how the RSASSA package allows for interaction between RStudio
and SAS.
7. What are the benefits of using Jupyter Notebooks within SAS Studio?
9. What are the key differences between SAS Studio and SAS Enterprise Guide
in terms of language integration?
10. Why might a data scientist choose to run Python code within RStudio?
11. What steps are necessary to run R code within Python Spyder?
12. How does using the SASPy module enhance Python's capabilities in handling
SAS code?
13. In what situations would it be beneficial to use SAS Studio over SAS
Enterprise Guide?
14. How can language fusion improve the efficiency of data science workflows?
17. How does RStudio's integration of Python and SAS differ from that of
Python Spyder?
18. What are the potential drawbacks of using multiple programming languages
within a single project?
19. How can Jupyter Notebooks be used to create a dynamic document that
includes Python and R code?
- Use data.table or
- Import and - Use Pandas for
dplyr for data
export data data manipulation
manipulation
between SAS and
- Exchange data
Data Integration R/Python using - Exchange data
with Python via
PROC with R using rpy2
Reticulate and
IMPORT/EXPORT and with SAS using
with SAS via
and PROC SQL SASPy
RSASSA
- Install Reticulate
- Install rpy2 for R
for Python
integration
- SAS Studio is integration
typically web- - Install SASPy for - Install RSASSA for
Installation based, with built- SAS integration SAS integration
in support for - Set up
Python and R RMarkdown for
multilingual
documents
- Leverage the full - Use rpy2 for
- Use Reticulate to
power of SAS with seamless
seamlessly
integrated Python integration of R
integrate Python
and R for code within
code
specialized tasks Python
Best Practices
- Use Jupyter
- Use RMarkdown
Notebooks for - Utilize SASPy for
for combining R,
more interactive running SAS code
Python, and SAS in
and collaborative in Python
reports or analysis
workflows
SAS, Python & R: A Cross Reference Guide
Introduction
As you progress in your data science journey, continuous learning is essential to stay
updated with the latest tools, techniques, and industry trends. This appendix
provides a curated list of resources, including books, websites, YouTube channels,
subreddits, and podcasts, to help you deepen your knowledge and skills. Whether
you are just starting or looking to advance your expertise, these resources offer
valuable insights and practical guidance across various aspects of data science.
Books
A comprehensive
overview of the data
The Data Science Carl Shan, Henry Wang,
science landscape with
Handbook William Chen, Max Song
interviews from leading
data scientists.
Essential for learning
pandas, a powerful
Python for Data Analysis Wes McKinney Python library for data
manipulation and
analysis.
A complete guide to
Hadley Wickham, data import, cleaning,
R for Data Science
Garrett Grolemund visualization, and
modeling with RStudio.
A solid foundation in
SAS Essentials:
Alan C. Elliott, Wayne A. SAS, focusing on data
Mastering SAS for Data
Woodward analytics and the use of
Analytics
SAS Studio.
Covers practical
Practical Statistics for Peter Bruce, Andrew statistical methods with
Data Scientists Bruce examples in both R and
Python.
Introduction to deep
Deep Learning with learning with Keras and
François Chollet
Python TensorFlow, written by
the creator of Keras.
Teaches fundamental
Data Science from concepts of data science
Joel Grus
Scratch with Python, building up
from the basics.
Explores the concepts
Artificial Intelligence: A
and implications of AI,
Guide for Thinking Melanie Mitchell
making it accessible to a
Humans
general audience.
SAS, Python & R: A Cross Reference Guide
Websites
universities and
companies.
Community-based
platform offering
www.analyticsvidhya.co
Analytics Vidhya tutorials, discussions,
m
and resources for data
science enthusiasts.
A community for data
www.datasciencecentral science professionals,
Data Science Central
.com featuring articles,
webinars, and forums.
Explores complex data
stories through engaging
The Pudding www.pudding.cool visual essays, blending
data science with
journalism.
YouTube Channels
Subreddits
A subreddit focused on
www.reddit.com/r/DataS job opportunities, career
r/DataScienceJobs
cienceJobs advice, and discussions
related to data science.
Discusses AI research,
www.reddit.com/r/Artifici tools, and developments,
r/ArtificialIntelligence
alIntelligence intersecting with many
aspects of data science.
Geared toward learning
machine learning
www.reddit.com/r/learn
r/learnmachinelearning techniques and sharing
machinelearning
resources for beginners
and practitioners alike.
Podcasts
programming in Python
and R.
Hosted by data science
experts Hilary Parker
Available on Apple
Not So Standard and Roger D. Peng,
Podcasts, Spotify, and
Deviations discussing data analysis,
Stitcher
data science tools, and R
programming.
A podcast from SAS
Available on SAS.com, covering various topics,
SAS Users Podcast Apple Podcasts, and including best practices,
Google Podcasts tips, and interviews with
SAS users.
A podcast from
Available on DataCamp, DataCamp exploring
DataFramed Apple Podcasts, Spotify, data science, its
and Google Podcasts applications, and its
impact on industries.
Discusses the long-term
Available on Apple
impact of AI, with
Podcasts, Spotify,
AI Alignment Podcast insights relevant to the
Stitcher, and Google
broader data science
Podcasts
field.
Available on Apple Focuses on machine
Podcasts, Spotify, learning, AI, and deep
The TWIML AI Podcast
Stitcher, and Google learning, with interviews
Podcasts from industry experts.
Available on Apple Follows the journey of
Becoming a Data Podcasts, Spotify, learning data science,
Scientist Stitcher, and Google with insights and advice
Podcasts for beginners.
A podcast about data
Available on Apple
visualization, how data
Data Stories Podcasts, Spotify, and
is presented, and the
Google Podcasts
stories it tells.
SAS, Python & R: A Cross Reference Guide
Introduction
Access to high-quality data is crucial for developing and honing your data science
skills. This appendix provides a comprehensive list of open-source data sources
where you can find data sets for a wide range of applications, from machine
learning projects to statistical analysis. These resources offer diverse data sets that
can be used to build, test, and refine your models, providing ample opportunities to
apply your knowledge in real-world scenarios.
A vast collection of
global development
data, including
World Bank Open Data data.worldbank.org
economic, social, and
environmental
indicators.
A gateway to global
statistics provided by
UN Data data.un.org the United Nations,
covering a wide range
of topics.
Public data sets
available through
Google Cloud Public Data console.cloud.google.com
Google Cloud, suitable
sets /marketplace/browse
for big data analysis
and machine learning.
A data set containing
over five million
Yelp Data set www.yelp.com/data set reviews, useful for NLP
and sentiment analysis
projects.
Open data from the
City of Chicago,
including crime
City of Chicago Data
data.cityofchicago.org reports, public health
Portal
data, and
transportation
information.
A collection of data
sets from various
European Union Open www.data.europa.eu/euo European Union
Data Portal dp/en/data institutions and
agencies, covering
diverse topics.
456 SAS, Python & R: A Cross Reference Guide
Introduction
GitHub is a treasure trove of resources for data scientists and AI/ML practitioners.
This appendix features a carefully selected list of GitHub repositories that offer
valuable code, tools, and libraries to enhance your data science and machine
learning projects. These repositories include everything from hands-on tutorials and
reference implementations to advanced algorithms and libraries for data
processing, modeling, and visualization. The repositories span multiple
programming languages, including Python, R, and SAS, and are suitable for learners
at all levels, from beginners to seasoned professionals. Use this appendix as a guide
to explore, learn, and contribute to the vibrant open-source data science
community.
Repository Programming
Author Web Address Description
Name Language
Repository for
this book SAS,
https://github. Python, and
SAS, Python, com/Gearhj/S R: A Cross-
and R: A AS-Python- Reference
Cross- Gearhj and-R-A- Guide, SAS, Python, R
Reference Cross- featuring
Guide Reference- code
Guide examples in
SAS, Python,
and R.
Repository for
the book End-
https://github. to-End Data
End-to-End
com/Gearhj/E Science with
Data Science Gearhj SAS
nd-to-End- SAS, featuring
with SAS
Data-Science hands-on SAS
code
examples.
458 SAS, Python & R: A Cross Reference Guide
A curated list
of awesome
https://github. data science
academic/aw
Awesome com/academi resources,
esome- Various
Data Science c/awesome- including
datascience
datascience books,
courses, and
tools.
A
comprehensiv
e guide to
https://github. learning
com/Avik- machine
100 Days of
Avik-Jain Jain/100- learning Python
ML Code
Days-Of-ML- through 100
Code days of code,
covering
various ML
topics.
A collection of
R packages
designed for
https://github.
data science,
Tidyverse tidyverse com/tidyverse R
making data
/tidyverse
wrangling and
visualization
easier.
Classification
and
Regression
Training
https://github.
package in R,
Caret topepo com/topepo/c R
offering tools
aret
to streamline
the model
training
process.
SAS, Python & R: A Cross Reference Guide
Fastai
simplifies
https://github. training fast
fastai fastai com/fastai/fas and accurate Python
tai neural nets
using modern
best practices.
Models and
examples
built with
https://github. TensorFlow,
TensorFlow
tensorflow com/tensorflo showcasing Python
Models
w/models various
machine
learning
techniques.
A Python
module
integrating
https://github. classic
com/scikit- machine
Scikit-learn scikit-learn Python
learn/scikit- learning
learn algorithms
with the
scientific
Python stack.
PyTorch
tutorials
covering
https://github.
PyTorch basics,
pytorch com/pytorch/t Python
Tutorials advanced
utorials
concepts, and
research
topics.
460 SAS, Python & R: A Cross Reference Guide
An R package
for creating
https://github. elegant data
ggplot2 tidyverse com/tidyverse visualizations R
/ggplot2 using the
grammar of
graphics.
Scalable,
portable, and
distributed
https://github. gradient
Python, R,
XGBoost dmlc com/dmlc/xgb boosting
Julia, Scala
oost library in
Python, R,
and other
languages.
A collection of
R scripts and
projects from
https://github. the R-
com/r- Bloggers
R-Bloggers R-Bloggers R
bloggers/R- community,
Bloggers focused on
various data
science
topics.
A Python
https://github. library to
SASPy sassoftware com/sassoftw connect to Python, SAS
are/saspy and run SAS
from Python.
A repository
with
https://github. advanced SAS
Advanced SAS advanced-sas com/advance programming SAS
d-sas techniques
and examples
for
SAS, Python & R: A Cross Reference Guide
experienced
SAS users.
A 12-week,
26-lesson
https://github.
Machine curriculum
com/microsof
Learning for microsoft teaching basic Python
t/ML-For-
Beginners machine
Beginners
learning
concepts.
A collection of
deep learning
models and
https://github.
algorithms
Deep Learning com/rasbt/de
rasbt implemented Python
Models eplearning-
in Python
models
using
TensorFlow
and Keras.
A curated list
https://github.
of machine
Awesome com/josephmi
learning
Machine josephmisiti siti/awesome- Various
frameworks,
Learning machine-
libraries, and
learning
software.
Code for the
Hands-On
book Hands-
Machine
https://github. On Machine
Learning with
ageron com/ageron/h Learning with Python
Scikit-Learn
andson-ml2 Scikit-Learn,
and
Keras, and
TensorFlow
TensorFlow.
462 SAS, Python & R: A Cross Reference Guide
The official
repository for
the pandas
https://github. library,
Pandas pandas-dev com/pandas- essential for Python
dev/pandas data
manipulation
and analysis
in Python.
A collection of
algorithms
https://github. implemented
The TheAlgorithm
com/TheAlgori in Python for Python
Algorithms s
thms/Python data science
and machine
learning.
Jupyter
notebooks for
https://github.
the Python
com/jakevdp/
Data Science Data Science
jakevdp PythonDataSci Python
Notebooks Handbook,
enceHandboo
covering
k
essential data
science tools.
A roadmap to
https://github.
help
com/floodsun
Deep Learning understand
g/Deep-
Papers deep learning
floodsung Learning- Various
Reading by reading the
Papers-
Roadmap top research
Reading-
papers in the
Roadmap
field.
A repository
https://github.
of natural
com/dair-
NLP Projects dair-ai language Python
ai/nlp_paper_s
processing
ummaries
(NLP) projects
SAS, Python & R: A Cross Reference Guide
and paper
summaries.
Cheat sheets
for the
https://github. Stanford CS
Machine com/afshinea/ 229 Machine
Learning afshinea stanford-cs- Learning Various
Cheatsheets 229-machine- course,
learning covering key
concepts and
algorithms.
464 SAS, Python & R: A Cross Reference Guide
7. What are the key features of RStudio that make it suitable for
reproducible research and collaboration in data science?
8. Describe how SAS Studio handles large data sets and why this is important
for advanced analytics.
● SAS Studio handles large data sets efficiently with its ability to process
data in chunks, perform parallel processing, and leverage in-memory
analytics, which is crucial for advanced analytics.
9. What advantages does Python Spyder offer for exploratory data analysis
in scientific computing?
12. How can knowledge of multiple IDEs and programming languages enhance
a data scientist’s ability to tackle complex data problems?
13. Describe how SAS Studio’s log output assists in debugging and optimizing
code.
14. What role does the interactive console in Python Spyder play in iterative
code development?
15. How does RStudio support the creation of interactive data visualizations
and what are some use cases?
16. Why is it important for data scientists to be familiar with the different
programming environments discussed in this chapter?
SAS, Python & R: A Cross Reference Guide
17. Discuss the impact of the long-standing presence of SAS in the industry on
the development of analytical products and automated processes.
19. What are the benefits of using RStudio for statistical modeling in academic
and research settings?
20. How does understanding the similarities and differences between SAS,
Python, and R improve a data scientist’s versatility and problem-solving
capabilities?
1. What is the significance of the "Garbage In, Garbage Out" (GIGO) principle
in data science?
● The GIGO principle emphasizes that the quality of input data directly
affects the quality of the output. Poor-quality data will lead to
unreliable and inaccurate models, regardless of the sophistication of the
algorithms used.
2. Why is high-quality data essential for building reliable and robust models?
3. Describe the structure of the Lending Club data set used in this project.
● The Lending Club data set typically includes features such as loan
amount, interest rate, grade, subgrade, loan status, and borrower
attributes, which are used to assess credit risk.
4. Explain the process of creating a target variable for the Lending Club risk
model.
5. How does the PROC IMPORT procedure in SAS facilitate data import and
variable selection?
6. What is the role of the pandas library in Python for importing data?
SAS, Python & R: A Cross Reference Guide
7. How does R's fread function from the data.table package assist in data
import and selection?
● Systematic sampling involves selecting every nth item from a list after a
random start. It’s useful when a population is ordered and a simple
method of sampling is desired.
13. How does the use of a seed value in sampling ensure reproducibility?
● A seed value ensures that the random number generation process can
be replicated, making the sampling process reproducible.
14. Compare the data import and sampling techniques in SAS, Python, and R.
● SAS uses procedures like PROC IMPORT for data import and PROC
SURVEYSELECT for sampling. Python uses pandas for import and numpy
or scikit-learn for sampling. R uses fread for import and functions like
sample() for sampling.
15. What factors should be considered when choosing a data import method?
● Factors include file size, format, the need for variable selection, and the
efficiency of the method in the given programming environment.
16. Why might a data scientist choose to sample a data set before analysis?
● Sampling reduces the data set size, making analysis faster and more
manageable, especially with large data sets.
● It allows the data scientist to choose the most appropriate method for
ensuring that the sample is representative and the analysis is robust.
18. What are the benefits of reducing the size of a data set through sampling?
20. What are the next steps after importing and sampling data in a data
science project?
1. What are the key differences between academic data sets and real-world
data sets?
3. How does the "art" of data science influence the decision-making process
during data preparation?
4. What is the business problem defined for the Lending Club data set in this
chapter?
5. How is the "loan_status" field in the Lending Club data set used to create
the target variable "bad"?
7. What is "runway time," and how does it affect the analysis of the target
variable?
● "Runway time" refers to the period between when a loan is issued and
when its performance can be evaluated. It affects the analysis by
determining the point at which the outcome of the loan (e.g., default or
no default) can be reliably assessed.
8. Describe the process of limiting the data set to a specific date range and its
impact on the modeling process.
● Limiting the data set to a specific date range ensures that the data used
for modeling is relevant and consistent, which helps in building a model
that reflects current trends and behaviors.
SAS, Python & R: A Cross Reference Guide
10. How can summary statistics and visualizations like histograms help in
understanding the data set?
11. What are outliers, and why is it important to address them in data
science?
● Outliers are data points that deviate significantly from the rest of the
data. Addressing them is important because they can skew the results
and negatively impact the performance of models.
13. What is the Interquartile Range (IQR) method, and how is it used to
identify outliers?
● The IQR method identifies outliers by calculating the range between the
first and third quartiles (Q1 and Q3) and defining outliers as points that
fall below Q1 - 1.5IQR or above Q3 + 1.5IQR.
14. Why is feature selection important in machine learning, and what are its
benefits?
474 SAS, Python & R: A Cross Reference Guide
16. What are wrapper methods in feature selection, and how do they differ
from filtering methods?
17. Explain the concept of embedded methods in feature selection and their
implementation.
18. What is feature engineering, and how can it improve model performance?
19. Describe the process of feature scaling and its importance in machine
learning.
● Data segmentation (splitting data into training, validation, and test sets)
is critical because it allows for an unbiased evaluation of the model’s
performance on unseen data, ensuring that the model generalizes well
beyond the data it was trained on.
● The train-validation-test split involves dividing the data set into three
parts:
● K-fold cross-validation divides the data set into k equally sized folds. The
model is trained on k-1 folds and validated on the remaining fold,
repeating this process k times. It is beneficial because it uses all data
478 SAS, Python & R: A Cross Reference Guide
points for both training and validation, leading to a more accurate and
stable estimate of model performance.
16. How does the AUC metric help in assessing the performance of a binary
classifier?
● AUC (Area Under the Curve) measures the model’s ability to distinguish
between classes across all threshold levels. A higher AUC indicates a
better performing model that can effectively differentiate between
positive and negative cases.
19. How does one-hot encoding facilitate the inclusion of categorical variables
in machine learning models?
20. What are the benefits of using a systematic model pipeline in data science
projects?
2. How does the quality of the modeling data set impact the accuracy of
predictive models?
● The quality of the data set directly impacts the accuracy of predictive
models, as clean, well-prepared data ensures the model can learn
meaningful patterns rather than noise.
3. What are the key steps involved in the model pipeline for data
preparation?
480 SAS, Python & R: A Cross Reference Guide
● The odds ratio represents the change in odds of the outcome occurring
for a one-unit increase in the predictor variable, holding other variables
constant.
● The logit link function transforms the probability of the outcome into a
linear combination of the predictor variables, facilitating the use of
linear regression techniques for binary classification.
● Child nodes are created by splitting a parent node based on the feature
that results in the highest reduction in impurity, dividing the data into
more homogeneous groups.
14. What are the advantages of using decision trees over logistic regression?
16. What is the difference between reduced error pruning and cost complexity
pruning?
19. What are the key performance metrics used to evaluate a decision tree
model?
20. What are the pros and cons of using logistic regression versus decision
trees for predictive modeling?
● Logistic Regression:
● Decision Trees:
SAS, Python & R: A Cross Reference Guide
5. How does Gradient Boosting differ from Random Forest in its approach to
building models?
484 SAS, Python & R: A Cross Reference Guide
● The learning rate controls the contribution of each tree to the final
model. A lower learning rate reduces the impact of each individual tree,
which can lead to better generalization but requires more trees.
8. When should you consider using Random Forest over Gradient Boosting?
● Random Forest is preferable when you need a quick, robust model with
less risk of overfitting, especially in data sets with many features and
less need for fine-tuned performance.
● Early stopping involves halting the training process once the model’s
performance on a validation set stops improving, which helps prevent
overfitting.
13. What are the typical use cases for LightGBM compared to XGBoost?
● LightGBM is typically used for large data sets with many features, where
its gradient-based one-sided sampling and leaf-wise growth strategy
offer faster training times and lower memory usage compared to
XGBoost.
16. What are the pros and cons of using Gradient Boosting in terms of
computational efficiency?
17. How can you use permutation importance to assess feature significance in
Gradient Boosting?
486 SAS, Python & R: A Cross Reference Guide
18. What metrics would you use to evaluate the performance of a Random
Forest model?
19. How does Gradient Boosting handle imbalanced data sets compared to
Random Forest?
20. What are the common hyperparameters that need tuning in Gradient
Boosting models?
4. What are support vectors, and why are they important in SVM?
● Support vectors are the data points closest to the hyperplane, and they
define the margin of separation between classes. These vectors are
crucial because the position and orientation of the hyperplane depend
directly on them.
● SVM is preferable when working with smaller data sets, when the
number of features is large relative to the number of observations, and
when the decision boundary is clear but non-linear.
9. What are the advantages of using neural networks for deep learning
tasks?
10. How does data standardization impact the performance of SVMs and
neural networks?
11. What is the difference between overfitting and underfitting, and how does
the C parameter in SVM address these issues?
● Overfitting occurs when the model is too complex and captures noise in
the data, while underfitting happens when the model is too simple to
capture the underlying trend. The C parameter in SVM helps balance
this by controlling the trade-off between achieving a wider margin
(which may lead to underfitting) and minimizing classification errors
(which may lead to overfitting).
14. How can class imbalance be addressed when training SVMs and neural
networks?
● The margin in SVM is the distance between the hyperplane and the
nearest data points (support vectors) from each class. A larger margin is
preferred as it implies a better generalization ability of the model.
16. What are some common use cases for neural networks in machine
learning?
18. Why might a deep neural network require more computational resources
than an SVM?
19. What are the pros and cons of using a radial basis function (RBF) kernel in
SVM?
● Pros: The RBF kernel can handle non-linear relationships well and can
map the data into higher dimensions to find a separating hyperplane.
Cons: It may lead to overfitting if not properly tuned and can be
computationally expensive for large data sets.
5. How is the AUC (Area Under the Curve) calculated, and what does it
represent?
● The AUC is calculated as the area under the ROC curve, which plots the
True Positive Rate (Sensitivity) against the False Positive Rate (1 -
Specificity). It represents the model’s ability to distinguish between the
positive and negative classes across different thresholds.
8. How does the Mean Squared Error (MSE) differ from Mean Absolute Error
(MAE) in regression analysis?
11. What are AIC and BIC, and how are they used to assess model complexity
in regression models?
models for having too many parameters, helping to select models that
balance fit and complexity.
12. Why is it important to monitor a model after it has been deployed into
production?
14. What are Docker and Kubernetes, and how do they assist in model
deployment?
15. What is PSI (Population Stability Index), and why is it critical in model
monitoring?
16. How does VSI (Variance Stability Index) contribute to model stability
monitoring?
17. What does a yellow range in threshold analysis indicate about a model’s
performance?
18. How can the lift table help in setting thresholds for classification models?
● The lift table helps identify the deciles where the model is most
effective at capturing true positives, guiding the setting of thresholds to
maximize business objectives like fraud detection or customer
retention.
19. Why is it necessary to periodically reassess the thresholds set for model
performance metrics?
20. Explain how dashboards and monitoring reports can be used to track and
evaluate model performance over time.
2. How does the rpy2 library facilitate the use of R code in Python Spyder?
3. Describe how the subprocess module can be used to run SAS code in
Python.
● SAS code can be integrated into an RMarkdown document using the sas
engine in code chunks. This allows for the execution of SAS code directly
within the RMarkdown file, producing dynamic reports that combine
SAS output with R analysis.
6. Explain how the RSASSA package allows for interaction between RStudio
and SAS.
● The RSASSA package enables RStudio to connect to a SAS server and run
SAS code from within R. It facilitates the transfer of data between R and
496 SAS, Python & R: A Cross Reference Guide
7. What are the benefits of using Jupyter Notebooks within SAS Studio?
9. What are the key differences between SAS Studio and SAS Enterprise
Guide in terms of language integration?
10. Why might a data scientist choose to run Python code within RStudio?
11. What steps are necessary to run R code within Python Spyder?
● To run R code within Python Spyder, one would typically use the rpy2
library, which allows R functions and objects to be accessed directly
from Python. Installation of rpy2 and setting up the appropriate
environment are necessary steps.
SAS, Python & R: A Cross Reference Guide
12. How does using the SASPy module enhance Python's capabilities in
handling SAS code?
● The SASPy module allows Python to connect to SAS, execute SAS code,
and retrieve results. This integration enhances Python’s capabilities by
enabling it to harness the powerful data manipulation and statistical
analysis tools in SAS.
13. In what situations would it be beneficial to use SAS Studio over SAS
Enterprise Guide?
14. How can language fusion improve the efficiency of data science
workflows?
17. How does RStudio's integration of Python and SAS differ from that of
Python Spyder?
● RStudio integrates Python and SAS primarily through the Reticulate and
RSASSA packages, respectively, allowing code from these languages to
be executed within R scripts. Python Spyder, on the other hand, uses
libraries like rpy2 and subprocess to run R and SAS code within Python
scripts.
19. How can Jupyter Notebooks be used to create a dynamic document that
includes Python and R code?
A/B Testing
A method of comparing two versions of a web page, model, or other product to
determine which one performs better. Commonly used in marketing and product
development.
Accuracy
A metric that measures the percentage of correctly predicted instances out of the
total instances, often used to evaluate classification models.
Activation Function
A function in a neural network that determines the output of a node given an input
or set of inputs. Common activation functions include ReLU, sigmoid, and tanh.
Algorithm
A set of rules or instructions given to an AI/ML model to help it learn from the data
and make predictions or decisions.
Backpropagation
An algorithm used in training neural networks, where the error is calculated and
propagated backward through the network to update the weights.
Bayesian Inference
A statistical method that updates the probability estimate for a hypothesis as more
evidence or information becomes available.
Bias
A model’s systematic error that leads to consistent deviations from the true value,
typically resulting in underfitting.
Bias-Variance Trade-off
A balance between bias (error from erroneous assumptions) and variance (error
from sensitivity to small fluctuations in the training data). The trade-off affects
model performance on unseen data.
C
SAS, Python & R: A Cross Reference Guide
Classification Model
A type of predictive model that assigns a label or category to an input based on its
features. Examples include logistic regression, decision trees, and neural networks.
Clustering
A type of unsupervised learning that groups similar data points together based on
their features. Common algorithms include K-means and hierarchical clustering.
Confusion Matrix
A table used to describe the performance of a classification model, showing the true
positives, false positives, true negatives, and false negatives.
Correlation
A statistical measure that describes the strength and direction of a relationship
between two variables. It ranges from -1 to 1, where 1 indicates a perfect positive
relationship, -1 indicates a perfect negative relationship, and 0 indicates no
relationship.
Cross-Validation
A technique used to assess the performance of a model by dividing the data into
multiple subsets (folds) and training/testing the model on different combinations of
these subsets.
Curse of Dimensionality
A phenomenon where the performance of a model degrades as the number of
features (dimensions) increases, due to the sparsity of data in high-dimensional
space.
Data Augmentation
A technique used to increase the diversity of training data by applying random
transformations such as rotation, flipping, or scaling, commonly used in image
processing.
502 SAS, Python & R: A Cross Reference Guide
Data Imputation
The process of replacing missing data with substituted values, commonly using
mean, median, or mode, or more complex methods like k-nearest neighbors or
regression.
Data Preprocessing
The process of cleaning, transforming, and organizing raw data into a suitable
format for analysis and model building.
Decision Trees
A type of model that splits data into branches based on feature values, creating a
tree-like structure to make decisions or predictions.
Deep Learning
A subset of machine learning involving neural networks with many layers (deep
networks) that can learn from large amounts of data to model complex patterns.
Dimensionality Reduction
Techniques used to reduce the number of features in a data set while retaining as
much information as possible, often through methods like PCA (Principal
Component Analysis) or t-SNE.
Distributions
A statistical function that describes the likelihood of different outcomes in a data
set. Common distributions include normal, binomial, and Poisson distributions.
Ensemble Methods
Techniques that combine multiple models to improve overall performance.
Common methods include bagging, boosting, and stacking.
Epoch
A single pass through the entire training data set during the training process of a
machine learning model, particularly in neural networks.
Explainability (Interpretability)
The extent to which a human can understand the decisions or predictions made by a
machine learning model. Explainability is crucial in fields like healthcare and finance,
where understanding model decisions is essential.
F1 Score
A performance metric that combines precision and recall, calculated as the
harmonic mean of precision and recall. It provides a single score that balances the
two metrics.
False Negative
A situation where a model incorrectly predicts the negative class when the actual
class is positive.
False Positive
A situation where a model incorrectly predicts the positive class when the actual
class is negative.
Feature Engineering
The process of selecting, transforming, and creating new features from raw data to
improve model performance.
504 SAS, Python & R: A Cross Reference Guide
Feature Importance
A metric that measures the contribution of each feature in predicting the target
variable, often used in tree-based models.
Feature Scaling
A method used to normalize or standardize the range of independent variables
(features) to ensure that no single feature dominates the learning process.
Gradient Descent
An optimization algorithm used to minimize the loss function by iteratively adjusting
the model parameters in the direction of the steepest decrease in the loss.
Hyperparameter Tuning
The process of optimizing the hyperparameters of a model (settings that are not
learned from data) to improve its performance.
Hypothesis Testing
A statistical method used to test an assumption (hypothesis) about a population
parameter, often using p-values to determine significance.
SAS, Python & R: A Cross Reference Guide
Imbalanced Data
A situation where one class is significantly more frequent than the other(s) in a
classification problem, potentially leading to biased models.
Inference
The process of making predictions or drawing conclusions from a trained machine
learning model on new, unseen data.
K-Means Clustering
An unsupervised learning algorithm that partitions data into k clusters, where each
data point belongs to the cluster with the nearest mean.
Lasso Regression
A type of linear regression that includes L1 regularization, which penalizes the
absolute value of coefficients, leading to sparse models where some coefficients are
exactly zero.
506 SAS, Python & R: A Cross Reference Guide
Learning Rate
A hyperparameter that controls the step size during gradient descent, determining
how quickly or slowly a model learns.
Linear Regression
A type of regression model that predicts the value of a continuous dependent
variable based on one or more independent variables using a linear equation.
Logistic Regression
A classification model that predicts the probability of a binary outcome using a
logistic function.
Loss Function
A function that measures how well a model's predictions match the true values,
guiding the optimization process during model training.
Model Evaluation
The process of assessing the performance of a machine learning model using various
metrics like accuracy, precision, recall, F1 score, and AUC.
Multicollinearity
A situation in regression analysis where independent variables are highly correlated,
leading to unreliable coefficient estimates.
SAS, Python & R: A Cross Reference Guide
Neural Networks
A class of machine learning models inspired by the human brain, consisting of layers
of interconnected nodes (neurons) that process data and learn complex patterns.
Normalization
The process of scaling data to a standard range, often between 0 and 1, to improve
model performance and convergence.
Optimization
The process of finding the best set of parameters for a model that minimizes or
maximizes a given objective function, often using algorithms like gradient descent.
Overfitting
A modeling error that occurs when a model is too complex and captures noise in the
training data, leading to poor generalization on new data.
P-Value
A statistical measure that helps determine the significance of the results. In
hypothesis testing, a low p-value (< 0.05) indicates strong evidence against the null
hypothesis.
Precision
A metric for classification models that measures the accuracy of positive predictions
(True Positives / (True Positives + False Positives)).
Predictor
See Independent Variable.
508 SAS, Python & R: A Cross Reference Guide
Random Forest
An ensemble learning method that builds multiple decision trees and aggregates
their predictions to improve accuracy and reduce overfitting.
Recall (Sensitivity)
A metric for classification models that measures the ability to identify all positive
instances (True Positives / (True Positives + False Negatives)).
Reinforcement Learning
A type of machine learning where an agent learns to make decisions by interacting
with an environment and receiving feedback through rewards or penalties.
Regression Model
A type of model that predicts a continuous outcome based on one or more
independent variables. Examples include linear regression and polynomial
regression.
Regularization
A technique used to prevent overfitting by adding a penalty to the loss function,
encouraging simpler models. Common methods include L1 (Lasso) and L2 (Ridge)
regularization.
Ridge Regression
A type of linear regression that includes L2 regularization, which penalizes the
square of coefficients, helping to reduce model complexity and multicollinearity.
SAS, Python & R: A Cross Reference Guide
Sampling
The process of selecting a subset of data from a larger data set for analysis, often
used in the context of training machine learning models.
Supervised Learning
A type of machine learning where the model is trained on labeled data, meaning
that the input data comes with corresponding output labels. The model learns to
map inputs to outputs and is evaluated based on how well it can predict the labels
for new data.
Synthetic Data
Artificially generated data that mimics the properties of real data. It is often used to
augment data sets, perform privacy-preserving data analysis, or test models in
controlled environments.
Test Set
A subset of data used to evaluate the performance of a trained model. The test set
is separate from the training set and is used to assess how well the model
generalizes to new, unseen data.
Train/Test Split
The process of dividing a data set into two parts: one for training the model (training
set) and one for testing its performance (test set). This split allows for an unbiased
evaluation of the model's generalization to new data.
Transfer Learning
A machine learning technique where a model developed for a particular task is
reused as the starting point for a model on a second task. It is commonly used in
deep learning, where pre-trained models are fine-tuned on new tasks.
Tuning
The process of optimizing hyperparameters to improve the performance of a
machine learning model. Tuning is typically done using techniques like grid search,
random search, or Bayesian optimization.
Underfitting
A modeling error that occurs when a model is too simple to capture the underlying
patterns in the data, resulting in poor performance on both the training and test
sets. Underfitting often occurs when the model is not complex enough or when
there is not enough training data.
SAS, Python & R: A Cross Reference Guide
Unsupervised Learning
A type of machine learning where the model is trained on unlabeled data, meaning
the input data does not come with corresponding output labels. The model learns to
identify patterns, clusters, or structures in the data.
Validation Set
A subset of data used during model training to tune hyperparameters and prevent
overfitting. The validation set provides an additional check on model performance,
helping to select the best model before evaluating on the test set.
Variance
A measure of a model's sensitivity to fluctuations in the training data. High variance
can lead to overfitting, where the model performs well on the training data but
poorly on new, unseen data.
Vectorization
The process of converting data (e.g., text, images) into numerical vectors that can be
used in machine learning models. Vectorization is commonly used in natural
language processing (NLP) and computer vision.
Weights
Parameters in a machine learning model that are learned from the data, influencing
the model's predictions. In neural networks, weights are adjusted during training to
minimize the loss function.
Word Embedding
A type of word representation that maps words into continuous vector space where
semantically similar words are located closer together. Common techniques include
Word2Vec, GloVe, and FastText.
512 SAS, Python & R: A Cross Reference Guide
Wrapper Method
A feature selection method that evaluates subsets of features based on their
contribution to model performance. The method wraps around a model to select
the best-performing subset of features.
XGBoost
An optimized gradient boosting algorithm that is widely used in machine learning
competitions and real-world applications for its efficiency, speed, and performance.
XGBoost can be used for both classification and regression tasks.
Y Variable
Another term for the target or dependent variable, representing the output that a
model is trained to predict.
Zero-Inflated Model
A statistical model used for count data that has an excess of zero counts. These
models assume that the data-generating process has two parts: one that generates
the excess zeros and another that generates the counts.
Z-Score
A statistical measure that describes a data point's relationship to the mean of a
group of values. Z-scores are expressed in terms of standard deviations from the
mean, with a positive Z-score indicating a value above the mean and a negative Z-
score indicating a value below the mean.
SAS, Python & R: A Cross Reference Guide
Index of Terms
activation 12, 297, 300, 304, 316, 318, 321, 467, 478
AIC 99, 104, 186–187, 196, 327, 356–357, 359–361, 369, 371–374, 387, 389, 391,
472, 478–479
anaconda 28, 46
architecture 72, 122, 297–300, 311–312, 316, 318, 321, 377–378, 467
ARIMA 488
AUC 6, 12–13, 157, 159–165, 170–171, 174, 193, 215–216, 223–225, 233, 239–241,
246–247, 255–256, 262–263, 268, 272, 280, 284–285, 287, 291–292, 302, 307–
308, 311–314, 317, 321–323, 325, 327, 330–334, 336–339, 345–347, 373–374,
381–382, 385–387, 389, 458, 461, 470–471, 478, 485
Bayesian 71, 166, 214, 327, 369, 372, 391, 470, 472, 479, 488
bias 16, 86, 128–129, 131–132, 144–146, 150, 155, 180–181, 185, 191, 206, 214–
215, 279, 281–282, 297, 362, 364, 384, 455–456, 479, 483
BIC 99, 187, 196, 327, 356–357, 359–361, 369, 372–374, 387, 389, 391, 472, 479
binary 49, 69–72, 75–76, 80, 109, 113–117, 124, 131, 137, 141, 144, 164, 168, 171,
176–184, 191, 202, 207, 215, 227, 232, 245, 247, 253, 259, 262–263, 324, 327,
336–337, 340, 347, 373–374, 449, 452, 456, 458–460, 481, 484
boosting 5, 12, 70, 72, 110, 116, 132, 147, 159, 163–166, 174, 177, 205, 215, 226–
227, 234, 249–256, 258–259, 261–264, 266–272, 276–277, 442, 463–466, 481,
483, 490
caret 10, 13, 36, 95, 99–101, 106–107, 114, 141, 173–174, 196, 218–219, 232–233,
242–243, 248, 258, 263, 271–272, 286, 305, 313, 321, 328, 377, 397, 441
SAS, Python & R: A Cross Reference Guide
categorical 2, 68–69, 110, 113, 115–118, 123–124, 138, 140, 143–147, 155–156,
158, 160–162, 168–169, 171–173, 175–177, 191–192, 203, 206, 214, 228–229,
232, 236, 239, 250, 255, 280, 301, 313, 324, 327, 372–373, 376, 455–460, 470,
481
CHISQ 332
classification 2, 6–7, 49, 65, 68–73, 85, 107–110, 112, 131–132, 144, 146, 164, 168,
171, 175–177, 179, 182, 190, 193, 202, 207–210, 215–216, 224, 227, 230, 232,
234–235, 238–240, 248–249, 252–256, 263, 266, 271, 273–275, 278–279, 282–
285, 287, 293, 297, 301–302, 313–315, 318, 320–321, 324–325, 327–331, 334–
338, 340–341, 345, 347, 349, 353, 355, 370, 372–374, 381, 386–387, 389–390,
441, 456, 458–460, 466–468, 470, 473, 478–480, 483–484, 486–487, 490
cluster 21, 55, 59, 61, 65, 328, 340, 450, 480, 484, 489
CNN 298
coefficients 180, 183–184, 186, 188–192, 196–198, 228, 306, 376, 461, 484, 487
combine 28, 34, 49, 65–67, 70–71, 111, 119–120, 132–133, 135–136, 173, 186, 192,
195, 205, 212, 217, 225–226, 238, 249, 257, 264, 266, 273, 285, 287, 304, 307,
323, 386, 399–400, 404, 406–408, 410–411, 420–422, 426, 463, 474–475, 479,
481–482
compile 12
516 SAS, Python & R: A Cross Reference Guide
complexity 41, 86, 94, 118, 121–122, 166, 186–187, 189–190, 209–210, 214–215,
219, 228, 230, 232, 236, 238, 251, 254–255, 268, 272, 277–278, 282, 293, 300,
307, 327, 368–369, 371–372, 374, 387, 389, 461–462, 472, 476–477, 487
concatenate 117, 134, 158, 160–161, 195, 217, 257, 285, 304
confusion matrix 6, 12, 18, 23, 157, 160, 170–171, 174, 193, 199, 202, 210, 215–
216, 219, 224, 233, 240, 243, 248, 256, 263, 282, 284–285, 287, 302, 307–308,
313, 327, 330–333, 341–342, 373, 389, 458, 470, 480
console 1, 19, 28–30, 32–34, 37, 40–41, 44–45, 396, 398, 402, 408, 438, 447
continuous 68–70, 87, 113–114, 123–124, 146–147, 176, 179, 182, 192, 229, 323–
324, 337, 355–356, 372–373, 375, 379, 385, 387, 420, 427, 459–460, 470, 484,
486, 490
convolutional 298
correlation 2, 95–98, 103, 116, 123–125, 137, 139, 141, 185–186, 191–192, 200,
203, 309, 313, 454, 480, 485
covariance 124
crosstabulation 3, 76–77, 99–100, 149–153, 169, 171, 210, 215, 250, 278, 282, 301,
317, 457–459, 462, 480
crowd-sourced 85
csv 10, 23, 25, 30, 53–54, 58, 63, 157, 193, 196, 216, 218, 240, 242, 248, 256–258,
263, 282, 285, 287, 302, 306, 313, 449
cvfit 198
dashboard 21–22, 89, 380, 385, 387, 390, 393, 447, 473
dataframe 31–32, 54, 108, 158, 160–161, 194–195, 217, 225, 257, 285, 287, 303–
304, 328, 332, 351, 356, 358, 434
debugging 19, 24, 27, 29, 34, 41, 43, 45–46, 445, 447, 477
deploy 33, 136, 148, 153, 156, 163, 166, 297, 314, 323, 341, 355, 374–380, 387, 389,
392, 416, 448, 459, 472
depth 159, 209, 215, 217, 241, 248, 257, 264, 267, 271–272, 300, 462, 466
518 SAS, Python & R: A Cross Reference Guide
deviation 86, 112, 125, 146, 212, 279, 281, 363–366, 380–382, 434, 453, 455, 468,
479, 491
dimensionality 2, 65, 116, 118–119, 121–125, 166, 281, 298, 454, 461, 480–481,
486
distinct 69, 97, 118, 147, 149, 151, 273–274, 276, 324, 365, 373, 410, 415
distribution 6, 12, 15, 23, 27, 46, 55, 59, 74–77, 80, 82, 84–85, 87, 112, 114, 117,
129–135, 137, 145, 177–179, 183, 185, 191, 211–212, 274–275, 301, 325, 338,
340, 345, 347, 362–365, 379, 381, 383–384, 387, 392–393, 442, 453, 456, 471–
472, 481
docstring 31
download 18, 28, 33, 36–37, 46, 49–50, 52, 61, 63, 68, 449–450
dummy 2, 4, 111, 115–118, 124, 144, 146, 162, 173, 176, 191, 200, 203, 206, 220,
224, 227, 301, 308, 313, 481
eigenvalues 126–128
elbow 128
encoding 2, 110–111, 115–117, 138, 140, 143–147, 155–156, 158, 160, 162, 168–
169, 171–173, 191, 232, 301, 376–377, 455–460
ensemble 5, 11–12, 70–72, 108, 132, 157, 174, 177, 205, 215, 226, 234–240, 249–
252, 254–255, 264–267, 269, 271, 273, 276–277, 329, 331, 463, 465, 479, 481,
483, 486
epsilon 321
estimate 68, 86–87, 93, 104–105, 132, 146–147, 150, 152–153, 159–160, 169, 182,
185–186, 191–192, 195–196, 198, 217, 239, 241, 248, 252, 257–258, 264, 266,
271–272, 280, 286, 305, 324, 331, 372, 457–458, 463, 466, 479, 485
excel 16, 43, 177, 206, 273, 296, 298–299, 415, 445, 449, 468
explainability 482
feature 2, 4–5, 11, 19, 21–22, 27–29, 31, 33–34, 36–38, 40, 43, 45–49, 57, 65, 68,
70–71, 73, 81–82, 93–95, 97–103, 105–116, 118–125, 135, 137–141, 143–145,
153–156, 158, 160–163, 168–169, 171, 173, 176, 190, 194, 204–208, 214, 227–
520 SAS, Python & R: A Cross Reference Guide
228, 230, 232, 236–239, 250, 252–253, 255–256, 266–267, 269, 271–272, 277–
279, 281–282, 284, 295, 298–299, 301, 303, 313, 316, 320, 322, 328, 342, 377–
378, 380, 383–385, 415, 417, 419, 440, 445–446, 449, 451–452, 454–456, 458,
460–461, 464–468, 472, 479–483, 486, 488, 490
filter 2, 49, 80, 90–91, 95, 97–98, 101–102, 104–105, 107–110, 133, 139, 250, 454
fine-tune 136, 148, 151, 156, 225, 297, 349, 464, 488
fit 4, 10–12, 94, 103, 108, 114, 120, 126, 158–160, 186–188, 195, 197, 200, 209–
210, 213, 217, 241, 249, 258, 278, 286, 304–306, 309, 326–327, 331, 358–359,
362, 366, 368–372, 472, 478
fixed-form 175
floor 88
FNN 298
forest 4–5, 11, 65, 70–72, 94, 107–110, 113, 115–116, 132, 147, 159, 163, 177, 205,
215, 226–227, 234–243, 245, 247–253, 256, 264, 266–272, 276–277, 282, 327,
329–331, 349, 351, 463–466, 486
format 10, 24, 26, 48, 53–54, 57, 77–78, 81, 144, 146, 162, 168, 222, 245, 261, 291,
294, 311, 376, 408, 412, 419, 436, 449, 451, 456, 459, 480
FPR 13, 159–161, 196, 210, 218, 242, 258, 286, 305, 325, 332, 344, 346
framework 7, 19, 28, 44, 66, 116, 153–154, 175, 385, 388, 448
fraud 69, 130, 327–335, 337, 339–345, 347–353, 355, 373, 382, 386, 473
SAS, Python & R: A Cross Reference Guide
freeware 16–17, 22
frequency 12, 23, 74–77, 80, 117–118, 133–135, 174, 201–202, 211, 219, 221–222,
233, 244–245, 260–261, 289–290, 310, 325, 332, 357, 383
gain 6, 29, 82, 94, 119, 123, 152, 156, 207–208, 215, 228, 238–239, 252, 256, 301,
323, 326–327, 330–334, 348, 350–355, 380
generalize 86, 94, 129, 148, 152–153, 189, 213, 249, 278–279, 383, 386, 461–463,
465
ggtitle 350
gini 6, 12, 207–208, 211–212, 228, 233, 238, 253, 267, 325, 327, 330, 332–334, 336,
345–347, 373, 381–382, 386–387, 389, 461–462, 464, 471
GLMSELECT 11, 99, 101–104, 106, 118, 141, 174, 183, 232
GOSS 250
GPUs 298
522 SAS, Python & R: A Cross Reference Guide
gradient 5, 12, 70, 72, 110, 112, 116, 132, 147, 157, 159, 163–166, 174, 177, 180–
181, 205, 215, 227, 234, 249–256, 258–259, 261–264, 266–272, 276–277, 297–
298, 316, 442, 455, 463–467, 483–485, 487, 490
graph 18, 22, 30, 36, 45, 85, 103, 127, 164, 202, 442, 445
GridSearchCV 159, 193, 195, 215, 217, 232, 240–241, 256–257, 284, 286, 302, 304,
321
GRU 298
heteroscedasticity 362–363
high-dimensional 71, 122–123, 177, 254, 273–274, 276–278, 280, 314–316, 466,
469, 480, 486
histogram 11, 25, 84–85, 137, 139, 141, 338, 340, 357, 359, 361, 364–365, 453
Hosmer-Lemeshow 186–187
HPNEURAL 311–312
HPSVM 291–292
hyperparameter 4, 143, 148, 150, 154–156, 159, 163, 166, 171, 195, 203, 213–215,
217, 222–223, 225, 229–230, 232, 235, 239–243, 245–248, 250, 255–257, 261–
264, 270, 280–283, 286, 288, 290, 292, 297, 299–301, 304–305, 307, 311–312,
314, 317–319, 321, 375, 457–459, 462, 466, 468–470, 483–484, 488–489
IDE 1, 7, 17, 19, 22–23, 28–30, 33–37, 39–41, 43, 45, 395, 400–402, 406, 421, 445,
447
imbalance 128–133, 144–145, 168, 171, 173, 190–191, 206, 239–240, 255–256, 268,
270, 280–283, 301, 316, 318, 336, 342–343, 374, 456, 466, 468–469, 483
import 1, 10–13, 23, 32, 38, 47, 49, 52–54, 58–59, 61, 63, 75, 84, 91, 95, 102, 108,
114, 120–121, 125, 133–134, 157, 159–160, 193, 202, 215, 218, 224, 240, 242,
248, 256, 258, 263, 282, 284, 286, 302, 313, 328, 331, 350, 357–358, 396, 398,
401, 404–405, 425–426, 428, 449–451
importance 10, 39, 43–44, 47, 49, 56, 58–61, 68, 71–72, 76, 82, 85, 88, 94, 100, 103,
105–106, 108–110, 115, 118–120, 122–125, 128–132, 136–137, 139–140, 143,
146–147, 149, 151, 163, 168–169, 171, 176, 181, 183, 185, 188, 190, 210, 213,
215, 225, 227, 229–230, 236, 238–239, 251, 253–256, 267–269, 271, 277, 279,
283–284, 301, 313, 315–316, 318, 323, 340, 342–343, 361, 365, 370, 379, 383,
385–389, 446–447, 449, 452–455, 458, 462, 464–469, 471–472, 482, 486
imputation 11, 55, 59, 80, 93, 143–147, 154–158, 160, 162, 168–169, 173, 176, 206,
227, 239, 279, 281, 322, 376, 456, 480
independent 17, 49, 119, 148, 154, 182, 184–186, 192, 228, 368, 454, 460, 471,
482–486
index 7, 46, 82, 102, 108, 110, 158, 160–162, 287, 351, 383–384, 387, 389, 392–393,
436, 472
indicator 48, 74–75, 114–115, 117, 124, 327, 329, 380, 438
interaction 2, 19, 28–29, 32–34, 40–41, 44–45, 111, 119–121, 131, 138, 142, 144–
145, 158, 160–162, 168, 173, 182, 188–189, 253, 272, 393, 399–400, 402, 408,
410, 419, 423, 426, 429, 447, 474–475, 477
KDE 359
kernel 5, 7, 166, 274, 276, 278, 280–284, 286, 288, 291–292, 315, 317–318, 320,
402–403, 419, 425, 466, 469
SAS, Python & R: A Cross Reference Guide
KPI 380–381
kurtosis 365
layer 71, 209, 293, 295–299, 304, 316–317, 321, 467–470, 481, 485
lbfgs 304
leaf 159, 176, 204–205, 208–209, 215, 217, 261, 263, 272, 465
libraries 7, 11–13, 15, 19–21, 28–29, 36–37, 39–41, 43–46, 52–53, 58, 61, 74, 90, 95,
99–101, 106–107, 114, 125, 133, 193, 196, 199–200, 202, 215, 218, 220, 224,
240, 242–244, 248, 256, 258, 260, 263, 279, 282, 284, 286–287, 289, 302, 305,
309, 313, 328, 356, 360, 376, 378, 392, 395, 397–399, 404, 406–407, 409–410,
414, 421, 423, 427–428, 430, 440, 442–443, 445–446, 448–449, 474–477
likelihood 49, 73, 181–182, 185–187, 192, 337, 339, 344, 369, 463, 481
linear 3, 11, 70–71, 89, 94, 102, 106, 112, 119, 122, 157, 167, 175, 177–181, 183–
186, 190–193, 205, 225, 227–228, 230, 273, 276, 278, 284, 286, 288, 292, 294–
296, 315, 320, 355, 357–358, 360, 362–363, 367, 460, 462, 466–467, 484, 486–
487
logistic 3–4, 11–13, 70, 94, 102–103, 106, 112, 115, 132, 157, 159, 163–167, 174–
193, 196, 199, 202, 204–206, 223–225, 227–233, 246–247, 253, 259, 262–263,
291–292, 311–312, 315, 321, 332, 460–462, 479, 484
log-likelihood 186–187
log-odds 205
LSTM 298
Lua 416
MAE 13, 239, 255, 280, 324, 326, 356–358, 360–361, 366–367, 370–373, 380, 387,
389, 391, 458, 471
manifold 122
markdown 408
SAS, Python & R: A Cross Reference Guide
matplotlib 10, 13, 19, 28–29, 84, 141, 157, 193, 215, 224, 240, 256, 282, 284, 302,
350, 358, 393
matrix 6, 12, 95–96, 98, 117, 124, 157, 160, 170–171, 173–174, 186, 193, 198–199,
202, 210, 215–216, 224, 233, 240, 248, 256, 259, 263, 282, 284–285, 302, 307–
308, 313, 327, 330–333, 341–342, 373, 389, 458, 470, 480
MDI 238
mean 11, 13, 82–83, 85–86, 104, 112, 114, 126, 145–146, 168, 173, 180, 183, 185,
209, 211, 234, 238–239, 249, 253, 255, 266, 279–281, 297, 324–326, 331–333,
343–344, 349, 351–352, 356–358, 360, 366–368, 370–371, 373, 389, 391, 456,
458, 468, 471, 480, 482, 484, 491
median 82–83, 93, 145, 158, 160, 162, 168, 173, 456, 480
metrics 4–7, 12–13, 68, 73, 75, 99–101, 107, 129, 154, 156–157, 160, 164–165, 170–
171, 174, 186–187, 192–193, 195, 202–203, 209–212, 215, 218–219, 224–225,
228, 230, 233, 236–241, 243, 248, 252–256, 258–259, 263–264, 266, 268–270,
272, 279–288, 302, 305, 307–308, 313–314, 317, 321–327, 329–335, 337, 340–
342, 345–347, 349, 355–358, 360–361, 366–368, 370, 372–374, 379–391, 437,
458, 461–463, 465–466, 469–473, 478, 482, 485–486
mini-batch 181
missing 11, 24, 49, 55, 57, 66, 74–75, 80, 82, 93, 97, 101–102, 107–109, 135–136,
143–147, 154–155, 158, 160, 162, 168–169, 171, 173, 176, 194, 197, 201, 203, 206,
216, 218, 221, 224, 227–228, 239, 241–242, 244, 257–258, 260, 279, 281–282, 285,
287, 289, 300, 303, 307, 310, 322, 335, 339–340, 343, 365, 376–378, 452, 456, 460,
480
MLE 185
MLR 99–100
monitoring 7, 163, 212, 297, 314, 323, 375, 379–390, 392–393, 472–473
mse 13, 104, 239, 255, 280, 323–324, 326, 356–358, 360–361, 366–367, 370–374,
380, 387, 389, 391, 458, 470–471
multicollinearity 111, 116, 123, 185–186, 191–192, 194, 206, 228–230, 232, 236,
240, 251, 269, 276, 281, 301, 303, 454, 460–462, 465, 481, 485, 487
multi-output 71–72
multi-target 2, 71
naive 71
neural 5–6, 12, 71–72, 112, 122, 157, 159, 163–166, 177, 190, 227, 265, 273, 293–
302, 304–305, 307–308, 311–322, 441, 467–470, 478–479, 481, 485, 490
node 8, 71–72, 176, 204–205, 207–209, 228, 230, 235, 238, 248, 252, 293, 295–296,
378, 416–417, 422, 425, 461–462, 478, 485
numpy 11, 19, 28, 40, 43, 45, 75, 91, 102, 108, 157, 193, 302, 328, 331, 350, 357,
392, 446, 450
OneHotEncoder 110–111, 145–146, 157–158, 160, 162, 168, 172–173, 191, 232,
459
OOT 149, 154, 157–158, 160–163, 165, 169, 192–203, 215–221, 223–225, 240–245,
247–248, 256–261, 263–264, 281–283, 285–290, 292, 302–303, 305–310, 312–
314
opensource 8, 20, 28, 43, 45, 378, 416, 436, 440, 445
optimization 18–19, 29, 40, 43, 45–46, 63, 99, 106, 110, 112, 143, 145, 148, 156,
166, 214, 229, 249, 265, 276, 284, 297, 299, 301, 312, 314, 316–317, 321, 337–
339, 341, 375, 447, 464, 469–470, 483–485, 487–488, 490
order 3, 18, 76, 113, 115–116, 123, 126, 145–146, 152–153, 169, 194, 197, 216, 219,
291–292, 303, 307
orthogonal 122
530 SAS, Python & R: A Cross Reference Guide
outlier 2, 11, 55, 65–66, 82–93, 103, 112, 131, 136–137, 139, 141, 143–147, 154–
156, 158, 160, 162, 168–169, 171, 173, 175, 193, 206, 214, 227–229, 278–281,
300, 302, 362–367, 370–371, 387, 453–456, 461–462, 471
out-of-time 148–149, 151–156, 162, 169, 171, 192, 202–203, 215, 240, 256, 281–
282, 457–458
overfitting 86, 93–94, 112, 118, 122, 137, 148, 186–187, 189–190, 205, 209, 213–
215, 228–230, 234, 236–239, 250–255, 266–269, 272, 274, 277–278, 297, 299–
300, 315–316, 318, 329, 340, 368–369, 371–372, 374, 454, 461–465, 468–469,
478, 485–487, 489
oversampling 55, 130, 145, 168, 173–174, 191, 206, 239, 255–256, 280, 282, 301,
316, 374, 456, 469
pandas 10, 19, 28, 40, 43, 45, 53, 58, 61, 63, 84, 95, 102, 108, 114, 121, 125, 133,
141, 157, 193, 215, 224, 240, 248, 256, 263, 282, 284, 302, 313, 328, 331, 350,
357, 392–393, 414, 425, 428, 430, 443, 446, 449–450
panel 1, 29–31
parameter 5, 63, 86, 97, 104–105, 143, 148, 150–151, 163, 180–182, 185, 196, 198,
209–210, 213, 219, 222, 227, 232, 245, 253, 261, 267, 278–280, 283–284, 290,
297, 311, 315, 317–318, 320–321, 369, 372, 462, 466, 468–469, 472, 478–479,
483, 485, 487, 489
parametric 204–205
pearson 96–97
SAS, Python & R: A Cross Reference Guide
penalize 106, 186–187, 189, 210, 253–254, 276, 291–292, 327, 366, 368–369, 371–
372, 374, 465–466, 469, 471–472, 478–479, 484, 486–487
pipeline 3, 116, 135–136, 143, 145, 153–154, 156–157, 162–164, 166–172, 175–
176, 202–203, 227, 230, 282–283, 323, 376, 392, 420, 455–456, 458–460
plot 10, 12–13, 17, 29–31, 34, 78, 82, 84, 95, 126–128, 157, 159, 193, 196, 198–199,
202, 210, 215, 218, 220, 240–243, 248, 256, 258–259, 264, 282–284, 286, 288,
302, 305, 314, 325, 332, 346, 350–354, 357–364, 366, 417, 471, 478
Poisson 481
polynomial 111, 119–120, 144–145, 168, 173, 206, 278, 288, 292, 315, 320, 486
precision 12, 129–130, 157, 160–162, 164, 170, 174, 211–212, 228, 239, 255, 268,
272, 280, 314, 317, 321, 323–325, 327, 330–331, 333–340, 342–343, 355, 373–
374, 380, 385–386, 389, 458, 462, 466, 469–470, 482, 485–486
predict 12–13, 48–49, 51, 58, 68–73, 80, 86, 93, 95, 103, 114, 130, 146–147, 159–
161, 175, 180–189, 191, 193–198, 200–203, 216–225, 235, 240–245, 247–249,
257–260, 262–264, 266, 282–294, 302–313, 324, 327, 329–331, 338, 356–358,
368–369, 371–373, 391–392, 449, 452, 459, 471, 481, 483, 487–488, 490
preprocess 11, 111, 114, 119–121, 128, 143, 157, 161–162, 175–176, 202, 225, 227,
279, 282, 284, 288, 297, 299, 302, 307, 313, 322, 375–377, 387, 392, 451–452,
455, 472, 480
532 SAS, Python & R: A Cross Reference Guide
prune 4–5, 209–210, 215, 222–223, 228, 230, 232, 236, 238, 254, 267, 272, 332,
461–462, 464
pyplot 10, 13, 84, 157, 193, 215, 240, 256, 284, 302, 350, 358
quadrant 17
quadratic 283
qualitative 177
random 4–5, 11, 54–56, 59, 61, 63–64, 70–72, 94, 107–110, 113, 115–116, 118, 132,
134, 147, 159, 163–164, 166, 177, 180, 185, 194–195, 201, 205, 214–215, 217,
221, 225–227, 229, 234–245, 247–253, 256–257, 260, 264, 266–272, 275–277,
282–283, 285–286, 290, 301, 303–304, 310, 317, 326–331, 336, 346–349, 351,
353–354, 358, 450, 459, 463–466, 470, 480, 486, 488
recall 12, 129–130, 157, 160–162, 164, 170, 174, 211–212, 228, 239, 255, 268, 272,
280, 314, 317, 321, 323–325, 327, 330–331, 333–337, 339–340, 343, 355, 373–
374, 380, 385–386, 458, 462, 466, 469–470, 482, 485–486
regression 2–4, 6, 11, 21, 65, 68–71, 89, 94, 106, 111–112, 115–116, 131–132, 146–
147, 159, 163–167, 174–193, 195–196, 198–200, 202–206, 209–210, 219, 225,
227–232, 234–235, 238–239, 249, 252–253, 255, 266, 279–280, 283, 294–297,
309, 315, 321, 324–326, 355–358, 360–364, 366, 370, 372–373, 386–387, 389,
391, 441, 454, 459–462, 470–472, 479–481, 484–487, 490
regularization 4–5, 105–106, 112, 189–190, 202, 251, 253–254, 267–269, 272, 277–
278, 299–300, 315, 317, 320–321, 454, 464–466, 484, 487
replace 10, 56, 89–92, 133–134, 146, 356, 398, 401, 413
report 51, 89, 193, 202, 215–216, 224, 240, 248, 256, 263, 282, 284–285, 287, 302,
313, 331, 408
residual 6, 179, 210, 249–251, 267, 276, 324, 357, 359–366, 373–374, 387, 391
rget 13
RNN 298
ROC 12–13, 157, 159–162, 174, 193, 196, 198–199, 202, 210, 212, 215–216, 218,
220, 223–225, 228, 230, 233, 239–243, 246–248, 255–256, 258–259, 262–264,
272, 280, 282–288, 291–292, 302, 305, 307–308, 311–314, 321–322, 325, 327,
330–332, 336–337, 339, 345–346, 461–462, 466, 469, 471, 478
RShiny 392
R-squared 104, 239, 255, 280, 324, 326–327, 356–358, 360–361, 368, 371–373, 387,
389, 391, 458, 470–471
rstats 432
sampling 1, 54–55, 59, 61–64, 133, 145, 168, 174, 194, 197, 203, 217, 224, 241–242,
248, 250, 257, 259, 264, 282, 284–285, 287, 303, 306, 313, 450–451, 463–465,
487
scaling 2, 11, 34, 110–115, 123–125, 138, 140–141, 145, 153, 250, 253, 279–284,
288, 297, 300–301, 307, 313, 316, 322–324, 375–379, 392, 442, 455, 468, 472,
480, 482, 485
SAS, Python & R: A Cross Reference Guide
scatter 359–360
scikit-learn 19, 28, 40, 43, 45, 99–100, 106, 376, 427, 430, 441, 443, 446, 448, 451
scipy 12, 95
score 6, 12–13, 156–157, 160–164, 170, 174, 184, 187, 189, 193, 196, 202, 204–205,
211–212, 215–216, 222–224, 228, 233, 239–240, 245–248, 255–256, 262–263,
268, 271–272, 278, 280, 282, 284–285, 287, 291–292, 302, 311–313, 317, 321,
324–325, 327–334, 336–341, 343, 348–349, 351–352, 355, 358, 373–374, 376,
386, 389, 391–392, 458, 462, 466, 469–470, 482, 485
scree 126–128
seed 50, 55–56, 59, 61, 63–64, 201, 221, 244, 260, 290, 310, 329, 450
sensitivity 2, 10, 132, 191, 199, 211, 276, 321, 325, 335, 339, 343, 373, 387, 471,
479, 486, 489
setsearch 436
sigma 361
sklearn 10–13, 95, 102, 108, 114, 120, 125, 134, 141, 157, 159–160, 173–174, 193,
215, 224, 232–233, 240, 248, 256, 263, 271–272, 282, 284, 302, 313, 320–322,
328, 331, 357, 391
sns 358–359
splitting 147, 149, 169, 176, 192, 206–208, 215, 228, 236–238, 252–253, 297, 330–
331, 356, 358–359, 457, 460–461, 464
SQL 16, 97, 103, 109, 134, 200, 220, 244, 259, 289, 309, 360, 392–393, 426, 429
standardize 11, 114, 123, 125–126, 141, 157, 280, 283–284, 286, 288, 290, 301–302,
304–305, 307, 310, 313, 322, 482
stopping 180–181, 190, 208–209, 251–252, 254, 268–269, 272, 277, 300, 464–465
subset 79, 99–102, 122–123, 208, 235, 238, 266–267, 277, 280, 284, 451, 463, 481,
487–490
SURVEYSELECT 56, 63, 134, 173, 201, 221, 244, 260, 290, 310, 359, 450
SVM 5, 12, 70–71, 110, 112–113, 132, 157, 159, 166, 273–275, 278–284, 286, 288–
289, 291–292, 311, 315, 317–318, 320–322, 466–470, 487
SAS, Python & R: A Cross Reference Guide
synthetic 145, 168, 174, 239, 255–256, 280, 282, 301, 316, 327–329, 355–356, 358–
359, 386, 456, 469, 487
target 2, 11–13, 48–52, 58, 61, 65, 68–76, 80, 101–104, 107–110, 119, 131, 133–
135, 137, 139, 141, 144–147, 159–160, 176–178, 182, 185, 193–197, 200–201,
205, 207, 215–217, 219, 221–225, 227, 240, 242, 245–249, 257–258, 260–264,
278, 282, 285, 287–288, 290–292, 294, 297, 302–307, 309–313, 327, 332, 334,
349, 351–353, 355–356, 365, 367, 370–373, 378, 386, 449, 452–454, 460, 471,
481–482, 488, 490
t-distributed 122
three-dimensional 274
threshold 6–7, 73, 87–89, 95–97, 129, 158, 160, 162, 195, 197, 199, 208, 210–211,
241, 243, 306, 325–326, 334–335, 337–341, 345–347, 349, 355, 370, 372, 374,
379–382, 384–387, 389–390, 393, 458, 461, 471, 473
tpr 13, 159–161, 196, 210, 218, 242, 258, 286, 305, 325, 332, 346
trade-off 67, 209–210, 278, 284, 315, 339, 343, 349, 355, 466, 468, 479
538 SAS, Python & R: A Cross Reference Guide
transform 11, 17–18, 47, 54, 57, 59, 65, 73, 88–89, 93, 110, 114–115, 119–120, 123–
124, 135–136, 144–145, 158, 160–161, 175, 183–185, 227, 276, 278, 281, 286,
304–305, 315, 320, 363–365, 376–377, 452, 480
tree 4, 11, 157, 159, 163, 166, 175, 177, 204, 206–210, 212, 214–215, 217–220,
222–225, 228, 230, 232, 235–240, 249–255, 266–269, 271–272, 276, 280, 461–
465
tune 159, 195, 203, 215, 217, 219, 222, 225, 239, 241–243, 245–248, 250, 255, 257,
261, 264, 280, 283, 286, 288, 291–292, 297, 304, 308, 311, 315, 318, 320–321,
457, 468, 489
undersampling 130, 133, 145, 168, 173, 191, 195, 197, 201, 203, 206, 217, 219, 221,
225, 239, 255–257, 261, 280, 282, 285, 287, 301, 304, 306, 316, 374, 456, 469
validation 5, 147–156, 158–159, 162–163, 165–166, 169, 171, 190, 192, 194, 197,
203, 210, 215, 217, 223–224, 236–237, 241–242, 246, 248, 252, 254, 257, 259,
262–264, 266, 272, 282, 285, 287, 291, 297, 301, 303–304, 306–307, 311, 313,
375, 457–458, 460, 462, 465, 470, 489
variance 7, 104, 112, 114, 122–123, 126–128, 150, 179–180, 186, 192–194, 206,
215, 236, 238, 251–252, 277, 282, 302–303, 326, 362–363, 368, 371, 374, 383–
385, 387, 389, 393, 457, 464, 471–473, 479, 489
SAS, Python & R: A Cross Reference Guide
vectors 5, 12, 70–71, 106, 110, 112–113, 115, 132, 159, 163–166, 176–177, 227,
265, 273–283, 314–315, 318, 320, 466–467, 469, 487, 489–490
VIF 192, 194, 197, 200, 203, 206, 232, 303, 306, 309, 313
weights 88, 131–132, 180–181, 281, 295–297, 299–300, 316, 328, 366, 464, 466–
467, 469, 479, 489–490
whitespace 10, 53
winsorizing 2, 86–87, 89, 137, 139, 141, 144, 146, 158, 160, 162, 168, 173, 453–454
WOE 88
xgboost 70, 132, 174, 250, 256–259, 267, 269, 442, 464–465, 490