[go: up one dir, main page]

0% found this document useful (0 votes)
24 views178 pages

Data Mining User Guide

The document is a user guide for Analytic Solver Data Mining, detailing installation, licensing, and usage for both Desktop and Cloud versions compatible with Excel 2013-2019. It includes sections on getting help, using existing models, and various data mining techniques. The guide also acknowledges contributions from experts and provides contact information for support and ordering the software.

Uploaded by

Nelson Morales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views178 pages

Data Mining User Guide

The document is a user guide for Analytic Solver Data Mining, detailing installation, licensing, and usage for both Desktop and Cloud versions compatible with Excel 2013-2019. It includes sections on getting help, using existing models, and various data mining techniques. The guide also acknowledges contributions from experts and provides contact information for support and ordering the software.

Uploaded by

Nelson Morales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 178

Version 2020 For Use With Excel 2013-2019

Analytic Solver
Data Mining User Guide
Copyright
Software copyright 1991-2020 by Frontline Systems, Inc.
User Guide copyright 2020 by Frontline Systems, Inc.
GRG/LSGRG Solver: Portions copyright 1989 by Optimal Methods, Inc. SOCP Barrier Solver: Portions
copyright 2002 by Masakazu Muramatsu. LP/QP Solver: Portions copyright 2000-2010 by International
Business Machines Corp. and others. Neither the Software nor this User Guide may be copied, photocopied,
reproduced, translated, or reduced to any electronic medium or machine-readable form without the express
written consent of Frontline Systems, Inc., except as permitted by the Software License agreement below.

Trademarks
Frontline Solvers®, XLMiner®, Analytic Solver®, Risk Solver®, Premium Solver®, Solver SDK®, and RASON®
are trademarks of Frontline Systems, Inc. Windows and Excel are trademarks of Microsoft Corp. Gurobi is a
trademark of Gurobi Optimization, Inc. Knitro is a trademark of Artelys. MOSEK is a trademark of MOSEK
ApS. OptQuest is a trademark of OptTek Systems, Inc. XpressMP is a trademark of FICO, Inc.

Acknowledgements
Thanks to Dan Fylstra and the Frontline Systems development team for a 25-year cumulative effort to build the
best possible optimization and simulation software for Microsoft Excel. Thanks to Frontline’s customers who
have built many thousands of successful applications, and have given us many suggestions for improvements.
Risk Solver Pro and Risk Solver Platform have benefited from reviews, critiques, and suggestions from several
risk analysis experts:
• Sam Savage (Stanford Univ. and AnalyCorp Inc.) for Probability Management concepts including SIPs,
SLURPs, DISTs, and Certified Distributions.
• Sam Sugiyama (EC Risk USA & Europe LLC) for evaluation of advanced distributions, correlations, and
alternate parameters for continuous distributions.
• Savvakis C. Savvides for global bounds, censor bounds, base case values, the Normal Skewed distribution
and new risk measures.

How to Order
Contact Frontline Systems, Inc., P.O. Box 4288, Incline Village, NV 89450.
Tel (775) 831-0300 Fax (775) 831-0314 Email info@solver.com Web http://www.solver.com
Table of Contents
Table of Contents 3

Start Here: Data Mining Essentials in V2020 6


Getting the Most from This User Guide ......................................................................... 6
Desktop and Cloud Versions ............................................................................ 6
Installing the Software ..................................................................................... 6
Understanding License and Upgrade Options ................................................... 6
Getting Help Quickly ....................................................................................... 6
Finding the Examples ...................................................................................... 7
Using Existing Models ..................................................................................... 7
Getting and Interpreting Results ....................................................................... 7

Installation and Add-Ins 8


What You Need ............................................................................................................. 8
Installing the Software ................................................................................................... 8
Installing Analytic Solver Cloud ...................................................................... 8
Installing Analytic Solver Desktop ................................................................. 10
Logging in the First Time ............................................................................................ 14
Uninstalling the Software ............................................................................................ 14
Activating and Deactivating the Software .................................................................... 15
Excel 2019, 2016 and 2013 ............................................................................ 15

Analytic Solver Data Mining Overview 17


Overview .................................................................................................................... 17
V2016 Release ............................................................................................... 18
V2016-R2 Release ......................................................................................... 18
V2016-R3 Release ......................................................................................... 18
V2017 Release ............................................................................................... 19
V2017-R2 Release ......................................................................................... 19
V2018 Release ............................................................................................... 20
V2019 Release ............................................................................................... 21
V2020 Release ............................................................................................... 21
What’s New in Analytic Solver V2020.5...................................................................... 22
Easily Deploy Your Model as a Cloud Service ............................................... 22
More About RASON Decision Services ......................................................... 24
How You Can Use RASON ........................................................................... 24
New Time Series Simulation Functions .......................................................... 25
New Optimization Result Functions ............................................................... 25
Analytic Solver Product Line ....................................................................................... 26
Desktop and Cloud versions ........................................................................... 26
Analytic Solver Basic .................................................................................... 27
Analytic Solver Upgrade ................................................................................ 27
Analytic Solver Optimization ......................................................................... 27
Analytic Solver Simulation ............................................................................ 27

Frontline Solvers Analytic Solver Data Mining User Guide 3


Analytic Solver Data Mining.......................................................................... 28
Analytic Solver Comprehensive ..................................................................... 28
Data Mining Ribbon Overview .................................................................................... 30
Model ......................................................................................................................... 32
Get Data ...................................................................................................................... 32
Data Analysis .............................................................................................................. 33
Time Series Analysis ................................................................................................... 34
Data Mining ................................................................................................................ 34
Tools and Help ............................................................................................................ 37

Using Help, Licensing and Product Subsets 38


Introduction................................................................................................................. 38
Working with Licenses in V2020 ................................................................................. 38
Frontline License Manager ............................................................................. 38
Managing Your Licenses ............................................................................................. 39
Product Selection Wizard ............................................................................................ 41
Getting Help ................................................................................................................ 43
Accessing Resources...................................................................................... 43
User Guides ................................................................................................... 44
Example Models ............................................................................................ 44
Knowledge Base ............................................................................................ 44
Operating Mode ............................................................................................. 44
Support Mode ................................................................................................ 44
Submit a Support Ticket................................................................................. 45
Solver Academy ............................................................................................ 45
Video Tutorials/Live Webinars ...................................................................... 46
Learn more! ................................................................................................... 46
Help Menu..................................................................................................... 46

Creating a Workflow 47
Introduction................................................................................................................. 47
Creating a Workflow ................................................................................................... 47
Workflow Tab Options .................................................................................. 50
Deploying Your Workflow .......................................................................................... 51
Posting Workflow to RASON Cloud Services ................................................ 52
Deploying Your Model .................................................................................. 54
Manually Creating a Workflow .................................................................................... 55
Making/Breaking a Connection ...................................................................... 59
Running a Workflow with a New Dataset ....................................................... 59
Multiple Workflows....................................................................................... 60
Changing Options Settings ............................................................................. 61
Workflow Groups .......................................................................................... 61

Bringing Big Data into Excel Using Apache Spark 63


Introduction................................................................................................................. 63
Sampling and Summarizing Big Data........................................................................... 63
Connecting to an Apache Spark Cluster.......................................................... 64
Storage Sources and Data Formats ................................................................. 64
Sampling from a Large Dataset ...................................................................... 65
Summarizing a Large Dataset......................................................................... 68

Fitting a model using Feature Selection 73


What is Feature Selection?........................................................................................... 73

Frontline Solvers Analytic Solver Data Mining User Guide 4


Fitting a Model ............................................................................................................ 74

Text Mining 88
Introduction................................................................................................................. 88
Text Mining Example .................................................................................................. 89
Importing from a File Folder .......................................................................... 89
Using Text Miner........................................................................................... 92
Output Results ............................................................................................. 101
Classification with Concept Document Matrix.............................................. 105

Exploring a Time Series Dataset 109


Introduction............................................................................................................... 109
Autocorrelation (ACF) ................................................................................. 109
Partial Autocorrelation Function (PACF)...................................................... 110
ARIMA ....................................................................................................... 110
Partitioning .................................................................................................. 111
Examples for Time Series Analysis ............................................................................ 111

Classifying the Iris Dataset 119


Introduction............................................................................................................... 119
Creating the Classification Model .............................................................................. 119

Predicting Housing Prices using Multiple Linear Regression 135


Introduction............................................................................................................... 135
Multiple Linear Regression Example ......................................................................... 135

Scoring New Data 148


Introduction............................................................................................................... 148
Scoring to a Database ................................................................................................ 148
Scoring to a Worksheet .............................................................................................. 156
Scoring Test Data ...................................................................................................... 159
Scoring Test Data Example .......................................................................... 160
Using Data Mining Psi Functions in Excel ................................................................. 162
Scoring Data Using Psi Functions ................................................................ 163
PsiPredict() .................................................................................................. 164
PsiPosteriors() ............................................................................................. 167
PsiTransform()............................................................................................. 168
Time Series Forecasting ............................................................................... 168
PsiForecast()................................................................................................ 172
Time Series Simulation ................................................................................ 173

Frontline Solvers Analytic Solver Data Mining User Guide 5


Start Here: Data Mining
Essentials in V2020

Getting the Most from This User Guide


Desktop and Cloud Versions
Analytic Solver V2020 comes in two versions: Analytic Solver Desktop – a
traditional “COM add-in” that works only in Microsoft Excel for Windows PCs
(desktops and laptops), and Analytic Solver Cloud – a modern “Office add-in”
that works in Excel for Windows and Excel for Macintosh (desktops and
laptops), and also in Excel for the Web using Web browsers such as Chrome,
FireFox and Safari. Your license gives you access to both versions, and your
Excel workbooks and optimization, simulation and data mining models work in
both versions, no matter where you save them (though OneDrive is most
convenient).

Installing the Software


Read the chapter “Installation and Add-Ins” for complete information on
installing Analytic Solver Cloud and (if you wish) Analytic Solver Desktop.
This chapter also explains how the Cloud and Desktop versions interact when
both are installed, and how to install and uninstall both versions.
In brief, to add Analytic Solver Cloud version to your copy of Excel, you use
the Excel Ribbon option Insert – Get Add-ins – no Setup program download or
installation is required. To install Analytic Solver Desktop on Windows PCs,
visit www.solver.com , login using the same email and password you used to
register on Solver.com, download and run the SolverSetup program.

Understanding License and Upgrade Options


Frontline Solvers V2020 feature a revised, simpler product line that gives you
access to all features, all the time for small models, and a new licensing system,
tied to you and usable on more than one computer. Read about this in the
chapter “Help, Support, Licenses and Product Versions.”

Getting Help Quickly


Choose Help on the Ribbon. You’ll see several options, starting with Help –
Help Center. Support Live Chat, Example Models, and User Guides are also
available here. In Analytic Solver Desktop (only) you can also get quick online
Help by clicking any underlined caption or message in the Task Pane.

Frontline Solvers Analytic Solver Data Mining User Guide 6


Finding the Examples
Use Help – Examples on the Analytic Solver or Data Mining Ribbon to open a
list of example optimization and simulation models, and example data sets for
data mining, that you can open by clicking hyperlinks. See the chapter “Help,
Support, Licenses and Product Subsets” for details. Some of these examples are
used and described in the Examples chapters.

Using Existing Models


Models created using XLMiner 4.0 and earlier can be used in Analytic Solver
Data Mining V2020 or later without any required changes.

Getting and Interpreting Results


Learn how to interpret Analytic Solver Data Mining’s result messages, error
messages, reports and charts using the Help file imbedded within the software.
Simply go to Help – User Guides on the Data Mining ribbon.

Frontline Solvers Analytic Solver Data Mining User Guide 7


Installation and Add-Ins

What You Need


You can use Analytic Solver Cloud in Excel for the Web (formerly Excel
Online) through a Web browser (such as Edge, Chrome, Firefox or Safari),
without installing anything else. This is the simplest and most flexible option,
but it requires a constant Internet connection.
To use Analytic Solver Cloud in Excel Desktop on a PC or Mac, you must have
a current version of Windows or iOS installed, and you will need the latest
Excel version installed via your Office 365 subscription – older non-
subscription versions, even Excel 2019, do not have all the features and APIs
needed for modern JavaScript add-ins like Analytic Solver Cloud.
To use Analytic Solver Desktop (Windows PCs only), you must have first
installed Microsoft Excel 2013, 2016, 2019, or the latest Office 365 version on
Windows 10, Windows 8, Windows 7, or Windows Server 2019, 2016 or
2012. (Windows Vista or Windows Server 2008 may work but are no longer
supported.). It’s not essential to have the standard Excel Solver installed.

Installing the Software


Installing Analytic Solver Cloud
Analytic Solver V2020 includes our next-generation offering, Analytic Solver
Cloud – usable in the latest versions of desktop Excel for Windows and
Macintosh, and in Excel for the Web. Analytic Solver Cloud is divided into two
add-ins that work closely together (since a JavaScript add-in currently can have
only one Ribbon tab): the Analytic Solver add-in builds optimization,
simulation and decision table models, and the Data Mining add-in builds data
mining or forecasting models.
Both the Analytic Solver and Data Mining add-ins support existing models
created in previous versions of Analytic Solver. Your license for Analytic
Solver will allow you to use Analytic Solver Desktop in desktop Excel or
Analytic Solver Cloud in either desktop Excel (latest version) or Excel for the
Web.
Once you do this, the Analytic Solver and Data Mining tabs will appear To use
the Analytic Solver and Data Mining add-ins, you must first “insert” them for
use in your licensed copy of desktop Excel or Excel for the Web, while you are
logged into your Office 365 account. on the Ribbon in each new workbook you
use.
To insert the add-ins for the first time, open desktop Excel (latest version) or
Excel for the Web, click the Insert tab on the Ribbon, then click the button
Office Add-ins or (if you see it) the smaller button Get Add-ins.

Frontline Solvers Analytic Solver Data Mining User Guide 8


In the dialog box that appears, click the Store tab and type “Analytic Solver”
into the Search box. Once you find the Analytic Solver add-in, click Add.
After a moment, you should see the Analytic Solver tab appear on the Ribbon,
with a note about how to “Get started with the Solver add-in!”, as shown below.

Repeat these steps to search for, locate and Add the Analytic Solver Data
Mining add-in. After a moment, you should see the Data Mining tab appear on
the Ribbon, with a similar “Get started” note.
After you perform these steps (one time) to insert the Analytic Solver and
Analytic Solver Data Mining add-ins, they will appear under "My Add-ins". If
you ever need to remove the add-ins, click the “…” symbol to the right of the
add-in name, then click the Remove choice on the dropdown menu that appears.

Note: Data Mining Cloud will defer to Analytic Solver Desktop. If you find that
Data Mining Cloud does not appear after following these steps, you'll first need
to remove Analytic Solver Desktop before installing the Cloud app. To do so
simply open Excel and click File – Options – Add-ins – Com Add-ins – Go,
select Analytic Solver Add-in, then click Remove. Once Analytic Solver
Desktop is removed, then go back to Office Add-ins – My Add-ins and add in
Data Mining Cloud.

Single Sign On Functionality


Analytic Solver Cloud includes Single Sign On functionality which
automatically logs in users to their Analytic Solver Cloud account using their
Microsoft 365 credentials. This means that if you've signed in to your Microsoft
365 account using the same email address you used to register on
www.solver.com, then you will not be asked to login to Analytic Solver Cloud.
Once you insert Analytic Solver cloud, you will have immediate access to all the
features and functionality of Analytic Solver. As long as you remain signed in
to your Microsoft 365 account, you'll never have to login to Analytic Solver
Cloud again!
If you log out of Analytic Solver or your Office 365 credentials do not match
your Solver credentials, you'll see the following welcome screen in the Solver
task pane. Just click the Get Started button or click License – Login/Logout on
the ribbon to login.

Frontline Solvers Analytic Solver Data Mining User Guide 9


Installing Analytic Solver Desktop
To install Analytic Solver Desktop to work with any of the supported versions
of Microsoft Excel (see above), simply run the program SolverSetup.exe, which
installs all of the Solver program, Help, User Guide, and example files.
SolverSetup.exe checks your system, detects what version of Office you are
running (32-bit or 64-bit) and then downloads and runs the appropriate Setup
program version.
Note that your copy of the Setup program will usually have a filename such as
SolverSetup_12345.exe; the ‘12345’ is your user account number on
Solver.com.
When you run the Setup program, depending on your antivirus program or
Windows security settings, you might be prompted with a message “[reason
such as new/unknown program]. Are you sure you want to run this
software?” You may safely click Run in response to this message.
Next, you’ll briefly see the standard Windows Installer dialog. Then a dialog
box like the one shown below should appear:

Frontline Solvers Analytic Solver Data Mining User Guide 10


Read this, so you know the difference between Analytic Solver Basic and the
upgrades to handle larger models and datasets. Then click Next to proceed –
you’ll see a dialog like the one below.

Next, the Setup program will ask if you accept Frontline’s software license
agreement. You must click “I accept” and Next in order to be able to proceed.

The Setup program then displays a dialog box like the one shown below, where
you can select or confirm the folder to which files will be copied (normally
C:\Program Files\Frontline Systems\Analytic Solver Platform, or if you’re
installing Analytic Solver for 32-bit Excel on 64-bit Windows, C:\Program Files
(x86)\Frontline Systems\Analytic Solver Platform). Click Next to proceed.

Frontline Solvers Analytic Solver Data Mining User Guide 11


You’ll see a dialog confirming that the preliminary steps are complete, and the
installation is ready to begin.

After you click Install, the Analytic Solver files will be installed, and the
program file RSPAddin.xll will be registered as a COM add-in (which may take
some time). A progress dialog may appear; be patient, since this process could
take longer than it has in previous Solver Platform releases.

Frontline Solvers Analytic Solver Data Mining User Guide 12


When the installation is complete, you’ll see a dialog box like the one below.
Click Finish to exit the installation wizard.

The full Analytic Solver product family is now installed. With your trial and
paid license, you can access every feature of the software, including forecasting
and data mining, simulation and risk analysis, and conventional and stochastic
optimization. Simply click “Finish” and Microsoft Excel will launch with a
Welcome workbook containing information to help you get started quickly.

Frontline Solvers Analytic Solver Data Mining User Guide 13


Logging in the First Time
In Analytic Solver V2020, your license is associated with you, and may be used
on more than one PC. For example, you can run SolverSetup to install the
desktop software on your office PC, your company laptop, and your PC at home.
But only you can use Analytic Solver, and only on one of these computers at a
time. It is unlawful to “share” your license with another human user.
The first time you run Analytic Solver (Desktop or Cloud) after installing the
software on a new computer, when you next start Excel and visit the Analytic
Solver tab on the Ribbon, you will be prompted to login. Enter the email
address and password that you used to register on Solver.com. Once you’ve
done this in Analytic Solver Desktop, your identity will be “remembered,” so
you won’t have to login every time you start Excel and go to one of the Analytic
Solver tabs. In Analytic Solver Cloud, you may be asked to login more
frequently.
You can login and logout at any time, by visiting the Solver Home tab (see
“Cloud Version and Solver Home Tab” below) and clicking the Log In or Log
Out button in Analytic Solver Desktop or by clicking Help – Login/Logout in
Analytic Solver Cloud. If you share use of a single physical computer with
other Analytic Solver users, be careful to login with your own email and
password, and log out when you’re done – if you don’t, other users could
access private files in your cloud account, or use up your allotted CPU time or
storage.
When you move from one computer to another, you should log out on one and
log in on the other. As a convenience, if you log in to Analytic Solver on a new
computer when you haven’t logged out on the old computer, Analytic Solver
will let you know, and offer to automatically log you out on the other computer.

Uninstalling the Software


To uninstall Analytic Solver Desktop, just run the SolverSetup program as
outlined above. You’ll be asked to confirm that you want to remove the
software.
You can also uninstall by choosing Control Panel from the Start menu, and
double-clicking the Programs and Features or Add/Remove Programs applet.

Frontline Solvers Analytic Solver Data Mining User Guide 14


In the list box below “Currently installed programs,” scroll down if necessary
until you reach the line, “Frontline Excel Solvers 2020,” and click the
Uninstall/Change or Add/Remove… button. Click OK in the confirming dialog
box to uninstall the software.

Activating and Deactivating the Software


Analytic Solver Desktop’s main program file RSPAddin.xll is a COM add-in, an
XLL add-in, and a COM server. A reference to the add-in Solver.xla is needed
if you wish to use the “traditional” VBA functions to control Analytic Solver,
instead of its new VBA Object-Oriented API.

Excel 2019, 2016 and 2013


In Excel 2019, 2016 and 2013, you can manage all types of add-ins from one
dialog, reached by clicking File – Options -- Addins.

You can manage add-ins by selecting the type of add-in from the dropdown list
at the bottom of this dialog. For example, if you select COM Add-ins from the
dropdown list and clock the Go button, the dialog shown below appears.

If you uncheck the box next to “Analytic Solver Addin” and click OK, you will
deactivate the Analytic Solver COM add-in, which will remove the Analytic

Frontline Solvers Analytic Solver Data Mining User Guide 15


Solver tab from the Ribbon in desktop Excel, and also remove the PSI functions
for optimization from the Excel Function Wizard.

Frontline Solvers Analytic Solver Data Mining User Guide 16


Analytic Solver Data Mining
Overview

Overview
This Guide shows you how to use Analytic Solver Data Mining (Desktop or
Cloud) that combines the capabilities of data analysis, time series analysis,
classification techniques and prediction techniques. Analytic Solver Data
Mining is included in Analytic Solver Comprehensive, or can be purchased as a
stand-alone license.
Analytic Solver Data Mining was originally developed and marketed by others
(Cytel Corp and Statisstics.com), and was in popular use for over a decade,
primarily in teaching, when Frontline Systems acquired all rights to the product
in 2011. In 2012-2013, Frontline marketed and supported the existing product,
while rewriting the underlying software from the ground up.
• In our V2014 release, we introduced a fundamental new capability in
Analytic Solver Platform, Risk Solver Platform and Premium Solver
Platform for building optimization and simulation models in Excel:
Dimensional Modeling. It introduces new concepts such as dimensions and
cubes, and provides all the tools you need to build and solve larger scale,
better structured, more maintainable models using these concepts.
• In our V2014-R2 release, Analytic Solver Platform included a completely
re-engineered, far more powerful data mining and forecasting capability
named XLMiner Platform. New data mining algorithms are up to 100 times
faster, constantly exploit multiple processor cores, and offer greater
accuracy and numeric stability.
• Our V2015 release introduced a wide range of new features, including
powerful text mining and ensemble methods for classification and
prediction in XLMiner Platform; feature selection, partitioning “on-the-fly,”
ROC/RROC curves, enhanced linear and logistic regression, and more in
XLMiner Pro and Platform; extensive chart enhancements, distribution
fitting, and new Six Sigma functions in Risk Solver Pro and Platform; and
support for “publishing” optimization and simulation models to Excel for
the Web (formerly Excel Online) and Google Sheets. This feature was
further extended to handle large-scale models in V2015-R2.
• Our V2015-R2 release made it easy to share results in the cloud: You can
transfer optimization, simulation, or data mining results into your Microsoft
Power BI online account, visualize those results with just a few clicks, and
share them with others. Similarly, you can export optimization, simulation
or data mining results into Tableau Data Extract (*.tde) files that can be
opened in Tableau, the popular interactive data visualization
software. V2015-R2 also links your Excel workbook with “Big Data”:
You can easily obtain sampled and summarized results drawn from the
largest datasets, stored across many hard disks, in compute clusters running
Apache Spark.

Frontline Solvers Analytic Solver Data Mining User Guide 17


V2016 Release
Our V2016 release dramatically speeds the process of moving from an
“analytic Excel model” to a “deployed application, available to others.”
With the new Create App feature, which translates your Excel optimization
or simulation model into Frontline’s new RASON modeling language, you
can create an application that can run in a web browser, or a mobile app
for phones or tablets – with just two mouse clicks! Your app solves
problems via our RASON server, running 24x7 on Microsoft Azure, using
its REST API.
Other enhancements in V2016 include a Task Pane navigator for data
mining in XLMiner Pro and Platform, and a greatly enhanced
Evolutionary Solver in Premium Pro and Platform, and Risk Solver Pro
and Platform. A completely new local search algorithm called SQP-GS
(Sequential Quadratic Programming with Gradient Sampling), and new
Feasibility Pump methods for both continuous and integer variables help
the Evolutionary Solver find better solutions, faster than ever.

V2016-R2 Release
In our V2016-R2 release, simulation models can use compound
distributions, with either a constant or a discrete distribution as the
‘frequency’ element, and correlation using copulas (Gaussian, Student and
Archimedean forms), as well as rank-order correlation, to generate samples
for multiple uncertain variables.
In optimization, the Evolutionary Solver is further enhanced with new ‘GA
methods’ for integer variables, often yielding much better integer solutions
than previous releases in a given amount of time. And users of
Dimensional Modeling will see major improvements in speed and memory
use, thanks to new support for ‘sparse cubes’. Export of analytic model
results to Tableau data visualization software is more convenient than ever
in V2016-R2 with support for the Tableau Web Data Connector
introduced in Tableau 9.1.

V2016-R3 Release
With our V2016-R3 release for Excel, we’re introducing
AnalyticSolver.com, a new cloud-based platform for both predictive and
prescriptive analytics models that you can use via a web browser –
including all the optimization, simulation, and data mining power found in
our desktop products. Your models are solved using our Azure-based
RASON cloud servers.
The AnalyticSolver.com user interface works just like our Excel user
interface, with a Ribbon, Analytic Solver Platform and XLMiner Platform
tabs with the same icons and dropdown menus, and a Task Pane with
Model, Platform, Engine and Output tabs. Both V2016-R3 in Excel and
AnalyticSolver.com include a new “Solver Home” tab on the Ribbon that
makes it easy to move Excel workbooks and other files between desktop
and cloud. And access to AnalyticSolver.com is included with your
V2016-R3 license for desktop Excel.

Frontline Solvers Analytic Solver Data Mining User Guide 18


V2017 Release
In V2017, we are introducing commercial users to AnalyticSolver.com, a new
cloud-based platform for both predictive and prescriptive analytics models that
you can use via a web browser – including all the optimization, simulation, and
data mining power found in the desktop version. The AnalyticSolver.com user
interface works just like our Excel user interface, with a Ribbon and Task Pane.
Both V2017 in Excel and AnalyticSolver.com include a new “Solver Home” tab
on the Ribbon that makes it easy to move Excel workbooks and other files
between desktop and cloud. And access to AnalyticSolver.com is included with
your V2017 license for desktop Excel.
V2017 also uses a new licensing system that offers you more flexible ways to
use the software, both desktop and cloud. Your license is associated with you,
and may be used on more than one PC. For example, you can install the
software on your office PC, your company laptop, and your PC at home. But
only you can use Analytic Solver, and only on one of these computers at a time.
V2017 introduces Analytic Solver Basic, as described above, to give you access
to all Analytic Solver features, all the time, for learning purposes using small
models. It also includes a new License/Subscription Manager and a Product
Selection Wizard that makes it much easier to upgrade or change your license
subscription on a self-service basis, and a new Test Run/Summary feature that
lets you see exactly how your model will run with an Analytic Solver upgrade,
even a plug-in large-scale Solver Engine, before you purchase the upgrade – and
do this any time, not just during a 15-day free trial.
V2017 includes major enhancements to data mining: Automatic support for
categorical variables in many classification and prediction algorithms that
‘normally’ require continuous variables; ensembles that combine nearly any
type of algorithm as a ‘weak learner’, not just for example trees; general-
purpose Rescaling as a new Data Transformation method that can also be
applied ‘on-the-fly’ when training a model; greatly enhanced multilayer neural
networks; ability to export models in PMML; and many report and chart
enhancements.
The V2017 Evolutionary Solver includes another set of major enhancements in
its handling of non-smooth models with integer variables – enough so that most
such models will solve significantly faster. And there’s support for the Tableau
Web Data Connector 2.0, and a new SolverSetup program that automatically
installs the correct 32-bit or 64-bit version of the software.

V2017-R2 Release
V2017-R2 includes major enhancements to Monte Carlo simulation/risk analysis
and optimization. It’s now possible to fit copula parameters to historical data –
a complement to distribution fitting that is sometimes called “correlation
fitting.” You can use a new family of probability distributions, called the
Metalog distributions, even more general than the Pearson distributions –
members of the family can be chosen based directly on historical data (even just
a few observations), without a distribution fitting process.
The V2017-R2 PSI Interpreter includes major speed enhancements for large
linear and nonlinear optimization models – users with large models are likely to
see a dramatic speedup in “Setting Up Problem…” Also part of this release are
new, higher performance versions of the Gurobi Solver Engine (based on Gurobi

Frontline Solvers Analytic Solver Data Mining User Guide 19


7.5), the Xpress Solver Engine (based on Xpress 30.1), and the Knitro Solver
Engine (based on Knitro 10.3).

Creating Power BI Custom Visuals


The most exciting new feature of V2017-R2 is the ability to turn your Excel-
based optimization or simulation model into a Microsoft Power BI Custom
Visual, with just a few mouse clicks! Where others must learn JavaScript (or
TypeScript) programming and a whole set of Web development tools to even
begin to create a Custom Visual, you’ll be able to create one right away.
You simply select rows or columns of data to serve as changeable parameters,
then choose Create App – Power BI, and save the file created by V2017-R2.
You click the Load Custom Visual icon in Power BI, and select the file you just
saved. What you get isn’t just a chart – it’s your full optimization or simulation
model, ready to accept Power BI data, run on demand on the web, and display
visual results in Power BI! You simply need to drag and drop appropriate
Power BI datasets into the “well” of inputs to match your model parameters.
How does that work? The secret is that V2017-R2 translates your Excel model
into RASON® (RESTful Analytic Solver Object Notation, embedded in JSON),
then “wraps” a JavaScript-based Custom Visual around the RASON model. See
the chapter “Creating Your Own Application” for full details!

V2018 Release
V2018 extends Analytic Solver’s forecasting and data mining features with a
new capability called data mining workflows that can save a lot of time and
eliminate repetitive steps. You can combine nearly any of Analytic Solver’s
data retrieval, data transformation, forecasting and data mining methods into a
single, all-inclusive workflow, or pipeline.
Using the new Workflow tab in the Task Pane, you can either “drag and drop”
icons onto a “canvas” to create a workflow diagram, or you can simply turn on a
workflow recorder, carry out the steps as you’ve always done by choosing
menu options and dialog selections, and the workflow diagram will be created
automatically. Once the diagram or pipeline is created, you can “run” it in one
step – each data mining method in the workflow will be executed in sequence.
In previous releases, you could use the trained model from a single data mining
method (such as a Classification Tree or Neural Network) to “score” new data,
by mapping features (columns) between the training set and new data set. In
V2018, you can apply an entire workflow – including data transformations,
partitioning, model training, and more – to a new dataset, by mapping features
(columns) between the dataset used to create the workflow, and a new dataset.

Creating Tableau Dashboard Extensions


Another exciting new feature of V2018 is the ability to turn your Excel-based
optimization or simulation model into a Tableau Dashboard Extension, with
just a few mouse clicks! This is quite similar to the ability to create Power BI
Custom Visuals introduced in Analytic Solver V2017-R2. It works (only) with
Tableau version 2018.2 or later.
You simply select rows or columns of data to serve as changeable parameters,
then choose Create App – Tableau, and save the file created by V2018. In
Tableau, drag the Extensions object onto your dashboard, and choose the file
you just saved. You’ll be prompted to match the parameters your model needs
with data in Tableau. What you get isn’t just a chart – it’s your full optimization

Frontline Solvers Analytic Solver Data Mining User Guide 20


or simulation model, ready to accept Tableau data, run on demand (using our
RASON server), and display visual results in Tableau!

V2019 Release
In Frontline Solvers V2019, we are introducing Analytic Solver Cloud – a
next-generation product that’s the result of five years of development using new
cloud technologies, that we can now bring to you – since Microsoft has released
a complete set of JavaScript APIs for new Excel features, such as functions like
PsiNormal() and PsiMean() used in simulation and risk analysis.
While we have new and enhanced features in development, in the V2019 release
we have focused on a consistent user experience between Analytic Solver
Desktop and Analytic Solver Cloud. In a few cases, this has involved
modifications to Analytic Solver Desktop. For example, older versions of
Analytic Solver Desktop used “cascading submenus” to select probability
distributions, and results in Monte Carlo simulations. Since a JavaScript add-in
cannot define or use cascading submenus, we have modified Analytic Solver
Desktop and Analytic Solver Cloud so that the Distributions button on the
Ribbon displays a dialog, rather than a cascading submenu, where you can select
an appropriate probability distribution. Yet the order and layout of probability
distributions remains the same as in previous Analytic Solver versions.

V2020 Release
With each new release, Analytic Solver gives you more! Our latest enhance-
ments apply to both Analytic Solver Desktop and Analytic Solver Cloud.
In V2020, the LP/Quadratic Solver – probably the most-used Solver Engine in
Analytic Solver – features significantly improved performance on linear mixed-
integer models. Prior versions of this Solver would use only one processor core
at a time, but V2020 will use multiple processor cores to speed your solution.
The plug-in large-scale Solver Engines in Analytic Solver V2020 also feature
significantly improved performance (they continue to utilize multiple processor

Frontline Solvers Analytic Solver Data Mining User Guide 21


cores). These include the Gurobi Solver V9.0, with a new ability to solve non-
convex quadratic models; the Xpress Solver V35, with a new Solution Refiner,
and the Artelys Knitro Solver V12.1, with SOCP and MIP speedups.
Analytic Solver V2020 includes 12 new probability distribution functions,
enhanced property functions for the PSI Distribution functions, new property
functions for the PSI Statistics functions, and new “theoretical” functions that
return analytic moments of distributions. Full details are in the Frontline
Solvers Reference Guide, but here’s a partial list of new/enhanced functions:
PsiBurr PsiLevyAlt PsiTheoMin
PsiDagum PsiHypSecantAlt PsiTheoMax
PsiDblTriang PsiCumulD PsiTheoVariance
PsiFatigue PsiLaplace PsiTheoStdDev
PsiFdist PsiCauchy PsiTheoSkewness
PsiFrechet PsiTruncate PsiTheoKurtosis
PsiHypSecant PsiCensor PsiTheoRange
PsiJohnsonSB PsiLock PsiTheoPercentile
PsiJohnsonUB PsiOutput PsiTheoPercentileD
PsiKumaraswamy PsiTheoMean PsiTheoTarget
PsiReciprocal PsiTheoMedian PsiTheoTargetD
PsiLevy PsiTheoMode PsiCategory
These functions make it even easier to adapt risk analysis models developed
with other popular Excel add-ins to work with Analytic Solver. A simple Find
and Replace of the function name prefix with ‘Psi’ is often all you need. And
unlike those other Excel add-ins, with Analytic Solver you can easily run your
model in the cloud with Excel for the Web, translate your model to RASON,
and use it in Power BI, Tableau, or your own web or mobile application!

What’s New in Analytic Solver V2020.5


Analytic Solver V2020.5 in mid-2020 is our latest and most powerful release for
Microsoft Excel. Again all enhancements apply to both Analytic Solver
Desktop and Cloud versions. Significant enhancement to both Monte Carlo
simulation and optimization are included – but the most exciting new feature is a
greatly expanded Create App facility that makes it easy to deploy your Excel
analytic model as a cloud service (thanks to RASON), usable from nearly any
corporate, web or mobile application. What’s more, you can manage, monitor
and update your own cloud services, without ever leaving Excel!

Easily Deploy Your Model as a Cloud Service


We’ve realized for many years that developing and testing your analytic model
in Excel is often just the first step: To gain the real business value from the
decisions that can be made with your model, it’s often necessary to get the
model into the hands of other people in the business – in a form where they
can easily ensure that it has up-to-date data, re-run the model’s optimization,
simulation, or data mining process, and either view the results, or plug them
into another software application or process.

Frontline Solvers Analytic Solver Data Mining User Guide 22


In our 2017 and 2018 Analytic Solver releases, we took the steps that were
possible at that time, enabling users to get their models into the hands of Power
BI and Tableau users. And we built a facility to translate simpler Excel models
into the RASON modeling language, enabling them to be solved in our cloud
platform (RASON is an acronym for RESTful Analytic Solver Object Notation).
But up to this point, a typical Excel user would still need help from a web
developer, or would need learn JavaScript and other web development skills, to
make truly effective use of this facility.
Now in our V2020.5 release of both Analytic Solver and RASON, we’ve gone
much further to simplify the process of deploying an Excel model as a cloud
service, and connecting it to databases and cloud data sources. The RASON
cloud service will now accept and run Excel workbook models “on a par” with
models written in the RASON modeling language. With the Create App menu
option, you can turn your Excel model into a cloud service in seconds.

As an Analytic Solver user, you can now create and test models, deploy them
“to the cloud” – point and click – as full-fledged RESTful decision services,
and even get reports of recent runs of your decision services, all without leaving
Excel. Using our web portal at https://rason.com, you can go further –even
embed your Excel workbook in a multi-stage “decision flow” that can combine
SQL, RASON, Excel, and DMN models, passing results from stage to stage.
We start you out with a RASON “basic” license, so you can try out these new
capabilities without purchasing anything else! (Of course, you may need an
upgraded RASON license to deploy your model to many users, and re-solve it
hundreds or thousands of times on our cloud servers.)

Frontline Solvers Analytic Solver Data Mining User Guide 23


More About RASON Decision Services
RASON is an Azure-hosted cloud service that enables your company to easily
embed 'intelligent decisions' in a custom application, manual or automated
business process, applying the full range of analytics methods – from simple
calculations and business rules to data mining and machine learning, simulation
and risk analysis, and conventional and risk-based optimization.
RASON Decision Services can be used from nearly any application, via a
series of simple REST API requests to https://rason.net. To express the full
range of analytic models, RASON includes a high-level, declarative modeling
language, syntactically embedded in JSON (JavaScript Object Notation), the
popular structured format almost universally used in web and mobile apps.
RASON results appear in JSON, or as more structured OData JSON endpoints.
RASON Decision Services also includes comprehensive data access support for
Excel, SQL Server on Azure, Power BI, Power Apps, Power Automate (aka
Microsoft Flow) and Dynamics 365. And it includes powerful model
management tools, such as tracking model versions including “champions and
challengers”, monitoring model results, and automated scheduling of runs for
both models and multi-stage decision flows.

How You Can Use RASON


You can use RASON to quickly and easily create and solve optimization,
simulation/risk analysis, data mining, decision table, and decision flow models –
instantly deployed as cloud services. You can learn RASON, create models,
supply data and solve them, and even manage model versions and cloud data
connections, “point and click” using https://rason.com, our “web portal” to the
underlying REST API service.

Frontline Solvers Analytic Solver Data Mining User Guide 24


If you’ve used another modeling language to build an analytic model, you’ll
find the RASON language to be simple but powerful and expressive – and
integrating RASON models into a larger application, especially a web or mobile
app, is much easier than with other modeling languages. Excel users will find
that RASON includes virtually the entire Excel formula language as a subset.
If you’ve used tools based on the DMN (Decision Model and Notation)
standard, you’ll find that RASON – and Analytic Solver, as shown in the
chapter “Building Decision Tables” – fully support DMN and FEEL Level 2.
Unlike existing “heavyweight” Business Rule Management Systems, with year-
long implementation schedules, six-figure budgets and limited analytics power,
RASON Decision Services enables you to get results in just weeks to months,
from building and testing models, to deploying them across an organization.
With RASON, you can build successful POCs (Proofs of Concept) without any
IT or professional developer support – yet RASON is very “IT and developer
friendly” when you’re ready to deploy your POC across your company.

New Time Series Simulation Functions


Analytic Solver V2020.5 includes another new set of PSI Distribution functions
and related PSI property functions, focused around time series simulation.
Earlier Analytic Solver versions supported time series simulation using
functions such as PsiForecast() and PsiPredict(), and models fitted via Analytic
Solver Data Mining – but V2020.5 goes further, to support time series functions
found in other popular Excel add-ins, such as Palisade’s @RISK. Full details
are in the Frontline Solvers Reference Guide, but here’s a partial list of
new/enhanced functions:
PsiAR1 PsiBMMR PsiAPARCH11
PsiAR2 PsiGBMJD PsiTSTransform
PsiMA1 PsiARCH1 PsiTSIntegrate
PsiMA2 PsiGARCH11 PsiTSSeasonality
PsiARMA11 PsiEGARCH11 PsiTSSync
With these functions, virtually any risk analysis model developed with other
popular Excel add-ins, such as Palisade’s @RISK, can be easily made to work
with Analytic Solver. An appendix in the Frontline Solvers Reference Guide,
“@Risk to Analytic Solver Psi Function Conversion Table”, explains the details.
And with Analytic Solver, you can easily deploy your risk analysis model as a
cloud service – usable from Tableau, Power BI, Power Apps, Power Automate,
or virtually any corporate, web or mobile application!

New Optimization Result Functions


In every version of Analytic Solver (and its precedessors, such as Premium
Solver and the Excel Solver), you could obtain all the properties of an optimal
solution – such as initial and final values, dual values, and ranges for decision
variables and constraints – via the Answer Report and Sensitivity Report, which
are inserted into your Excel workbook as new worksheets. But what if you want
only select subsets of these values – and you’d like to have them on the same
worksheet as your model? That’s now possible in Analytic Solver V2020.5.
Just type these new functions into cells, or use the Function Wizard in Excel.

Frontline Solvers Analytic Solver Data Mining User Guide 25


PsiInitialValue PsiDualValue PsiCalcValue
PsiFinalValue PsiDualLower PsiOptStatus
PsiSlackValue PsiDualUpper PsiModelDesc
These new functions have another purpose in V2020.5, when you use Analytic
Solver’s enhanced Create App facility to deploy your model as a cloud service:
You can use them to determine a select subset of values from the solution that
you want to return from your cloud service to a calling application.

Analytic Solver Product Line


This Guide shows you how to create and evaluate forecasting and data mining
models using Analytic Solver.
In Frontline Solvers V2020, every license starts with Analytic Solver Basic,
which allows you to use every feature described in this User Guide and the
companion Reference Guide, for learning purposes with small models. Upgrade
versions enable you to ‘scale up’ and solve commercial-size models for
optimization, simulation or data mining, paying for only what you need – but
you keep access to all the features of Analytic Solver Basic.
Analytic Solver combines and integrates the features of Frontline’s products for
conventional optimization (formerly called Premium Solver Pro and Premium
Solver Platform), Monte Carlo simulation and stochastic optimization
(formerly Risk Solver Pro and Risk Solver Platform), and forecasting and data
mining (formerly XLMiner Pro and XLMiner Platform), in a common user
interface that’s available both in Excel (desktop) and in your browser (cloud).
Analytic Solver’s optimization features are fully compatible upgrades for the
Solver bundled within Microsoft Excel, which was developed by Frontline
Systems for Microsoft. Your Excel Solver models and macros will work
without changes; you can use either the classical Solver Parameters dialog, or
a newer Task Pane user interface to define optimization models. For more
information on the optimization and simulation features included in Analytic
Solver Basic, see the Analytic Solver User and Reference guides.

Cloud Products

Desktop and Cloud versions


The release of Analytic Solver V2020 brings our latest offering, Analytic Solver
Cloud and Data Mining Cloud – accessible through Excel for the Web or
desktop Excel. Use Analytic Solver Cloud to solve optimization and simulation
models and use the Data Mining Cloud app to perform data mining or
forecasting. Both the Analytic Solver and Data Mining Cloud apps support
existing models created in previous versions of Analytic Solver. Your license
for Analytic Solver will allow you to use Analytic Solver Desktop in desktop

Frontline Solvers Analytic Solver Data Mining User Guide 26


Excel or Analytic Solver Cloud in either desktop Excel1 or Excel for the Web.
A license for a desktop product will grant you a license for the corresponding
product in a cloud app. For example, a license for Analytic Solver Optimization
in Anlaytic Solver Desktop will also grant you a license for Analytic Solver
Optimization in the Analytic Solver Cloud app. The overwhelming majority of
features in Analytic Solver Desktop are also included in Analytic Solver Cloud
and Data Mining Cloud. However, there could be slight differences in the way
users execute a function in the cloud apps.
Analytic Solver for desktop or cloud may be purchased in several different ways
starting with the most basic version, Analytic Solver Basic, up to our most
complete version, Analytic Solver Comprehensive. Continue reading to see
which product will best meet your needs.

Analytic Solver Basic


As described above, Analytic Solver Basic allows you to use every feature of
Frontline Solvers V2020, for learning purposes with small models. Its model
size limits for optimization are identical to those of the Solver bundled with
Microsoft Excel (200 decision variables and 100 constraints); it doesn’t support
plug-in large-scale Solver Engines. Its size limits for Monte Carlo simulation,
data mining and text mining are sufficient to run all the examples that we install
with the software.

Analytic Solver Upgrade


Analytic Solver Upgrade (formerly Premium Solver Pro) is Frontline’s basic
upgrade for the Excel Solver – enabling you to solve linear models 10 times
larger (up to 2,000 variables), and nonlinear models 2.5 times larger (up to 500
variables). It includes faster versions of the LP/Quadratic, GRG Nonlinear, and
Evolutionary Solvers), but it doesn’t support plug-in large-scale Solver Engines.
This product includes all the features of Analytic Solver Basic.

Analytic Solver Optimization


Analytic Solver Optimization (formerly Premium Solver Platform) is
Frontline’s most powerful product for conventional optimization. It includes the
PSI Interpreter, five built-in Solvers (LP/Quadratic, SOCP Barrier, GRG
Nonlinear, Interval Global, and Evolutionary), solves linear models up to 8,000
variables and nonlinear models up to 1,000 variables, and it supports plug-in
large-scale Solver Engines to handle much larger models. When used with
Analytic Solver Simulation, you can also solve models with uncertainty using
simulation optimization, stochastic linear programming, and robust
optimization. In addition, this product includes all the features of Analytic
Solver Basic.

Analytic Solver Simulation


Analytic Solver Simulation (expanded from Risk Solver Pro) is Frontline’s
full-function product for Monte Carlo simulation and simulation optimization.
It includes decision tree capabilities and the PSI Interpreter – which gives you

1
Must be using Microsoft Excel V2020 or later.

Frontline Solvers Analytic Solver Data Mining User Guide 27


the fastest Monte Carlo simulations available in any Excel-based products,
unique interactive simulation capabilities, multiple parameterized simulations,
and simulation optimization using the Evolutionary Solver. When used with
Analytic Solver Optimization, you can also solve models with uncertainty using
stochastic linear programming and robust optimization. In addition, this product
includes all the features of Analytic Solver Basic.
Together, Analytic Solver Optimization and Analytic Solver Simulation provide
all the capabilities of Frontline’s former product Risk Solver Platform.

Analytic Solver Data Mining


Analytic Solver Data Mining (formerly XLMiner Platform) is Frontline’s most
powerful product for data mining, text mining, forecasting and predictive
analytics. It includes data access and sampling, data exploration and
visualization, text mining, data transformation, and feature selection capabilities;
time series forecasting with ARIMA and exponential smoothing; and a wide
range of data mining methods for classification, prediction and affinity analysis,
from multiple regression to neural networks. This product includes all the
features of Analytic Solver Basic.

Analytic Solver Comprehensive


Analytic Solver Comprehensive (formerly Analytic Solver Platform) combines
the optimization capabilities of Analytic Solver Optimization, the simulation
capabilities of Analytic Solver Simulation, and the data mining capabilities of
Analytic Solver Data Mining. It includes the PSI Interpreter, five built-in
Solvers (LP/Quadratic, SOCP Barrier, GRG Nonlinear, Interval Global, and
Evolutionary) and it accepts a full range of plug-in large- scale Solver Engines.
It supports optimization, Monte Carlo simulation, simulation optimization,
stochastic programming and robust optimization, and large-scale data mining
and forecasting capabilities.
See the chart below to compare the data handling limits between Analytic Solver
Data Mining and Analytic Solver Basic. (Analytic Solver Basic is available for
classroom use or by purchasing a textbook that contains the software. For a
complete list of textbooks that include Analytic Solver for Education or Analytic
Solver Basic, see our Website at www.solver.com.) "Unlimited" indicates that
no limit is imposed, however, other limits may apply based on your computer
resources and Excel version.

Feature Data Mining/Comprehensive Basic


Partitioning
# of Records (original data) Unlimited 65,000
# of Records (training partition) Unlimited 10,000
# of Variables (output) Unlimited 50
Sampling
# of Records (original data) Unlimited 65,000
# of Variables (output) Unlimited 50
# of Strata (Stratified Sampling) Unlimited 30
Database
# of Records (table) Unlimited 1,000,000
# of Records (output) Unlimited 65,000

Frontline Solvers Analytic Solver Data Mining User Guide 28


# of Variables (table) Unlimited Unlimited
# of Variables (output) Unlimited 50
# of Strata (Stratified Sampling) Unlimited 30
File System
# of Files Unlimited 100
Text Mining
# Documents Unlimited 100
# Characters (per document) Unlimited 5,000
# Terms in final vocabulary Unlimited 50
# Text columns Unlimited 1
Transformation
Common
# Records Unlimited 10,000
# Variables Unlimited 50
Missing Data Handling
# of Records Unlimited 65,000
# of Variables Unlimited 50
Binning Continuous Data
# of Records Unlimited 65,000
Transforming Categorical Data
# of Records Unlimited 65,000
# of Variables (data range) Unlimited 50
# of Distinct values Unlimited 30
Time Series Analysis
# of Records Unlimited 1,000
Classification and Prediction
# of Records (total) Unlimited 65,000
# of Records (training partition) Unlimited 10,000
# of Records (new data for scoring) Unlimited 65,000
# of Variables (output) Unlimited 50
# of Distinct classes (output
Unlimited 30
variable)
# of Distinct values (categorical
Unlimited 15
input variables)
k-Nearest Neighbors
# of Nearest neighbors 50 10
Regression/Classification Trees
# of Splits Unlimited 100
# of Nodes Unlimited 100
# of Levels Unlimited 100
# Levels in Tree Drawing Unlimited 7
Ensemble Methods
# Weak learners Unlimited 10
Feature Selection
# of Records Unlimited 10,000

Frontline Solvers Analytic Solver Data Mining User Guide 29


# of Variables Unlimited 50
# of Distinct classes (output
Unlimited 30
variable)
# of Distinct values (input
Unlimited 100
variables)
Association Rules
# of Transactions Unlimited 65,000
# of Distinct items 5,000 100
Clustering
K-Means
# of Records Unlimited 10,000
# of Variables Unlimited 50
# of Clusters Unlimited 10
# of Iterations Unlimited 50
Hierarchical
# of Records Unlimited 10,000
# of Variables Unlimited 50
# of Clusters in a Dendrogram Unlimited 10
Size of distance matrix Unlimited 1,000 x 1,000
Charts
# of Records Unlimited 65,000
# of Variables (original data) Unlimited 100
General
Model pane Included Included
Big Data sampling/summarization Included Included
Model storage and scoring Included Included

Data Mining Ribbon Overview


Analytic Solver Data Mining software offers over 30 different methods for
analyzing a dataset in order to forecast future events. The Data Mining ribbon is
broken up into five different segments as shown in the screenshot below.

Desktop Analytic Solver Data Mining

Data Mining Cloud

Frontline Solvers Analytic Solver Data Mining User Guide 30


• Click the Model button to display the Solver Task Pane. This new feature
(added in V2016) allows you to quickly navigate through datasets and
worksheets containing Analytic Solver Data Mining results.
• Click the Get Data button to draw a random sample of data, or summarize
data from a (i) an Excel worksheet, (ii) the PowerPivot “spreadsheet data
model” which can hold 10 to 100 million rows of data in Excel, (iii) an
external SQL database such as Oracle, DB2 or SQL Server, or (iv) a dataset
with up to billions of rows, stored across many hard disks in an external Big
Data compute cluster running Apache Spark (https://spark.apache.org/).
• You can use the Data Analysis group of buttons to explore your data, both
visually and through methods like cluster analysis, transform your data with
methods like Principal Components, Missing Value imputation, Binning
continuous data, and Transforming categorical data, or use the Text Mining
feature to extract information from text documents.
• Use the Time Series group of buttons for time series forecasting, using both
Exponential Smoothing (including Holt-Winters) and ARIMA (Auto-
Regressive Integrated Moving Average) models, the two most popular time
series forecasting methods from classical statistics. These methods forecast
a single data series forward in time.
• The Data Mining group of buttons give you access to a broad range of
methods for prediction, classification and affinity analysis, from both
classical statistics and data mining. These methods use multiple input
variables to predict an outcome variable or classify the outcome into one of
several categories. Introduced in V2015, Analytic Solver Data Mining and
now the Data Mining Cloud app, offer Ensemble Methods for use with
Classification Trees, Regression Trees, and Neural Networks.
• Use the Predict button to build prediction models using Multiple Linear
Regression (with variable subset selection and diagnostics), k-Nearest
Neighbors, Regression Trees, and Neural Networks. Use Ensemble
Methods with Regression Trees and Neural Networks to create more
accurate prediction models.
• Use the Classify button to build classification models with Discriminant
Analysis, Logistic Regression, k-Nearest Neighbors, Classification Trees,
Naïve Bayes, and Neural Networks. Use Ensemble Methods with
Classification Trees and Neural Networks to create more accurate
classification models.
• Use the Associate button to perform affinity analysis (“what goes with
what” or market basket analysis) using Association Rules.
If forecasting and data mining are new for you, don’t worry – you can learn a lot
about them by consulting our extensive in-product Help. Click Help – Help
Text on the Data Mining tab, or click Help – Help Text – Forecasting/Data
Mining on the Analytic Solver tab (these open the same Help file).
If you’d like to learn more and get started as a ‘data scientist,’ consult the
excellent book Data Mining for Business Intelligence, which was written by the
original Data Mining (formally known as XLMiner) designers and early
academic users. You’ll be able to run all the Data Mining examples and
exercises in Analytic Solver.
Analytic Solver Data Mining, along with the Data Mining Cloud app, can be
purchased as a stand-alone product. A stand-alone license for Analytic Solver
Data Mining includes all of the data analysis, time series data capabilities,
classification and prediction features available in Analytic Solver
Comprehensive but does not support optimization or simulation. See the
Analytic Solver Data Mining User Guide Data Specifications for each product.

Frontline Solvers Analytic Solver Data Mining User Guide 31


Model
Click the Model button in Analytic Solver Desktop to display the Solver Task
Pane within Analytic Solver Data Mining. From the Model tab, you can easily
navigate between worksheets containing data and results. Note that in the Data
Mining Cloud app, only the Workflow tab is present on the task pane.

All fields contained in the dataset are listed under the name of each data
containing worksheet (for example Data!$A$1:$O$507) while results obtained
from Analytic Solver Data Mining are listed under Reports or Transformations
by type and run number.
See the next chapter, Creating Workflows for information on how to create a
workflow in Analytic Solver Data Mining.

Get Data
Analytic Solver Data Mining includes several different methods for importing
your data including Sampling from either a Worksheet or Database or Importing
from a File Folder.

Click the Get Data icon to take a representative sample from a dataset included
in either an Excel workbook or an Oracle, SQL Server, MS-Access, or Power
Pivot database. Users can choose to sample with or without replacement using
simple or stratified random sampling. Click Get Data – File Folder to import
and or sample from a collection of text documents for use with Text Mining.
Click the Get Data - Big Data to sample from or summarize from a dataset with
up to billions of rows, stored across many hard disks in an external compute
cluster running Apache Spark. Results may be requested immediately upon
completion or at a later time using Get Results.

Frontline Solvers Analytic Solver Data Mining User Guide 32


Data Analysis
Analytic Solver Data Mining includes several different methods for data
analysis, including Sampling from either a Worksheet or Database (not
supported in the Cloud app), Charting with 8 different types of available charts,
Transformation techniques which handle missing data, binning continuous data,
creating dummy variables and transforming categorical data, and using Principal
Components Analysis to reduce and eliminate superfluous or redundant
variables; along with two different types of Clustering techniques, k-Means and
Hierarchical.

Click the Explore icon to use Feature Selection to help decide which variables
should be included in your classification or prediction models or use the Chart
Wizard to create one or more charts of your data. The new Feature Selection
tool can help give insight into which variables are the most important or relevant
for inclusion in your classification or prediction model using various types of
statistics and data analysis measures. Analytic Solver Data Mining includes 8
different types of charts to choose from, including: bar charts, line charts,
scatterplots, boxplots, histograms, parallel coordinates charts, scatterplot matrix
charts or variable charts. Click this icon to edit or view previously created
charts as well.
Click the Transformation icon when data manipulation is required. In most large
databases or datasets, a portion of variables are bound to be missing some data.
Analytic Solver Data Mining includes routines for dealing with these missing
values by allowing a user to either delete the full record or apply a value of
her/his choice. Analytic Solver Data Mining also includes a routine for binning
continuous data for use with prediction and classification methods which do not
support continuous data. Continuous variables can be binned using several
different user specified options. Non-numeric data can be transformed using
dummy variables with up to 30 distinct values. If more than 30 categories exist
for a single variable, use the Reduce Categories routine to decrease the number
of categories to 30. Finally, use Principal Components Analysis to remove
highly correlated or superfluous variables from large databases.
Click the Cluster icon to gain access to two different types of clustering
techniques: k-Means clustering and hierarchical clustering. Both methods
allow insight into a database or dataset by performing a cluster analysis. This
type of analysis can be used to obtain the degree of similarity (or dissimilarity)
between the individual objects being clustered.
Click the Text icon to use the Text Miner tool to analyze a collection of text
documents for patterns and trends. (In the Cloud app, this tool is included in the
Text section of the Ribbon.) These algorithms can categorize documents,
provide links between documents that were not otherwise noted and create
visual maps of the documents. Analytic Solver Data Mining takes an integrated
approach to text mining by combining text processing and analysis in a single
package. While Analytic Solver Data Mining is effective for mining “pure text”
such as a set of documents, it is especially useful for “integrated text and data
mining” applications such as maintenance reports, evaluation forms, or any

Frontline Solvers Analytic Solver Data Mining User Guide 33


situation where a combination of structured data and free-form text data is
available.

Time Series Analysis


Analytic Solver Data Mining also supports the analysis and forecasting of
datasets that contain observations generated sequentially such as predicting next
year’s sales figures, monthly airline bookings, etc. through partitioning,
autocorrelations or ARIMA models and through smoothing techniques.
A time series model is first used to obtain an understanding of the underlying
forces and structure that produced the data and then secondly, to fit a model that
will predict future behavior. In the first step, the analysis of the data, a model is
created to uncover seasonal patterns or trends in the data, for example bathing
suit sales in June. In the second step, forecasting, the model is used to predict the
value of the data in the future, for example, next year's bathing suit sales.
Separate modeling methods are required to create each type of model.

Typically, when using a time series dataset, the data is first partitioned into
training and validation sets. Click the Partition icon within the Time Series
ribbon segment to utilize the Time Series Data Partitioning routine. Analytic
Solver Data Mining features two techniques for exploring trends in a dataset,
ACF (Autocorrelation function) and PACF (Partial autocorrelation function).
These techniques help the user to explore various patterns in the data which can
be used in the creation of the model. After the data is analyzed, a model can be
fit to the data using the ARIMA method. All three of these methods can be
found by clicking the ARIMA icon on the Data Mining ribbon.
Data collected over time is likely to show some form of random variation.
"Smoothing techniques" can be used to reduce or cancel the effect of these
variations. These techniques, when properly applied, will “smooth” out the
random variation in the time series data to reveal any underlying trends that may
exist.
Click the Smoothing icon to gain access to Analytic Solver Data Mining’s four
different smoothing techniques: Exponential, Moving Average, Double
Exponential, and Holt Winters. The first two techniques, Exponential and
Moving Average, are relatively simple smoothing techniques and should not be
performed on datasets involving trends or seasonality. The third technique,
Double Exponential, should be used when a trend is present in the dataset, but
not seasonality. The last technique, Holt Winters, is a more advanced technique
and should be selected when working with datasets involving seasonality.

Data Mining
The Data Mining section of the Data Mining ribbon contains four icons:
Partition, Classify, Predict, and Associate. Note that in the Cloud app, Partition
has been cordoned off into it's own "Partition" section. Click the Partition icon
to partition your data into training, validation, and if desired, test sets. Click the

Frontline Solvers Analytic Solver Data Mining User Guide 34


Classify icon to select one of six different classification methods. Click the
Predict icon to select one of four different prediction methods. Click the
Associate icon to recognize associations or correlations among variables in the
dataset.
Analytic Solver Data Mining supports six different methods for predicting the
class of an outcome variable (classification) along with three ensemble methods
which use these six methods as weak learners, and four different methods, along
with three ensemble methods, for predicting the actual (prediction) of an
outcome variable. Classification can be described as categorizing a set of
observations into predefined classes in order to determine the class of an
observation based on a set of variables. A prediction method can be described as
a technique performed on a database either to predict the response variable value
based on a predictor variable or to study the relationship between the response
variable and the predictor variables. For example, when determining the
relationship between the crime rate of a city or neighborhood and demographic
factors such as population, education, male to female ratio, etc.
One very important issue when fitting a model is how well the newly created
model will behave when applied to new data. To address this issue, the dataset
can be divided into multiple partitions before a classification or prediction
algorithm is applied: a training partition used to create the model, a validation
partition to test the performance of the model and, if desired, a third test
partition. Partitioning is performed randomly, to protect against a biased
partition, according to proportions specified by the user or according to rules
concerning the dataset type. For example, when creating a time series forecast,
data is partitioned by chronological order.
The six different classification methods are:
Discriminant Analysis - Constructs a set of linear functions of the
predictor variables and uses these functions to predict the class of a
new observation with an unknown class. Common uses of this method
include: classifying loan, credit card or insurance applicants into low
or high risk categories, classifying student applications for college
entrance, classifying cancer patients into clinical studies, etc.
Logistic Regression – A variant of ordinary regression which is used
to predict the response variable, or the output variable, when the
response variable is a dichotomous variable (a variable that takes only
two values such as yes/no, success/failure, survive/die, etc.).
k-Nearest Neighbors – This classification method divides a training
dataset into groups of k observations using a Euclidean Distance
measure to determine similarity between “neighbors”. These
classification groups are used to assign categories to each member of
the validation training set.
Classification Tree – Also known as Decision Trees, this classification
method is a good choice when goal is to generate easily understood and
explained “rules” that can be translated in an SQL or query language.
Naive Bayes – This classification method first scans the training
dataset and finds all records where the predictor values are equal. Then
the most prevalent class of the group is determined and assigned to the
entire collection of observations. If a new observation’s predictor
variable equals the predictor variable of this group, the new observation
will be assigned to this class. Due to the simplicity of this method a
large number of records are required to obtain accuracy.

Frontline Solvers Analytic Solver Data Mining User Guide 35


Neural Network – Artificial neural networks are based on the
operation and structure of the human brain. These networks process
one record at a time and “learn” by comparing their classification of the
record (which as the beginning is largely arbitrary) with the known
actual classification of the record. Errors from the initial classification
of the first records are fed back into the network and used to modify the
networks algorithm the second time around. This continues for many,
many iterations.
The four different predictive methods are:
Multiple Linear Regression – This method is performed on a dataset
to predict the response variable based on a predictor variable or used to
study the relationship between a response and predictor variable, for
example, student test scores compared to demographic information
such as income, education of parents, etc.
k-Nearest Neighbors – Like the classification method with the same
name above, this prediction method divides a training dataset into
groups of k observations using a Euclidean Distance measure to
determine similarity between “neighbors”. These groups are used to
predict the value of the response for each member of the validation set.
Regression Trees - A Regression tree may be considered a variant of a
decision tree, designed to approximate real-valued functions instead of
being used for classification methods. As with all regression
techniques, Analytic Solver Data Mining assumes the existence of a
single output (response) variable and one or more input (predictor)
variables. The output variable is numerical. The general regression tree
building methodology allows input variables to be a mixture of
continuous and categorical variables. A decision tree is generated when
each decision node in the tree contains a test on some input variable's
value. The terminal nodes of the tree contain the predicted output
variable values.
Neural Network – Artificial neural networks are based on the
operation and structure of the human brain. These networks process
one record at a time and “learn” by comparing their prediction of the
record (which as the beginning is largely arbitrary) with the known
actual value of the response variable. Errors from the initial prediction
of the first records are fed back into the network and used to modify the
networks algorithm the second time around. This continues for many,
many iterations.
Three ensemble methods, bagging, boosting, and random trees, are also
available for both classification and prediction. Each of these methods uses a
classification or prediction method as a weak learner. Each of these can be
accessed on the button of either the Classify or Predict menus.
The goal of association rule mining is to recognize associations and/or
correlations among large sets of data items. A typical and widely-used example
of association rule mining is the Market Basket Analysis. Most 'market basket'
databases consist of a large number of transaction records where each record
lists all items purchased by a customer during a trip through the check-out line.
Data is easily and accurately collected through the bar-code scanners.
Supermarket managers are interested in determining what foods customers
purchase together, like, for instance, bread and milk, bacon and eggs, wine and
cheese, etc. This information is useful in planning store layouts (placing items

Frontline Solvers Analytic Solver Data Mining User Guide 36


optimally with respect to each other), cross-selling promotions, coupon offers,
etc.

Tools and Help


Click the Score icon to score new data in a database or worksheet with any of
the Classification or Prediction algorithms. This facility matches the input
variables to the database (or worksheet) fields and then performs the scoring on
the database (or worksheet).
Analytic Solver Data Mining also supports the scoring of Test Data. When
Analytic Solver Data Mining calculates prediction or classification results,
internal values and coefficients are generated and used in the computations.
Analytic Solver Data Mining saves these values to an additional output sheet,
termed Stored Model Sheet, which uses the output sheet name, XX_Stored_N
where XX are the initials of the classification or prediction method and N is the
number of generated stored sheets. This sheet is used when scoring the test data.
Note: In previous versions of XLMiner, this utility was a separate add-on
application named XLMCalc. Starting in XLMiner V12.5, this utility is
included free of charge. Starting in V2014-R2, PsiClassify(), PsiPredict() and
PsiForecast functions are available for instantaneous interactive scoring on the
worksheet without the need to click the Score icon.
Click the License icon to open the Licensing Center where you can manage your
existing licenses or account and login and logout of the product.
Click the Help to open an example dataset (over 25 example datasets are
provided and most are used in the examples throughout this guide), open the
online help, open this guide, or check for updates. See the next chapter for more
information on this menu.

Frontline Solvers Analytic Solver Data Mining User Guide 37


Using Help, Licensing and
Product Subsets

Introduction
Analytic Solver Comprehensive (previously referred to as XLMiner™) is a
comprehensive data mining software package for use on the Web or as an add-in
to Excel. Data mining is a discovery-driven data analysis technology used for
identifying patterns and relationships in data sets. With overwhelming amounts
of data now available from transaction systems and external data sources,
organizations are presented with increasing opportunities to understand their
data and gain insights into it. Data mining is still an emerging field, and is a
convergence of fields like statistics, machine learning, and artificial intelligence.
Often, there may be more than one approach to a problem. Analytic Solver Data
Mining is a tool belt to help you get started quickly by offering a variety of
methods to analyze your data. It has extensive coverage of statistical and
machine learning techniques for classification, prediction, affinity analysis and
data exploration and reduction.
This chapter describes the ways Analytic Solver V2018 handles the overall
operation, including registration, licensing, use of product subsets, and use of
the Startup Screen, online Help and examples.

Working with Licenses in V2020


A license is a grant of rights, from Frontline Systems to you, to use our software
in specified ways. Information about a license – for example, its temporary vs.
permanent status and its expiration date – is encoded in a license code. The
same binary files are used for all Analytic Solver products . The product
features you see depend on the license code you have.

Frontline License Manager:


In Analytic Solver V2020, the Frontline License Manager has replaced the
Reprise License Manager – its basic purpose is license control (allowing the
software to run, or not). But unlike Reprise, the new Frontline License Manager
ties a license to a human user ID / email, not to a hardware ID or “lock code”.
Starting with a user ID stored locally, it will ask for license rights for a user via a
web (REST API) request to Frontline's License Manager, and store this license
information locally. When our software is first installed, it will have an
embedded user ID and license code for a trial or evaluation license, so the
software can run for the trial period even without Internet access.

Upgrading from an Early Version of Desktop ASP


Analytic Solver Desktop 2020 can store license codes locally but typically your
license will be stored on Frontline's license server. If you’re upgrading from
V2017/2018, a new V2020 license code will not be required. Simply download
and install V2020 for Desktop Excel and your existing V2017/V2018 license

Frontline Solvers Analytic Solver Data Mining User Guide 38


will activate the newer version. Old license codes for V2016 and earlier have no
negative effect in 2020. If they exist in the obsolete Solver.lic file (located at
C:\ProgramData\Frontline Systems), they will be ignored. A license code will
be issued at the time of purchase.

Managing Your Licenses


Click the License button to open the License Manager where you can manage
your current licenses and accounts, open our Product Selection Wizard, connect
to Live Chat or peruse through a list of FAQs.
The "MyLicenses" tab displays your current license and license type, along with
the expiration date. You can request a quote to renew your current license or, if
your license has expired or is within 30 days of expiring, you can purchase a
new license through our online store.

Click About Analytic Solver to open the following dialog containing information
on this release.

Click the Account tab to view your account on www.solver.com. Click Edit
Profile to edit the information. Click Live Chat to open a Live Chat window or
Log Out to log out of the product.

Frontline Solvers Analytic Solver Data Mining User Guide 39


Click the Product Guide tab to view a list of products and pricing information.
Click Product Selection Wizard to open the Product Selection Wizard. See the
next section for information on this feature.

Click the Questions tab to review a list of FAQs, submit a support ticket or start
a live chat.

Use the License menu to gain shortcuts to your account and to login or logout of
Analytic Solver.

Frontline Solvers Analytic Solver Data Mining User Guide 40


Product Selection Wizard
Select Product Selection Wizard from the Product Guide tab in the Licensing
Center to open a series of dialogs that will help you determine which product
will best meet your needs based on your recent pattern of use.

Select the Product that you'd like to purchase and then click Next. Click the
Optimization Choices link to learn more about Analytic Solver products that can
solve optimization models and to find more information on speed, memory, and
the use of plug-in Solver Engines.

On this screen, the Product Selection Wizard will recommend a product or


products based on your answers on the previous screens. Click Upgrade to
purchase the recommended product. Click the Optimization Choices link to
learn more about Analytic Solver products that can solve optimization models.
If at any time you'd like to chat with a member of our Technical Support staff,
click Live Chat. Or if you'd like to amend your answers on a previous dialog,
click Back.

Frontline Solvers Analytic Solver Data Mining User Guide 41


When you run a simulation or optimization model that contains too many
decision variables/uncertain variables or constraints/uncertain functions for the
selected engine, the Product Wizard will automatically appear and recommend a
product that can solve your model.

When you click “Test Run”, the Product Wizard will immediately run the
optimization or simulation model using the recommended product. (Only
summary information will be available.) At this point, you can purchase the
recommended product(s), or close the dialog.
This same behavior will also occur when solving smaller models, if you select a
specific external engine, from the Engine drop down menu on the Engine tab of
the Solver Task Pane, for which you do not have a license. The Product Wizard
will recommend the selected engine, and allow you to solve your model using
this engine. Once Solver has finished solving, you will have the option to
purchase the product.

Frontline Solvers Analytic Solver Data Mining User Guide 42


Getting Help
Should you run into any problems downloading or installing any of our
products, we’re happy to help. Call us at 775-831-0300 or email us at
support@solver.com.
Click Help – Help Center to open the Help Center. Click Support Live Chat, in
the bottom right hand corner, to open a Live Chat window. If you run into any
issues when using the software, the best way to get help is to start a Live Chat
with our support specialists This will start a Live Chat during our business
hours (or send us a message at other hours), just as if you were to start a Live
Chat on www.solver.com – but it saves you and our tech support rep a lot of
time – because the software reports your latest error message, model diagnosis,
license issue or other problem, without you having to type anything or explain
verbally what’s happened. You’ll see a dialog like this:

Since the software automatically sends diagnostic information to Tech Support,


we can usually identify and resolve the problem faster. (Note: No contents from
your actual spreadsheet model is sent, only information such as the number of
variables and constraints, last error message, and Excel and Windows version.)
Note: If Support Live Chat is disabled, click the down arrow beneath Help and
select Support Mode – Active Support.

Accessing Resources
The Help Center gives you easy access to video demos, User Guides, online Help,
example models, and Website support pages to learn how to use our software tools, and
build an effective model.

Frontline Solvers Analytic Solver Data Mining User Guide 43


User Guides
Click the User Guides menu choice to open PDF files of the Analytic Solver
Optimization and Simulation User and Reference Guides, Analytic Solver Data
Mining User or Reference Guides, or our Quick Start Guides.

Example Models
Clicking this menu item will open the Frontline Solvers Example Models
Overview dialog with nearly 120 self-guided example models covering a range
of model types and business situaltions.

Knowledge Base
Click Knowledge Base to peruse a multitude of online articles related to support
and installation issues or to locate articles that will help you to quickly build
accurate, efficient optimization, simulation, and data mining models.

Operating Mode
Click Operating Mode to switch between three different levels of help. The
Excel formulas and functions you use in your model have a huge impact on how
fast it runs and how well it solves. If you learn more about this, you can get
better results, but if you don't, your results will be limited. Guided Mode can
help you learn.
• Guided Mode prompts you step-by-step when solving, with dialogs.
• Auto-Help Mode shows dialogs or Help only when there’s a problem or
error condition.
• Expert Mode provides only messages in the Task Pane Output tab.
(This mode not supported when using a trial license.)

Support Mode
Click Support Mode to switch between three different levels of support. No
information (cell contents etc.) from your Excel model is ever reported
automatically to Frontline Systems, in any of these Support Modes. Only events

Frontline Solvers Analytic Solver Data Mining User Guide 44


in Frontline's software, such as menu selections, Solver Result messages, or
error messages are reported.
• Active Support automatically reports events, errors and problems to
Frontline Support, receives and displays messages to you from Support,
and allows you to start a Live Chat with Support while working in
Excel (Recommended).Auto-Help Mode shows dialogs or Help only
when there’s a problem or error condition.
• Standard Support automatically reports events, errors and problems
anonymously (not associated with you) to Frontline Support, but does
not provide a means to receive messages or start a Live Chat with
Support.
• Basic Support provides no automatic connection to Frontline Support.
You will have to contact Frontline Support manually via email, website
or phone if you need help.

Submit a Support Ticket


If you're having installation, technical, or modeling issues, submit a Support
Ticket to open an online support request form. Submit your email address and a
short, concise description of the issue that you are experiencing. You'll receive
a reply from one of Frontline's highly trained Support Specialists within 24
hours, and generally much sooner.
Our technical support service is designed to supplement your own efforts: Getting you
over stumbling blocks, pointing out relevant sections of our User Guides or example
models, helping you fix a modeling error, or -- in rare cases -- working around an issue
with our software (always at our expense).

Solver Academy
Solver Academy is Frontline Systems' own learning platform. It's the place
where business analysts can gain expertise in advanced analytics: forecasting,
data mining, text mining, mathematical optimization, simulation and risk
analysis, and stochastic optimization.

Frontline Solvers Analytic Solver Data Mining User Guide 45


Video Tutorials/Live Webinars
Click Video Tutorials to be directed to Frontline's You Tube Channel. Browse
videos on how to create an optimization or simulation model or construct a data
mining or prediction model using Analytic Solver.
Click Live Webinars to be redirected to www.solver.com to join a live or pre-
recorded webinar. Topics include Using Analytic Solver Data Mining to Gain
Insights from your Excel Data, Overview of Monte Carlo Simulation
Applications, Applications of Optimization in Excel, etc.

Learn more!
Click any of the three Learn More buttons to learn more about how you can
solve large-scale optimization, simulation, and data mining models, reduce
costs, quantify and mitigate risk, and create forecasting, data mining and text
mining models using Analytic Solver.

Help Menu
Use the Help Menu to gain short cuts to live chat, example models,
documentation, set your operating and support mode preferences, and also to
open the Welcome Screen and check for software updates.

Use the Welcome Screen to get help with an existing model, open our example
models or watch a quick video on how to get running quickly with Analytic
Solver.

Frontline Solvers Analytic Solver Data Mining User Guide 46


Creating a Workflow

Introduction
The Workflow tab, released in version 2018, allows the combination of all
available data mining techniques into an all-inclusive workflow, or workflows.
Once the workflow is created, either manually or simply by recording your
actions, you can initiate the start of the pipeline, or pipelines, by clicking the
button. Note: This is the only tab supported in the Data Mining Cloud app.
Analytic Solver Comprehensive V2020 now allows you to export your existing
workflow to Frontline's Rason Cloud Services where it can be deployed to a
website without the need to write code! Continue reading to find out how.

Creating a Workflow
Workflows can be created in two ways: 1. By recording your actions
performed on the Data Mining ribbon after the Record button is pressed on the
Workflow tab or 2. By manually dragging the Data Mining nodes from the left
of the Workflow tab onto the Workflow window. The next two sections,
Recording a Workflow and Manually Creating a Workflow, contain examples to
illustrate each method.

Recording a Workflow
At the top of the Workflow tab, you’ll find 6 icons. Each one performs a
specific function.

Record Dock to top or side of Task Pane


Delete Format (Not present in Data Mining Cloud App)
Play

Frontline Solvers Analytic Solver Data Mining User Guide 47


Press to record a new workflow. On the Data Mining ribbon, click Help
– Example Models – Forecasting/Data Mining Examples, then click the
Boston_Housing dataset hyperlink to open the Boston Housing example dataset.
Click a cell on the Data worksheet, then click to start the Workflow
recorder. Click Data Mining – Partition – Standard Partition to open the
Standard Data Partition dialog. While holding down the shift key, select all
variables under Variables In Input Data, then click to move them to
Selected Variables. Leave all options at their defaults, then click OK to run the
partition. You’ll see several icons added to the Workflow tab.

Click Data Mining -- Classify – Classification Tree to open the Classification


Tree dialog. Under Data Source, click the down arrow next to Worksheet and
select STDPartition. While holding down the SHIFT key, select CRIM, ZN, and
INDUS, under Variables In Input Data, then click to move these three
variables to Selected Variables. Next, select the CAT.MEDV variable, under
Variables In Input Data, and click to choose this variable as the Output
Variable.

Frontline Solvers Analytic Solver Data Mining User Guide 48


Leave all remaining options at their defaults, then click Finish to run the
Classification Tree algorithm.
Finally, click the Score button on the Data Mining ribbon to open the Scoring
dialog. Under Data to be Scored, click the down arrow next to Worksheet and
select New Data. The Variables In New Data field will populate with new
variables. Afterwards click Match By Name to match the variables listed under
Variables in New Data with the variables listed under Model Variables.

Click OK to score the new data then click the Stop Recording button to stop the
Workflow recorder.

Frontline Solvers Analytic Solver Data Mining User Guide 49


Stop
Recording

The completed workflow is displayed in the window. Use Excel’s File – Save
to save the workflow to the workbook.

There is no limit to the number of workflows that may appear in the Workflow
tab. Each new workflow will be added to the right of the existing workflow(s).

Workflow Tab Options


Press to execute the workflow or workflows. If no nodes in the workflow
are selected, all workflows present in the Workflow window will be executed.
If a node in the flow is selected, and the Execute Workflow button is pressed,
the workflow will be run up to and including the selected node.

Press to delete the contents of the Workflow window. If no nodes in the


workflow are selected, all nodes in the Workflow window will be deleted. If a
node is selected in the Workflow window, pressing the Delete Workflow icon
will delete only the selected node.

Frontline Solvers Analytic Solver Data Mining User Guide 50


Press to format the Workflow into a more readable form. If the Task
Pane is docked to the top of the Excel workbook, the Workflow will be
formatted horizontally. If the Task Pane is docked to the side of the Excel
workbook, the Workflow will be formatted vertically.
Note: It is possible to move a node “behind” the task pane as shown in the
screenshot below.

To regain access to the node, click the Format Workflow button.

Press to dock the Solver Task Pane to the right of the workbook. Press
to dock the Solver Task Pane to the top of the workbook.

Deploying Your Workflow


Starting with Analytic Solver Comprehensive V2020, when you click Create
App – Cloud Service, your model (or a copy of it) is saved to the Azure cloud.
(Note: Analytic Solver Data Mining does not support this new feature. Trial
licenses of Analytic Solver Comprehensive do support this new feature.)
If you use File Save to store your workbook in OneDrive or OneDrive for
Business, we use that, which is by far the best option. Afterwards, any web or
mobile application can easily run your model, via simple, standard REST API
calls, often made from JavaScript on a web page, of from C# of Java code on a
Web server.
Click Help – Example Models – Forecasting / Data Mining Examples, then click
the Airpass Workflow link. The Airpass Workflow Example will open.

Click the Analytic Solver tab to move to the Analytic Solver ribbon.

Frontline Solvers Analytic Solver Data Mining User Guide 51


Posting Workflow to RASON Cloud Services
Click Create App – Cloud Service – RASON Model.

In the COM Addin, it's easy for us to check if the user has an optimization,
simulation, or flow model, and we can give that proper warning/error message.
Note: For a multi-stage data mining flow, Cloud service – RASON model is the
only available Create App options in either Analytic Solver Com Addin or
Analytic Solver Cloud app. Users of the Analytic Solver COM Addin will
receive an error if they attempt to use Power BI, Tableau, etc. However, users
of the Analytic Solver Cloud app will not.
On the next screen, click Save to accept the default Model Name. (You can also
provide a more meaningful name here if wanted. If your workbook resides on
your OneDrive account, click the checkbox. )

Note that data mining models that include a weight variable or a partition
variable may not be translated to Rason.
Immediately, a browser opens to the Editor tab on www.Rason.com. From here
you can view your workflow translated into Frontline's Rason modeling
language. Notice that the Airpass_Workflow is listed under Models on the right
and the workflow appears in graphical form in the grid.

Frontline Solvers Analytic Solver Data Mining User Guide 52


To see the model translated into Rason simply click the Show RASON model
editor icon in the top left of the grid.

The Rason model editor displays the Airpass Workflow example translated into
the Rason modeling language.

From here you can edit your model directly in Rason or if you'd rather not
interact with Rason code at all, simply make the change to the Excel model and
then re-deploy the edited workflow. There's no need to learn a new language!
To solve this model on the Editor tab, simply click the down arrow next to the
"play" button and click Solve. (Only the Solve endpoint supports the solving of
workflows.)

Frontline Solvers Analytic Solver Data Mining User Guide 53


After the workflow is executed, you can find the results listed under Results.

Deploying Your Model


Click Create App – Web Page on the Editor tab ribbon to deploy the Airpass
Workflow to a Web page application.

The file, RasonScript.html, will be downloaded locally. Double click the file to
open the Web application. Notice that the translated workflow appears in the
Model window.

To solve the workflow:

Frontline Solvers Analytic Solver Data Mining User Guide 54


1. Attach the Excel file, Airpass Workflow.xlsx. Click "Choose Files" at the
top of the Web app, then browse to C:\Program Files\Frontline
Systems\Analytic Solver Platform\Datasets and select Airpass
Workflow.xlsx.
2. Since we are solving a workflow, we must first change the query parameter
from ?response-format=STANDALONE to response-
format=WORKFLOW.
3. Now click the Solve button to solve the workflow through the Web
application.
Note: Quick Solve does not support external files so it is unable to solve this
translated workflow.

Click Choose Files to attach


the Excel workbook,
AirpssWorkflow.xlsx.

Change Query Parameters


to ?response-
format=WORKFLOW

Output Results

Whether you decide to keep your model in Excel or translate it to RASON, the
important thing is that your workflow (or model) can be executed whenever or
wherever it's needed – on the factory floor, on a salesperson's laptop or
smartphone or in call center custom application. And Rason can get the updated
data your model needs directly from operational business systems. This is
amazingly easy if you are using Power BI, Power Apps, Power Automate or
Dynamics 365.
For more information on this new feature, including information on the
Deployment Wizard and how to Manage your deployed models within Excel,
see the Analytic Solver User Guide chapter, Deploying Your Model.

Manually Creating a Workflow


It’s also possible to manually create a workflow or workflows. Go back to
Excel, open 'Boston Housing.xlsx' workbook and navigate to 'Workflow' tab.
Click the + in front of Get Data, then drag “New” into the workflow window.
The New Data dialog will open. Under Data Source, click the down arrow next
to Worksheet and select Data.

Frontline Solvers Analytic Solver Data Mining User Guide 55


Click OK to close the dialog. The Source Data icon will appear in the
Workflow window. It’s important to note that a workflow does not need to start
with a Source Data node. In fact, any node dragged to the workflow window can
stand alone and will be executed when the Execute Solver button is pressed.

Afterwards, expand Partitioning (by clicking the +), then drag Standard Partition
to the Workflow window. When the Standard Partition dialog opens, select all
variables under Variables In Input Data then click to move all variables to
Selected Variables. Leave all options at their defaults and click OK.
Immediately, the workflow updates in the Workflow window displaying the
connection.
Note: Notice that when creating a workflow manually, Map Features is not
required. However, if you’d like to provide a new data source for an existing
flow, you must use Map Features to specify the input and output variables to be
used in the flow. See below for more information.

Frontline Solvers Analytic Solver Data Mining User Guide 56


Expand Classify, then drag Classification Tree to the window.
• Under Data Source, click the down arrow next to Worksheet and select
STDPartition.
• Select the first three variables under Variables In Input Data (CRIM,
ZN and INDUS) and click to move them to Selected Variables.
• Select CAT.MEDV as the Output Variable.

Then click Finish. For more information on the Classification Tree feature in
Analytic Solver Data Mining, see the Data Mining Reference Guide.
The workflow will update immediately by connecting the Standard Partition
icon to the Classification Tree icon.

Frontline Solvers Analytic Solver Data Mining User Guide 57


Note: Automatic connections are made based on the current effective state of
the workflow and sheets in the workbook. For example, if you were to run this
workflow, a new STDPartition1 would be created. As a result, connecting a
new classification or regression to the STDPartition sheet would not be allowed.
On the Workflow tab, drag Score into the Workflow window. The Scoring
dialog will open. Click the down arrow next to Worksheet under Data to be
Scored and select New Data. Then click the down arrow for Worksheet under
Stored Model and select CT_Stored. Click Match By Name to match the
variables in the New Data worksheet with the selected Model Variables, then
click OK.
Note: When a node is dragged onto the workspace, the node’s dialog opens.
When Finish or OK is pressed on the open dialog, the node is executed. If this
dialog is reopened and either minor or no changes are made, the node will not be
executed again when Finish or OK is clicked to close the dialog. However, if
major changes are made, and children (additional nodes) are attached to this
node, the node will be executed again.
Immediately, the workflow window is updated.

Once all nodes are connected, click to run the workflow.

Frontline Solvers Analytic Solver Data Mining User Guide 58


Saving a Workflow
A workflow is automatically saved to the workbook when one of three events is
executed.
1. When an Analytic Solver Data Mining dialog is closed.
2. After the workflow is formatted by pressing the Format Workflow button on
the Workflow tab.
3. After the Stop Recording button is pressed on the Workflow tab.
Otherwise, use Excel’s File – Save menu item to save the workflow to the
workbook.

Note on Sampling/Summarizing Big Data


Currently, workflow methods that perform asynchronous activities, such as
Sampling and Summarizing Big Data, will first be converted to synchronous
methods when included in a workflow. Running big data synchronously (i.e.
clicking the RUN button on the Sampling or Summarizing Big Data dialog) is
supported without any required internal conversion.

Making/Breaking a Connection
To make a connection between two existing nodes, simply connect the green
square on the first node with the blue circle on the second node. To break a
connection, simply click the arrow connection in the blue node of the second
node and move it back to the green square on the first node.

Running a Workflow with a New Dataset


When there is an existing workflow, you may drag and drop a new dataset (a
new Get Data node) onto the Workflow window and connect the new Source
Data icon to the existing “Map Features” node. The result is two connections,
one from each Source Data icon, to the Map Features icon. Connecting to, or
double-clicking the “Map Features” node will pop up the existing Score button
dialog, enabling column names to be matched. When the workflow is executed,
the newer data source will be used.

Frontline Solvers Analytic Solver Data Mining User Guide 59


Multiple Workflows
As discussed earlier, there is no limit to the number of workflows appearing in
the workflow window. To run several different classification or regression
algorithms at the same time, simply drag the desired method into the workflow
and connect the method to the flow.
In the example workflow below, three different classification techniques,
Logistic Regression, Bagging and Naïve Bayes, and three different regression
methods, Linear Regression, Regression Tree and Boosting, use the same
standard partitions in the workflow. When this flow is executed, all six models
will be created and used to score new data. Note: Although each classification
and regression technique uses the same standard partition and data source, each
model may contain different selected variables.

Two workflows are present below. Once the Execute Workflow button is
pressed, all workflows will be executed at the same time.

Frontline Solvers Analytic Solver Data Mining User Guide 60


Changing Options Settings
To change the option settings for any of the nodes in the workflow, simply
double click the desired node to bring up the task dialog, make the desired
option changes, then click OK or Finish.
Important Note: If you make changes to the Selected Variables or the Input
Sheet for any existing node, a message with three options will appear asking if
you would like to 1. Cancel, 2. Remove Children or 3. Disconnect Children.
Click Cancel to return to the Workflow tab without any changes being made.
Click Remove Children to delete all nodes beneath the selected node or
Disconnect Children to disconnect the selected node from the rest of the
workflow.

Workflow Groups
Analytic Solver Data Mining allows you to partition, rescale and score a dataset
“on-the-fly”. This means that instead of performing separate steps for
partitioning, rescaling or scoring, users can perform all these actions on the same
set of dialogs used to create the classification or regression model. When
partitioning, rescaling or scoring “on the fly” during the recording of a
workflow, a “group” will be created. This group is treated as one node and is
denoted with a dotted line surrounding all nodes in the group. Clicking any of
the nodes in the group will bring up the classification or regression method
dialogs.

Frontline Solvers Analytic Solver Data Mining User Guide 61


To create a group when creating a workflow manually, simply select the
partitioning, rescaling or scoring options once the classification or regression
method dialog opens.
To remove or change a group, double click any of the nodes in the group to open
the classification or regression method dialog and disable the desired feature or
features.

Frontline Solvers Analytic Solver Data Mining User Guide 62


Bringing Big Data into Excel
Using Apache Spark

Introduction
Large amounts of data are being generated and collected continuously from a
multitude of sources every minute of every day. From your toothbrush to your
vehicle GPS to Twitter/Facebook/Google/Yahoo, data is everywhere. Being
able to make decisions based on this information requires the ability to extract
trends and patterns2 that can be buried deeply within the numbers.
Generally these large datasets contain millions of records (rows) requiring
multiple gigabytes or terabytes of storage space across multiple hard drives in an
external compute cluster. Analytic Solver Data Mining enables users to ‘pull’
sampled and summarized data into Excel from compute clusters running Apache
Spark, the open-source software widely embraced by Big Data vendors and
users.

Sampling and Summarizing Big Data


This example illustrates how to use the Big Data Sample/Summarization feature
using Data stored across an Apache Spark compute cluster where the Frontline
Systems access server is installed. By drawing a representative sample of Big
Data from all the nodes in the cluster, Excel users can easily train data mining
and text mining models directly on their desktops.
In this example, we will use the Airline dataset. The data used in this example
consists of flight arrival and departure information for all commercial flights
within the USA dating from October 1987 to April 2008. This data was
obtained from 29 commercial airlines and 3,376 airports and consists of 3.2
million cancelled flights and 25 million flights at least 15 minutes late. This is a
large dataset with nearly 120 million records requiring 1.6 GB of storage space
when compressed and 12 GB of storage space when uncompressed. Data was
obtained from the Research and Innovative Technology Administration (RITA)
which coordinates the U.S. Department of Transportation research programs.
Note: Southwest (WN), American Airlines (AA), United Airlines (UA), US
Airways (US), Continental Airlines (CO), Delta Airlines (DL), Northwest
Airlines (NW) and Alaska Airlines (AS) are the only airlines where data is
available for all 20 years. Recall the annual revenue from the domestic airline
industry is $157 billion. This public dataset was obtained from here. Navigate
to this webpage to explore details about this dataset. For supplemental data
including the location of each airport, plane type and meteorological data
pertaining to each flight, click here.
The information contained in this large dataset could allow us to answer or
understand the following questions or issues:
• What are the airports most prone to departure delays? What airports
tend to have the most arrival delays?

Frontline Solvers Analytic Solver Data Mining User Guide 63


• What are the times of day and days of week that are most susceptible to
departure/arrival delay?
• How can we understand flight patterns as they respond to well-known
events? (i.e., examining the data before and after September 2011)
• How many miles per year does each plane by carrier fly?
• When is the best time of day/day of week/time of year to fly to
minimize delays?
• How does the number of people flying between different locations
change over time?
• How well does weather predict plane delays?
• Can you detect cascading failures as delays in one airport creates delays
in others? Are there critical links in the system?
• Understanding flight patterns between the pair of cities that you fly
between most often, or all flights to and from a major airport like
Chicago (ORD)
• Average arrival delay in minutes by flight or by year?
• How many flights were cancelled, at least 15 minutes late, etc?
• How many flights were less than 50 miles?

Connecting to an Apache Spark Cluster


The Analytic Solver Data Mining software communicates over the network with
a Frontline-supplied, server-side software package that runs on one of the
computers in the Spark cluster. The first step in connecting Analytic Solver
Data Mining to your organization's own Apache Spark cluster is to contact
Frontline Systems Sales and Technical Support at 775-831-0300. After the
server-side software package is installed, the proper entries for the cluster
options can be entered as shown in the example below.
For university instructors teaching courses in business analytics to MBA and
undergraduate business students, using methods such as data mining,
optimization and simulation, who would like to give their students hands-on
experience with the use of Big Data in decision-making, without a need for
programming expertise or other "data science" preparation, Frontline Systems
operates an Apache Spark cluster "in the cloud" on Amazon Web Services, pre-
loaded with a set of interesting, publicly available Big Data datasets (such as the
Airline dataset illustrated in this chapter), and sample exercises and case studies
using the datasets, that we can make available at a nominal cost for student use.
For further information about this option, please contact Frontline Systems
Academic Sales and Support at 775-831-0300 or academic@solver.com.

Storage Sources and Data Formats


Analytic Solver Data Mining can process data from Hadoop Distributed File
System (HDFS), local file systems that are visible to Spark cluster, and Amazon
S3. Performance is best with HDFS, and it is recommended that you load data
from a local file system or Amazon S3 into HDFS. If the local file system is
used, the data must be accessible at the same path on all Spark workers, either
via a network path, or because it was copied to the same location on all workers.

Frontline Solvers Analytic Solver Data Mining User Guide 64


At present, Analytic Solver Data Mining can process data in Apache Parquet
and CSV (delimited text) formats. Performance is far better with Parquet, which
stores data in a compressed, columnar representation; it is highly recommended
that you convert CSV data to Parquet before you seek to sample or summarize
the data.

Sampling from a Large Dataset


If using desktop Analytic Solver Data Mining, click Get Data – Big Data –
Sample to open the Sample Big Data dialog. Enter the location of the file for
File Location and the URL for the Spark Server for Spark REST server URL.
This example uses the Airline dataset (described above) installed on a Frontline
operated Apache Spark cluster. If your dataset is located on Amazon S3, click
Credentials to enter your Access and Secret Keys.
If using AnalyticSolver,com, click the down arrow to open the File Location
drop down menu and select Airline.

Keep All variables selected to include all variables in the dataset in the sample.
To only include specific variables, you would choose Select variables, then click
Infer Schema. All variables contained in the dataset will be reported under Variables.
Use the >/< and >>/<< buttons to select variables to be included in the sample.

Click the Options tab.

Select Approximate Sampling. When this option is selected, the size of the resultant
sample will be determined by the value entered for Desired Sample Fraction.
Approximate sampling is much faster than Exact Sampling. Usually, the
resultant fraction is very close to the Desired Sample Fraction so this option
should be preferred over exact sampling as often as possible. Even if the
resultant sample slightly deviates from the desired size, this would be easy to
correct in Excel.

Enter 0.00001 for Desired Sample Fraction. This option controls the fraction of
the total number of records in the full dataset that is expected to be included in
the generated sample. Since our dataset contains about 120 million records, our
sample will contain approximately 1,200 records. If Sampling with Replacement

Frontline Solvers Analytic Solver Data Mining User Guide 65


is selected, the value for Desired Sample Fraction is the expected number of
times each record can be chosen and must be greater than 0. If Sampling
without replacement (i.e. Sampling with Replacement is not selected), the
Desired Sample Fraction becomes the probability that each element is chosen
and, as a result, Desired Sample Fraction must be between 0 and 1.

Keep Random Seed at the default of 12345. This value initializes the random
number generator. This option allows you to generate reproducible samples.
Track record IDs and Sample with replacement should remain unchecked.
Please see the Analytic Solver Data Mining Reference guide for a complete
description of each option included on this dialog.

Clicking Submit sends a request for sampling to the compute cluster but does not
wait for completion. The result is output containing the Job ID and basic
information about the submitted job so multiple submissions may be identified.
This information can be used at any time later for querying the status of the job
and generating reports based on the results of the completed job.
Clicking Run sends a request for sampling to the compute cluster and waits for
the results. Once the job is completed and results are returned to the Analytic
Solver client, a report is inserted into the Model tab on the Data Mining task
pane under Reports – Text Mining containing the sampling results.
Click Submit. Results will be inserted into the Data Mining Task Pane under
Results – Sampling – Run 1. Open BD_Sampling.

This report displays the details about the chosen dataset, selected options for
sampling and the job identifier required for identifying the submission on the
cluster.
Click Get Data – Big Data – Get Results on the Data Mining ribbon to open
the Big Data: Get Results dialog. Click the down arrow to the right of Job
identifier and select the previously submitted job. Click Get Info to obtain the
status of the Job from the cluster.

Frontline Solvers Analytic Solver Data Mining User Guide 66


Application is the type of the submitted job. This submission corresponds to the
Approximate Sampling job that was submitted earlier.
Start Time displays the date and time when the job was submitted. Start Time
will always be displayed in the user's Local Time.
Duration shows the elapsed time since job submission if the job is still
RUNNING and the cluster total compute time if the job is FINISHED. This job
has not yet finished. Once the job is finished, the report will be inserted into the
current workbook.
Status is the current state of the job: FINISHED, FAILED, ERRORED or
RUNNING. FINISHED indicates that the job has been completed and results
are available for retrieval. FAILED or ERRORED indicates that the job has not
completed due to an internal cluster failure. When this occurs, Details will
contain a message indicating the reason.
If Status is FINISHED, you may click the Get Results button to obtain the
results from the cluster and populate the report as shown below. Note: It is not
required to click Get Info before Get Results. If Get Results is clicked, the status
of the job will be checked and if the status is FINISHED, the results will be
pulled from the cluster and the report will be created. Otherwise, Status will be
updated with the appropriate message to reflect the status: FAILED,
ERRORED, or RUNNING.
Click Get Results. BD_Sampling1 will be inserted into the Task Pane under
Results – Sampling – Run 2. Double click to open.

Frontline Solvers Analytic Solver Data Mining User Guide 67


The Inputs section displays the information about the dataset, cluster
configuration, details on the running time, the options chosen during setup, and
a summary of the data including the size and dimensionality of the full and
sampled datasets. Since Approximate Sample was selected, we can expect the
resulting fraction to be slightly different from the desired fraction. In our
example, the resulting fraction, approximately 0.00001014 (1,184/116,701,402)
is very close to the requested fraction (0.00001)). (Recall Approximate Sample
was selected and Desired Sample Fraction was entered on the Options tab
during setup.)
Scroll down to see the full and sampled data schemas. Since we choose to
include all variables in the sample, the set of columns in the full and sampled
datasets is the same.
Further down we see the Sampled Data, which includes 1,184 records, as
indicated by the Number of records – sample field under Sampled Data
Summary.
Now that the representative data sample has been drawn and is available in the
output, all of the methods and features included in Analytic Solver Data Mining
are literally at your fingertips. We could choose to explore our sampled data by
creating visualizations using the Chart Wizard, transform the data using
Analytic Solver Data Mining's data transformation utilities, build
classification/prediction models to forecast arrival and departure times, predict
airport delays, estimate total flight times and perform any other analytic tasks
that can address numerous challenges that Big Data, and the Airline dataset in
particular, present to the data scientists and analysts.

Summarizing a Large Dataset


The Big Data Summarization feature in Analytic Solver Data Mining translates
the lightning-fast cluster computing capabilities from the state of the art Big
Data engine, Apache Spark, to the simple and easy to use, "point & click"
interface within Excel. This very powerful yet intuitive tool is useful for rapid
extraction of key metrics contained in data, which can be immediately used by
data analysts and decision makers. The new Summarization feature in Analytic
Solver Data Mining provides similar functionality as standard SQL engines, but
for the data, volume and complexity which extends far beyond your desktop or
laptop computer. This tool is a great assistant for composing reports,
constructing informative visualizations, building prescriptive and predictive
models that can drive the directions of consequent analysis.

Frontline Solvers Analytic Solver Data Mining User Guide 68


Now we will illustrate how to utilize this easy-to-use yet powerful tool by
summarizing the Airport dataset and using the information obtained to answer
the following three questions:
• What carrier has the most domestic flights by year?
• Who are the most reliable airlines?
• Who are the least reliable airlines?
Click Get Data – Big Data – Summarize on the Data Mining Ribbon. This
time we will select a subset of variables for summarization along with grouping
variables, so (after entering the File location, Spark REST server URL and file
Credentials, if needed) choose Select variables and click Infer Schema.
Afterwards transfer ArrTime, and Cancelled to the Selected Variables grid and
Year and UniqueCarrier to the Group Variables grid. Group Variables are
variables from the dataset that are treated as key variables for aggregation. In
this example, the variables will be grouped so that all records with the same
Year and UniqueCarrier are included in the same group, and then all aggregate
functions for each group will be calculated.
Note: If All variables is selected, the result is a simple aggregation of all
variables across the entire dataset which can be used to quickly obtain overall
statistics.

Click the Options tab and select Average for Aggregation Type and Compute
Group Counts.

Aggregation Type provides 5 statistics that can be inferred from the dataset:
sum, average, standard deviation, minimum and maximum.

The option Compute group counts is enabled when 1 or more Grouping


Variables is selected. When this option is selected, the number of records
belonging to each group is computed and reported.

Frontline Solvers Analytic Solver Data Mining User Guide 69


Click Run to send a request for a summarization job to the cluster and wait for
the results. Once the job is completed, BD_Summarization will be inserted into
the current workbook and the Data Mining task pane under Results – Sampling –
Run 2.
Once again the Inputs section recaps the dataset details, the cluster
configuration, the time taken to complete the job, the options selected during
setup and the number of columns and records in both the full dataset and the
summarized data. Full Data Schema displays all variables in the dataset while
Summarized Data Schema displays only the variables that were selected during
setup.
Scroll down to Group Counts to examine the number of records belonging to
each Year and UniqueCarrier. In this example, there were 405,598 US Airways
flights in 2003 and 684,961 Southwest flights in 1995.
Click the down arrow next to Count and select Sort Largest to Smallest to
answer our first question, "What carrier has the most domestic flights by year?"

Once, the table is sorted on the Count column, we see that Southwest (WN)
holds the largest market share in domestic flights for years 2005 - 2008.

Frontline Solvers Analytic Solver Data Mining User Guide 70


Scroll down to Summary Data to find some evidence (that can be further
verified) of the most and least "reliable" airlines. Click the down arrow next to
Cancelled_AVG and sort from largest to smallest. The airline with the largest
average percentage of cancelled flights is Eastern Airlines (EA) with a little over
10% of their flights cancelled on average in 1989. ExpressJet Airlines (EV) and
America West (HP) round out the top three spots with 4.5% and 4.3%
respectively.

Click the down arrow next to Year and sort from largest to smallest. Now we
see that Mesa Airlines held the largest on average cancellation percentage in
2008 (.036%). The second least reliable airline in 2008 goes to SkyWest
Airlines (OO) with an average cancellation percentage of 2.2% and the third
least reliable airline in 2008 is "awarded" to ExpressJet Airlines (EV) with an on
average cancellation percentage of 1.8%.

Frontline Solvers Analytic Solver Data Mining User Guide 71


Click the down array next to Cancelled_AVG again, but this time sort from
Smallest to Largest. Then sort by Year from Largest to Smallest. The table is
updated to display the airlines with the smallest on average flight cancellation
percentage in 2008: 1st place – Hawaiian Airlines, 2nd place – Continental
Airlines and 3rd place – ExpressJet Airlines (1).

Using the same steps illustrated here, we could find answers to many other
questions, for example:
• What are the yearly flight volume per carrier?
• Which times of day and days of week are most susceptible to
departure/arrival delays?
• How many miles per year does each plane by carrier fly?

Concluding Remarks
The ability to sample and summarize large datasets is one that will become more
and more important as technology progresses and more and more data is
captured. Analytic Solver Data Mining's Big Data feature allows users to import
these large datasets into Excel allowing business analysts and data scientists the
power to build predictive and prescriptive analytic models in their spreadsheets,
without the need for complex programming skills. Using Analytic Solver Data
Mining's Big Data feature, we could easily answer the questions posed in the
introduction and more.

Frontline Solvers Analytic Solver Data Mining User Guide 72


Fitting a model using Feature
Selection

What is Feature Selection?


Analytic Solver Data Mining’s Feature Selection tool gives users the ability to
rank and select the most relevant variables for inclusion in a classification or
prediction model. In many cases the most accurate models, or the models with
the lowest misclassification or residual errors, have benefited from better feature
selection, using a combination of human insights and automated methods.
Analytic Solver Data Mining provides a facility to compute all of the following
metrics, described in the literature, to give users information on what features
should be included, or excluded, from their models.

• Correlation-based
o Pearson product-moment correlation
o Spearman rank correlation
o Kendall concordance
• Statistical/probabilistic independence metrics
o Welch’s statistic
o F statistic
o Chi-square statistic
• Information-theoretic metrics
o Mutual Information (Information Gain)
o Gain Ratio
• Other
o Cramer’s V
o Fisher score
o Gini index

Only some of these metrics can be used in any given application, depending on
the characteristics of the input variables (features) and the type of problem. In a
supervised setting, if we classify data mining problems as follows:

• : real-valued features, prediction (regression) problem


• : real-valued features, binary classification problem
• : real-valued features, multi-class classification problem
• : nominal categorical features, prediction (regression)
problem
• : nominal categorical features, binary classification
problem

• : nominal categorical features, multi-class


classification problem

Frontline Solvers Analytic Solver Data Mining User Guide 73


then we can describe the applicability of the Feature Selection metrics by the
following table:

R-R R-{0,1} R-{1..C} {1..C}-R {1..C}-{0,1} {1..C}-{1..C}


Pearson N
Spearman N
Kendall N
Welch's D N
F-Test D N N
Chi-
D D D D N N
squared
Mutual
D D D D N N
Info
Gain Ratio D D D D N N
Fisher D N N
Gini D N N

"N" means that metrics can be applied naturally, and “D” means that features
and/or the outcome variable must be discretized before applying the particular
filter.
As a result, depending on the variables (features) selected and the type of
problem chosen in the first dialog, various metrics will be available or disabled
in the second dialog.

Fitting a Model
The goal of this example is three-fold: 1.To use Feature Selection as a tool for
exploring relationships between features and the outcome variable, 2. Reducing
the dimensionality based on the Feature Selection results and 3. Evaluating the
performance of a supervised learning algorithm (a classification algorithm) for
different feature subsets.
This example uses the Boston_Housing.xlsx example dataset, which contains 14
variables each describing a census tract within the city of Boston. A description
of each variable is given in the table below. In addition to these variables, the
data set also contains an additional variable, which has been created by
categorizing median value (MEDV) into two categories – high (MEDV > 30)
and low (MEDV < 30).

CRIM Per capita crime rate by town


ZN Proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS Proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX Nitric oxides concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Proportion of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil-teacher ratio by town

Frontline Solvers Analytic Solver Data Mining User Guide 74


B 1000(Bk - 0.63)^2 where Bk is the proportion of African-Americans
by town
LSTAT % Lower status of the population
MEDV Median value of owner-occupied homes in $1000's
Click If using desktop Analytic Solver Data Mining, Click Help – Examples on
the Data Mining ribbon to open the Boston_Housing.xlsx example file. Select a
cell within the data (say A2), then click Explore – Feature Selection to bring
up the first dialog.
Select all variables except MEDV and CAT.MEDV as Continuous Variables
and CAT.MEDV as the Output Variable. Leave the default setting of
Categorical selected. This setting denotes that the Output Variable is a
categorical variable. If the number of unique values in the Output variable is
greater than 10, then Continuous will be selected by default. However, at any
time the User may override the default choice based on his or her own
knowledge of the variable. Note: You can also perform this analysis with
variables CHAS (nominal) and RAD (ordinal) selected as Categorical Variables
– for this particular example, the most/least relevant or important variables
found by Feature Selection would be similar.

Click the Measures tab or click Next.


Since we have continuous variables, Discretize predictors is enabled. When this
option is selected, Analytic Solver Data Mining will transform continuous
variables into discrete, categorical data in order to be able to calculate statistics
as shown in the table in the Introduction. For example, all of our variables (or
features) are continuous or real-valued, as labeled in the
chart. As a result, if we are interested in evaluating the relevance of features
according to the Chi-Squared Test or measures available in the Information
Theory group (Mutual Information and Gain ratio), we must discretize these
variables. If we do not select Discretize predictors, then we have the option to
compute Welch’s test or the F-test statistics (F-Statistic or Fisher score), only.
Let’s select Discretize predictors, then click Advanced. Leave the defaults of

Frontline Solvers Analytic Solver Data Mining User Guide 75


10 for Maximum # bins and Equal Interval for Bins to be made with. Analytic
Solver Data Mining will create 10 bins and will assign records to the bins based
on if the variable’s value falls in the interval of the bin. This will be performed
for each of the Continuous Variables.
Note: Discretize output variable is disabled because our output variable,
CAT.MEDV, is already a categorical nominal variable. If we had no
Continuous Variables and all Categorical Variables, Discretize predictors would
be disabled.

Select Chi-squared and Cramer’s V under Chi-Squared Test. The Chi-


squared test statistic is used to assess if there is a significant relationship
between two categorical variables. When applied to Feature Selection, it is used
as a test of independence to determine whether the assigned class is independent
of a particular variable. The higher the value of chi-squared, the larger the
evidence that there’s some association between the tested variables.
Conversely, the smaller the chi-square value, the more independent the
variable. The minimum value for this statistic is 0.
The independence test is usually carried out using corresponding p-values and
not the statistic values, but the values provide the indication of the
(in)dependence as well. The minimum value for this statistic is 0.
Cramer’s V is a variation of the Chi-Squared statistic that also measures the
association between two discrete nominal variables. This statistic ranges from 0
to 1 with 0 indicating no association between the two variables and 1 indicating
complete association (the two variables are equal).
Select Mutual information and Gain ratio within the Information Theory frame.
Mutual information is the degree of a variables’ mutual dependence or the
amount of uncertainty in variable 1 that can be reduced by incorporating
knowledge about variable 2. Mutual Information is non-negative and is equal to
zero if the two variables are statistically independent. Also, it is always less
than the entropy (amount of information contained) in each individual variable.

Frontline Solvers Analytic Solver Data Mining User Guide 76


The Gain Ratio, ranging from 0 and 1, is defined as the mutual information (or
information gain) normalized by the feature entropy. This normalization helps
address the problem of overemphasizing features with many values but the
normalization results in an overestimate of the relevance of features with low
entropy. It is a good practice to consider both mutual information and gain ratio
for deciding on feature rankings. The larger the gain ratio, the larger the
evidence for the feature to be relevant in a classification model.

Click the Output Options tab or click Next. Table of all produced measures is
selected by default. When this option is selected, Analytic Solver Data Mining
will produce a report containing all measures selected on the Measures tab.
Top Features table is selected by default. This option produces a report
containing only top variables as indicated by the Number of features edit box.
Select Feature importance plot. This is a graphical representation of variable
importance based on the measure selected in the Rank By drop down menu.
Enter 5 for Number of features. Analytic Solver Data Mining will display the
top 5 most important or most relevant features (variables) as ranked by the
statistic displayed in the Rank By drop down menu.
Keep Chi squared statistic selected for the Rank By option. Analytic Solver
Data Mining will display all measures and rank them by the statistic chosen in
this drop down menu.

Frontline Solvers Analytic Solver Data Mining User Guide 77


Click Finish. Results are added to the end of the workbook. Select the
FS_Output tab and click the Top Features Info link to open the Top Features
Importance Plot.

• If using the Data Mining Cloud app, click the Charts icon on the
Ribbon to open the Charts dialog, then select FS_Top_Features for
Worksheet and Feature Importance Chart for Chart.
This chart ranks the variables by most important or relevant according to the
selected measure. In this example, we see that the RM (Average number of
rooms per dwelling), LSTAT (% lower status of the population), PTRatio
(pupil-teacher ratio by town), ZN (proportion of residential land zoned for lots
over 25,000 sq. ft.), and INDUS (proportion of non-retail business acres per
town) variables are the top five most important or relevant variables according
to the Chi-Squared statistic. It’s beneficial to examine the Feature Selection
Importance Plot in order to quickly identify the largest drops or “elbows” in
feature relevancy (importance) and select the optimal number of variables for a
given classification or prediction model.
Note: We could have limited the number of variables displayed on the plot to a
specified number of variables (or features) by selecting Number of features and
then specifying the number of desired variables. This is useful when the number
of input variables is large or we are particularly interested in a specified number
of highly – ranked features.

Frontline Solvers Analytic Solver Data Mining User Guide 78


Click Feature Selection: Statistics on the Output Navigator to move to this
report.
The Detailed Feature Selection Report displays each computed metric selected
on the Measures tab: Chi-squared statistic, Chi-squared P-Value, Cramer’s V,
Mutual Information, and Gain Ratio. If using desktop Analytic Solver Data
Mining, click the down arrow next to each statistic to sort the table. For
example, if we click the down arrow next to Chi-squared Statistic and select Sort
Largest to Smallest from the menu,

the table will be sorted on the Chi-squared test statistic from largest to
smallest.

Click back to the Top Features Info link on the Output Navigator to view the
Top Features Info. The RM, LSTAT, PTRATIO, ZN, and INDUS variables are
the 5 most important or relevant variables as ranked by the Chi-square test.
According to the Chi-squared test, RM is the most relevant variable for
discriminating the price of the house. This variable is highly dependent with the
outcome variable, CAT.MEDV.

Frontline Solvers Analytic Solver Data Mining User Guide 79


Keep in mind that when determining what features to include in our
classification model, it is advantageous to examine at least several metrics to see
which ones agree/disagree on the level of each variable’s importance.
Let’s re-run Feature Selection but this time we’ll ask for different statistics. On
the first dialog, we will again select all variables except CAT.MEDV and
MEDV as Selected Variables and select CAT.MEDV as the output
variable. Then click the Measures tab. Select Welch’s test, F-Statistic, Fisher
score, and Gini index.
Welch’s Test is a two-sample test (i.e. applicable for binary classification
problems) that is used to check the hypothesis that two populations with
possibly unequal variances have equal means. When used with the Feature
Selection tool, a large T-statistic value (in conjunction with a small p-value)
would provide sufficient evidence that the Distribution of values for each of the
two classes are distinct and the variable may have enough discriminative power
to be included in the classification model.
F-Test tests the hypothesis of at least one sample mean being different from
other sample means assuming equal variances among all samples. If the
variance between the two samples is large with respect to the variance within the
sample, the F statistic will be large. Specifically for Feature Selection purposes,
it is used to test if a particular feature is able to separate the records from
different target classes by examining between-class and within-class variances.
Fisher Score is a variation of the F-Statistic. It chooses (or assigns higher
values) to variables that assign similar values to samples from the same class
and different values to samples from different classes. The larger the Fisher
Score value, the more relevant or important the variable (or feature).
The Gini index measures a variable’s ability to distinguish between classes. The
maximum value of the index for binary classification is 0.5. The smaller the
Gini index, the more relevant the variable.

Frontline Solvers Analytic Solver Data Mining User Guide 80


Click Next and select Top Features table and increase the Number of features
to 5 under Feature Selection, then click Finish to accept the remaining defaults
on the Output Options tab and run Feature Selection.

According to the Welch Test, the top five most relevant variables are LSTAT,
RM, INDUS, PTRatio, and Tax.

Click the Feature Selection: Statistics link on the Output Navigator. Observe
that the p-values corresponding to the computed Welch’s Test statistic for the
above 5 variables are (e.g. LSTAT – 4.73E-70, RM – 2.83-31, INDUS –
5.111E-19, PTRATIO – 1.95E-17, etc…), which means that we have very
strong evidence (threshold for p-value is 0.05) for rejecting the hypothesis that
two samples contained in these variables – in our case stratified by binary
CATMEDV – have equal means. This provides evidence that these variables

Frontline Solvers Analytic Solver Data Mining User Guide 81


would not be redundant and would have some nontrivial discriminative ability
for house prices.

If using desktop Analytic Solver Data Mining, we can sort by each Statistic and
see that the top four variables for each statistic includes: LSTAT, INDUS,
PRATIO, and RM (taking into account Welch statistic magnitude). As you can
see in the screenshot below, the F-Test statistic ranks RM as the top variable
(with strong evidence i.e. very low P-Value) and LSTAT as the second most
important variable,

while Welch’s test ranks the LSTAT variable as the most important variable and
the RM variable (in magnitude) as the 2nd most important variable.
Interestingly, the Gini Index, which is not a statistical hypothesis test, also
agrees to the above ranks. The fact that this index and our hypothesis tests agree
provides even stronger evidence of the aforementioned variables relevancy. As
mentioned above, the Gini index is a widely – used measure for quantifying a
variable’s ability to distinguish between classes which is related to how the
hierarchy of splits in the Classification Tree and Regression Tree algorithms are
found.

Frontline Solvers Analytic Solver Data Mining User Guide 82


At this point we already have a lot of useful information about our variables and
their relationship to the categorical CAT.MEDV variable. Now it’s a good time
to come back to the data description and try to understand the feature selection
results logically. For example, the LSTAT variable is the % lower status of the
population of the census tract, the RM variable is the average number of rooms
per house in the census tract, INDUS corresponds to the amount of non-retail
business in the census tract, and PTRatio is the pupil to teacher ratio in schools
in the census tract. The Boston Housing dataset is a small but very well-known
and widely used dataset. Feature selection confirms the intuitive observations
that these features are dependent on the output variable’s value. However, in
most cases, these relationships are typically hard to detect due to the large scale
of most datasets which involve complex interrelationships between variables.
Armed with the knowledge that we have obtained through the Features Selection
tool, let’s now quickly create a classification model using the Logistic
Regression algorithm and compare the performance of a classification model
created using only the two “best” variables, out of the original 13, to a
classification model created using all 13 variables. (For more information on
this classification method, please see the Logistic Regression chapter later on
this guide.)
Let’s re-run Feature Selection but this time, we will use Feature Selection for
evaluating a variable’s importance or relevance for predicting the median house
prices instead of classifying them into two categories, low or high, as done in the
analysis above. Again, select all variables except CAT. MEDV and MEDV and
move them to Selected Variables. Then select MEDV as the Output
Variable. Continuous is selected by default.

Click the Measures tab or Next. Leave Discretize predictors and Discretize
output variable unchecked. If Discretize predictors is selected, no statistics will
be enabled. If Discretize output variable is selected, F-Statistic, Fisher score,
and Gini index are enabled.

Frontline Solvers Analytic Solver Data Mining User Guide 83


Select Pearson correlation, Spearman rank correlation, and Kendall
concordance.
The Pearson product-moment correlation coefficient is a widely used statistic
that measures the closeness of the linear relationship between two variables,
with a value between +1 and −1 inclusive, where 1 indicates complete positive
correlation, 0 indicates no correlation, and −1 indicates complete negative
correlation.
The Spearman rank correlation coefficient is a nonparametric measure that
assesses the relationship between two variables. This measure calculates the
correlation coefficient between the ranked values of the two variables. If data
values are repeated, the Spearman rank correlation coefficient will be +1 or -1, if
each of the variables is a perfect monotone (or non-varying) function of the
other.
Kendall concordance, also known as Kendall’s tau coefficient, is also used to
measure the level of association between two variables. A tau value of +1
signifies perfect agreement and a -1 indicates complete disagreement. If a
variable and the outcome variable are independent, then one could expect the
Kendall tau to be approximately zero.

Click the Output Options tab (or click Next) and select Top features table,
entering a value of 5 for Number of features. Leave all remaining options at
their defaults.

Frontline Solvers Analytic Solver Data Mining User Guide 84


Click Finish to run Feature Selection for prediction. Click the Top Features
Info on the FS_Output2 worksheet to display the Feature Importance Plot.

This plot ranks the variables in order of importance according to Pearson’s rank
order coefficient. Click the Feature Selection: Statistics link in the Output
Navigator to view the Detailed Feature Selection Report. If using desktop
Analytic Solver Data Mining, click the down arrows next to each measure to
investigate how different ranking methods arrange the input features based on
their importance or relevance.

Frontline Solvers Analytic Solver Data Mining User Guide 85


Click the Top Features Info link in the Output Navigator to display the 5 top
Selected Predictors ranked accordingly by the Pearson Correlation.

If we sort by each Statistic (Pearson Correlation: Rho and P-Value, Spearman


Correlation: Rho and P-Value, and Kendall Correlation: Tau and P-Value), we
can see that the top four variables common amongst all statistics are again
LSTAT, INDUS, RM and TAX. The low p-values indicate that with extremely
large statistical evidence, the observed correlation is real, i.e. not due to random
sampling. Although observed correlations are not extremely large, they still
show significant relationships, and, more importantly, they show relative
importance/relevance of variables (or features). Note that the correlation is a
signed measure and correlations with large magnitude, are considered to be
important/relevant for regression models. If we order the variables according to
the Spearman Correlation coefficient from largest to smallest by magnitude, we
see that the variables with the smallest magnitudes are the CHAS and B
variables with magnitudes of 0.1857 and 0.1406, respectively. Since these
values are close to zero (and other measures have also agreed with such
ranking), we can conclude that these variables will not be of much relevance in
our prediction model. The top 5 variables with the largest magnitude of
Spearman correlation are LSTAT, RM, INDUS, NOX and TAX. If we rank the
variables according to the magnitude of Pearson Correlation, we see that our top
five variables are (in order) LSTAT (-0.7377), RM (0.6954), PTRATIO (-
0.5078), INDUS (-0.4837) and TAX (-0.4685) while the variable with the least
amount of relevance/importance is the CHAS variable (census tract proximity to
the Charles River). If we rank the variables by the Kendall Correlation, the 5
most relevant/important variables are again (in order) LSTAT, RM, INDUS,
TAX, and CRIM (per capita crime rate). Variables that might be worth a 2nd
look include the CRIM and NOX variables as all three statistics rank these
variables in the middle. It’s worth noting that the top two variables ranked by
all 3 correlations (LSTAT and RM) are at the polar extremes with respect to
their magnitudes - meaning that the median house price would tend to increase
with increased RM(#rooms), and will tend to decrease with increased % lower
status of the population.

Frontline Solvers Analytic Solver Data Mining User Guide 86


The Feature Selection tool has allowed us to quickly explore and learn about our
data. We now have a pretty good idea of which variables are the most relevant
or most important to our classification or prediction model, how our variables
relate to each other and to housing prices, and which data attributes would be
worth extra time and money in future data collection. Interestingly, for this
example, most of our ranking statistics have agreed (mostly) on the most
important or relevant features with strong evidence. We computed and
examined various metrics and statistics and for some (where p-values can be
computed) we’ve seen a statistical evidence that the test of interest succeeded
with definitive conclusion. In this example, we’ve observed that several
variable (or features) were consistently ranked in the top 5 most important
variables by most of the measures produced by Analytic Solver Data Mining’s
Feature Selection tool. However, this will not always be the case. On some
datasets you will find that the ranking statistics and metrics compete on
rankings. In cases such as these, further analysis may be required.

Frontline Solvers Analytic Solver Data Mining User Guide 87


Text Mining

Introduction
Text mining is the practice of automated analysis of one document or a
collection of documents (corpus) and extracting non-trivial information from it.
Also, Text Mining usually involves the process of transforming unstructured
textual data into structured representation by analyzing the patterns derived from
text. The results can be analyzed to discover interesting knowledge, some of
which would only be found by a human carefully reading and analyzing the text.
Typical widely-used tasks of Text Mining include but are not limited to
Automatic Text Classification/Categorization, Topic Extraction, Concept
Extraction, Documents/Terms Clustering, Sentiment Analysis, Frequency-based
Analysis and many more. Some of these tasks could not be completed by a
human, which makes Text Mining a particularly useful and applicable tool in
modern Data Science. Analytic Solver Data Mining's Text Miner takes an
integrated approach to text mining as it does not totally separate analysis of
unstructured data from traditional data mining techniques applicable for
structured information. While Analytic Solver Data Mining is a very powerful
tool for analyzing text only, it also offers automated treatment of mixed data, i.e.
combination of multiple unstructured and structured fields. This is a particularly
useful feature that has many real-world applications, such as analyzing
maintenance reports, evaluation forms, insurance claims, etc. Text Miner uses
the “bag of words” model – the simplified representation of text, where the
precise grammatical structure of text and exact word order is disregarded.
Instead, syntactic, frequency-based information is preserved and is used for text
representation. Although such assumptions might be harmful for some specific
applications of Natural Language Processing (NLP), it has been proven to work
very well for applications such as Text Categorization, Concept Extraction and
others, which are the particular areas addressed by Text Miner's capabilities. It
has been shown in many theoretical/empirical studies that syntactic similarity
often implies semantic similarity. One way to access syntactic relationships is to
represent text in terms of Generalized Vector Space Model (GVSP). Advantage
of such representation is a meaningful mapping of text to the numeric space, the
disadvantage is that some semantic elements, e.g. order of words, are lost (recall
the bag-of-words assumption).
Input to Text Miner (the Text Mining tool within Analytic Solver Data Mining)
could be of two main types – few relatively large documents (e.g. several books)
or relatively large number of smaller documents (e.g. collection of emails, news
articles, product reviews, comments, tweets, Facebook posts, etc.). While Text
Miner is capable of analyzing large text documents, it is particularly effective
for large corpuses of relatively small documents. Obviously, this functionality
has limitless number of applications – for instance, email spam detection, topic
extraction in articles, automatic rerouting of correspondence, sentiment analysis
of product reviews and many more.
The input for text mining is a dataset on a worksheet, with at least one column
that contains free-form text (or file paths to documents in a file system
containing free-form text), and, optionally, other columns that contain traditional
structured data. In the first tab of the Text Mining dialog, the user selects the
text variable(s), and the other variable(s) to be processed.

Frontline Solvers Analytic Solver Data Mining User Guide 88


The output for the text mining is a set of reports that contain general explorative
information about the collection of documents and structured representations of
text (free-form text columns are expanded to a set of new columns with numeric
representation. The new columns will each correspond to either (i) a single term
(word) found in the “corpus” of documents, or, if requested, (ii) a concept
extracted from the corpus through Latent Semantic Indexing (LSI, also called
LSA or Latent Semantic Analysis). Each concept represents an automatically
derived complex combination of terms/words that have been identified to be
related to a particular topic in the corpus of documents. The structural
representation of text can serve as an input to any traditional Data Mining
techniques available in Text Miner – unsupervised/supervised, affinity,
visualization techniques, etc. In addition, Text Miner also presents a visual
representation of Text Mining results to allow the user to interactively explore
the information, which otherwise would be extremely hard to analyze manually.
Typical visualizations that aid in understanding of Text Mining outputs and that
are produced by Text Miner are:
• Zipf plot – for visual/interactive exploration of frequency-based information
extracted by Text Miner
• Scree Plot, Term-Concept and Document-Concept 2D scatter plots – for
visual/interactive exploration of Concept Extraction results
If you are interested in visualizing specific parts of Text Mining analysis
outputs, Text Miner provides rich capabilities for charting – the functionality
that can be used to explore Text Mining results and supplement standard charts
discussed above.
In the example below, you will learn how to use Text Miner in Analytic Solver
Data Mining to process/analyze approximately 1000 text files and use the results
for automatic topic categorization. This will be achieved by using structured
representation of text presented to Logistic Regression for building the model
for classification.

Text Mining Example


This example uses the text files within the Text Mining Example Documents.zip
archive file to illustrate how to use Analytic Solver Data Mining’s Text Mining
tool. These documents were selected from the well-known text dataset
(downloadable from http://www.cs.cmu.edu/afs/cs/project/theo-
20/www/data/news20.html) which consists of 20,000 messages, collected from
20 different internet newsgroups. We selected about 1,200 of these messages
that were posted to two interest groups, Autos and Electronics (about 500
documents from each).
Note: Analytic Solver Cloud does not currently support importing from a file
folder.

Importing from a File Folder


The Text Mining Example Documents.zip archive file is located at C:\Program
Files\Frontline Systems\Analytic Solver Platform\Datasets. Unzip the contents
of this file to a location of your choice. Four folders will be created beneath
Text Mining Example Documents: Autos, Electronics, Additional Autos and
Additional Electronics. One thousand, two hundred short text files will be
extracted to the location chosen. This example is based on the text dataset at
http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/news20.html, which
consists of 20,000 messages, collected from 20 different netnews newsgroups.

Frontline Solvers Analytic Solver Data Mining User Guide 89


We selected about 1,200 of these messages that were posted to two interest
groups, for Autos and Electronics (about 50% in each).
Select Get Data – File Folder to open the Import From File System dialog.
At the top of the dialog, click Browse… to navigate to the Autos subfolder
(C:\Program Files\Frontline Systems\Analytic Solver Platform\Datasets\Text
Mining Example Documents\Autos). Set the File Type to All Files (*.*), then
select all files in the folder and click the Open button. The files will appear in
the left listbox under Files. Click the >> button to move the files from the Files
listbox to the Selected Files listbox. Now repeat these steps for the Electronics
subfolder. When these steps are completed, 985 files will appear under Selected
Files.
Select Sample from selected files to enable the Sampling Options. Text Miner
will perform sampling from the files in the Selected Files field. Enter 300 for
Desired sample size while leaving the default settings for Simple random
sampling and Set Seed.
Note: If you are using the educational version of Analytic Solver Data Mining,
enter "100" for Desired Sample Size. This is the upper limit for the number of
files supported when sampling from a file system when using Analytic Solver
Data Mining. For a complete list of the capabilities of Analytic Solver Data
Mining and Analytic Solver Data Mining for Education, click here.
Text Miner will select 300 files using Simple random sampling with a seed
value of 12345. Under Output, leave the default setting of Write file paths.
Rather than writing out the file contents into the report, Text Miner will include
the file paths.
Note: Currently, Analytic Solver Data Mining only supports the import of
delimited text files. A delimited text file is one in which data values are
separated by a character such as quotation marks, commas or tabs. These
characters define a beginning and end of a string of text.

Click OK. The output XLM_SampleFiles will be inserted into the Data Mining
task pane, with contents similar to that shown on the next page.

Frontline Solvers Analytic Solver Data Mining User Guide 90


The Data portion of the report displays the selections we made on the Import
From File System dialog. Here we see the path of the directories, the number of
files written, our choice to write the paths or contents (File Paths), the sampling
method, the desired sample size, the actual size of the sample, and the seed
value (12345).
Underneath the Data portion are paths to the 300 text files in random order that
were sampled by Analytic Solver Data Mining. If Write file contents had been
selected, rather than Write file paths, the report would contain the RowID, File
Path, and the first 32,767 characters present in the document.

Here is an example of a document that appeared in the Electronics newsgroup.


Note the appearance of email addresses, “From” and “Subject” lines. All three
appear in each document.

The selected file paths are now in random order, but we will need to categorize
the “Autos” and “Electronics” files in order to be able to identify them later. To
do this, we’ll use Excel to sort the rows by the file path: Select columns B
through D and rows 18 through 317, then choose Sort from the Data tab. In the
Sort dialog, select column d, where the file paths are located, and click OK.

Frontline Solvers Analytic Solver Data Mining User Guide 91


The file paths should now be sorted between Electronics and Autos files.

Using Text Miner


Click the Text icon to bring up the Text Miner dialog.

Data Source Tab


Confirm that XLM_SampleFiles is selected for Worksheet. Select TextVar in
the Variables listbox, and click the upper > button to move it to the Selected
Text Variables listbox. By doing so, we are selecting the text in the documents
as input to the Text Miner model. Ensure that “Text variables contain file paths”
is checked.

Click the Next button, or click the Pre-Processing tab at the top. b

Frontline Solvers Analytic Solver Data Mining User Guide 92


Pre-Processing Tab
Leave the default setting for Analyze all terms selected under Mode. When this
option is selected, Text Miner will examine all terms in the document. A “term”
is defined as an individual entity in the text, which may or may not be an
English word. A term can be a word, number, email, url, etc. terms are
separated by all possible delimiting characters (i.e. \, ?, ', `, ~, |, \r, \n, \t, :, !, @,
#, $, %, ^, &, *, (, ), [, ], {, }, <>,_, ;, =, -, +, \) with some exceptions related to
stopwords, synonyms, exclusion terms and boilerplate normalization (URLs,
emails, monetary amounts, etc.). Text Miner will not tokenize on these
delimiters.
Note: Exceptions are related not to how terms are separated but as to whether
they are split based on the delimiter. For example: URL's contain many
characters such as "/", ";", etc. Text Miner will not tokenize on these characters
in the URL but will consider the URL as a whole and will remove the URL if
selected for removal. (See below for more information.)
If Analyze specified terms only is selected, the Edit Terms button will be
enabled. If you click this button, the Edit Exclusive Terms dialog opens. Here
you can add and remove terms to be considered for text mining. All other terms
will be disregarded. For example, if we wanted to mine each document for a
specific part name such as “alternator” we would click Add Term on the Edit
Exclusive Terms dialog, then replace “New term” with “alternator” and click
Done to return to the Pre-Processing dialog. During the text mining process,
Text Miner would analyze each document for the term “alternator”, excluding
all other terms.
Leave both Start term/phrase and End term/phrase empty under Text
Location. If this option is used, text appearing before the first occurrence of the
Start Phrase will be disregarded and similarly, text appearing after End Phrase
(if used) will be disregarded. For example, if text mining the transcripts from a
Live Chat service, you would not be particularly interested in any text appearing
before the heading “Chat Transcript” or after the heading “End of Chat
Transcript”. Thus you would enter “Chat Transcript” into the Start Phrase field
and “End of Chat Transcript” into the End Phrase field.
Leave the default setting for Stopword removal. Click Edit to view a list of
commonly used words that will be removed from the documents during pre-
processing. To remove a word from the Stopword list, simply highlight the
desired word, then click Remove Stopword. To add a new word to the list,
click Add Stopword, a new term “stopword” will be added. Double click to
edit.
Text Miner also allows additional stopwords to be added or existing to be
removed via a text document (*.txt) by using the Browse button to navigate to
the file. Terms in the text document can be separated by a space, a comma, or
both. If we were supplying our three terms in a text document, rather than in the
Edit Stopwords dialog, the terms could be listed as: subject emailterm from or
subject,emailterm,from or subject, emailterm, from. If we had a large list of
additional stopwords, this would be the preferred way to enter the terms.

Frontline Solvers Analytic Solver Data Mining User Guide 93


Click Advanced in the Term Normalization group to open the Term
Normalization – Advanced dialog. Select all options as shown below. Then
click Done. This dialog allows us to indicate to Text Miner, that
• If stemming reduced term length to 2 or less characters, disregard the
term (Minimum stemmed term length).
• HTML tags, and the text enclosed, will be removed entirely. HTML
tags and text contained inside these tags often contain technical,
computer-generated information that is not typically relevant to the
goal of the text mining application.
• URLs will be replaced with the term, “urltoken”. Specific form of
URLs do not normally add any meaning, but it is sometimes interesting
to know how many URLs are included in a document.
• Email addresses will be replaced with the term, “emailtoken”. Since
the documents in our collection all contain a great many email
addresses (and the distinction between the different emails often has
little use in Text Mining), these email addresses will be replaced with
the term “emailtoken”.
• Numbers will be replaced with the term, “numbertoken”.
• Monetary amounts will be substituted with the term, “moneytoken”.

Recall that when we inspected an email from the document collection we saw
several terms such as “subject”, “from” and email addresses. Since all of our
documents contain these terms, including them in the analysis will not provide
any benefit and could bias the analysis. As a result, we will exclude these terms

Frontline Solvers Analytic Solver Data Mining User Guide 94


from all documents by selecting Exclusion list then clicking Edit. The Edit
Exclusion List dialog opens. Click Add Exclusion Term. The label
“exclusionterm” is added. Click to edit and change to “subject”. Then repeat
these same steps to add the term “from”.
We can take the email issue one step further and completely remove the term
“emailtoken” from the collection. Click Add Exclusion Term and edit
“exclusionterm” to “emailtoken”.
To remove a term from the exclusion list, highlight the term and click Remove
Exclusion Term.
We could have also entered these terms into a text document (*.txt) and added
the terms all at once by using the Browse button to navigate to the file and
import the list. Terms in the text document can be separated by a space, a
comma, or both. If, for example we were supplying excluded terms in a
document rather than in the Edit Exclusion List dialog, we would enter the terms
as: subject emailtoken from, or subject,emailtoken,from, or subject, emailtoken,
from. If we had a large list of terms to be excluded, this would be the preferred
way to enter the terms.

Click Done to close the dialog and return to Pre-Processing.


Text Miner also allows the combining of synonyms and full phrases by clicking
Advanced within Vocabulary Reduction. Select Synonym reduction at the top
of the dialog to replace synonyms such as “car”, “automobile”, “convertible”,
“vehicle”, “sedan”, “coupe”, “subcompact”, and “jeep” with “auto”. Click
Add Synonym and replace “rootterm” with “auto” then replace “synonym list”
with “car, automobile, convertible, vehicle, sedan, coupe, subcompact,
jeep” (without the quotes). During pre-processing, Text Miner will replace the
terms “car”, “automobile”, “convertible”, “vehicle”, “sedan”, “coupe”,
“subcompact” and “jeep” with the term “auto”. To remove a synonym from the
list, highlight the term and click Remove Synonym.
If adding synonyms from a text file, each line must be of the form
rootterm:synonymlist or using our example: auto:car automobile convertible
vehicle sedan coup or auto:car,automobile,convertible,vehicle,sedan,coup. Note
separation between the terms in the synonym list be either a space, a comma or
both. If we had a large list of synonyms, this would be the preferred way to
enter the terms.
Text Miner also allows the combining of words into phrases that indicate a
singular meaning such as “station wagon” which refers to a specific type of car
rather than two distinct tokens – station and wagon. To add a phrase in the
Vocabulary Reduction – Advanced dialog, select Phrase reduction and click

Frontline Solvers Analytic Solver Data Mining User Guide 95


Add Phrase. The term “phrasetoken” will be appear, click to edit and enter
“wagon”. Click “phrase” to edit and enter “station wagon”. If supplying
phrases through a text file (*.txt), each line of the file must be of the form
phrasetoken:phrase or using our example, wagon:station wagon. If we had a
large list of phrases, this would be the preferred way to enter the terms.

Enter 200 for Maximum Vocabulary Size. Text Miner will reduce the number of
terms in the final vocabulary to the top 200 most frequently occurring in the
collection.
Leave Perform stemming at the selected default. Stemming is the practice of
stripping words down to their “stems” or “roots”, for example, stemming terms
such as “argue”, “argued”, “argues”, “arguing”, and “argus” would result in the
stem “argu. However “argument” and “arguments” would stem to “argument”.
The stemming algorithm utilized in Text Miner is “smart” in the sense that while
“running” would be stemmed to “run”, “runner” would not. . Text Miner uses
the Porter Stemmer 2 algorithm for the English Language. For more
information on this algorithm, please see the
Webpage: http://tartarus.org/martin/PorterStemmer/
Leave the default selection for Normalize case. When this option is checked,
Text Miner converts all text to a consistent (lower) case, so that Term, term,
TERM, etc. are all normalized to a single token “term” before any processing,
rather than creating three independent tokens with different case. This simple
method can dramatically affect the frequency distributions of the corpus, leading
to biased results.
Enter 3 for Remove terms occurring in less than _% of documents and 97 for
Remove terms occurring in more than _% of documents. For many text mining
applications, the goal is to identify terms that are useful for discriminating
between documents. If a particular term occurs in all or almost all documents, it
may not be possible to highlight the differences. If a term occurs in very few

Frontline Solvers Analytic Solver Data Mining User Guide 96


documents, it will often indicate great specificity of this term, which is not very
useful for some Text Mining purposes.
Enter 20 for Maximum term length. Terms that contain more than 20 characters
will be excluded from the text mining analysis and will not be present in the
final reports. This option can be extremely useful for removing some parts of
text which are not actual English words, for example, URLs or computer-
generated tokens, or to exclude very rare terms such as Latin species or disease
names, i.e. Pneumonoultramicroscopicsilicovolcanoconiosis.

Click Next to advance to the Representation tab or simply click Representation


at the top.

Representation Tab
Keep the default selection of TF-IDF (Term Frequency – Inverse Document
Frequency) for Term-Document Matrix Scheme. A term-document matrix is a
matrix that displays the frequency-based information of terms occurring in a
document or collection of documents. Each column is assigned a term and each
row a document. If a term appears in a document, a weight is placed in the
corresponding column indicating the term’s importance or contribution. Text
Miner offers four different commonly used methods of weighting scheme used
to represent each value in the matrix: Presence/Absence, Term Frequency, TF-
IDF (the default) and Scaled term frequency. If Presence/Absence is selected,
Text Miner will enter a 1 in the corresponding row/column if the term appears in
the document and 0 otherwise. This matrix scheme does not take into account
the number of times the term occurs in each document. If Term Frequency is
selected, Text Miner will count the number of times the term appears in the
document and enter this value into the corresponding row/column in the matrix.
The default setting – Term Frequency – Inverse Document Frequency (TF-IDF)
is the product of scaled term frequency and inverse document
frequency. Inverse document frequency is calculated by taking the logarithm of
the total number of documents divided by the number of documents that contain
the term. A high value for TF-IDF indicates that a term that does not occur
frequently in the collection of documents taken as a whole, appears quite

Frontline Solvers Analytic Solver Data Mining User Guide 97


frequently in the specified document. A TF-IDF value close to 0 indicates that
the term appears frequently in the collection or rarely in a specific document. If
Scaled term frequency is selected, Text Miner will normalize (bring to the same
scale) the number of occurrences of a term in the documents (see the table
below).
It’s also possible to create your own scheme by clicking the Advanced command
button to open the Term Document Matrix – Advanced dialog. Here users can
select their own choices for local weighting, global weighting, and
normalization. Please see the table below for definitions regarding options for
Term Frequency, Document Frequency and Normalization.

Local Weighting Global Weighting Normalization


1, if 𝑡𝑓𝑡𝑑 > 0
Binary 𝑙𝑤𝑡𝑑 = { None 𝑔𝑤𝑡 = 1 None 𝑛𝑑 = 1
0, if 𝑡𝑓𝑡𝑑 = 0

𝑁 𝑛𝑑
Raw 1
𝑙𝑤𝑡𝑑 = 𝑡𝑓𝑡𝑑 Inverse 𝑔𝑤𝑡 = log 2 Cosine
Frequency 1 + 𝑑𝑓𝑡 =
‖𝑔 𝑑 ‖2
̅̅̅
1
Logarithmic 𝑙𝑤𝑡𝑑 = log(1 + 𝑡𝑓𝑡𝑑 ) Normal 𝑔𝑤𝑡 =
2
√∑𝑑 𝑡𝑓𝑡𝑑
𝑙𝑤𝑡𝑑
𝑡𝑓 𝑐𝑓𝑡
Augnorm (max 𝑡𝑑
𝑡𝑓𝑡𝑑 ) + 1 GF-IDF 𝑔𝑤𝑡 =
𝑡
𝑑𝑓𝑡
=
2
𝑔𝑤𝑡
Entropy 𝑝𝑡𝑑 log 𝑝𝑡𝑑
= 1+∑
𝑑 log 𝑁
IDF 𝑁
𝑔𝑤𝑡 = log 2
probability 1 + 𝑑𝑓𝑡

Notations:
• 𝑡𝑓𝑡𝑑 – frequency of term 𝑡 in a document 𝑑;

• 𝑑𝑓𝑡 – document frequency of term 𝑡;


• 𝑙𝑤𝑡𝑑 – local weighting of term 𝑡 in a document 𝑑;

• 𝑔𝑤𝑡𝑑 – global weighting of term 𝑡 in a document 𝑑;

• 𝑛𝑑 – normalization of vector of terms representing the document 𝑑;

• 𝑁 – total number of documents in the collection;


• 𝑐𝑓𝑡 – collection frequency of term 𝑡;

• 𝑝𝑡𝑑 – estimated probability of term 𝑡 to appear in a document 𝑑


𝑡𝑓
(𝑝𝑡𝑑 = 𝑡𝑑⁄𝑐𝑓 );
𝑡

• 𝑔 𝑑 – vector of terms representing the document 𝑑.


̅̅̅

Frontline Solvers Analytic Solver Data Mining User Guide 98


Finally, the element 𝑇𝑡𝑑 of Term-Document Matrix is computed as 𝑇𝑡𝑑 =
𝑙𝑤𝑡𝑑 ∗ 𝑔𝑤𝑡 ∗ 𝑛𝑑 , ∀𝑡, 𝑑
Leave Perform latent semantic indexing selected (the default). When this option
is selected, Text Miner will use Latent Semantic Indexing (LSI) to detect
patterns in the associations between terms and concepts to discover the meaning
of the document.
The statistics produced and displayed in the Term-Document Matrix contain
basic information on the frequency of terms appearing in the document
collection. With this information we can “rank” the significance or importance
of these terms relative to the collection and particular document. Latent
Semantic Indexing, in comparison, uses singular value decomposition (SVD) to
map the terms and documents into a common space to find patterns and
relationships. For example: if we inspected our document collection, we might
find that each time the term “alternator” appeared in an automobile document,
the document also included the terms “battery” and “headlights”. Or each time
the term “brake” appeared in an automobile document, the terms “pads” and
“squeaky” also appeared. However there is no detectable pattern regarding the
use of the terms “alternator” and “brake”. Documents including “alternator”
might not include “brake” and documents including “brake” might not include
“alternator”. Our four terms, battery, headlights, pads, and squeaky describe
two different automobile repair issues: failing brakes and a bad
alternator. Latent Semantic Indexing will attempt to 1. Distinguish between
these two different topics, 2. Identify the documents that deal with faulty
brakes, alternator problems or both and 3. Map the terms into a common
semantic space using singular value decomposition. SVD is a tool used by Text
Miner to extract concepts that explain the main dimensions of meaning of the
documents in the collection. The results of LSA are usually hard to examine
because the construction of the concept representations will not be fully
explained. Interpreting these results is actually more of an art, than a science.
However, Text Miner provides several visualizations that simplify this process
greatly.
Select Maximum number of concepts and leave the default setting of 5. Doing
so will tell Text Miner to retain the top 5 most significant concepts. If
Automatic is selected, Text Miner will calculate the importance of each concept,
take the difference between each and report any concepts above the largest
difference. For example if three concepts were identified (Concept1, Concept2,
and Concept3) and given importance factors of 10, 8, and 2, respectively, Text
Miner would keep Concept1 and Concept2 since the difference between
Concept2 and Concept 3 (8-2=6) is larger than the difference between Concept1
and Concept2 (10-8=2). If Minimum percentage explained is selected, Text
Miner will identify the concepts with singular values that, when taken together,
sum to the minimum percentage explained, 90% is the default.

Frontline Solvers Analytic Solver Data Mining User Guide 99


Click Next or the Output Options tab.

Options Tab
Keep Term-Document and Concept-Document selected under Matrices (the
default) and select Term-Concept to print each matrix in the output. The Term-
Document matrix displays the terms across the top of the matrix and the
documents down the left side of the matrix. The Concept – Document and Term
– Concept matrices are output from the Perform latent semantic indexing option
that we selected on the Representation tab. In the first matrix, Concept –
Document, 20 concepts will be listed across the top of the matrix and the
documents will be listed down the left side of the matrix. The values in this
matrix represent concept coordinates in the identified semantic space. In the
Term-Concept matrix, the terms will be listed across the top of the matrix and
the concepts will be listed down the left side of the matrix. The values in this
matrix represent terms in the extracted semantic space.
Keep Term frequency table selected (the default) under Preprocessing Summary
and select Zipf’s plot. Increase the Most frequent terms to 20 and select
Maximum corresponding documents. The Term frequency table will include
the top 20 most frequently occurring terms. The first column, Collection
Frequency, displays the number of times the term appears in the collection. The
2nd column, Document Frequency, will display the number of documents that
include the term. The third column, Top Documents, will display the top 5
documents where the corresponding term appears the most frequently. The Zipf
Plot graphs the document frequency against the term ranks in descending order
of frequency.. Zipf’s law states that the frequency of terms used in a free-form
text drops exponentially, i.e. that people tend to use a relatively small number of
words extremely frequently and use a large number of words very rarely.
Keep Show documents summary selected and check Keep a short excerpt. under
Documents. Text Miner will produce a table displaying the document ID, length
of the document, number of terms and 20 characters of the text of the document.

Frontline Solvers Analytic Solver Data Mining User Guide 100


Select all plots under Concept Extraction to produce various plots in the output.
Select Write text mining model under Text Miner Model to write the model to an
output sheet.

Click the Finish button to run the Text Mining analysis. Result worksheets are
inserted to the right.

Output Results
The Output Navigator appears at the top of each output worksheet. Clicking any
of these links will allow you to "jump" to the desired output.

Term Count and Document Info


Select the TM_Output tab. The Term Count table shows that the original term
count in the documents was reduced by 14.26% by the removal of stopwords,
excluded terms, synonyms, phrase removal and other specified preprocessing
procedures.

Scroll down to the Documents table. This table lists each Document with its
length, number of terms, and if Keep a short excerpt is selected on the Output
Options tab and a value is present for Number of characters, then an excerpt
from each document will be displayed.

Frontline Solvers Analytic Solver Data Mining User Guide 101


Term-Document Matrix
Click TM_TDM, to display the Term – Document Matrix. As discussed above,
this matrix lists the 200 most frequently appearing terms across the top and the
document IDs down the left. A portion of this table is shown below. If a term
appears in a document, a weight is placed in the corresponding column
indicating the importance of the term using our selection of TF-IDF on the
Representation dialog.

Vocabulary Matrix
Click TM_Vocabulary to view the Final List of Terms table. This table contains
the top 20 terms occurring in the document collection, the number of documents
that include the term and the top 5 document IDs where the corresponding term
appears most frequently. In this list we see terms such as “car”, “power”,
“engine”, “drive”, and “dealer” which suggests that many of the documents,
even the documents from the electronic newsgroup, were related to autos.

When you click on the TM_Vocabulary tab, the Zipf Plot opens. We see that
our collection of documents obey the power law stated by Zipf (see above). As
we move from left to right on the graph, the documents that contain the most
frequently appearing terms (when ranked from most frequent to least frequent)
drop quite steeply. Hover over each data point to see the detailed information
about the term corresponding to this data point.
Note: To view charts in the Data Mining Cloud app, click Charts on the Ribbon,
select the desired worksheet, in this case TM_Vocabulary, then select the
desired chart.
The term “numbertoken” is the most frequently occurring term in the document
collection appearing in 223 documents (out of 300), 1,083 times total. Compare

Frontline Solvers Analytic Solver Data Mining User Guide 102


this to a less frequently occurring term such as "thing" which appears in only 64
documents and only 82 times total.

Concept Importance
Click TM_LSASummary to view the Concept Importance and Term Importance
tables. The first table, the Concept Importance table, lists each concept, its
singular value, the cumulative singular value and the % singular value
explained. (The number of concepts extracted is the minimum of the number of
documents (985) and the number of terms (limited to 200).) These values are
used to determine which concepts should be used in the Concept – Document
Matrix, Concept – Term Matrix and the Scree Plot according to the Users
selection on the Representation tab. In this example, we entered “20” for
Maximum number of concepts.

The Term Importance table lists the 200 most important terms. (To increase the
number of terms from 200, enter a larger value for Maximum Vocabulary on the
Pre-processing tab of Text Miner.)
When you click the TM_LSASummary tab, the Scree Plot opens. This plot
gives a graphical representation of the contribution or importance of each
concept. The largest “drop” or “elbow” in the plot appears between the 1st and
2nd concept. This suggests that the first top concept explains the leading topic in
our collection of documents. Any remaining concepts have significantly
reduced importance. However, we can always select more than 1 concept to
increase the accuracy of the analysis – it is advised to examine the Concept
Importance table and the “Cumulative Singular Value” in particular to identify
how many top concepts capture enough information for your application.

Concept Document Matrix


Click TM_LSA_CDM to display the Concept – Document Matrix. This matrix
displays the top concepts (as selected on the Representation tab) along the top of
the matrix and the documents down the left side of the matrix.
When you click on the TM_LSA_CDM tab, the Concept-Document Scatter Plot
opens. This graph is a visual representation of the Concept – Document
matrix. Note that Analytic Solver Data Mining normalizes each document
representation so it lies on a unit hypersphere. Documents that appear in the
middle of the plot, with concept coordinates near 0 are not explained well by
either of the shown concepts. The further the magnitude of coordinate from
zero, the more effect that particular concept has for the corresponding document.
In fact, two documents placed at extremes of a concept (one close to -1 and
other to +1) indicates strong differentiation between these documents in terms of

Frontline Solvers Analytic Solver Data Mining User Guide 103


the extracted concept. This provides means for understanding actual meaning of
the concept and investigating which concepts have the largest discriminative
power, when used to represent the documents from the text collection.

You can examine all extracted concepts by changing the axes on a scatter plot -
click the down pointing arrow next to Concept 1 or the concept on the Y axis by
clicking the right pointing arrow next to Concept 2. Use your touchscreen or
your mouse scroll wheel to zoom in and out.

Term-Concept Matrix
Double click TM_LSA_CTM to display the Concept – Term Matrix which lists
the top 5 most important concepts along the top of the matrix and the top 200
most frequently appearing terms down the side of the matrix.

When you click on the TM_LSA-CTM tab, the Term-Concept Scatter Plot
opens. This graph is a visual representation of the Concept – Term Matrix. It
displays all terms from the final vocabulary in terms of two concepts. Similarly
to the Concept-Document scatter plot, the Concept-Term scatter plot visualizes
the distribution of vocabulary terms in the semantic space of meaning extracted
with LSA. The coordinates are also normalized, so the range of axes is always [-
1,1], where extreme values (close to +/-1) highlight the importance or “load” of
each term to a particular concept. The terms appearing in a zero-neighborhood
of concept range do not contribute much to a concept definition. In our example,
if we identify a concept having a set of terms that can be divided into two
groups: one related to “Autos” and other to “Electronics”, and these groups are

Frontline Solvers Analytic Solver Data Mining User Guide 104


distant from each other on the axis corresponding to this concept, this would
definitely provide an evidence that this particular concept “caught” some pattern
in the text collection that is capable of discriminating the topic of article.
Therefore, Term-Concept scatter plot is an extremely valuable tool for
examining and understanding the main topics in the collection of documents,
finding similar words that indicate similar concept, or the terms explaining the
concept from “opposite sides” (e.g. term1 can be related to cheap affordable
electronics and term2 can be related to expensive luxury electronics)
Recall that if you want to examine different pair of concepts, click the down
pointing arrow next to Concept 1 and the right pointing arrow next to Concept 2
to change the concepts on either axis. Use your touchscreen or mouse wheel to
scroll in or out.

Stored PMML models for TFIDF and LSA


Since "Write text mining model" on the Output Options tab, two more tabs are
created containing PMML models for the TFIDF and LSA models. These
models can be used when scoring a series of new documents. For more
information on how to process this new data using these two saved models, see
the Text Mining chapter within the Data Mining Reference Guide.

Classification with Concept Document Matrix


From here, we can use any of the six classification algorithms to classify our
documents according to some term or concept using the Term – Document
matrix, Concept – Document matrix or Concept – Term matrix where each
document becomes a “record” and each concept becomes a “variable”. If
wanting to classify documents based on a binary variable such as Auto
email/non-Auto email, then we would use either the Term – Document or
Concept – Document matrix. If wanting to cluster terms or classify terms, then
we would use the Term-Concept matrix. We could even use the transpose of the
Term – Document matrix where each term would become a “record” and each
column would become a “feature”. See the Analytic Solver Data Mining User
Guide for an example model that uses the Logistic Regression Classification
method to create a classification model using the Concept Document matrix
within TM_LSA_CDM.
In this example, we will use the Logistic Regression Classification method to
create a classification model using the Concept Document matrix on
TM_LSA_CDM. Recall that this matrix includes the top twenty concepts
extracted from the document collection across the top of the matrix and each
document in the sample down the left. Each concept will now become a
“feature” and each document will now become a “record”.
First, we’ll need to append a new column with the class that the document is
currently assigned: electronics or autos. Since we sorted our documents at the
beginning of the example starting with Autos, we can simply enter “Autos” into
column I for Document IDs 101553 through 103096 (or cells I13:I162) and
enter “Electronics” into column I for Document IDs 52434 through 53879 (or
cells I163:I312). Give the column the title "Type".

Frontline Solvers Analytic Solver Data Mining User Guide 105


First, we’ll need to partition our data into two datasets, a training dataset where
the model will be “trained” and a validation dataset where the newly created
model can be tested, or validated. When the model is being trained, the actual
class label assignments are “shown” to the algorithm in order for it to “learn”
which variables (or concepts) result in an “auto” or “electronic”
assignment. When the model is being validated or tested, the known
classification is only used to evaluate the performance of the algorithm. Click
Partition – Standard Partition on the Text Miner ribbon to open the Standard
Data Partition dialog. Select all variables in the Variables In Input Data
listbox, then click > to move all to the Selected Variables listbox. Select
Specify percentages Under Partitioning percentages when picking up rows
randomly (at the bottom) and enter 80 for Training Set. Automatically, 20 will
be entered for Validation Set.

Frontline Solvers Analytic Solver Data Mining User Guide 106


Click Finish to partition the data into two randomly selected datasets: The
Training dataset containing 80% of the “records” (or documents) and the
Validation dataset containing 20% of the “records”. (For more information on
partitioning, please see the Standard Partitioning chapter that appears in the
Analytic Solver Data Mining Reference Guide.)
Now click Classify – Logistic Regression to open the Logistic Regression – Step
1 of 3 dialog. Select all 5 concents under Variables In Input Data listbox and
click > to move them to the Selected Variables listbox. Doing so selects these
variables as inputs to the classification method. Select Type, then click the >
next to Output Variable to add this variable as the Output Variable.

Click Finish to accept all defaults and run Logistic Regression.


Select DA_ValidationScore tab and scroll down to the Validation Classification
Summary, shown below.
Text Miner used the training dataset to “train” the Logistic Regression model to
classify each “record” (or document) as an “autos” or “electronics” document.
Afterwards, Text Miner tested the newly created Logistic Regression model on
the records in the validation dataset and assigned each record (or document) a
classification.

Frontline Solvers Analytic Solver Data Mining User Guide 107


As you can see in the reports above, Logistic Regression was able to correctly
classify 50 out of a total of 60 documents in the validation partition, which
translates to an overall error of 16.67%, (For more information on how to read
the summary report, see the Logistic Regression chapter later on in this guide.)
This concludes our example on how to use Analytic Solver Data Mining’s Text
Miner feature. This example has illustrated how Analytic Solver Data Mining
provides powerful tools for importing a collection of documents for
comprehensive text preprocessing, quantitation, and concept extraction, in order
to create a model that can be used to process new documents – all performed
without any manual intervention. When using Text Miner in conjunction with
our classification algorithms, Analytic Solver Data Mining can be used to
classify customer reviews as satisfied/not satisfied, distinguish between which
products garnered the least negative reviews, extract the topics of articles,
cluster the documents/terms, etc. The applications for Text Miner are endless!

Frontline Solvers Analytic Solver Data Mining User Guide 108


Exploring a Time Series Dataset

Introduction
Time series datasets contain a set of observations generated sequentially in time.
Organizations of all types and sizes utilize time series datasets for analysis and
forecasting for predicting next year’s sales figures, raw material demand,
monthly airline bookings, etc. .

Example of a time series dataset: Monthly airline bookings.

A time series model is first used to obtain an understanding of the underlying


forces and structure that produced the data and then secondly, to fit a model that
will predict future behavior. In the first step, the analysis of the data, a model is
created to uncover seasonal patterns or trends in the data, for example bathing
suit sales in June. In the second step, forecasting, the model is used to predict
the value of the data in the future, for example, next year’s bathing suit sales.
Separate modeling methods are required to create each type of model.
Analytic Solver Data Mining features two techniques for exploring trends in a
dataset, ACF (Autocorrelation function) and PACF (Partial autocorrelation
function). These techniques help the user to explore various patterns in the data
which can be used in the creation of the model. After the data is analyzed, a
model can be fit to the data using Analytic Solver Data Mining’s ARIMA
method.

Autocorrelation (ACF)
Autocorrelation (ACF) is the correlation between neighboring observations in a
time series. When determining if an autocorrelation exists, the original time
series is compared to the “lagged” series. This lagged series is simply the
original series moved one time period forward (xn vs xn+1). Suppose there are 5
time based observations: 10, 20, 30, 40, and 50. When lag = 1, the original
series is moved forward one time period. When lag = 2, the original series is
moved forward two time periods.

Frontline Solvers Analytic Solver Data Mining User Guide 109


Day Observed Value Lag-1 Lag-2
1 10
2 20 10
3 30 20 10
4 40 30 20
5 50 40 30
The autocorrelation is computed according to the formula:

∑𝑛 ̅ ̅
𝑖=𝑘+1(𝑌𝑡 −𝑌 )(𝑌𝑡−𝑘 −𝑌 )
𝑟𝑘 = ∑𝑛 ̅ 2
where k = 0, 1, 2, …., n
𝑖=1(𝑌𝑡 −𝑌 )

Where Yt is the Observed Value at time t, 𝑌̅ is the mean of the Observed Values
and Yt –k is the value for Lag-k.
For example, using the values above, the autocorrelation for Lag-1 and Lag - 2
can be calculated as follows.
𝑌̅ = (10 + 20 + 30 + 40 + 50) / 5 = 30
r1 = ((20 – 30) * (10 - 30) + (30 - 30) * (20 - 30) + (40 - 30) * (30 - 30) + (50 –
30) * (40 – 30)) / ((10 – 30)2 + (20 - 30)2 + (30 – 30)2 + (40 – 30)2 + (50 – 30)2)
= 0.4
r2 =( (30 – 30) * (10 – 30) + (40 – 30) * (20 – 30) + (50 – 30) * (30 – 30)) / (((10
– 30)2 + (20 - 30)2 + (30 – 30)2 + (40 – 30)2 + (50 – 30)2) = -0.1
The two red horizontal lines on the graph below delineate the Upper confidence
level (UCL) and the Lower confidence level (LCL). If the data is random, then
the plot should be within the UCL and LCL. If the plot exceeds either of these
two levels, as seen in the plot above, then it can be presumed that some
correlation exists in the data.

Partial Autocorrelation Function (PACF)


This technique is used to compute and plot the partial autocorrelations between
the original series and the lags. However, PACF eliminates all linear
dependence in the time series beyond the specified lag.

ARIMA
An ARIMA (autoregressive integrated moving-average models) model is a
regression-type model that includes autocorrelation. The basic assumption in
estimating the ARIMA coefficients is that the data are stationary, that is, the

Frontline Solvers Analytic Solver Data Mining User Guide 110


trend or seasonality cannot affect the variance. This is generally not true. To
achieve the stationary data, Analytic Solver Data Mining will first apply
“differencing”: ordinary, seasonal or both.

After Analytic Solver Data Mining fits the model, various results will be
available. The quality of the model can be evaluated by comparing the time plot
of the actual values with the forecasted values. If both curves are close, then it
can be assumed that the model is a good fit. The model should expose any
trends and seasonality, if any exist. If the residuals are random then the model
can be assumed a good fit. However, if the residuals exhibit a trend, then the
model should be refined. Fitting an ARIMA model with parameters (0,1,1) will
give the same results as exponential smoothing. Fitting an ARIMA model with
parameters (0,2,2) will give the same results as double exponential smoothing.

Partitioning
To avoid over fitting of the data and to be able to evaluate the predictive
performance of the model on new data, we must first partition the data into
training and validation sets using Analytic Solver Data Mining’s time series
partitioning utility. After the data is partitioned, ACF, PACF, and ARIMA can
be applied to the dataset.

Examples for Time Series Analysis


The examples below illustrate how Analytic Solver Data Mining can be used to
explore the Income.xlsx dataset to uncover trends and seasonalities in a dataset.
Click Help – Examples on the Data Mining ribbon, then Forecasting/Data
Mining Examples and open the example dataset, Income.xlsx. This dataset
contains the average income of tax payers by state.
Typically the following steps are performed in a time series analysis.
1. The data is first partitioned into two sets with 60% of the data assigned to
the training set and 40% of the data assigned to validation.
2. Exploratory techniques are applied to both the training and validation sets.
If the results are in synch then the model can be fit. If the ACF and PACF
plots are the same, then the same model can be used for both sets.
3. The model is fit using the ARIMA method.
4. When we fit a model using the ARIMA method, Analytic Solver displays
the ACF and PACF plots for residuals. If these plots are in the band of UCL
and LCL then it indicates that the residuals are random and the model is
adequate.
2. If the residuals are not within the bands, then some correlation exists, and
the model should be improved.
First we must perform a partition on the data. Click Partition within the Time
Series group on the Data Mining ribbon to open the following dialog.
Select Year under Variables and click > to define the variable as the Time
Variable. Select the remaining variables under Variables and click > to include
them in the partitioned data.
Select Specify #Records under Specify Partitioning Options to specify the
number of records assigned to the training and validation sets. Then select

Frontline Solvers Analytic Solver Data Mining User Guide 111


Specify #Records under Specify #Records for Partitioning. Enter 50 for the
number of Training Set records and 21 for the number of Validation Set records.
If Specify Percentages is selected under Specify Partitioning Options, Analytic
Solver Data Mining will assign a percentage of records to each set according to
the values entered by the user or automatically entered by Analytic Solver Data
Mining under Specify Percentages for Partitioning.

Click OK. TSPartition is inserted into the Model tab of the Solver task pane
under Reports – Time Series Partition – Run 1.

Note in the output above, the partitioning method is sequential (rather than
random). The first 50 observations have been assigned to the training set and
the remaining 21 observations have been assigned to the validation set.
Open the Lag Analysis dialog by clicking ARIMA – Lag Analysis. Select CA
under Variables In input data, then click > to move the variable to Selected
variable. Enter 1 for Minimum Lag and 40 for Maximum Lag under
Parameters: Training and 1 for Minimum Lag and 15 for Maximum Lag under
Parameters: Validation.

Frontline Solvers Analytic Solver Data Mining User Guide 112


Under Charting, select ACF chart, ACVF chart, and PACF chart to include each
chart in the output.

Click OK. TS_Lags is inserted into the task pane under Reports –
Autocorrelations – Run 1.

First, let's take a look at the ACF charts. Note on each chart, the autocorrelation
decreases as the number of lags increase. This suggests that a definite pattern
does exist in each partition. However, since the pattern does not repeat, it can be
assumed that no seasonality is included in the data. In addition, both charts
appear to exhibit a similar pattern.

Frontline Solvers Analytic Solver Data Mining User Guide 113


The PACF functions show a definite pattern which means there is a trend in the
data. However, since the pattern does not repeat, we can conclude that the data
does not show any seasonality.
A plot of the autocovariance values has been added to the output.

All three charts suggest that a definite pattern exists in the data, but no
seasonality. In addition, both datasets exhibit the same behavior in both the
training and validation sets which suggests that the same model could be
appropriate for each. Now we are ready to fit the model.
The ARIMA model accepts three parameters: p – the number of autoregressive
terms, d – the number of non-seasonal differences, and q – the number of lagged
errors (moving averages).
Recall that the ACF plot showed no seasonality in the data which means that
autocorrelation is almost static, decreasing with the number of lags increasing.
This suggests setting q = 0 since there appears to be no lagged errors. The
PACF plot displayed a large value for the first lag but minimal plots for
successive lags. This suggest setting p =1. With most datasets, setting d =1 is
sufficient or can at least be a starting point.
Click back to the TSPartition tab and then click ARIMA – ARIMA Model to
bring up the Time Series – ARIMA dialog.
Select CA under Variables In input data then click > to move the variable to the
Selected Variable field. Under Nonseasonal Parameters set Autoregressive (p)
to 1, Difference (d) to 1 and Moving Average (q) to 0.

Frontline Solvers Analytic Solver Data Mining User Guide 114


Click Advanced to open the ARIMA – Advanced Options dialog. Select Fitted
Values and residuals, Produce forecasts, and Report Forecast Confidence
Intervals. The default Confidence Level setting of 95 is automatically entered.
The option Variance-covariance matrix is selected by default.

Click OK on the ARIMA-Advanced Options dialog and again on the Time Series
– ARIMA dialog. Analytic Solver Data Mining calculates and displays various
parameters and charts in four output sheets, Arima_Output, Arima_Fitted,
Arima_Forecast and Arima_Stored. Click the Arima_Output tab to view the
Output Navigator.

Click the ARIMA Model link on the Output Navigator to move to display the
ARIMA Model and Ljung-Box Test Results on Residuals.

Frontline Solvers Analytic Solver Data Mining User Guide 115


Analytic Solver has calculated the constant term and the AR1 term for our
model, as seen above. These are the constant and f1 terms of our forecasting
equation. See the following output of the Chi - square test.
The very small p-values for the constant term (1.119E-7) and AR1 term (1.19e-
89) suggest that the model is a good fit to our data.
Click the Lag Analysis: Residuals – Training link. This table plots the actual
and fitted values and the resulting residuals for the training partition. As shown
in the graph below, the Actual and Forecasted values match up fairly well. The
usefulness of the model in forecasting will depend upon how close the actual
and forecasted values are in the Forecast, which we will inspect later.

Use your mouse to select a point on the graph to compare the Actual value to the
Forecasted value.

Frontline Solvers Analytic Solver Data Mining User Guide 116


Take a look at the ACF and PACF plots for Errors found at the bottom of
ARIMA_Output. One additional chart was added starting in V2017 - the ACVF
Plot for the Residuals.

With the exception of Lag1, the majority of the lags in the PACF and ACF
charts are either clearly within the UCL and LCL bands or just outside of these
bands. This suggests that the residuals are random and are not correlated.
Click the Forecast link on the Output Navigator to display the Forecast Data
table and charts.

Frontline Solvers Analytic Solver Data Mining User Guide 117


The table shows the actual and forecasted values along with LCI (Lower
Confidence Interval), UCI (Upper Confidence Interval) and Residual values.
The "Lower" and "Upper" values represent the lower and upper bounds of the
confidence interval. There is a 95% chance that the forecasted value will fall
into this range. The graph to the right plots the Actual values for CA against
the Forecasted values. Again, click any point on either curve to compare the
Actual against the Forecasted values.

Frontline Solvers Analytic Solver Data Mining User Guide 118


Classifying the Iris Dataset

Introduction
Analytic Solver Data Mining's chart wizard allows you to visualize the contents
of your dataset by allowing you to create up to 8 different types of charts
quickly and easily: bar chart, histogram, line chart, parallel coordinates,
scatterplot, scatterplot matrix, boxplot, and variable. Create a bar chart to
compare individual statistics (i.e. mean, count, etc.) across a group of variables
or a box plot to illustrate the shape of a distribution, it's central value and range
of data. Use a histogram to depict the range and scale of your observations at
variable intervals, a line chart to describe a time series dataset, or a parallel
coordinates plot to create a multivariate profile. A scatterplot can be created to
compare the relationship between two variables while a scatterplot matrix
combines several scatterplots into one panel allowing the user to see pairwise
relationships between variables. Finally, use the variable graph to plot each
variable's distribution. Two additional options, Export to PowerBI and Export to
Tableau, may be used to export your data to these 3rd party applications. For
more information, see the Analytic Solver Data Mining Reference Guide.

The following example will walk you through a quick tutorial, using both
Analytic Solver Data Mining Cloud and Desktop apps, which create several
different types of charts to allow a quick and thorough visualization of the well-
known iris dataset.

Creating the Classification Model


In this example, we will use the well-known iris dataset to craft a classification
model. The iris dataset is a multivariate data set introduced by Sir Ronald
Fisher in 1936 as an example of discriminant analysis. This data set contains 50
observations from each of three Iris species: Iris setosa, Iris virginica and Iris
versicolor. Four characterisitics were recorded from each sample: sepal length
and width and petal length and width. Our goal is to build a model that will
assign new data points to the correct iris species.
To open this dataset, click Help – Examples – Forecasting/Data Mining
Examples, then click the Iris link.
A portion of the dataset is shown below.

Frontline Solvers Analytic Solver Data Mining User Guide 119


First, we will explore our dataset with graphical analysis. This is often a
beginning step when creating a classification or prediction model.
Visualizations can help identify patterns within each variable or in relationships
between variables. If a variable is numerical, charts such as histograms or
boxplots may be used to study the distribution of values and to identify outliers.
If the dataset is a time series, a line chart may be used to detect a trend in the
data, for example, the time of year when ski/snowboard sales spike. Bar charts
are typically used when a variable is categorical, such as the number of ski's,
snowboards, and poles that are purchased during a winter sale. The iris dataset
includes both categorical and numerical data. Let's start with a simple bar chart
to examine the maximum, minimum and mean values of each numerical
variable, petal_length, petal_width, sepal_length and sepal_width.
• To create a chart in the Cloud app, select a cell within the dataset, say cell
A2, then simply click Explore – Chart Wizard. Select "Count" for
Statistic, "Species_name" for "Color by" and "Petal_length" for "Filter".

Frontline Solvers Analytic Solver Data Mining User Guide 120


• To create a chart in the Desktop app, select a cell within the dataset, say cell
A2, then click Explore – Chart Wizard to open the Chart Wizard. Select
Bar Chart.

On the Y Axis Selection Dialog, select Petal_width, then click Next.

Frontline Solvers Analytic Solver Data Mining User Guide 121


Select Species_name on the X Axis Selection Dialog, then click Finish. A
bar chart displays with Count of Petal width on the y axis and
Species_name on the x axis.

Both charts simply show the number of data points collected for petal_width for
each iris species. This is a quick and easy way to look for missing data points.

Frontline Solvers Analytic Solver Data Mining User Guide 122


In the Cloud app, select "Petal_length" for Filter. In the Desktop app, click the
right pointing arrow on the y axis and select Petal_length.

Again, we see no missing data points.


Data Mining Desktop Data Mining Cloud

We can perform the same steps for both Sepal_width and Sepal_length.
In the Cloud app, change Statistic to "Maximum".
In the Desktop app, go back to the right pointing arrow on the y axis, and select
Petal_width then Maximum to display the mean petal width for each species.

This chart becomes a bit more interesting. In this chart, we see that the Setosa
species typcially has a small petal width, while the Verginica species has a very
large petal_width, in relation to the Setosa species. The petal width of
Versicolor is about midway between the two. This chart supports the idea of
including petal_width as a possible predictor in our classification model.

Frontline Solvers Analytic Solver Data Mining User Guide 123


Data Mining Desktop Data Mining Cloud

We could use similar steps to examine the mean values by species of the
reamining numerical variables: petal_length, sepal_width and sepal_length.
Now let's create a scatterplot matrix to identify any trends in the relationships
between variables.
In the Cloud app, click the back arrow, then select the Scatterpot Matrix chart.
Select Petal_width, Petal_length, Sepal_width, and Sepal_length for Filter.

Now we can clearly see that Setosa iris’ have short, narrow petals and short,
wide sepals. While the Verginica species has long wide petals and long,
medium width sepals.
In the Desktop app, click the New Scatterplot Matrix icon in the title bar of the
chart.

On the Variable Selection Dialog, select Petal_width, Petal_length, Sepal_width


and Sepal_length. Then click Finish.

Frontline Solvers Analytic Solver Data Mining User Guide 124


The Scatterplot Matrix opens within the same dialog.

Click the X in the upper right hand corner of the BarChart to remove this chart
from the display.

Frontline Solvers Analytic Solver Data Mining User Guide 125


On the diagonal of the Scatterplot Matrix in Analytic Solver Desktop, we find
histograms of each variable. In Analytic Solver Cloud, the plot is the variable
plotted against itself. Each histogram displays the frequency of values for each
variable. Moving off to the right of the Petal_width histogram in the Desktop
app or the left in the Cloud app, we find a scatterplot matrix with Petal_width on
the y axis and petal_length on the x axis. As you can see there is a distinct
cluster of observations with narrow petal_widths and short petal_lengths.
Analytic Solver Desktop Analytic Solver Cloud

Let’s take a closer look by using our mouse to draw a square around the cluster
of points in the lower left of this scatterplot in each app.
Immediately, the observations that are in this cluster all turn red, if using the
Desktop app, or blue, if using the Cloud app, in each histogram and scatterplot
matrix. As you can see, not only do these observations have narrow petal widths
and short petal lengths but they also have short sepal lengths with slightly wider
sepal widths.

In the Desktop app, the records for each point in this cluster can be found under
Observations on the right of the Chart Wizard. The first thing we see is that this
cluster is made up of 50 points out of a total of 150. If we take a moment to
scroll through these 50 records using the right and left arrows, one common
feature starts to emerge – each of these records belongs to the Setosa species.

Frontline Solvers Analytic Solver Data Mining User Guide 126


To remove the selection, click the icon.
In the Desktop app, let’s use the Color By feature to distinguish the records
between species type. Select Color By in the upper right hand corner of the
Chart Wizard and select Species Name (or Species Number).

From here we can start to customize a little deeper in the Destop app. For
example, we can look only at the Versicolor species simply by unchecking both
Setosa and Verginica under Species_name.

Frontline Solvers Analytic Solver Data Mining User Guide 127


We can further filter our graph by only looking at observations where
petal_width is between 1.5 and 2.5 and petal_length is between 4 and 6, by
either moving the sliders left and right or clicking the value at each end of the
slider and changing it appropriately.

Click the Reset icon in the top right corner to reset the filters. Now let’s add a
boxplot to our dialog, as an alternative to the histograms in the scatterplot
matrix. A boxplot is similar to a histogram in that it graphically displays a
variable's distribution of values, but a boxplot also includes several additional
useful statistics. In a boxplot, also known as a box and whisker plot, the
whiskers denote the minimum and maximum values and a "box" is used to

Frontline Solvers Analytic Solver Data Mining User Guide 128


designate the 25th and 75th percentiles. In Analytic Solver Data Mining, the
box is shaded in green. The top of the box denotes the 75th percentile and the
bottom of the box denotes the 25th percentile. Inside the box are two lines, a
dotted line, indicating the variable's mean, and a solid line, indicating the
variable's median value. Looking at the three boxplots together, we can see that
while petal_widths for Verginica and Versacolor species overlap, while the
petal_widths of the Setosa iris do not.
Data Mining Cloud

To close the chart in the Cloud app, simply click the x in the upper right hand
corner.
Analytic Solver Data Mining Desktop

Frontline Solvers Analytic Solver Data Mining User Guide 129


Click the X in the upper right hand corner of the Scatterplot Matrix in the
Desktop app to remove it from the dialog and enlarge the Box Whisker Chart.
Click the right pointing arrow and select Petal_width from the drop down menu.
In this plot we see that the petal length from the Verginica and Versicolor iris
overlap while the petal length for the Setosa iris is much smaller.
Click the X in the title of the Chart Wizard dialog to close the Chart Wizard.
You will be asked to Save or Discard the chart. To save, enter a name such as
Box Whisker Petal Length in the Name field, then click Save. To discard, click
discard. To open a previously saved chart, click Explore – Existing Charts.
In the data mining field, datasets with large amounts of variables are routinely
encountered. In most cases, the size of the dataset can be reduced by removing
highly correlated or superfluous variables. The accuracy and reliability of a
classification or prediction model produced from this resultant dataset will be
improved by the removal of these redundant and unnecessary variables. In
addition, superfluous variables increase the data-collection and data-processing
costs of deploying a model on a large database. As a result, one of the first steps
in data mining should be finding ways to reduce the number of independent or
input variables used in the model (otherwise known as dimensionality) without
sacrificing accuracy.
Dimensionality Reduction is the process of reducing the amount of variables to
be used as input to a prediction or classification model. One way to reduce the
number of dimensions in a classification model is to fit a model using a
classification tree which splits on the variables that result in the best model.
Variables that are not included in the classification tree, can be removed.
One very important issue when fitting a model is how well the newly created
model will behave when applied to new data. To address this issue, the dataset
is divided into multiple partitions: a training partition which is used to create or
"train" the model and a validation partition to test the performance of the model.
(For more information on partitioning, see the Analytic Solver Data Mining
Reference Guide.) For this particular example, we will partition the dataset
according to the Standard Partition defaults of 60% of records assigned to the
Training set and 40% of records assigned to the Validation set.
Click Partition – Standard Partition to open the Standard Data Partition
dialog. Select Species_No, Petal_width, Petal_length, Sepal_width, and

Sepal_length under Variables In Input data and click to select and include
in the partition. Click OK to accept the defaults and create the partitions.
STDPartition is inserted in the Model tab of the Analytic Solver task pane under
Data Mining – Results – Text Mining. Click the Partition Summary link in the
Output Navigator to highlight the records selected for the Training Dataset and
click the Validation Data link to highlight the records selected for the Validation
Dataset.

Analytic Solver Data Mining offers three powerful ensemble methods for use
with Classification trees: bagging (bootstrap aggregating), boosting, and
random trees. The Classification Tree Algorithm on its own can be used to find
one model that will result in a good classification of the new data. We can view
the statistics and confusion matrices of the current classifier to see if our model
is a good fit to the data, but how would we know if there is a better classifier?

Frontline Solvers Analytic Solver Data Mining User Guide 130


The answer is – we don't. Ensemble methods, however, allow us to combine
multiple “weak” classification tree models which, when taken together form a
new, more accurate “strong” classification tree model. These methods work by
creating multiple diverse classification models, by taking different samples of
the original dataset, and then combining their outputs. Outputs may be
combined by several techniques, for example, majority vote for classification
and averaging for regression. This combination of models effectively reduces
the variance in the “strong” model. The three different types of ensemble
methods offered in Analytic Solver (bagging, boosting, and random trees) differ
on three items: 1.The selection of training data for each classifier or “weak”
model, 2.How the “weak” models are generated and 3. How the outputs are
combined. In all three methods, each “weak” model is trained on the entire
training dataset to become proficient in some portion of the dataset.
Click Classify – Ensemble – Boosting to open the Boosting – Data dialog.
Select Petal_width, Petal_length, Sepal_width, and Sepal_length as Selected
Variables and Species_No as the Output Variable.

Click Next to advance to the Classification Tree Boosting – Parameters dialog.


Click the down arrow beneath Weak Learner and select Decision Tree from the
menu. Note: If we had not previously partitioned the dataset, we could have
performed the partitioning from this dialog by clicking Partition Data.

Frontline Solvers Analytic Solver Data Mining User Guide 131


Click Next to accept all default option settings and advance to the Classification
Tree Boosting – Scoring dialog. Select Detailed Report under both Scoring
Training Data and Score Validation Data.

Click Finish to accept all default option settings and run the Boosting algorithm.
Output sheets containing results of the algorithm will be inserted to the right of
the Data worksheet. Click the Training: Classification Summary and
Validation: Classification Summary links in the Output Navigator.
The algorithm performed perfectly on the training data partition as evidenced by
the overall error of 0.

In the validiation set, four records were assigned incorrectly resulting in a 6.67%
misclassification error.

Frontline Solvers Analytic Solver Data Mining User Guide 132


If we repeat the same steps using the two remaining Ensemble methods,
Bagging (using Decision Trees as the Weak Leaner) and Random Trees, we will
get the following results.
Each ensemble method performed well, by evidenced by the overall errors in the
validation sets, but the random trees method performed best.

Bagging

Frontline Solvers Analytic Solver Data Mining User Guide 133


Random Trees

Frontline Solvers Analytic Solver Data Mining User Guide 134


Predicting Housing Prices using
Multiple Linear Regression

Introduction
Prediction algorithms are supervised learning methods which aim to estimate, or
forecast, a continuous output variable. For example, the predicted price of a
house that has just come on the market. Analytic Solver Data Mining includes 4
different prediction algorithms: Multiple Linear Regression, k-Nearest
Neighbors, Regression Tree, and Neural Networks. This example uses Multiple
Linear Regression to fit a prediction model using the Boston_Housing dataset.
The information in this dataset was gathered by the US Census Service from
census tracts within the Boston area. Each of the 14 features (or variables)
describes a characteristic impacting the selling price of a house. A description of
each variable is given in the table below. In addition to these variables, the data
set also contains an additional variable, which has been created by categorizing
median value (MEDV) into two categories – high (MEDV > 30) and low
(MEDV < 30).

CRIM Per capita crime rate by town


ZN Proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS Proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX Nitric oxides concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Proportion of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of African-Americans
by town
LSTAT % Lower status of the population
MEDV Median value of owner-occupied homes in $1000's
Click Help – Examples – Forecasting Data Mining to open the
Boston_Housing example dataset.

Multiple Linear Regression Example


Open or upload the Boston_Housing.xlsx example dataset. A portion of the
dataset is shown below. The last variable, CAT.MEDV, is a discrete
classification of the MEDV variable and will not be used in this example.

Frontline Solvers Analytic Solver Data Mining User Guide 135


First, we partition the data into training and validation sets using the Standard
Data Partition defaults with percentages of 60% of the data randomly allocated
to the Training Set and 40% of the data randomly allocated to the Validation
Set. For more information on partitioning a dataset, see the Data Mining
Partitioning chapter within the Analytic Solver Data Mining Reference Guide.

Frontline Solvers Analytic Solver Data Mining User Guide 136


Click Predict – Linear Regression to open the Linear Regression Data dialog.
Select MEDV as the Output Variable, CHAS as a Categorical Variable and all
remaining variables (except CAT. MEDV) as Selected variables.
Note: In this specific instance, since the CHAS variable is the only Categorical Variable selected
and it's values are binary (0/1), the output results will be the same as if you selected this variable as a
Selected Variable. If either 1. there were more categorical variables, 2. the CHAS varaible was
non-binary, or 3, another algorithm besides Mulitple Linear Regression or Logistic Regression was
selected, this would not be true.

Click Next to advance to the Parameters dialog.


If the number of rows in the data is less than the number of variables selected as
Input variables, the following message box is displayed.

Select OK to be directed to th Feature Selection dialog.

Frontline Solvers Analytic Solver Data Mining User Guide 137


If Fit Intercept is selected, the intercept term will be fitted, otherwise there will
be no constant term in the equation. Leave this option selected for this example.
Under Regression: Display, select all 6 display options to display each in the
output.
Under Statistics, select:
• ANOVA
• Variance-Covariance Matrix
• Multicollinearity Diagnostics
Under Advanced, select:
• Analysis of Coefficients
• Analysis of Residuals
• Influence Diagnostics
• Confidence/Prediction Intervals
When you have a large number of predictors and you would like to limit the
model to only the significant variables, click Feature Selection to open the
Feature Selection dialog and select Perform Feature Selection at the top of the
dialog. Increment the value for Maximum Subset Size to 12. This option can
take on values of 1 up to N where N is the number of Selected Variables. The
default setting is N.

Analytic Solver Data Mining offers five different selection procedures for
selecting the best subset of variables.
• Backward Elimination in which variables are eliminated one at a time,
starting with the least significant. If this procedure is selected, FOUT
is enabled. A statistic is calculated when variables are eliminated. For
a variable to leave the regression, the statistic’s value must be less than
the value of FOUT (default = 2.71).
• Forward Selection in which variables are added one at a time, starting
with the most significant. If this procedure is selected, FIN is enabled.
On each iteration of the Forward Selection procedure, each variable is
examined for the eligibility to enter the model. The significance of

Frontline Solvers Analytic Solver Data Mining User Guide 138


variables is measured as a partial F-statistic. Given a model at a current
iteration, we perform an F Test, testing the null hypothesis stating that
the regression coefficient would be zero if added to the existing set if
variables and an alternative hypothesis stating otherwise. Each variable
is examined to find the one with the largest partial F-Statistic. The
decision rule for adding this variable into a model is: Reject the null
hypothesis if the F-Statistic for this variable exceeds the critical value
chosen as a threshold for the F Test (FIN value), or Accept the null
hypothesis if the F-Statistic for this variable is less than a threshold. If
the null hypothesis is rejected, the variable is added to the model and
selection continues in the same fashion, otherwise the procedure is
terminated.
• Sequential Replacement in which variables are sequentially replaced
and replacements that improve performance are retained.
• Stepwise selection is similar to Forward selection except that at each
stage, Analytic Solver Data Mining considers dropping variables that
are not statistically significant. When this procedure is selected, the
Stepwise selection options FIN and FOUT are enabled. In the stepwise
selection procedure a statistic is calculated when variables are added or
eliminated. For a variable to come into the regression, the statistic’s
value must be greater than the value for FIN (default = 3.84). For a
variable to leave the regression, the statistic’s value must be less than
the value of FOUT (default = 2.71). The value for FIN must be greater
than the value for FOUT.
• Best Subsets where searches of all combinations of variables are
performed to observe which combination has the best fit. (This option
can become quite time consuming depending on the number of input
variables.) If this procedure is selected, Number of best subsets is
enabled.
Click Done to accept the default choice, Backward Elimimination with a
Maximum Subset Size of 3 and an F-out setting of 2.71, and return to the
Parameters dialog, then click Next to advance to the Scoring dialog.

Click Next to proceed to the Scoring dialog.

Frontline Solvers Analytic Solver Data Mining User Guide 139


Select all options under Score Training data and Score validation data to
produce all three reports in the output. Since we did not create a Test Partition,
the options under Score Test Data are disabled See the chapter "Data Mining
Partitioning" in the Analytic Solver Data Mining Reference Guide for more
information on how to create a test partition.

Click Finish. Results will be inserted to the right of the Data sheet. Double
click LinReg_Output to find the Output Navigator. Click any link here to
display the selected output.

Click the Predictor Screening hyperlink in the Output Navigator to display the
Model Predictors table. In Analytic Solver Data Mining, a new preprocessing
feature selection step was added in V2015 to take advantage of automatic
variable screening and elimination, using Rank-Revealing QR Decomposition.
This allows Analytic Solver Data Mining to identify the variables causing
multicollinearity, rank deficiencies, and other problems that would otherwise
cause the algorithm to fail. Information about “bad” variables is used in
Variable Selection and Multicollinearity Diagnostics, and in computing other
reported statistics. Included and excluded predictors are shown in the Model
Predictors table. In this model there was one excluded predictor, the Intercept.
All remaining predictors were eligible to enter the model passing the tolerance
threshold of 1.17691E-12. This denotes a tolerance beyond which a variance –
covariance matrix is not exactly singular to within machine precision. The test
is based on the diagonal elements of the triangular factor R resulting from Rank-
Revealing QR Decomposition. Predictors that do not pass the test are excluded.
Note: If a predictor is excluded, the corresponding coefficient estimates will be
0 in the regression model and the variable – covariance matrix would contain all
zeros in the rows and columns that correspond to the excluded predictor.
Multicollinearity diagnostics, variable selection and other remaining output will
be calculated for the reduced model.
The design matrix may be rank-deficient for several reasons. The most common
cause of an ill-conditioned regression problem is the presence of feature(s) that
can be exactly or approximately represented by a linear combination of other
feature(s). For example, assume that among predictors you have 3 input
variables X, Y, and Z where Z = a * X + b * Y and a and b are constants. This
will cause the design matrix to not have a full rank. Therefore, one of these 3
variables will not pass the threshold for entrance and will be excluded from the
final regression model.

Frontline Solvers Analytic Solver Data Mining User Guide 140


Click the Training: Prediction Details link to open the Training: Prediction
Details Data table and Validation: Prediction Details link to open the
Validation: Prediction tab. Of primary interest in a data-mining context will be
the predicted and actual values for each record, along with the residual
(difference) for each predicted value.

Analytic Solver Data Mining also displays The Total sum of squared errors
summaries for both the training and validation data sets in
LinReg_TrainingScore and LinReg_ValidationScore. The total sum of squared
errors is the sum of the squared errors (deviations between predicted and actual
values) and the root mean square error (square root of the average squared
error). The average error is typically very small, because positive prediction
errors tend to be counterbalanced by negative ones.

Frontline Solvers Analytic Solver Data Mining User Guide 141


Select the Feature Selection link on the Output Navigator to display the
Variable Selection table which displays a list of different models generated
using the selections made on the Feature Selection dialog. When Backward
elimination is used, Linear Regression may stop early when there is no variable
eligible for elimination as evidenced in the table below (i.e. there are no subsets
with less than 12 coefficients). Since Fit Intercept was selected on the
Parameters tab, each subset includes an intercept.

The error values calculated are:


• RSS: The residual sum of squares, or the sum of squared deviations
between the predicted probability of success and the actual value (1 or
0)
• Cp: Mallows Cp (Total squared error) is a measure of the error in the
best subset model, relative to the error incorporating all variables.
Adequate models are those for which Cp is roughly equal to the
number of parameters in the model (including the constant), and/or Cp
is at a minimum
• R-Squared: R-squared Goodness-of-fit
• Adj. R-Squared: Adjusted R-Squared values.
• "Probability" is a quasi hypothesis test of the proposition that a given
subset is acceptable; if Probability < .05 we can rule out that subset.
Compare the RSS value as the number of coefficients in the subset increases
from 13 to 12 (7794.742 to 7801.43). The RSS for 12 coefficients is just
slightly higher than the RSS for 13 coefficients suggesting that a model with 12
coefficients may be sufficient to fit a regression.
Click the Coefficients link on the Output Navigator to display the Regression
Summary and Coefficients table shown below.

Frontline Solvers Analytic Solver Data Mining User Guide 142


Note: If a variable has been eliminated by Rank-Revealing QR Decomposition,
the variable will appear in red in the Coefficients table with a 0 for Estimate, CI
Lower, CI Upper, Standard Error and N/A for T-Statistic and P-Value.
The Regression Model table contains the estimate, the standard error of the
coefficient, the p-value, confidence intervals, and the Sum of Squared Error for
each variable included in the model.
The Sum of Squared Errors is calculated as each variable is introduced into the
model beginning with the constant term and continuing with each variable as it
appears in the dataset.
Analytic Solver Data Mining produces 95% Confidence Intervals for the
estimated values. For a given record, the Confidence Interval gives the mean
value estimation with 95% probability. This means that with 95% probability,
the regression line will pass through this interval.

Summary Statistics, in the Regression Summary table, show the residual degrees
of freedom (#observations - #predictors), the R-squared value, a standard
deviation type measure for the model (which typically has a chi-square
distribution), and the Residual Sum of Squares error.
The R-squared value shown here is the r-squared value for a logistic regression
model , defined as -
R2 = (D0-D)/D0 ,
where D is the Deviance based on the fitted model and D0 is the deviance based
on the null model. The null model is defined as the model containing no
predictor variables apart from the constant.
Click the Multicollinearity Diags link to display the Multicollinearity
Diagnostics table. This table helps assess whether two or more variables so
closely track one another as to provide essentially the same information.

Frontline Solvers Analytic Solver Data Mining User Guide 143


The columns represent the variance components (related to principal
components in multivariate analysis), while the rows represent the variance
proportion decomposition explained by each variable in the model. The
eigenvalues are those associated with the singular value decomposition of the
variance-covariance matrix of the coefficients, while the condition numbers are
the ratios of the square root of the largest eigenvalue to all the rest. In general,
multicollinearity is likely to be a problem with a high condition number (more
than 20 or 30), and high variance decomposition proportions (say more than 0.5)
for two or more variables.
Click the Residuals link to open the Residuals table. This table displays the
Raw Residuals, Standarized Residuals, Studentized Residuals and Deleted
Residuals.

Studentized residuals are computed by dividing the unstandardized residuals by


quantities related to the diagonal elements of the hat matrix, using a common
scale estimate computed without the ith case in the model. These residuals have t
- distributions with ( n-k-1) degrees of freedom. As a result, any residual with
absolute value exceeding 3 usually requires attention.
The Deleted residual is computed for the ith observation by first fitting a model
without the ith observation, then using this model to predict the ith observation.
Afterwards the difference is taken between the predicted observation and the
actual observation.
Click the Influence Diagnositics link on the Output Navigator to display the
Influence Diagnostics data table. This table contains various statistics computed
by Analytic Solver Data Mining.
The Cooks Distance for each observation is displayed in this table. This is an
overall measure of the impact of the ith datapoint on the estimated regression
coefficient. In linear models Cooks Distance has, approximately, an F
distribution with k and (n-k) degrees of freedom.
The DF fits for each observation is displayed in the output. DFFits gives
information on how the fitted model would change if a point was not included in
the model.
Analytic Solver Data Mining computes DFFits using the following computation.

where
y_hat_i = i-th fitted value from full model
y_hat_i(-i) = i-th fitted value from model not including i-th observation
sigma(-i) = estimated error variance of model not including i-th observation
h_i = leverage of i-th point (i.e. {i,i}-th element of Hat Matrix)

Frontline Solvers Analytic Solver Data Mining User Guide 144


e_i = i-th residual from full model
e_i^stud = i-th Studentized residual
The covariance ratios are displayed in the output. This measure reflects the
change in the variance-covariance matrix of the estimated coefficients when the
ith observation is deleted.
The diagonal elements of the hat matrix are displayed under the Leverage
column. This measure is also known as the Leverage of the ith observation.
Click either the Intervals: Training or Intervals: Validation links in the
Output Navigator to view the Intervals report for both the Training and
Validation partitions. Of primary interest in a data-mining context will be the
predicted and actual values for each record, along with the residual (difference)
and Confidence and Prediction Intervals for each predicted value.

Analytic Solver Data Mining produces 95% Confidence and Prediction Intervals
for the predicted values. Typically, Prediction Intervals are more widely utilized
as they are a more robust range for the predicted value. For a given record, the
Confidence Interval gives the mean value estimation with 95% probability. This
means that with 95% probability, the regression line will pass through this
interval. The Prediction Interval takes into account possible future deviations of
the predicted response from the mean. There is a 95% chance that the predicted
value will lie within the Prediction interval.
Lift charts and RROC Curves (on the LinReg_TrainingLiftChart and
LinReg_ValidationLiftChart tabs, respectively) are visual aids for measuring
model performance. Lift Charts consist of a lift curve and a baseline. The greater
the area between the lift curve and the baseline, the better the model. RROC
(regression receiver operating characteristic) curves plot the performance of
regressors by graphing over-estimations (or predicted values that are too high)
versus underestimations (or predicted values that are too low.) The closer the
curve is to the top left corner of the graph (in other words, the smaller the area
above the curve), the better the performance of the model.
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set (if one exists). Then the data
set(s) are sorted using the predicted output variable value. After sorting, the
actual outcome values of the output variable are cumulated and the lift curve is
drawn as the number of cases versus the cumulated value. The baseline (red line
connecting the origin to the end point of the blue line) is drawn as the number of
cases versus the average of actual output variable values multiplied by the
number of cases. The decilewise lift curve is drawn as the decile number versus
the cumulative actual output variable value divided by the decile's mean output
variable value. The bars in this chart indicate the factor by which the MLR

Frontline Solvers Analytic Solver Data Mining User Guide 145


model outperforms a random assignment, one decile at a time. Refer to the
validation graph below. In the first decile in the validation dataset, taking the
most expensive predicted housing prices in the dataset, the predictive
performance of the model is about 1.8 times better as simply assigning a random
predicted value.
Decile-Wise Lift Chart, ROC Curve and Lift Chart from Training Partition

Decile-Wise Lift Chart, ROC Curve and Lift Chart from Validation
Partition

In an RROC curve, we can compare the performance of a regressor with that of


a random guess (red line) for which under estimations are equal to over-
estimations shifted to the minimum under estimate. Anything to the left of this
line signifies a better prediction and anything to the right signifies a worse
prediction. The best possible prediction performance would be denoted by a
point at the top left of the graph at the intersection of the x and y axis. This
point is sometimes referred to as the “perfect classification”. Area Over the
Curve (AOC) is the space in the graph that appears above the ROC curve and is
calculated using the formula: sigma2 * n2/2 where n is the number of records
The smaller the AOC, the better the performance of the model. In this example
we see that the area above the curve in both datasets, or the AOC, is fairly large
which indicates that this model might not be the best fit to the data.
Two new charts were introduced in V2017: a new Lift Chart and the Gain
Chart. To display these new charts, click the down arrow next to Lift Chart
(Original), in the Original Lift Chart, then select the desired chart.

Frontline Solvers Analytic Solver Data Mining User Guide 146


Select Lift Chart to display Analytic Solver Data Mining's new Lift Chart. Each
of these charts consists of an Optimum Predictor curve, a Fitted Predictor curve,
and a Random Predictor curve. The Optimum Predictor curve plots a
hypothetical model that would provide perfect classification for our data. The
Fitted Predictor curve plots the fitted model and the Random Predictor curve
plots the results from using no model or by using a random guess (i.e. for x% of
selected observations, x% of the total number of positive observations are
expected to be correctly classified).
The Alternative Lift Chart plots Lift against % Cases. The Gain chart plots Gain
Ratio against % Cases.
Lift Chart (Alternative) and Gain Chart for Training Partition

Lift Chart (Alternative) and Gain Chart for Validation Partition

See the “Scoring New Data” chapter on Stored Model Sheets for more
information LinReg_Stored.

Frontline Solvers Analytic Solver Data Mining User Guide 147


Scoring New Data

Introduction
Analytic Solver Data Mining and AnalyticSolver.com provide a method for
scoring new data in a database or worksheet with any of the Classification or
Prediction algorithms. This facility matches the input variables to the database
(or worksheet) fields and then performs the scoring on the database (or
worksheet).

Scoring to a Database
This example describes the steps required to create a classification model using
the Discriminant Analysis classification algorithm and then uses that model to
score new data. Note that this is only supported in Analytic Solver Desktop.
This is not supported in the Data Mining Cloud app.
The example dataset Boston_Housing.xlsx will be used to illustrate the steps
required. Recall that this example dataset includes 14 variables related to
housing prices collected from census tracts in the Boston area. For more
information on this example dataset and Discriminant Analysis in general,
please see the Discriminant Analysis chapter within the Analytic Solver Data
Mining Reference Guide.
Open Boston_Housing.xlsx, then click Classify – Discriminant Analysis to
open the Discriminant Analysis – Data dialog.
Note: Frontline's example files and datasets contained within \Frontline
Systems\Analytic Solver Platform\Datasets are read-only. Before starting this
example, copy dataset.mdb (located in C:\Program Files\Frontline
Systems\Analytic Solver Platform\Datasets) to a location where write access is
allowed.
Select the CAT. MEDV variable in the Variables In Input Data listbox then
click > to select as the Output Variable. Immediately, the options for Classes in
the Output Variable are enabled. #Classes is prefilled as “2” since the CAT.
MEDV variable contains two classes, 0 and 1.
Specify “Success” Class (for Lift Chart) is selected by default and Class 1 is to
be considered a “success” or the significant class in the Lift Chart. (Note: This
option is enabled when the number of classes in the output variable is equal to
2.)
Enter a value between 0 and 1 here to denote the Specify initial cutoff
probability for success. If the calculated probability for success for an
observation is greater than or equal to this value, than a “success” (or a 1) will
be predicted for that observation. If the calculated probability for success for an
observation is less than this value, then a “non-success” (or a 0) will be
predicted for that observation. The default value is 0.5. (Note: This option is
only enabled when the # of classes is equal to 2.)
Select CRIM, ZN, INDUS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, &
B in the Variables In Input Data listbox then click > to move to the Selected

Frontline Solvers Analytic Solver Data Mining User Guide 148


Variables listbox. (CHAS, LSTAT, & MEDV should remain in the Variables In
Input Data listbox as shown below.)

Click Next to advance to the Parameters dialog.


Since we did not partition the dataset before we started the classification
method, we can partition the dataset now. Click Partition Data and then select
the Partition Data option to enable the Partitioning Options. Select User
Defined, then enter 80 for Training and 20 for Validation. Click Done to close
the dialog.

Click Prior Probability to open the Prior Probability dialog. Three options
appear: Empiracal, Uniform, and Manual.

Frontline Solvers Analytic Solver Data Mining User Guide 149


If the first option is selected, Empirical, Analytic Solver Data Mining will
assume that the probability of encountering a particular class in the dataset is the
same as the frequency with which it occurs in the training data.
If the second option is selected, Uniform, Analytic Solver Data Mining will
assume that all classes occur with equal probability.
Select the third option, Manual, to manually enter the desired class and
probability value
Under Probability, enter 0.7 for Class 1 and 0.3 for Class 0. Click Done to close
the dialog.

Click Next to advance to the Discriminant Analysis – Scoring dialog.


Since we did not create a test partition, the options for Score test data are
disabled. See the chapter “Data Mining Partitioning” within the Analytic Solver
Data Mining Reference Guide for information on how to create a test partition.

Frontline Solvers Analytic Solver Data Mining User Guide 150


In the Score New Data group, select In Database. The Scoring to Database
dialog opens.
The first step on this dialog is to select the Data source. Once the Data source
is selected, Connect to database… will be enabled.
This example illustrates how to score to an MS-Access database. Select MS-
Access for the Data source, then click Connect to Database…

An Open dialog appears, browse to the location where dataset.mdb is saved,

Frontline Solvers Analytic Solver Data Mining User Guide 151


then click Open. Note the Login Name and Password fields at the bottom of
the dialog. If your database is password protected, enter the appropriate
information here.
The Scoring to Database dialog re-appears. Select Boston_Housing for
Table/View. The dialog will be populated with variables from the database,
dataset.mdb, under Fields in Table and with variables from the
Boston_Housing.xlsx dataset under Variables In Input Data.

Analytic Solver Data Mining offers three easy techniques to match variables in
the dataset to variables in the database:
1. Matching by Name.
2. Matching Sequentially
3. Manually Matching.
If Match By Name is clicked, all similar named variables in
Boston_Housing.xlsx will be matched with similar named variables in
dataset.mdb, as shown in the screenprint below. Note that the additional
database fields remain in the Fields in Table listbox while all variables in the
Variables In Input Data listbox have been matched.

Frontline Solvers Analytic Solver Data Mining User Guide 152


If Match Sequentially is clicked, the first 11 variables in Boston_Housing.xlsx
will be matched with the first 11 variables in the dataset.mdb database.
The first 11 variables in both the database and the dataset are now matched
under Variables In Input Data. The additional database fields remain under
Fields in Table.
Note: It’s also possible to start the sequential matching with variables that are
not in order in the Fields in Table listbox. For example, if the variables, CRIM,
INDUS, and NOX are selected, then Match Sequentially is clicked, the selected
database fields (CRIM, INDUS and NOX) will be matched with the first three
variables in the Input Data and the remaining Table Fields will be matched
sequentially with the remaining Input Data Variables. For example:
To manually map variables from the dataset to the database, select a field from
the database in the Fields in Table listbox, then select the variable to be matched
in the dataset in the Variables In Input Data listbox, then click Match Selected.
For example to match the CRIM variable in the database to the CRIM variable
in the dataset, select CRIM from the dataset.mdb database in the Fields in Table
listbox, select CRIM from Variables In Input Data, then click Match Selected
o match the two variables.

Frontline Solvers Analytic Solver Data Mining User Guide 153


Notice that CRIM has been removed from the Fields in table listbox and is now
listed next to CRIM in the Variables In input data listbox. Continue with these
steps to match the remaining 10 variables in the Boston_Housing.xlsx dataset.
To unmatch all variables click Unmatch all. To unmatch a single match,
highlight the match, then click Unmatch selected.
An Output Field can be selected from the remaining database fields listed under
Fields in Table or a new Output Field can be added. Note: An output field must
be a string.
Click Unmatch All, then click Match By Name. Select Add new field for
output, then enter a name in the field to the right such as “Score Var”.

Frontline Solvers Analytic Solver Data Mining User Guide 154


Click Finish on the New Data dialog. DA_NewScoreDB is inserted into the
Model tab of the Analytic Solver task pane under Reports – Discrimininant
Analysis. This worksheet simply includes the name of the database, Table
Name, # of records scored and the variables.

To view the scored records, open the dataset.mdb database in Microsoft Access
and inspect the Score Var column as shown in the screenshot below. (Click the
Enable Content button to view the dataset.)

Frontline Solvers Analytic Solver Data Mining User Guide 155


Note: If you did not copy dataset.mdb to a location that allows write-access,
you will get the error, This database has been opened read-only.

Scoring to a Worksheet
Analytic Solver Data Mining can also perform scoring on new data in a
worksheet. To illustrate, we'll re-use the Boston Housing example dataset.
Click Predict – Linear Regression to open the Linear Regression - Data dialog
and select the Output, Selected, and Categorical Variables as shown in the
screenshot below.

Click Next to advance to the Parameters dialog. Since we haven't yet


partitioned the dataset, we will do so now. Click Partition Data to open the

Frontline Solvers Analytic Solver Data Mining User Guide 156


Partition Data Dialog, then select the Partition Data option to enable the
Standard Partitioning Options. For more information on these partitioning
options, please see the Data Mining Partitioning chapter that occurs earlier in
this guide. Click Done to accept the partitioning defaults.

Under Advanced, Select Confidence/Prediction Intervals to include these


intervals for the new predictions.

Click Next to advance to the Scoring dialog.


Select In Worksheet in the Score new data group. The dialog for Match
variables in the New Range appears.

Select New Data for Worksheet at the top of the dialog.

Frontline Solvers Analytic Solver Data Mining User Guide 157


The variables listed under Variables in New Data are from the New Data
worksheet and the variables listed under Continuous Variables In Input Data are
from the Data worksheet. Variables can be matched in three different ways.
1. Matching by Name.
2. Matching Sequentially
3. Manually Matching.
If Match By Name is clicked, variables with the same names in each set will be
matched.
If Match Sequentially is clicked, the first five variables from each listbox are
matched.
Variables may also be matched manually by selecting a variable under Variables
in New Data, selecting a variable in Variables In Input Data, and clicking Match
Selected. For example, select CRIM under Variables in new data and CRIM
under Variables In input data, then click Match Selected.
Click Match By Name to match the variables.

To unmatch all matched variables, click Unmatch all. To unmatch only one set
of matched variables, select the matched variables in the Variables In input data
listbox, then select Unmatch Selected.
Click Finish.
Output containing the Linear Regression results can be found beneath Scoring –
Run 1 within the Analytic Solver Task pane. Click LinReg_NewScore to view
the output as shown below.

Frontline Solvers Analytic Solver Data Mining User Guide 158


Click the Intervals: New link on the Output Navigator to view the Confidence
and Prediction Intervals for the new predictions.

Here we see the 95% Confidence and Prediction Intervals. Typically, Prediction
Intervals are more widely utilized as they are a more robust range for the
predicted value. For a given record, the Confidence Interval gives the mean
value estimation with 95% probability. This means that with 95% probability,
the regression line will pass through this interval. The Prediction Interval takes
into account possible future deviations of the predicted response from the mean.
There is a 95% chance that the predicted value will lie within the Prediction
interval.
For more information on the rest of the Linear Regression output, please see the
Multiple Linear Regression chapter that appears earlier in this guide.

Scoring Test Data


When Analytic Solver Data Mining calculates prediction, classification,
forecasting and transformation results, internal values and coefficients are
generated and used in the computations. These values are saved to output sheets
named, X_Stored -- where X is the abbreviated name of the data mining method.
For example, the name given to the stored model sheet for Linear Regression is
"LinReg_Stored".
Note: In previous versions of XLMiner, this utility was a separate add-on
application named XLMCalc. Starting with XLMiner V12, this utility is
included free of charge and can be accessed under Score in the Tools section of
the XLMiner ribbon.

For example, assume the Linear Regression prediction method has just finished.
The Stored Model Sheet (LinReg_Stored) will contain the regression equation in

Frontline Solvers Analytic Solver Data Mining User Guide 159


PMML format. When the Score Test Data utility is invoked or when the
PsiPredict() function is present (see below), Analytic Solver will apply this
equation from the Stored Model Sheet to the test data.
Along with values required to generate the output, the Stored Model Sheet also
contains information associated with the input variables that were present in the
training data. The dataset on which the scoring will be performed should
contain at least these original Input variables. Analytic Solver Data Mining
offers a “matching” utility that will match the Input variables in the training set
to the variables in the new dataset so the variable names are not required to be
identical in both data sets (training and test). See the sections above for more
information.

Scoring Test Data Example


This example illustrates how to score test data using a stored model sheet using
output from a Multiple Linear Regression. This procedure may be repeated
using any stored model sheets generated using Analytic Solver Data Mining.
Click Help – Examples on the Data Mining ribbon, then Forecasting/Data
Mining Examples and open the example file Scoring.xlsx.
LinReg_Stored was generated while performing the steps in the “Multiple Linear
Regression Prediction Method” chapter. See this chapter for details on
performing a Multiple Linear Regression.
Scoring.xlsx also contains a New Data worksheet with 10 new records. Our
goal is to score this new dataset to come up with a predicted housing price for
each of the 10 new records.
Click Score on the Data Mining ribbon. Under Data to be scored, confirm that
New Data appears as the Worksheet, Scoring.xlsx as the Workbook, the Data
range is A1:M11 and LinReg_Stored is selected in the Worksheet drop down
menu under Stored Model.

Frontline Solvers Analytic Solver Data Mining User Guide 160


Variables in the New Data may be matched with Variables in Stored Model
using three easy techniques: by name, by sequence or manually.
If Match By Name is clicked, all similar named variables in the stored model
sheet will be matched with similar named variables in the new dataset.
If Match Sequentially is clicked, the Variables in the stored model will be
matched with the Variables in the new data in order that they appear in the two
listboxes. For example, the variable CRIM from the new dataset will be
matched with the variable CRIM from the stored model sheet, the variable ZN
from the new data will be matched with the variable ZN from the stored model
sheet and so on.
To manually map variables from the stored model sheet to the new data set,
select a variable from the new data set in the Variables in New Data listbox,
then select the variable to be matched in the stored model sheet in the Variables
in Stored Model listbox, then click Match. For example to match the CRIM
variable in the new dataset to the CRIM variable in the stored model sheet, select
CRIM from the Variables in New Data listbox, select CRIM from the stored
model sheet in the Variables in Stored Model listbox, then click Match Selected
to match the two variables.
To unmatch all variables click Unmatch all. To unmatch two specific variables,
select the matched variables, then click Unmatch Selected.

Frontline Solvers Analytic Solver Data Mining User Guide 161


Click OK.
Scoring_LinearRegression will be inserted to the right of the New Data tab.
The output is shown below. The results from scoring can be found under:
Prediction MEDV. These are the predicted prices for each of the ten new
records.

Using Data Mining Psi Functions in Excel


Analytic Solver Data Mining utilizes XML PMML format to store the supported
models and use them for “scoring” (classifying, predicting, forecasting,
transforming) new data using four new generic scoring functions: PsiPredict(),
PsiForecast(), PsiTransform() and PsiPosteriors(). PsiPredict() and
PsiForecast() provide previously available functionality plus new additional
functionality such as storing and scoring ensemble models with any available
weak learner and computing the fitted values for new time series data.

Frontline Solvers Analytic Solver Data Mining User Guide 162


PsiTransform() and PsiPosteriors() provide new functionality and the
availability of new models for storing or scoring.

PSI Scoring Function Description


PsiPredict() Predicts the target for input data
using a Classification or
Regression model and computes
the fitted values for a Time Series
model stored in PMML format.
PsiForecast() Computes the forecasts for the
input data using a Time Series
model stored in PMML format.
PsiPosteriors() Computes the posterior
probabilities for the input data
using a Classification model stored
in PMML format.
PsiTransform() Transforms the input data using a
Transformation model stored in
PMML format.
It’s possible to score data with a prediction or classification method or perform a
time series forecast manually (without the need to click the Score icon on the
ribbon) by entering a Psi Solver function into an Excel cell as an array.

Scoring Data Using Psi Functions


Using the example above, click back to the New Data worksheet and select a
blank cell on the worksheet. Enter "=PsiPredict(LinReg_Stored!B12:B74,'New
Data'!A1:M11)". If using a version of Excel that supports Dynamic Arays, the
formula result will "spill" into the cells below. If using a version of Excel that
does not support Dynamic Arrays, select cells N1:N11, enter the formula and
press SHIFT + CTRL + ENTER to enter the formula as an array into all 10 cells
(N2:N11).
The first argument, LinReg_Stored!B12:B74, is the range of cells used by
Analytic Solver Data Mining to store the linear regression model on the
LinReg_Stored worksheet. Clearly this data range will change as the
classification or prediction method changes and as the number of features
included in the dataset changes.
The second argument, New Data!A1:M11, is the range containing the new data
on the New Data worksheet. The new data must contain at least one row of data
containing the same number of features (or columns) as the data used to create
the model. In this example, we included 13 features in our model (CRIM, ZN,
INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B & LSTAT).
As a result, our new data also contained these same exact features. We could
have performed this prediction on only one row of new data, NewData!A2:M2,
but choose to use all 10 rows.
It’s also possible to enter this formula using the Insert Function dialog by
clicking Formulas – Insert Function, select PSI Data Mining for Category, then
PsiPredictMLR.

Frontline Solvers Analytic Solver Data Mining User Guide 163


Note: The Insert Function dialog is not supported when using Data Mining
Cloud. To use this function or any other Psi function, simply type the function
directly into the cell.
The scoring results are shown in the screenshot below in the N column.

The PsiPredict() function is interactive meaning that if a variable value is


changed, for example the first LSTAT value in cell M2 changes from 4.5 to 9.5,
the Predicted Value in cell N2 will immediately update to reflect a new
predicted value.
The remaining PSI Data Mining functions can be used in the same way using
models from their respective stored model sheets. See below for specifications
for PsiPredict(), PsiPosteriors() and PsiTransform(). See the section below for
information on PsiForecast().
Note: Using Psi Data Mining Functions in the Cloud or in versions
of Desktop Excel that support Dynamic Arrays.
The Cloud apps take advantage of Excel's newly introduced Dynamic
Arrays. To use this function in the Cloud, you need enter the Psi
function in only one cell as a normal function, i.e., not as a control
array. In the example above, we would enter the following formula
into cell N1 =PsiPredict(LinReg_Stored!B12:B74, A1:M11). If this
formula is entered into cell N1, the contents of the PsiPredict()
Dynamic Array will "spill" down into cells N2:N11.

PsiPredict()
PsiPredict(Model, Input_Data, [Header])
Predicts the response, target, output or dependent variable for Input_Data
whether it is continuous (Regression) or categorical (Classification) when the
model is stored in PMML format. In addition, this function also computes the
fitted values for a Time Series model when the model is stored in PMML
format.

Frontline Solvers Analytic Solver Data Mining User Guide 164


Model: Range containing the stored Classification, Regression or TimeSeries
model in PMML format.
Input_Data: Range containing the new data for computing predictions.
Range must contain a header row with column names and at least one row of
data containing the exact same features (or columns) as the data used to create
the model.
Header: If True, a heading will be inserted above the forecasted values. If
omitted or false, a heading will not appear.
In Data Mining Cloud and in newer versions of desktop Excel, PsiPredict()
returns a Dynamic Array (see Note in section heading, above). To use this
function in the Cloud, you need only enter the Psi function in one cell as a
normal function, i.e., not as a control array. The contents of the Dynamic Array
will "spill" down the column. If a nonblank cell is "blocking" the contents of
the Dynamic Array, PsiForecast() will return #SPILL until such time as the
blockage is removed. Use the optional numForecasts argument to specify the
number of forecasts in the Dynamic Array. If not present, one forecast will be
returned.
Output: A single column containing the header (if the header argument is set to
TRUE) and predicted/fitted values for each record in Input_Data.
To know if the result of the prediction is continuous or categorical, you must
know what kind of model you are passing as an argument to the scoring function
– if you previously fitted the classification model and are now predicting the
new feature vectors, you should expect to get the compatible categorical
response. On the other hand, you should expect the continuous response from
the new data prediction when using a fitted regression model. In previous
versions, the user was expected to know the exact type model, such as Mulitple
Linear Regression or Discriminant Analysis, to know what kind of output will
be produced, whereas in V2017 and later, it is sufficient to know whether you’re
pointing to a classification or regression model in order to determine the type of
the response. Note: If the user intends to use an “unknown” model for scoring,
the stored worksheets contain the complete information about the model
including several clear indications of the model type and data dictionaries with
the types of features and response.
In addition, PsiPredict() can compute the fitted values for the new time series
based on the provided Time Series model. Unlike future-looking forecasting,
provided by PsiForecast(), PsiPredict() computes a model prediction for each
observation in the provided new time series.

Supported Models:
• Classification:
o Linear Discriminant Analysis
o Logistic Regression
o K-Nearest Neighbors
o Classification Tree
o Naïve Bayes
o Neural Network
o Random Trees
o Bagging (with any supported weak learner)
o Boosting (with any supported weak learner)
• Regression:
o Logistic Regression
o K-Nearest Neighbors

Frontline Solvers Analytic Solver Data Mining User Guide 165


o Neural Network
o Bagging (with any supported weak learner)
o Boosting (with any supported weak learner)
• Time Series (fitted values)
o ARIMA
o Exponential Smoothing
o Double Exponential Smoothing
o Holt-Winters Smoothing

Previous related Psi Scoring functions:


• Classification: PsiClassifyLR, PsiClassifyDA, PsiClassifyCT,
PsiClassifyNB, PsiClassifyNN, PsiClassifyCTEnsemble,
PsiClassifyNNEnsemble
• Regression: PsiPredictMLR, PsiPredictRT, PsiPredictNN,
PsiPredictNNEnsemble, PsiPredictRTEnsemble

Prediction/Classification/Time Series Stored Model Sheet


Algorithm
Linear Discriminant Analysis Classification DA_Stored
Logistic Regression Classification LogReg_Stored
k-Nearest Neighbors Classification KNNC_Stored
Classification Trees CT_Stored
Naïve Bayes Classification NB_Stored
Neural Networks Classification NNC_Stored
Ensemble Methods for Classification CBoosting_Stored
CBagging_Stored
CRandTrees_Stored
Linear Regression LinReq_Stored
k-Nearest Neighbors Regression KNNP_Stored
Regression Tree RT_Stored
Neural Network Regression NNP_Stored
Ensemble Methods for Regression RBoosting_Stored
RBagging_Stored
RRandTrees_Stored
ARIMA ARIMA_Stored
Exponential Smoothing Expo_Stored
Double Exponential Smoothing DoubleExpo_Stored
Moving Average Smoothing MovingAvg_Stored
Holt Winters Smoothing MultHoltWinters_Stored
AddHoltWinters_Stored
NoTrendHoltWinters_Stored

Frontline Solvers Analytic Solver Data Mining User Guide 166


PsiPosteriors()
PsiPosteriors(Model, Input_Data, [Header])
Computes the posterior probabilities for Input_Data using a Classification
model stored in PMML format.

Model: Range containing the stored Classification model in PMML format.


Input_Data: Range containing the new data for computing posterior
probabilities. Range must contain a header with column names and at least one
row of data containing the exact same features (or columns) as the data used to
create the model.
Header: If True, a heading is inserted in the output above the forecasted
values. If False or omitted, a heading is not inserted into the output.
In Data Mining Cloud and in new versions of desktop Excel, PsiPosterior()
returns a Dynamic Array (see Note in section heading, above) To use this
function in the Cloud, you need only enter the Psi function in one cell as a
normal function, i.e., not as a control array. The contents of the Dynamic Array
will "spill" down the column. If a nonblank cell is "blocking" the contents of
the Dynamic Array, PsiForecast() will return #SPILL until such time as the
blockage is removed. Use the optional numForecasts argument to specify the
number of forecasts in the Dynamic Array. If not present, one forecast will be
returned.
Output: Multiple columns containing a header with class labels and estimated
posterior probabilities for each class label for all records in Input_Data.

Supported Models:
• Classification:
o Linear Discriminant Analysis
o Logistic Regression
o K-Nearest Neighbors
o Classification Tree
o Naïve Bayes
o Neural Network
o Random Trees
o Bagging (with any supported weak learner)
o Boosting (with any supported weak learner)

Previous related Psi Scoring functions: N/A


Classification Algorithm Stored Model Sheet
Linear Discriminant Analysis Classification DA_Stored
Logistic Regression Classification LogReg_Stored
k-Nearest Neighbors Classification KNNC_Stored
Classification Trees CT_Stored
Naïve Bayes Classification NB_Stored
Neural Networks Classification NNC_Stored
Ensemble Methods for Classification CBoosting_Stored
CBagging_Stored

Frontline Solvers Analytic Solver Data Mining User Guide 167


CRandTrees_Stored

PsiTransform()
PsiTransform(Model, Input_Data, [Header])
Transforms the Input_Data using a Transformation model stored in PMML
format.
Model: Range containing the stored Transformation model in PMML format.
Input_Data: Range containing the new data for transformation. Range must
contain a header with column names and at least one row of data containing the
exact same features (or columns) as the data used to create the model.
Header: If True, a heading is inserted above the forecasted values. If False or
omitted, a heading is not inserted.
In Data Mining Cloud and in newer versions of desktop Excel, PsiTransform()
returns a Dynamic Array (see Note in section heading, above). To use this
function in the Cloud, you need only enter the Psi function in one cell as a
normal function, i.e., not as a control array. The contents of the Dynamic Array
will "spill" down the column. If a nonblank cell is "blocking" the contents of
the Dynamic Array, PsiForecast() will return #SPILL until such time as the
blockage is removed. Use the optional numForecasts argument to specify the
number of forecasts in the Dynamic Array. If not present, one forecast will be
returned.
Output: One or multiple columns containing a header and transformed data.

Supported Models:
• Transformation:
o Rescaling
• Text Mining
o TF-IDF Vectorization (input data – text variable with the
corpus of documents)
o LSA Concept Extraction (input data – term-document matrix,
where columns represent terms and rows represent documents)

Previous related Psi Scoring functions: N/A

Algorithm Stored Model Sheet


Rescaling Rescaling_Stored
Text Mining TFIDF_Stored
LSA_Stored

Time Series Forecasting


Starting with version 2014-R2, Analytic Solver Data Mining includes the ability
to forecast a future point in a time series in one of your spreadsheet formulas
(without using the Score button on the Ribbon) using a PsiForecast() function in
conjunction with a model created using ARIMA or one of our smoothing
methods (Exponential, Double Exponential, Moving Average, or Holt Winters).

Frontline Solvers Analytic Solver Data Mining User Guide 168


PsiForecast() is similar to the previous PSIForecastXXX functions supported in
V2014, 2015, and 2016: it will compute future-looking forecasts based on the
fitted model, using the provided new time series observations as initial points.
The result of PsiForecast() can be deterministic, if the Simulate argument is
FALSE, or non-deterministic, if the Simulate argument is TRUE– in which case
the forecasts are adjusted with random normally distributed errors, defined by
the forecasts’ statistics.
Open the Airpass.xlsx example dataset by clicking Help – Examples on the Data
Mining ribbon, then clicking Forecasting/Data Mining Examples. This example
dataset includes International Airline Passenger Information by month for years
1949 – 1960. Since the number of airline passengers increases during certain
times of the year, for example Spring, Summer, and in the month of December,
we can say that this dataset includes “seasonality”.
First, we will partition this dataset into two datasets: a training dataset and a
validation dataset. We’ll use the training dataset to create the ARIMA model
and then we’ll apply the model to the validation dataset to forecast six future
data points, or one half year of data.
Click Partition in the Time Series section of the Data Mining ribbon to open the
Time Series Partition Data dialog. Select Passengers for the Variables in the
Partition Data and Month for the Time Variable.

Click OK to accept the defaults for Specify Partitioning Options and Specify
Percentages for Partitioning. Recall that when a time series dataset is
partitioned, the dataset is partitioned sequentially. Therefore, 60% or the first 86
records, will be assigned to the training dataset and the remaining 40%, or 58

Frontline Solvers Analytic Solver Data Mining User Guide 169


records, will be assigned to the validation dataset. (For more information on
partitioning a time series dataset, see the previous chapter Exploring a Time
Series Dataset.)
The TSPartition worksheet will be inserted into the Model tab of the Analytic
Solver task pane under Transformations – Time Series Partition. Recall the
steps needed to produce the forecast. Click ARIMA -- ARIMA to open the
ARIMA dialog. Month has been pre selected as the Time variable. Select
Passengers as the Selected variable.
This example will use a SARIMA model, or Seasonal Autoregressive Integrated
Moving Average model, to predict the next six datapoints in the dataset. (For
more information on this type of time series model, please see the earlier
chapter, “Exploring a Time Series Dataset.”) A seasonal ARIMA model
requires 7 parameters, 3 nonseasonal (autoregressive (p), integrated (d), and
moving average (q)), 3 seasonal (autoregressive (P), integrated (D), and moving
average (Q)), and period. Each parameter must be a non-negative integer.
Selecting appropriate values for p, d, q, P, D, Q and period is beyond the scope
of this User Guide. Consequently, this example will use a well documented
SARIMA model with parameters p = 0, d = 1, q = 1, P = 0, D = 1, Q = 2 and
period (P) = 12. Please refer to the classic time series analysis text Time Series
Analysis: Forecasting and Control written by George Box and Gwilym
Jenkins for more information on parameter selection.
Select Fit seasonal model and enter 12 for Period since it takes a full 12 months
for the seasonal pattern to repeat. Set the Non-seasonal Parameters as
Autoregressive (p) = 0, Difference (d) = 1, Moving Average (q) = 1 and the
Seasonal Parameters as Autoregressive (P) = 0, Difference (D) = 1, and Moving
Average (Q) = 2.

Click OK to create the SARIMA model.

Frontline Solvers Analytic Solver Data Mining User Guide 170


ARIMA_Output will be inserted into the Model tab of the task pane under
Reports – ARIMA. This output contains the Training Error Measures and Fitted
Model Statistics. (For more information on this report, please see the chapter
Exploring a Time Series Dataset within the Analytic Solver Data Mining
Reference Guide.) ARIMA_Stored contains the stored model parameters.
Now we’ll use this ARIMA model to predict new data points in the validation
dataset using the PsiForecast() function. When array-entered into six different
Excel cells, this function will forecast six different future points in the dataset.
(Note: The first forecasted point will be more accurate than the second, the
second forecasted point more accurate than the third and so on.) The
PsiForecast() function will be interactive in the sense that if any of the input
values (values passed in the 2nd argument) change, the forecast will be
recomputed.
The PsiForecastARIMA function takes five arguments but two are optional:
Model, Input Data, Simulate, Num_forecasts, and Header. Select a blank cell on
the Data worksheet and enter =PsiForecast(. If using a version of Excel that
does not support Dynamic Arrays, highlight cells B146:B152, then enter
=PsiForecast(.
The first argument, Model, is the range of cells used by Analytic Solver Data
Mining to store the ARIMA model on the ARIMA_Stored worksheet. This data
range will change as the forecast method changes. Select or enter
ARIMA_Stored!B12:B38, for this argument.
The second argument, Input_Data, is the range containing the initial starting
points from the validation data set. The minimum number of initial points that
should be specified for a seasonal ARIMA model is the larger of p + d + s * (P +
D) and q + s * Q. In this example, p + d + s * (P + D) is equal to 13 (0 + 1 + 12
* (0 + 1) and q + s * Q is equal to 13 (1 + 12 * 1), therefore the minimum
number of initial starting points required is 13 (MAX (13, 13)). If you provide
fewer than the minimum required number of starting points, PsiForecast() will
return a column of zeros. (See the table below for the minimum number of
initial starting points required by each Forecasting method included in Analytic
Solver Data Mining.) The maximum number of starting points is the number of
points in the validation dataset. All points supplied in the second argument will
be used in the forecast. Select or enter Data!B1:B145, for this argument.
Pass True or False for the third argument. Passing False will result in a static
forecast that will only update if a cell passed in the 2nd argument is changed. If
True is passed for this argument, a random error will be included in the
forecasted points. See the Time Series Simulation example below for more
information on passing True for this argument. In this case, Pass False) for this
argument.
Pass "7" for the next argument, number of forecasts.
Pass "True" for header since to display a header at the top of the results.

Your formula should now be the following:


=PsiForecast(ARIMA_Stored!B12:B38,Data!B1:B145, False, 7, True). If
using a version of Excel that does not support Dynamic Arrays, press CTRL +
SHIFT + ENTER to enter this formula as an array in all seven cells
(B146:B152).
It’s also possible to enter this formula using the Insert Function dialog by
clicking Formulas – Insert Function, select PSI Data Mining for Category, then
PsiForecastARIMA.

Frontline Solvers Analytic Solver Data Mining User Guide 171


The results from this function are displayed below.
Enter True for the Header argument to insert a heading above the forecasted
values.
Notice that the formula is entered into cell B146 and the contents of the
PsiForecast() Dynamic Array "spill" down into cells B147:B152.

If any values change in the ranges ARIMA_Stored!B12:B38 or Data!B2:B145,


the forecast will be recomputed; but if the input argument values stay the same,
the PsiForecast() function will always return the same forecast values. As
mentioned above, the first forecasted value is the most accurate predicted point.
Accuracy declines as the number of forecasted points increases.
See the section below for specifications on PsiForecast().

PsiForecast()
PsiForecast(Model, Input_Data, [Simulate],
[Num_Forecasts], [Header])
Computes the forecasts for Input_Data using a Time Series model stored in
PMML format.
Model: Range containing the stored Times Series model in PMML format.

Frontline Solvers Analytic Solver Data Mining User Guide 172


Input_Data: Range containing the new Time Series data for computing the
forecasts. Range must contain a header with the time series name and a
sufficient number of records for the forecasting with a given model.
Simulate: If True, the forecasts are adjusted with random normally
distributed errors. If False or omitted, the forecasts will be deterministic.
Num_Forecasts: Enter the number of desired forecasts.
Header: If True, a heading will be inserted above the forecasted results. If
False or omitted, a heading will not be included in the result.
Output: A single column containing the header, if used, and forecasts for input
time series. The number of produced forecasts is determined by the number of
selected cells in the array-formula entry.

Supported Models:
• Arima
• Exponential Smoothing
• Double Exponential Smoothing
• Holt Winters Smoothing

Previous related Psi Scoring functions: PsiForecastARIMA, PsiForecastExp,


PsiForecastDoubleExp, PsiForecastMovingAvg, PsiForecastHoltWinters

Time Series Simulation


Analytic Solver Data Mining includes the ability to perform a time series
simulation, where future points in a time series are forecast on each Monte Carlo
trial, using a model created via ARIMA or one of our smoothing methods
(Exponential, Double Exponential, Moving Average, or Holt Winters).
To run a time series simulation, we must pass “True” as the third argument to
PsiForecast(). When the third argument is set to True, Analytic Solver will add
a random (positive or negative) “epsilon” value to each forecasted point. Each
time a simulation is run, 1000 trial “epsilon” values are generated using the
PsiNormal distribution with parameters mean and standard deviation computed
by the PsiForecast() function. You can view the output of this simulation in the
same way as you would view “normal” simulation results in Analytic Solver
Comprohensive, Analytic Solver Simulation, Analytic Solver Upgrade, or
Analytic Solver Basic, simply by creating a PsiOutput() function and then
double clicking the Output cell to view the Simulation Results dialog.
Select a blank cell, or Data!C146:C152 if using a version of Excel that does not
support Dynamic Arrays, then click Formulas – Insert Function to display the
Function Argument dialog.
As discussed previously, the first argument, ARIMA_Stored!B12:B38, is the
range of cells used by Analytic Solver to store the ARIMA model on the
ARIMA_Stored worksheet. This data range will change as the forecast method
changes.
For the second argument the range containing the initial points in the series must
be greater than the minimum number of initial points for a static forecast. For a
seasonal ARIMA model when Simulate = True, the minimum number of initial
points must be greater than Max((p + d + s * (P + D), (q + s * Q). In this
example, p + d + s * (P + D) is equal to 13 (0 + 1 + 12 * (0 + 1) and q + s * Q is
equal to 13 (1 + 12 * 1), therefore the minimum number of initial starting points
required is 13 (Minimum #Initial Points > MAX (13, 13)). However, when

Frontline Solvers Analytic Solver Data Mining User Guide 173


PsiForecastARIMA() is called with Simulate = True, it is recommended to add
an additional number of datapoints, equal to the #Periods, to the minimum
number required. In this instance the number of initial points will be 25: 13
(minimum # of points) + 12 (# of points for Period in the Time Series - ARIMA
dialog). If you provide fewer than the minimum required number of starting
points (13 in this example) PsiForecastARIMA() will return #VALUE. (See the
table below for the minimum number of initial starting points required by each
forecast method in Analytic Solver.) All points supplied in the second argument
will be used in the forecast. Select or enter Data!B1:B145, for this argument.
Passing TRUE for the third argument indicates to Analytic Solver Data Mining
that you plan to use this function call in a Monte Carlo simulation, so it should
add a random epsilon value (different on each Monte Carlo trial) to each
forecasted point.

In versions of Excel supporting Dynamic Arrays, this formula is entered into


cell C146 and the contents of the PsiForecast() Dynamic Array "spills" down
into cells C147:C152. If versions of Excel that do not support Dynamic Arrays,
the formula must be entered as an Excel array.
To view the results of the simulation including frequency and sensitivity charts,
statistics, and percentiles for the full range of trial values in any version of
Excel, we must first create an output cell. Select cell B147, then click Analytic
Solver – Results – Referred Cell. Select cell D147 (or any blank cell on the
spreadsheet) to enter the PsiOutput formula. Copy this formula from cell D147
down to cell D152. Therefore D147 = PsiOutput(C147), D148 =
PsiOutput(C148), and so on.

Click the down arrow on the Simulate icon and select Run Once. Instantly,
Analytic Solver will perform a simulation with 1,000 Monte Carlo simulation

Frontline Solvers Analytic Solver Data Mining User Guide 174


trials (the default number). Since this is the first time a simulation has been
performed, the following dialog opens. Subsequent simulations will not produce
this report. However, it is possible to reopen the individual frequency charts by
double clicking each of the output cells (B147:B152).
Important Note: For Users who are familiar with simulation models in Analytic
Solver Simulation, you’ll notice that the time series simulation model that we
just created now includes 6 uncertain functions, B147:B152, which are the cells
containing our PsiForecast() functions. For more information on simulation
with Analytic Solver, please see the Analytic Solver User Guide chapter,
“Examples: Simulation and Risk Analysis”.
PsiForecast() is not recognized as an uncertain function in the Cloud apps. If
the simulation argument is set to "True", Analytic Solver App will generate a
single random point around the forecast.

This dialog displays frequency charts for each of the six cells containing the
forecasted data points. Double click the chart for cell C147 (top left) to open the
Simulation Results dialog for the PsiForecast() function in cell C147. From here
you can view frequency and sensitivity charts, statistics and percentiles for each
forecasted point.

Frontline Solvers Analytic Solver Data Mining User Guide 175


The frequency chart displays the distribution of all 1000 trial values for cell
C147 with and observed mean 440.76 and standard deviation of 16.90 shown in
the Chart Statistics. Select Simulate – Run Once a few more times (or click the
green “play” button on the Solver Pane Model tab). Each time you do, another
1,000 Monte Carlo trials are run, and a slightly different mean will be displayed.

Enter 444 for the Lower Cutoff in the Chart Statistics section of the
right panel, A vertical bar appears over the Frequency chart to display the
frequency with which the forecasted value was greater than this value during the
simulation. You can use this as an estimate of the probability that the actual
value will be less than the forecasted value. In this case there was a 58.20%
chance that the number of international airline passengers would be less than
444,000 in January 1961 and a 41.80% chance that the number of passengers
would be greater than 444,000.

Looking to the right, you’ll find the Statistics pane, which includes
summary statistics for the full range of forecasted outcomes. We can see that the
minimum forecasted value during this simulation was 397.78, and the maximum
forecasted value was 493.37. Value at Risk 95% shows that 95% of the time, the
number of international airline passengers was 469.72 or less in January 1961, in
this simulation. The Conditional Value at Risk 95% value indicates
that the average number of passengers we would have seen (up to the 95%
percentile) was 438.86. For more information on Analytic Solver Platform’s
full range of features, see the Analytic Solver User Guide chapter, “Examples:
Simulation and Risk Analysis”.

Frontline Solvers Analytic Solver Data Mining User Guide 176


Select cells E147:E156 and then enter the formula, =PsiData(C147), then press
CTRL + SHIFT + ENTER to array enter the formula into all 10 cells. Repeat
the same steps to array enter “=PsiData(C148)” in cells F147:F156,
“=PsiData(C149)” in cells G147:G156, “=PsiData(C150)” in cells H147:H156,
“PsiData(C151)” in cells I147:I156, and “=PsiData(C152)” in cells J147:J156.
Then click Simulate – Run Once to run a simulation.

The ten Excel cells in these columns will update with trial values for each of the
PsiForecast() functions in column C. For example, cells E147:N147 will
contain the first 10 trial values for the PsiForecast() function in cell C147, Cells
E147:N147 will contain the first 10 trial values for cell C148 and so on. (For
more information on the PsiData() function, please see the Excel Solvers
Reference Guide chapter, “Psi Function Reference.”)

If we create an Excel chart of these values, you’ll see a chart similar to the one
below where each of Series1 through Series6 represents a different Monte Carlo
trial. The random “epsilon” value added to each forecast value accounts for (all
of) the variation among the lines. If the third argument were FALSE or omitted,
all of the lines would overlap, assuming that the table or parameters and the
starting values were not changing.

Frontline Solvers Analytic Solver Data Mining User Guide 177


The remaining Forecasting methods can be used in the same way using
PsiForecast() with information from their respective Stored Model sheets.

Forecasting Stored Model Sheet Minimum # of Minimum # of


Algorithm Initial Points when Initial Points when
Simulate = False Simulate = True
Non- Seasonal ARIMA ARIMA_Stored Max(p + d, q) Max(p + d, q)
Seasonal ARIMA ARIMA_Stored Max((p + d + s *(P 1 + Max((p + d + s
+ D), (q + s * Q) *(P + D), (q + s *
Q)**
Exponential Smoothing Expo_Stored 1 1
Double Exponential DoubleExpo_Stored 1 1
Smoothing
Moving Average MovingAvg_Stored # of Intervals # of Intervals
Smoothing
Holt Winters MulHoltWinters_Stored 2 * #Periods 2 * #Periods
Smoothing
AddHoltWinters_Stored
NoTrendHoltWinters_Stored

**Adding a number of data points equal to the Number of Periods (as shown on the Time Series – ARIMA dialog)
to the Minimum # of Initial Points when Simulate = True is recommended when calling PsiForecast() with Simulate
= True.

Frontline Solvers Analytic Solver Data Mining User Guide 178

You might also like