Adding a utility function for plotting decision regions of classifiers #6338

rasbt · 2016-02-12T00:27:29Z

It's been a while when @amueller asked me about moving this utility function "upstream," but I finally got around it now (rasbt/mlxtend#8) ...

Although such a function probably less useful in real-world applications (since we typically have more than 1 or 2 features in our dataset), but I think it would be a nice utility for replacing the many repeated lines of codes in the tutorials and examples on the scikit-learn website.

So, this is a simple matplotlib-wrapping convenience function to plot a decision surface of a classifier; it looks like this:

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC

# Loading some example data
iris = datasets.load_iris()
X = iris.data[:, [0,2]]
y = iris.target

# Training a classifier
svm = SVC(C=0.5, kernel='linear')
svm.fit(X,y)

# Plotting decision regions
plot_decision_regions(X, y, clf=svm, res=0.02, legend=2)

# Adding axes annotations
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.title('SVM on Iris')
plt.show()

It also supports 1D decision regions (if the input array has only one feature column) and you can highlight the test datapoint, which can be quite useful for teaching (tutorial) purposes.

I am looking forward to feedback! Also, I am wondering how (or if) we implement tests for such a function?

code-of-kpp · 2016-02-13T10:27:24Z

Hello! Is it possible to plot a decision region on a 2d slice of multidimensional problem?

rasbt · 2016-02-13T22:13:34Z

@podshumok I am not sure how you'd take such a slice. For the generalized linear models, you could maybe train your model on the full feature set, take 2 of the model coefficients plus bias unit and assign them to a new model that was trained on 2 feature variables?

code-of-kpp · 2016-02-14T06:28:48Z

For example, build full-dimensional mesh grid with classified markers and plot two or more subspaces eg obtained by fixing all coordinates except 2

rasbt · 2016-02-14T19:22:18Z

@podshumok yes, maybe a good add-on in future -- do you think something like this would be useful in practice? The original goal of this utility function was to replace all the repeated lines of code in all the documentation examples to make them more consistent and leaner.

GaelVaroquaux · 2016-02-14T20:13:53Z

The original goal of this utility function was to replace all the repeated lines of code in all the documentation examples to make them more consistent and leaner.

Indeed, I am a bit afraid that if we start improving that function, we'll soon have to maintain a plotting library.

jnothman · 2016-02-18T03:42:23Z

sklearn/utils/plotting.py

+def plot_decision_regions(X, y, clf, X_highlight=None,
+                          res=0.02, legend=1,
+                          hide_spines=True,
+                          markers='s^oxv<>',


This is unused.

jnothman · 2016-02-18T03:48:20Z

A related risk is that we make plotting a mysterious black box within our examples. The examples currently make the user understand what is involved in building a plot, arguably. The flipside is that it will remove some code that can be fairly indecipherable for people not very familiar with matplotlib, meshgrid contouring, obfuscations like np.c_, etc. I remain undecided.

Looking through examples with this kind of illustration, it's clear there's some amount of variation that makes some examples arguably less readable than others. Some (such as plot_classifier_comparison) show the contour on the basis of decision_function rather than the hard delineations of predict. plot_forest_iris overlays multiple decision surfaces, so you're not really going to cover that case.

rasbt · 2016-02-19T03:27:31Z

The examples currently make the user understand what is involved in building a plot, arguably

I agree with you here. However, I am wondering if "teaching" users "how to plot" is really within the scope of the scikit-learn examples (in contrast to matplotlib tutorials). Although, you are right, one could also see it as a distraction from the main content, and if we provide the user with such a plotting function, he/she wouldn't have to worry about learning how to use matplotlib if she/he doesn't want to.

Also, we could make an documentation page to walk through this decision plot function to show people how this generally works? I think that's better than implementing it over and over again.

I see it somewhat similar to e.g., the test_train_split function; you could easily do that "manually" in 2-4 lines in NumPy, but it's just extra convenience.

plot_forest_iris overlays multiple decision surfaces, so you're not really going to cover that case.

I think even if we leave these "special" cases as they are for now, I think there are plenty examples that could be streamlined with a plotting function, maybe it would help to make the examples leaner with focus on the main part.

GaelVaroquaux · 2016-02-19T10:34:20Z

The examples currently make the user understand what is involved in
building a plot, arguably
I agree with you here. However, I am wondering if "teaching" users "how to
plot" is really within the scope of the scikit-learn examples (in contrast to
matplotlib tutorials).

Well, we aren't really teaching matplotlib, we are teaching people the
kind of operations that they need to do to probe a classifier and get
data that they can plot.

I see it somewhat similar to e.g., the test_train_split function; you could
easily do that "manually" in 2-4 lines in NumPy, but it's just extra
convenience.

Indeed, it's all a question of tradeoff. If we go down the plotting way,
users won't learn much about plotting and hence always ask for more
plotting. The question is whether we are ready to go down that way.

amueller · 2016-07-28T22:35:54Z

@GaelVaroquaux late answer: I usually need to do multiple google searches, browse stackoverflow and then in the end bug tcaswell about how to do the plot that I want to do.
Plotting a confusion matrix like so that it works with arbitrary color maps involves stuff that I couldn't really have figured out myself.

My book contains much more matplotlib code than scikit-learn code, and I think the same is true for our examples. I find it makes the point of the examples much harder to read.

Finally, scikit-learn is part of an iterative process for many people that do sklearn in a notebook. doing exploration without plotting functions is pretty annoying. I usually end up copy and pasting some parts of previous plots into my notebooks so that I have functions I can call to visualize coefficients or confusion matrices or learning curves etc.

An alternative to having more functions in scikit-learn is creating a separate library of plotting functions.
Not sure if that should live in the contrib repo, as it doesn't really implement any estimators.

rasbt · 2016-07-28T23:09:33Z

10000

Yep, whether it's for exploratory analysis in a data mining/ML pipeline or teaching purposes (e.g., book or scikit-learn docs), I find that the matplotlib part is often burying the "main" content/message.

I think that a separate lib, something like scikit-learn-plotlib or sklearnplots sounds like a good idea. Beyond simple 1D/2D/3D plots for teaching purposes (such as the decision regions of classifiers, regression fits, or clustering results), there are many other kind of plots where it would be useful to have "convenience wrapper functions." Things like silhouette plots, elbow plots, ROC curves, etc.

amueller · 2016-07-29T18:49:51Z

ok, so let's not do this here, then. Btw, I have a bunch of them in my BSD-licensed book repo. They need heavy cleanup, though.

amueller · 2016-07-29T18:50:15Z

@rasbt do you want to start a repo or should I? I'm still behind on reviews (only 9000 notifications to go)

GaelVaroquaux · 2016-07-29T18:52:14Z

@GaelVaroquaux late answer: I usually need to do multiple google searches,
browse stackoverflow and then in the end bug tcaswell about how to do the plot
that I want to do.
Plotting a confusion matrix like in our example so that it works with arbitrary
color maps involves stuff that I couldn't really have figured out myself.

My book contains much more matplotlib code than scikit-learn code, and I think
the same is true for our examples. I find it makes the point of the examples
much harder to read.

I am sold. +1 for plotting code in sklearn. It would benefit everybody
including me.

Only one rule: it lives in a separate submodule (sklearn.plotting?), and
nothing imports from that module.

Is that a good plan?

amueller · 2016-07-29T18:55:21Z

Ok cool. +1 on sklearn.plotting (or sklearn.plot)

amueller · 2016-07-29T18:55:56Z

Examples can import from that module, right?

rasbt · 2016-07-29T19:18:12Z

do you want to start a repo or should I? I'm still behind on reviews (only 9000 notifications to go)

Whoa! And that's what they call the "finishing touch(es)"!? O.o

I briefly thought about creating such a repo, but yet another side-project? :) Thanks @GaelVaroquaux , the sklearn.plotting solution sounds much nicer.

amueller · 2016-07-29T19:23:33Z

Whoa! And that's what they call the "finishing touch(es)"!? O.o
No I mean catching up with scikit-learn ;)

I opened #7116 to discuss what we might consider adding.

PanWu · 2016-11-26T06:51:20Z

This feature looks great, looking forward to have sklearn utility function to visualize decision boundary directly. Which release will it be included in?

Btw. I wrote a small function to do similar task, for high dimension data, I use PCA to do the job, maybe this could be helpful (https://github.com/PanWu/pylib/blob/master/example/01.plot_decision_boundary.ipynb )

Example for Iris dataset + Logistic regression visualization:

agramfort · 2016-11-26T13:47:05Z

looks really nice. I would really help me for teaching...

rasbt · 2016-11-26T20:16:16Z

This feature looks great, looking forward to have sklearn utility function to visualize decision boundary directly. Which release will it be included in?

Yeah, I think we should pick this up again soon! I make a reminder to get back to it in December when I checked off some other projects and I am back from traveling; probably will make it to the next sklearn version then (0.19)!?
As far as I remember, we decided to create a new submodule for that (sklearn.plotting ), which should be excluded from unit testing.

@PanWu Your plotting functions looks nice as well! Haven't looked at the code in detail, but I noticed that you hard-coded the colors. Would it work for an arbitrary number of class labels then?

PanWu · 2016-11-26T20:54:16Z

@rasbt Right, currently the color is hard coded, and it is possible to extend into arbitrary number of class labels if needed. However, need to be careful about how to define "class color", because the grid point's color is a linear combination of their class colors (based on class probability).

For example, if beside the 8 basic color as following (RBG scale),
0 - [1.0, 0.0, 0.0],
1 - [0.0, 1.0, 0.0],
2 - [0.0, 0.0, 1.0],
3 - [1.0, 1.0, 0.0],
4 - [0.0, 1.0, 1.0],
5 - [1.0, 0.0, 1.0],
6 - [0.0, 0.0, 0.0]
7 - [1.0, 1.0, 1.0]
we are defining another class color [0.5, 0.5, 0.0], if on a grid where the grid has 50% likelihood (class 3) and 50% likelihood (class 6), it will seems like it is 100% likelihood (in the new class).

PanWu · 2016-11-29T04:22:53Z

@rasbt I just added the approach to generate unlimited classes, so now it works for arbitrary classes. https://github.com/PanWu/pylib/blob/master/pylib/plot.py#L125

rasbt · 2016-11-29T05:26:53Z

Nice! Just wondering what would be the preferred function for scikit-learn, the decision_region_plot I posted earlier or the plot_decision_boundary by @PanWu . Don't want to make it too complicated. but a function that does the hard boundaries by default and having thus fuzzy boundaries if use_proba=True or so. (mainly, I was thinking of having a decision region function with regard to making the documentation examples simpler [shorter], so I am not sure how much we really want to implement here -- maybe a separate scikit-contrib project would be better for "sophisticated" visualizations?)

PanWu · 2016-11-29T05:53:30Z

Agree with @rasbt. It would be nice to have a simple UI and add "use_proba" as keyword for improved functionality.

Naming side, plot_decision_boundary may not be a good function name for the general purpose since it indicates more "boundary" rather than "region". We can choose either "decision_region_plot" or "plot_decision_region" based on whether the preferred name in sklearn should start with a verb or noun.

amueller · 2016-11-30T20:59:26Z

I looked through the code for plot_decision_boundary that @PanWu wrote and it looks like it retrains the model on a 2d projection of the high-dimensional data. Is that right?

I'm not sure what that illustrates. I feel that is obfuscating more than anything else. In particular the model that is passed in is discarded?

amueller · 2016-11-30T21:01:20Z

I have functions for hard and soft boundaries I think. I would like a function that can easily do linear separators as a line or color regions or color by distance. The one from my book might need some polish but should be able to do that.

PanWu · 2016-12-01T06:28:28Z

@amueller No exactly: the model (for classification) is not modified; the grids on 2D surface is inverse transformed back to high-dimensional space, then these high-dimension grids (although sparse) are calculated probability & decision based on the classification model. These probability & decision are then used over the 2D surface to represent the model's probability & prediction.

The name dr_model means: dimension reduction model (such as PCA or kernal PCA). It has both fit_transform and inverse_transform methods. This is how the 2D space represent model's result over high-dimension space.

amueller · 2016-12-01T21:49:43Z

@PanWu that's an interesting way to do it. I'm not sure I would advocate this as a default solution, but maybe it's a good idea? How do you set the bandwidth in the kernel PCA?

Could you maybe do a binary classification example where you have data on a 3d or 4d sphere and some data in the center, learn it with a random forest or SVM and then project it using KernelPCA or PCA?
I feel like that could yield some insight into how useful this is.
For PCA, I guess the directions are pretty arbitrary and you'll just get some random projection of the ball to 2d ... which would show some overlap of the classes but not too bad...

I'm not sure if there are other high-dimensional datasets we could look at - I'd like the classes to have more similar densities...

Another interesting thing would be to compare a PCA projection against a LinearDiscriminantAnalysis projection when using a linear classifier and linearly separable dataset. Or maybe just use the classifier itself if it's linear?

PanWu · 2016-12-04T07:33:10Z

@amueller Definitely agree, I would not advocate visualizing more than 2D data as well. However, if the user really have a high dimension use case, I would like to have a simple way to help extend this function into higher dimension. With PCA, I think linearly separable cases should be fine; for non-linear cases, as long as the user understand PCA's principle (intrinsically linear), such visualization could still be helpful.

Here is the example showing how a 2D, 3D, and 10D circle data (as class 1) vs. gaussian distribution around center (as class 2) looks like. I am using PCA for visualization dimension reduction and the classifier is Random Forest.

(2D case, clear insights, I would advocate all users only use 2D)

(3D case, interesting shape ... while one can see that decision boundary is solid, so the green points flying inside red region is suggesting such cases is due to dimension reduction visualization)

(10D case, similar as 3D, just more exaggerated "green points in red region" case)

Here is the code I am using:

def gen_data(N=200, k=2):
   np.random.seed(1)

   X1 = np.random.randn(N, k) * 2.0
   Y1 = np.zeros([N, 1])

   X2 = np.random.rand(N, k) - 0.5
   r = 25
   X2f = X2 * ((r + np.random.randn(N) * 1.) / np.sqrt((X2 ** 2).sum(axis=1))).reshape(N, 1)
   Y2f = np.ones([N, 1])

   X = np.vstack([X1, X2f])
   Y = np.vstack([Y1, Y2f]).ravel()
   return X, Y

X, Y = gen_data(200, 10)
model = ensemble.RandomForestClassifier(n_estimators=51)
model.fit(X, Y)
a = plot_decision_boundary(model, X=X, Y=Y)

amueller · 2016-12-06T21:23:00Z

That's cool, though I do have my doubts about the utility of the last one.

rasbt · 2016-12-06T21:56:50Z

Before we keep the discussion going in this PR , I was wondering if it's worthwhile closing this PR in favor of the alternative solution proposed by @PanWu

amueller · 2016-12-06T22:09:27Z

The PCA is a bit too magic for my taste to have it as default behavior. We can discuss adding a way to visualize high-dimensional decision boundaries - which I think is very important, but I don't want to jump the gun on this.

The last figure in the series is still a synthetic toy dataset that is much simpler than what you'll see "in the real world" and the visualization is already pretty useless.

We can try on some more real-world datasets and see how it goes, though. I couldn't make the code work unfortunately - though I didn't try that hard:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-1a10ba4d9486> in <module>()
     20 model = ensemble.RandomForestClassifier(n_estimators=51)
     21 model.fit(X, Y)
---> 22 a = plot_decision_boundary(model, X=X, Y=Y)

<ipython-input-6-d727cb8ef4de> in plot_decision_boundary(model, dim_red_method, X, Y, xrg, yrg, Nx, Ny, scatter_sample, figsize, alpha, random_state)
    123     else:
    124         X_full_grid_inverse = ss.inverse_transform(
--> 125             dr_model.inverse_transform(X_full_grid))
    126 
    127         Yp = model.predict(X_full_grid_inverse)

/home/andy/checkout/scikit-learn/sklearn/decomposition/base.py in inverse_transform(self, X, y)
    160                             self.components_) + self.mean_
    161         else:
--> 162             return fast_dot(X, self.components_) + self.mean_

TypeError: unsupported operand type(s) for *: 'zip' and 'float'

Probably some python3 issues? I changed the print statements but nothing else.

amueller · 2016-12-06T22:12:02Z

ah, got it. Though np.c_ would probably be better than the zip.

rasbt · 2016-12-06T22:16:01Z

The PCA is a bit too magic for my taste to have it as default behavior.

Imho, I would opt for something plain simple for scikit-learn, mainly that's intuitive to interpret for users, since the goal I had in mind was to replace the elaborate code in the documentation examples, and to have something that is model-agnostic in 2D for simple teaching purposes. Not that I am against any fancier stuff, and I think it's an interesting idea, but it's probably better suited for a scikit-visualize contrib project or so. As @GaelVaroquaux mentioned, we probably don't want to go too deep into the plotting stuff :)

amueller · 2016-12-07T15:36:07Z

@rasbt yeah I agree which is why I would rather stick to something closer to your version - and a helpful error message if the data is higher-dimensional, since that's something that comes up a lot.

rasbt · 2016-12-07T17:40:42Z

Makes sense! Just looking at my code in mlxtend, it already should have some basic input checking, complaining if the data array has >2 features (https://github.com/rasbt/mlxtend/blob/master/mlxtend/plotting/decision_regions.py). I think that I made several changes/improvements over time compared to the PR here. So, I think I may want to remove the highlighting (circling) of test samples since it may be too much fluff for a robust, basic decsion region plotting function for scikit-learn. In any case, I will put together a fresh commit in the next few days so that you can have a look!

amueller · 2016-12-07T17:44:28Z

Cool.
One feature I definitely want is to easily be able to plot a linear model so that it looks nice ;)

rasbt · 2016-12-07T17:48:07Z

Haha, "looking nice" is relative (still looks like as in the images at the very top of this PR), but as it is right now, it should

be model-agnostic for 2D and 1D decision boundaries
support an arbitrary number of classes

amueller · 2016-12-07T17:50:09Z

can you also do "soft" decision boundaries that show "predict_proba" or "decision_function"? I don't really know how to do that well for more than two classes

rasbt · 2016-12-07T17:54:12Z

phew, would have to think about how to implement that; currently, it wouldn't work with the way I implemented it wrt to colors and plt.contourf

PanWu · 2016-12-07T18:46:06Z

Agree with only having 2D case in sklearn to make sure things are stable and consistent.

@rasbt , if needed, i have some functions to "support an arbitrary number of classes", feel free to leverage it for this implementation.
https://github.com/PanWu/pylib/blob/master/pylib/util.py

Detaied discussion is here: http://www.magic-analytics.com/blog/rgb-representation-for-classes

rasbt · 2016-12-08T17:45:19Z

Agree with only having 2D case in sklearn to make sure things are stable and consistent.

And more interpretable (w.r.t. to what it actually does) :). The code that I have should also support arbitrary numbers of classes, but I will probably come back to your offer regarding the soft boundaries for predict_proba :)

adrinjalali · 2024-04-19T13:39:31Z

We have https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html#sklearn.inspection.DecisionBoundaryDisplay in the meantime.

adding a utility function for plotting decision regions of classifiers

749619c

jnothman reviewed Feb 18, 2016
View reviewed changes

rasbt added 3 commits February 18, 2016 22:32

remove redundant imports

4353971

fixing unused arguments colors and markers

fa8c1f4

uses Axes instead of figure

148d7ee

jnothman mentioned this pull request Apr 25, 2016

DOC: fixed typos and one style issue in plot examples #6707

Merged

amueller mentioned this pull request Jul 29, 2016

Discussion of useful plotting #7116

Open

amueller mentioned this pull request Mar 4, 2017

[MRG] Modify svm/plot_separating_hyperplane.py example for matplotlib v2 #8369

Closed

amueller added the Needs Decision Requires decision label Aug 5, 2019

github-actions bot added the module:utils label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:48

adrinjalali closed this Apr 19, 2024

Uh oh!

Adding a utility function for plotting decision regions of classifiers #6338

Adding a utility function for plotting decision regions of classifiers #6338

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants