[MRG + 1] add partial_fit to multioutput module #8054

yupbank · 2016-12-14T17:58:14Z

Reference Issue

Fixes: #8053

What does this implement/fix? Explain your changes.

add partial_fit

Any other comments?

amueller · 2016-12-14T19:15:40Z

tests are failing. Also, please use if_delegate_has_method

yupbank · 2016-12-14T19:38:04Z

sorry about the typo.. and hmm can you explain more on using if_delegate_has_method ?

amueller · 2016-12-14T19:44:32Z

do a git grep to see how it's used. Basically we want hasattr(est, "partial_fit") to fail if partial_fit is not implemented. So you want partial_fit in the MultiOutputEstimator to raise an AttributeError if the estimator doesn't support partial_fit.

yupbank · 2016-12-14T21:00:38Z

i see you point. more efforts needed, since current implementation doesn't consider about the reuse of self.estimators_, a simple clone to Parallel job is why the test cases failing..

yupbank · 2016-12-17T01:25:14Z

@amueller can i have another comment?

maniteja123 · 2016-12-18T06:57:25Z

sklearn/multioutput.py

 class MultiOutputEstimator(six.with_metaclass(ABCMeta, BaseEstimator)):

    def __init__(self, estimator, n_jobs=1):
        self.estimator = estimator
        self.n_jobs = n_jobs

+    @if_delegate_has_method('estimator')


Hi, while you are at it, can you please add the @if_delegate_has_method for the fit and predict functions too ? Thanks.

No, we do not need if_delegate_has_method for fit. It is required for every estimator. I think in this context we are dealing with predictors and we do not need if_delegate_has_method for predict.

We also do not need the scope of the PR to be expanded unnecessarily. Throwing in a decorator means that decoration needs testing for instance.

Sorry for the unnecessary work, @yupbank

maniteja123 · 2016-12-18T06:58:40Z

sklearn/multioutput.py

@@ -83,8 +164,10 @@ def fit(self, X, y, sample_weight=None):
            raise ValueError("Underlying regressor does not support"


A small nitpick, could you replace regressor with estimator since it is the base class. Thanks again !

Thanks and will do

Thanks, this was my mistake :)

maniteja123 · 2016-12-18T07:00:45Z

sklearn/multioutput.py

+    else:
+        estimator = copy.copy(estimator)
+
+    if sample_weight is not None:


I was wondering if

estimator.partial_fit(X, y, classes=classes, sample_weight=sample_weight)

would suffice since partial_fit function of the the base estimator would handle it appropriately. Let me know if it sounds okay !

Since the regressor and classifier is sharing same interface, so choice of argument for classes is non-neglect-able, but for sample_weight i might need to check, is it universal for sgd estimator

The fit in BaseSGD only take X, y, so i think i will keep this to make sure all the SGD estimator use this code..

Hi, thanks for looking into it. I understand that classes argument is not applicable for regressors. But in case someone inadvertently passes classes argument with a Regressor, the estimator would throw an error right ? Perhaps we can have separate partial_fit function in the sub classes MultiOutputRegressor and MultiOutputClassifier ? Just my 2c.

Also BaseSGD has an abstract definition for fit, which is implemented in the subclasses BaseSGDClassifier and BaseSGDRegressor right ?

Please do let me know what you think ? Thanks.

Hi, sorry I missed the partial_fit function you have added to MultiOutputRegressor. And also since it is a helper function, I get the reason for separate code paths based on classes. Sorry for the noise..

classes is required for classifiers' partial_fit, but I think we have no requirement that sample_weight be supported.

jnothman · 2016-12-19T03:59:44Z

sklearn/tests/test_multioutput.py

+    # train the multi_target_linear and also get the predictions.
+    half_index = int(X.shape[0] / 2)
+    multi_target_linear.partial_fit(
+        X[:half_index], y[:half_index], classes=classes)


should also check check:

sample weight

passing classes=None raises appropriate error

jnothman · 2016-12-19T03:59:46Z

sklearn/tests/test_multioutput.py

+    half_index = 25
+    for n in range(3):
+        sgr = SGDRegressor(random_state=0)
+        sgr.partial_fit(X_train[:half_index], y_train[:half_index, n])


You also need to test sample_weight

jnothman · 2016-12-19T04:00:24Z

sklearn/multioutput.py

+            Data.
+
+        y : (sparse) array-like, shape (n_samples, n_outputs)
+            Multi-output targets. An indicator matrix turns on multilabel


Multilabel is not appropriate here.

jnothman · 2016-12-19T04:00:25Z

sklearn/multioutput.py

    def __init__(self, estimator, n_jobs=1):
        super(MultiOutputRegressor, self).__init__(estimator, n_jobs)

+    def partial_fit(self, X, y, sample_weight=None):
+        """ Fit linear model with Stochastic Gradient Descent..


will update

jnothman · 2016-12-19T04:00:28Z

sklearn/multioutput.py

@@ -68,7 +148,8 @@ def fit(self, X, y, sample_weight=None):
        """

        if not hasattr(self.estimator, "fit"):
-            raise ValueError("The base estimator should implement a fit method")
+            raise ValueError(


These cosmetic fixes make it much harder to review your work.

jnothman · 2016-12-19T04:00:33Z

sklearn/multioutput.py

+        classes : array, shape (n_classes, n_outputs)
+            Classes across all calls to partial_fit.
+            Can be obtained by via
+            `[np.unique(y[:, i]) for i in range(y.shape[1])]`, where y is the


This is only going to work if there are the same number of classes in each output.

the code is documentation is going to work even with different number of classes in each output.
but yeah, i need to change https://github.com/scikit-learn/scikit-learn/pull/8054/files#diff-66150694b846268fe58229e40db8b909R122 into classes [i] if classes is not None else None.

import numpy as np y = np.array([[0,1,2,0], [0, 1, 0, 1], [1, 0, 1, 0]]).T In [11]: y Out[11]: array([[0, 0, 1], [1, 1, 0], [2, 0, 1], [0, 1, 0]]) [np.unique(y[:, i]) for i in range(y.shape[1])] [array([0, 1, 2]), array([0, 1]), array([0, 1])]

jnothman · 2016-12-19T04:00:35Z

sklearn/multioutput.py

+            Multi-output targets. An indicator matrix turns on multilabel
+            estimation.
+
+        classes : array, shape (n_classes, n_outputs)


Should specify, "of int or string" and perhaps should not be array to allow for different numbers of classes in each output.

jnothman · 2016-12-19T04:00:36Z

sklearn/multioutput.py

+            Data.
+
+        y : (sparse) array-like, shape (n_samples, n_outputs)
+            Multi-output targets. An indicator matrix turns on multilabel


It doesn't really turn on anything. Multilabel is binary multioutput by definition.

jnothman · 2016-12-19T04:00:37Z

sklearn/multioutput.py

 class MultiOutputEstimator(six.with_metaclass(ABCMeta, BaseEstimator)):

    def __init__(self, estimator, n_jobs=1):
        self.estimator = estimator
        self.n_jobs = n_jobs

+    @if_delegate_has_method('estimator')
+    def partial_fit(self, X, y, classes=None, sample_weight=None):
+        """ Fit linear model with Stochastic Gradient Descent..


jnothman · 2016-12-19T04:00:39Z

sklearn/multioutput.py

+    if first_time:
+        estimator = clone(estimator)
+    else:
+        estimator = copy.copy(estimator)


This is an interesting implementation detail. I acknowledge that copying like this is the only easy way to maintain consistent behaviour whether or not parallelism is used. However I think it should be a deepcopy if anything. And I suppose you should ensure this behaviour in tests: that across multiple calls to partial_fit the same estimator objects are not maintained.

i don't think deepcopy is required, it is a waste of memory, the joblib.Parallel is going to take care of it to ensure not job is interfering with each other's input.

d = [0] a = [d, d, d] def change_value(x, i): x[0][0] = i return x w = Parallel(n_jobs=1, backend='threading')(delayed(change_value)(a, n) for n in xrange(3)) w [[0], [0], [0]] [[1], [1], [1]] [[2], [2], [2]] w = Parallel(n_jobs=2, backend='multiprocessing')(delayed(change_value)(a, n) for n in xrange(3)) w [[0], [0], [0]] [[1], [1], [1]] [[2], [2], [2]] w = [] for n in xrange(3): w.append(change_value(a, n)) In [67]: w Out[67]: [[[2], [2], [2]], [[2], [2], [2]], [[2], [2], [2]]]

and yeah.. the test case need to be updated

joblib.Parallel will not take care of it in the n_jobs=1 case...

Sorry I take that back.

So if joblib is performing a copy regardless, are you sure we need this copy.copy?

jnothman

Otherwise LGTM

jnothman · 2016-12-20T13:15:57Z

sklearn/multioutput.py

+    if first_time:
+        estimator = clone(estimator)
+    else:
+        estimator = copy.copy(estimator)


So if joblib is performing a copy regardless, are you sure we need this copy.copy?

jnothman · 2016-12-20T13:18:10Z

sklearn/multioutput.py

 class MultiOutputEstimator(six.with_metaclass(ABCMeta, BaseEstimator)):

    def __init__(self, estimator, n_jobs=1):
        self.estimator = estimator
        self.n_jobs = n_jobs

+    @if_delegate_has_method('estimator')


No, we do not need if_delegate_has_method for fit. It is required for every estimator. I think in this context we are dealing with predictors and we do not need if_delegate_has_method for predict.

We also do not need the scope of the PR to be expanded unnecessarily. Throwing in a decorator means that decoration needs testing for instance.

Sorry for the unnecessary work, @yupbank

jnothman · 2016-12-20T13:18:43Z

sklearn/multioutput.py

 class MultiOutputEstimator(six.with_metaclass(ABCMeta, BaseEstimator)):

    def __init__(self, estimator, n_jobs=1):
        self.estimator = estimator
        self.n_jobs = n_jobs

+    @if_delegate_has_method('estimator')
+    def partial_fit(self, X, y, classes=None, sample_weight=None):
+        """ Fit the model to data with Stochastic Gradient Descent.


Still, SGD?

jnothman · 2016-12-20T13:18:47Z

sklearn/multioutput.py

    def __init__(self, estimator, n_jobs=1):
        super(MultiOutputRegressor, self).__init__(estimator, n_jobs)

+    def partial_fit(self, X, y, sample_weight=None):
+        """ Fit the model to data with Stochastic Gradient Descent..


jnothman · 2016-12-20T13:20:17Z

sklearn/tests/test_multioutput.py

+def test_mutli_output_classifiation_partial_fit_no_first_classes_exception():
+    sgd_linear_clf = SGDClassifier(loss='log', random_state=1)
+    multi_target_linear = MultiOutputClassifier(sgd_linear_clf)
+    assert_raises_regex(ValueError, "classes must be passed on the first call to partial_fit.",


PEP8 line length

jnothman · 2016-12-21T02:49:04Z

I looked over the tests yesterday, but can't recall if I checked for one that ensured that

mor.partial_fit(X, y)
est1 = mor.estimators_[0]
mor.partial_fit(X, y)
est2 = mor.estimators_[0]
# parallelism requires this to be the case for a sane implementation
assert_false(est1 is est2)

regardless of n_jobs == 1 or n_jobs > 1? This is a subtlety but your copy code reminded me it's worth ensuring.

jnothman · 2016-12-21T02:49:38Z

sklearn/multioutput.py

 class MultiOutputEstimator(six.with_metaclass(ABCMeta, BaseEstimator)):

    def __init__(self, estimator, n_jobs=1):
        self.estimator = estimator
        self.n_jobs = n_jobs

+    @if_delegate_has_method('estimator')
+    def partial_fit(self, X, y, classes=None, sample_weight=None):
+        """ Fit the model to data with SGD.


No, what I'm saying is that it's not always SGD. Why do you mention SGD??

oh man... i was so obsessed with sgd.. would delete that... i was heavily working on sgd related projects recently, sorry about that...

jnothman · 2016-12-21T02:49:44Z

sklearn/multioutput.py

    def __init__(self, estimator, n_jobs=1):
        super(MultiOutputRegressor, self).__init__(estimator, n_jobs)

+    def partial_fit(self, X, y, sample_weight=None):
+        """ Fit the model to data with SGD.


No, what I'm saying is that it's not always SGD. Why do you mention SGD??

maniteja123 · 2016-12-22T22:45:51Z

Sorry for the digression @yupbank and @jnothman . I just thought the check can be removed if the decorator can be used.

jnothman

Apart from nitpicks, LGTM.

Please add an "Enhancement" entry in whats_new.rst. Thanks!

jnothman · 2016-12-23T04:39:31Z

sklearn/multioutput.py

 class MultiOutputEstimator(six.with_metaclass(ABCMeta, BaseEstimator)):

    def __init__(self, estimator, n_jobs=1):
        self.estimator = estimator
        self.n_jobs = n_jobs

+    @if_delegate_has_method('estimator')
+    def partial_fit(self, X, y, classes=None, sample_weight=None):
+        """ Fit the model to data.


Say "Incrementally fit". Also remove space between quotes and words.

raghavrv

Thanks a lot for the PR! A few minor comments and questions.

raghavrv · 2017-01-03T13:29:27Z

sklearn/multioutput.py

+    if first_time:
+        estimator = clone(estimator)
+    else:
+        estimator = copy.deepcopy(estimator)


Why? Sorry if this question was answered before...

Once the estimators are cloned, can they not be fitted in parallel? I don't understand why you need to copy them? Maybe I'm missing something...

it has been answered. it is to ensure fitting in different "process" on different estimators, and partial fit requires being called multiple times

Yes, you will be partial-fitting one estimator per process right? Unless you are using the same estimator in 2 processes, I don't see the problem here... Maybe I'm just completely blind to some detail. Would you be kind to clarify please?

The way I see it - we have one estimator per target. Say there are 2 target, we have est1 and est2. And if you give n_jobs>=2. est1 will partial_fit on process 1 and est2 on process 2 correct? Both of these are cloned from the original est instance and hence should train on the required batch of data without any side effects from parallelism...

oh... yeah, when doing the second fit, i have already separated estimators. no clone/copy is needed anymore.

Thanks for the catch 👍

raghavrv · 2017-01-03T13:36:37Z

doc/whats_new.rst

@@ -44,6 +44,10 @@ New features
 Enhancements
 ............

+   - :class:`multioutput.MultiOutputRegressor` and :class:`multioutput.MultiOutputClassifier`
+     now support incremental fit using `partial_fit`.


"support online learning using partial_fit" sounds sexy? (nitpick of the highest order. Feel free to ignore)

raghavrv · 2017-01-03T13:37:09Z

sklearn/multioutput.py

+            Multi-output targets.
+
+        classes : list of numpy arrays, shape (n_outputs)
+            each array is unique classes for one output in str/int


Each (capital E)

raghavrv · 2017-01-03T13:42:47Z

sklearn/multioutput.py

+        classes : list of numpy arrays, shape (n_outputs)
+            each array is unique classes for one output in str/int
+            Can be obtained by via
+            `[np.unique(y[:, i]) for i in range(y.shape[1])]`, where y is the


Double backticks

raghavrv · 2017-01-03T13:43:34Z

sklearn/multioutput.py

+        self : object
+            Returns self.
+        """
+


blank line can be removed

raghavrv · 2017-01-03T13:49:19Z

sklearn/tests/test_multioutput.py

+
+def test_multi_output_classification_partial_fit_parallelism():
+    sgd_linear_clf = SGDClassifier(loss='log', random_state=1)
+    mor = MultiOutputClassifier(sgd_linear_clf)


Why is n_jobs not set for parallel test?

raghavrv · 2017-01-03T13:51:16Z

sklearn/tests/test_multioutput.py

+        sgd_linear_clf.partial_fit(
+            X[:half_index], y[:half_index, i], classes=classes[i])
+        sgd_linear_clf.partial_fit(X[half_index:], y[half_index:, i])
+        assert_equal(list(sgd_linear_clf.predict(X)), list(predictions[:, i]))


you can use assert_array_equal here...

raghavrv · 2017-01-03T13:52:24Z

sklearn/tests/test_multioutput.py

+    multi_target_linear = MultiOutputClassifier(sgd_linear_clf)
+
+    # train the multi_target_linear and also get the predictions.
+    half_index = int(X.shape[0] / 2)


you could use X.shape[0] // 2 instead of using int...

(If not already done, use from __future__ import division at top of file)

raghavrv · 2017-01-03T14:04:53Z

sklearn/tests/test_multioutput.py

+    half_index = int(X.shape[0] / 2)
+    multi_target_linear.partial_fit(
+        X[:half_index], y[:half_index], classes=classes)
+    multi_target_linear.partial_fit(X[half_index:], y[half_index:])


Would it be better to test if the predictions are same for both multi_target_linear and sgd_linear_clf at each partial fit?

raghavrv · 2017-01-03T14:06:47Z

sklearn/tests/test_multioutput.py

+    clf = MultiOutputClassifier(sgd_linear_clf)
+    clf.fit(X, y)
+    X_test = [[1.5, 2.5, 3.5]]
+    assert_almost_equal(clf.predict(X_test), clf_w.predict(X_test))


assert_array_almost_equal

raghavrv · 2017-01-04T22:26:27Z

sklearn/multioutput.py

@@ -15,13 +15,15 @@
 # License: BSD 3 clause

 import numpy as np
+import copy


Can this be removed now?

raghavrv

With that I'm +1 for merge once CIs pass...

raghavrv · 2017-01-04T22:45:31Z

sklearn/tests/test_multioutput.py

+        sgd_linear_clf.partial_fit(
+            X[:half_index], y[:half_index, i], classes=classes[i])
+        sgd_linear_clf.partial_fit(X[half_index:], y[half_index:, i])
+        assert_array_equal(sgd_linear_clf.predict(X), predictions[:, i])


Could you try and put both these blocks in a loop? It'd save us some redundancy in code.

(sorry for the nagging)

sure thing :)

tguillemot · 2017-01-05T13:18:16Z

@yupbank If you have added new partial_fit methods can you update doc/modules/scaling_strategies.rst as done in #8152 please ?

yupbank · 2017-01-05T14:45:28Z

either sklearn.multioutput.MultiOutputClassifier or sklearn.multioutput.MultiOutputRegressor does not guarantee partial_fit would work, it depends on the estimator being passed.

so i'm not sure it is the right thing to do to add that in online learners

tguillemot · 2017-01-05T14:56:09Z

@yupbank Sorry indeed, forget what I said :).

raghavrv

Thanks a lot... I'm merging!

tguillemot · 2017-01-05T15:15:00Z

Thx @yupbank

* add partial_fit to multioupt module * fix range in python3 * fix flake8 * fix the comments * fix according to comments * fix lint * remove pytest * fix ValueException message * py 3.5 compatiable classes * fix stuff * fix according the comments * remove used copy * flake8.. * fix docs * eventually, i use deepcopy to ensure the parallel * lint.. * address final comment * fix addressing the comments * update confirmed separate estimators * finally remove copy * compact test

yupbank force-pushed the add-partial-fit-to-multioutout branch from 9ac08c2 to 7b42573 Compare December 14, 2016 22:24

maniteja123 reviewed Dec 18, 2016

View reviewed changes

jnothman requested changes Dec 19, 2016

View reviewed changes

jnothman requested changes Dec 20, 2016

View reviewed changes

jnothman approved these changes Dec 21, 2016

View reviewed changes

jnothman requested changes Dec 21, 2016

View reviewed changes

yupbank added 15 commits December 22, 2016 13:06

add partial_fit to multioupt module

0ed2cef

fix range in python3

331892c

fix flake8

8613079

fix the comments

f8cd9a1

fix according to comments

8e92127

fix lint

c1f96b1

remove pytest

e30efae

fix ValueException message

3a1a62f

py 3.5 compatiable classes

e5cece2

fix stuff

edbc879

fix according the comments

9dc729d

remove used copy

748036f

flake8..

204a4ac

fix docs

bb908dc

eventually, i use deepcopy to ensure the parallel

59261e2

yupbank force-pushed the add-partial-fit-to-multioutout branch from 28031e0 to 59261e2 Compare December 22, 2016 18:06

lint..

8cab66e

jnothman approved these changes Dec 23, 2016

View reviewed changes

jn 6E50 othman changed the title ~~add partial_fit to multioupt module~~ [MRG+1] add partial_fit to multioutput module Dec 23, 2016

address final comment

6b8769b

raghavrv suggested changes Jan 3, 2017

View reviewed changes

raghavrv changed the title ~~[MRG+1] add partial_fit to multioutput module~~ [MRG + 1] add partial_fit to multioutput module Jan 3, 2017

raghavrv added the New Feature label Jan 3, 2017

raghavrv added this to the 0.19 milestone Jan 3, 2017

yupbank added 2 commits January 3, 2017 09:49

fix addressing the comments

d89bba5

update confirmed separate estimators

3347d78

raghavrv suggested changes Jan 4, 2017

View reviewed changes

finally remove copy

7464287

raghavrv suggested changes Jan 4, 2017

View reviewed changes

compact test

9728a08

raghavrv approved these changes Jan 5, 2017

View reviewed changes

raghavrv merged commit 8695ff5 into scikit-learn:master Jan 5, 2017

Przemo10 mentioned this pull request Mar 17, 2017

update fork (#1) #8606

Closed

		@@ -83,8 +164,10 @@ def fit(self, X, y, sample_weight=None):
		raise ValueError("Underlying regressor does not support"

Uh oh!

[MRG + 1] add partial_fit to multioutput module #8054

[MRG + 1] add partial_fit to multioutput module #8054

Uh oh!

Conversation

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment