8000 clone should not really deepcopy constructor parameters · Issue #5563 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
clone should not really deepcopy constructor parameters #5563
Closed
@ogrisel

Description

@ogrisel

As of now, clone systematically triggers a copy of the input parameters:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cntvec = CountVectorizer(vocabulary={'g': 0, 'a': 1, 't': 2, 'c': 3})
>>> cntvec.vocabulary is clone(cntvec).vocabulary
False

This could be inefficient on large input datastructures (e.g. for a vocabular with 100k+ tokens instead of 4 as in the previous example).

Furthermore, it forbidden to have estimators that mutate the input params datastructures and we actually have a hash-based check in our official tests:

>>> from sklearn.utils.estimator_checks import check_estimators_overwrite_params
>>> class EvilMutationEstimator(BaseEstimator):
...     def __init__(self, a=np.ones(42)):
...         self.a = a
...     def fit(self, X, y=None):
...         self.a[:] = 0
...         return self
...
>>> check_estimators_overwrite_params('bad', EvilMutationEstimator)
Traceback (most recent call last):
  File "<ipython-input-35-fd40a6bd7830>", line 1, in <module>
    check_estimators_overwrite_params('bad', EvilMutationEstimator)
  File "/Users/ogrisel/code/scikit-learn/sklearn/utils/estimator_checks.py", line 1331, in check_estimators_overwrite_params
    % (name, param_name, original_value, new_value))
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/unittest/case.py", line 817, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/unittest/case.py", line 1190, in assertMultiLineEqual
    self.fail(self._formatMessage(msg, standardMsg))
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/unittest/case.py", line 662, in fail
    raise self.failureException(msg)
AssertionError: '7f743b92849194794b6276898d494d6f' != '810ecd082b511abb9c3960f8f4adfee1'
- 7f743b92849194794b6276898d494d6f
+ 810ecd082b511abb9c3960f8f4adfee1
 : Estimator bad should not change or mutate  the parameter a from [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.] to [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.] during fit.

So I as sklearn treats constructor params as immutable data-structures I really do not see the need for clone to do a deepcopy on them.

WDYT?

Metadata

Metadata

Assignees

No one assigned

    Labels

    EasyWell-defined and straightforward way to resolve

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0