Closed
Description
As of now, clone
systematically triggers a copy of the input parameters:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cntvec = CountVectorizer(vocabulary={'g': 0, 'a': 1, 't': 2, 'c': 3})
>>> cntvec.vocabulary is clone(cntvec).vocabulary
False
This could be inefficient on large input datastructures (e.g. for a vocabular with 100k+ tokens instead of 4 as in the previous example).
Furthermore, it forbidden to have estimators that mutate the input params datastructures and we actually have a hash-based check in our official tests:
>>> from sklearn.utils.estimator_checks import check_estimators_overwrite_params
>>> class EvilMutationEstimator(BaseEstimator):
... def __init__(self, a=np.ones(42)):
... self.a = a
... def fit(self, X, y=None):
... self.a[:] = 0
... return self
...
>>> check_estimators_overwrite_params('bad', EvilMutationEstimator)
Traceback (most recent call last):
File "<ipython-input-35-fd40a6bd7830>", line 1, in <module>
check_estimators_overwrite_params('bad', EvilMutationEstimator)
File "/Users/ogrisel/code/scikit-learn/sklearn/utils/estimator_checks.py", line 1331, in check_estimators_overwrite_params
% (name, param_name, original_value, new_value))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/unittest/case.py", line 817, in assertEqual
assertion_func(first, second, msg=msg)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/unittest/case.py", line 1190, in assertMultiLineEqual
self.fail(self._formatMessage(msg, standardMsg))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/unittest/case.py", line 662, in fail
raise self.failureException(msg)
AssertionError: '7f743b92849194794b6276898d494d6f' != '810ecd082b511abb9c3960f8f4adfee1'
- 7f743b92849194794b6276898d494d6f
+ 810ecd082b511abb9c3960f8f4adfee1
: Estimator bad should not change or mutate the parameter a from [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.] to [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.] during fit.
So I as sklearn treats constructor params as immutable data-structures I really do not see the need for clone to do a deepcopy on them.
WDYT?