8000 imblearn SMOTE throwing error when n_jobs > 1 · Issue #10916 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
< 8000 template class="js-flash-template">

imblearn SMOTE throwing error when n_jobs > 1 #10916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datNurd opened this issue Apr 4, 2018 · 7 comments
Closed

imblearn SMOTE throwing error when n_jobs > 1 #10916

datNurd opened this issue Apr 4, 2018 · 7 comments

Comments

@datNurd
Copy link
datNurd commented Apr 4, 2018

Description

imblearn SMOTE throws error with n_jobs > 1

sm = SMOTE(random_state=12,kind="svm",svm_estimator=svm.SVC(C=0.1,kernel="linear"),n_jobs = 6)
X_res, y_res = sm.fit_sample(X, y)

Expected Results

Actual Results

Error:
multiprocessing.pool.RemoteTraceback:

Traceback (most recent call last):
  File "/home/ubuntu/
8000
ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 350, in __call__
    return self.func(*args, **kwargs)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 223, in euclidean_distances
    X, Y = check_pairwise_arrays(X, Y)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 110, in check_pairwise_arrays
    warn_on_dtype=warn_on_dtype, estimator=estimator)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/utils/validation.py", line 431, in check_array
    force_all_finite)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/utils/validation.py", line 296, in _ensure_sparse_format
    spmatrix = spmatrix.astype(dtype)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/data.py", line 71, in astype
    self._deduped_data().astype(dtype, casting=casting, copy=copy),
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/data.py", line 34, in _deduped_data
    self.sum_duplicates()
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/compressed.py", line 1009, in sum_duplicates
    self.sort_indices()
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/compressed.py", line 1055, in sort_indices
    self.indices, self.data)
ValueError: WRITEBACKIFCOPY base is read-only

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 359, in __call__
    raise TransportableException(text, e_type)
sklearn.externals.joblib.my_exceptions.TransportableException: TransportableException
___________________________________________________________________________
ValueError                                         Wed Apr  4 09:09:04 2018
PID: 20131                    Python 3.5.2: /home/ubuntu/ML/venv/bin/python
...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function euclidean_distances>, (<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, <39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>), {'squared': True})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function euclidean_distances>
        args = (<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, <39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>)
        kwargs = {'squared': True}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, Y=<39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>, Y_norm_squared=None, squared=True, X_norm_squared=None)
    218 
    219     See also
    220     --------
    221     paired_distances : distances betweens pairs of elements of X and Y.
    222     """
--> 223     X, Y = check_pairwise_arrays(X, Y)
        X = <5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>
        Y = <39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>
    224 
    225     if X_norm_squared is not None:
    226         XX = check_array(X_norm_squared)
    227         if XX.shape == (1, X.shape[0]):

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, Y=<39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>, precomputed=False, dtype=<class 'float'>)
    105     if Y is X or Y is None:
    106         X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
    107                             warn_on_dtype=warn_on_dtype, estimator=estimator)
    108     else:
    109         X = check_array(X, accept_sparse='csr', dtype=dtype,
--> 110                         warn_on_dtype=warn_on_dtype, estimator=estimator)
        warn_on_dtype = False
        estimator = 'check_pairwise_arrays'
    111         Y = check_array(Y, accept_sparse='csr', dtype=dtype,
    112                         warn_on_dtype=warn_on_dtype, estimator=estimator)
    113 
    114     if precomputed:

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, accept_sparse='csr', dtype=<class 'float'>, order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, ensure_min_samples=1, ensure_min_features=1, warn_on_dtype=False, estimator='check_pairwise_arrays')
    426         estimator_name = "Estimator"
    427     context = " by %s" % estimator_name if estimator is not None else ""
    428 
    429     if sp.issparse(array):
    430         array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
--> 431                                       force_all_finite)
        force_all_finite = True
    432     else:
    433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, accept_sparse=['csr'], dtype=<class 'float'>, copy=False, force_all_finite=True)
    291                          "boolean or list of strings. You provided "
    292                          "'accept_sparse={}'.".format(accept_sparse))
    293 
    294     if dtype != spmatrix.dtype:
    295         # convert dtype
--> 296         spmatrix = spmatrix.astype(dtype)
        spmatrix = <5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>
        spmatrix.astype = <bound method _data_matrix.astype of <5409x11723...stored elements in Compressed Sparse Row format>>
        dtype = <class 'float'>
    297     elif copy and not changed_format:
    298         # force copy
    299         spmatrix = spmatrix.copy()
    300 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/data.py in astype(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, dtype=dtype('float64'), casting='unsafe', copy=True)
     66 
     67     def astype(self, dtype, casting='unsafe', copy=True):
     68         dtype = np.dtype(dtype)
     69         if self.dtype != dtype:
     70             return self._with_data(
---> 71                 self._deduped_data().astype(dtype, casting=casting, copy=copy),
        self._deduped_data.astype = undefined
        dtype = dtype('float64')
        casting = 'unsafe'
        copy = True
     72                 copy=copy)
     73         elif copy:
     74             return self.copy()
     75         else:

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/data.py in _deduped_data(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>)
     29         self.data.dtype = newtype
     30     dtype = property(fget=_get_dtype, fset=_set_dtype)
     31 
     32     def _deduped_data(self):
     33         if hasattr(self, 'sum_duplicates'):
---> 34             self.sum_duplicates()
        self.sum_duplicates = <bound method _cs_matrix.sum_duplicates of <5409...stored elements in Compressed Sparse Row format>>
     35         return self.data
     36 
     37     def __abs__(self):
     38         return self._with_data(abs(self._deduped_data()))

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/compressed.py in sum_duplicates(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>)
   1004 
   1005         The is an *in place* operation
   1006         """
   1007         if self.has_canonical_format:
   1008             return
-> 1009         self.sort_indices()
        self.sort_indices = <bound method _cs_matrix.sort_indices of <5409x1...stored elements in Compressed Sparse Row format>>
   1010 
   1011         M, N = self._swap(self.shape)
   1012         _sparsetools.csr_sum_duplicates(M, N, self.indptr, self.indices,
   1013                                         self.data)

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/compressed.py in sort_indices(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>)
   1050         """Sort the indices of this matrix *in place*
   1051         """
   1052 
   1053         if not self.has_sorted_indices:
   1054             _sparsetools.csr_sort_indices(len(self.indptr) - 1, self.indptr,
-> 1055                                           self.indices, self.data)
        self.indices = array([110400, 110390, 110345, ...,  18292,  13241,  13236], dtype=int32)
        self.data = memmap([1, 1, 2, ..., 1, 1, 1])
   1056             self.has_sorted_indices = True
   1057 
   1058     def prune(self):
   1059         """Remove empty space after all non-zero elements.

ValueError: WRITEBACKIFCOPY base is read-only
___________________________________________________________________________
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
sklearn.externals.joblib.my_exceptions.TransportableException: TransportableException
___________________________________________________________________________
ValueError                                         Wed Apr  4 09:09:04 2018
PID: 20131                    Python 3.5.2: /home/ubuntu/ML/venv/bin/python
...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function euclidean_distances>, (<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, <39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>), {'squared': True})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function euclidean_distances>
        args = (<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, <39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>)
        kwargs = {'squared': True}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, Y=<39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>, Y_norm_squared=None, squared=True, X_norm_squared=None)
    218 
    219     See also
    220     --------
    221     paired_distances : distances betweens pairs of elements of X and Y.
    222     """
--> 223     X, Y = check_pairwise_arrays(X, Y)
        X = <5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>
        Y = <39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>
    224 
    225     if X_norm_squared is not None:
    226         XX = check_array(X_norm_squared)
    227         if XX.shape == (1, X.shape[0]):

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, Y=<39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>, precomputed=False, dtype=<class 'float'>)
    105     if Y is X or Y is None:
    106         X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
    107                             warn_on_dtype=warn_on_dtype, estimator=estimator)
    108     else:
    109         X = check_array(X, accept_sparse='csr', dtype=dtype,
--> 110                         warn_on_dtype=warn_on_dtype, estimator=estimator)
        warn_on_dtype = False
        estimator = 'check_pairwise_arrays'
    111         Y = check_array(Y, accept_sparse='csr', dtype=dtype,
    112                         warn_on_dtype=warn_on_dtype, estimator=estimator)
    113 
    114     if precomputed:

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, accept_sparse='csr', dtype=<class 'float'>, order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, ensure_min_samples=1, ensure_min_features=1, warn_on_dtype=False, estimator='check_pairwise_arrays')
    426         estimator_name = "Estimator"
    427     context = " by %s" % estimator_name if estimator is not None else ""
    428 
    429     if sp.issparse(array):
    430         array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
--> 431                                       force_all_finite)
        force_all_finite = True
    432     else:
    433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, accept_sparse=['csr'], dtype=<class 'float'>, copy=False, force_all_finite=True)
    291                          "boolean or list of strings. You provided "
    292                          "'accept_sparse={}'.".format(accept_sparse))
    293 
    294     if dtype != spmatrix.dtype:
    295         # convert dtype
--> 296         spmatrix = spmatrix.astype(dtype)
        spmatrix = <5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>
        spmatrix.astype = <bound method _data_matrix.astype of <5409x11723...stored elements in Compressed Sparse Row format>>
        dtype = <class 'float'>
    297     elif copy and not changed_format:
    298         # force copy
    299         spmatrix = spmatrix.copy()
    300 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/data.py in astype(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, dtype=dtype('float64'), casting='unsafe', copy=True)
     66 
     67     def astype(self, dtype, casting='unsafe', copy=True):
     68         dtype = np.dtype(dtype)
     69         if self.dtype != dtype:
     70             return self._with_data(
---> 71                 self._deduped_data().astype(dtype, casting=casting, copy=copy),
        self._deduped_data.astype = undefined
        dtype = dtype('float64')
        casting = 'unsafe'
        copy = True
     72                 copy=copy)
     73         elif copy:
     74             return self.copy()
     75         else:

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/data.py in _deduped_data(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>)
     29         self.data.dtype = newtype
     30     dtype = property(fget=_get_dtype, fset=_set_dtype)
     31 
     32     def _deduped_data(self):
     33         if hasattr(self, 'sum_duplicates'):
---> 34             self.sum_duplicates()
        self.sum_duplicates = <bound method _cs_matrix.sum_duplicates of <5409...stored elements in Compressed Sparse Row format>>
     35         return self.data
     36 
     37     def __abs__(self):
     38         return self._with_data(abs(self._deduped_data()))

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/compressed.py in sum_duplicates(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>)
   1004 
   1005         The is an *in place* operation
   1006         """
   1007         if self.has_canonical_format:
   1008             return
-> 1009         self.sort_indices()
        self.sort_indices = <bound method _cs_matrix.sort_indices of <5409x1...stored elements in Compressed Sparse Row format>>
   1010 
   1011         M, N = self._swap(self.shape)
   1012         _sparsetools.csr_sum_duplicates(M, N, self.indptr, self.indices,
   1013                                         self.data)

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/compressed.py in sort_indices(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>)
   1050         """Sort the indices of this matrix *in place*
   1051         """
   1052 
   1053         if not self.has_sorted_indices:
   1054             _sparsetools.csr_sort_indices(len(self.indptr) - 1, self.indptr,
-> 1055                                           self.indices, self.data)
        self.indices = array([110400, 110390, 110345, ...,  18292,  13241,  13236], dtype=int32)
        self.data = memmap([1, 1, 2, ..., 1, 1, 1])
   1056             self.has_sorted_indices = True
   1057 
   1058     def prune(self):
   1059         """Remove empty space after all non-zero elements.

ValueError: WRITEBACKIFCOPY base is read-only
___________________________________________________________________________

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "sampling.py", line 42, in <module>
    sample_smote(X,y,class_names)
  File "sampling.py", line 16, in sample_smote
    X_res, y_res = sm.fit_sample(X, y)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/base.py", line 88, in fit_sample
    return self.fit(X, y).sample(X, y)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/base.py", line 64, in sample
    return self._sample(X, y)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/over_sampling/smote.py", line 598, in _sample
    return self._sample_svm(X, y)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/over_sampling/smote.py", line 513, in _sample_svm
    kind='noise')
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/over_sampling/smote.py", line 202, in _in_danger_noise
    x = self.nn_m_.kneighbors(samples, return_distance=False)[:, 1:]
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/neighbors/base.py", line 357, in kneighbors
    n_jobs=n_jobs, squared=True)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 1247, in pairwise_distances
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 1096, in _parallel_pairwise
    for s in gen_even_slices(Y.shape[0], n_jobs))
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
    self.retrieve()
  File "/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 740, in retrieve
    raise exception
sklearn.externals.joblib.my_exceptions.JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/home/ubuntu/ML/sampling.py in <module>()
     37 	X = vectorizer.fit_transform(msgs)
     38 	y = train.messagetype_num.values
     39 	class_names = train.messagetype.values
     40 	print(datetime.datetime.now().time())
     41 	sys.stdout.flush()
---> 42 	sample_smote(X,y,class_names)

...........................................................................
/home/ubuntu/ML/sampling.py in sample_smote(X=<236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, y=array([8, 7, 5, ..., 8, 1, 1]), class_names=array(['Debit', 'Credit', 'Warning', ..., 'Debit...ayment_due',
       'Payment_due'], dtype=object))
     11 
     12 def sample_smote(X,y,class_names):
     13 	print('Original dataset shape {}'.format(Counter(class_names)))
     14 	sys.stdout.flush()
     15 	sm = SMOTE(random_state=12,kind="svm",svm_estimator=svm.SVC(C=0.1,kernel="linear"),n_jobs=6)
---> 16 	X_res, y_res = sm.fit_sample(X, y)
        X_res = undefined
        y_res = undefined
        sm.fit_sample = <bound method SamplerMixin.fit_sample of SMOTE(k...ne, shrinking=True,
  tol=0.001, verbose=False))>
        X = <236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>
        y = array([8, 7, 5, ..., 8, 1, 1])
     17 	print(datetime.datetime.now().time())
     18 	sys.stdout.flush()
     19 	print('Resampled dataset shape {}'.format(Counter(y_res)))
     20 	save_classifier = open("messagetype_X_res.pickle","wb")

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/base.py in fit_sample(self=SMOTE(k=None, k_neighbors=5, kind='svm', m=None,...one, shrinking=True,
  tol=0.001, verbose=False)), X=<236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, y=array([8, 7, 5, ..., 8, 1, 1]))
     83         y_resampled : array-like, shape (n_samples_new,)
     84             The corresponding label of `X_resampled`
     85 
     86         """
     87 
---> 88         return self.fit(X, y).sample(X, y)
        self.fit = <bound method BaseSampler.fit of SMOTE(k=None, k...ne, shrinking=True,
  tol=0.001, verbose=False))>
        X = <236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>
        y.sample = undefined
        y = array([8, 7, 5, ..., 8, 1, 1])
     89 
     90     @abstractmethod
     91     def _sample(self, X, y):
     92         """Resample the dataset.

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/base.py in sample(self=SMOTE(k=None, k_neighbors=5, kind='svm', m=None,...one, shrinking=True,
  tol=0.001, verbose=False)), X=<236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, y=array([8, 7, 5, ..., 8, 1, 1]))
     59         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc'])
     60 
     61         check_is_fitted(self, 'ratio_')
     62         self._check_X_y(X, y)
     63 
---> 64         return self._sample(X, y)
        self._sample = <bound method SMOTE._sample of SMOTE(k=None, k_n...ne, shrinking=True,
  tol=0.001, verbose=False))>
        X = <236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>
        y = array([8, 7, 5, ..., 8, 1, 1])
     65 
     66     def fit_sample(self, X, y):
     67         """Fit the statistics and resample the data directly.
     68 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/over_sampling/smote.py in _sample(self=SMOTE(k=None, k_neighbors=5, kind='svm', m=None,...one, shrinking=True,
  tol=0.001, verbose=False)), X=<236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, y=array([8, 7, 5, ..., 8, 1, 1]))
    593         if self.kind == 'regular':
    594             return self._sample_regular(X, y)
    595         elif self.kind == 'borderline1' or self.kind == 'borderline2':
    596             return self._sample_borderline(X, y)
    597         elif self.kind == 'svm':
--> 598             return self._sample_svm(X, y)
        self._sample_svm = <bound method SMOTE._sample_svm of SMOTE(k=None,...ne, shrinking=True,
  tol=0.001, verbose=False))>
        X = <236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>
        y = array([8, 7, 5, ..., 8, 1, 1])

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/over_sampling/smote.py in _sample_svm(self=SMOTE(k=None, k_neighbors=5, kind='svm', m=None,...one, shrinking=True,
  tol=0.001, verbose=False)), X=<236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, y=array([8, 7, 5, ..., 8, 1, 1]))
    508                 y[self.svm_estimator_.support_] == class_sample]
    509             support_vector = safe_indexing(X, support_index)
    510 
    511             self.nn_m_.fit(X)
    512             noise_bool = self._in_danger_noise(support_vector, class_sample, y,
--> 513                                                kind='noise')
    514             support_vector = safe_indexing(
    515                 support_vector,
    516                 np.flatnonzero(np.logical_not(noise_bool)))
    517             danger_bool = self._in_danger_noise(support_vector, class_sample,

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/imblearn/over_sampling/smote.py in _in_danger_noise(self=SMOTE(k=None, k_neighbors=5, kind='svm', m=None,...one, shrinking=True,
  tol=0.001, verbose=False)), samples=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, target_class=0, y=array([8, 7, 5, ..., 8, 1, 1]), kind='noise')
    197         -------
    198         output : ndarray, shape (n_samples,)
    199             A boolean array where True refer to samples in danger or noise.
    200 
    201         """
--> 202         x = self.nn_m_.kneighbors(samples, return_distance=False)[:, 1:]
        x = undefined
        self.nn_m_.kneighbors = <bound method KNeighborsMixin.kneighbors of Near...None, n_jobs=6, n_neighbors=11, p=2, radius=1.0)>
        samples = <5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>
    203         nn_label = (y[x] != target_class).astype(int)
    204         n_maj = np.sum(nn_label, axis=1)
    205 
    206         if kind == 'danger':

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/neighbors/base.py in kneighbors(self=NearestNeighbors(algorithm='auto', leaf_size=30,...=None, n_jobs=6, n_neighbors=11, p=2, radius=1.0), X=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, n_neighbors=11, return_distance=False)
    352         n_jobs = _get_n_jobs(self.n_jobs)
    353         if self._fit_method == 'brute':
    354             # for efficiency, use squared euclidean distances
    355             if self.effective_metric_ == 'euclidean':
    356                 dist = pairwise_distances(X, self._fit_X, 'euclidean',
--> 357                                           n_jobs=n_jobs, squared=True)
        n_jobs = 6
    358             else:
    359                 dist = pairwise_distances(
    360                     X, self._fit_X, self.effective_metric_, n_jobs=n_jobs,
    361                     **self.effective_metric_params_)

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, Y=<236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, metric='euclidean', n_jobs=6, **kwds={'squared': True})
   1242         if n_jobs == 1 and X is Y:
   1243           
8000
  return distance.squareform(distance.pdist(X, metric=metric,
   1244                                                       **kwds))
   1245         func = partial(distance.cdist, metric=metric, **kwds)
   1246 
-> 1247     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
        X = <5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>
        Y = <236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>
        func = <function euclidean_distances>
        n_jobs = 6
        kwds = {'squared': True}
   1248 
   1249 
   1250 # These distances recquire boolean arrays, when using scipy.spatial.distance
   1251 PAIRWISE_BOOLEAN_FUNCTIONS = [

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, Y=<236345x117239 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, func=<function euclidean_distances>, n_jobs=6, **kwds={'squared': True})
   1091 
   1092     # TODO: in some cases, backend='threading' may be appropriate
   1093     fd = delayed(func)
   1094     ret = Parallel(n_jobs=n_jobs, verbose=0)(
   1095         fd(X, Y[s], **kwds)
-> 1096         for s in gen_even_slices(Y.shape[0], n_jobs))
        Y.shape = (236345, 117239)
        n_jobs = 6
   1097 
   1098     return np.hstack(ret)
   1099 
   1100 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=6), iterable=<generator object _parallel_pairwise.<locals>.<genexpr>>)
    784             if pre_dispatch == "all" or n_jobs == 1:
    785                 # The iterable was consumed all at once by the above for loop.
    786                 # No need to wait for async callbacks to trigger to
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=6)>
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time
    792             self._print('Done %3i out of %3i | elapsed: %s finished',
    793                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError                                         Wed Apr  4 09:09:04 2018
PID: 20131                    Python 3.5.2: /home/ubuntu/ML/venv/bin/python
...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function euclidean_distances>, (<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, <39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>), {'squared': True})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function euclidean_distances>
        args = (<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, <39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>)
        kwargs = {'squared': True}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, Y=<39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>, Y_norm_squared=None, squared=True, X_norm_squared=None)
    218 
    219     See also
    220     --------
    221     paired_distances : distances betweens pairs of elements of X and Y.
    222     """
--> 223     X, Y = check_pairwise_arrays(X, Y)
        X = <5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>
        Y = <39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>
    224 
    225     if X_norm_squared is not None:
    226         XX = check_array(X_norm_squared)
    227         if XX.shape == (1, X.shape[0]):

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, Y=<39391x117239 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>, precomputed=False, dtype=<class 'float'>)
    105     if Y is X or Y is None:
    106         X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
    107                             warn_on_dtype=warn_on_dtype, estimator=estimator)
    108     else:
    109         X = check_array(X, accept_sparse='csr', dtype=dtype,
--> 110                         warn_on_dtype=warn_on_dtype, estimator=estimator)
        warn_on_dtype = False
        estimator = 'check_pairwise_arrays'
    111         Y = check_array(Y, accept_sparse='csr', dtype=dtype,
    112                         warn_on_dtype=warn_on_dtype, estimator=estimator)
    113 
    114     if precomputed:

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, accept_sparse='csr', dtype=<class 'float'>, order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, ensure_min_samples=1, ensure_min_features=1, warn_on_dtype=False, estimator='check_pairwise_arrays')
    426         estimator_name = "Estimator"
    427     context = " by %s" % estimator_name if estimator is not None else ""
    428 
    429     if sp.issparse(array):
    430         array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
--> 431                                       force_all_finite)
        force_all_finite = True
    432     else:
    433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, accept_sparse=['csr'], dtype=<class 'float'>, copy=False, force_all_finite=True)
    291                          "boolean or list of strings. You provided "
    292                          "'accept_sparse={}'.".format(accept_sparse))
    293 
    294     if dtype != spmatrix.dtype:
    295         # convert dtype
--> 296         spmatrix = spmatrix.astype(dtype)
        spmatrix = <5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>
        spmatrix.astype = <bound method _data_matrix.astype of <5409x11723...stored elements in Compressed Sparse Row format>>
        dtype = <class 'float'>
    297     elif copy and not changed_format:
    298         # force copy
    299         spmatrix = spmatrix.copy()
    300 

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/data.py in astype(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>, dtype=dtype('float64'), casting='unsafe', copy=True)
     66 
     67     def astype(self, dtype, casting='unsafe', copy=True):
     68         dtype = np.dtype(dtype)
     69         if self.dtype != dtype:
     70             return self._with_data(
---> 71                 self._deduped_data().astype(dtype, casting=casting, copy=copy),
        self._deduped_data.astype = undefined
        dtype = dtype('float64')
        casting = 'unsafe'
        copy = True
     72                 copy=copy)
     73         elif copy:
     74             return self.copy()
     75         else:

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/data.py in _deduped_data(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>)
     29         self.data.dtype = newtype
     30     dtype = property(fget=_get_dtype, fset=_set_dtype)
     31 
     32     def _deduped_data(self):
     33         if hasattr(self, 'sum_duplicates'):
---> 34             self.sum_duplicates()
        self.sum_duplicates = <bound method _cs_matrix.sum_duplicates of <5409...stored elements in Compressed Sparse Row format>>
     35         return self.data
     36 
     37     def __abs__(self):
     38         return self._with_data(abs(self._deduped_data()))

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/compressed.py in sum_duplicates(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>)
   1004 
   1005         The is an *in place* operation
   1006         """
   1007         if self.has_canonical_format:
   1008             return
-> 1009         self.sort_indices()
        self.sort_indices = <bound method _cs_matrix.sort_indices of <5409x1...stored elements in Compressed Sparse Row format>>
   1010 
   1011         M, N = self._swap(self.shape)
   1012         _sparsetools.csr_sum_duplicates(M, N, self.indptr, self.indices,
   1013                                         self.data)

...........................................................................
/home/ubuntu/ML/venv/lib/python3.5/site-packages/scipy/sparse/compressed.py in sort_indices(self=<5409x117239 sparse matrix of type '<class 'nump... stored elements in Compressed Sparse Row format>)
   1050         """Sort the indices of this matrix *in place*
   1051         """
   1052 
   1053         if not self.has_sorted_indices:
   1054             _sparsetools.csr_sort_indices(len(self.indptr) - 1, self.indptr,
-> 1055                                           self.indices, self.data)
        self.indices = array([110400, 110390, 110345, ...,  18292,  13241,  13236], dtype=int32)
        self.data = memmap([1, 1, 2, ..., 1, 1, 1])
   1056             self.has_sorted_indices = True
   1057 
   1058     def prune(self):
   1059         """Remove empty space after all non-zero elements.

ValueError: WRITEBACKIFCOPY base is read-only
___________________________________________________________________________

Steps/Code to Reproduce

sm = SMOTE(random_state=12,kind="svm",svm_estimator=svm.SVC(C=0.1,kernel="linear"),n_jobs = 6)
X_res, y_res = sm.fit_sample(X, y)

<----- Version----->

Versions

Linux-4.4.0-1052-aws-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.2
SciPy 1.0.1
Scikit-Learn 0.19.1
@glemaitre
Copy link
Member

Not sure that you are inside the good issue tracker:
https://github.com/scikit-learn-contrib/imbalanced-learn/issues

But since that it is here, I might need the skills of @lesteve or @ogrisel. We are using n_jobs with some NN search. For some reason, I have the impression that the error is related to the auto-memmap of joblib for which the memmap will be read-only.

But I am not really sure what is going on for sure.

@jnothman
Copy link
Member
jnothman commented Apr 4, 2018 via email

@datNurd
Copy link
Author
datNurd commented Apr 4, 2018

current imblearn minimum scipy version is >= 0.19.0

@lesteve
Copy link
Member
lesteve commented Apr 4, 2018

As a work-around I think using n_jobs=1 in pairwise_distances may be as fast (see #8216 for more details).

The error seems similar to #6614. At the time, as I said in #6614 (comment), the only work-around I could find was to make sure the CSR matrix had its indices sorted before reaching the problematic step of the Pipeline.

The root cause of the problem lies in scipy: it looks like csr_matrix.astype does not play nicely with read-only memmaps. Here is a snippet that demonstrates the problem:

import numpy as np

from scipy import sparse

from sklearn.externals import joblib

filename = '/tmp/test.pkl'

data = [2, 1, 4, 3]
indices = [1, 0, 1, 0]
indptr = [0, 2, 4]
matrix = sparse.csr_matrix((data, indices, indptr))
print('matrix.todense():\n', repr(matrix.todense()))
# To trigger the error you need to make sure that the indices are not sorted
print('matrix.has_sorted_indices:', matrix.has_sorted_indices)

joblib.dump(matrix, filename)
mmap_backed_matrix = joblib.load(filename, mmap_mode='r')
mmap_backed_matrix.astype(np.float64)
Stack-trace
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/test.py in <module>()
     12 mmap_backed_matrix = joblib.load(filename, mmap_mode='r')
     13 mmap_backed_matrix.has_sorted_indices = False
---> 14 mmap_backed_matrix.astype(np.float64)

/home/local/lesteve/miniconda3/lib/python3.6/site-packages/scipy/sparse/data.py in astype(self, dtype, casting, copy)
     69         if self.dtype != dtype:
     70             return self._with_data(
---> 71                 self._deduped_data().astype(dtype, casting=casting, copy=copy),
     72                 copy=copy)
     73         elif copy:

/home/local/lesteve/miniconda3/lib/python3.6/site-packages/scipy/sparse/data.py in _deduped_data(self)
     32     def _deduped_data(self):
     33         if hasattr(self, 'sum_duplicates'):
---> 34             self.sum_duplicates()
     35         return self.data
     36 

/home/local/lesteve/miniconda3/lib/python3.6/site-packages/scipy/sparse/compressed.py in sum_duplicates(self)
   1007         if self.has_canonical_format:
   1008             return
-> 1009         self.sort_indices()
   1010 
   1011         M, N = self._swap(self.shape)

/home/local/lesteve/miniconda3/lib/python3.6/site-packages/scipy/sparse/compressed.py in sort_indices(self)
   1053         if not self.has_sorted_indices:
   1054             _sparsetools.csr_sort_indices(len(self.indptr) - 1, self.indptr,
-> 1055                                           self.indices, self.data)
   1056             self.has_sorted_indices = True
   1057 

ValueError: WRITEBACKIFCOPY base is read-only

@lesteve
Copy link
Member
lesteve commented Apr 4, 2018

For the record I opened scipy/scipy#8678.

@glemaitre
Copy link
Member

Closing since that scikit-learn cannot do much about it.
Thanks @lesteve and @jnothman for the debugging.

@lesteve
Copy link
Member
lesteve commented Apr 23, 2018

Closing since that scikit-learn cannot do much about it.

For the record, I think we could in principle work around the problem (for example a try/catch around the .astype in _ensure_sparse_format) but I haven't looked at it in more details and it could well be that there are plenty of .astype for sparse matrices outside of check_array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0