-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] ENH: Added block_size parameter for lesser memory consumption #7979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'd think the main thing needing work in the docstring is the return value: we can't return a monolithic matrix. |
Yes, that needs to be changed. But since it's depending on the |
I would have thought you'd return a generator of chunks which when
concatenated would produce the full matrix..
…On 6 December 2016 at 14:30, Aman Dalmia ***@***.***> wrote:
Yes, that needs to be changed. But since it's depending on the block_size,
I am unable to assign a specific value to it. Any suggestion that you may
have?
The work-around I could think of was to assign n_samples_x and n_samples_y
which depend on the block_size parameter?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7979 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6ysGzj226cso-47DloP74S5YiseXks5rFNbFgaJpZM4LD_lx>
.
|
Multithreading isn't really the issue here. If you're uncertain, write a
test case and ask for a review of that.
…On 6 December 2016 at 16:49, Aman Dalmia ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I currently don't have any
experience with multithreading(in practise) but am able to understand most
of the parts of your changes in #7177
<#7177>. So, do you
suggest I follow up from your changes to get started and then try to fix if
something breaks?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7979 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68vPjfEow5zA9cVYXwSxeenS5PFxks5rFPdNgaJpZM4LD_lx>
.
|
@jnothman I added the code for the generator and checked it by running an example. It seemed to work fine. I could not add tests/ check with ======================================================================
ERROR: Failure: ValueError (numpy.dtype has the wrong size, try recompiling. Expected 88, got 96)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/loader.py", line 414, in loadTestsFromName
addr.filename, addr.module)
File "/usr/lib/python2.7/dist-packages/nose/importer.py", line 47, in importFromPath
return self.importFromDir(dir_path, fqname)
File "/usr/lib/python2.7/dist-packages/nose/importer.py", line 94, in importFromDir
mod = load_module(part_fqname, fh, filename, desc)
File "/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/__init__.py", line 57, in <module>
from .base import clone
File "/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/base.py", line 12, in <module>
from .utils.fixes import signature
File "/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/__init__.py", line 10, in <module>
from .murmurhash import murmurhash3_32
File "__init__.pxd", line 155, in init sklearn.utils.murmurhash (sklearn/utils/murmurhash.c:6319)
ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, got 96
----------------------------------------------------------------------
Ran 1 test in 0.001s
FAILED (errors=1) Please help me with this error. |
Build scikit-learn first: run `make in`
…On 7 December 2016 at 18:33, Aman Dalmia ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I added the code for the
generator and checked it by running an example. It seemed to work fine. I
could not add tests/ check with nosetests because of the following error
I am getting on running nosetests:
======================================================================ERROR: Failure: ValueError (numpy.dtype has the wrong size, try recompiling. Expected 88, got 96)----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/loader.py", line 414, in loadTestsFromName
addr.filename, addr.module)
File "/usr/lib/python2.7/dist-packages/nose/importer.py", line 47, in importFromPath
return self.importFromDir(dir_path, fqname)
File "/usr/lib/python2.7/dist-packages/nose/importer.py", line 94, in importFromDir
mod = load_module(part_fqname, fh, filename, desc)
File "/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/__init__.py", line 57, in <module>
from .base import clone
File "/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/base.py", line 12, in <module>
from .utils.fixes import signature
File "/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/__init__.py", line 10, in <module>
from .murmurhash import murmurhash3_32
File "__init__.pxd", line 155, in init sklearn.utils.murmurhash (sklearn/utils/murmurhash.c:6319)ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, got 96
----------------------------------------------------------------------
Ran 1 test in 0.001s
FAILED (errors=1)
Please help me with this error.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7979 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60lt9eSQoUssQuJGNDyTiG4EfLynks5rFmFVgaJpZM4LD_lx>
.
|
I still get the same error. I get this while compiling:
Any suggestions? @jnothman |
It's fixed now. |
flake8 errors |
Yes, I had noticed those errors before committing. But I didn't change it as I had just copied the docstring from the Y : array [n_samples_b, n_features], optional
An optional second feature array. Only allowed if metric != "precomputed". |
flake8 is run in diff mode so that we only encourage you to fix errors you have introduced to the code (although it doesn't catch everything). You are still required to fix those issues. |
I have added the tests for |
sklearn/metrics/pairwise.py
Outdated
|
||
pairwise_distances(X, y, metric, n_jobs) | ||
|
||
but uses much less memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"but may use less memory"
sklearn/metrics/pairwise.py
Outdated
|
||
def pairwise_distances_blockwise(X, Y=None, metric='euclidean', n_jobs=1, | ||
block_size=DEFAULT_BLOCK_SIZE, **kwds): | ||
""" Compute the distance matrix from a vector array X and optional Y. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove space between """
and text (PEP257)
sklearn/metrics/pairwise.py
Outdated
|
||
Returns | ||
------- | ||
D : generator of blocks based on the ``block_size`` parameter. The blocks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first line is reserved for type description
sklearn/metrics/pairwise.py
Outdated
|
||
Parameters | ||
---------- | ||
X : array [n_samples_a, n_samples_a] if metric == "precomputed", or, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove , after "or"
sklearn/metrics/pairwise.py
Outdated
|
||
This method takes either a vector array or a distance matrix, and returns | ||
a distance matrix. If the input is a vector array, the distances are | ||
computed. If the input is a distances matrix, it is returned instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"returned in blocks"
sklearn/metrics/pairwise.py
Outdated
block_size=DEFAULT_BLOCK_SIZE, **kwds): | ||
""" Compute the distance matrix from a vector array X and optional Y. | ||
|
||
This method takes either a vector array or a distance matrix, and returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"returns" -> "generates blocks of"
|
||
def test_pairwise_distances_blockwise_invalid_block_size(): | ||
rng = np.random.RandomState(0) | ||
X = rng.random_sample((400, 4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might as well be empty or zeros
rng = np.random.RandomState(0) | ||
X = rng.random_sample((400, 4)) | ||
y = rng.random_sample((200, 4)) | ||
gen = pairwise_distances_blockwise(X, y, block_size=0, metric='euclidean') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see a test where block_size=1 or greater
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was testing this for invalid block size hence used 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but 1 is invalid too if X has many samples
X = rng.random_sample((400, 4))
y = rng.random_sample((200, 4))
gen = pairwise_distances_blockwise(X, y, block_size=0, metric='euclidean')
assert_raise_message(ValueError, 'block_size should be at least n_samples '
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather this error be raised upon calling pairwise_distances_blockwise
. That means you need to not use a generator function immediately, but use a secondary function call.
X = rng.random_sample((400, 4)) | ||
y = rng.random_sample((200, 4)) | ||
gen = pairwise_distances_blockwise(X, y, block_size=0, metric='euclidean') | ||
assert_raise_message(ValueError, 'block_size should be at least n_samples ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather this error be raised upon calling pairwise_distances_blockwise
. That means you need to not use a generator function immediately, but use a secondary function call.
sklearn/metrics/pairwise.py
Outdated
for start in range(0, n_samples, block_n_rows): | ||
# get distances from block to every other sample | ||
stop = min(start + block_n_rows, X.shape[0]) | ||
yield pairwise_distances(X[start:stop], Y, metric, n_jobs, **kwds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will use at most block_size
memory overall, not per job. I'm not sure this is the best way to do parallelism here, though. I.e. ideally we would use a parallel generator, but I don't think joblib.parallel
supports that atm.
X = rng.random_sample((400, 4)) | ||
gen = pairwise_distances_blockwise(X, block_size=1, metric="euclidean") | ||
S = np.empty((0, X.shape[0])) | ||
for row in gen: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use np.vstack(list(gen))
gen = pairwise_distances_blockwise(X, block_size=1, metric="euclidean") | ||
S = np.empty((0, X.shape[0])) | ||
for row in gen: | ||
S = np.vstack((S, row)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, please check that block_size
is upheld for each block.
Secondly, instead of repeated vstack, add to a list, then vstack the list.
@@ -370,6 +372,49 @@ def test_pairwise_distances_argmin_min(): | |||
np.testing.assert_almost_equal(dist_orig_val, dist_chunked_val, decimal=7) | |||
|
|||
|
|||
def check_invalid_block_size_generator(generator): |
There 10000 was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the same as next
?
rng = np.random.RandomState(0) | ||
# Euclidean distance should be equivalent to calling the function. | ||
X = rng.random_sample((400, 4)) | ||
gen = pairwise_distances_blockwise(X, block_size=1, metric="euclidean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be testing at least one other metric. I think you would be best with a helper function like check_pairwise_distances_blockwise
, then call it for different sets of parameters inclduing metric, Y
None or otherwise, and block_size
@jnothman Thanks for the review. I've mostly implemented the changes you mentioned. However, the part you mentioned about using a parallel generator was a bit unclear to me. Since you're telling that it's not supported atm, should I be adding anything else there? |
In terms of parallelism. The current approach to parallelism means that The parallelism used in #7177 was performed across blocks, but on the basis that each block was reduced to a negligible memory overhead before returning. For the sake of parallelism or otherwise, a more useful interface may be something like: def pairwise_distances_reduce(X, Y, reduce_func, metric, n_jobs, block_size):
... wherein |
@jnothman Thanks for the detailed explanation. I mostly got hold of what you are explaining. However, looking at #7177 I couldn't understand where in the code has each block been reduced to a negligible memory overhead. Could you please direct me to it so that I can understand better and implement the same? |
It gets translated into n_samples * n_clusters (intra_clust_dists,
inter_clust_dists), rather than n_samples * n_samples.
…On 16 December 2016 at 10:03, Aman Dalmia ***@***.***> wrote:
@jnothman <https://github.com/jnothman> Thanks for the detailed
explanation. I mostly got hold of what you are explaining. However, looking
at #7177 <#7177> I
couldn't understand where in the code has each block been reduced to a
negligible memory overhead. Could you please direct me to it so that I can
understand better and implement the same?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7979 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60PwIsdiYMhbiHy36gFHwJmshQFYks5rIccugaJpZM4LD_lx>
.
|
I understood it now. So, do you have any such |
The pairwise_distance case has no reduce_func.
…On 16 December 2016 at 10:44, Aman Dalmia ***@***.***> wrote:
I understood it now. So, do you have any such reduce_func in mind for
pairwise_distances or should I try incorporating it for silhouette_samples
directly?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7979 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6ybgMm6YmWLqv0Y2AWhY7qEEa0SCks5rIdDIgaJpZM4LD_lx>
.
|
@jnothman Oh! So, should I get started on incorporating it for |
Sure, go ahead, but I think using this helper in nearest neighbors would be
more practically useful.
…On 18 December 2016 at 18:03, Aman Dalmia ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I intended to ask that should I
get started on the rewrite of silhouette_samples via
pairwise_distances_reduce.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7979 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz64uLCztdor92anatOFzYo2MnLVlIks5rJNq5gaJpZM4LD_lx>
.
|
@jnothman I'd been trying to think what to do next here, but am a bit confused(as your PR for |
nearest neighbors reduces pairwise_distances's output to the nearest
neighbors. So it can be implemented with a reduce_func.
…On 22 December 2016 at 17:04, Aman Dalmia ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I'd been trying to think what to
do next here, but am a bit confused(as your PR for silhouette_samples is
not yet merged). I assume you are saying that I replace pairwise_distances
in nearest neighbors to pairwise_distances_blockwise since we don't have
a reduce_func for pairwise_distances. Could you please brief me as to
what should I be doing next in this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7979 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wqMNieyubJVaFIVmghFPHEoC0qRks5rKhMDgaJpZM4LD_lx>
.
|
Well, that one can directly make use of pairwise_distances_argmin which
uses blocks.
…On 5 January 2017 at 16:04, Aman Dalmia ***@***.***> wrote:
In nearest neighbors, approximate.py, base.py and nearest_centroid.py use
pairwise_distances. Just so I understand what I need to do clearly,
nearest_centroid.py uses it in the following way:
return self.classes_[pairwise_distances(
X, self.centroids_, metric=self.metric).argmin(axis=1)]
Could you please give me a head start as to how could I replace this with
a generator of blocks with a reduce_func?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7979 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61V9OdGgk2_cZ44cA72OuO9xTIqJks5rPHnKgaJpZM4LD_lx>
.
|
Seems great, working on them. Have a doubt though:
From tests, do you mean apart from those in the gist linked above? (which I have added in |
Ha I'd forgotten I actually wrote a gist with unit tests :P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to WIP while you make those substantial additions to the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, and sorry for the slow review. There continues to be interest in this kind of feature, with issues raised wrt silhouette_score and in the implementation of large margin nearest neighbors.
I don't think you need to integrate #7177 here; I think we can mark it MRG again.
|
||
if batch_size is not None: | ||
warnings.warn("'batch_size' was deprecated in version 0.19 and will " | ||
"be removed in version 0.21.", DeprecationWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should set block_size
according to batch_size
, i.e. block_size = int(ceil(batch_size ** 2 / BYTES_PER_FLOAT / 1024 / 1024))
(I think).
@@ -457,10 +431,13 @@ def pairwise_distances_argmin(X, Y, axis=1, metric="euclidean", | |||
sklearn.metrics.pairwise_distances | |||
sklearn.metrics.pairwise_distances_argmin_min | |||
""" | |||
if batch_size is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just forward the parameters as provided onto argmin_min
and let it do the warning and conversions.
@@ -256,8 +257,19 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False, | |||
return distances if squared else np.sqrt(distances, out=distances) | |||
|
|||
|
|||
def _argmin_min_reduce_min(dist): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be _argmin_min_reduce
?
Returns | ||
------- | ||
D : array-like or sparse matrix or tuple | ||
A distance matrix D such that D_{i, j} is the distance between the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accurate. The only thing we can say about the output here is that its first axis (or the first axis of each entry in a tuple) corresponds to rows in X.
are computed. If the input is a distances matrix, it is returned in blocks | ||
instead. | ||
|
||
This is equivalent to calling: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"vstacking the chunks generated by this function is equivalent to calling:"
|
||
pairwise_distances(X, y, metric, n_jobs) | ||
|
||
but may use less memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but avoids storing the entire distance matrix in memory
@@ -1131,6 +1109,214 @@ def _pairwise_callable(X, Y, metric, **kwds): | |||
'sokalsneath', 'sqeuclidean', 'yule', "wminkowski"] | |||
|
|||
|
|||
def _generate_pairwise_distances_blockwise(X, Y=None, metric='euclidean', | |||
n_jobs=1, | |||
block_size=DEFAULT_BLOCK_SIZE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_size isn't used here
D : generator of blocks based on the ``block_size`` parameter. | ||
|
||
""" | ||
if metric != 'precomputed' and Y is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do this in pairwise_distances_blockwise
n_samples = X.shape[0] | ||
for start in range(0, n_samples, block_n_rows): | ||
# get distances from block to every other sample | ||
stop = min(start + block_n_rows, X.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does just using start + block_n_rows
break anything? Usually slice notation does not raise an error if stop is greater than the length
@@ -1131,6 +1109,214 @@ def _pairwise_callable(X, Y, metric, **kwds): | |||
'sokalsneath', 'sqeuclidean', 'yule', "wminkowski"] | |||
|
|||
|
|||
def _generate_pairwise_distances_blockwise(X, Y=None, metric='euclidean', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should inline this function in pairwise_distances_blockwise
.
@dalmia if this is considered for attention during an upcoming sprint, would you like to help finish it, or let someone else take it on? |
@dalmia, I will mark this for someone else to finish. |
@jnothman please do the same. I won't be able to continue work on this |
As noted at #8602 (comment), this continues to be an in-demand feature, but I've received next-to-no feedback from other core devs on whether I've led this PR in the right direction. I've also wondered whether we should make the batch size a global option (e.g. |
Closing as superseded by #10280 |
Reference Issue
Fixes #7287
What does this implement/fix? Explain your changes.
This intends to add a function
pairwise_distances_blockwise
with an additional block_size parameter to avoid storing all the O(n^2) pairs' distances. This would then be used byneighbors
,silhouette_score
andpairwise_distances_argmin_min
.Any other comments?
pairwise_distances_reduce
sklearn/neighbors
pairwise_distances_reduce
pairwise_distances_reduce