8000 [MRG] ENH: Added block_size parameter for lesser memory consumption by dalmia · Pull Request #7979 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] ENH: Added block_size parameter for lesser memory consumption #7979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 39 commits into from

Conversation

dalmia
Copy link
Contributor
@dalmia dalmia commented Dec 5, 2016

Reference Issue

Fixes #7287

What does this implement/fix? Explain your changes.

This intends to add a function pairwise_distances_blockwise with an additional block_size parameter to avoid storing all the O(n^2) pairs' distances. This would then be used by neighbors, silhouette_score and pairwise_distances_argmin_min.

Any other comments?

  • Stacking within pairwise_distances_reduce
  • Revert changes to sklearn/neighbors
  • Docstring for pairwise_distances_reduce
  • Tests for pairwise_distances_reduce

@jnothman
Copy link
Member
jnothman commented Dec 5, 2016

I'd think the main thing needing work in the docstring is the return value: we can't return a monolithic matrix.

@dalmia
Copy link
Contributor Author
dalmia commented Dec 6, 2016

Yes, that needs to be changed. But since it's depending on the block_size, I am unable to assign a specific value to it. Any suggestion that you may have?
The work-around I could think of was to assign n_samples_x and n_samples_y which depend on the block_size parameter?

@jnothman
Copy link
Member
jnothman commented Dec 6, 2016 via email

@dalmia dalmia changed the title [WIP] ENH: Added template for pairwise_distances_blockwise with docstring c… [WIP] ENH: Added block_size parameter for lesser memory consumption Dec 6, 2016
@dalmia
Copy link
Contributor Author
dalmia commented Dec 6, 2016

@jnothman I currently don't have any experience with multithreading(in practise) but am able to understand most of the parts of your changes in #7177. So, do you suggest I follow up from your changes to get started and then try to fix if something breaks?

@jnothman
Copy link
Member
jnothman commented Dec 6, 2016 via email

@dalmia
Copy link
Contributor Author
dalmia commented Dec 7, 2016

@jnothman I added the code for the generator and checked it by running an example. It seemed to work fine. I could not add tests/ check with nosetests because of the following error I am getting on running nosetests:

======================================================================
ERROR: Failure: ValueError (numpy.dtype has the wrong size, try recompiling. Expected 88, got 96)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/loader.py", line 414, in loadTestsFromName
    addr.filename, addr.module)
  File "/usr/lib/python2.7/dist-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/usr/lib/python2.7/dist-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/__init__.py", line 57, in <module>
    from .base import clone
  File "/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/base.py", line 12, in <module>
    from .utils.fixes import signature
  File "/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/__init__.py", line 10, in <module>
    from .murmurhash import murmurhash3_32
  File "__init__.pxd", line 155, in init sklearn.utils.murmurhash (sklearn/utils/murmurhash.c:6319)
ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, got 96

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

Please help me with this error.

@jnothman
Copy link
Member
jnothman commented Dec 7, 2016 via email

@dalmia
Copy link
Contributor Author
dalmia commented Dec 8, 2016

I still get the same error. I get this while compiling:

 #warning "Using deprecated NumPy API, disable it by " \

Any suggestions? @jnothman

@dalmia
Copy link
Contributor Author
dalmia commented Dec 8, 2016

It's fixed now.

@jnothman
Copy link
Member
jnothman commented Dec 8, 2016

flake8 errors

@dalmia
Copy link
Contributor Author
dalmia commented Dec 8, 2016

Yes, I had noticed those errors before committing. But I didn't change it as I had just copied the docstring from the pairwise_distances function. Is there someway of overriding the flake8 errors? Because the same line in pairwise_distances function doesn't give this error:

    Y : array [n_samples_b, n_features], optional
        An optional second feature array. Only allowed if metric != "precomputed".

@jnothman
Copy link
Member
jnothman commented Dec 8, 2016

flake8 is run in diff mode so that we only encourage you to fix errors you have introduced to the code (although it doesn't catch everything). You are still required to fix those issues.

@dalmia dalmia changed the title [WIP] ENH: Added block_size parameter for lesser memory consumption [MRG] ENH: Added block_size parameter for lesser memory consumption Dec 8, 2016
@dalmia
Copy link
Contributor Author
dalmia commented Dec 8, 2016

I have added the tests for pairwise_distances_blockwise. I couldn't think of a better way of testing for invalid block size in a generator. Please review.
Thanks.

@dalmia
Copy link
Contributor Author
dalmia commented Dec 10, 2016

ping @jnothman @amueller @lesteve


pairwise_distances(X, y, metric, n_jobs)

but uses much less memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"but may use less memory"


def pairwise_distances_blockwise(X, Y=None, metric='euclidean', n_jobs=1,
block_size=DEFAULT_BLOCK_SIZE, **kwds):
""" Compute the distance matrix from a vector array X and optional Y.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove space between """ and text (PEP257)


Returns
-------
D : generator of blocks based on the ``block_size`` parameter. The blocks,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first line is reserved for type description


Parameters
----------
X : array [n_samples_a, n_samples_a] if metric == "precomputed", or, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove , after "or"


This method takes either a vector array or a distance matrix, and returns
a distance matrix. If the input is a vector array, the distances are
computed. If the input is a distances matrix, it is returned instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returned in blocks"

block_size=DEFAULT_BLOCK_SIZE, **kwds):
""" Compute the distance matrix from a vector array X and optional Y.

This method takes either a vector array or a distance matrix, and returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returns" -> "generates blocks of"


def test_pairwise_distances_blockwise_invalid_block_size():
rng = np.random.RandomState(0)
X = rng.random_sample((400, 4))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might as well be empty or zeros

rng = np.random.RandomState(0)
X = rng.random_sample((400, 4))
F438 y = rng.random_sample((200, 4))
gen = pairwise_distances_blockwise(X, y, block_size=0, metric='euclidean')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see a test where block_size=1 or greater

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing this for invalid block size hence used 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but 1 is invalid too if X has many samples

X = rng.random_sample((400, 4))
y = rng.random_sample((200, 4))
gen = pairwise_distances_blockwise(X, y, block_size=0, metric='euclidean')
assert_raise_message(ValueError, 'block_size should be at least n_samples '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather this error be raised upon calling pairwise_distances_blockwise. That means you need to not use a generator function immediately, but use a secondary function call.

for start in range(0, n_samples, block_n_rows):
# get distances from block to every other sample
stop = min(start + block_n_rows, X.shape[0])
yield pairwise_distances(X[start:stop], Y, metric, n_jobs, **kwds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will use at most block_size memory overall, not per job. I'm not sure this is the best way to do parallelism here, though. I.e. ideally we would use a parallel generator, but I don't think joblib.parallel supports that atm.

X = rng.random_sample((400, 4))
gen = pairwise_distances_blockwise(X, block_size=1, metric="euclidean")
S = np.empty((0, X.shape[0]))
for row in gen:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use np.vstack(list(gen))

gen = pairwise_distances_blockwise(X, block_size=1, metric="euclidean")
S = np.empty((0, X.shape[0]))
for row in gen:
S = np.vstack((S, row))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, please check that block_size is upheld for each block.

Secondly, instead of repeated vstack, add to a list, then vstack the list.

@@ -370,6 +372,49 @@ def test_pairwise_distances_argmin_min():
np.testing.assert_almost_equal(dist_orig_val, dist_chunked_val, decimal=7)


def check_invalid_block_size_generator(generator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same as next?

rng = np.random.RandomState(0)
# Euclidean distance should be equivalent to calling the function.
X = rng.random_sample((400, 4))
gen = pairwise_distances_blockwise(X, block_size=1, metric="euclidean")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be testing at least one other metric. I think you would be best with a helper function like check_pairwise_distances_blockwise, then call it for different sets of parameters inclduing metric, Y None or otherwise, and block_size

@dalmia
Copy link
Contributor Author
dalmia commented Dec 12, 2016

@jnothman Thanks for the review. I've mostly implemented the changes you mentioned. However, the part you mentioned about using a parallel generator was a bit unclear to me. Since you're telling that it's not supported atm, should I be adding anything else there?

@jnothman
Copy link
Member

In terms of parallelism. The current approach to parallelism means that n_jobs jobs are started for every block. This is a lot of overhead. It may be possible to use this kind of within-block parallelism only if we reuse the same jobs (threads or processes) from block to block. I think this is possible but would require passing a Parallel instance into pairwise_distances

The parallelism used in #7177 was performed across blocks, but on the basis that each block was reduced to a negligible memory overhead before returning. For the sake of parallelism or otherwise, a more useful interface may be something like:

def pairwise_distances_reduce(X, Y, reduce_func, metric, n_jobs, block_size):
    ...

wherein reduce_func is applied to each block calculated in parallel processes. This could be used in silhouette_samplesor in a rewrite of pairwise_distances_argmin_min

@dalmia
Copy link
Contributor Author
dalmia commented Dec 15, 2016

@jnothman Thanks for the detailed explanation. I mostly got hold of what you are explaining. However, looking at #7177 I couldn't understand where in the code has each block been reduced to a negligible memory overhead. Could you please direct me to it so that I can understand better and implement the same?

@jnothman
Copy link
Member
jnothman commented Dec 15, 2016 via email

@dalmia
Copy link
Contributor Author
dalmia commented Dec 15, 2016

I understood it now. So, do you have any such reduce_func in mind for pairwise_distances or should I try incorporating it for silhouette_samples directly?

@jnothman
Copy link
Member
jnothman commented Dec 18, 2016 via email

@dalmia
Copy link
Contributor Author
dalmia commented Dec 18, 2016

@jnothman Oh! So, should I get started on incorporating it for silhouette_samples or pairwise_distances_argmin_min via pairwise_distances_reduce?

@jnothman
Copy link
Member
jnothman commented Dec 20, 2016 via email

< 7802 div data-show-on-forbidden-error hidden>

Uh oh!

There was an error while loading. Please reload this page.

@dalmia
Copy link
Contributor Author
dalmia commented Dec 22, 2016

@jnothman I'd been trying to think what to do next here, but am a bit confused(as your PR for silhouette_samples is not yet merged). I assume you are saying that I replace pairwise_distances in nearest neighbors to pairwise_distances_blockwise since we don't have a reduce_func for pairwise_distances. Could you please brief me as to what should I be doing next in this PR?

@jnothman
Copy link
Member
jnothman commented Dec 22, 2016 via email

@jnothman
Copy link
Member
jnothman commented Jan 5, 2017 via email

@dalmia
Copy link
Contributor Author
dalmia commented Feb 20, 2017

Seems great, working on them. Have a doubt though:

If we are to keep it, it needs tests.

From tests, do you mean apart from those in the gist linked above? (which I have added in utils/test_utils.py)

@jnothman
Copy link
Member

From tests, do you mean apart from those in the gist linked above? (which I have added in utils/test_utils.py)

Ha I'd forgotten I actually wrote a gist with unit tests :P

@jnothman jnothman changed the title [MRG] ENH: Added block_size parameter for lesser memory consumption [WIP] ENH: Added block_size parameter for lesser memory consumption Feb 23, 2017
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to WIP while you make those substantial additions to the PR.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, and sorry for the slow review. There continues to be interest in this kind of feature, with issues raised wrt silhouette_score and in the implementation of large margin nearest neighbors.

I don't think you need to integrate #7177 here; I think we can mark it MRG again.


if batch_size is not None:
warnings.warn("'batch_size' was deprecated in version 0.19 and will "
"be removed in version 0.21.", DeprecationWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should set block_size according to batch_size, i.e. block_size = int(ceil(batch_size ** 2 / BYTES_PER_FLOAT / 1024 / 1024)) (I think).

@@ -457,10 +431,13 @@ def pairwise_distances_argmin(X, Y, axis=1, metric="euclidean",
sklearn.metrics.pairwise_distances
sklearn.metrics.pairwise_distances_argmin_min
"""
if batch_size is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just forward the parameters as provided onto argmin_min and let it do the warning and conversions.

@@ -256,8 +257,19 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
return distances if squared else np.sqrt(distances, out=distances)


def _argmin_min_reduce_min(dist):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be _argmin_min_reduce?

Returns
-------
D : array-like or sparse matrix or tuple
A distance matrix D such that D_{i, j} is the distance between the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate. The only thing we can say about the output here is that its first axis (or the first axis of each entry in a tuple) corresponds to rows in X.

are computed. If the input is a distances matrix, it is returned in blocks
instead.

This is equivalent to calling:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"vstacking the chunks generated by this function is equivalent to calling:"


pairwise_distances(X, y, metric, n_jobs)

but may use less memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but avoids storing the entire distance matrix in memory

@@ -1131,6 +1109,214 @@ def _pairwise_callable(X, Y, metric, **kwds):
'sokalsneath', 'sqeuclidean', 'yule', "wminkowski"]


def _generate_pairwise_distances_blockwise(X, Y=None, metric='euclidean',
n_jobs=1,
block_size=DEFAULT_BLOCK_SIZE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_size isn't used here

D : generator of blocks based on the ``block_size`` parameter.

"""
if metric != 'precomputed' and Y is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do this in pairwise_distances_blockwise

n_samples = X.shape[0]
for start in range(0, n_samples, block_n_rows):
# get distances from block to every other sample
stop = min(start + block_n_rows, X.shape[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does just using start + block_n_rows break anything? Usually slice notation does not raise an error if stop is greater than the length

@@ -1131,6 +1109,214 @@ def _pairwise_callable(X, Y, metric, **kwds):
'sokalsneath', 'sqeuclidean', 'yule', "wminkowski"]


def _generate_pairwise_distances_blockwise(X, Y=None, metric='euclidean',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should inline this function in pairwise_distances_blockwise.

@jnothman jnothman changed the title [WIP] ENH: Added block_size parameter for lesser memory consumption [MRG] ENH: Added block_size parameter for lesser memory consumption May 24, 2017
@jnothman jnothman added the Sprint label Jun 3, 2017
@jnothman
Copy link
Member
jnothman commented Jun 3, 2017

@dalmia if this is considered for attention during an upcoming sprint, would you like to help finish it, or let someone else take it on?

@jnothman
Copy link
Member
jnothman commented Jun 8, 2017

@dalmia, I will mark this for someone else to finish.

@dalmia
Copy link
Contributor Author
dalmia commented Jun 8, 2017

@jnothman please do the same. I won't be able to continue work on this

@jnothman jnothman mentioned this pull request Jun 12, 2017
4 tasks
@jnothman jnothman added this to the 0.19 milestone Jun 13, 2017
@jnothman jnothman removed the Sprint label Jun 19, 2017
@jnothman jnothman modified the milestones: 0.19, 0.20 Jun 27, 2017
@jnothman
Copy link
Member

As noted at #8602 (comment), this continues to be an in-demand feature, but I've received next-to-no feedback from other core devs on whether I've led this PR in the right direction. I've also wondered whether we should make the batch size a global option (e.g. sklearn.set_config(chunk_memory='64M')) and then use it wherever operations can be chunked. Your thoughts on whether this is a priority; its design (at lesat in terms of the magical flexible_vstack I introduced); and whether configurations should be global are very welcome, @amueller, @lesteve, @GaelVaroquaux, ...

@jnothman
Copy link
Member
jnothman commented Feb 6, 2018

Closing as superseded by #10280

@jnothman jnothman closed this Feb 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch pairwise_distances in neighbors to reduce memory consumption
4 participants
0