-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: Add np.scale function #8196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
numpy/lib/function_base.py
Outdated
Raises | ||
------ | ||
ValueError | ||
When ``min > max`` or when values of ``min``, ``max``, and ``axis`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min > max
seems like an unnecessary restriction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I was thinking about that but then apparently forgot about it. 😄
numpy/lib/function_base.py
Outdated
out[...] = (arr - mins) / (maxs - mins) * (max - min) + min | ||
else: | ||
arr = np.rollaxis(arr, axis) | ||
out[...] = np.rollaxis( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this could be done better by just broadcasting with min
and max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I agree it's much better. Have I gotten the code correctly and concise enough?
@@ -1,4 +1,4 @@ | |||
from __future__ import division, absolute_import, print_function | |||
from __future__ import division, absolute_import, print_function, with_statement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure numpy doesn't need 2.5 support any more
THANKS.txt
Outdated
@@ -49,6 +49,7 @@ Alan McIntyre for updating the NumPy test framework to use nose, improve | |||
Joe Harrington for administering the 2008 Documentation Sprint. | |||
Mark Wiebe for the new NumPy iterator, the float16 data type, improved | |||
low-level data type operations, and other NumPy core improvements. | |||
Kernc for scale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a threshold for being added to this list? What is that threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not, the contributor guidelines suggest you add yourself. Full name would be better than GitHub handle.
We should probably change that file in some way or get rid of it completely. We acknowledge and thank all contributors in the release notes; this file is often not updated; when you do update it in a PR there's often a merge conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see authors mentioned in 1.8.0-notes.rst, but not in 1.7.0 and former or 1.9.0 through 1.12.0. Is it a problem? Given the low update frequency, it's much more likely the conflict would arise in the release notes. :)
@kernc can you please propose adding this function on the numpy-discussion mailing list? That's where we decide on enhancements. The name |
numpy/lib/function_base.py
Outdated
its shape must match that of projection of `arr` on `axis`. | ||
axis : int, optional | ||
Axis along which to scale `arr`. If `None`, scaling is done over | ||
the flattened array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"all axes" instead of "the flattened array". Only need to mention flattening is the returned array for n-D input is 1-D.
numpy/lib/function_base.py
Outdated
mins, maxs = np.nanmin(arr, axis), np.nanmax(arr, axis) | ||
|
||
if axis is not None: | ||
shape = [slice(None)] * arr.ndim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shape is a bad name here. slice
or idx
would be better
numpy/lib/function_base.py
Outdated
if out is None: | ||
out = np.empty_like(arr, dtype=float) | ||
|
||
mins, maxs = np.nanmin(arr, axis), np.nanmax(arr, axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add keepdims=True
here, and you don't need the mins, maxs = mins[shape], maxs[shape]
line below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
numpy/lib/function_base.py
Outdated
if axis is None and (min.ndim or max.ndim): | ||
raise ValueError('min and max must be scalar when axis is None') | ||
if axis is not None: | ||
valid_shapes = ((), arr.shape[:axis] + arr.shape[axis + 1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to enforce that the shape is arr.shape[:axis] + (1,) + arr.shape[axis + 1:]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering it, but then anyone would have to do this hard-to-reason-about reshaping themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, in a simple 2d case, it seems easier to obtain a 1d vector of extremes than a properly shaped matrix ...
On the other hand, keepdims
looks ubiquitous enough. I think you're right. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps both could be supported?
The transformation where something is linearly changed in size is normally called scaling. People search for it. With vectors, it's also called normalization, but the meaning of this term seems more domain dependent (what is normal?) and implies scaling to unit length or to 1, whereas this function is meant for scaling to arbitrary intervals given interval extremes. What would you call it? I'll write to the mailing list, thanks. |
I'm still not sure this is warranted for numpy, but for the name, I think |
I feel like the function that's actually wanted here is the def scale(data, min, max, axis):
return arduino_map(data,
in_min=data.min(axis, keepdims=True),
in_max=data.max(axis, keepdims=True),
out_min=min,
out_max=max
) Most of the time you don't want to rescale to the min and max of the data - you want to use some predefined min and max that you would expect your data to always lie within. Using Perhaps def rescale(array, in_min, in_max, out_min=0, out_max=1):
return out_min + (array - in_min) * (out_max - out_min) / (in_max - in_min) Or perhaps the signature should be Of course, this whole thing is just an interpolation, so perhaps is better expressed through a sufficiently-broadcasting interpolate function of some kind |
☔ The latest upstream changes (presumably #8182) made this pull request unmergeable. Please resolve the merge conflicts. |
1.12.x has been branched, you should move the comments in the 1.12.0-notes to the 1.13.0-notes. |
FWIW, I use this function pretty regularly. That does a different operation, by default returning an array with zero mean and a standard deviation of 1. That seems like a different use case. Not sure if this function can or should include the sklearn functionality. In either event the name scale to me means the sklearn version. |
@mattharrigan Both methods are valid, mostly depending on the use case. I agree to @eric-wieser's idea on additional prior min / prior max arguments. Thanks; will update accordingly. Two use cases I experienced in the past week:
For posterity, the way to achieve this simply: rescaled = np.interp(a, (a.min(), a.max()), (0, 1)) @shoyer, |
I'm not strongly opposed, just leaning against it. Consider this a -0 vote (Note that you shouldn't take my opinion as a definitive pronouncement here, but you will need to at least a plurality of maintainers that this is worth adding.) My reasons:
|
I feel pretty much the same as @shoyer about this. Long term, my concern is that numpy may slowly become bloated with small functions. That said, the numpy polynomials do this sort of scaling but they also need to store the scaling parameters. That points up the problem of providing a single function for this simple operation when the use cases may differ in small details from project to project. |
Could you elaborate with an example? |
@eric-wieser The various polynomial classes scale the domain to the window. For instance, the default window for Chebyshev polynomials is [-1, 1], while the domain can be any interval. This sort of scaling is essential for numerical stability, especially when fitting polynomials to data. I suppose that instead of storing the scaling parameters in the polynomial classes one could store a scaling function generated by a scaling function factory. |
Oh, I see, by "do this sort of scaling", you mean "use it internally", not "already implement the feature of this PR" |
Yep. Scaling and shifting of some sort is quite common, but I think the guestions here are
Note that scaling might not only be scaling to intervals, but scaling by standard deviation or some other norm. |
Pushing the
Although the raw form of |
I have my own implementation of this function and I use it often. It would be nice to have it in numpy. btw, there are similar implementations of this: |
numpy/lib/function_base.py
Outdated
.format(valid_shapes[-1])) | ||
|
||
if out is None: | ||
out = np.empty_like(arr, dtype=float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could move the test for out = None to the return so that you wouldn't create an innecessary out matrix in the most likely case when out is None
.
temp = ... # Compute
if out is not None:
out[...] = temp # Copy to out
return out
return temp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bashtage: I don't think out
makes any sense as an argument at all unless if offers some performance improvement over out[...] = func
, which this does not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. 👍
289c2d1
to
ff44639
Compare
This pull request has had the "needs decision" tag since November 2017. My impression is that the function is at the edge of what is considered worthwhile adding to NumPy. So this comment is really a ping to the NumPy devs: is it desirable to add such a linear rescaling function to NumPy? The lack of a decision is effectively a "no", but maybe a few devs taking a fresh look will think otherwise. @kernc, if the answer turns out to be "yes", are you still interested in working on this? If the answer to both those questions is yes, there is still some work to do, starting with a rebase to fix the merge conflict. The function should do a better job of handling a constant input. Currently it generates several warnings and returns all
|
This PR adds a scale function for scaling non-NaN array values in any dimension onto a desired interval. I found myself somewhat regularly using something like this and copying it into projects. I think a robust implementation should fit into numpy quite nicely.
I hope you too won't find it unimaginable to have something like this in numpy and at roughly the proposed location.