8000 ENH: Add np.scale function by kernc · Pull Request #8196 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Add np.scale function #8196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

ENH: Add np.scale function #8196

wants to merge 3 commits into from

Conversation

kernc
Copy link
@kernc kernc commented Oct 21, 2016

This PR adds a scale function for scaling non-NaN array values in any dimension onto a desired interval. I found myself somewhat regularly using something like this and copying it into projects. I think a robust implementation should fit into numpy quite nicely.

I hope you too won't find it unimaginable to have something like this in numpy and at roughly the proposed location.

Raises
------
ValueError
When ``min > max`` or when values of ``min``, ``max``, and ``axis``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min > max seems like an unnecessary restriction

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I was thinking about that but then apparently forgot about it. 😄

out[...] = (arr - mins) / (maxs - mins) * (max - min) + min
else:
arr = np.rollaxis(arr, axis)
out[...] = np.rollaxis(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could be done better by just broadcasting with min and max

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I agree it's much better. Have I gotten the code correctly and concise enough?

@@ -1,4 +1,4 @@
from __future__ import division, absolute_import, print_function
from __future__ import division, absolute_import, print_function, with_statement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure numpy doesn't need 2.5 support any more

THANKS.txt Outdated
@@ -49,6 +49,7 @@ Alan McIntyre for updating the NumPy test framework to use nose, improve
Joe Harrington for administering the 2008 Documentation Sprint.
Mark Wiebe for the new NumPy iterator, the float16 data type, improved
low-level data type operations, and other NumPy core improvements.
Kernc for scale.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a threshold for being added to this list? What is that threshold?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not, the contributor guidelines suggest you add yourself. Full name would be better than GitHub handle.

We should probably change that file in some way or get rid of it completely. We acknowledge and thank all contributors in the release notes; this file is often not updated; when you do update it in a PR there's often a merge conflict.

Copy link
Author
@kernc kernc Oct 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see authors mentioned in 1.8.0-notes.rst, but not in 1.7.0 and former or 1.9.0 through 1.12.0. Is it a problem? Given the low update frequency, it's much more likely the conflict would arise in the release notes. :)

@rgommers
Copy link
Member

@kernc can you please propose adding this function on the numpy-discussion mailing list? That's where we decide on enhancements.

The name scale is a bit too generic imho.

its shape must match that of projection of `arr` on `axis`.
axis : int, optional
Axis along which to scale `arr`. If `None`, scaling is done over
the flattened array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"all axes" instead of "the flattened array". Only need to mention flattening is the returned array for n-D input is 1-D.

mins, maxs = np.nanmin(arr, axis), np.nanmax(arr, axis)

if axis is not None:
shape = [slice(None)] * arr.ndim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shape is a bad name here. slice or idx would be better

if out is None:
out = np.empty_like(arr, dtype=float)

mins, maxs = np.nanmin(arr, axis), np.nanmax(arr, axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add keepdims=True here, and you don't need the mins, maxs = mins[shape], maxs[shape] line below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

if axis is None and (min.ndim or max.ndim):
raise ValueError('min and max must be scalar when axis is None')
if axis is not None:
valid_shapes = ((), arr.shape[:axis] + arr.shape[axis + 1:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to enforce that the shape is arr.shape[:axis] + (1,) + arr.shape[axis + 1:]?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering it, but then anyone would have to do this hard-to-reason-about reshaping themselves?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, in a simple 2d case, it seems easier to obtain a 1d vector of extremes than a properly shaped matrix ...

On the other hand, keepdims looks ubiquitous enough. I think you're right. 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps both could be supported?

@kernc
Copy link
Author
kernc commented Oct 22, 2016

The name scale is a bit too generic imho.

The transformation where something is linearly changed in size is normally called scaling. People search for it. With vectors, it's also called normalization, but the meaning of this term seems more domain dependent (what is normal?) and implies scaling to unit length or to 1, whereas this function is meant for scaling to arbitrary intervals given interval extremes. What would you call it?

I'll write to the mailing list, thanks.

@shoyer
Copy link
Member
shoyer commented Oct 22, 2016

I'm still not sure this is warranted for numpy, but for the name, I think rescale would be more appropriate.

@eric-wieser
Copy link
Member
eric-wieser commented Oct 22, 2016

I feel like the function that's actually wanted here is the remap map provided in arduino, which this function can be written in terms of:

def scale(data, min, max, axis):
    return arduino_map(data,
        in_min=data.min(axis, keepdims=True),
        in_max=data.max(axis, keepdims=True),
        out_min=min,
        out_max=max
    )

Most of the time you don't want to rescale to the min and max of the data - you want to use some predefined min and max that you would expect your data to always lie within. Using remap gives you the freedom to do this.

Perhaps rescale is the right name for this arduino_map. The implementation would be as follows (possibly with the addition of careful handling of overflow)

def rescale(array, in_min, in_max, out_min=0, out_max=1):
    return out_min + (array - in_min) * (out_max - out_min) / (in_max - in_min)

Or perhaps the signature should be rescale(array, in_range=(min, max), out_range=(min, max)), which could be complemented with a np.span function that returns np.min, np.max (and does so more efficiently than calling the two separately).


Of course, this whole thing is just an interpolation, so perhaps is better expressed through a sufficiently-broadcasting interpolate function of some kind

@homu
Copy link
Contributor
homu commented Nov 5, 2016

☔ The latest upstream changes (presumably #8182) made this pull request unmergeable. Please resolve the merge conflicts.

@charris
Copy link
Member
charris commented Nov 5, 2016

1.12.x has been branched, you should move the comments in the 1.12.0-notes to the 1.13.0-notes.

@charris charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Nov 5, 2016
@mattharrigan
Copy link
Contributor

FWIW, I use this function pretty regularly. That does a different operation, by default returning an array with zero mean and a standard deviation of 1. That seems like a different use case. Not sure if this function can or should include the sklearn functionality. In either event the name scale to me means the sklearn version.

@kernc
Copy link
Author
kernc commented Nov 10, 2016

@mattharrigan Both methods are valid, mostly depending on the use case.

I agree to @eric-wieser's idea on additional prior min / prior max arguments. Thanks; will update accordingly. Two use cases I experienced in the past week:

  1. Have a long vector of data. Because it's really long, work with samples. To visualize the sample, associate color intensity (opacity, 0-255) to each value according to the peak-to-peak of the original full vector.
  2. Use scipy.optimize.differential_evolution. Provide varied, non-normalized bounds for N variables. Provide a loss function that works within those bounds and consider including some regularization of parameters. Because not all parameters are from the same interval space (i.e. some are bounded by [0, 1], others by [-1e5, 1e5], ...), I want to scale the parameters according to input bounds so that all parameters exert the same regularization influence. I could pre-scale the bounds, of course, but that might require also making the loss function less intuitive.

For posterity, the way to achieve this simply:

rescaled = np.interp(a, (a.min(), a.max()), (0, 1))

@shoyer, rescale it is. How strongly do you feel against this in numpy and why?

@shoyer
Copy link
Member
shoyer commented Nov 10, 2016

How strongly do you feel against this in numpy and why?

I'm not strongly opposed, just leaning against it. Consider this a -0 vote (Note that you shouldn't take my opinion as a definitive pronouncement here, but you will need to at least a plurality of maintainers that this is worth adding.)

My reasons:

  1. It's not unambiguous what scale or even rescale means. The existence of sklearn.preprocessing.scale doing something totally different is actually a strong point against including this in NumPy. NumPy already has too many utility functions with domain specific logic that we are stuck maintaining for perpetuity (for a recent example, see atleast_3d).
  2. This is a small utility function, which is quite easy to write (and rewrite) in user or domain specific packages (e.g., sklearn.preprocessing.scale). So the incremental value of including it in NumPy is minimal.

@charris
Copy link
Member
charris commented Nov 10, 2016

I feel pretty much the same as @shoyer about this. Long term, my concern is that numpy may slowly become bloated with small functions. That said, the numpy polynomials do this sort of scaling but they also need to store the scaling parameters. That points up the problem of providing a single function for this simple operation when the use cases may differ in small details from project to project.

@eric-wieser
Copy link
Member

the numpy polynomials do this sort of scaling

Could you elaborate with an example?

@charris
Copy link
Member
charris commented Nov 10, 2016

@eric-wieser The various polynomial classes scale the domain to the window. For instance, the default window for Chebyshev polynomials is [-1, 1], while the domain can be any interval. This sort of scaling is essential for numerical stability, especially when fitting polynomials to data.

I suppose that instead of storing the scaling parameters in the polynomial classes one could store a scaling function generated by a scaling function factory.

@eric-wieser
Copy link
Member

Oh, I see, by "do this sort of scaling", you mean "use it internally", not "already implement the feature of this PR"

@charris
Copy link
Member
charris commented Nov 10, 2016

Yep. Scaling and shifting of some sort is quite common, but I think the guestions here are

  • Is scaling complicated enough to justify a function in NumPy.
  • Can we make a function of sufficient generality to cover most use cases.

Note that scaling might not only be scaling to intervals, but scaling by standard deviation or some other norm.

@eric-wieser
Copy link
Member
eric-wieser commented Nov 11, 2016

Pushing the rescale(array, in_min, in_max, out_min=0, out_max=1) I suggested earlier, rescaling to the uniform normal is expressed as:

np.rescale(arr, in_min=mu, in_max=mu+std)

Although the raw form of (arr - mu)/std is way clearer here. It's only really the four-argument form that I think provided value over the raw expression

@mherkazandjian
Copy link

I have my own implementation of this function and I use it often. It would be nice to have it in numpy.
But to avoid having lots of small function in numpy, i think it would be nice to have a flexible design for this "rescale" functionality. A design that would allow for providing a custom scaling function, i.e the scaling could be uniform, normal..etc..

btw, there are similar implementations of this:
http://www.harrisgeospatial.com/docs/BYTSCL.html

.format(valid_shapes[-1]))

if out is None:
out = np.empty_like(arr, dtype=float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could move the test for out = None to the return so that you wouldn't create an innecessary out matrix in the most likely case when out is None.

temp = ... # Compute
if out is not None:
    out[...] = temp # Copy to out
    return out
return temp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bashtage: I don't think out makes any sense as an argument at all unless if offers some performance improvement over out[...] = func, which this does not

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. 👍

@kernc kernc force-pushed the scale branch 5 times, most recently from 289c2d1 to ff44639 Compare December 10, 2017 02:22
@WarrenWeckesser
Copy link
Member

This pull request has had the "needs decision" tag since November 2017. My impression is that the function is at the edge of what is considered worthwhile adding to NumPy. So this comment is really a ping to the NumPy devs: is it desirable to add such a linear rescaling function to NumPy? The lack of a decision is effectively a "no", but maybe a few devs taking a fresh look will think otherwise.

@kernc, if the answer turns out to be "yes", are you still interested in working on this?

If the answer to both those questions is yes, there is still some work to do, starting with a rebase to fix the merge conflict.

The function should do a better job of handling a constant input. Currently it generates several warnings and returns all nans:

In [2]: np.rescale([10, 10, 10])                                                                                                   
/Users/warren/mc37numpy/lib/python3.7/site-packages/numpy-1.15.0.dev0+967809c-py3.7-macosx-10.7-x86_64.egg/numpy/lib/function_base.py:4479: RuntimeWarning: divide by zero encountered in double_scalars
  offset = (in_max * out_min - in_min * out_max) / oldlen
/Users/warren/mc37numpy/lib/python3.7/site-packages/numpy-1.15.0.dev0+967809c-py3.7-macosx-10.7-x86_64.egg/numpy/lib/function_base.py:4480: RuntimeWarning: divide by zero encountered in true_divide
  scale = newlen / oldlen
/Users/warren/mc37numpy/lib/python3.7/site-packages/numpy-1.15.0.dev0+967809c-py3.7-macosx-10.7-x86_64.egg/numpy/lib/function_base.py:4482: RuntimeWarning: invalid value encountered in add
  res = arr * scale + offset
Out[2]: array([nan, nan, nan])

@rgommers
Copy link
Member

Both @shoyer and @charris were leaning towards no, or at least "meh". I asked for proposing on the mailing list, which IIRC wasn't done. I agree with the reservations of @charris and @shoyer. So I propose to close this.

@seberg seberg closed this Aug 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0