ENH: Add np.scale function #8196

kernc · 2016-10-21T18:19:53Z

This PR adds a scale function for scaling non-NaN array values in any dimension onto a desired interval. I found myself somewhat regularly using something like this and copying it into projects. I think a robust implementation should fit into numpy quite nicely.

I hope you too won't find it unimaginable to have something like this in numpy and at roughly the proposed location.

eric-wieser · 2016-10-22T02:07:20Z

numpy/lib/function_base.py

+    Raises
+    ------
+    ValueError
+        When ``min > max`` or when values of ``min``, ``max``, and ``axis``


min > max seems like an unnecessary restriction

Right. I was thinking about that but then apparently forgot about it. 😄

eric-wieser · 2016-10-22T02:10:21Z

numpy/lib/function_base.py

+        out[...] = (arr - mins) / (maxs - mins) * (max - min) + min
+    else:
+        arr = np.rollaxis(arr, axis)
+        out[...] = np.rollaxis(


I feel like this could be done better by just broadcasting with min and max

Thanks. I agree it's much better. Have I gotten the code correctly and concise enough?

eric-wieser · 2016-10-22T02:11:57Z

numpy/lib/tests/test_function_base.py

@@ -1,4 +1,4 @@
-from __future__ import division, absolute_import, print_function
+from __future__ import division, absolute_import, print_function, with_statement


Pretty sure numpy doesn't need 2.5 support any more

eric-wieser · 2016-10-22T02:15:31Z

THANKS.txt

@@ -49,6 +49,7 @@ Alan McIntyre for updating the NumPy test framework to use nose, improve
 Joe Harrington for administering the 2008 Documentation Sprint.
 Mark Wiebe for the new NumPy iterator, the float16 data type, improved
    low-level data type operations, and other NumPy core improvements.
+Kernc for scale.


Is there a threshold for being added to this list? What is that threshold?

There's not, the contributor guidelines suggest you add yourself. Full name would be better than GitHub handle.

We should probably change that file in some way or get rid of it completely. We acknowledge and thank all contributors in the release notes; this file is often not updated; when you do update it in a PR there's often a merge conflict.

I see authors mentioned in 1.8.0-notes.rst, but not in 1.7.0 and former or 1.9.0 through 1.12.0. Is it a problem? Given the low update frequency, it's much more likely the conflict would arise in the release notes. :)

rgommers · 2016-10-22T03:12:08Z

@kernc can you please propose adding this function on the numpy-discussion mailing list? That's where we decide on enhancements.

The name scale is a bit too generic imho.

rgommers · 2016-10-22T03:13:17Z

numpy/lib/function_base.py

+        its shape must match that of projection of `arr` on `axis`.
+    axis : int, optional
+        Axis along which to scale `arr`. If `None`, scaling is done over
+        the flattened array.


"all axes" instead of "the flattened array". Only need to mention flattening is the returned array for n-D input is 1-D.

eric-wieser · 2016-10-22T16:20:13Z

numpy/lib/function_base.py

+    mins, maxs = np.nanmin(arr, axis), np.nanmax(arr, axis)
+
+    if axis is not None:
+        shape = [slice(None)] * arr.ndim


shape is a bad name here. slice or idx would be better

eric-wieser · 2016-10-22T16:24:35Z

numpy/lib/function_base.py

+    if out is None:
+        out = np.empty_like(arr, dtype=float)
+
+    mins, maxs = np.nanmin(arr, axis), np.nanmax(arr, axis)


Add keepdims=True here, and you don't need the mins, maxs = mins[shape], maxs[shape] line below

eric-wieser · 2016-10-22T16:26:04Z

numpy/lib/function_base.py

+    if axis is None and (min.ndim or max.ndim):
+        raise ValueError('min and max must be scalar when axis is None')
+    if axis is not None:
+        valid_shapes = ((), arr.shape[:axis] + arr.shape[axis + 1:])


Would it make more sense to enforce that the shape is arr.shape[:axis] + (1,) + arr.shape[axis + 1:]?

I was considering it, but then anyone would have to do this hard-to-reason-about reshaping themselves?

I mean, in a simple 2d case, it seems easier to obtain a 1d vector of extremes than a properly shaped matrix ...

On the other hand, keepdims looks ubiquitous enough. I think you're right. 👍

Perhaps both could be supported?

kernc · 2016-10-22T16:51:57Z

The name scale is a bit too generic imho.

The transformation where something is linearly changed in size is normally called scaling. People search for it. With vectors, it's also called normalization, but the meaning of this term seems more domain dependent (what is normal?) and implies scaling to unit length or to 1, whereas this function is meant for scaling to arbitrary intervals given interval extremes. What would you call it?

I'll write to the mailing list, thanks.

shoyer · 2016-10-22T17:58:10Z

I'm still not sure this is warranted for numpy, but for the name, I think rescale would be more appropriate.

eric-wieser · 2016-10-22T20:59:38Z

I feel like the function that's actually wanted here is the ~~remap~~ map provided in arduino, which this function can be written in terms of:

def scale(data, min, max, axis):
    return arduino_map(data,
        in_min=data.min(axis, keepdims=True),
        in_max=data.max(axis, keepdims=True),
        out_min=min,
        out_max=max
    )

Most of the time you don't want to rescale to the min and max of the data - you want to use some predefined min and max that you would expect your data to always lie within. Using remap gives you the freedom to do this.

Perhaps rescale is the right name for this arduino_map. The implementation would be as follows (possibly with the addition of careful handling of overflow)

def rescale(array, in_min, in_max, out_min=0, out_max=1):
    return out_min + (array - in_min) * (out_max - out_min) / (in_max - in_min)

Or perhaps the signature should be rescale(array, in_range=(min, max), out_range=(min, max)), which could be complemented with a np.span function that returns np.min, np.max (and does so more efficiently than calling the two separately).

Of course, this whole thing is just an interpolation, so perhaps is better expressed through a sufficiently-broadcasting interpolate function of some kind

homu · 2016-11-05T16:26:57Z

☔ The latest upstream changes (presumably #8182) made this pull request unmergeable. Please resolve the merge conflicts.

charris · 2016-11-05T22:59:41Z

1.12.x has been branched, you should move the comments in the 1.12.0-notes to the 1.13.0-notes.

mattharrigan · 2016-11-10T15:56:39Z

FWIW, I use this function pretty regularly. That does a different operation, by default returning an array with zero mean and a standard deviation of 1. That seems like a different use case. Not sure if this function can or should include the sklearn functionality. In either event the name scale to me means the sklearn version.

kernc · 2016-11-10T16:37:48Z

@mattharrigan Both methods are valid, mostly depending on the use case.

I agree to @eric-wieser's idea on additional prior min / prior max arguments. Thanks; will update accordingly. Two use cases I experienced in the past week:

Have a long vector of data. Because it's really long, work with samples. To visualize the sample, associate color intensity (opacity, 0-255) to each value according to the peak-to-peak of the original full vector.
Use scipy.optimize.differential_evolution. Provide varied, non-normalized bounds for N variables. Provide a loss function that works within those bounds and consider including some regularization of parameters. Because not all parameters are from the same interval space (i.e. some are bounded by [0, 1], others by [-1e5, 1e5], ...), I want to scale the parameters according to input bounds so that all parameters exert the same regularization influence. I could pre-scale the bounds, of course, but that might require also making the loss function less intuitive.

For posterity, the way to achieve this simply:

rescaled = np.interp(a, (a.min(), a.max()), (0, 1))

@shoyer, rescale it is. How strongly do you feel against this in numpy and why?

shoyer · 2016-11-10T17:21:49Z

How strongly do you feel against this in numpy and why?

I'm not strongly opposed, just leaning against it. Consider this a -0 vote (Note that you shouldn't take my opinion as a definitive pronouncement here, but you will need to at least a plurality of maintainers that this is worth adding.)

My reasons:

It's not unambiguous what scale or even rescale means. The existence of sklearn.preprocessing.scale doing something totally different is actually a strong point against including this in NumPy. NumPy already has too many utility functions with domain specific logic that we are stuck maintaining for perpetuity (for a recent example, see atleast_3d).
This is a small utility function, which is quite easy to write (and rewrite) in user or domain specific packages (e.g., sklearn.preprocessing.scale). So the incremental value of including it in NumPy is minimal.

charris · 2016-11-10T17:48:54Z

I feel pretty much the same as @shoyer about this. Long term, my concern is that numpy may slowly become bloated with small functions. That said, the numpy polynomials do this sort of scaling but they also need to store the scaling parameters. That points up the problem of providing a single function for this simple operation when the use cases may differ in small details from project to project.

eric-wieser · 2016-11-10T20:05:28Z

the numpy polynomials do this sort of scaling

Could you elaborate with an example?

charris · 2016-11-10T20:30:20Z

@eric-wieser The various polynomial classes scale the domain to the window. For instance, the default window for Chebyshev polynomials is [-1, 1], while the domain can be any interval. This sort of scaling is essential for numerical stability, especially when fitting polynomials to data.

I suppose that instead of storing the scaling parameters in the polynomial classes one could store a scaling function generated by a scaling function factory.

eric-wieser · 2016-11-10T21:43:25Z

Oh, I see, by "do this sort of scaling", you mean "use it internally", not "already implement the feature of this PR"

charris · 2016-11-10T21:57:00Z

Yep. Scaling and shifting of some sort is quite common, but I think the guestions here are

Is scaling complicated enough to justify a function in NumPy.
Can we make a function of sufficient generality to cover most use cases.

Note that scaling might not only be scaling to intervals, but scaling by standard deviation or some other norm.

eric-wieser · 2016-11-11T01:45:18Z

Pushing the rescale(array, in_min, in_max, out_min=0, out_max=1) I suggested earlier, rescaling to the uniform normal is expressed as:

np.rescale(arr, in_min=mu, in_max=mu+std)

Although the raw form of (arr - mu)/std is way clearer here. It's only really the four-argument form that I think provided value over the raw expression

mherkazandjian · 2016-11-11T05:29:09Z

I have my own implementation of this function and I use it often. It would be nice to have it in numpy.
But to avoid having lots of small function in numpy, i think it would be nice to have a flexible design for this "rescale" functionality. A design that would allow for providing a custom scaling function, i.e the scaling could be uniform, normal..etc..

btw, there are similar implementations of this:
http://www.harrisgeospatial.com/docs/BYTSCL.html

bashtage · 2017-01-19T17:32:55Z

numpy/lib/function_base.py

+                             .format(valid_shapes[-1]))
+
+    if out is None:
+        out = np.empty_like(arr, dtype=float)


You could move the test for out = None to the return so that you wouldn't create an innecessary out matrix in the most likely case when out is None.

temp = ... # Compute if out is not None: out[...] = temp # Copy to out return out return temp

@bashtage: I don't think out makes any sense as an argument at all unless if offers some performance improvement over out[...] = func, which this does not

Thanks. 👍

WarrenWeckesser · 2019-08-18T20:36:47Z

This pull request has had the "needs decision" tag since November 2017. My impression is that the function is at the edge of what is considered worthwhile adding to NumPy. So this comment is really a ping to the NumPy devs: is it desirable to add such a linear rescaling function to NumPy? The lack of a decision is effectively a "no", but maybe a few devs taking a fresh look will think otherwise.

@kernc, if the answer turns out to be "yes", are you still interested in working on this?

If the answer to both those questions is yes, there is still some work to do, starting with a rebase to fix the merge conflict.

The function should do a better job of handling a constant input. Currently it generates several warnings and returns all nans:

In [2]: np.rescale([10, 10, 10])                                                                                                   
/Users/warren/mc37numpy/lib/python3.7/site-packages/numpy-1.15.0.dev0+967809c-py3.7-macosx-10.7-x86_64.egg/numpy/lib/function_base.py:4479: RuntimeWarning: divide by zero encountered in double_scalars
  offset = (in_max * out_min - in_min * out_max) / oldlen
/Users/warren/mc37numpy/lib/python3.7/site-packages/numpy-1.15.0.dev0+967809c-py3.7-macosx-10.7-x86_64.egg/numpy/lib/function_base.py:4480: RuntimeWarning: divide by zero encountered in true_divide
  scale = newlen / oldlen
/Users/warren/mc37numpy/lib/python3.7/site-packages/numpy-1.15.0.dev0+967809c-py3.7-macosx-10.7-x86_64.egg/numpy/lib/function_base.py:4482: RuntimeWarning: invalid value encountered in add
  res = arr * scale + offset
Out[2]: array([nan, nan, nan])

rgommers · 2019-08-18T20:48:22Z

Both @shoyer and @charris were leaning towards no, or at least "meh". I asked for proposing on the mailing list, which IIRC wasn't done. I agree with the reservations of @charris and @shoyer. So I propose to close this.

kernc force-pushed the scale branch from e118500 to bc811a8 Compare October 21, 2016 18:27

charris added 01 - Enhancement component: numpy.lib labels Oct 21, 2016

eric-wieser reviewed Oct 22, 2016

View reviewed changes

rgommers reviewed Oct 22, 2016

View reviewed changes

eric-wieser reviewed Oct 22, 2016

View reviewed changes

charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Nov 5, 2016

bashtage reviewed Jan 19, 2017

View reviewed changes

eric-wieser added 54 - Needs decision 55 - Needs work labels Nov 15, 2017

kernc force-pushed the scale branch from db18ad6 to f4e1f05 Compare December 9, 2017 06:43

kernc force-pushed the scale branch 5 times, most recently from 289c2d1 to ff44639 Compare December 10, 2017 02:22

kernc added 3 commits March 30, 2018 03:39

ENH: Add np.rescale function

b131eaa

ENH: np.rescale(): add return_params argument

c37e546

np.rescale: remove redundant out= argument

967809c

kernc force-pushed the scale branch from 1015b00 to 967809c Compare March 30, 2018 01:41

seberg closed this Aug 18, 2019

		@@ -1,4 +1,4 @@
		from __future__ import division, absolute_import, print_function
		from __future__ import division, absolute_import, print_function, with_statement

Uh oh!

ENH: Add np.scale function #8196

ENH: Add np.scale function #8196

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!