FIX: Do not rely on strides for contiguous arrays #1458

seberg · 2012-12-10T15:23:55Z

When an array is contiguous in memory but has shape[dim] == 1, then
its strides[dim] is not really used, so it can be considered arbitrary
even if the array is contiguous. This is generally true for 0-sized arrays.

If there is no chance of 0-sized (or 1-sized and 1-dimensioal) arrays, then
this code is perfectly correct in current numpy. But its a dangerous
design choice and it would be nice to be able to change numpy at some point.

This fixes gh-1406 (without a numpy change, which I am sure will come soon, but there is no reason to change it here anyway). Using .itemsize is much clearer anyway, so the only reason to not do it is to do some premature optimization because Cython doesn't compile it away.

(Recreated the c files with cython 0.17.1)

Btw. all over this file <int> casts seem to be used for memory pointers. Is this really correct? Since it should be npy_intp from a numpy perspective.

When an array is contiguous in memory but has `shape[dim] == 1`, then its `strides[dim]` is not really used, so it can be considered arbitrary even if the array is contiguous. This is generally true for 0-sized arrays. If there is no chance of 0-sized (or 1-sized and 1-dimensioal) arrays, then this code is perfectly correct in current numpy. But its a dangerous design choice and it would be nice to be able to change numpy at some point.

amueller · 2012-12-10T15:35:20Z

There shouldn't be any casts, only memory views ;)

I didn't really look at the code but if we want to improve the pointer-fu, trying to use memory-views would be my first approach at refactoring.

amueller · 2012-12-10T15:36:26Z

Thanks for the fix :)
I'm sure @pprett, @larsmans and @glouppe are interested ;)

larsmans · 2012-12-10T16:06:02Z

Btw. all over this file casts seem to be used for memory pointers.

I don't see any of this...

seberg · 2012-12-10T16:10:12Z

@larsman cdef int X_stride = <int> X.strides[1] / <int> X.itemsize is obviously used to address memory later. And if this is a 32-bit integer and not a npy_intp equivalent, you will get overflows for enormous arrays.

It actually seems likely that memoryviews only use the buffer interface, and I think for those numpy should probably allow you to rely on strides, but cython currently uses something in between buffers and arrays for np.ndarray which means that numpy could not redefine the contiguous flags. But that doesn't change that its clearer what it means like this in any case...

GaelVaroquaux · 2012-12-10T16:29:11Z

@seberg: thanks a lot for pitching in. I haven't had time to look at this
code myself, but it's good to have a numpy expert looking at the code.

larsmans · 2012-12-10T16:33:44Z

Ah, that's what you mean. Yes, a lot of our code uses int to index into arrays. I've never been happy with that, but it's a choice that was made at some point. The tree code in particular was written for maximum speed, even at the expense of safety. (Can't say I agree with that design choice.)

pprett · 2012-12-10T16:43:45Z

@larsmans I don't think this was intentional - just a lack of expertise (at least on my side) - if there is a safety risk in our current codebase we should fix that.

you propose that we use npy_intp as the data type to index into numpy arrays (i.e. the contiguous memory segment) ?

seberg · 2012-12-10T16:55:54Z

@pprett for everything to do with indexing. Also normal indexing, strides, shape/size too, I believe you need to use npy_intp (in Cython I guess np.intp_t). Or you add a check here if you want to use int and inform the user that the specific function is not designed for such humongous arrays if int is not large enough, I guess, but though I did not try it I expect you can easily segfault the tree module as is, even if it may have to be a very large array (because of datatype itemsize factoring in).

glouppe · 2012-12-11T08:49:59Z

Thanks for the hints! We should indeed use npy_intp where it is needed.

Regarding this PR, this looks good to me. +1

pprett · 2012-12-11T08:58:59Z

@seberg thanks

I'd propose we postpone the np.intp_t refactoring until #522 is merged (lots of changes to _tree.pyx) .

GaelVaroquaux · 2012-12-11T13:31:11Z

I'd propose we postpone the np.intp_t refactoring until #522 is merged (lots of
changes to _tree.pyx) .

We should create a ticket with the 'EasyFix' label.

larsmans · 2012-12-11T13:42:18Z

Isn't there a release coming up? If so, we should (IMHO) merge this, then do the release, then merge in new features.

pprett · 2012-12-11T13:57:44Z

I wasn't clear: +1 on merging this; -1 for doing the refactoring before
merging #522; IMHO #522 should be merged before the next release

2012/12/11 Lars Buitinck notifications@github.com

Isn't there a release coming up? If so, we should (IMHO) merge this, then
do the release, then merge in new features.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/1458#issuecomment-11243526.

Peter Prettenhofer

amueller · 2012-12-11T14:10:57Z

Am 11.12.2012 14:42, schrieb Lars Buitinck:

Isn't there a release coming up? If so, we should (IMHO) merge this,
then do the release, then merge in new features.

Yes, there is. Not sure when, though ;) Was planned for before Christmas.

GaelVaroquaux · 2012-12-11T14:21:38Z

Yes, there is. Not sure when, though ;) Was planned for before Christmas.

I won't be really able to help before beginning of January. Sorry.

glouppe · 2012-12-12T07:53:32Z

I am merging this, since both @pprett and I are fine with it. I'll edit the what's new myself and open an issue regarding the use of np.intp_t.

FIX: Do not rely on strides for contiguous arrays

glouppe added a commit that referenced this pull request Dec 12, 2012

Merge pull request #1458 from seberg/contig_strides

67fccac

FIX: Do not rely on strides for contiguous arrays

glouppe merged commit 67fccac into scikit-learn:master Dec 12, 2012

This was referenced Dec 12, 2012

Floating point exception in GradientBoostingClassifier during nosetests on macosx 10.8.2 64bit (and others?) #1406

Closed

_tree.pyx: Use np.intp_t instead of int #1466

Closed

glouppe mentioned this pull request Mar 18, 2014

Trees incompatible between 32bit and 64bit version #2972

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Do not rely on strides for contiguous arrays #1458

FIX: Do not rely on strides for contiguous arrays #1458

FIX: Do not rely on strides for contiguous arrays #1458

FIX: Do not rely on strides for contiguous arrays #1458

Conversation