8000 FIX: Do not rely on strides for contiguous arrays by seberg · Pull Request #1458 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX: Do not rely on strides for contiguous arrays #1458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 12, 2012

Conversation

seberg
Copy link
Contributor
@seberg seberg commented Dec 10, 2012

When an array is contiguous in memory but has shape[dim] == 1, then
its strides[dim] is not really used, so it can be considered arbitrary
even if the array is contiguous. This is generally true for 0-sized arrays.

If there is no chance of 0-sized (or 1-sized and 1-dimensioal) arrays, then
this code is perfectly correct in current numpy. But its a dangerous
design choice and it would be nice to be able to change numpy at some point.


This fixes gh-1406 (without a numpy change, which I am sure will come soon, but there is no reason to change it here anyway). Using .itemsize is much clearer anyway, so the only reason to not do it is to do some premature optimization because Cython doesn't compile it away.

(Recreated the c files with cython 0.17.1)

Btw. all over this file <int> casts seem to be used for memory pointers. Is this really correct? Since it should be npy_intp from a numpy perspective.

When an array is contiguous in memory but has `shape[dim] == 1`, then
its `strides[dim]` is not really used, so it can be considered arbitrary
even if the array is contiguous. This is generally true for 0-sized arrays.

If there is no chance of 0-sized (or 1-sized and 1-dimensioal) arrays, then
this code is perfectly correct in current numpy. But its a dangerous
design choice and it would be nice to be able to change numpy at some point.
@amueller
Copy link
Member

There shouldn't be any casts, only memory views ;)

  • I didn't really look at the code but if we want to improve the pointer-fu, trying to use memory-views would be my first approach at refactoring.

@amueller
Copy link
Member

Thanks for the fix :)
I'm sure @pprett, @larsmans and @glouppe are interested ;)

@larsmans
Copy link
Member

Btw. all over this file casts seem to be used for memory pointers.

I don't see any of this...

@seberg
Copy link
Contributor Author
seberg commented Dec 10, 2012

@larsman cdef int X_stride = <int> X.strides[1] / <int> X.itemsize is obviously used to address memory later. And if this is a 32-bit integer and not a npy_intp equivalent, you will get overflows for enormous arrays.

It actually seems likely that memoryviews only use the buffer interface, and I think for those numpy should probably allow you to rely on strides, but cython currently uses something in between buffers and arrays for np.ndarray which means that numpy could not redefine the contiguous flags. But that doesn't change that its clearer what it means like this in any case...

@GaelVaroquaux
Copy link
Member

@seberg: thanks a lot for pitching in. I haven't had time to look at this
code myself, but it's good to have a numpy expert looking at the code.

@larsmans
Copy link
Member

Ah, that's what you mean. Yes, a lot of our code uses int to index into arrays. I've never been happy with that, but it's a choice that was made at some point. The tree code in particular was written for maximum speed, even at the expense of safety. (Can't say I agree with that design choice.)

@pprett
Copy link
Member
pprett commented Dec 10, 2012

@larsmans I don't think this was intentional - just a lack of expertise (at least on my side) - if there is a safety risk in our current codebase we should fix that.

you propose that we use npy_intp as the data type to index into numpy arrays (i.e. the contiguous memory segment) ?

@seberg
Copy link
Contributor Author
seberg commented Dec 10, 2012

@pprett for everything to do with indexing. Also normal indexing, strides, shape/size too, I believe you need to use npy_intp (in Cython I guess np.intp_t). Or you add a check here if you want to use int and inform the user that the specific function is not designed for such humongous arrays if int is not large enough, I guess, but though I did not try it I expect you can easily segfault the tree module as is, even if it may have to be a very large array (because of datatype itemsize factoring in).

@glouppe
Copy link
Contributor
glouppe commented Dec 11, 2012

Thanks for the hints! We should indeed use npy_intp where it is needed.

Regarding this PR, this looks good to me. +1

@pprett
Copy link
Member
pprett commented Dec 11, 2012

@seberg thanks

I'd propose we postpone the np.intp_t refactoring until #522 is merged (lots of changes to _tree.pyx) .

@GaelVaroquaux
Copy link
Member

I'd propose we postpone the np.intp_t refactoring until #522 is merged (lots of
changes to _tree.pyx) .

We should create a ticket with the 'EasyFix' label.

@larsmans
Copy link
Member

Isn't there a release coming up? If so, we should (IMHO) merge this, then do the release, then merge in new features.

8000
@pprett
Copy link
Member
pprett commented Dec 11, 2012

I wasn't clear: +1 on merging this; -1 for doing the refactoring before
merging #522; IMHO #522 should be merged before the next release

2012/12/11 Lars Buitinck notifications@github.com

Isn't there a release coming up? If so, we should (IMHO) merge this, then
do the release, then merge in new features.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1458#issuecomment-11243526.

Peter Prettenhofer

@amueller
Copy link
Member

Am 11.12.2012 14:42, schrieb Lars Buitinck:

Isn't there a release coming up? If so, we should (IMHO) merge this,
then do the release, then merge in new features.

Yes, there is. Not sure when, though ;) Was planned for before Christmas.

@GaelVaroquaux
Copy link
Member

Yes, there is. Not sure when, though ;) Was planned for before Christmas.

I won't be really able to help before beginning of January. Sorry.

@glouppe
Copy link
Contributor
glouppe commented Dec 12, 2012

I am merging this, since both @pprett and I are fine with it. I'll edit the what's new myself and open an issue regarding the use of np.intp_t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Floating point exception in GradientBoostingClassifier during nosetests on macosx 10.8.2 64bit (and others?)
6 participants
0