8000 Trees incompatible between 32bit and 64bit version · Issue #2972 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Trees incompatible between 32bit and 64bit version #2972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Mar 17, 2014 · 18 comments
Closed

Trees incompatible between 32bit and 64bit version #2972

amueller opened this issue Mar 17, 2014 · 18 comments

Comments

@amueller
Copy link
Member

I'm not sure if this a known issue but I think trees build on 64bit can not be unpickled on 32bit.
How hard would it be to allow that? I think the problem is that Size_t is different between the two platforms.

@pprett
Copy link
Member
pprett commented Mar 17, 2014

that's an ugly one... it would be much nicer if we would support this

@amueller
Copy link
Member Author

I'll try to look into it today...

@jnothman
Copy link
Member

On a 64bit OSX system, replacing the SIZE_ts in struct Node with INT32_t (and correspondingly in NODE_DTYPE) compiles, passes tests, and has no apparent effect on the covertype benchmark.

@jnothman jnothman reopened this Mar 18, 2014
@glouppe
Copy link
Contributor
glouppe commented Mar 18, 2014

We already had this discussion... We switched from int to np.npy_intp (for which SIZE_t is an alias) because it is the recommended type for indices.

See #1458

@jnothman
Copy link
Member

And it looks like the remaining data pickled as SIZE_t arrays are just n_classes and n_outputs.

@jnothman
Copy link
Member

We already had this discussion...

I think that discussion just says we can't allow the tree to grow larger than an int32 can represent...

@glouppe
Copy link
Contributor
glouppe commented Mar 18, 2014

CC @larsmans: Would love to have your opinion on this

@amueller
Copy link
Member Author

We could either use int32 or int64 on all platforms, right? Or is there a problem with using int64 indices on 32bit platforms?

@larsmans
Copy link
Member

The problem with int32 is that it just doesn't scale up. If we want to support sparse inputs, then switching to 32 bits when SciPy is just switching to 64 is a bad idea (see #2969).

The problem with int64 is that it's slower on 32-bit hardware, but I haven't tested to see how slow (even incrementing a uint64_t takes multiple instructions and ties up 2 out of 4 registers on x86).

The clean solution is to fix the serialization. The second best, IMHO, is to use np.uint64_t for all indices from now on. That's both forward- and backward-compatible, except with models trained on 32-bit boxen.

@amueller
Copy link
Member Author

How would you fix the serialization? When loading the class, convert to the native format?

8000

@larsmans
Copy link
Member

Yes. And always store as 64-bit, then check for too large indices.

@amueller
Copy link
Member Author

Why do you need to always store in 64bit if you convert anyhow?

@amueller
Copy link
Member Author

So basically I need to overwrite __setstate__ and __getstate__.
I'll give it a shot but as this would be an Amazon contribution it will take me some time to do a PR.

@larsmans
Copy link
Member

To get a consistent file format. It's not strictly necessary but it simplifies things.

@amueller
Copy link
Member Author
amueller commented Dec 8, 2017

Don't think it's worth it.

@amueller amueller closed this as completed Dec 8, 2017
@Schmetzler
Copy link

It would still be nice to support this. I am calculating data on a 64-bit machine, afterwards I calculate a tree on that data and use feature_importance_ to remove unnecessary features. In the next step I calculate another tree on the stripped down data. This tree should be used on a 32-bit microcontroller. The problem I can't load the tree into the 32-bit system.

So I tried to recalculate the tree on 32-bit, but I cannot load my data in because it's over 4GB. So I'm out of luck here.

What should I do then?

@jnothman
Copy link
Member
jnothman commented Jun 27, 2018 via email

@Schmetzler
Copy link

I found a solution... so I need save my stripped down features for calculation in a model put it on my device and recalculate the tree on the device itself (so I also avoid version incompatibilities).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0