8000 Need a condensed representation for storing random forest classifiers · Issue #6276 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Need a condensed representation for storing random forest classifiers #6276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dgehlhaar opened this issue Feb 3, 2016 · 15 comments · Fixed by #8437
Closed

Need a condensed representation for storing random forest classifiers #6276

dgehlhaar opened this issue Feb 3, 2016 · 15 comments · Fixed by #8437
Labels
Documentation Easy Well-defined and straightforward way to resolve
Milestone

Comments

@dgehlhaar
Copy link

scikit-learn version: 0.16.1
python: 3.4.4 (anaconda 2.3.0)
machine: Linux RH 6, more memory than you can shake a stick at

There should be an option to store random forest classifiers that can be used for predictions, in a condensed representation, e.g. with each decision point having the descriptor, operator, and cutoff value, and the leaves having the classification. I trained a 300 tree model using default parameters, with about 100K data points (about 20 descriptors each). This trained fine but the resulting model, output with compression by joblib, is 182 MB (!). I tried training another model that had ten times the data points (same number of trees). This finished building, but joblib choked with the dreaded "OverflowError: Size does not fit in an unsigned int" error (supposedly this was fixed with python 3?).

By the looks of it, scikit-learn is storing the data used to build the model as part of the model, with no options not to do this. Any way around this? I want to build a model with 4M data points...

@dgehlhaar
Copy link
Author

I have tried all kinds of variants, using gzip and pickle, e.g.:

def saveModel(self, fileName):
assert self.classifier is not None
try:
with gzip.open(fileName, "wb", 3) as gz:
pickle.dump(self.classifier, gz, protocol=pickle.HIGHEST_PROTOCOL)
return True
except IOError as err:
print("ERROR: Dump of classifier model failed: {}".format(err))
return False

This gives the same error... Aargh...

@jnothman
Copy link
Member
jnothman commented Feb 4, 2016

I really don't think the training data is being stored with the model. But
you say you build with default parameters. Have you considered that the
default parameters -- with min_samples_split=2, min_samples_leaf=1,
max_leaf_nodes=None, max_depth=None -- might tend towards large, overfit
models if the data are not easily separable? Could you try joblib.dumping
after limiting the depth?

On 4 February 2016 at 14:04, dgehlhaar notifications@github.com wrote:

I have tried all kinds of variants, using gzip and pickle, e.g.:

def saveModel(self, fileName):
assert self.classifier is not None
try:
with gzip.open(fileName, "wb", 3) as gz:
pickle.dump(self.classifier, gz, protocol=pickle.HIGHEST_PROTOCOL)
return True
except IOError as err:
print("ERROR: Dump of classifier model failed: {}".format(err))
return False

This gives the same error... Aargh...


Reply to this email directly or view it on GitHub
#6276 (comment)
.

@glouppe
Copy link
Contributor
glouppe commented Feb 4, 2016

Hi,

With default parameters, the size of the forest will be O(M_N_Log(N)),
where M is the number of trees and N is the number of samples. So yes, it
is expected for the model to get large if you build it on an even larger
dataset. As Joel points out, you can usually reduce the size of the model
by setting min_samples_split, max_leaf_nodes or max_depth.

Gilles

On 4 February 2016 at 04:27, Joel Nothman notifications@github.com wrote:

I really don't think the training data is being stored with the model. But
you say you build with default parameters. Have you considered that the
default parameters -- with min_samples_split=2, min_samples_leaf=1,
max_leaf_nodes=None, max_depth=None -- might tend towards large, overfit
models if the data are not easily separable? Could you try joblib.dumping
after limiting the depth?

On 4 February 2016 at 14:04, dgehlhaar notifications@github.com wrote:

I have tried all kinds of variants, using gzip and pickle, e.g.:

def saveModel(self, fileName):
assert self.classifier is not None
try:
with gzip.open(fileName, "wb", 3) as gz:
pickle.dump(self.classifier, gz, protocol=pickle.HIGHEST_PROTOCOL)
return True
except IOError as err:
print("ERROR: Dump of classifier model failed: {}".format(err))
return False

This gives the same error... Aargh...


Reply to this email directly or view it on GitHub
<
#6276 (comment)

.


Reply to this email directly or view it on GitHub
#6276 (comment)
.

@dgehlhaar
Copy link
Author

Thanks everyone, I will play with parameters and see what comes out. I tried limiting the depth a bit previously but my error rate went way up. I will look again.

@dgehlhaar
Copy link
Author

All, I toyed with parameters, and came up with the following, which gives good performance and "reasonable" file size: 300 trees; min 5 points / leaf; max 22 levels; max 1000 leaves.

Thanks!

@dgehlhaar
Copy link
Author

This issue many be closed.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Feb 17, 2016 via email

@dgehlhaar
Copy link
Author

GaelVaroquaux, I agree, especially the expected growth in the tree size with default parameters. I don't have enough experience to know if any of my results can be generalized. But it is likely that most people will just try with default parameters. Thanks!

@amueller amueller added Easy Well-defined and straightforward way to resolve Documentation Need Contributor labels Oct 8, 2016
@amueller
Copy link
Member
amueller commented Oct 8, 2016

This is indeed a very frequent question.

@amueller amueller added this to the 0.19 milestone Oct 8, 2016
@Morikko
Copy link
Contributor
Morikko commented Feb 19, 2017

Is it still up to date ? I can take it, if it is.

@jnothman
Copy link
Member
jnothman commented Feb 19, 2017 via email

@jnothman
Copy link
Member

Absolutely, removing the model will reduce model size. But your prediction step does not work:

AttributeError: 'RandomForestClassifier' object has no attribute 'estimators_'

@bsaunders23
Copy link

Yeah nevermind I'm being stupid

@SamirArcadia
Copy link

Using Python 3.6.5, gzipping files appear to compress the model to about 20-25% in size.

  1. Example of writing gzip from already pickled file (filename):
with open(filename, 'rb') as f: # where filename points to already pickled data
   with gzip.open(filename+'.gz', 'wb') as g:
      pickle.dump(pickle.load(f), g)
  1. Example of writing gzip from data (D)
with gzip.open('data.pickle.gz', 'wb') as g:
   pickle.dump(D, g) # where D is the data
  1. Example of loading from gzip
with gzip.open('data.pickle.gz', 'rb') as g:
   pickle.load(g)

@rth
Copy link
Member
rth commented Apr 21, 2020

Also see the documentation of the compress argument in joblib.dump. xz should be allow higher compression ratios than gzip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants
0