Need a condensed representation for storing random forest classifiers #6276

dgehlhaar · 2016-02-03T22:45:28Z

scikit-learn version: 0.16.1
python: 3.4.4 (anaconda 2.3.0)
machine: Linux RH 6, more memory than you can shake a stick at

There should be an option to store random forest classifiers that can be used for predictions, in a condensed representation, e.g. with each decision point having the descriptor, operator, and cutoff value, and the leaves having the classification. I trained a 300 tree model using default parameters, with about 100K data points (about 20 descriptors each). This trained fine but the resulting model, output with compression by joblib, is 182 MB (!). I tried training another model that had ten times the data points (same number of trees). This finished building, but joblib choked with the dreaded "OverflowError: Size does not fit in an unsigned int" error (supposedly this was fixed with python 3?).

By the looks of it, scikit-learn is storing the data used to build the model as part of the model, with no options not to do this. Any way around this? I want to build a model with 4M data points...

dgehlhaar · 2016-02-04T03:04:20Z

I have tried all kinds of variants, using gzip and pickle, e.g.:

def saveModel(self, fileName):
assert self.classifier is not None
try:
with gzip.open(fileName, "wb", 3) as gz:
pickle.dump(self.classifier, gz, protocol=pickle.HIGHEST_PROTOCOL)
return True
except IOError as err:
print("ERROR: Dump of classifier model failed: {}".format(err))
return False

This gives the same error... Aargh...

jnothman · 2016-02-04T03:26:55Z

I really don't think the training data is being stored with the model. But
you say you build with default parameters. Have you considered that the
default parameters -- with min_samples_split=2, min_samples_leaf=1,
max_leaf_nodes=None, max_depth=None -- might tend towards large, overfit
models if the data are not easily separable? Could you try joblib.dumping
after limiting the depth?

On 4 February 2016 at 14:04, dgehlhaar notifications@github.com wrote:

I have tried all kinds of variants, using gzip and pickle, e.g.:

def saveModel(self, fileName):
assert self.classifier is not None
try:
with gzip.open(fileName, "wb", 3) as gz:
pickle.dump(self.classifier, gz, protocol=pickle.HIGHEST_PROTOCOL)
return True
except IOError as err:
print("ERROR: Dump of classifier model failed: {}".format(err))
return False

This gives the same error... Aargh...

—
Reply to this email directly or view it on GitHub
#6276 (comment)
.

glouppe · 2016-02-04T07:03:54Z

Hi,

With default parameters, the size of the forest will be O(M_N_Log(N)),
where M is the number of trees and N is the number of samples. So yes, it
is expected for the model to get large if you build it on an even larger
dataset. As Joel points out, you can usually reduce the size of the model
by setting min_samples_split, max_leaf_nodes or max_depth.

Gilles

On 4 February 2016 at 04:27, Joel Nothman notifications@github.com wrote:

I really don't think the training data is being stored with the model. But
you say you build with default parameters. Have you considered that the
default parameters -- with min_samples_split=2, min_samples_leaf=1,
max_leaf_nodes=None, max_depth=None -- might tend towards large, overfit
models if the data are not easily separable? Could you try joblib.dumping
after limiting the depth?

On 4 February 2016 at 14:04, dgehlhaar notifications@github.com wrote:

I have tried all kinds of variants, using gzip and pickle, e.g.:

def saveModel(self, fileName):
assert self.classifier is not None
try:
with gzip.open(fileName, "wb", 3) as gz:
pickle.dump(self.classifier, gz, protocol=pickle.HIGHEST_PROTOCOL)
return True
except IOError as err:
print("ERROR: Dump of classifier model failed: {}".format(err))
return False

This gives the same error... Aargh...

—
Reply to this email directly or view it on GitHub
<
#6276 (comment)

.

—
Reply to this email directly or view it on GitHub
#6276 (comment)
.

dgehlhaar · 2016-02-04T17:18:53Z

Thanks everyone, I will play with parameters and see what comes out. I tried limiting the depth a bit previously but my error rate went way up. I will look again.

dgehlhaar · 2016-02-17T17:31:27Z

All, I toyed with parameters, and came up with the following, which gives good performance and "reasonable" file size: 300 trees; min 5 points / leaf; max 22 levels; max 1000 leaves.

Thanks!

dgehlhaar · 2016-02-17T17:31:59Z

This issue many be closed.

GaelVaroquaux · 2016-02-17T17:32:44Z

This issue many be closed.

We should add a note in the documentation about this.

dgehlhaar · 2016-02-17T17:35:01Z

GaelVaroquaux, I agree, especially the expected growth in the tree size with default parameters. I don't have enough experience to know if any of my results can be generalized. But it is likely that most people will just try with default parameters. Thanks!

amueller · 2016-10-08T00:15:53Z

This is indeed a very frequent question.

Morikko · 2017-02-19T22:44:56Z

Is it still up to date ? I can take it, if it is.

jnothman · 2017-02-19T22:52:45Z

i don't think this has been fixed. a note on the docs is still welcome

…

On 20 Feb 2017 9:44 am, "Morikko" ***@***.***> wrote: Is it still up to date ? I can take it if yes. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6276 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61zxW47t6J5v2qKwje3taymhRPb0ks5reMXpgaJpZM4HS6Mr> .

…8437)

…-learn#6276 (scikit-learn#8437)

jnothman · 2018-01-31T00:05:28Z

Absolutely, removing the model will reduce model size. But your prediction step does not work:

AttributeError: 'RandomForestClassifier' object has no attribute 'estimators_'

bsaunders23 · 2018-01-31T00:15:50Z

Yeah nevermind I'm being stupid

SamirArcadia · 2020-04-21T16:42:17Z

Using Python 3.6.5, gzipping files appear to compress the model to about 20-25% in size.

Example of writing gzip from already pickled file (filename):

with open(filename, 'rb') as f: # where filename points to already pickled data
   with gzip.open(filename+'.gz', 'wb') as g:
      pickle.dump(pickle.load(f), g)

Example of writing gzip from data (D)

with gzip.open('data.pickle.gz', 'wb') as g:
   pickle.dump(D, g) # where D is the data

Example of loading from gzip

with gzip.open('data.pickle.gz', 'rb') as g:
   pickle.load(g)

rth · 2020-04-21T16:56:14Z

Also see the documentation of the compress argument in joblib.dump. xz should be allow higher compression ratios than gzip.

…-learn#6276 (scikit-learn#8437)

amueller added Easy Well-defined and straightforward way to resolve Documentation Need Contributor labels Oct 8, 2016

amueller added this to the 0.19 milestone Oct 8, 2016

Morikko mentioned this issue Feb 22, 2017

[MRG+1] Add note about the size of a random forest model #6276 #8437

Merged

jnothman closed this as completed in #8437 Feb 23, 2017

jnothman pushed a commit that referenced this issue Feb 23, 2017

[MRG+1] Add note about the size of default random forest model #6276 (#…

7f084b0

…8437)

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this issue Feb 28, 2017

[MRG+1] Add note about the size of default random forest model scikit…

cfe35c4

…-learn#6276 (scikit-learn#8437)

Przemo10 mentioned this issue Mar 17, 2017

update fork (#1) #8606

Closed

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this issue Jun 14, 2017

[MRG+1] Add note about the size of default random forest model scikit…

6cb903c

…-learn#6276 (scikit-learn#8437)

NelleV pushed a commit to NelleV/scikit-learn that referenced this issue Aug 11, 2017

[MRG+1] Add note about the size of default random forest model scikit…

1ccf460

…-learn#6276 (scikit-learn#8437)

paulha pushed a commit to paulha/scikit-learn that referenced this issue Aug 19, 2017

[MRG+1] Add note about the size of default random forest model scikit…

8096e3d

…-learn#6276 (scikit-learn#8437)

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this issue Nov 15, 2017

[MRG+1] Add note about the size of default random forest model scikit…

3a6e548

…-learn#6276 (scikit-learn#8437)

jiho mentioned this issue May 13, 2020

CLASSIFICATION : Saved RF models are too big ecotaxa/ecotaxa_front#151

Closed

lemonlaug pushed a commit to lemonlaug/scikit-learn that referenced this issue Jan 6, 2021

[MRG+1] Add note about the size of default random forest model scikit…

fc25db7

…-learn#6276 (scikit-learn#8437)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a condensed representation for storing random forest classifiers #6276

Need a condensed representation for storing random forest classifiers #6276

Need a condensed representation for storing random forest classifiers #6276

Need a condensed representation for storing random forest classifiers #6276

Comments