-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Need a condensed representation for storing random forest classifiers #6276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have tried all kinds of variants, using gzip and pickle, e.g.: def saveModel(self, fileName): This gives the same error... Aargh... |
I really don't think the training data is being stored with the model. But On 4 February 2016 at 14:04, dgehlhaar notifications@github.com wrote:
|
Hi, With default parameters, the size of the forest will be O(M_N_Log(N)), Gilles On 4 February 2016 at 04:27, Joel Nothman notifications@github.com wrote:
|
Thanks everyone, I will play with parameters and see what comes out. I tried limiting the depth a bit previously but my error rate went way up. I will look again. |
All, I toyed with parameters, and came up with the following, which gives good performance and "reasonable" file size: 300 trees; min 5 points / leaf; max 22 levels; max 1000 leaves. Thanks! |
This issue many be closed. |
This issue many be closed.
We should add a note in the documentation about this.
|
GaelVaroquaux, I agree, especially the expected growth in the tree size with default parameters. I don't have enough experience to know if any of my results can be generalized. But it is likely that most people will just try with default parameters. Thanks! |
This is indeed a very frequent question. |
Is it still up to date ? I can take it, if it is. |
i don't think this has been fixed. a note on the docs is still welcome
…On 20 Feb 2017 9:44 am, "Morikko" ***@***.***> wrote:
Is it still up to date ? I can take it if yes.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6276 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61zxW47t6J5v2qKwje3taymhRPb0ks5reMXpgaJpZM4HS6Mr>
.
|
Absolutely, removing the model will reduce model size. But your prediction step does not work:
|
Yeah nevermind I'm being stupid |
Using Python 3.6.5, gzipping files appear to compress the model to about 20-25% in size.
|
Also see the documentation of the |
scikit-learn version: 0.16.1
python: 3.4.4 (anaconda 2.3.0)
machine: Linux RH 6, more memory than you can shake a stick at
There should be an option to store random forest classifiers that can be used for predictions, in a condensed representation, e.g. with each decision point having the descriptor, operator, and cutoff value, and the leaves having the classification. I trained a 300 tree model using default parameters, with about 100K data points (about 20 descriptors each). This trained fine but the resulting model, output with compression by joblib, is 182 MB (!). I tried training another model that had ten times the data points (same number of trees). This finished building, but joblib choked with the dreaded "OverflowError: Size does not fit in an unsigned int" error (supposedly this was fixed with python 3?).
By the looks of it, scikit-learn is storing the data used to build the model as part of the model, with no options not to do this. Any way around this? I want to build a model with 4M data points...
The text was updated successfully, but these errors were encountered: