8000 DOC: Consolidate description of missing values in tree-based models i… · scikit-learn/scikit-learn@85fb4da · GitHub
[go: up one dir, main page]

Skip to content

Commit 85fb4da

Browse files
Rishab260adam2392
andauthored
DOC: Consolidate description of missing values in tree-based models in _forest.py (#30955)
Co-authored-by: Adam Li <adam2392@gmail.com>
1 parent ade815c commit 85fb4da

File tree

1 file changed

+30
-0
lines changed

1 file changed

+30
-0
lines changed

sklearn/ensemble/_forest.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1188,6 +1188,13 @@ class RandomForestClassifier(ForestClassifier):
11881188
For a comparison between tree-based ensemble models see the example
11891189
:ref:`sphx_glr_auto_examples_ensemble_plot_forest_hist_grad_boosting_comparison.py`.
11901190
1191+
This estimator has native support for missing values (NaNs). During training,
1192+
the tree grower learns at each split point whether samples with missing values
1193+
should go to the left or right child, based on the potential gain. When predicting,
1194+
samples with missing values are assigned to the left or right child consequently.
1195+
If no missing values were encountered for a given feature during training, then
1196+
samples with missing values are mapped to whichever child has the most samples.
1197+
11911198
Read more in the :ref:`User Guide <forest>`.
11921199
11931200
Parameters
@@ -1572,6 +1579,13 @@ class RandomForestRegressor(ForestRegressor):
15721579
`bootstrap=True` (default), otherwise the whole dataset is used to build
15731580
each tree.
15741581
1582+
This estimator has native support for missing values (NaNs). During training,
1583+
the tree grower learns at each split point whether samples with missing values
1584+
should go to the left or right child, based on the potential gain. When predicting,
1585+
samples with missing values are assigned to the left or right child consequently.
1586+
If no missing values were encountered for a given feature during training, then
1587+
samples with missing values are mapped to whichever child has the most samples.
1588+
15751589
For a comparison between tree-based ensemble models see the example
15761590
:ref:`sphx_glr_auto_examples_ensemble_plot_forest_hist_grad_boosting_comparison.py`.
15771591
@@ -1929,6 +1943,14 @@ class ExtraTreesClassifier(ForestClassifier):
19291943
of the dataset and uses averaging to improve the predictive accuracy
19301944
and control over-fitting.
19311945
1946+
This estimator has native support for missing values (NaNs) for
1947+
random splits. During training, a random threshold will be chosen
1948+
to split the non-missing values on. Then the non-missing values will be sent
1949+
to the left and right child based on the randomly selected threshold, while
1950+
the missing values will also be randomly sent to the left or right child.
1951+
This is repeated for every feature considered at each split. The best split
1952+
among these is chosen.
1953+
19321954
Read more in the :ref:`User Guide <forest>`.
19331955
19341956
Parameters
@@ -2302,6 +2324,14 @@ class ExtraTreesRegressor(ForestRegressor):
23022324
of the dataset and uses averaging to improve the predictive accuracy
23032325
and control over-fitting.
23042326
2327+
This estimator has native support for missing values (NaNs) for
2328+
random splits. During training, a random threshold will be chosen
2329+
to split the non-missing values on. Then the non-missing values will be sent
2330+
to the left and right child based on the randomly selected threshold, while
2331+
the missing values will also be randomly sent to the left or right child.
2332+
This is repeated for every feature considered at each split. The best split
2333+
among these is chosen.
2334+
23052335
Read more in the :ref:`User Guide <forest>`.
23062336
23072337
Parameters

0 commit comments

Comments
 (0)
0