8000 DOC Release highlights for version 0.24 (#18795) · franslarsson/scikit-learn@76e82ba · GitHub
[go: up one dir, main page]

Skip to content

Commit 76e82ba

Browse files
cmarmojnothmanNicolasHugogrisel
authored
DOC Release highlights for version 0.24 (scikit-learn#18795)
Co-authored-by: Joel Nothman <joel.nothman@gmail.com> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
1 parent da562b4 commit 76e82ba

File tree

3 files changed

+254
-1
lines changed

3 files changed

+254
-1
lines changed

doc/glossary.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -328,6 +328,12 @@ General Concepts
328328
* sometimes in the :ref:`User Guide <user_guide>` (built from ``doc/``)
329329
alongside a technical description of the estimator.
330330

331+
experimental
332+
An experimental tool is already usable but its public API, such as
333+
default parameter values or fitted attributes, is still subject to
334+
change in future versions without the usual :term:`deprecation`
335+
warning policy.
336+
331337
evaluation metric
332338
evaluation metrics
333339
Evaluation metrics give a measure of how well a model performs. We may

doc/modules/partial_dependence.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,8 @@ generated. The ``values`` field returned by
105105
used in the grid for each input feature of interest. They also correspond to
106106
the axis of the plots.
107107

108+
.. _individual_conditional:
109+
108110
Individual conditional expectation (ICE) plot
109111
=============================================
110112

@@ -180,7 +182,7 @@ values are defined by :math:`x_S` for the features in :math:`X_S`, and by
180182

181183
Computing this integral for various values of :math:`x_S` produces a PDP plot
182184
as above. An ICE line is defined as a single :math:`f(x_{S}, x_{C}^{(i)})`
183-
evaluated at at :math:`x_{S}`.
185+
evaluated at :math:`x_{S}`.
184186

185187
Computation methods
186188
===================
Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
# flake8: noqa
2+
"""
3+
========================================
4+
Release Highlights for scikit-learn 0.24
5+
========================================
6+
7+
.. currentmodule:: sklearn
8+
9+
We are pleased to announce the release of scikit-learn 0.24! Many bug fixes
10+
and improvements were added, as well as some new key features. We detail
11+
below a few of the major features of this release. **For an exhaustive list of
12+
all the changes**, please refer to the :ref:`release notes <changes_0_24>`.
13+
14+
To install the latest version (with pip)::
15+
16+
pip install --upgrade scikit-learn
17+
18+
or with conda::
19+
20+
conda install -c conda-forge scikit-learn
21+
"""
22+
23+
##############################################################################
24+
# Successive Halving estimators for tuning hyper-parameters
25+
# ---------------------------------------------------------
26+
# Successive Halving, a state of the art method, is now available to
27+
# explore the space of the parameters and identify their best combination.
28+
# :class:`~sklearn.model_selection.HalvingGridSearchCV` and
29+
# :class:`~sklearn.model_selection.HalvingRandomSearchCV` can be
30+
# used as drop-in replacement for
31+
# :class:`~sklearn.model_selection.GridSearchCV` and
32+
# :class:`~sklearn.model_selection.RandomizedSearchCV`.
33+
# Successive Halving is an iterative selection process illustrated in the
34+
# figure below. The first iteration is run with a small amount of resources,
35+
# where the resource typically corresponds to the number of training samples,
36+
# but can also be an arbitrary integer parameter such as `n_estimators` in a
37+
# random forest. Only a subset of the parameter candidates are selected for the
38+
# next iteration, which will be run with an increasing amount of allocated
39+
# resources. Only a subset of candidates will last until the end of the
40+
# iteration process, and the best parameter candidate is the one that has the
41+
# highest score on the last iteration.
42+
#
43+
# Read more in the :ref:`User Guide <successive_halving_user_guide>` (note:
44+
# the Successive Halving estimators are still :term:`experimental
45+
# <experimental>`).
46+
#
47+
# .. figure:: ../model_selection/images/sphx_glr_plot_successive_halving_iterations_001.png
48+
# :target: ../model_selection/plot_successive_halving_iterations.html
49+
# :align: center
50+
51+
import numpy as np
52+
from scipy.stats import randint
53+
from sklearn.experimental import enable_halving_search_cv # noqa
54+
from sklearn.model_selection import HalvingRandomSearchCV
55+
from sklearn.ensemble import RandomForestClassifier
56+
from sklearn.datasets import make_classification
57+
58+
rng = np.random.RandomState(0)
59+
60+
X, y = make_classification(n_samples=700, random_state=rng)
61+
62+
clf = RandomForestClassifier(n_estimators=10, random_state=rng)
63+
64+
param_dist = {"max_depth": [3, None],
65+
"max_features": randint(1, 11),
66+
"min_samples_split": randint(2, 11),
67+
"bootstrap": [True, False],
68+
"criterion": ["gini", "entropy"]}
69+
70+
rsh = HalvingRandomSearchCV(estimator=clf, param_distributions=param_dist,
71+
factor=2, random_state=rng)
72+
rsh.fit(X, y)
73+
rsh.best_params_
74+
75+
##############################################################################
76+
# Native support for categorical features in HistGradientBoosting estimators
77+
# --------------------------------------------------------------------------
78+
# :class:`~sklearn.ensemble.HistGradientBoostingClassifier` and
79+
# :class:`~sklearn.ensemble.HistGradientBoostingRegressor` now have native
80+
# support for categorical features: they can consider splits on non-ordered,
81+
# categorical data. Read more in the :ref:`User Guide
82+
# <categorical_support_gbdt>`.
83+
#
84+
# .. figure:: ../ensemble/images/sphx_glr_plot_gradient_boosting_categorical_001.png
85+
# :target: ../ensemble/plot_gradient_boosting_categorical.html
86+
# :align: center
87+
#
88+
# The plot shows that the new native support for categorical features leads to
89+
# fitting times that are comparable to models where the categories are treated
90+
# as ordered quantities, i.e. simply ordinal-encoded. Native support is also
91+
# more expressive than both one-hot encoding and ordinal encoding. However, to
92+
# use the new `categorical_features` parameter, it is still required to
93+
# preprocess the data within a pipeline as demonstrated in this :ref:`example
94+
# <sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py>`.
95+
96+
##############################################################################
97+
# Improved performances of HistGradientBoosting estimators
98+
# --------------------------------------------------------
99+
# The memory footprint of :class:`ensemble.HistGradientBoostingRegressor` and
100+
# :class:`ensemble.HistGradientBoostingClassifier` has been significantly
101+
# improved during calls to `fit`. In addition, histogram initialization is now
102+
# done in parallel which results in slight speed improvements.
103+
# See more in the `Benchmark page
104+
# <https://scikit-learn.org/scikit-learn-benchmarks/>`_.
105+
106+
##############################################################################
107+
# New self-training meta-estimator
108+
# --------------------------------
109+
# A new self-training implementation, based on `Yarowski's algorithm
110+
# <https://doi.org/10.3115/981658.981684>`_ can now be used with any
111+
# classifier that implements :term:`predict_proba`. The sub-classifier
112+
# will behave as a
113+
# semi-supervised classifier, allowing it to learn from unlabeled data.
114+
# Read more in the :ref:`User guide <self_training>`.
115+
116+
import numpy as np
117+
from sklearn import datasets
118+
from sklearn.semi_supervised import SelfTrainingClassifier
119+
from sklearn.svm import SVC
120+
121+
rng = np.random.RandomState(42)
122+
iris = datasets.load_iris()
123+
random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3
124+
iris.target[random_unlabeled_points] = -1
125+
svc = SVC(probability=True, gamma="auto")
126+
self_training_model = SelfTrainingClassifier(svc)
127+
self_training_model.fit(iris.data, iris.target)
128+
129+
##############################################################################
130+
# New SequentialFeatureSelector transformer
131+
# -----------------------------------------
132+
# A new iterative transformer to select features is available:
133+
# :class:`~sklearn.feature_selection.SequentialFeatureSelector`.
134+
# Sequential Feature Selection can add features one at a time (forward
135+
# selection) or remove features from the list of the available features
136+
# (backward selection), based on a cross-validated score maximization.
137+
# See the :ref:`User Guide <sequential_feature_selection>`.
138+
139+
from sklearn.feature_selection import SequentialFeatureSelector
140+
from sklearn.neighbors import KNeighborsClassifier
141+
from sklearn.datasets import load_iris
142+
143+
X, y = load_iris(return_X_y=True, as_frame=True)
144+
feature_names = X.columns
145+
knn = KNeighborsClassifier(n_neighbors=3)
146+
sfs = SequentialFeatureSelector(knn, n_features_to_select=2)
147+
sfs.fit(X, y)
148+
print("Features selected by forward sequential selection: "
149+
f"{feature_names[sfs.get_support().tolist()]}")
150+
151+
##############################################################################
152+
# New PolynomialCountSketch kernel approximation function
153+
# -------------------------------------------------------
154+
# The new :class:`~sklearn.kernel_approximation.PolynomialCountSketch`
155+
# approximates a polynomial expansion of a feature space when used with linear
156+
# models, but uses much less memory than
157+
# :class:`~sklearn.preprocessing.PolynomialFeatures`.
158+
159+
from sklearn.datasets import fetch_covtype
160+
from sklearn.pipeline import make_pipeline
161+
from sklearn.model_selection import train_test_split
162+
from sklearn.preprocessing import MinMaxScaler
163+
from sklearn.kernel_approximation import PolynomialCountSketch
164+
from sklearn.linear_model import LogisticRegression
165+
166+
X, y = fetch_covtype(return_X_y=True)
167+
pipe = make_pipeline(MinMaxScaler(),
168+
PolynomialCountSketch(degree=2, n_components=300),
169+
LogisticRegression(max_iter=1000))
170+
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=5000,
171+
test_size=10000,
172+
random_state=42)
173+
pipe.fit(X_train, y_train).score(X_test, y_test)
174+
175+
# ##############################################################################
176+
# # For comparison, here is the score of a linear baseline for the same data:
177+
178+
linear_baseline = make_pipeline(MinMaxScaler(),
179+
LogisticRegression(max_iter=1000))
180+
linear_baseline.fit(X_train, y_train).score(X_test, y_test)
181+
182+
##############################################################################
183+
# Individual Conditional Expectation plots
184+
# ----------------------------------------
185+
# A new kind of partial dependence plot is available: the Individual
186+
# Conditional Expectation (ICE) plot. ICE plots visualize the dependence of the
187+
# prediction on a feature for each sample separately, with one line per sample.
188+
# See the :ref:`User Guide <individual_conditional>`
189+
190+
from sklearn.ensemble import RandomForestRegressor
191+
from sklearn.datasets import fetch_california_housing
192+
from sklearn.inspection import plot_partial_dependence
193+
194+
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
195+
features = ['MedInc', 'AveOccup', 'HouseAge', 'AveRooms']
196+
est = RandomForestRegressor(n_estimators=10)
197+
est.fit(X, y)
198+
display = plot_partial_dependence(
199+
est, X, features, kind="individual", subsample=50,
200+
n_jobs=3, grid_resolution=20, random_state=0
201+
)
202+
display.figure_.suptitle(
203+
'Partial dependence of house value on non-location features\n'
204+
'for the California housing dataset, with BayesianRidge'
205+
)
206+
display.figure_.subplots_adjust(hspace=0.3)
207+
208+
##############################################################################
209+
# New Poisson splitting criterion for DecisionTreeRegressor
210+
# ---------------------------------------------------------
211+
# The integration of Poisson regression estimation continues from version 0.23.
212+
# :class:`~sklearn.tree.DecisionTreeRegressor` now supports a new `'poisson'`
213+
# splitting criterion. Setting `criterion="poisson"` might be a good choice
214+
# if your target is a count or a frequency.
215+
216+
from sklearn.tree import DecisionTreeRegressor
217+
from sklearn.model_selection import train_test_split
218+
import numpy as np
219+
220+
n_samples, n_features = 1000, 20
221+
rng = np.random.RandomState(0)
222+
X = rng.randn(n_samples, n_features)
223+
# positive integer target correlated with X[:, 5] with many zeros:
224+
y = rng.poisson(lam=np.exp(X[:, 5]) / 2)
225+
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
226+
regressor = DecisionTreeRegressor(criterion='poisson', random_state=0)
227+
regressor.fit(X_train, y_train)
228+
229+
##############################################################################
230+
# New documentation improvements
231+
# ------------------------------
232+
#
233+
# New examples and documentation pages have been added, in a continuous effort
234+
# to improve the understanding of machine learning practices:
235+
#
236+
# - a new section about :ref:`common pitfalls and recommended
237+
# practices <common_pitfalls>`,
238+
# - an example illustrating how to :ref:`statistically compare the performance of
239+
# models <sphx_glr_auto_examples_model_selection_plot_grid_search_stats.py>`
240+
# evaluated using :class:`~sklearn.model_selection.GridSearchCV`,
241+
# - an example on how to :ref:`interpret coefficients of linear models
242+
# <sphx_glr_auto_examples_inspection_plot_linear_model_coefficient_interpretation.py>`,
243+
# - an :ref:`example
244+
# <sphx_glr_auto_examples_cross_decomposition_plot_pcr_vs_pls.py>`
245+
# comparing Principal Component Regression and Partial Least Squares.

0 commit comments

Comments
 (0)
0