|
| 1 | +# flake8: noqa |
| 2 | +""" |
| 3 | +======================================== |
| 4 | +Release Highlights for scikit-learn 0.24 |
| 5 | +======================================== |
| 6 | +
|
| 7 | +.. currentmodule:: sklearn |
| 8 | +
|
| 9 | +We are pleased to announce the release of scikit-learn 0.24! Many bug fixes |
| 10 | +and improvements were added, as well as some new key features. We detail |
| 11 | +below a few of the major features of this release. **For an exhaustive list of |
| 12 | +all the changes**, please refer to the :ref:`release notes <changes_0_24>`. |
| 13 | +
|
| 14 | +To install the latest version (with pip):: |
| 15 | +
|
| 16 | + pip install --upgrade scikit-learn |
| 17 | +
|
| 18 | +or with conda:: |
| 19 | +
|
| 20 | + conda install -c conda-forge scikit-learn |
| 21 | +""" |
| 22 | + |
| 23 | +############################################################################## |
| 24 | +# Successive Halving estimators for tuning hyper-parameters |
| 25 | +# --------------------------------------------------------- |
| 26 | +# Successive Halving, a state of the art method, is now available to |
| 27 | +# explore the space of the parameters and identify their best combination. |
| 28 | +# :class:`~sklearn.model_selection.HalvingGridSearchCV` and |
| 29 | +# :class:`~sklearn.model_selection.HalvingRandomSearchCV` can be |
| 30 | +# used as drop-in replacement for |
| 31 | +# :class:`~sklearn.model_selection.GridSearchCV` and |
| 32 | +# :class:`~sklearn.model_selection.RandomizedSearchCV`. |
| 33 | +# Successive Halving is an iterative selection process illustrated in the |
| 34 | +# figure below. The first iteration is run with a small amount of resources, |
| 35 | +# where the resource typically corresponds to the number of training samples, |
| 36 | +# but can also be an arbitrary integer parameter such as `n_estimators` in a |
| 37 | +# random forest. Only a subset of the parameter candidates are selected for the |
| 38 | +# next iteration, which will be run with an increasing amount of allocated |
| 39 | +# resources. Only a subset of candidates will last until the end of the |
| 40 | +# iteration process, and the best parameter candidate is the one that has the |
| 41 | +# highest score on the last iteration. |
| 42 | +# |
| 43 | +# Read more in the :ref:`User Guide <successive_halving_user_guide>` (note: |
| 44 | +# the Successive Halving estimators are still :term:`experimental |
| 45 | +# <experimental>`). |
| 46 | +# |
| 47 | +# .. figure:: ../model_selection/images/sphx_glr_plot_successive_halving_iterations_001.png |
| 48 | +# :target: ../model_selection/plot_successive_halving_iterations.html |
| 49 | +# :align: center |
| 50 | + |
| 51 | +import numpy as np |
| 52 | +from scipy.stats import randint |
| 53 | +from sklearn.experimental import enable_halving_search_cv # noqa |
| 54 | +from sklearn.model_selection import HalvingRandomSearchCV |
| 55 | +from sklearn.ensemble import RandomForestClassifier |
| 56 | +from sklearn.datasets import make_classification |
| 57 | + |
| 58 | +rng = np.random.RandomState(0) |
| 59 | + |
| 60 | +X, y = make_classification(n_samples=700, random_state=rng) |
| 61 | + |
| 62 | +clf = RandomForestClassifier(n_estimators=10, random_state=rng) |
| 63 | + |
| 64 | +param_dist = {"max_depth": [3, None], |
| 65 | + "max_features": randint(1, 11), |
| 66 | + "min_samples_split": randint(2, 11), |
| 67 | + "bootstrap": [True, False], |
| 68 | + "criterion": ["gini", "entropy"]} |
| 69 | + |
| 70 | +rsh = HalvingRandomSearchCV(estimator=clf, param_distributions=param_dist, |
| 71 | + factor=2, random_state=rng) |
| 72 | +rsh.fit(X, y) |
| 73 | +rsh.best_params_ |
| 74 | + |
| 75 | +############################################################################## |
| 76 | +# Native support for categorical features in HistGradientBoosting estimators |
| 77 | +# -------------------------------------------------------------------------- |
| 78 | +# :class:`~sklearn.ensemble.HistGradientBoostingClassifier` and |
| 79 | +# :class:`~sklearn.ensemble.HistGradientBoostingRegressor` now have native |
| 80 | +# support for categorical features: they can consider splits on non-ordered, |
| 81 | +# categorical data. Read more in the :ref:`User Guide |
| 82 | +# <categorical_support_gbdt>`. |
| 83 | +# |
| 84 | +# .. figure:: ../ensemble/images/sphx_glr_plot_gradient_boosting_categorical_001.png |
| 85 | +# :target: ../ensemble/plot_gradient_boosting_categorical.html |
| 86 | +# :align: center |
| 87 | +# |
| 88 | +# The plot shows that the new native support for categorical features leads to |
| 89 | +# fitting times that are comparable to models where the categories are treated |
| 90 | +# as ordered quantities, i.e. simply ordinal-encoded. Native support is also |
| 91 | +# more expressive than both one-hot encoding and ordinal encoding. However, to |
| 92 | +# use the new `categorical_features` parameter, it is still required to |
| 93 | +# preprocess the data within a pipeline as demonstrated in this :ref:`example |
| 94 | +# <sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py>`. |
| 95 | + |
| 96 | +############################################################################## |
| 97 | +# Improved performances of HistGradientBoosting estimators |
| 98 | +# -------------------------------------------------------- |
| 99 | +# The memory footprint of :class:`ensemble.HistGradientBoostingRegressor` and |
| 100 | +# :class:`ensemble.HistGradientBoostingClassifier` has been significantly |
| 101 | +# improved during calls to `fit`. In addition, histogram initialization is now |
| 102 | +# done in parallel which results in slight speed improvements. |
| 103 | +# See more in the `Benchmark page |
| 104 | +# <https://scikit-learn.org/scikit-learn-benchmarks/>`_. |
| 105 | + |
| 106 | +############################################################################## |
| 107 | +# New self-training meta-estimator |
| 108 | +# -------------------------------- |
| 109 | +# A new self-training implementation, based on `Yarowski's algorithm |
| 110 | +# <https://doi.org/10.3115/981658.981684>`_ can now be used with any |
| 111 | +# classifier that implements :term:`predict_proba`. The sub-classifier |
| 112 | +# will behave as a |
| 113 | +# semi-supervised classifier, allowing it to learn from unlabeled data. |
| 114 | +# Read more in the :ref:`User guide <self_training>`. |
| 115 | + |
| 116 | +import numpy as np |
| 117 | +from sklearn import datasets |
| 118 | +from sklearn.semi_supervised import SelfTrainingClassifier |
| 119 | +from sklearn.svm import SVC |
| 120 | + |
| 121 | +rng = np.random.RandomState(42) |
| 122 | +iris = datasets.load_iris() |
| 123 | +random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3 |
| 124 | +iris.target[random_unlabeled_points] = -1 |
| 125 | +svc = SVC(probability=True, gamma="auto") |
| 126 | +self_training_model = SelfTrainingClassifier(svc) |
| 127 | +self_training_model.fit(iris.data, iris.target) |
| 128 | + |
| 129 | +############################################################################## |
| 130 | +# New SequentialFeatureSelector transformer |
| 131 | +# ----------------------------------------- |
| 132 | +# A new iterative transformer to select features is available: |
| 133 | +# :class:`~sklearn.feature_selection.SequentialFeatureSelector`. |
| 134 | +# Sequential Feature Selection can add features one at a time (forward |
| 135 | +# selection) or remove features from the list of the available features |
| 136 | +# (backward selection), based on a cross-validated score maximization. |
| 137 | +# See the :ref:`User Guide <sequential_feature_selection>`. |
| 138 | + |
| 139 | +from sklearn.feature_selection import SequentialFeatureSelector |
| 140 | +from sklearn.neighbors import KNeighborsClassifier |
| 141 | +from sklearn.datasets import load_iris |
| 142 | + |
| 143 | +X, y = load_iris(return_X_y=True, as_frame=True) |
| 144 | +feature_names = X.columns |
| 145 | +knn = KNeighborsClassifier(n_neighbors=3) |
| 146 | +sfs = SequentialFeatureSelector(knn, n_features_to_select=2) |
| 147 | +sfs.fit(X, y) |
| 148 | +print("Features selected by forward sequential selection: " |
| 149 | + f"{feature_names[sfs.get_support().tolist()]}") |
| 150 | + |
| 151 | +############################################################################## |
| 152 | +# New PolynomialCountSketch kernel approximation function |
| 153 | +# ------------------------------------------------------- |
| 154 | +# The new :class:`~sklearn.kernel_approximation.PolynomialCountSketch` |
| 155 | +# approximates a polynomial expansion of a feature space when used with linear |
| 156 | +# models, but uses much less memory than |
| 157 | +# :class:`~sklearn.preprocessing.PolynomialFeatures`. |
| 158 | + |
| 159 | +from sklearn.datasets import fetch_covtype |
| 160 | +from sklearn.pipeline import make_pipeline |
| 161 | +from sklearn.model_selection import train_test_split |
| 162 | +from sklearn.preprocessing import MinMaxScaler |
| 163 | +from sklearn.kernel_approximation import PolynomialCountSketch |
| 164 | +from sklearn.linear_model import LogisticRegression |
| 165 | + |
| 166 | +X, y = fetch_covtype(return_X_y=True) |
| 167 | +pipe = make_pipeline(MinMaxScaler(), |
| 168 | + PolynomialCountSketch(degree=2, n_components=300), |
| 169 | + LogisticRegression(max_iter=1000)) |
| 170 | +X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=5000, |
| 171 | + test_size=10000, |
| 172 | + random_state=42) |
| 173 | +pipe.fit(X_train, y_train).score(X_test, y_test) |
| 174 | + |
| 175 | +# ############################################################################## |
| 176 | +# # For comparison, here is the score of a linear baseline for the same data: |
| 177 | + |
| 178 | +linear_baseline = make_pipeline(MinMaxScaler(), |
| 179 | + LogisticRegression(max_iter=1000)) |
| 180 | +linear_baseline.fit(X_train, y_train).score(X_test, y_test) |
| 181 | + |
| 182 | +############################################################################## |
| 183 | +# Individual Conditional Expectation plots |
| 184 | +# ---------------------------------------- |
| 185 | +# A new kind of partial dependence plot is available: the Individual |
| 186 | +# Conditional Expectation (ICE) plot. ICE plots visualize the dependence of the |
| 187 | +# prediction on a feature for each sample separately, with one line per sample. |
| 188 | +# See the :ref:`User Guide <individual_conditional>` |
| 189 | + |
| 190 | +from sklearn.ensemble import RandomForestRegressor |
| 191 | +from sklearn.datasets import fetch_california_housing |
| 192 | +from sklearn.inspection import plot_partial_dependence |
| 193 | + |
| 194 | +X, y = fetch_california_housing(return_X_y=True, as_frame=True) |
| 195 | +features = ['MedInc', 'AveOccup', 'HouseAge', 'AveRooms'] |
| 196 | +est = RandomForestRegressor(n_estimators=10) |
| 197 | +est.fit(X, y) |
| 198 | +display = plot_partial_dependence( |
| 199 | + est, X, features, kind="individual", subsample=50, |
| 200 | + n_jobs=3, grid_resolution=20, random_state=0 |
| 201 | +) |
| 202 | +display.figure_.suptitle( |
| 203 | + 'Partial dependence of house value on non-location features\n' |
| 204 | + 'for the California housing dataset, with BayesianRidge' |
| 205 | +) |
| 206 | +display.figure_.subplots_adjust(hspace=0.3) |
| 207 | + |
| 208 | +############################################################################## |
| 209 | +# New Poisson splitting criterion for DecisionTreeRegressor |
| 210 | +# --------------------------------------------------------- |
| 211 | +# The integration of Poisson regression estimation continues from version 0.23. |
| 212 | +# :class:`~sklearn.tree.DecisionTreeRegressor` now supports a new `'poisson'` |
| 213 | +# splitting criterion. Setting `criterion="poisson"` might be a good choice |
| 214 | +# if your target is a count or a frequency. |
| 215 | + |
| 216 | +from sklearn.tree import DecisionTreeRegressor |
| 217 | +from sklearn.model_selection import train_test_split |
| 218 | +import numpy as np |
| 219 | + |
| 220 | +n_samples, n_features = 1000, 20 |
| 221 | +rng = np.random.RandomState(0) |
| 222 | +X = rng.randn(n_samples, n_features) |
| 223 | +# positive integer target correlated with X[:, 5] with many zeros: |
| 224 | +y = rng.poisson(lam=np.exp(X[:, 5]) / 2) |
| 225 | +X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng) |
| 226 | +regressor = DecisionTreeRegressor(criterion='poisson', random_state=0) |
| 227 | +regressor.fit(X_train, y_train) |
| 228 | + |
| 229 | +############################################################################## |
| 230 | +# New documentation improvements |
| 231 | +# ------------------------------ |
| 232 | +# |
| 233 | +# New examples and documentation pages have been added, in a continuous effort |
| 234 | +# to improve the understanding of machine learning practices: |
| 235 | +# |
| 236 | +# - a new section about :ref:`common pitfalls and recommended |
| 237 | +# practices <common_pitfalls>`, |
| 238 | +# - an example illustrating how to :ref:`statistically compare the performance of |
| 239 | +# models <sphx_glr_auto_examples_model_selection_plot_grid_search_stats.py>` |
| 240 | +# evaluated using :class:`~sklearn.model_selection.GridSearchCV`, |
| 241 | +# - an example on how to :ref:`interpret coefficients of linear models |
| 242 | +# <sphx_glr_auto_examples_inspection_plot_linear_model_coefficient_interpretation.py>`, |
| 243 | +# - an :ref:`example |
| 244 | +# <sphx_glr_auto_examples_cross_decomposition_plot_pcr_vs_pls.py>` |
| 245 | +# comparing Principal Component Regression and Partial Least Squares. |
0 commit comments