8000 Merge branch 'scikit-learn:main' into pdp_sw · scikit-learn/scikit-learn@aecc0df · GitHub
[go: up one dir, main page]

Skip to content

Commit aecc0df

Browse files
authored
Merge branch 'scikit-learn:main' into pdp_sw
2 parents 4dd1bc9 + ecb9a70 commit aecc0df

16 files changed

+263
-73
lines changed

doc/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -331,6 +331,7 @@
331331
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
332332
"joblib": ("https://joblib.readthedocs.io/en/latest/", None),
333333
"seaborn": ("https://seaborn.pydata.org/", None),
334+
"skops": ("https://skops.readthedocs.io/en/stable/", None),
334335
}
335336

336337
v = parse(release)

doc/model_persistence.rst

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,37 @@ serialization methods, please refer to this
9292
`talk by Alex Gaynor
9393
<https://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.
9494

95+
96+
A more secure format: `skops`
97+
.............................
98+
99+
`skops <https://skops.readthedocs.io/en/stable/>`__ provides a more secure
100+
format via the :mod:`skops.io` module. It avoids using :mod:`pickle` and only
101+
loads files which have types and references to functions which are trusted
102+
either by default or by the user. The API is very similar to ``pickle``, and
103+
you can persist your models as explain in the `docs
104+
<https://skops.readthedocs.io/en/stable/persistence.html>`__ using
105+
:func:`skops.io.dump` and :func:`skops.io.dumps`::
106+
107+
import skops.io as sio
108+
obj = sio.dumps(clf)
109+
110+
And you can load them back using :func:`skops.io.load` and
111+
:func:`skops.io.loads`. However, you need to specify the types which are
112+
trusted by you. You can get existing unknown types in a dumped object / file
113+
using :func:`skops.io.get_untrusted_types`, and after checking its contents,
114+
pass it to the load function::
115+
116+
unknown_types = sio.get_untrusted_types(obj)
117+
clf = sio.loads(obj, trusted=unknown_types)
118+
119+
If you trust the source of the file / object, you can pass ``trusted=True``::
120+
121+
clf = sio.loads(obj, trusted=True)
122+
123+
Please report issues and feature requests related to this format on the `skops
124+
issue tracker <https://github.com/skops-dev/skops/issues>`__.
125+
95126
Interoperable formats
96127
---------------------
97128

doc/related_projects.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,10 @@ enhance the functionality of scikit-learn's estimators.
115115
Scikit-learn pipelines to `ONNX <https://onnx.ai/>`_ for interchange and
116116
prediction.
117117

118+
` `skops.io <https://skops.readthedocs.io/en/stable/persistence.html>`__ A
119+
persistence model more secure than pickle, which can be used instead of
120+
pickle in most common cases.
121+
118122
- `sklearn2pmml <https://github.com/jpmml/sklearn2pmml>`_
119123
Serialization of a wide variety of scikit-learn estimators and transformers
120124
into PMML with the help of `JPMML-SkLearn <https://github.com/jpmml/jpmml-sklearn>`_
@@ -356,6 +360,8 @@ and promote community efforts.
356360
(`source <https://github.com/mehrdad-dev/scikit-learn>`__)
357361
- `Spanish translation <https://qu4nt.github.io/sklearn-doc-es/>`_
358362
(`source <https://github.com/qu4nt/sklearn-doc-es>`__)
363+
- `Korean translation <https://panda5176.github.io/scikit-learn-korean/>`_
364+
(`source <https://github.com/panda5176/scikit-learn-korean>`__)
359365

360366

361367
.. rubric:: Footnotes

doc/whats_new/v1.3.rst

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,11 @@ parameters, may produce different models from the previous version. This often
1919
occurs due to changes in the modelling logic (bug fixes or enhancements), or in
2020
random sampling procedures.
2121

22+
- |Fix| The `categories_` attribute of :class:`preprocessing.OneHotEncoder` now
23+
always contains an array of `object`s when using predefined categories that
24+
are strings. Predefined categories encoded as bytes will no longer work
25+
with `X` encoded as strings. :pr:`25174` by :user:`Tim Head <betatim>`.
26+
2227
Changes impacting all modules
2328
-----------------------------
2429

@@ -36,6 +41,13 @@ Changelog
3641
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`.
3742
where 123456 is the *pull request* number, not the issue number.
3843
44+
:mod:`sklearn.ensemble`
45+
.......................
46+
- |Feature| Compute a custom out-of-bag score by passing a callable to
47+
:class:`ensemble.RandomForestClassifier`, :class:`ensemble.RandomForestRegressor`,
48+
:class:`ensemble.ExtraTreesClassifier` and :class:`ensemble.ExtraTreesRegressor`.
49+
:pr:`25177` by :user:`Tim Head <betatim>`.
50+
3951
:mod:`sklearn.pipeline`
4052
.......................
4153
- |Feature| :class:`pipeline.FeatureUnion` can now use indexing notation (e.g.
@@ -44,7 +56,6 @@ Changelog
4456

4557
:mod:`sklearn.preprocessing`
4658
............................
47-
4859
- |Enhancement| Added support for `sample_weight` in
4960
:class:`preprocessing.KBinsDiscretizer`. This allows specifying the parameter
5061
`sample_weight` for each sample to be used while fitting. The option is only

sklearn/compose/_column_transformer.py

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
# Author: Andreas Mueller
77
# Joris Van den Bossche
88
# License: BSD
9+
from numbers import Integral, Real
910
from itertools import chain
1011
from collections import Counter
1112

@@ -20,6 +21,7 @@
2021
from ..utils import Bunch
2122
from ..utils import _safe_indexing
2223
from ..utils import _get_column_indices
24+
from ..utils._param_validation import HasMethods, Interval, StrOptions, Hidden
2325
from ..utils._set_output import _get_output_config, _safe_set_output
2426
from ..utils import check_pandas_support
2527
from ..utils.metaestimators import _BaseComposition
@@ -212,6 +214,20 @@ class ColumnTransformer(TransformerMixin, _BaseComposition):
212214

213215
_required_parameters = ["transformers"]
214216

217+
_parameter_constraints: dict = {
218+
"transformers": [list, Hidden(tuple)],
219+
"remainder": [
220+
StrOptions({"drop", "passthrough"}),
221+
HasMethods(["fit", "transform"]),
222+
HasMethods(["fit_transform", "transform"]),
223+
],
224+
"sparse_threshold": [Interval(Real, 0, 1, closed="both")],
225+
"n_jobs": [Integral, None],
226+
"transformer_weights": [dict, None],
227+
"verbose": ["verbose"],
228+
"verbose_feature_names_out": ["boolean"],
229+
}
230+
215231
def __init__(
216232
self,
217233
transformers,
@@ -406,6 +422,7 @@ def _validate_transformers(self):
406422
if not (hasattr(t, "fit") or hasattr(t, "fit_transform")) or not hasattr(
407423
t, "transform"
408424
):
425+
# Used to validate the transformers in the `transformers` list
409426
raise TypeError(
410427
"All estimators should implement fit and "
411428
"transform, or can be 'drop' or 'passthrough' "
@@ -432,16 +449,6 @@ def _validate_remainder(self, X):
432449
Validates ``remainder`` and defines ``_remainder`` targeting
433450
the remaining columns.
434451
"""
435-
is_transformer = (
436-
hasattr(self.remainder, "fit") or hasattr(self.remainder, "fit_transform")
437-
) and hasattr(self.remainder, "transform")
438-
if self.remainder not in ("drop", "passthrough") and not is_transformer:
439-
raise ValueError(
440-
"The remainder keyword needs to be one of 'drop', "
441-
"'passthrough', or estimator. '%s' was passed instead"
442-
% self.remainder
443-
)
444-
445452
self._n_features = X.shape[1]
446453
cols = set(chain(*self._transformer_to_input_indices.values()))
447454
remaining = sorted(set(range(self._n_features)) - cols)
@@ -688,6 +695,7 @@ def fit(self, X, y=None):
688695
self : ColumnTransformer
689696
This estimator.
690697
"""
698+
self._validate_params()
691699
# we use fit_transform to make sure to set sparse_output_ (for which we
692700
# need the transformed data) to have consistent output type in predict
693701
self.fit_transform(X, y=y)
@@ -714,6 +722,7 @@ def fit_transform(self, X, y=None):
714722
any result is a sparse matrix, everything will be converted to
715723
sparse matrices.
716724
"""
725+
self._validate_params()
717726
self._check_feature_names(X, reset=True)
718727

719728
X = _check_X(X)

sklearn/compose/tests/test_column_transformer.py

Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,23 @@ def test_column_transformer():
137137
assert len(both.transformers_) == 1
138138

139139

140+
def test_column_transformer_tuple_transformers_parameter():
141+
X_array = np.array([[0, 1, 2], [2, 4, 6]]).T
142+
143+
transformers = [("trans1", Trans(), [0]), ("trans2", Trans(), [1])]
144+
145+
ct_with_list = ColumnTransformer(transformers)
146+
ct_with_tuple = ColumnTransformer(tuple(transformers))
147+
148+
assert_array_equal(
149+
ct_with_list.fit_transform(X_array), ct_with_tuple.fit_transform(X_array)
150+
)
151+
assert_array_equal(
152+
ct_with_list.fit(X_array).transform(X_array),
153+
ct_with_tuple.fit(X_array).transform(X_array),
154+
)
155+
156+
140157
def test_column_transformer_dataframe():
141158
pd = pytest.importorskip("pandas")
142159

@@ -812,15 +829,6 @@ def test_column_transformer_special_strings():
812829
assert len(ct.transformers_) == 2
813830
assert ct.transformers_[-1][0] != "remainder"
814831

815-
# None itself / other string is not valid
816-
for val in [None, "other"]:
817-
ct = ColumnTransformer([("trans1", Trans(), [0]), ("trans2", None, [1])])
818-
msg = "All estimators should implement"
819-
with pytest.raises(TypeError, match=msg):
820-
ct.fit_transform(X_array)
821-
with pytest.raises(TypeError, match=msg):
822-
ct.fit(X_array)
823-
824832

825833
def test_column_transformer_remainder():
826834
X_array = np.array([[0, 1, 2], [2, 4, 6]]).T
@@ -865,15 +873,6 @@ def test_column_transformer_remainder():
865873
assert ct.transformers_[-1][1] == "passthrough"
866874
assert_array_equal(ct.transformers_[-1][2], [1])
867875

868-
# error on invalid arg
869-
ct = ColumnTransformer([("trans1", Trans(), [0])], remainder=1)
870-
msg = "remainder keyword needs to be one of 'drop', 'passthrough', or estimator."
871-
with pytest.raises(ValueError, match=msg):
872-
ct.fit(X_array)
873-
874-
with pytest.raises(ValueError, match=msg):
875-
ct.fit_transform(X_array)
876-
877876
# check default for make_column_transformer
878877
ct = make_column_transformer((Trans(), [0]))
879878
assert ct.remainder == "drop"

sklearn/datasets/_kddcup99.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@ def fetch_kddcup99(
8181
data_home : str, default=None
8282
Specify another download and cache folder for the datasets. By default
8383
all scikit-learn data is stored in '~/scikit_learn_data' subfolders.
84+
8485
.. versionadded:: 0.19
8586
8687
shuffle : bool, default=False

0 commit comments

Comments
 (0)
0