8000 ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") by DeaMariaLeon · Pull Request #1 · DeaMariaLeon/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 47 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
3522d2d
Adding sample_weight parameter support in fit method of KBinsDiscretizer
Seladus Dec 21, 2021
4366bdc
Adding support for sample_weights in the case of an array-like n_bins
Seladus Dec 21, 2021
17e429b
Adding test cases for sample_weights parameter in fit method in KBins…
Seladus Dec 21, 2021
fe6076d
Black formatting and clarifications in documentation
Seladus Dec 21, 2021
e7d5003
Minor fix for PEP8 compatibility
Seladus Dec 21, 2021
85fe2b7
Adding fix to correct misunderstanding of the task and to make better…
Seladus Dec 23, 2021
83a0bba
Adding parameter copy=True in check for sample_weights + formatting w…
Seladus Dec 23, 2021
524cb73
Removing unused imports
Seladus Dec 23, 2021
dbe2eb0
Adding entry to the changelog
Seladus Dec 23, 2021
6bff3a3
Adding dtype in bin edges construction
Seladus Dec 23, 2021
ea5f446
Update sklearn/preprocessing/_discretization.py
Seladus Dec 23, 2021
e7e9eee
Application of suggestions : sooner check for valid strategy + interl…
Seladus Dec 23, 2021
e78c263
Movins subsample check for other strategy than quantile in its own if…
Seladus Dec 24, 2021
1a417f4
Change for linter
Seladus Dec 24, 2021
b5677e3
Merge branch 'main' into support_sample_weight_in_kbinsdiscretizer
Seladus Dec 31, 2021
b16321c
Update sklearn/preprocessing/_discretization.py
Seladus Jan 5, 2022
0d3919b
Update _discretization.py
Seladus Jan 5, 2022
30edabc
Adding TODO comment in tests and removing uselesss call to np.array
Seladus Jan 6, 2022
b600a0d
Changes for linter
Seladus Jan 6, 2022
aee7f59
DOC changed n_iter to max_iter to resolve a deprecation warning (#24837)
star1327p Nov 5, 2022
353b05b
DOC moved legend to upper center (#24847)
star1327p Nov 7, 2022
2e481f1
DOC fix class reference in changelog 1.1 (#24850)
GaelVaroquaux Nov 7, 2022
3e47fa9
DOC Improve interaction constraint doc for HistGradientBoosting* (#24…
betatim Nov 8, 2022
80f9247
Merge branch 'main' into deas-kbins
DeaMariaLeon Nov 8, 2022
5c573bd
Revert pre-commit file change
DeaMariaLeon Nov 8, 2022
f8986ee
FIX adapt epsilon value depending of the dtype of the input (#24354)
Safikh Nov 10, 2022
a923f9e
FIX Calls super().init_subclass in _SetOutputMixin (#24863)
thomasjpfan Nov 10, 2022
539cd6c
DOC Improve narrative of DBSCAN example narrative (#24874)
ArturoAmorQ Nov 10, 2022
13b5b61
ENH Allow 0 < p < 1 for Minkowski distance for `algorithm="brute"` in…
RudreshVeerkhare Nov 10, 2022
64432e1
DOC Fix FutureWarning in 'examples/bicluster/plot_bicluster_newsgroup…
kianelbo Nov 10, 2022
e947074
FIX Make sure that set_output is keyword only everywhere (#24890)
thomasjpfan Nov 11, 2022
c98b910
DOC Fix FutureWarning in 'applications/plot_cyclical_feature_engineer…
Ti-Ion Nov 12, 2022
2718d9b
DOC remove FutureWarning in cluster/plot_bisect_kmeans.py (#24891)
aditya-anulekh Nov 12, 2022
4c96d7c
DOC remove FutureWarning in plot_colour_quantization example (#24893)
GeorgiaMayDay Nov 12, 2022
cf04165
DOC Fix FutureWarning in ensemble/plot_gradient_boosting_categorical …
DhanshreeA Nov 12, 2022
7ec1bfc
CI Update to python 3.11 docker windows image (#24900)
cmarmo Nov 12, 2022
84a7a7a
ENH Specify categorical features with feature names in HGBDT (#24889)
ogrisel Nov 13, 2022
980ded1
DOC add additional pointers in Forest regarding how to use `warm_star…
Nov 13, 2022
49aae1c
DOC improve `GammaRegressor` docstring (#24789)
Badr-MOUFAD Nov 13, 2022
70442b9
DOC Fix FutureWarning in decomposition/plot_ica_blind_source_separati…
MaximSmolskiy Nov 13, 2022
1bf4ebd
DOC Fix FutureWarning in plot_linear_model_coefficient_interpretation…
SarahRemus Nov 14, 2022
bf573bd
DOC Fix FutureWarning in manifold/plot_compare_methods.py (#24909)
MaximSmolskiy Nov 14, 2022
dbde1da
DOC Fix FutureWarning in cluster/plot_dict_face_patches.html (#24910)
Ti-Ion Nov 14, 2022
c4d3a69
DOC fix spelling in `DecisionBoundaryDisplay` docstring (#24921)
glemaitre Nov 15, 2022
bfcb5a4
DOC Fix -- make OPTICS plots consistent (#24926)
espg Nov 15, 2022
9498a89
removed test_invalid_n_bins() & trailing spaces to test_discretizatio…
DeaMariaLeon Nov 15, 2022
7ab4801
Finishing PR 22048 - updating my branch with upstream
DeaMariaLeon Nov 15, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions build_tools/github/build_minimal_windows_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,6 @@ cp $WHEEL_PATH $WHEEL_NAME
# Dot the Python version for identyfing the base Docker image
PYTHON_VERSION=$(echo ${PYTHON_VERSION:0:1}.${PYTHON_VERSION:1:2})

# TODO: Remove when 3.11 images will be available for
# windows (for now the docker image is tagged as 3.11-rc)
if [[ "$PYTHON_VERSION" == "3.11" ]]; then
PYTHON_VERSION=$(echo ${PYTHON_VERSION}-rc)
fi

# Build a minimal Windows Docker image for testing the wheels
docker build --build-arg PYTHON_VERSION=$PYTHON_VERSION \
--build-arg WHEEL_NAME=$WHEEL_NAME \
Expand Down
5 changes: 4 additions & 1 deletion doc/developers/develop.rst
Original file line number Diff line number Diff line change
Expand Up @@ -670,7 +670,10 @@ when defining a custom subclass::
...

The default value for `auto_wrap_output_keys` is `("transform",)`, which automatically
wraps `fit_transform` and `transform`.
wraps `fit_transform` and `transform`. The `TransformerMixin` uses the
`__init_subclass__` mechanism to consume `auto_wrap_output_keys` and pass all other
keyword arguments to it's super class. Super classes' `__init_subclass__` should
**not** depend on `auto_wrap_output_keys`.

For transformers that return multiple arrays in `transform`, auto wrapping will
only wrap the first array and not alter the other arrays.
Expand Down
2 changes: 1 addition & 1 deletion doc/whats_new/v1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -400,7 +400,7 @@ Changelog
:mod:`sklearn.cluster`
......................

- |MajorFeature| :class:`BisectingKMeans` introducing Bisecting K-Means algorithm
- |MajorFeature| :class:`cluster.BisectingKMeans` introducing Bisecting K-Means algorithm
:pr:`20031` by :user:`Michal Krawczyk <michalkrawczyk>`,
:user:`Tom Dupre la Tour <TomDLT>`
and :user:`Jérémie du Boisberranger <jeremiedbb>`.
Expand Down
25 changes: 25 additions & 0 deletions doc/whats_new/v1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ random sampling procedures.
scores will all be set to the maximum possible rank.
:pr:`24543` by :user:`Guillaume Lemaitre <glemaitre>`.

- |Enhancement| The default of `eps` in :func:`metrics.logloss` is will change
from `1e-15` to `"auto"` that defaults to `np.finfo(y_pred.dtype).eps`.
:pr:`24354` by :user:`Safiuddin Khaja <Safikh>` and :user:`gsiisg <gsiisg>`.

Changes impacting all modules
-----------------------------

Expand Down Expand Up @@ -295,6 +299,11 @@ Changelog
- |Efficiency| Improve runtime performance of :class:`ensemble.IsolationForest`
by avoiding data copies. :pr:`23252` by :user:`Zhehao Liu <MaxwellLZH>`.

- |Enhancement| Make it possible to pass the `categorical_features` parameter
of :class:`ensemble.HistGradientBoostingClassifier` and
:class:`ensemble.HistGradientBoostingRegressor` as feature names.
:pr:`24889` by :user:`Olivier Grisel <ogrisel>`.

- |Enhancement| :class:`ensemble.StackingClassifier` now supports
multilabel-indicator target
:pr:`24146` by :user:`Nicolas Peretti <nicoperetti>`,
Expand Down Expand Up @@ -457,6 +466,12 @@ Changelog
:pr:`22710` by :user:`Conroy Trinh <trinhcon>` and
:pr:`23461` by :user:`Meekail Zain <micky774>`.

- |Enhancement| Adds an `"auto"` option to `eps` in :func:`metrics.logloss`.
This option will automatically set the `eps` value depending on the data
type of `y_pred`. In addition, the default value of `eps` is changed from
`1e-15` to the new `"auto"` option.
:pr:`24354` by :user:`Safiuddin Khaja <Safikh>` and :user:`gsiisg <gsiisg>`.

- |FIX| :func:`metrics.log_loss` with `eps=0` now returns a correct value of 0 or
`np.inf` instead of `nan` for predictions at the boundaries (0 or 1). It also accepts
integer input.
Expand Down Expand Up @@ -520,6 +535,11 @@ Changelog
:pr:`10468` by :user:`Ruben <icfly2>` and :pr:`22993` by
:user:`Jovan Stojanovic <jovan-stojanovic>`.

- |Enhancement| :class:`neighbors.NeighborsBase` now accepts
Minkowski semi-metric (i.e. when :math:`0 < p < 1` for
`metric="minkowski"`) for `algorithm="auto"` or `algorithm="brute"`.
:pr:`24750` by :user:`Rudresh Veerkhare <RudreshVeerkhare>`

- |Efficiency| :class:`neighbors.NearestCentroid` is faster and requires
less memory as it better leverages CPUs' caches to compute predictions.
:pr:`24645` by :user:`Olivier Grisel <ogrisel>`.
Expand Down Expand Up @@ -564,6 +584,11 @@ Changelog
- |Fix| :class:`preprocessing.LabelEncoder` correctly encodes NaNs in `transform`.
:pr:`22629` by `Thomas Fan`_.

- |Enhancement| Added support for `sample_weight` in :class:`preprocessing.KBinsDiscretizer`.
This allows specifying the parameter weights for each sample to be used while
fitting. The option is only available when `strategy` is set to `quantile`.
:pr:`22048` by :user:`Seladus <seladus>`.

:mod:`sklearn.svm`
..................

Expand Down
12 changes: 8 additions & 4 deletions examples/applications/plot_cyclical_feature_engineering.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@


fig, ax = plt.subplots(figsize=(12, 4))
average_week_demand = df.groupby(["weekday", "hour"]).mean()["count"]
average_week_demand = df.groupby(["weekday", "hour"])["count"].mean()
average_week_demand.plot(ax=ax)
_ = ax.set(
title="Average hourly bike demand during the week",
Expand Down Expand Up @@ -209,11 +209,15 @@
("categorical", ordinal_encoder, categorical_columns),
],
remainder="passthrough",
# Use short feature names to make it easier to specify the categorical
# variables in the HistGradientBoostingRegressor in the next
# step of the pipeline.
verbose_feature_names_out=False,
),
HistGradientBoostingRegressor(
categorical_features=range(4),
categorical_features=categorical_columns,
),
)
).set_output(transform="pandas")

# %%
#
Expand Down Expand Up @@ -263,7 +267,7 @@ def evaluate(model, X, y, cv):
import numpy as np


one_hot_encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)
one_hot_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
alphas = np.logspace(-6, 6, 25)
naive_linear_pipeline = make_pipeline(
ColumnTransformer(
Expand Down
2 changes: 1 addition & 1 deletion examples/applications/plot_model_complexity_influence.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ def plot_influence(conf, mse_values, prediction_times, complexities):
ax2.tick_params(axis="y", colors=line2.get_color())

plt.legend(
(line1, line2), ("prediction error", "prediction latency"), loc="upper right"
(line1, line2), ("prediction error", "prediction latency"), loc="upper center"
)

plt.title(
Expand Down
4 changes: 3 additions & 1 deletion examples/bicluster/plot_bicluster_newsgroups.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,9 @@ def build_tokenizer(self):
cocluster = SpectralCoclustering(
n_clusters=len(categories), svd_method="arpack", random_state=0
)
kmeans = MiniBatchKMeans(n_clusters=len(categories), batch_size=20000, random_state=0)
kmeans = MiniBatchKMeans(
n_clusters=len(categories), batch_size=20000, random_state=0, n_init=3
)

print("Vectorizing...")
X = vectorizer.fit_transform(newsgroups.data)
Expand Down
2 changes: 1 addition & 1 deletion examples/cluster/plot_bisect_kmeans.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@

for i, (algorithm_name, Algorithm) in enumerate(clustering_algorithms.items()):
for j, n_clusters in enumerate(n_clusters_list):
algo = Algorithm(n_clusters=n_clusters, random_state=random_state)
algo = Algorithm(n_clusters=n_clusters, random_state=random_state, n_init=3)
algo.fit(X)
centers = algo.cluster_centers_

Expand Down
4 changes: 3 additions & 1 deletion examples/cluster/plot_color_quantization.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@
print("Fitting model on a small sub-sample of the data")
t0 = time()
image_array_sample = shuffle(image_array, random_state=0, n_samples=1_000)
kmeans = KMeans(n_clusters=n_colors, random_state=0).fit(image_array_sample)
kmeans = KMeans(n_clusters=n_colors, n_init="auto", random_state=0).fit(
image_array_sample
)
print(f"done in {time() - t0:0.3f}s.")

# Get labels for all points
Expand Down
87 changes: 62 additions & 25 deletions examples/cluster/plot_dbscan.py
Original file line number Diff line number Diff line change
@@ -1,39 +1,53 @@
# -*- coding: utf-8 -*-
"""
===================================
Demo of DBSCAN clustering algorithm
===================================

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
finds core samples of high density and expands clusters from them.
This algorithm is good for data which contains clusters of similar density.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core
samples in regions of high density and expands clusters from them. This
algorithm is good for data which contains clusters of similar density.

See the :ref:`sphx_glr_auto_examples_cluster_plot_cluster_comparison.py` example
for a demo of different clustering algorithms on 2D datasets.

"""

import numpy as np
# %%
# Data generation
# ---------------
#
# We use :class:`~sklearn.datasets.make_blobs` to create 3 synthetic clusters.

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler


# %%
# Generate sample data
# --------------------
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(
n_samples=750, centers=centers, cluster_std=0.4, random_state=0
)

X = StandardScaler().fit_transform(X)

# %%
# We can visualize the resulting data:

import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1])
plt.show()

# %%
# Compute DBSCAN
# --------------
#
# One can access the labels assigned by :class:`~sklearn.cluster.DBSCAN` using
# the `labels_` attribute. Noisy samples are given the label math:`-1`.

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
Expand All @@ -42,23 +56,46 @@

print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f" % metrics.adjusted_rand_score(labels_true, labels))

# %%
# Clustering algorithms are fundamentally unsupervised learning methods.
# However, since :class:`~sklearn.datasets.make_blobs` gives access to the true
# labels of the synthetic clusters, it is possible to use evaluation metrics
# that leverage this "supervised" ground truth information to quantify the
# quality of the resulting clusters. Examples of such metrics are the
# homogeneity, completeness, V-measure, Rand-Index, Adjusted Rand-Index and
# Adjusted Mutual Information (AMI).
#
# If the ground truth labels are not known, evaluation can only be performed
# using the model results itself. In that case, the Silhouette Coefficient comes
# in handy.
#
# For more information, see the
# :ref:`sphx_glr_auto_examples_cluster_plot_adjusted_for_chance_measures.py`
# example or the :ref:`clustering_evaluation` module.

print(f"Homogeneity: {metrics.homogeneity_score(labels_true, labels):.3f}")
print(f"Completeness: {metrics.completeness_score(labels_true, labels):.3f}")
print(f"V-measure: {metrics.v_measure_score(labels_true, labels):.3f}")
print(f"Adjusted Rand Index: {metrics.adjusted_rand_score(labels_true, labels):.3f}")
print(
"Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels)
"Adjusted Mutual Information:"
f" {metrics.adjusted_mutual_info_score(labels_true, labels):.3f}"
)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
print(f"Silhouette Coefficient: {metrics.silhouette_score(X, labels):.3f}")

# %%
# Plot result
# -----------
import matplotlib.pyplot as plt
# Plot results
# ------------
#
# Core samples (large dots) and non-core samples (small dots) are color-coded
# according to the asigned cluster. Samples tagged as noise are represented in
# black.

# Black removed and is used for noise instead.
unique_labels = set(labels)
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
Expand Down Expand Up @@ -87,5 +124,5 @@
< 17AE span class='blob-code-inner blob-code-marker ' data-code-marker=" "> markersize=6,
)

plt.title("Estimated number of clusters: %d" % n_clusters_)
plt.title(f"Estimated number of clusters: {n_clusters_}")
plt.show()
2 changes: 1 addition & 1 deletion examples/cluster/plot_dict_face_patches.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@

print("Learning the dictionary... ")
rng = np.random.RandomState(0)
kmeans = MiniBatchKMeans(n_clusters=81, random_state=rng, verbose=True)
kmeans = MiniBatchKMeans(n_clusters=81, random_state=rng, verbose=True, n_init=3)
patch_size = (20, 20)

buffer = []
Expand Down
6 changes: 3 additions & 3 deletions examples/cluster/plot_optics.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,10 +88,10 @@
ax2.set_title("Automatic Clustering\nOPTICS")

# DBSCAN at 0.5
colors = ["g", "greenyellow", "olive", "r", "b", "c"]
for klass, color in zip(range(0, 6), colors):
colors = ["g.", "r.", "b.", "c."]
for klass, color in zip(range(0, 4), colors):
Xk = X[labels_050 == klass]
ax3.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3, marker=".")
ax3.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax3.plot(X[labels_050 == -1, 0], X[labels_050 == -1, 1], "k+", alpha=0.1)
ax3.set_title("Clustering at 0.5 epsilon cut\nDBSCAN")

Expand Down
10 changes: 5 additions & 5 deletions examples/decomposition/plot_faces_decomposition.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):

# %%
batch_pca_estimator = decomposition.MiniBatchSparsePCA(
n_components=n_components, alpha=0.1, n_iter=100, batch_size=3, random_state=rng
n_components=n_components, alpha=0.1, max_iter=100, batch_size=3, random_state=rng
)
batch_pca_estimator.fit(faces_centered)
plot_gallery(
Expand All @@ -171,7 +171,7 @@ def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):

# %%
batch_dict_estimator = decomposition.MiniBatchDictionaryLearning(
n_components=n_components, alpha=0.1, n_iter=50, batch_size=3, random_state=rng
n_components=n_components, alpha=0.1, max_iter=50, batch_size=3, random_state=rng
)
batch_dict_estimator.fit(faces_centered)
plot_gallery("Dictionary learning", batch_dict_estimator.components_[:n_components])
Expand Down Expand Up @@ -272,7 +272,7 @@ def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):
dict_pos_dict_estimator = decomposition.MiniBatchDictionaryLearning(
n_components=n_components,
alpha=0.1,
n_iter=50,
max_iter=50,
batch_size=3,
random_state=rng,
positive_dict=True,
Expand All @@ -294,7 +294,7 @@ def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):
dict_pos_code_estimator = decomposition.MiniBatchDictionaryLearning(
n_components=n_components,
alpha=0.1,
n_iter=50,
max_iter=50,
batch_size=3,
fit_algorithm="cd",
random_state=rng,
Expand All @@ -318,7 +318,7 @@ def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):
dict_pos_estimator = decomposition.MiniBatchDictionaryLearning(
n_components=n_components,
alpha=0.1,
n_iter=50,
max_iter=50,
batch_size=3,
fit_algorithm="cd",
random_state=rng,
Expand Down
2 changes: 1 addition & 1 deletion examples/decomposition/plot_ica_blind_source_separation.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
from sklearn.decomposition import FastICA, PCA

# Compute ICA
ica = FastICA(n_components=3)
ica = FastICA(n_components=3, whiten="arbitrary-variance")
S_ = ica.fit_transform(X) # Reconstruct signals
A_ = ica.mixing_ # Get estimated mixing matrix

Expand Down
Loading
0