8000 Add missing value support for AdaBoost? · Issue #30477 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add missing value support for AdaBoost? #30477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jcbioinformatics opened this issue Dec 12, 2024 · 7 comments
Closed

Add missing value support for AdaBoost? #30477

jcbioinformatics opened this issue Dec 12, 2024 · 7 comments

Comments

@jcbioinformatics
Copy link

Describe the bug

I am working on classifying samples in various datasets using the AdaBoostClassifier with the DecisionTreeClassifier as the base estimator.

The DecisionTreeClassifier can handle np.nan values, so I assumed the AdaBoostClassifier would be able to as well.

However, that does not seem to be the case, as AdaBoost gives the error ValueError: Input X contains NaN when I try to use it with data containing NAs.

I asked if this was the intended behavior here but have yet to receive a response.

If this isn't intentional, could AdaBoostClassifier be updated to support missing values when the base estimator does?

Steps/Code to Reproduce

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
import numpy as np


iris = load_iris()

# Set first position to nan
iris.data[0,0] = np.nan

# Confirm DecisionTreeClassifier still works
clf_works = DecisionTreeClassifier(max_depth=1)
cross_val_score(clf_works, iris.data, iris.target, cv=3)

# Explicitly call DecisionTreeClassifier as the base estimator
clf = AdaBoostClassifier(random_state=0, estimator=DecisionTreeClassifier(max_depth=1)) 

# Attempt to use AdaBoostClassifier w/ data containing nan
cross_val_score(clf, iris.data, iris.target, cv=3)

Expected Results

No error is thrown when DecisionTreeClassifier(max_depth=1) is used as the classifier since the DecisionTreeClassifier can handle np.nan values.

Because of that, I expected AdaBoost to fit and train successfully too.

Actual Results

C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\model_selection\_validation.py:976: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\metrics\_scorer.py", line 144, in __call__
    score = scorer(estimator, *args, **routed_params.get(name).score)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\metrics\_scorer.py", line 472, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\base.py", line 513, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
                             ^^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\ensemble\_weight_boosting.py", line 628, in predict
    pred = self.decision_function(X)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\ensemble\_weight_boosting.py", line 689, in decision_function
    X = self._check_X(X)
        ^^^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\ensemble\_weight_boosting.py", line 94, in _check_X
    return validate_data(
           ^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\utils\validation.py", line 2933, in validate_data
    out = check_array(X, input_name="X", **check_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\utils\validation.py", line 1106, in check_array
    _assert_all_finite(
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\utils\validation.py", line 120, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\utils\validation.py", line 169, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
AdaBoostClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

  warnings.warn(
C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\model_selection\_validation.py:526: FitFailedWarning: 
2 fits failed out of a total of 3.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\model_selection\_validation.py", line 864, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\base.py", line 1330, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\ensemble\_weight_boosting.py", line 130, in fit
    X, y = validate_data(
           ^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\utils\validation.py", line 2950, in validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\utils\validation.py", line 1369, in check_X_y
    X = check_array(
        ^^^^^^^^^^^^
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\utils\validation.py", line 1106, in check_array
    _assert_all_finite(
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\utils\validation.py", line 120, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "C:\Users\pacea\miniconda3\envs\jupyter\Lib\site-packages\sklearn\utils\validation.py", line 169, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
AdaBoostClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

  warnings.warn(some_fits_failed_message, FitFailedWarning)
array([nan, nan, nan])

Versions

System:
    python: 3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:24:51) [MSC v.1940 64 bit (AMD64)]
executable: C:\Users\pacea\miniconda3\envs\jupyter\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.6.0rc1
          pip: 24.2
   setuptools: 74.1.2
        numpy: 1.26.4
        scipy: 1.14.1
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.9.2
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
    num_threads: 8
         prefix: libblas
       filepath: C:\Users\pacea\miniconda3\envs\jupyter\Library\bin\libblas.dll
        version: 2024.1-Product
threading_layer: intel

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: vcomp
       filepath: C:\Users\pacea\miniconda3\envs\jupyter\vcomp140.dll
        version: None
@jcbioinformatics jcbioinformatics added Bug Needs Triage Issue requires triage labels Dec 12, 2024
@lesteve
Copy link
Member
lesteve commented Dec 13, 2024

Indeed missing values are not supported in AdaBoostClassifier, see our list of estimators that handle missing values.

I would suggest to use HistGradientBoostingClassifier that supports missing values if you can as the error message suggests.

Even if trees support missing values, adding missing value support for AdaBoost is not trivial and I would think it is unlikely to be a priority for us.

To get an idea of the kind of work we are talking about, see #26391 for RandomForest and #27966 for ExtraTrees.

@lesteve lesteve added New Feature and removed Bug labels Dec 13, 2024
@lesteve lesteve changed the title AdaBoost does not accept NaN values even when base estimator supports them Add missing value support for AdaBoost Dec 13, 2024
@lesteve lesteve changed the title Add missing value support for AdaBoost Add missing value support for AdaBoost? Dec 13, 2024
@jcbioinformatics
Copy link
Author

Thank you for the reply and for providing those links.

For my current use case, I'm essentially benchmarking a bunch of different classification methods with various datasets, and I was hoping to include AdaBoostClassifier (HistGradientBoostingClassifier is already in the comparison).

I was able to get AdaBoostClassifier to work with missing values locally with by changing two parts of the code.

BaseWeightBoosting

    def _check_X(self, X):
        # Only called to validate X in non-fit methods, therefore reset=False
        return validate_data(
            self,
            X,
            accept_sparse=["csr", "csc"],
            ensure_2d=True,
            allow_nd=True,
            dtype=None,
            reset=False,
        )

to

    def _check_X(self, X):
        #if self.estimator_._support_missing_values(X):
        if type(self.estimator_) is DecisionTreeClassifier:
            ensure_all_finite = "allow-nan"
        else:
            ensure_all_finite = True
        # Only called to validate X in non-fit methods, therefore reset=False
        return validate_data(
            self,
            X,
            accept_sparse=["csr", "csc"],
            ensure_2d=True,
            allow_nd=True,
            dtype=None,
            reset=False,
            ensure_all_finite=ensure_all_finite,
        )

And here

        _raise_for_unsupported_routing(self, "fit", sample_weight=sample_weight)
        X, y = validate_data(
            self,
            X,
            y,
            accept_sparse=["csr", "csc"],
            ensure_2d=True,
            allow_nd=True,
            dtype=None,
            y_numeric=is_regressor(self),
        )

        sample_weight = _check_sample_weight(
            sample_weight, X, np.float64, copy=True, ensure_non_negative=True
        )
        sample_weight /= sample_weight.sum()

        # Check parameters
        self._validate_estimator()

to

        self._validate_estimator()

        if type(self.estimator_) is DecisionTreeClassifier:
            ensure_all_finite = "allow-nan"
        else:
            ensure_all_finite = True
                
        _raise_for_unsupported_routing(self, "fit", sample_weight=sample_weight)
        X, y = validate_data(
            self,
            X,
            y,
            accept_sparse=["csr", "csc"],
            ensure_2d=True,
            allow_nd=True,
            dtype=None,
            y_numeric=is_regressor(self),
            ensure_all_finite=ensure_all_finite,
        )

        sample_weight = _check_sample_weight(
            sample_weight, X, np.float64, copy=True, ensure_non_negative=True
        )
        sample_weight /= sample_weight.sum()

This approach worked for my own use. However, I know it would be better to check if the classifier supports missing values directly, instead of just that the base classifier is DecisionTreeClassifier.

I also did not do any benchmarking to see how it affects runtime performance, so essentially all I can say now is my changes allow AdaBoostClassifier to run with missing values. And, I didn't add a parameter for toggling missing value support since I know in my case that missing values are expected.

So, I realize there's a big gap between what I did then and actually having something that could be merged in as a pull request. But, there was also a thread on SO asking about using AdaBoostClassifier with missing values, so I think there could be broader interest in supporting that functionality.

@lesteve lesteve removed the Needs Triage Issue requires triage label Dec 16, 2024
@lesteve
Copy link
Member
lesteve commented Dec 16, 2024

Great to hear that you managed to get it to work for your use case. In my opinion, hackability
is an underrated aspect of scikit-learn: people find their ways and are able to tweak some parts to explore ideas. That doesn't mean that every possible variations should go into the library because we need to balance features with maintainability.

I would expect that in your benchmarks HistGradientBoosting overperforms AdaBoost, both in terms of generalization performance and computational efficiency (time taken by fit or predict).

@jcbioinformatics
Copy link
Author

Thanks!

I agree it's really cool that sk-learn's easy enough to edit that I was able to get AdaBoostClassifier to work like that, and you're right too that HistGradientBoosting performed better anyway (at least based on accuracy and likely computational speed as well just from how long it seemed to take to run).

Part of me would still like to see support for running it with missing values eventually added, but I wouldn't be able to attempt it myself anytime in the near future (and whatever I'd come up with would probably need at least a little tweaking even if it would be considered worth adding). And, I realize too in most cases HistGradientBoosting is expected to outperform AdaBoostClassifier.

@ogrisel
Copy link
Member
ogrisel commented Apr 4, 2025

I had a quick glance at your code change, and I think the meta estimator should inspect the __sklearn_tags__ of the base estimator instead of testing if type(self.estimator_) is DecisionTreeClassifier:.

You are welcome to open a PR if you have the time to contribute your fix to scikit-learn.

Otherwise, let's close this PR if nobody else is interested enough to write a PR.

@jcbioinformatics
Copy link
Author

Shoot, yes, that would have been better to check. I'm on a different project now, so I'm not sure if/when I'd be able to attempt to get this into good enough shape to be merged. I'm fine with the issue being closed if no one else is able to tackle this either

@ogrisel
Copy link
Member
ogrisel commented Apr 7, 2025

Let's close but if someone else is interested in fixing this, feel free to open a PR and link to the discussion of this issue from the description of the PR.

@ogrisel ogrisel closed this as completed Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
0