FIX `param_distribution` param of `HalvingRandomSearchCV` accepts list of dicts #26893

StefanieSenger · 2023-07-24T23:29:37Z

What does this implement/fix? Explain your changes.

Closes #26885
Fixed that the param_distribution param of HalvingRandomSearchCV accepts lists of dicts and updated documentation.

I also tried to implement a test, using test_random_search_cv_results as a template, as you suggested @glemaitre , but I encountered several problems, that I could not resolve.

The template implementation calls two functions (check_cv_results_array_types and check_cv_results_keys), that check and compare the occurrence of params. But those might not always be present (like 'param_degree' is only a key in cv_results for the poly kernel, not for rbf). (HalvingSerach' cv_results will also have two additional keys, compared to the other searches, these tests are used for: "iter", "n_resources")

I cannot see a way to use the assert tests in the end of the template test, because HalvingGridSearchCV will mask part of the candidates, as part of the process. So, checking for this is not going to work, I assume.

I have determined the value for n_proportion = 6 by looking at cv_results[key].shape, which is for sure the wrong way around.

I have commented the test out and hope for your advice. At the moment the test fails because of KeyError (param_degree) in one of the util functions.

I could write a much simpler test, that captures the insertion of param_distribution as a list of dicts.

github-actions · 2023-07-24T23:31:07Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 5afd7d0. Link to the linter CI: here}

jeremiedbb · 2023-07-25T14:36:05Z

I think the main reasons of the failures come from a too small dataset and not enough candidates. Here's a slightly modified version:

def test_halving_random_search_cv_results():
    X, y = make_classification(n_samples=150, n_features=4, random_state=42)

    params = [
        {"kernel": ["rbf"], "C": expon(scale=10), "gamma": expon(scale=0.1)},
        {"kernel": ["poly"], "degree": [2, 3]},
    ]
    param_keys = ("param_C", "param_degree", "param_gamma", "param_kernel")
    score_keys = (
        "mean_test_score",
        "mean_train_score",
        "rank_test_score",
        "split0_test_score",
        "split1_test_score",
        "split2_test_score",
        "split0_train_score",
        "split1_train_score",
        "split2_train_score",
        "std_test_score",
        "std_train_score",
        "mean_fit_time",
        "std_fit_time",
        "mean_score_time",
        "std_score_time",
    )
    extra_keys = ("n_resources", "iter")

    search = HalvingRandomSearchCV(
        SVC(),
        cv=3,
        param_distributions=params,
        return_train_score=True,
        random_state=0,
    )
    search.fit(X, y)
    n_candidates = sum(search.n_candidates_)

    cv_results = search.cv_results_
    # Check results structure
    check_cv_results_keys(
        cv_results, param_keys, score_keys, n_candidates, extra_keys
    )
    check_cv_results_array_types(search, param_keys, score_keys)

    assert all(
        (
            cv_results["param_C"].mask[i]
            and cv_results["param_gamma"].mask[i]
            and not cv_results["param_degree"].mask[i]
        )
        for i in range(n_candidates)
        if cv_results["param_kernel"][i] == "poly"
    )
    assert all(
        (
            not cv_results["param_C"].mask[i]
            and not cv_results["param_gamma"].mask[i]
            and cv_results["param_degree"].mask[i]
        )
        for i in range(n_candidates)
        if cv_results["param_kernel"][i] == "rbf"
    )

more samples. 2 reasons:
- we're testing something doing cross validation so the dataset will be splitted and we need to have the 2 classes in both train and test sets as often as possible.
- Successing halving uses n_samples as a resource and relies on it to evaluate the number of candidates for the first round. So it means that it'll start with less samples which comes back to the previous point. Also, too small can lead to only select a single type of kernel for the first round for instance.
n_candidates is not a paremeter so we can't easily know in advance what it'll be. Better to just use the n_candidates_ attribute
set random_state to ensure reproducible results. I tested with 100 different seeds and the test always passed.

I'd rather set iter and n_resources as extra_keys and modify check_cv_results_keys instead.

def check_cv_results_keys(cv_results, param_keys, score_keys, n_cand, extra_keys=()):
    # Test the search.cv_results_ contains all the required results
    all_keys = param_keys + score_keys + extra_keys
    assert_array_equal(
        sorted(cv_results.keys()), sorted(all_keys + ("params",))
    )
    assert all(cv_results[key].shape == (n_cand,) for key in all_keys)

jeremiedbb

Please also add a changelog entry for 1.3.1 in v1.3.rst

sklearn/model_selection/tests/test_search.py

StefanieSenger · 2023-07-26T19:17:07Z

Thanks for reviewing, @jeremiedbb, for your help and the explanations. I have made the changes according to your suggestions and kind of understood your reasoning.

I still haven't understood the asserts for the masked arrays though (all the candidates appeared as masked_array for both kernels, when I checked), and I will talk with @adrinjalali about it tomorrow.

…arch_cv_results

StefanieSenger · 2023-07-27T12:26:24Z

After reviewing this together with @adrinjalali, I have also modified the two not fully functioning tests I had idicated in the issue (#26885). Please have a look. :)

sklearn/model_selection/tests/test_search.py

StefanieSenger · 2023-08-01T10:42:41Z

@glemaitre

glemaitre

LGTM on my side.

sklearn/model_selection/tests/test_successive_halving.py

glemaitre · 2023-08-01T13:20:39Z

And you would need to solve the conflict in the changelog.

betatim

Thanks for the fix! LGTM (looks good to me)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

StefanieSenger · 2023-08-04T11:39:01Z

I've finished the last few things. Thanks everyone for your support. :)

…t of dicts (scikit-learn#26893) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…t of dicts (#26893) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…t of dicts (scikit-learn#26893) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

StefanieSenger added 2 commits July 24, 2023 23:36

HalvingRandomSearchCV with list format input and test attempted

62db7fa

swapped to HalvingRandomSearchCV, test outcommented, now fails

6c7e4cb

github-actions bot added the module:model_selection label Jul 24, 2023

StefanieSenger changed the title ~~FIX param_distribution param of HalvingRandomSearchCV accepts lists of dicts~~ FIX param_distribution param of HalvingRandomSearchCV accepts list of dicts Jul 25, 2023

jeremiedbb reviewed Jul 25, 2023

View reviewed changes

sklearn/model_selection/tests/test_search.py Outdated Show resolved Hide resolved

changes after review

6227903

StefanieSenger marked this pull request as ready for review July 26, 2023 18:52

bug fix and cleanup in test_grid_search_cv_results and test_random_se…

98f9412

…arch_cv_results

adrinjalali reviewed Jul 28, 2023

View reviewed changes

sklearn/model_selection/tests/test_search.py Outdated Show resolved Hide resolved

StefanieSenger and others added 2 commits July 28, 2023 15:22

specific length test

4d90784

Merge branch 'main' into HalvingRandomSearchCV

c675e36

glemaitre self-requested a review August 1, 2023 11:42

glemaitre reviewed Aug 1, 2023

View reviewed changes

sklearn/model_selection/tests/test_successive_halving.py Outdated Show resolved Hide resolved

betatim approved these changes Aug 3, 2023

View reviewed changes

StefanieSenger and others added 2 commits August 4, 2023 13:30

Update sklearn/model_selection/tests/test_successive_halving.py

564ab35

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Merge branch 'main' into HalvingRandomSearchCV

5afd7d0

adrinjalali approved these changes Aug 7, 2023

View reviewed changes

adrinjalali merged commit 3725ac1 into scikit-learn:main Aug 7, 2023

TamaraAtanasoska pushed a commit to TamaraAtanasoska/scikit-learn that referenced this pull request Aug 21, 2023

FIX param_distribution param of HalvingRandomSearchCV accepts lis…

8000

03ed58b

…t of dicts (scikit-learn#26893) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

StefanieSenger deleted the HalvingRandomSearchCV branch August 23, 2023 11:16

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Sep 18, 2023

FIX param_distribution param of HalvingRandomSearchCV accepts lis…

671a6e2

…t of dicts (scikit-learn#26893) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

jeremiedbb pushed a commit that referenced this pull request Sep 20, 2023

FIX param_distribution param of HalvingRandomSearchCV accepts lis…

d82fd2a

…t of dicts (#26893) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

FIX param_distribution param of HalvingRandomSearchCV accepts lis…

6400cb1

…t of dicts (scikit-learn#26893) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX `param_distribution` param of `HalvingRandomSearchCV` accepts list of dicts #26893

FIX `param_distribution` param of `HalvingRandomSearchCV` accepts list of dicts #26893

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FIX param_distribution param of HalvingRandomSearchCV accepts list of dicts #26893

FIX param_distribution param of HalvingRandomSearchCV accepts list of dicts #26893

Uh oh!

Conversation

What does this implement/fix? Explain your changes.

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

FIX `param_distribution` param of `HalvingRandomSearchCV` accepts list of dicts #26893

FIX `param_distribution` param of `HalvingRandomSearchCV` accepts list of dicts #26893