Provide examples on how to customize the scikit-learn classes

Describe the issue linked to the documentation

Recently I add to implement my custom CV Splitter for a project I'm working on. My first instinct was to look in the documentation to see if there were any examples of how this could be done. I could not find anything too concrete, but after not too much time I found the Glossary of Common Terms and API Elements. Although not exactly what I hoped to find, it does have a section on CV Splitters. From there I can read that they expected to have a split and get_n_splits methods, and following some other links in the docs I can find what arguments they take and what they should return.

Although all the information is in fact there, I believe that more inexperienced users may find it a bit more difficult to piece together all the pieces, and was thinking if it wouldn't be beneficial for all users to have a section in the documentation with examples on how to customize the sci-kit learn classes to suit the user's needs. After all, I understand the library was developed with a API in mind that would allow for this exact flexibility and customization.

I know this is not a small task, and may add a non-trivial maintenance burden to the team, but would like to understand how the maintenance team would feel about a space in the documentation for these customization examples? Of course as the person suggesting I would be happy contribute for this.

Suggest a potential alternative/fix

One way I could see this taking shape would be with a dedicated page in the documentation, where examples of customized classes could be demonstrated. I think it's also important to show how the customized class would be used as part of a larger pipeline and allowing the user to copy and paste the code to their working environment.
I'll leave below of an example of a custom CV Splitter for discussion. But the idea would be to then expand to most commonly used classes.

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

X, y = datasets.load_iris(return_X_y=True)

class CustomSplitter:
    def __init__(self, n_folds=5) -> None:
        self.n_folds = n_folds

    def split(self, X = None, y = None, groups = None):
        assert X.shape[0] == y.shape[0]
        idxs = np.arange(X.shape[0])
        splits = np.array_split(idxs, self.get_n_splits())
        for split_idx, split in enumerate(splits):
            train_idxs = np.concatenate([split for idx, split in enumerate(splits) if idx != split_idx])
            test_idxs = split
            yield train_idxs, test_idxs

    def get_n_splits(self, X = None, y = None, groups = None):
        return self.n_folds

clf = LogisticRegression(random_state=42)
scores = cross_val_score(clf, X, y, cv=CustomSplitter(n_folds=5))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Describe the issue linked to the documentation

Suggest a potential alternative/fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Describe the issue linked to the documentation

Suggest a potential alternative/fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions