8000 Provide examples on how to customize the scikit-learn classes · Issue #28828 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Provide examples on how to customize the scikit-learn classes #28828
Open
@miguelcsilva

Description

@miguelcsilva

Describe the issue linked to the documentation

Recently I add to implement my custom CV Splitter for a project I'm working on. My first instinct was to look in the documentation to see if there were any examples of how this could be done. I could not find anything too concrete, but after not too much time I found the Glossary of Common Terms and API Elements. Although not exactly what I hoped to find, it does have a section on CV Splitters. From there I can read that they expected to have a split and get_n_splits methods, and following some other links in the docs I can find what arguments they take and what they should return.

Although all the information is in fact there, I believe that more inexperienced users may find it a bit more difficult to piece together all the pieces, and was thinking if it wouldn't be beneficial for all users to have a section in the documentation with examples on how to customize the sci-kit learn classes to suit the user's needs. After all, I understand the library was developed with a API in mind that would allow for this exact flexibility and customization.

I know this is not a small task, and may add a non-trivial maintenance burden to the team, but would like to understand how the maintenance team would feel about a space in the documentation for these customization examples? Of course as the person suggesting I would be happy contribute for this.

Suggest a potential alternative/fix

One way I could see this taking shape would be with a dedicated page in the documentation, where examples of customized classes could be demonstrated. I think it's also important to show how the customized class would be used as part of a larger pipeline and allowing the user to copy and paste the code to their working environment.
I'll leave below of an example of a custom CV Splitter for discussion. But the idea would be to then expand to most commonly used classes.

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

X, y = datasets.load_iris(return_X_y=True)

class CustomSplitter:
    def __init__(self, n_folds=5) -> None:
        self.n_folds = n_folds

    def split(self, X = None, y = None, groups = None):
        assert X.shape[0] == y.shape[0]
        idxs = np.arange(X.shape[0])
        splits = np.array_split(idxs, self.get_n_splits())
        for split_idx, split in enumerate(splits):
            train_idxs = np.concatenate([split for idx, split in enumerate(splits) if idx != split_idx])
            test_idxs = split
            yield train_idxs, test_idxs

    def get_n_splits(self, X = None, y = None, groups = None):
        return self.n_folds

clf = LogisticRegression(random_state=42)
scores = cross_val_score(clf, X, y, cv=CustomSplitter(n_folds=5))

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocumentationModerateAnything that requires some knowledge of conventions and best practiceshelp wanted

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0