Description
Describe the issue linked to the documentation
Recently I add to implement my custom CV Splitter for a project I'm working on. My first instinct was to look in the documentation to see if there were any examples of how this could be done. I could not find anything too concrete, but after not too much time I found the Glossary of Common Terms and API Elements. Although not exactly what I hoped to find, it does have a section on CV Splitters. From there I can read that they expected to have a split
and get_n_splits
methods, and following some other links in the docs I can find what arguments they take and what they should return.
Although all the information is in fact there, I believe that more inexperienced users may find it a bit more difficult to piece together all the pieces, and was thinking if it wouldn't be beneficial for all users to have a section in the documentation with examples on how to customize the sci-kit learn classes to suit the user's needs. After all, I understand the library was developed with a API in mind that would allow for this exact flexibility and customization.
I know this is not a small task, and may add a non-trivial maintenance burden to the team, but would like to understand how the maintenance team would feel about a space in the documentation for these customization examples? Of course as the person suggesting I would be happy contribute for this.
Suggest a potential alternative/fix
One way I could see this taking shape would be with a dedicated page in the documentation, where examples of customized classes could be demonstrated. I think it's also important to show how the customized class would be used as part of a larger pipeline and allowing the user to copy and paste the code to their working environment.
I'll leave below of an example of a custom CV Splitter for discussion. But the idea would be to then expand to most commonly used classes.
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X, y = datasets.load_iris(return_X_y=True)
class CustomSplitter:
def __init__(self, n_folds=5) -> None:
self.n_folds = n_folds
def split(self, X = None, y = None, groups = None):
assert X.shape[0] == y.shape[0]
idxs = np.arange(X.shape[0])
splits = np.array_split(idxs, self.get_n_splits())
for split_idx, split in enumerate(splits):
train_idxs = np.concatenate([split for idx, split in enumerate(splits) if idx != split_idx])
test_idxs = split
yield train_idxs, test_idxs
def get_n_splits(self, X = None, y = None, groups = None):
return self.n_folds
clf = LogisticRegression(random_state=42)
scores = cross_val_score(clf, X, y, cv=CustomSplitter(n_folds=5))