Description
Description
So I am trying to understand the behavior of TimeSeriesSplit
. Especially the max_train_size
parameter. I was initially surprised that it is an absolute number and not a ratio like it is in other splitting operations.
I traced this parameter to issue #8249 and PR #8282 and I realized that it was added to support window-based splitting, as it is described here. This was very surprising for me because this is not really clear from documentation that this is happening. Moreover, I found parameters initialWindow
, horizon
, and fixedWindow
much easier to understand, especially with that image.
I would suggest that:
- Documentation is improved here. Such visualization as shown in https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series would really help a lot.
- We consider using or sample based parameters, like
initialWindow
,horizon
, andfixedWindow
, or ratio/fold based ones, but not both, because it is very confusing.
If we have splitting done by number of folds (which I prefer because it makes things adapt to different dataset sizes automatically), then also window size should be expressed in folds. In a way, parameters could then be:
- How many folds to do.
- Number of folds used in horizon, i.e., used in test data. It looks like this is currently fixed to 1 in this splitting operation and cannot really be configured. I suggest we allow this to be configured.
- Number of folds used in the window, i.e., training data. Default could be None, which would mean a non-fixed window and would mean to use all folds before the test data. Or you could fix it to get a sliding window.
Versions
Relates to how it is in sklearn v0.20.3.