Description
Describe the workflow you want to enable
If I'm using TruncatedSVD in a pipeline, it'd be nice to have an option to automatically set n_components < n_features, if n_components >= n_features.
For example, in the docs for sklearn.manifold.TSNE suggest using TruncatedSVD to limit the dimensionality of the input to 50. This is easy to do with a pipeline:
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.pipeline import make_pipeline
from scipy.sparse import random
tsne = make_pipeline(TruncatedSVD(n_components=50), TSNE(n_iter=250))
wide_data = random(1000, 100)
wide_tsne = tsne.fit_transform(wide_data)
However, let's say later we get a narrower dataset:
narrow_data = random(1000, 10)
narrow_tsne = tsne.fit_transform(narrow_data)
This raises the error: n_components must be < n_features; got 50 >= 10
Describe your proposed solution
I'd like to add a parameter to the __init__
for TruncatedSVD, with a name like excess_n_components
. The default will be something like excess_n_components="error"
which will preserve the current behaivor.
However, if excess_n_components="reduce_n_components"
(or some other good way to specify it), at fit time we'd automatically reset n_components
to X.shape[1]-1
. (With maybe a special case for when X.shape[1]==1?)
Describe alternatives you've considered, if relevant
Writing a wrapper for Pipeline that generates different pipelines depending on the source data. This gets difficult; however, if the pipeline contains intermediate steps that may increase or decrease the dimensionality of the data.