-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Should we make most functions private? #14897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cluster_optics_xi enables users to obtain clusters according to different "thresholds" without recalculating some statistics. |
In the robust_covariance.py I would like to make the first |
But is it really useful? Is it worth to publicly support it? Tools like that which aren't in the user guide are virtually impossible to be discovered. |
It would be very useful if the user knows what they're doing. I agree that they should be documented, but I think they're useful and should be public. We had quite long discussions about having those functions, so I think they were justified. |
Ok. Let's document them then please. |
Having a |
yes it was a design decision from the early days. core algorithms were
supposed to be available without the object API. It's for example why you
have ridge, lasso, fastica, etc... exposed as plain functions. Basically objects
should wrap the plain functions to have the "scikit-learn API".
|
What is the reason for that? Is it just for convenience? Do you think it is possible to reconsider whether it's worth having them? As mentioned earlier, a lot of them are designed in such a way that it is impossible to change the estimator without breaking backward compatibility of the function. |
I think that backwards compatibility issue is only really true if the
estimator calls the public function. It does not need to be true if the
public function and estimator both call some shared, private routines.
|
@NicolasHug could you maybe give an example? I can't easily think of a case where moving the logic implemented in the function to the class and just using the class from that function wouldn't solve the issue. It's how the optics function was implemented (although it was removed in later revisions). |
Indeed, that would solve the issue, but some classes are designed the other way around. That's a good first step but I'd like to go one step further and just make private anything that isn't obviously useful. The problem then disappears. |
if we have functions that call the objects internally for me it should
never have been the case (at least in theory...)
objects should call plain public functions and not the other way around
… |
But that's precisely what makes it hard to change anything to the object. I don't understand the benefit of this design? Our first class citizens are the objects, not the functions. |
I also don't understand why. I just wanted to fix something in DBSCAN, and it has the same issue, and I'd rather move the logic to the class and have the function just use the class. |
@agramfort I'm pretty sure we have it going both directions. I find the shortcuts that use the estimators useful, but I have never called the functions that are implementation details, and I'm not sure why they should be public. |
@agramfort <https://github.com/agramfort> I'm pretty sure we have it
going both directions.
What is that theory that you speak of?
well my understanding of the philosophy is the following:
- it's different skillsets to write and maintain an API with object
hierarchy etc... than it is to write efficient numerical code
- if I am a numerical guy used to fortran / matlab or whatever then I
optimize / edit plain functions
- if i want a fast kmeans / fastica I just rely on the functions that also
tend to do less input checks than can slow things down (a bit)
I know there's some public functions that are used as implementation
details, and those are usually not well-tested. There's also some functions
that are shortcuts (in the scalers for example) that reuse the estimators.
these shortcuts are maybe the true issue as they defeat the purpose I state
above.
my 2c
… |
I agree with all these points @agramfort but this is not the kind of functions we are dealing with here. The I think you are talking about fast helpers e.g. cythonized routines, which are private anyway (or not explicitly exposed, at least). But this is not the problem we're having. What I'm trying to argue is that
|
what do you mean by "we can't change anything to KMeans." ?
KMeans is allowed to call any public/private functions to do what it has to
do.
There is for me no contract between API of functions and estimators. For
example
we don't enforce by design that outputs of kmeans are attributes of the
estimator.
… |
Please look at KMeans.fit, it's pretty obvious. The fit function is def fit(self, X):
self.attr_1, self.attr_2 ... = k_means(...)
return self Sure, we could make |
Well it would be a dependency of both KMeans and k_means. I don't think we
gain anything from changing existing functions
|
If we don't make the functions use the classes, the input validation of the two are gonna diverge from each other soon. That's probably not a good idea. |
So you're fine with the current state of things @jnothman??
Yes! In #13603, we are now forced to do the input validation twice: once in KMeans.fit, and once in k_means (there are other ways, but they are all just as bad). Same for fastica. And that's just the tip of the iceberg IMO. I'm sorry to ask again but: do we really want to support |
I am also not so sure about the actual value those functions (
This actually brings some value to the user on the other hand. Let's have a look at the estimator-related functions documented in our official public API: https://scikit-learn.org/dev/modules/classes.html
It's hard to give a definite answer. In any case, whenever we decide to keep the class / function dual API, we need to:
|
+1, why can't we make k_means private? Maybe we should encourage users to use classes instead of functions? (except for some special cases, e.g., make_pipeline, scale) |
I buy the argument of avoiding to duplicate input checks. I've seen before
a copy done in the fit method and then in the subsequent plain function.
it's also not a great option have a public parameter validate=True | False.
It's nicer to have a private function in that case that does no input
validation as for a cython function call.
I would consider the context manager that skips all checks but it's maybe
an unnecessary pain...
now we need a coherent opinion over the package. What is allowed to be a
plain p
8000
ublic function and what is not...
my 1.5c
… |
As a first step I would be happy with starting to deprecate all the functions of |
I don't have a strong opinion. I think we should do an audit on github on how many times those functions are used by other projects / code snippets and notebooks that use scikit-learn. It seems that it is non-zero but it's hard to write a specific github search query to only display the function calls. |
BTW, public object using public function = ugly code and API like that of if whiten:
if compute_sources:
S = np.dot(np.dot(W, K), X).T
else:
S = None
if return_X_mean:
if return_n_iter:
return K, W, S, X_mean, n_iter
else:
return K, W, S, X_mean
else:
if return_n_iter:
return K, W, S, n_iter
else:
return K, W, S
else:
if compute_sources:
S = np.dot(W, X).T
else:
S = None
if return_X_mean:
if return_n_iter:
return None, W, S, None, n_iter
else:
return None, W, S, None
else:
if return_n_iter:
return None, W, S, n_iter
else:
return None, W, S I've opened PRs to put them in the correct order, but I'm stuck by this potential bug so please take a look at #14988 . |
We have a lot of functions referenced in our API reference.
For our own sanity, I would like to propose to make private any function that either:
k_means
orfastica
. There should be one obvious way to do things. (Some of them might be subject to debate, likescale
which is arguably convenient.)cluster_optics_xi
should be private (it's not even mentioned in the user guide).There are plenty of other functions, and even more that aren't in the API ref.
Motivation:
k_means
andfastica
are making my life difficult right now as I'm working on #13603.KMeans.fit()
does nothing but callingk_means
which does all the work.I cannot change anything to the
KMeans
class now (e.g. add an attribute), because I would I need to change the signature / interface / behaviour ofk_means
, which is a public function.(I think that in general the design of having the estimator call a public helper is bad (it should be the other way around), but that's another issue.)
The text was updated successfully, but these errors were encountered: