Should we make most functions private? #14897

NicolasHug · 2019-09-05T22:11:02Z

We have a lot of functions referenced in our API reference.

For our own sanity, I would like to propose to make private any function that either:

has an obvious estimator equivalent, e.g. k_means or fastica. There should be one obvious way to do things. (Some of them might be subject to debate, like scale which is arguably convenient.)
isn't immediately useful for users. For example, I don't know much about OPTICS but I feel like cluster_optics_xi should be private (it's not even mentioned in the user guide).

There are plenty of other functions, and even more that aren't in the API ref.

Motivation:
k_means and fastica are making my life difficult right now as I'm working on #13603. KMeans.fit() does nothing but calling k_means which does all the work.

I cannot change anything to the KMeans class now (e.g. add an attribute), because I would I need to change the signature / interface / behaviour of k_means, which is a public function.

(I think that in general the design of having the estimator call a public helper is bad (it should be the other way around), but that's another issue.)

The text was updated successfully, but these errors were encountered:

qinhanmin2014 · 2019-09-06T05:17:32Z

isn't immediately useful for users. For example, I don't know much about OPTICS but I feel like cluster_optics_xi should be private (it's not even mentioned in the user guide).

cluster_optics_xi enables users to obtain clusters according to different "thresholds" without recalculating some statistics.

albertcthomas · 2019-09-06T08:21:23Z

In the robust_covariance.py I would like to make the first c_step function as well as the select_candidates function private. I do not think anyone use them...

NicolasHug · 2019-09-06T11:07:29Z

cluster_optics_xi enables users to obtain clusters according to different "thresholds" without recalculating some statistics.

But is it really useful? Is it worth to publicly support it?
If we keep it, the function should be documented, at the very least in the user guide.

Tools like that which aren't in the user guide are virtually impossible to be discovered.

adrinjalali · 2019-09-06T13:11:34Z

It would be very useful if the user knows what they're doing. I agree that they should be documented, but I think they're useful and should be public. We had quite long discussions about having those functions, so I think they were justified.

NicolasHug · 2019-09-06T13:15:05Z

Ok. Let's document them then please.

rth · 2019-09-06T13:29:16Z

For our own sanity, I would like to propose to make private any function that either:
has an obvious estimator equivalent, e.g. k_means or fastica. There should be one obvious way to do things. (Some of them might be subject to debate, like scale which is arguably convenient.)
k_means and fastica are making my life difficult right now as I'm working on

Having a KMeans and a stateless function k_means was an explicit design choice at the time as far as I understood. Maybe @agramfort would have some comments on it.

agramfort · 2019-09-07T12:13:38Z

yes it was a design decision from the early days. core algorithms were supposed to be available without the object API. It's for example why you have ridge, lasso, fastica, etc... exposed as plain functions. Basically objects should wrap the plain functions to have the "scikit-learn API".

NicolasHug · 2019-09-07T12:30:40Z

core algorithms were supposed to be available without the object API

What is the reason for that? Is it just for convenience?

Do you think it is possible to reconsider whether it's worth having them?

As mentioned earlier, a lot of them are designed in such a way that it is impossible to change the estimator without breaking backward compatibility of the function.

jnothman · 2019-09-08T00:23:09Z

I think that backwards compatibility issue is only really true if the estimator calls the public function. It does not need to be true if the public function and estimator both call some shared, private routines.

adrinjalali · 2019-09-08T09:34:29Z

@NicolasHug could you maybe give an example? I can't easily think of a case where moving the logic implemented in the function to the class and just using the class from that function wouldn't solve the issue. It's how the optics function was implemented (although it was removed in later revisions).

NicolasHug · 2019-09-08T20:48:21Z

I can't easily think of a case where moving the logic implemented in the function to the class and just using the class from that function wouldn't solve the issue

Indeed, that would solve the issue, but some classes are designed the other way around.

That's a good first step but I'd like to go one step further and just make private anything that isn't obviously useful. The problem then disappears.

agramfort · 2019-09-09T20:29:32Z

if we have functions that call the objects internally for me it should never have been the case (at least in theory...) objects should call plain public functions and not the other way around

…

NicolasHug · 2019-09-09T20:46:55Z

objects should call plain public functions and not the other way around

But that's precisely what makes it hard to change anything to the object. I don't understand the benefit of this design? Our first class citizens are the objects, not the functions.

adrinjalali · 2019-09-10T08:20:34Z

I also don't understand why. I just wanted to fix something in DBSCAN, and it has the same issue, and I'd rather move the logic to the class and have the function just use the class.

amueller · 2019-09-10T18:19:50Z

@agramfort I'm pretty sure we have it going both directions.
What is that theory that you speak of?
I know there's some public functions that are used as implementation details, and those are usually not well-tested. There's also some functions that are shortcuts (in the scalers for example) that reuse the estimators.

I find the shortcuts that use the estimators useful, but I have never called the functions that are implementation details, and I'm not sure why they should be public.

agramfort · 2019-09-11T12:03:02Z

@agramfort <https://github.com/agramfort> I'm pretty sure we have it going both directions. What is that theory that you speak of?

well my understanding of the philosophy is the following: - it's different skillsets to write and maintain an API with object hierarchy etc... than it is to write efficient numerical code - if I am a numerical guy used to fortran / matlab or whatever then I optimize / edit plain functions - if i want a fast kmeans / fastica I just rely on the functions that also tend to do less input checks than can slow things down (a bit)

I know there's some public functions that are used as implementation details, and those are usually not well-tested. There's also some functions that are shortcuts (in the scalers for example) that reuse the estimators.

these shortcuts are maybe the true issue as they defeat the purpose I state above. my 2c

…

NicolasHug · 2019-09-11T14:43:21Z

I agree with all these points @agramfort but this is not the kind of functions we are dealing with here.

The k_means() function does all the work, including input checking. It is not at all faster than the estimator in any way.

I think you are talking about fast helpers e.g. cythonized routines, which are private anyway (or not explicitly exposed, at least). But this is not the problem we're having.

What I'm trying to argue is that

k_means() is useless because we have KMeans
k_means() should then be private
if we really want to keep the k_means() helper, it should use the estimator's code (that can use as many fast private helper as it wants), not the other way around, because right now we can't change anything to KMeans.

agramfort · 2019-09-12T10:20:36Z

what do you mean by "we can't change anything to KMeans." ? KMeans is allowed to call any public/private functions to do what it has to do. There is for me no contract between API of functions and estimators. For example we don't enforce by design that outputs of kmeans are attributes of the estimator.

…

NicolasHug · 2019-09-12T10:54:22Z

what do you mean by "we can't change anything to KMeans." ?

Please look at KMeans.fit, it's pretty obvious. The fit function is

def fit(self, X):
    self.attr_1, self.attr_2 ... = k_means(...)
    return self

Sure, we could make fit call something else than k_means and change it, but that "something else" would be 99% redundant with k_means.

jnothman · 2019-09-12T11:03:33Z

Well it would be a dependency of both KMeans and k_means. I don't think we gain anything from changing existing functions

adrinjalali · 2019-09-12T11:11:41Z

If we don't make the functions use the classes, the input validation of the two are gonna diverge from each other soon. That's probably not a good idea.

NicolasHug · 2019-09-12T11:25:20Z

So you're fine with the current state of things @jnothman??

If we don't make the functions use the classes, the input validation of the two are gonna diverge from each other soon

Yes! In #13603, we are now forced to do the input validation twice: once in KMeans.fit, and once in k_means (there are other ways, but they are all just as bad). Same for fastica. And that's just the tip of the iceberg IMO.

I'm sorry to ask again but: do we really want to support k_means and other functions of the like? I think we don't. They save users 2 lines of code for an increased cost of maintenance.

ogrisel · 2019-09-13T16:25:17Z

I am also not so sure about the actual value those functions (k_means, fastica) bring to the user.

cluster_optics_xi enables users to obtain clusters according to different "thresholds" without recalculating some statistics.

This actually brings some value to the user on the other hand.

Let's have a look at the estimator-related functions documented in our official public API:

https://scikit-learn.org/dev/modules/classes.html

For covariance matrix estimation, it kinds of makes sense to provide a function that directly return the estimated covariance matrix.
For sparse coding, it makes sense to have a public function that just returns the code.
For matrix factorization methods, having a functional short-hand also kind of makes sense.
For clustering algorithms I am not so sure those function are really useful in general.

It's hard to give a definite answer.

In any case, whenever we decide to keep the class / function dual API, we need to:

make sure that input validation is never duplicated.
ensure that the 2 public API actually share the same implementation, possibly via some private helper called by the two of them.

qinhanmin2014 · 2019-09-14T08:15:39Z

I'm sorry to ask again but: do we really want to support k_means and other functions of the like? I think we don't. They save users 2 lines of code for an increased cost of maintenance.

+1, why can't we make k_means private? Maybe we should encourage users to use classes instead of functions? (except for some special cases, e.g., make_pipeline, scale)

agramfort · 2019-09-15T17:06:33Z

I buy the argument of avoiding to duplicate input checks. I've seen before a copy done in the fit method and then in the subsequent plain function. it's also not a great option have a public parameter validate=True | False. It's nicer to have a private function in that case that does no input validation as for a cython function call. I would consider the context manager that skips all checks but it's maybe an unnecessary pain... now we need a coherent opinion over the package. What is allowed to be a plain p 8000 ublic function and what is not... my 1.5c

…

NicolasHug · 2019-09-15T17:26:17Z

As a first step I would be happy with starting to deprecate all the functions of cluster that have an obvious equivalent.

ogrisel · 2019-09-16T12:44:02Z

I don't have a strong opinion. I think we should do an audit on github on how many times those functions are used by other projects / code snippets and notebooks that use scikit-learn.

It seems that it is non-zero but it's hard to write a specific github search query to only display the function calls.

NicolasHug · 2019-09-16T13:25:31Z

BTW, public object using public function = ugly code and API like that of fastica (below): need to add a return_blahblah parameter every time you want to add a new attribute to the object.

    if whiten:
        if compute_sources:
            S = np.dot(np.dot(W, K), X).T
        else:
            S = None
        if return_X_mean:
            if return_n_iter:
                return K, W, S, X_mean, n_iter
            else:
                return K, W, S, X_mean
        else:
            if return_n_iter:
                return K, W, S, n_iter
            else:
                return K, W, S

    else:
        if compute_sources:
            S = np.dot(W, X).T
        else:
            S = None
        if return_X_mean:
            if return_n_iter:
                return None, W, S, None, n_iter
            else:
                return None, W, S, None
        else:
            if return_n_iter:
                return None, W, S, n_iter
            else:
                return None, W, S

I've opened PRs to put them in the correct order, but I'm stuck by this potential bug so please take a look at #14988 .

NicolasHug mentioned this issue Sep 5, 2019

[MRG] Add n_features_in_ attribute to BaseEstimator #13603

Closed

NicolasHug mentioned this issue Sep 15, 2019

[MRG] Make k_means use KMeans instead #14985

Merged

This was referenced Sep 16, 2019

mean_shift and MeanShift don't have the same API #14995

Closed

[MRG] Make dbscan call DBSCAN.fit #14994

Merged

thomasjpfan added the API label Oct 26, 2019

jeremiedbb mentioned this issue Nov 30, 2022

parameters_to_validate to allow for partial validation in validation framework #25058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we make most functions private? #14897

Should we make most functions private? #14897

Should we make most functions private? #14897

Should we make most functions private? #14897

Comments