8000 Should we make most functions private? · Issue #14897 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Should we make most functions private? #14897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
NicolasHug opened this issue Sep 5, 2019 · 28 comments
Open

Should we make most functions private? #14897

NicolasHug opened this issue Sep 5, 2019 · 28 comments
Labels

Comments

@NicolasHug
Copy link
Member

We have a lot of functions referenced in our API reference.

For our own sanity, I would like to propose to make private any function that either:

  • has an obvious estimator equivalent, e.g. k_means or fastica. There should be one obvious way to do things. (Some of them might be subject to debate, like scale which is arguably convenient.)
  • isn't immediately useful for users. For example, I don't know much about OPTICS but I feel like cluster_optics_xi should be private (it's not even mentioned in the user guide).

There are plenty of other functions, and even more that aren't in the API ref.


Motivation:
k_means and fastica are making my life difficult right now as I'm working on #13603. KMeans.fit() does nothing but calling k_means which does all the work.

I cannot change anything to the KMeans class now (e.g. add an attribute), because I would I need to change the signature / interface / behaviour of k_means, which is a public function.

(I think that in general the design of having the estimator call a public helper is bad (it should be the other way around), but that's another issue.)

@qinhanmin2014
Copy link
Member

isn't immediately useful for users. For example, I don't know much about OPTICS but I feel like cluster_optics_xi should be private (it's not even mentioned in the user guide).

cluster_optics_xi enables users to obtain clusters according to different "thresholds" without recalculating some statistics.

@albertcthomas
Copy link
Contributor
albertcthomas commented Sep 6, 2019

In the robust_covariance.py I would like to make the first c_step function as well as the select_candidates function private. I do not think anyone use them...

@NicolasHug
Copy link
Member Author

cluster_optics_xi enables users to obtain clusters according to different "thresholds" without recalculating some statistics.

But is it really useful? Is it worth to publicly support it?
If we keep it, the function should be documented, at the very least in the user guide.

Tools like that which aren't in the user guide are virtually impossible to be discovered.

@adrinjalali
Copy link
Member

It would be very useful if the user knows what they're doing. I agree that they should be documented, but I think they're useful and should be public. We had quite long discussions about having those functions, so I think they were justified.

@NicolasHug
Copy link
Member Author

Ok. Let's document them then please.

@rth
Copy link
Member
rth commented 8000 Sep 6, 2019

For our own sanity, I would like to propose to make private any function that either:
has an obvious estimator equivalent, e.g. k_means or fastica. There should be one obvious way to do things. (Some of them might be subject to debate, like scale which is arguably convenient.)
k_means and fastica are making my life difficult right now as I'm working on

Having a KMeans and a stateless function k_means was an explicit design choice at the time as far as I understood. Maybe @agramfort would have some comments on it.

@agramfort
Copy link
Member
agramfort commented Sep 7, 2019 via email

@NicolasHug
Copy link
Member Author

core algorithms were supposed to be available without the object API

What is the reason for that? Is it just for convenience?

Do you think it is possible to reconsider whether it's worth having them?

As mentioned earlier, a lot of them are designed in such a way that it is impossible to change the estimator without breaking backward compatibility of the function.

@jnothman
Copy link
Member
jnothman commented Sep 8, 2019 via email

@adrinjalali
Copy link
Member
8000 adrinjalali commented Sep 8, 2019

@NicolasHug could you maybe give an example? I can't easily think of a case where moving the logic implemented in the function to the class and just using the class from that function wouldn't solve the issue. It's how the optics function was implemented (although it was removed in later revisions).

@NicolasHug
Copy link
Member Author

I can't easily think of a case where moving the logic implemented in the function to the class and just using the class from that function wouldn't solve the issue

Indeed, that would solve the issue, but some classes are designed the other way around.

That's a good first step but I'd like to go one step further and just make private anything that isn't obviously useful. The problem then disappears.

@agramfort
Copy link
Member
agramfort commented Sep 9, 2019 via email

@NicolasHug
Copy link
Member Author

objects should call plain public functions and not the other way around

But that's precisely what makes it hard to change anything to the object. I don't understand the benefit of this design? Our first class citizens are the objects, not the functions.

@adrinjalali
Copy link
Member

I also don't understand why. I just wanted to fix something in DBSCAN, and it has the same issue, and I'd rather move the logic to the class and have the function just use the class.

@amueller
Copy link
Member

@agramfort I'm pretty sure we have it going both directions.
What is that theory that you speak of?
I know there's some public functions that are used as implementation details, and those are usually not well-tested. There's also some functions that are shortcuts (in the scalers for example) that reuse the estimators.

I find the shortcuts that use the estimators useful, but I have never called the functions that are implementation details, and I'm not sure why they should be public.

@agramfort
Copy link
Member
agramfort commented Sep 11, 2019 via email

@NicolasHug
Copy link
Member Author

I agree with all these points @agramfort but this is not the kind of functions we are dealing with here.

The k_means() function does all the work, including input checking. It is not at all faster than the estimator in any way.

I think you are talking about fast helpers e.g. cythonized routines, which are private anyway (or not explicitly exposed, at least). But this is not the problem we're having.

What I'm trying to argue is that

  • k_means() is useless because we have KMeans
  • k_means() should then be private
  • if we really want to keep the k_means() helper, it should use the estimator's code (that can use as many fast private helper as it wants), not the other way around, because right now we can't change anything to KMeans.

@agramfort
Copy link
Member
agramfort commented Sep 12, 2019 via email

@NicolasHug
Copy link
Member Author

what do you mean by "we can't change anything to KMeans." ?

Please look at KMeans.fit, it's pretty obvious. The fit function is

def fit(self, X):
    self.attr_1, self.attr_2 ... = k_means(...)
    return self

Sure, we could make fit call something else than k_means and change it, but that "something else" would be 99% redundant with k_means.

@jnothman
Copy link
Member
jnothman commented Sep 12, 2019 via email

@adrinjalali
Copy link
Member

If we don't make the functions use the classes, the input validation of the two are gonna diverge from each other soon. That's probably not a good idea.

@NicolasHug
Copy link
Member Author

So you're fine with the current state of things @jnothman??

If we don't make the functions use the classes, the input validation of the two are gonna diverge from each other soon

Yes! In #13603, we are now forced to do the input validation twice: once in KMeans.fit, and once in k_means (there are other ways, but they are all just as bad). Same for fastica. And that's just the tip of the iceberg IMO.


I'm sorry to ask again but: do we really want to support k_means and other functions of the like? I think we don't. They save users 2 lines of code for an increased cost of maintenance.

@ogrisel
Copy link
Member
ogrisel commented Sep 13, 2019

I am also not so sure about the actual value those functions (k_means, fastica) bring to the user.

cluster_optics_xi enables users to obtain clusters according to different "thresholds" without recalculating some statistics.

This actually brings some value to the user on the other hand.

Let's have a look at the estimator-related functions documented in our official public API:

https://scikit-learn.org/dev/modules/classes.html

  • For covariance matrix estimation, it kinds of makes sense to provide a function that directly return the estimated covariance matrix.

  • For sparse coding, it makes sense to have a public function that just returns the code.

  • For matrix factorization methods, having a functional short-hand also kind of makes sense.

  • For clustering algorithms I am not so sure those function are really useful in general.

It's hard to give a definite answer.

In any case, whenever we decide to keep the class / function dual API, we need to:

  • make sure that input validation is never duplicated.

  • ensure that the 2 public API actually share the same implementation, possibly via some private helper called by the two of them.

@qinhanmin2014
Copy link
Member

I'm sorry to ask again but: do we really want to support k_means and other functions of the like? I think we don't. They save users 2 lines of code for an increased cost of maintenance.

+1, why can't we make k_means private? Maybe we should encourage users to use classes instead of functions? (except for some special cases, e.g., make_pipeline, scale)

@agramfort
Copy link
Member
agramfort commented Sep 15, 2019 via email

@NicolasHug
Copy link
Member Author

As a first step I would be happy with starting to deprecate all the functions of cluster that have an obvious equivalent.

@ogrisel
Copy link
Member
ogrisel commented Sep 16, 2019

I don't have a strong opinion. I think we should do an audit on github on how many times those functions are used by other projects / code snippets and notebooks that use scikit-learn.

It seems that it is non-zero but it's hard to write a specific github search query to only display the function calls.

@NicolasHug
Copy link
Member Author

BTW, public object using public function = ugly code and API like that of fastica (below): need to add a return_blahblah parameter every time you want to add a new attribute to the object.

    if whiten:
        if compute_sources:
            S = np.dot(np.dot(W, K), X).T
        else:
            S = None
        if return_X_mean:
            if return_n_iter:
                return K, W, S, X_mean, n_iter
            else:
                return K, W, S, X_mean
        else:
            if return_n_iter:
                return K, W, S, n_iter
            else:
                return K, W, S

    else:
        if compute_sources:
            S = np.dot(W, X).T
        else:
            S = None
        if return_X_mean:
            if return_n_iter:
                return None, W, S, None, n_iter
            else:
                return None, W, S, None
        else:
            if return_n_iter:
                return None, W, S, n_iter
            else:
                return None, W, S

I've opened PRs to put them in the correct order, but I'm stuck by this potential bug so please take a look at #14988 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants
0