10000 RFC CLI Tool Proposal · Issue #24828 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

RFC CLI Tool Proposal #24828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Micky774 opened this issue Nov 3, 2022 · 10 comments
Closed

RFC CLI Tool Proposal #24828

Micky774 opened this issue Nov 3, 2022 · 10 comments
Labels
RFC workflow Development workflow changes

Comments

@Micky774
Copy link
Contributor
Micky774 commented Nov 3, 2022

Introduction

Other OSS projects such as scipy and numpy have made use of development CLI tools such as runtests.py and dev.py to provide a collected and singular entry point into development tasks for contributors. I find these tools very helpful personally, and think they add a large amount of value, especially for onboarding new contributors. Recently Pandas has also begun discussion and implementation of their own development CLI tool. @noatamir has made a great writeup on some of the benefits and justifications for such a tool (see: pandas-dev/pandas#47700). I highly recommend taking the time to read their writeup.

Some of the benefits of such a tool include:

  1. Increased discoverability of tools, decreasing time spent jumping around documentation. This is especially helpful for those who may not read every single word of documentation (such is my bad habit 😅).
  2. Consistency for CI and development. We can establish a 1-1 between CI checks and dev.py commands, e.g. dev.py lint which could then impose both black and isort (cf. RFC isort as linter and import sorter #22853)
  3. Ease of installation. Users could install whatever dependencies are necessary depending on what workflow they want enabled. A call to dev.py doc build could check for the presence of the necessary dependencies and install if needed (or optionally prompt).
  4. Reduced barrier to entry. Quite frankly, having to run various commands for tasks, anywhere from linting to building C sources, can be overwhelming, especially to new developers.

Recently scipy explored a new solution for developing a unified development CLI tool (cf. scipy/scipy#15489) based on do.it. Pandas is also exploring a do.it based solution. There's been good feedback on this framework and I would propose the same.

Encompassed Actions

Some actions which would be wrapped by such a development CLI include:

  • Dependency installation (could be per-workflow)
  • Testing
  • Linting
  • Extension building
  • Documentation building
  • Pre-commit installation

Considerations

As with all things, the question is whether the benefits outweigh the maintenance cost. While projects like scipy with many complex workflows benefit the most from such a tool, personally I think anything to decrease the barrier to entry for scikit-learn development is worth considering.

If we do opt to build such a tool, we can make use of the work already done in scipy as well as the recently released separate dev.py package. Either way, the largest modification we'd need to make is alter the build command to use setuptools instead of meson, though that shouldn't be too problematic.

@Micky774 Micky774 added RFC workflow Development workflow changes labels Nov 3, 2022
@glemaitre
Copy link
Member

I still don't have a strong opinion on the matter. I would need to sleep on it. I would like to like the CLI proposal :)

On the downside first, I would be under the impression that we are going to wrap up commands from Python tools, that we use in scikit-learn. In the end, it would be cumbersome for newcomers because they will have to learn a new set of commands for every project to which they want to contribute to. So why not directly learn, the pytest, sphinx, etc, commands?

Consistency for CI and development. We can establish a 1-1 between CI checks and dev.py commands, e.g. dev.py lint which could then impose both black and isort (cf. #22853)

I assume that the comment of @jorisvandenbossche is valid for the 1-1 CI checks: pandas-dev/pandas#47700 (comment). For the linting, it could be OK but further than that, it would be too optimistic.

But I see an upside: if we get a unified CLI for the pydata ecosystem and even if in scikit-learn, we would use a subset of the complex CLI required for SciPy or Pandas, then it starts to be interesting. A newcomer is learning a CLI interface allowing them to contribute to any pydata project.

Also, we created in the past a Makefile with the exact same intention. However, it was not cross-platform. So a CLI based on Python commands would probably unlock some of those issues.

@adrinjalali
Copy link
Member

I still prefer a Makefile, and it's cross platform if windows users use a git bash, and it's designed to simplify workflows.

I have compiled multiple projects, and I've never used their tool. As @glemaitre mentions, it's hard to learn a new CLI for each project one needs to compile/contribute to, whereas I know how pytest works, how to install packages, and I know I can probably find stuff in a Makefile. We should care about transferable skills.

I do recognize that there's a mismatch between what an experienced user like our core dev community can do and expects, and the average first time contributor to the project.

These things complicate stuff, quite substantially. I used to understand our CI, and I did it the first time I looked at them. But I haven't followed the recent changes to our CI, and now I have absolutely no clue how we update dependencies in our CI, where all those different lock files come from, and what the CI does; and I think adding this CLI to the CI would complicate things even more.

I know this sounds like I'm a grumpy old man, but not looking at the development for a few months, coming back, and noticing you don't understand a thing, is not a good thing, and I used to work quite a bit on our CI side. I'm not sure where things are documented, and the PRs people open to update deps don't mention how they're done. We have a discoverability issue there.

I don't necessarily understand why we need new shiny things when the old tools work brilliantly (aka Makefile), which has the benefit that it contains scripts which you can basically copy/paste and customize in your shell. Whereas extracting a shell script from a python script is going to be impossible as soon as things get complicated.

For new contributors:

  • CLI has the benefit that they can copy/paste commands from docs and run things, and we can have complex logic behind those commands to hide them away.
  • At the same time, it hides away what's being done in the background. I think a value for first time contributors is for them to learn pytest, flake8, isort, black, build process, pre-commit hooks, etc.

If a new contributor comes, fixes a documentation issue, and runs a few custom CLI commands, the only thing they get out of it is that they can put scikit-learn on their github's profile page and say that they have contributed to the project. They haven't actually learned anything, is that a good thing? I'm not sure. I feel like we're expecting less and less from our contributors, and that might not be in their benefit.

An alternative for some of the issues we have, is to have bots which parse the output of our CI and comment on PRs with clear instructions on how a contributor needs to fix them. Example: conda-forge/skops-feedstock#1 (comment)

image

cc @scikit-learn/core-devs

Also cc @reshamas , @cmarmo , since I value their input in such matters as they might have a better idea of what our average contributors might need and want to have.

@jjerphan
Copy link
Member
jjerphan commented Nov 4, 2022

I still prefer a Makefile, and it's cross platform if windows users use a git bash, and it's designed to simplify workflows.

+1. I prefer canonical Makefiles and make command over custom scripts (like a dev.py). I am not against the creation of a dev.py, but I do not think such a CLI will allow people to work better. I think we should strive for simplicity and I do not think that tools' command indirection helps, especially for the reasons given by @adrinjalali in #24828 (comment) which I endorse, notably:

At the same time, it hides away what's being done in the background. I think a value for first time contributors is for them to learn pytest, flake8, isort, black, build process, pre-commit hooks, etc.

and

If a new contributor comes, fixes a documentation issue, and runs a few custom CLI commands, the only thing they get out of it is that they can put scikit-learn on their github's profile page and say that they have contributed to the project. They haven't actually learned anything, is that a good thing? I'm not sure. I feel like we're expecting less and less from our contributors, and that might not be in their benefit.

Yet, I am not against the creation of a dev.py if that's a good solution for people.

@betatim
Copy link
Member
betatim commented Nov 4, 2022

I think I'm joining Adrin in the "grumpy person" camp. I'm not a fan of "hiding" what the actual commands are inside tools. It almost doesn't matter what they are, but "old" things like Makefiles might be the least tricky ones because they mostly contain shell commands.

When I come to a new project (or return to one) I want to quickly see what I need to do to setup the dependencies, build the project, run the tests, etc. So the number one problem I have is quickly finding this information. And a lot of projects aren't great for this. Some of the worst are the ones that mention "Read CONTRIBUTING.md" in their README, and then in CONTRIBUTING.md they link to the docs, where there then is a four screen long wall of text. Lots of clicks, lots of media breaks, lots of risk that I'll just send the PR without doing anything and wait to see what the CI robots say.

So I think improving the "time to discovery" is important. Above all else. Once that time is small, then I think using "vanilla" commands is better than custom stuff. If the docs say "Setup the development dependencies with pip install foo biz buz bar" that is great for me. I can either copy&paste or adjust it "on the fly" because I happen to love using mamba and conda envs. The copy&paste works for newcomers and "translate on the fly" works for "old-timers" who want to tweak things. If the instruction says python dev.py run-tests then I have to go and find out what the actual command is you are running, then extract that, then do what I really wanted to do (run a subset of the tests, which is the most common test thing I do).

So overall I think improving discoverability is important. For me this is a documentation task. And it should include things like "how to bump CI deps", maybe in a section for maintainers. It also means that you need to keep the complexity low enough that you can explain it in one or two sentences.

Adding a layer on top to deal with the problem that "it is too difficult to explain" is IMHO trying to treat a patient by reducing their symptoms, not the actual illness.

@betatim
Copy link
Member
betatim commented Nov 4, 2022

A situation where this or a make build would be useful is that I can't remember the command to build scikit-learn. Maybe it would be possible to shorten it from pip install --no-build-isolation -e . by configuring some build config defaults? Or making an alias? The reason I bring this up is because I "solve" my problem of not being able to remember the command by searching my shell history. Which I guess is a sign that the ergonomics could be improved.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Nov 4, 2022

edit I had mis-understood the proposal. Rewriting my comment (and sorry for the mis-understanding, that was a bit dense of me).

Standardizing our CLIs / shell scripts

I think that the actual proposal above it to standardize the CLIs / shell scripts / Makefiles that we use.

I'm not sure where I sit here. My thoughts:

  • Makefile & shell are difficult to maintain and require specific skills that are not share amongst even good Python devs. An argument to get rid of them. Also, they typically do not work on all systems (eg Windows)
  • If it ain't broken don't fix it. A large reengineering project like this always smells bad. It's likely to consume a lot of energy and be inefficient: in the short term because we won't see gains (gains will be from better maintainability), in the long term because requirements will evolve.

I think that the good solution is to consider case-by-case settings: replacing elements of our toolchain as we go. Ideally the discussion should be case-by-case.

I personally would be fine with having a general policy that we try to prefer Python code to shell. However, what really matters is that the current pool of active devs feel comfortable with changes and are happy about them.

Making a CLI out of scikit-learn is out of scope.

(not the proposal, but a possible misinterpretation).

We are a Python library. We focus on doing high quality numerical algorithms and the tools around them that are necessary to use them to have valid statistical usage an operationalization.

We should not embark on this. It will be a strong distraction (and we have plenty on our plate to satisfy the above mission), and it will lead to frustration from the users because we will likely develop half-baked answer (difficult to have a proper expressivity in a CLI given the complexity of pipelines + model evaluation).

Finally, such projects can be developed outside scikit-learn. The policy these last years has been to develop new ideas outside, to avoid adding even more weight to an already overburden project.

@ogrisel
Copy link
Member
ogrisel commented Nov 4, 2022

These things complicate stuff, quite substantially. I used to understand our CI, and I did it the first time I looked at them. But I haven't followed the recent changes to our CI, and now I have absolutely no clue how we update dependencies in our CI, where all those different lock files come from, and what the CI does; and I think adding this CLI to the CI would complicate things even more.

It's all in build_tools/update_environments_and_lock_files.py now.

@cmarmo
Copy link
Contributor
cmarmo commented Nov 4, 2022

I have been tagged, I dare comment then... :)

I believe adding layers to the development environment makes more difficult to solve the issues in the code.
We often saw that already, when users report possible bugs using, eg, Jupyter Notebooks in docker containers hosting conda environments. This is an old issue and produced some tense exchanges in the past between maintainers and packagers.
Those tools are great for users, in teaching applications or in production, but explicitly hide the mechanics of the code, which is what we don't want at the development stage.

From the contributor point of view I agree with @adrinjalali comment:

If a new contributor comes, fixes a documentation issue, and runs a few custom CLI commands, the only thing they get out of it is that they can put scikit-learn on their github's profile page and say that they have contributed to the project. They haven't actually learned anything, is that a good thing? I'm not sure. I feel like we're expecting less and less from our contributors, and that might not be in their benefit.

I'd rather improve the documentation, not only the one for users but for maintainers too.
I'm afraid sometime back and forth from documentation are necessary to understand how things work, I guess we are all guilty of not reading it as much as is needed, but there is no other solution than going back and looking again at it... and this is true also at the editing stage: the contributor documentation could benefit of quicker updates, even if it is not always perfect, so it could be used by more people and improved in faster iterations.

In particular the PR checks could be better explained and backed by bots to explicit the errors, as said above: I am often doing the job of the bot ... :) ... I believe this is a very valuable information for contributors who don't always come from a background of software development and face for the first time the challenge of the integration.

Finally, I believe that the backlog of pull requests and issues to be addressed is already quite big, without adding new workloads for the developers.

@Micky774
Copy link
Contributor Author

Thanks everyone for your perspectives on this! To summarize (very very briefly), it seems the greatest concerns are:

  1. A custom CLI tool introduces a sizeable maintenance cost.
  2. A custom CLI tool may stifle the growth of new contributors by forcing them to familiarize themselves with our particular abstractions, rather than learning the underlying tools.

As mentioned, I suppose the fundamental issue this would try to solve is the inherit difficulty of discoverability and operability of some tools, however there are preferable solutions:

  1. A "quick-start guide" such as what we have for some sprints may be a valuable addition, providing a place to briefly demonstrate such tools and defer further detail to secondary linked resources (which may just be more of our own documentation).
  2. Bots which can automatically post explicit clarifications on the specific errors which may be present in a PR (e.g. linting).

I do not think that a CLI tool such as this harms a new contributor too much in their journey to master the underlying tools too much, but either way I think we've already arrived at an overall consensus wherein such a tool is not worth the cost, and indeed I agree that the alternative solutions are better implemented first.

Thanks again for the great discussion :)

@jjerphan
Copy link
Member
jjerphan commented Feb 1, 2023

I am coming back to this RFC to complete my point of view.

After having used the dev.py script from SciPy: as a user, it feels nice using this tool mainly because the built trace is actually providing useful and readable information: this gives some mental/psychological comfort which might be useful for some contributors. This is not what we provide at the moment with scikit-learn, but I think we could.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC workflow Development workflow changes
Projects
None yet
Development

No branches or pull requests

8 participants
0