10000 pip install sklearn behaviour · Issue #8215 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

pip install sklearn behaviour #8215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lesteve opened this issue Jan 19, 2017 · 36 comments
Closed

pip install sklearn behaviour #8215

lesteve opened this issue Jan 19, 2017 · 36 comments

Comments

@lesteve
Copy link
Member
lesteve commented Jan 19, 2017

I was surprised that pip install sklearn would do something without giving an error, which I learned from
#8185 (comment). All in all it is not so bad because scikit-learn is in install_requires of sklearn setup.py so you do end-up with scikit-learn installed. Still it is a bit confusing that pip list shows both scikit-learn 0.18.1 and sklearn 0.0.

@chrisfilo if I believe https://pypi.org/project/sklearn it looks like you uploaded it. I see a similar approach is done for bids and pybids https://pypi.org/project/bids. Is it a common thing to do prevent package names to be squatted on PyPI?

There must be a way to have a slightly clever setup.py that allows upload but prevents install with an error saying something like "Please use pip install scikit-learn".

@glemaitre
Copy link
Member

Maybe you could make sklearn an alias of scikit-learn.
I went across the following package.
I didn't check the mechanism thought.

8000

@lesteve
Copy link
Member Author
lesteve commented Jan 19, 2017

Quickly looking at the pypi-alias code it may well be that pypi-alias was used to generate the sklearn PyPI package indeed.

@glemaitre
Copy link
Member

Indeed you are right.

At the end, pip install sklearn or pip install scikit-learn --- apart from the annoying sklearn (0.0) showed in the pip list --- will install the latest available build from PyPI. I would have thought that this is the expected behaviour.

However, I could find a case which I find annoying. As a user, I would expect pip install sklearn==0.17 to do something as pip install scikit-learn==0.17. This will throw an error since that sklearn as only a (0.0) version.

Regarding the issue in #8185, I was checking on the doc if it was mentioned that you should use pip install . to trigger the install of the local version. I couldn't find it. Did I miss it?

@glemaitre
Copy link
Member
glemaitre commented Jan 19, 2017

And I forgot that pip uninstall sklearn will not do what you are expecting as well if you expect to remove scikit-learn.

@hugovk
Copy link
Contributor
hugovk commented Nov 15, 2017

Maybe it would be wise to phase out the sklearn package.

How about:

  • first allowing installation through pip install sklearn but with a warning,
  • then some time later don't install and just show an error,
  • and finally remove the package altogether.

@lesteve
Copy link
Member Author
lesteve commented Nov 15, 2017

My earlier comment is still valid:

There must be a way to have a slightly clever setup.py that allows upload but prevents install with an error saying something like "Please use pip install scikit-learn".

@hugovk if you can find or put together a setup.py that just raises an exception that would be a step forward on this issue.

Just to be clear I don't think we should have a warning phase (since the problem is super easy to fix and to be honest I suspect very few people are doing pip install sklearn). We should not remove the package it is there for a good reason: people might still do pip install sklearn and/or it could be squatted by not so benevolent people.

@hugovk
Copy link
Contributor
hugovk commented Nov 15, 2017

Here's PyPI downloads stats (via pypinfo) for the last month suggesting over a quarter of scikit-learn installs are through sklearn!

package download_count
sklearn 283,395
scikit-learn 1,082,341

How about something like this in sklearn's setup.py?

raise RuntimeError("sklearn is now known as scikit-learn. Please run: pip install scikit-learn")

@chrisgorgo
Copy link
Member

Happy to transfer ownership of the sklearn pypi to someone more involved in the project.

My two cents: if it ain't broken don't fix it. I see little benefit from preventing people from using sklearn and a lot of broken CI setups and packages if you do implement this exception.
I also vaguely remember that we added this meta package because of some meta-programming pattern that was looking for packages, by the name of the the module (which is sklearn). This would also break - sadly I don't remember the details.

@amueller
Copy link
Member

@lesteve do you have an account? @chrisfilo you could also transfer to me. (t3kcit I think).

I would argue that keeping it as an alias is fine. And yes, I think a lot of CI scripts will break and it'll be no benefit to anyone imho.

@lesteve
Copy link
Member Author
lesteve commented Nov 15, 2017

Here are a few ways this is broken:

pip install sklearn==0.18.1

Output:

 Could not find a version that satisfies the requirement sklearn==0.18.1 (from versions: 0.0)
No matching distribution found for sklearn==0.18.1
pip install sklearn
pip uninstall -y sklearn
python -c "import sklearn"  # Oh wait why does this work I thought I just uninstalled sklearn

pip list ouput is also is misleading (why do I have two versions of scikit-learn and why is one 0.0, which was actually the question from #8185 (comment)):

scikit-learn (0.19.1)
sklearn (0.0)

If we believe the numbers from PyPI (1 in 5 scikit-learn pip install is through pip install sklearn really ???), maybe it's worth having a warning for a while. Because the fix is so simple and could be very well explained in the error message I would slightly be in favour of just raising an exception.

My two cents: if it ain't broken don't fix it. I see little benefit from preventing people from using sklearn and a lot of broken CI setups and packages if you do implement this exception.

@chrisfilo I tend to be with you on this but in this case I am not, not sure exactly why. Probably part of the story is the mix of surprise and WTF I felt when I realised pip install sklearn was actually working. It's a good point that sklearn in requirements.txt or setup.py could be an issue. Maybe it is worth setting using github advanced search or using the github API to figure out how much people do that.

@lesteve do you have an account? @chrisfilo you could also transfer to me. (t3kcit I think).

I do have an account on PyPI which is lesteve.

@chrisgorgo
Copy link
Member

Ownership transferred. Remember: With great power comes... ability to mess up a lot of packages ;)

@ghost
Copy link
ghost commented Jan 29, 2018

pip install -U scikit-learn==0.18

@rth
Copy link
Member
rth commented Jun 27, 2018

For future reference, as I couldn't find a version under version control, I have made a copy of the current contents of https://pypi.org/project/sklearn/ under https://github.com/rth/scikit-learn-invalid-pypi

The consensus about this issue seems to be not to do anything, but if we decide to change something in that PyPi package in the future, please make a PR there first and ping relevant maintainers so the changes can be reviewed / commented on before uploading to PyPi.

@lesteve
Copy link
Member Author
lesteve commented Jun 27, 2018

For future reference, as I couldn't find a version under version control, I have made a copy of the current contents of https://pypi.org/project/sklearn/ under https://github.com/rth/scikit-learn-invalid-pypi

Thanks! For compleness' sake, last time I looked I got the feeling that they were done through a pypi alias tool, maybe https://pypi.org/project/pypi-alias?

The consensus about this issue seems to be not to do anything,

Although it makes me slightly sad, I think that this is not worth the pain indeed.

If someone has the time, it would be great to:

  • look at PyPI downloads to see whether pip install sklearn behaviour #8215 (comment) is still true
  • see whether a github search can help evaluating the number of packages using the sklearn package on PyPI, e.g. how many packages have sklearn in their setup.py

Other ideas:

  • adding a warning in setup.py could be a middle ground solution and reevaluating in a few years. I doubt a warning would have significant impact though
  • coupling the breaking change to a scikit-learn release so that we can advertise it and people are more aware of it.

Just for the story, I was on the wrong side on msgpack-python being renamed to msgpack on PyPI and this was clearly not my definition of fun (bonuses were: interaction between conda keeping the old name and pip, different behaviour whether you install a wheel or from source, see dask/distributed#1913 (comment) for the short version if you really care).

@zhenruiliao
Copy link

An issue that could cause significant confusion is that if a new version is released after you install using this alias, pip install --upgrade sklearn fails silently (the only requirement is some version of scikit-learn, not necessarily the most recent version). I suppose it would be possible to continually release new versions of sklearn requiring the most recent version of scikit-learn, but that would be absurd

@rth
Copy link
Member
rth commented Aug 16, 2018

Another possibility without breaking peoples code could be to detect whether the sklearn package is installed (e.g. by looking for site-packages/sklearn-0.0.dist-info/ or with some more standard python way), and raise a warning at import time if that is the case. This may add some small overhead for the import though.

@timkpaine
Copy link

for reference from airflow

import os
from setuptools import setup, find_packages

with open("README.md", "r") as fh:
    long_description = fh.read()

setup(
    name="airflow",
    version="0.6",
    author="Apache Author",
    author_email="dev@airflow.incubator.apache.org",
    description="Placeholder for the old Airflow package",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/apache/incubator-airflow",
    packages=find_packages(),
    classifiers=[
        'Development Status :: 5 - Production/Stable',
        'Environment :: Console',
        'Environment :: Web Environment',
        'Intended Audience :: Developers',
        'Intended Audience :: System Administrators',
        'License :: OSI Approved :: Apache Software License',
        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3.4',
        'Programming Language :: Python :: 3.5',
        'Topic :: System :: Monitoring',
    ]
)

if not os.getenv("OVERRIDE"):
    raise RuntimeError('Please install package apache-airflow instead of airflow')

@lesteve
Copy link
Member Author
lesteve commented Jun 21, 2021

I guess maybe the 1.0 release could be an opportunity to do this cleanup + breaking change?

I am imagining something like to make the disruption a bit smoother:

  • soon: uploading sklearn 0.1 on PyPI which raises an error explaining the change and allow an easy way out through an environment variable (better suggestions welcome for the easy way out)
  • in combination with the scikit-learn 1.0 release (I guess in a few months?): upload sklearn 0.2 on PyPI which always raises an error (no easy way out). Optionally remove sklearn 0.1 from PyPI and maybe sklearn 0.0.

Alternatives considered but discarded: add a warning in setup.py, the warning will only be visible when using pip install -v (see https://stackoverflow.com/a/44617404) and will probably be ignored anyway (pip install -v is very verbose) ...

One possible thing to look at is sklearn dependents from https://libraries.io/pypi/sklearn. One caveat is that there are plenty of false positives which seems like historical dependents of sklearn. For example pycaret is listed probably because sklearn was amongst the dependencies at one point (2 years ago):
https://github.com/pycaret/pycaret/blob/b9756afeae6b24107d86c74544deb83fe5527b50/requirements.txt#L7
but this has been fixed since.

For completeness, I checked the numbers pypistats.org, roughly 29% of the scikit-learn installs come from the sklearn package ... which is in the same ballpark as #8215 (comment). I find this both amazing and slightly sad at the same time 😉.

sklearn

https://pypistats.org/packages/sklearn
Downloads last month: 6,528,058

scikit-learn

https://pypistats.org/packages/scikit-learn
Downloads last month: 22,800,647

@rth
Copy link
Member
rth commented Jun 21, 2021

It would be good to fix this. I think if we error we always need a way out, because some popular but no longer maintained libraries can have sklearn in their list of dependencies and there is not much one can do about it.

Maybe we could make 1 sklearn release,

  • that changes the description in https://pypi.org/project/sklearn/ to be much more verbose and explicitly say that bad things will happen if you use it.
  • Always keep the env variable to disable the error
  • say starting from September 1st start producing an error each Monday between 3PM CEST to 7PM CEST . This should raise enough attention so that issues are opened in popular libraries and this is fixed.
  • Then progressively increase the frequency of failure, from once per week to every other day, every day etc, with a sufficiently long period. All this can be defined initially in that 0.1 sklearn release.

This would give time libraries that have sklearn in their dependencies to update and make a new release. I forgot what's the name for such voluntary intermittent failures is.

I think it would be better to do this completely independently from any scikit-learn release, to avoid implying this is somehow a breaking change / regression, though we can mention it in release notes.

@hugovk
Copy link
Contributor
hugovk commented Jun 21, 2021

I forgot what's the name for such voluntary intermittent failures is.

Brownout? Compare:

To help make sure all affected customers are aware of this change, we will temporarily suspend workflows on Ubuntu 16.04 LTS environment for two short 'brownout' periods ahead of the final removal. Builds that are scheduled to run during the brownout periods will fail.

https://github.blog/changelog/2021-04-29-github-actions-ubuntu-16-04-lts-virtual-environment-will-be-removed-on-september-20-2021/

@rth
Copy link
Member
rth commented Jun 21, 2021

Yes, thanks. For some reason my brain only suggested hashbrown when I tried to find the term :)

@lesteve
Copy link
Member Author
lesteve commented Jun 21, 2021

I think if we error we always need a way out, because some popular but no longer maintained libraries can have sklearn in their list of dependencies and there is not much one can do about it.

Good point, this also applies to old package versions that at one point required sklearn and unfortunately we can not fix the past 😉.

I think it would be better to do this completely independently from any scikit-learn release, to avoid implying this is somehow a breaking change / regression, though we can mention it in release notes.

OK I am fine with this.

About the brownout, why not ... I'd rather have something simpler if possible. It feels slightly weird to have something failing and then working the next day (I do get the increasing failure frequency thing but still ...). IMO one of the reason it works for github is that they can detect you are still not using the recommended way and they can send you emails from time to time to nudge you into the right direction (e.g. creating a token to use git on the command line if you are using https).

So maybe a fixed schedule with a few different environment variables like IGNORE_SKLEARN_PACKAGE_WARNING, IGNORE_SKLEARN_PACKAGE_SECOND_WARNING, IGNORE_SKLEARN_PACKAGE_LAST_WARNING ... one small advantage is that then you can easily google who is actually using these environment variables (I am slightly annoyed I was not able to find a good way to know who is the root cause of all of these sklearn package PyPI downloads, it must be the CI of a popular package and of its dependents ...).

@rth
Copy link
Member
rth commented Jun 21, 2021

IMO one of the reason it works for github is that they can detect you are still not using the recommended way and they can send you emails from time to time to nudge you

Brownout is the nudge. Other per-user notifications are not expected. PyPi did a similar thing for deprecating non SNI compatible clients pypi/support#978. It quite suitable for situations where you want some change to happen, make sure users are aware but don't want to make it a hard blocker either. It also allows one to monitor the migration as it happens (in our case PyPi stats) and adjust the schedule if necessary.

If we just fail permanently, it becomes a blocker for all maintainers who happened to depend on sklearn. They would be expected to drop everything and start working on their CI, updating documentation, responding to user complaints etc. Which I find is not very nice, particularly that there is no urgency to fix this. In our case, we don't really care if this migration happens even say by mid 2022 as long as it does happens.

So maybe a fixed schedule with a few different environment variables like IGNORE_SKLEARN_PACKAGE_WARNING, IGNORE_SKLEARN_PACKAGE_SECOND_WARNING, IGNORE_SKLEARN_PACKAGE_LAST_WARNING

That would be a bit confusing for downstream packages to document IMHO :) I would be rather +1 for a single env variable.

I think we can search for packages that currently use sklearn as a dependency with BigQuery, assuming one can filter files to requirements*.txt, setup.py only and search inside the file contents.

@ogrisel
Copy link
Member
ogrisel commented Jun 22, 2021

+1 for the brownout strategy. We can setup a wiki page or an issue that documents the planned brownout schedule and announce it on twitter + the mailing list.

@ogrisel
Copy link
Member
ogrisel commented Jun 22, 2021

We should discuss this on the next online dev meeting.

@hugovk
Copy link
Contributor
hugovk commented Jun 24, 2021

sklearn had 7,232,064 downloads over 30 days.

I've created PRs for four packages which had a total 7,039,690 downloads over 30 days:

These don't necessarily account for 7,039,690 sklearn downloads, but it'll help!

@rth
Copy link
Member
rth commented Jun 24, 2021

Thanks a lot @hugovk!

@lesteve
Copy link
Member Author
lesteve commented Jun 25, 2021

Very nice indeed @hugovk! Out of curiosity (and in case we need to do it again in the future), how did you find the sklearn dependents with the most PyPI dowloads?

@hugovk
Copy link
Contributor
hugovk commented Jun 26, 2021

I knocked up a script to find which sdist and wheel packages in https://github.com/DavHau/pypi-deps-db have sklearn as a dependency, and filtered out only those in the top 4k most-downloaded: https://github.com/hugovk/top-pypi-packages:

jsonpickle {'jsonpickle-1.5.2-py2.py3-none-any.whl'}
xgboost {'xgboost-1.0.0rc1-py2.py3-none-manylinux1_x86_64.whl'}
shap {'0.5'}
hmmlearn {'hmmlearn-0.2.0-cp34-cp34m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl', 'hmmlearn-0.2.0-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl', 'hmmlearn-0.2.0-cp33-cp33m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl'}
moviepy {'moviepy-0.2.3.2-py2.py3-none-any.whl'}
fastai {'0.7.0'}
pyldavis {'3.3.0'}
dbnd {'dbnd-0.42.1-py2.py3-none-any.whl'}
pdpbox {'0.2.1'}
datarobot {'datarobot-2.22.0-py2-none-any.whl', 'datarobot-2.25.0-py3-none-any.whl'}
caer {'1.3.6'}
flair {'0.4.1'}
recordlinkage {'0.2'}

Some of these are older releases, and they've since replaced sklearn. I made PRs for the rest, except for https://pypi.org/project/datarobot/ (108k downloads in 30 days) which might be closed source (or I couldn't find the repo). Could be worth emailing them.

Here's also a list of all matching packages: sklearn-all.txt

(To run the script, clone the repo and check the setup info in the top-level docstring. Likely Python 3.9+ but easy to adjust for 3.6+. If anyone would like a hand running the script, feel free to open an issue in its repo!)

@rth
Copy link
Member
rth commented Jun 26, 2021

except for pypi.org/project/datarobot (108k downloads in 30 days) which might be closed source (or I couldn't find the repo). Could be worth emailing them.

Email sent. It's only used in their extra examples, so most of their users probably wouldn't download it.

@lesteve
Copy link
Member Author
lesteve commented Jun 28, 2021

Looking at #8215 (comment) I had some big hope that most (6.7M over 7.2M) sklearn downloads were coming from jsonpickle but actually sklearn is only in their testing extras_require (see https://github.com/jsonpickle/jsonpickle/blob/master/setup.cfg#L59) so this is quite unlikely that jsonpickle is the main culprit.

@ahjota
Copy link
ahjota commented Sep 13, 2021

except for pypi.org/project/datarobot (108k downloads in 30 days) which might be closed source (or I couldn't find the repo). Could be worth emailing them.

Email sent. It's only used in their extra examples, so most of their users probably wouldn't download it.

Speaking as one of the maintainers, we just released datarobot v2.26.0 which is no longer dependent on sklearn. Same with our related datarobot-early-access package in v2.26.0.2021.9.13. Thanks for the heads up.

@eikevons
Copy link

For reference, there is a anti-typo-squatting package for NLTK that performs the "fail with helpful error" strategy: ntlk

@lesteve
Copy link
Member Author
lesteve commented Oct 13, 2022

I have created a repo to to implement the sklearn package deprecation brownout:
https://github.com/scikit-learn/sklearn-pypi-package

Feed-back on the repo, more than welcome. The start date is currently set to November 1st with a brownout period of one year but this is open to discussion.

@lesteve
Copy link
Member Author
lesteve commented Nov 9, 2022

sklearn 0.0.post1 has been uploaded, the brownout will start December 1st. I think we can close this even if the brownout period will end December 1st 2023.

More details can be found on the repo: https://github.com/scikit-learn/sklearn-pypi-package

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0