Re: [Scikit-learn-general] Exclusivity of scikit-learn

Satrajit Ghosh Wed, 03 Dec 2014 18:45:03 -0800

hi joel,

I don't see what's hard about comparing models from outside scikit-learn,
> on the assumption that all the packages worth comparing are trivial to
> install, and listed in scikit-learn's "Extension Library".
>


i was referring to the scenario where this wasn't a standalone package but
simply a fork of scikit-learn that someone coded a new model into. i agree
if extensions are built as standalone packages, it would be trivial to
install and use.

cheers,

satra

>
> On 4 December 2014 at 12:01, Satrajit Ghosh <sa...@mit.edu> wrote:
>
>> hi gael and joel,
>>
>> i'll insert a short response here. i actually agree with all the things
>> both of you said. i will however comment on two things:
>>
>> 1. algorithmic scenarios:
>>
>> a. adding algorithms that can be built directly of the scikit-learn api
>> b. adding algorithms that require refactoring some not all underlying
>> pieces.
>>
>> in case a), i could simply have a python script, i don't need a fork, but
>> in case b), i need a fork.
>>
>> 2. i love decentralization, but the current architecture doesn't allow me
>> to do the very simple use-case. i want to compare models in scikit-learn to
>> models outside scikit-learn. what's nice about the api is that it makes
>> comparing models easy, i can search over various models. however, if i have
>> to install or merge 5 different scikit-learn forks to be able to compare
>> those algorithms that are not in scikit learn that becomes expensive. if i
>> could do this in an easier manner, i wouldn't really ask for a common
>> bleeding repo.
>>
>> cheers,
>>
>> satra
>>
>> On Wed, Dec 3, 2014 at 6:55 PM, Joel Nothman <joel.noth...@gmail.com>
>> wrote:
>>
>>> While anything is better than publishing an extended fork of the main
>>> repository, I would like to see someone cite an instance where a
>>> open-slather contrib repository has been particularly successful
>>> (especially one where diverse contributions are assured). In line with
>>> Gaël's experience of sandbox coderot, I think it provides very little
>>> benefit over distributed open-source repositories.
>>>
>>> For example, let's say someone has implemented an algorithm (Affinity
>>> Propagation is what triggered this discussion so you might consider that).
>>> Someone else wants to come and add features to it, or even just clean the
>>> code, but by this time the original contributor has moved onto greener
>>> pastures and is not interested in responding to a pull request. Who has the
>>> right, and who the responsibility, to say that this change should be
>>> allowed? Does the contrib repository, too, require an army of maintainers
>>> to familiarise themselves with a vast collection of moderate-quality code?
>>> Without strict gatekeepers, a centralised repository provides almost
>>> nothing, and with strict gatekeepers it entails exactly the issue that we
>>> are trying to solve.
>>>
>>> The model of a distributed plugin library (think Django) seems much more
>>> successful when diversity and changing/variant needs are inevitable. Each
>>> contribution is published individually on PyPI and/or open-source hosting,
>>> and someone curates or facilitates a centralised library (like
>>> djangopackages.com). When a contributor doesn't want to maintain
>>> anymore, the project is forked; and the fittest survive.
>>>
>>> At the same time, scikit-learn is already trying to facilitate external
>>> contributions:
>>>
>>>    - it is working towards an estimator verification API
>>>    <https://github.com/scikit-learn/scikit-learn/issues/3810> so that
>>>    it is easy to test that externally-contributed estimators conform to many
>>>    scikit-learn API standards. Contributions to developing this are welcome!
>>>    - Gaël has commissioned a sphinx plugin
>>>    <https://github.com/sphinx-gallery/sphinx-gallery> to make it easy
>>>    for projects to build documentation by example as in scikit-learn's 
>>> example
>>>    gallery <http://scikit-learn.org/stable/auto_examples/>. Perhaps
>>>    this could facilitate also displaying external examples in the contrib
>>>    library (but only if someone is willing to code up such a feature!).
>>>
>>> Making a template repository that people can clone to get started
>>> writing an external package might be a nice extension of these ideas.
>>> Another idea would be to have a conventional prefix for packages that
>>> extend scikit-learn (just as django packages tend to be prefixed in PyPI by
>>> django-).
>>>
>>> Still, I think facilitating the construction and access to external
>>> projects will be much more wieldy than a centralised contribs repo, and may
>>> even streamline contribution back to the main repository.
>>>
>>> On 4 December 2014 at 03:18, Gael Varoquaux <
>>> gael.varoqu...@normalesup.org> wrote:
>>>
>>>> On Wed, Dec 03, 2014 at 09:56:55AM -0500, Satrajit Ghosh wrote:
>>>> > - let the community (to put zero additional burden on the current
>>>> maintainers)
>>>> > maintain a fork of scikit-learn that provides no guarantees other
>>>> than it is
>>>> > kept upto date with scikit-learn/master.
>>>>
>>>> The problem with this is that we are still going to have our tracker
>>>> filled with problems that are related to the fork, and not master. To
>>>> put
>>>> things in perspective, our tracker has 336 issue open, and 1318 closed.
>>>> Just keeping track on those issues is very hard.
>>>>
>>>> Thus the need for a different repo (eg scikit-learn-contrib, as
>>>> suggested
>>>> by Mathieu).
>>>>
>>>> > - people are welcome to add any algorithms to this (trivial,
>>>> non-trivial,
>>>> > recent)
>>>>
>>>> What you are suggesting is very similar to things that have been tried
>>>> as
>>>> a 'sandbox' for instance in scipy. Experience has shown that it code
>>>> rots, because nobody feels responsible for the code. It's been tried, it
>>>> fails, but if you feel like doing it, you should go ahead. Do you need
>>>> anything from us?
>>>>
>>>> I would believe more in separate repos in a 'scikit-learn-contrib'
>>>> github
>>>> organization, because it would give a feeling of responsibility to the
>>>> different owners of the repos.
>>>>
>>>> > - folks don't have to recreate packaging
>>>>
>>>> I don't understand: if there are releases, and packaging, someone has to
>>>> do it. It doesn't happen just like this. It's actually a lot of work.
>>>>
>>>> If it's just a fork, without any releases, what's the gain? In addition,
>>>> if somebody is not doing the work of making sure that it builds and run
>>>> on various platforms, quite quickly it will stop working on different
>>>> versions of Python and different platforms.
>>>>
>>>> > - it brings all the folks who are forking anyway together instead of
>>>> splitting
>>>> > off into forks (multiple forks are harder to use)
>>>>
>>>> But someone has to be making the merges :). So the work is there.
>>>>
>>>> > - it makes for increased availability of algorithms that may be
>>>> useful in
>>>> > practice but never makes it out because the world is biased towards
>>>> > loudspeakers
>>>>
>>>> Probably, provided that the project actually flies. But I really fear
>>>> coderot. The amount of work to keep the scikit-learn project going is
>>>> just huge. If nobody is doing this work, coderot would come in very
>>>> quickly.
>>>>
>>>> > - it doesn't add anything to the current maintainers plates, nor take
>>>> away
>>>> > anything from the main project. perhaps those wishing to add things
>>>> will take
>>>> > it upon themselves to maintain this fork.
>>>>
>>>> As long as it is called differently, and _has a different import name_.
>>>> If not, I can quite forcast the situation where users are complaining
>>>> about scikit-learn and after a long debugging session we find that they
>>>> are running some weird fork.
>>>>
>>>>
>>>> I think that there is something flawed in the way you see the life of a
>>>> project like scikit-learn. You seem to think that it is just an
>>>> accumulation of code. That putting code together is enough to make a
>>>> project successful. But if that's the case, why don't you just create
>>>> something else, just anything else, and accumulate code? More
>>>> importantly, why do you want algorithms in scikit-learn? Why aren't you
>>>> happy with just code on Internet that you can download? If you ask
>>>> yourself these questions, you will probably find where the value of
>>>> scikit-learn lies, and this will also tell you why there is a huge
>>>> effort
>>>> in maintaining scikit-learn.
>>>>
>>>>
>>>> Things like this, eg sandboxes where there is no feeling of belonging to
>>>> a global project and no harmonizing effort, have been tried in the past.
>>>> They fail because of coderot. Actually, to put a historical perspective,
>>>> a long time ago, there was a scipy 'sandbox', in the scipy SVN. It
>>>> didn't
>>>> have much working, mostly dead code. We hypothesized that it was because
>>>> of lack of visibility, so the 'sandbox' was cleaned, separated in some
>>>> structure, and renamed 'scikits'. Scikits weren't getting much traction
>>>> inside the scipy codebase, because people were having a hard time
>>>> working
>>>> there (back then it was an SVN, but there was also the problem of
>>>> compiling scipy, which is a bit hard). So we started pulling things out
>>>> of the SVN. And that's how the current scikits were born. Some of these
>>>> scikits took off, because they had a clear project management: releases,
>>>> documentation, quality.
>>>>
>>>> It's interesting that almost ten years later, we are falling in the same
>>>> problems. I think that this is not by chance. The reasons that these
>>>> evolutions happen are the following:
>>>>
>>>> 1. Projects are non-linearly hard to evolve. Bigger projects are harder
>>>> to
>>>>    drive than small projects, and significantly. This is a very very
>>>> true
>>>>    law of project management and is really underestimated by too many
>>>> [1].
>>>>
>>>> 2. People want different things, and that's perfectly legitimate. The
>>>>    statsmodels guys wanted control on p-values. The scikit-learn guys
>>>>    wanted good prediction. Both usecases are valid (I am an avid user of
>>>>    statsmodels), but doing both in the same project was much, much
>>>> harder
>>>>    than doing two projects.
>>>>
>>>> Thus I think that it is natural that some ecosystem of different
>>>> projects, from general to specific, shapes up. Yes, it's very important
>>>> to
>>>> keep in mind the big picture, and that people with close enough unite,
>>>> but only in balance with point 1.
>>>>
>>>> By the way, I care very much about the ecosystem. When we split of HMMs,
>>>> I spent half a day making them a separate package, with setup.py,
>>>> travis,
>>>> a README, examples, documentation:
>>>> https://github.com/hmmlearn
>>>> It did take a good 4 hours. Nothing happens for free. I did this even
>>>> though I do not use HMMs at all.
>>>>
>>>>
>>>> In terms of action points, to summarize my position:
>>>>
>>>> - You are free to create a fork. I strongly ask that you change the
>>>>   import name, elsewhere you will be putting burden on the main
>>>>   scikit-learn maintainers.
>>>>
>>>> - What I think could work would be a scikit-learn-contrib organization
>>>> with
>>>>   different repository in it. I see that Matthieu and Andy have the same
>>>>   feeling. I think we all agree that it should be done. I am ready to
>>>>   create the organization, and give you (and many others) the keys of
>>>> the
>>>>   kingdom.
>>>>
>>>> Gaël
>>>>
>>>>
>>>> [1] This has actually been studied. Here is one paper (out of probably
>>>>     many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>>> with Interactivity, Sharing, Native Excel Exports, App Integration &
>>>> more
>>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>> Get technology previously reserved for billion-dollar corporations, FREE
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
>
> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Exclusivity of scikit-learn

Reply via email to