8000 [MRG] remove warnings in univariate feature selection by larsmans · Pull Request #2369 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] remove warnings in univariate feature selection #2369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 19, 2013

Conversation

larsmans
Copy link
Member

These warnings are practically always triggered when doing text classification or any task with lots of boolean features. I suggest to just remove them, since in those cases the warning is so confusing that it does more harm than good.

@ogrisel
Copy link
Member
ogrisel commented Aug 19, 2013

+1

@GaelVaroquaux
Copy link
Member

👍 for removal, as long as we use a stable, non-random, sort. The reason is that I want to have 100% reproducibility. The default sort used by argsort is quicksort which is not stable. Should we switch to a heapsort, which is stable, but has the drawback of requiring p/2 work space in memory? I think that the work space requirement is not too bad, is it is in O(p) and not O(n p).

@larsmans
Copy link
Member Author

What is p in this formula? According to Wikipedia, heapsort should require O(1) auxiliary space (apart from the n indices allocated by argsort, of course).

@GaelVaroquaux
Copy link
Member

What is p in this formula?

Number of features in the learning problem.

According to Wikipedia, heapsort should require O(1) auxiliary space

Correct, I made a mistake and meant mergesort rather than heapsort, which
is the only stable sort implemented in numpy.

@larsmans
Copy link
Member Author

Actually there's a heapsort in NumPy master and it seems to have been there since the days of numarray. But since argsort takes linear space for its output anyway, I suggest we just take the fastest option. I'll profile a bit.

@larsmans
Copy link
Member Author

Timings:

>>> scores = np.random.randn(10000000)
>>> %timeit np.argsort(scores, kind='quicksort')
1 loops, best of 3: 3.35 s per loop
>>> %timeit np.argsort(scores, kind='heapsort')
1 loops, best of 3: 18.1 s per loop
>>> %timeit np.argsort(scores, kind='mergesort')
1 loops, best of 3: 3.08 s per loop

Again, with fresh random numbers:

>>> scores = np.random.randn(10000000)
>>> %timeit np.argsort(scores, kind='quicksort')
1 loops, best of 3: 3.32 s per loop
>>> %timeit np.argsort(scores, kind='heapsort')
1 loops, best of 3: 17.9 s per loop
>>> %timeit np.argsort(scores, kind='mergesort')
1 loops, best of 3: 3.08 s per loop

Memory usage:

$ cat testsort.py 
import numpy as np
import sys

rng = np.random.RandomState(0xCAFE)

scores = rng.randn(10000000)
np.argsort(scores, kind=sys.argv[1])

$ /usr/bin/time python testsort.py quicksort
3.91user 0.10system 0:04.02elapsed 99%CPU (0avgtext+0avgdata 169604maxresident)k
0inputs+0outputs (0major+42856minor)pagefaults 0swaps

$ /usr/bin/time python testsort.py heapsort
18.70user 0.11system 0:18.84elapsed 99%CPU (0avgtext+0avgdata 169608maxresident)k
0inputs+0outputs (0major+42858minor)pagefaults 0swaps

$ /usr/bin/time python testsort.py mergesort
3.65user 0.14system 0:03.80elapsed 99%CPU (0avgtext+0avgdata 208676maxresident)k
0inputs+0outputs (0major+52624minor)pagefaults 0swaps

Without the argsort, the test script takes just over a second to generate these random numbers. So indeed, mergesort takes more memory (40MB per megafeature), but it can actually be faster than quicksort. Heapsort is dead slow.

@GaelVaroquaux
Copy link
Member

So indeed, mergesort takes more memory (40MB per megafeature), but it can actually be faster than quicksort. Heapsort is really slow.

So let's use mergesort. I don't find the memory-usage numbers
unreasonnable, and the behavior (stable sort) will be less suprising to
users.

@agramfort
Copy link
Member

while you're at it I'd also like to have a stable sort in StratifiedKFold :)

On Mon, Aug 19, 2013 at 2:33 PM, Gael Varoquaux
notifications@github.com wrote:

So indeed, mergesort takes more memory (40MB per megafeature), but it can
actually be faster than quicksort. Heapsort is really slow.

So let's use mergesort. I don't find the memory-usage numbers
unreasonnable, and the behavior (stable sort) will be less suprising to
users.


Reply to this email directly or view it on GitHub.

@GaelVaroquaux
Copy link
Member

while you're at it I'd also like to have a stable sort in StratifiedKFold :)

PR welcomed :P

@agramfort
Copy link
Member

PR welcomed :P

I've heard this before ;)
Seriously I'll do it when I'm back at work unless Lars beats me to it.

These warnings are issued practically always when using
frequency-valued or boolean data.

Switched to a stable sort to get reproducible results.
@larsmans
Copy link
Member Author

Force-pushed a new version. Time to go back to the actual experiment I was performing, @agramfort stratified k-fold is yours :p

@GaelVaroquaux
Copy link
Member

👍 for merge. Thanks!

ogrisel added a commit that referenced this pull request Aug 19, 2013
[MRG] remove warnings in univariate feature selection
@ogrisel ogrisel merged commit 9033baf into scikit-learn:master Aug 19, 2013
@ogrisel
Copy link
Member
ogrisel commented Aug 19, 2013

I pushed the green button as travis was happy.

@larsmans larsmans deleted the no-warnings-in-fs branch October 2, 2015 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Assignees
No one assigned
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0