8000 BUG: fix "ValueError: Bin edges must be unique" exception by luca-s · Pull Request #104 · quantopian/alphalens · GitHub
[go: up one dir, main page]

Skip to content

BUG: fix "ValueError: Bin edges must be unique" exception #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

luca-s
Copy link
Collaborator
@luca-s luca-s commented Nov 11, 2016

Same issue as #87

Fix "ValueError: Bin edges must be unique" exception

One caveat: we might have identical factor values go into different
quantiles. This is not a big issue but just to highlight the behaviour.

One caveat: we might have identical factor values go into different
quantiles. This is not a big issue but just to highlight the behaviour.
@jameschristopher
Copy link
Contributor
jameschristopher commented Nov 22, 2016

@luca-s should I hold off on merging this PR until issue #105 is resolved?

@luca-s
Copy link
Collaborator Author
luca-s commented Nov 23, 2016

I see this as an unrelated issue, but if you think it's better to wait I am fine with that.

return pd.qcut(x, quantiles, labels=False) + 1
except ValueError:
# This is supposed to fix "ValueError: Bin edges must be unique"
return pd.qcut(x.rank(method='first'), quantiles, labels=False) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the error occurs when there are not enough unique values to generate the quantiles. .rank() then probably makes them unique? Not sure I follow that. Also, if that's the case I think a better approach would be raise an exception that there are not unique values to generate the requested number of quantiles. Or at least a warning that we're doing some changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting here:

"Suppose we have a list with too many duplicates, say we want to split [1,2,3,3,3,3,3,3,4,5,6,7] into quartiles. Right now qcut fails, because the second-lowest quartile consists entirely of '3's, duplicating the bin edges. But there is a natural way to assign the quartiles if we allow duplicate values to be split among different bins: [1,2,3], [3,3,3], [3,3,4], [5,6,7]."

The same is true for this set of values [1,1,1,1,1,2,2,3,3]. pandas.qcut fails with any number of quantiles in this scenario (see log below).

So, the point is that there might be cases where the bin edges fall in the middle of identical values and pandas.qcut will fail there. Ranking them will fix it, but we have that same values will go in different quantiles.

>>> pd.__version__
'0.17.1'
>>> s = pd.Series([1,1,1,1,1,2,2,3,3])
>>> pd.qcut(s,2, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
    include_lowest=True)
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([1, 1, 3])
>>> pd.qcut(s,3, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
    include_lowest=True)
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1.,  1.,  2.,  3.])
>>> pd.qcut(s,4, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
    include_lowest=True)
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([1, 1, 1, 2, 3])
>>> pd.qcut(s,5, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
    include_lowest=True)
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1. ,  1. ,  1. ,  1.8,  2.4,  3. ])
>>> pd.qcut(s,6, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
    include_lowest=True)
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1.        ,  1.        ,  1.        ,  1.        ,  2.        ,
        2.66666667,  3.        ])
>>> pd.qcut(s,7, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
    include_lowest=True)
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1.        ,  1.        ,  1.        ,  1.        ,  1.57142857,
        2.        ,  2.85714286,  3.        ])
>>> pd.qcut(s,8, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
    include_lowest=True)
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([1, 1, 1, 1, 1, 2, 2, 3, 3])
>>> len(s)
9
>>> pd.qcut(s,9, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
    include_lowest=True)
  File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.44444444,  2.        ,  2.22222222,  3.        ,  3.        ])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that if you are creating an alpha factor then you are implying there is meaning to the values you are assigning, whatsmore you are also implying there is meaning in the magnitude or distance between values. I'm worried that by arbitrarily throwing equities with the same values into different quantiles it would distort/skew the interpretation.

Copy link
Collaborator Author
@luca-s luca-s Nov 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameschristopher I understand that and it is the reason why I explicitly stated this behaviour in my comment on the top. Also I actually opened this PR mostly to listen to your opinions and have a discussion on this exception. I don't really expect to have this PR merged if the pros are less than the cons.

Imagine we have a factor whose values are 50% identical. If we run the tear sheet with a quantiles value different than 2 we'll get a "ValueError: Bin edges must be unique" excepion because the identical values would split in more than one quantile and this would lead to bin edges with the same value.

With this PR we would avoid the exception but the quantiles that contain (fully or partially) the identical values are arbitrary, as we could swap securities with identical factor values between quantiles. The consequence of having generated arbitrary quantiles is that we can only trust the alphalens statistics calculated for the quantiles that do not contain the identical values (the quantiles that are not arbitrary) and we have to look at the quantiles with identical values as just one of the multiple possibilities the identical values can be split into.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the useful discussion.

I agree we should not merge this. Instead, I think we should catch the exception and raise a new one that says that not enough unique values are provided for the required number of quantiles and that the number of quantiles should be reduced.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then, I'll close this PR.

@luca-s luca-s closed this Nov 29, 2016
@luca-s luca-s deleted the qcut_same_bins_fix branch January 20, 2017 10:26
@luca-s luca-s mentioned this pull request Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0