-
Notifications
You must be signed in to change notification settings - Fork 1.2k
BUG: fix "ValueError: Bin edges must be unique" exception #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
One caveat: we might have identical factor values go into different quantiles. This is not a big issue but just to highlight the behaviour.
I see this as an unrelated issue, but if you think it's better to wait I am fine with that. |
return pd.qcut(x, quantiles, labels=False) + 1 | ||
except ValueError: | ||
# This is supposed to fix "ValueError: Bin edges must be unique" | ||
return pd.qcut(x.rank(method='first'), quantiles, labels=False) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the error occurs when there are not enough unique values to generate the quantiles. .rank()
then probably makes them unique? Not sure I follow that. Also, if that's the case I think a better approach would be raise an exception that there are not unique values to generate the requested number of quantiles. Or at least a warning that we're doing some changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting here:
"Suppose we have a list with too many duplicates, say we want to split [1,2,3,3,3,3,3,3,4,5,6,7] into quartiles. Right now qcut fails, because the second-lowest quartile consists entirely of '3's, duplicating the bin edges. But there is a natural way to assign the quartiles if we allow duplicate values to be split among different bins: [1,2,3], [3,3,3], [3,3,4], [5,6,7]."
The same is true for this set of values [1,1,1,1,1,2,2,3,3]. pandas.qcut fails with any number of quantiles in this scenario (see log below).
So, the point is that there might be cases where the bin edges fall in the middle of identical values and pandas.qcut will fail there. Ranking them will fix it, but we have that same values will go in different quantiles.
>>> pd.__version__
'0.17.1'
>>> s = pd.Series([1,1,1,1,1,2,2,3,3])
>>> pd.qcut(s,2, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
include_lowest=True)
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([1, 1, 3])
>>> pd.qcut(s,3, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
include_lowest=True)
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1., 1., 2., 3.])
>>> pd.qcut(s,4, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
include_lowest=True)
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([1, 1, 1, 2, 3])
>>> pd.qcut(s,5, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
include_lowest=True)
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1. , 1. , 1. , 1.8, 2.4, 3. ])
>>> pd.qcut(s,6, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
include_lowest=True)
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1. , 1. , 1. , 1. , 2. ,
2.66666667, 3. ])
>>> pd.qcut(s,7, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
include_lowest=True)
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1. , 1. , 1. , 1. , 1.57142857,
2. , 2.85714286, 3. ])
>>> pd.qcut(s,8, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
include_lowest=True)
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([1, 1, 1, 1, 1, 2, 2, 3, 3])
>>> len(s)
9
>>> pd.qcut(s,9, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 169, in qcut
include_lowest=True)
File "/usr/lib/python2.7/dist-packages/pandas/tools/tile.py", line 189, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 1. , 1. , 1. , 1. , 1. ,
1.44444444, 2. , 2.22222222, 3. , 3. ])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that if you are creating an alpha factor then you are implying there is meaning to the values you are assigning, whatsmore you are also implying there is meaning in the magnitude or distance between values. I'm worried that by arbitrarily throwing equities with the same values into different quantiles it would distort/skew the interpretation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameschristopher I understand that and it is the reason why I explicitly stated this behaviour in my comment on the top. Also I actually opened this PR mostly to listen to your opinions and have a discussion on this exception. I don't really expect to have this PR merged if the pros are less than the cons.
Imagine we have a factor whose values are 50% identical. If we run the tear sheet with a quantiles value different than 2 we'll get a "ValueError: Bin edges must be unique" excepion because the identical values would split in more than one quantile and this would lead to bin edges with the same value.
With this PR we would avoid the exception but the quantiles that contain (fully or partially) the identical values are arbitrary, as we could swap securities with identical factor values between quantiles. The consequence of having generated arbitrary quantiles is that we can only trust the alphalens statistics calculated for the quantiles that do not contain the identical values (the quantiles that are not arbitrary) and we have to look at the quantiles with identical values as just one of the multiple possibilities the identical values can be split into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the useful discussion.
I agree we should not merge this. Instead, I think we should catch the exception and raise a new one that says that not enough unique values are provided for the required number of quantiles and that the number of quantiles should be reduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then, I'll close this PR.
Same issue as #87
Fix "ValueError: Bin edges must be unique" exception
One caveat: we might have identical factor values go into different
quantiles. This is not a big issue but just to highlight the behaviour.