-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
MNT: add a logging call if a categorical string array is all convertible #13026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT: add a logging call if a categorical string array is all convertible #13026
Conversation
Added a test. The categorical tests are not very easy to follow, and I wanted to suppress the warnings from spanning the test output so I littered the thing with |
I don't think I can support this. The whole point of categories is that any sting type list or array is interpreted as categorical. It's not matplotlib that should decide whether my list of strings is acceptable to be plotted as categories. |
@ImportanceOfBeingErnest This PR just emits a warning in those cases, but still treats the strings as categoricals. Yes, I admit its annoying for the knowledgable user, but the default behaviour is also quite mysterious, as per many StackOverflow questions. Worse, there is not a good way to google what the problem is, but if this warning came up, the users could google the warning and figure out what the problem was. But I'm willing for this to not go in if its deemed too intrusive. |
24db221
to
244fd21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would extract the whole convertible checking to a separate function. That makes the logic cleaner. Also, it simplifies the code because you can break out early with a return and don‘t need the convertible state variable.
lib/matplotlib/category.py
Outdated
@@ -168,6 +170,20 @@ def __init__(self, data=None): | |||
if data is not None: | |||
self.update(data) | |||
|
|||
@staticmethod | |||
def _str_is_convertable(val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_str_is_convertible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - the new version bails the check early as soon as a non-convertible string is found.
I‘m +1 on the idea of the PR. If designing from scratch, it would probably have been better not to allow categoricals in places that we expect number-like values. And then guess what it should be based on the type. The problem of loading number-likes as text from a file is very real and something inexperienced user stumble over in particular. Diven the existing API I think it‘s still better to test and issue a warning. You can either ignore it, or filter it out. The second best approach would be to suppress the check using a kwarg or a rcParam, but I would consider this a bit over-engineered. @ImportanceOfBeingErnest you gave two good examples why one might actually use number or date categoricals. Nevertheless, I assume that‘s less common than reading from a csv or similar. |
Honestly, I would seriously question the sanity of any project that asks their users to add something like
to their code. |
@ImportanceOfBeingErnest I understand your concern, but what is the solution, if not a warning like this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structurally, this is good.
Personally, I also think, it‘s the right thing to do. However, that discussion should be brought to a conclusion before merging.
I'm with @ImportanceOfBeingErnest and am very skeptical of this, giving users warnings on correct use of the library is extremely bad. I suspect that I could be convinced on the 'cast-to-float' case, but I am 👎 on the date checking.
This is as question of what the user intended to do which is something we can not deduce from the input. To quote the zen of python "In the face of ambiguity, refuse the temptation to guess.". I am still partially of a mind that ' xref https://gitter.im/matplotlib/matplotlib?at=5c1a74f94e0d6621ba1ae119 |
The use case is someone calls csvread and gets a list off strings and tries to plot it, the data is all out of order and they turn to stack overflow or github/Gitter to see what is wrong with matplotlib. I agree that we should never have cast to float and I’m fine with strings being interpreted as categorical, but the practical issue remains. I’m not passionate about this PR, just making sure we are on same page as to why it’s needed. There are quite a few stackoverflow requests that are basically this misunderstanding. |
That's out of question. It's superseded by categoricals.
From a pure API perspective, there is no ambiguity and no guessing nowadays: However, I believe that most of the time that a list of string-formatted numbers is passed, the users do not intend to do a categorical plot. They simply don't realize that they are passing strings. While indeed we would guess what the user is trying to achieve, we're not changing the resulting plot. This is just saying "Technically the input is valid, but be reminded that you are getting a categorical plot with numbers as categories" The question is, do we want to help in this situation with a warning (at the cost that it might be a false alert) or do we leave it up to the user? |
@jklymak my reasoning for convertor lock is @anntzer's comment that the case where it's safest to assume that they didn't intend categorical is when they're treating the axis as if it's a datetime or float axis. I'm net neutral on this going in. Like I kinda wish this had gone in as a heads up warning or something (a deprecation warning where the old behavior isn't supported?) back when categorical was first introduced, because as is I don't think this warning should stick around. I wonder if as annoying as it is, correcting users via stackoverflow is the documentation route to go for this. |
My problem with just correcting users on stackoverflow is that there is nothing for them to google on other than “matplotlib data plotted out of order”. So it’s almost completely non-discoverable. Hence the warning message would provide something searchable. |
I think what needs to be done here is to improve the documentation. There are a few principles one needs to know to be able to use matplotlib, and right after (1) You need to have some basic knowledge of python, and (2) matplotlib plots lists or numpy arrays would come (3) matplotlib interpretes the data type of your input and if it is strings, it will assume them to be categorical. This last point would need to be added to the user guide. Second, there is a FAQ around in the docs, and this should be updated with this very problem. Also I think FAQ should be one of the main points on the homepage, such that it'll look like The to-be-created FAQ should show a picture of what can happen if you use strings for plotting, such that users can directly see that it is relevant for them. The text can be written in such a way as to include possible wordings of this problem. Ideally, if you'd google "My plot doesn't look like I want it to" you'd still end up at the FAQ. Concerning possible wordings, there is a number of questions on stackoverflow (this list) that I or others have linked to one of the first occurances of this kind of question. Those can be analysed to find possible formulations that should equally be taken into account for that new FAQ entry. Last, I think the problem is by now widely known. So if a question about it is being asked on stackoverflow, it'll be answered or redirected to existings answers within some hours. |
8000
@ImportanceOfBeingErnest Thanks, and your experience dealing with this stuff on stackoverflow is very welcome. OTOH, 57 questions marked as duplicates, with the latest from 3 days ago doesn't argue strongly to me that people are getting it. 😉 More importantly, I think this is really hard to google to get the right answer for. |
lib/matplotlib/category.py
Outdated
if not isinstance(val, (str, bytes)): | ||
raise TypeError("{val!r} is not a string".format(val=val)) | ||
if val not in self._mapping: | ||
self._mapping[val] = next(self._counter) | ||
# check if we can convert all strings to number or date... | ||
if self._strs_are_convertible(data): | ||
cbook._warn_external('using category units to plot a list of ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will turn this inot a logging call at INFO level Done
2051814
to
df85b66
Compare
lib/matplotlib/category.py
Outdated
if not isinstance(val, (str, bytes)): | ||
raise TypeError("{val!r} is not a string".format(val=val)) | ||
if val not in self._mapping: | ||
self._mapping[val] = next(self._counter) | ||
# check if we can convert all strings to number or date... | ||
if self._strs_are_convertible(data): | ||
_log.info('using category units to plot a list of ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.__mapping.keys() instead of data so you at least don't have to double iterate over everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keys will include keys from previous calls to the converter, which may indeed be valid. So thats different than the check here, which tests only the most recent data that are sent to the converter. Is that what you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I'm just really not liking the second loop, but a second call to unique under the hood would be doing the same thing. :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, someone told me to take it out of the loop, but... new version has it in the loop and only calls the check so long as the strings are convertable to numbers. (i.e. stops checking as soon as a non-number is found).
df85b66
to
0ec990a
Compare
0ec990a
to
5ddbb86
Compare
5ddbb86
to
17a7c54
Compare
@timhoffm are you ok with this the way it is now? Marking your review as stale... |
Trvais failure is un-related. |
@@ -189,11 +210,22 @@ def update(self, data): | |||
""" | |||
data = np.atleast_1d(np.array(data, dtype=object)) | |||
|
|||
# check if convertable to number: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arriving after the battle, but this should have been named "convertible", not "convertable"...
I'm surprised this got merged without explicit approval from Tom considering he raised concerns and is the BDFL. Has this issue been discussed more thoroughly elsewhere in a none public channel? |
ping @tacaswell though we did discuss on a weekly call, and hence we changed from warning to a logging call. |
Note also #13129 was opened in response to that same call.. |
For the record, these log messages can be suppressed as follows: import logging
logging.getLogger('matplotlib.category').setLevel(logging.WARNING) As of now this does not affect any other log messages. (In contrast, |
Users somertimes do:
and are surprised when they get a plot that looks like:
This now emits (if
logging.basicConfig(level=logging.INFO)
):emits the same info message,
but
does not.
Note that we decided to use logging as less intrusive to knowledgable users. See #13129 for more user-friendly way to activate matplotlib logging.
PR Summary
PR Checklist