8000 MNT: add a logging call if a categorical string array is all convertible by jklymak · Pull Request #13026 · matplotlib/matplotlib · GitHub
[go: up one dir, main page]

Skip to content

MNT: add a logging call if a categorical string array is all convertible #13026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 20, 2019

Conversation

jklymak
Copy link
Member
@jklymak jklymak commented Dec 20, 2018

Users somertimes do:

plt.plot(['1', '3', '2'])

and are surprised when they get a plot that looks like:

figure_1

This now emits (if logging.basicConfig(level=logging.INFO)):

INFO:matplotlib.category:using category units to plot a list of strings that is all floats or parsable as dates. If you do not mean these to be categories, cast to the approriate data type before plotting.
plt.plot(['10/20/10'])

emits the same info message,

but

plt.plot(['1', 'a', '2'])

does not.

Note that we decided to use logging as less intrusive to knowledgable users. See #13129 for more user-friendly way to activate matplotlib logging.

PR Summary

PR Checklist

  • Has Pytest style unit tests
  • Code is Flake 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

@jklymak
Copy link
Member Author
jklymak commented Dec 20, 2018

Added a test. The categorical tests are not very easy to follow, and I wanted to suppress the warnings from spanning the test output so I littered the thing with @pytest.mark.filterwarnings('ignore::UserWarning'), not all of which are necessary. s

@tacaswell tacaswell added this to the v3.1 milestone Dec 20, 2018
@ImportanceOfBeingErnest
Copy link
Member

I don't think I can support this. The whole point of categories is that any sting type list or array is interpreted as categorical. It's not matplotlib that should decide whether my list of strings is acceptable to be plotted as categories.
Or in other words: What's the recommended way to plot my list of samples ["2016-07-18", "2016-07-22","2016-08-05"] (which happen to be dates, yes, because sample numbers are simply the dates when I produced that sample)? Or someone's product numbers ["6632", "1322", "9744"] (sorted by market share)?

@jklymak
Copy link
Member Author
jklymak commented Dec 20, 2018

@ImportanceOfBeingErnest This PR just emits a warning in those cases, but still treats the strings as categoricals.

Yes, I admit its annoying for the knowledgable user, but the default behaviour is also quite mysterious, as per many StackOverflow questions. Worse, there is not a good way to google what the problem is, but if this warning came up, the users could google the warning and figure out what the problem was.

But I'm willing for this to not go in if its deemed too intrusive.

@jklymak jklymak force-pushed the mnt-add-warning-for-categorical branch 3 times, most recently from 24db221 to 244fd21 Compare December 21, 2018 02:48
Copy link
Member
@timhoffm timhoffm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract the whole convertible checking to a separate function. That makes the logic cleaner. Also, it simplifies the code because you can break out early with a return and don‘t need the convertible state variable.

@@ -168,6 +170,20 @@ def __init__(self, data=None):
if data is not None:
self.update(data)

@staticmethod
def _str_is_convertable(val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_str_is_convertible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - the new version bails the check early as soon as a non-convertible string is found.

@timhoffm
Copy link
Member

I‘m +1 on the idea of the PR.

If designing from scratch, it would probably have been better not to allow categoricals in places that we expect number-like values. And then guess what it should be based on the type. The problem of loading number-likes as text from a file is very real and something inexperienced user stumble over in particular.

Diven the existing API I think it‘s still better to test and issue a warning. You can either ignore it, or filter it out. The second best approach would be to suppress the check using a kwarg or a rcParam, but I would consider this a bit over-engineered.

@ImportanceOfBeingErnest you gave two good examples why one might actually use number or date categoricals. Nevertheless, I assume that‘s less common than reading from a csv or similar.

@ImportanceOfBeingErnest
Copy link
Member

Honestly, I would seriously question the sanity of any project that asks their users to add something like

warnings.filterwarnings("ignore", message="You are using the library exactly as intended.")

to their code.

@jklymak
Copy link
Member Author
jklymak commented Dec 22, 2018

@ImportanceOfBeingErnest I understand your concern, but what is the solution, if not a warning like this?

timhoffm
timhoffm previously approved these changes Dec 22, 2018
Copy link
Member
@timhoffm timhoffm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structurally, this is good.

Personally, I also think, it‘s the right thing to do. However, that discussion should be brought to a conclusion before merging.

@tacaswell
Copy link
Member

I'm with @ImportanceOfBeingErnest and am very skeptical of this, giving users warnings on correct use of the library is extremely bad. I suspect that I could be convinced on the 'cast-to-float' case, but I am 👎 on the date checking.

Nevertheless, I assume that‘s less common than reading from a csv or similar.

This is as question of what the user intended to do which is something we can not deduce from the input. To quote the zen of python "In the face of ambiguity, refuse the temptation to guess.".

I am still partially of a mind that 'List[Str] that happens to be numbers' ever worked was a bug (or at least a miss-feature). I am not sure what the use case for float-encoded-as-strings that we need to be sure to support it. If the users have ended up with a list of floats-encoded-as-strings and they want to do anything but plotting it they will have to convert to numbers anyway.

xref https://gitter.im/matplotlib/matplotlib?at=5c1a74f94e0d6621ba1ae119

@jklymak
Copy link
Member Author
jklymak commented Dec 25, 2018

The use case is someone calls csvread and gets a list off strings and tries to plot it, the data is all out of order and they turn to stack overflow or github/Gitter to see what is wrong with matplotlib.

I agree that we should never have cast to float and I’m fine with strings being interpreted as categorical, but the practical issue remains.

I’m not passionate about this PR, just making sure we are on same page as to why it’s needed. There are quite a few stackoverflow requests that are basically this misunderstanding.

@timhoffm
Copy link
Member
timhoffm commented Dec 25, 2018

I am still partially of a mind that 'List[Str] that happens to be numbers' ever worked was a bug (or at least a miss-feature).

That's out of question. It's superseded by categoricals.

To quote the zen of python "In the face of ambiguity, refuse the temptation to guess.".

From a pure API perspective, there is no ambiguity and no guessing nowadays:
Lists of numbers result in a numeric axis, lists of strings result in a categorical axis. And we're not going to change this.

However, I believe that most of the time that a list of string-formatted numbers is passed, the users do not intend to do a categorical plot. They simply don't realize that they are passing strings. While indeed we would guess what the user is trying to achieve, we're not changing the resulting plot. This is just saying "Technically the input is valid, but be reminded that you are getting a categorical plot with numbers as categories" The question is, do we want to help in this situation with a warning (at the cost that it might be a false alert) or do we leave it up to the user?

@story645
Copy link
Member

@jklymak my reasoning for convertor lock is @anntzer's comment that the case where it's safest to assume that they didn't intend categorical is when they're treating the axis as if it's a datetime or float axis.

I'm net neutral on this going in. Like I kinda wish this had gone in as a heads up warning or something (a deprecation warning where the old behavior isn't supported?) back when categorical was first introduced, because as is I don't think this warning should stick around. I wonder if as annoying as it is, correcting users via stackoverflow is the documentation route to go for this.

@jklymak
Copy link
Member Author
jklymak commented Dec 26, 2018

My problem with just correcting users on stackoverflow is that there is nothing for them to google on other than “matplotlib data plotted out of order”. So it’s almost completely non-discoverable. Hence the warning message would provide something searchable.

@ImportanceOfBeingErnest
Copy link
Member

I think what needs to be done here is to improve the documentation. There are a few principles one needs to know to be able to use matplotlib, and right after (1) You need to have some basic knowledge of python, and (2) matplotlib plots lists or numpy arrays would come (3) matplotlib interpretes the data type of your input and if it is strings, it will assume them to be categorical. This last point would need to be added to the user guide.

Second, there is a FAQ around in the docs, and this should be updated with this very problem. Also I think FAQ should be one of the main points on the homepage, such that it'll look like
image.

The to-be-created FAQ should show a picture of what can happen if you use strings for plotting, such that users can directly see that it is relevant for them. The text can be written in such a way as to include possible wordings of this problem. Ideally, if you'd google "My plot doesn't look like I want it to" you'd still end up at the FAQ. Concerning possible wordings, there is a number of questions on stackoverflow (this list) that I or others have linked to one of the first occurances of this kind of question. Those can be analysed to find possible formulations that should equally be taken into account for that new FAQ entry.

Last, I think the problem is by now widely known. So if a question about it is being asked on stackoverflow, it'll be answered or redirected to existings answers within some hours.

@jklymak
Copy link
Member Author
jklymak commented Jan 7, 2019
8000

@ImportanceOfBeingErnest Thanks, and your experience dealing with this stuff on stackoverflow is very welcome.

OTOH, 57 questions marked as duplicates, with the latest from 3 days ago doesn't argue strongly to me that people are getting it. 😉 More importantly, I think this is really hard to google to get the right answer for.

newest_linked_questions_-_stack_overflow

if not isinstance(val, (str, bytes)):
raise TypeError("{val!r} is not a string".format(val=val))
if val not in self._mapping:
self._mapping[val] = next(self._counter)
# check if we can convert all strings to number or date...
if self._strs_are_convertible(data):
cbook._warn_external('using category units to plot a list of '
Copy link
Member Author
@jklymak jklymak Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will turn this inot a logging call at INFO level Done

@jklymak jklymak force-pushed the mnt-add-warning-for-categorical branch from 2051814 to df85b66 Compare January 9, 2019 05:11
@jklymak jklymak changed the title MNT: add a warning if a categorical string array is all convertible MNT: add a logging call if a categorical string array is all convertible Jan 10, 2019
@jklymak
Copy link
Member Author
jklymak commented Jan 16, 2019

@story645 is your block still active now this has been changed from warning to logging?

See also #13129 for a simpler way to help users turn on logging....

if not isinstance(val, (str, bytes)):
raise TypeError("{val!r} is not a string".format(val=val))
if val not in self._mapping:
self._mapping[val] = next(self._counter)
# check if we can convert all strings to number or date...
if self._strs_are_convertible(data):
_log.info('using category units to plot a list of '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self.__mapping.keys() instead of data so you at least don't have to double iterate over everything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys will include keys from previous calls to the converter, which may indeed be valid. So thats different than the check here, which tests only the most recent data that are sent to the converter. Is that what you want?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I'm just really not liking the second loop, but a second call to unique under the hood would be doing the same thing. :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, someone told me to take it out of the loop, but... new version has it in the loop and only calls the check so long as the strings are convertable to numbers. (i.e. stops checking as soon as a non-number is found).

@jklymak jklymak force-pushed the mnt-add-warning-for-categorical branch from df85b66 to 0ec990a Compare January 18, 2019 22:05
@jklymak jklymak force-pushed the mnt-add-warning-for-categorical branch from 0ec990a to 5ddbb86 Compare January 18, 2019 22:05
@jklymak jklymak force-pushed the mnt-add-warning-for-categorical branch from 5ddbb86 to 17a7c54 Compare January 18, 2019 22:08
@jklymak
Copy link
Member Author
jklymak commented Jan 19, 2019

@timhoffm are you ok with this the way it is now? Marking your review as stale...

@dstansby
Copy link
Member

Trvais failure is un-related.

@dstansby dstansby merged commit df10f3e into matplotlib:master Jan 20, 2019
@@ -189,11 +210,22 @@ def update(self, data):
"""
data = np.atleast_1d(np.array(data, dtype=object))

# check if convertable to number:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arriving after the battle, but this should have been named "convertible", not "convertable"...

@NelleV
Copy link
Member
NelleV commented Jan 21, 2019

I'm surprised this got merged without explicit approval from Tom considering he raised concerns and is the BDFL. Has this issue been discussed more thoroughly elsewhere in a none public channel?

@jklymak
Copy link
Member Author
jklymak commented Jan 21, 2019

ping @tacaswell though we did discuss on a weekly call, and hence we changed from warning to a logging call.

@jklymak
Copy link
Member Author
jklymak commented Jan 21, 2019

Note also #13129 was opened in response to that same call..

@treszkai
Copy link

For the record, these log messages can be suppressed as follows:

import logging
logging.getLogger('matplotlib.category').setLevel(logging.WARNING)

As of now this does not affect any other log messages. (In contrast, plt.set_loglevel('WARNING') turns off all INFO-level messages from Matplotlib.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0