[go: up one dir, main page]

Page MenuHomePhabricator

Remove all non-trivial parsers in jquery.tablesorter
Open, MediumPublic

Description

$.tablesorter does a lot of guesswork trying to infer what does the content in a table mean. It sometimes work, mostly for English; however, trying to (for example) guess if a column contains dates just plain doesn't work for some, if not most, languages.

Apart from numbers, $.tablesorter tries to guess dates and times in four distinct formats, currencies, IP addresses, and for some reasons URLs.

I'd kill everything but detecting and sorting numeric values, and even this I'd do differently (using "natural sorting" - identifying runs of digits and non-digits and sorting them separately). [http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html]

For sorting any remotely complicated data one should use a data-sort-value attribute on the <td> (which will be used as the sortkey for given cell), or a template that wraps it.

Bugs probably caused by or related to this misfeature (I didn't double-check):

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:39 AM
bzimport set Reference to bz45161.
bzimport added a subscriber: Unknown Object (MLST).

That's not particularly user-friendly. If a column contains dates that can be parsed, we shouldn't be forcing users to add a data-sort-value for every cell to get it to sort right.

Maybe the guessing could be toned down or removed though. I haven't looked to see how well/poorly it works.

Let's go through all the parsers, then.

  • 'text', obviously. It also has some rudimentary collation capabilities (which have to be configured per-wiki), and while extremely limited this is actually useful and used e.g. on pl.wikipedia. This one is good.
  • 'IPAddress'. It actually only supports IPv4 addresses. If natural sorting was implemented, it would be entirely 100% useless. Kill it.
  • 'currency'. By $.tablesorter definitions this actually means pounds, dollars, euros and yens; all other currency signs are not detected and treated as 'text'. It would be similarly superseded by natural sorting. Kill it.
  • 'url'. It just removes ftp:, file:, http: and https: protocols and then sorts as text. I have absolutely no idea why would anybody consider this useful. Kill it.
  • 'isoDate'. That is YYYY-M(M)-D(D), where (X) means that this part is optional. Would be similarly superseded by natural sorting, kill it.
  • 'usLongDate'. It uses a 114-character regex I'm not going to decipher right now, but the regex ends with (AM|PM). English-specific, completely useless for other languages, kill it.
  • 'date'. It uses some generated regexes based on wgMonthNames and wgMonthNamesShort. Assumes that all languages use a XXX[,.'-/]XXX[,.'-/]YY(YY) date formats, where X can be either D or M depending on wgDefaultDateFormat and wgContentLanguage==en. I have no idea why exactly those separators were used. Probably doesn't work for most languages; I didn't test carefully, but bug 42607 is due to this malfunctioning. Kill it.
  • 'time'. Matches a HH:MM time with optional AM/PM. Again, would be obsoleted by natural sorting, let's kill it.
  • 'number'. Would be obsoleted by natural sorting, let's kill it.

Is that enough justification?

You seem to be making some major assumptions on how this natural sorting will work, particularly with respect to punctuation, that may or may not be warranted.

(In reply to comment #2)

  • 'IPAddress'. It actually only supports IPv4 addresses. If natural sorting

was
implemented, it would be entirely 100% useless. Kill it.

Should be improved to support IPv6, which by no stretch will be handled correctly by "natural sorting".

I also note that "10.123.234.255" in locales that use '.' for grouping by thousands looks like the number 10123234225. While "255.0.0.1" does not look like any number, so it would probably be interpreted as 255 followed by some other junk, and be wrongly sorted before 10.123.234.255.

And in locales that use '.' for a decimal separator, is "10.23.45.6" the numbers "10.23" and "45.6" separated by a period, or is it four integers? This makes a difference if the table also contains "10.3.0.0".

  • 'currency'. By $.tablesorter definitions this actually means pounds,

dollars,
euros and yens; all other currency signs are not detected and treated as
'text'. It would be similarly superseded by natural sorting. Kill it.

If it's actually the case that your natural sorting handles it.

  • 'url'. It just removes ftp:, file:, http: and https: protocols and then

sorts
as text. I have absolutely no idea why would anybody consider this useful.
Kill
it.

That does seem like an odd choice. Perhaps the implementors should be asked for a reason behind it.

  • 'isoDate'. That is YYYY-M(M)-D(D), where (X) means that this part is

optional. Would be similarly superseded by natural sorting, kill it.

It also handles YYYY/MM/DD.

  • 'usLongDate'. It uses a 114-character regex I'm not going to decipher right

now, but the regex ends with (AM|PM). English-specific, completely useless
for
other languages, kill it.

  • 'date'. It uses some generated regexes based on wgMonthNames and

wgMonthNamesShort. Assumes that all languages use a
XXX[,.'-/]XXX[,.'-/]YY(YY)
date formats, where X can be either D or M depending on wgDefaultDateFormat
and
wgContentLanguage==en. I have no idea why exactly those separators were used.
Probably doesn't work for most languages; I didn't test carefully, but bug
42607 is due to this malfunctioning. Kill it.

*Lots* of assumptions there.

What we probably need are specific parsers for various date formats, and/or a way for a wiki to call $.tablesorter.addDateParser( 'name', 'format-string' ) to easily add additional date parsers for their language, and/or a general parser that reads the format string from a "data-date-format" property on the header cell.

  • 'time'. Matches a HH:MM time with optional AM/PM. Again, would be obsoleted

by natural sorting, let's kill it.

In what world would your natural sorting not put 2AM after 1PM? Or is it smart enough to decide it's looking at a time, as well?

  • 'number'. Would be obsoleted by natural sorting, let's kill it.

Your natural sorting is smart enough to handle numbers like 6.022e23 correctly? Interesting.

Is that enough justification?

Not really.

All in all, it seems that the existing parsers started out as relatively English-centric and were then adjusted to try to address other situations. But getting rid of all that to rely entirely on a poorly-specified "natural sort" algorithm doesn't seem like much of an improvement.

Krinkle set Security to None.
Krinkle removed a subscriber: Unknown Object (MLST).

Support of local date and number formats seems like a reasonable goal for a software that prides itself on its i18n capabilities. (Time too, but sorting times is not something that happens much.) And natural sort would not be any help - you can't tell which of 123 and 4,567 is larger without knowing the locale.

Krinkle renamed this task from Kill all non-trivial parsers in $.tablesorter to Remove all non-trivial parsers in jquery.tablesorter.Jan 10 2019, 7:05 PM

Tentatively proposing for 1.33, hoping it'll save some overhead during page initialisation since this is a fairly widely (pre)loaded module. Migration is expected to be easy (it degrades gracefully to other sorting mechanism).

This is a "it just works" feature to most people.

What migration path is proposed? Is it really that "easy"?

As I understand it, it will continue to "just work".

Is there a specific wiki article with a sortable table on it you fear would no longer sort in a way that makes sense to readers?

Anomie's objections above seem to be rather... uh... interesting.

Where is the specification for 'natural' order? Is that what the code does without the "non-trivial" paths?

Someone should add some logging to check how much each filter is used though.. As most table sorting guesses column type, it's hard to figure out how much they are in use right now, making it difficult to assess impact.

Personally I have no problem with deprecating usLongDate, IPAddress, currency and URL.

hoping it'll save some overhead during page initialisation

In bytes or performance ? Because I doubt either will be a lot honestly.. Most overhead is in analysing the layout of the table and virtualising that I suspect. If you completely remove ALL guess work, then maybe, but that seems pretty involved to me and high risk, unless you invest a ton of time in community shepherding...

I'd be interested in reevaluating the performance of localeCompare, and it's capabilities to use collations and numeric sorting (last time i looked, performance was bad and actual functioning locale and numeric support was abysmal).

Change 837760 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] jquery.tablesorter: Remove 'url' parser for transforming sorted text

https://gerrit.wikimedia.org/r/837760

Change 837764 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] jquery.tablesorter: Remove custom 'isoDate' parser for text sorting

https://gerrit.wikimedia.org/r/837764

Change 837760 merged by jenkins-bot:

[mediawiki/core@master] jquery.tablesorter: Remove 'url' parser for transforming sorted text

https://gerrit.wikimedia.org/r/837760

Change 837764 merged by jenkins-bot:

[mediawiki/core@master] jquery.tablesorter: Remove custom 'isoDate' parser for text sorting

https://gerrit.wikimedia.org/r/837764