charset=“wtf-8”

69

charset=“wtf-8” programming wtf-8.xn--stpie-k0a81a.com
via bitfield 2 days ago | caches
Archive.org Archive.today Ghostarchive
| 42 comments

42

1. 21
  
  freddyb 2 days ago | link
  
  Not to be confused with https://simonsapin.github.io/wtf-8/.
  1. 9
    
    fanf 2 days ago | link
    
    Nor with https://fanf2.user.srcf.net/hermes/doc/qsmtp/draft-fanf-wtf8.html
  2. gerikson 26 hours ago | link
    
    Discussed here: https://lobste.rs/s/kcuhls
2. 19
  
  mc680x0 2 days ago | link
  
  I have one extended-latin character in my name. It’s ü.
  
  It is frequently parsed as Ã¼ when labels are printed by Australia Post.
  
  Alternatively, websites that want my name tell me it’s not a letter.
  1. 8
    
    sny 2 days ago | link
    
    I have an ö in my last name, and I’ve often received packages where it was replaced with o, oe or even dropped completely. A few months ago I got a package that replaced it with ÷. I’m still curious as to what happened there…
    1. 27
      
      jmillikin edited 2 days ago | link
      
      A few months ago I got a package that replaced [ö] with ÷. I’m still curious as to what happened there…
      
      This is supposition, but ÷ is U+00F7, which was the byte assigned to œ in early drafts of Windows-28605 (aka ISO 8859-15).
      
      I can imagine a path from ö -> oe -> œ -> (encode to iso-8859-15-draft) -> 0xF7 -> (decode from iso-8859-15) -> ÷ existing in some ancient pre-Unicode mainframe software.
      
      It could also just be a bitflip, ö is 0b11110110 and ÷ is 0b11110111.
      1. 24
        
        enpo 2 days ago | link
        
        We should rule out bit flip: @sny must order another package to test :)
      2. 9
        
        carlana 2 days ago | link
        
        I also wouldn’t rule out OCR error.
    2. ekuber 22 hours ago | link
      
      My last name is Küber. I was traveling back to the US through Germany once, where my passport said “Küber”, my plane ticket said “Kueber”, and my US immigration documents said “Kuber”. It took significant arguing to convince them to allow me to board my flight.
    3. axelsvensson 20 hours ago | link
      
      Throwback to early 2000s! I’ve seen this often. ö is encoded as 0xf6 in latin-1 and UTF-8, while ÷ is encoded as 0xf6 in CP437, CP850, and perhaps more. Perhaps they printed the label on an old system where the code page is determined based on the printer port mode
  2. ekuber 23 hours ago | link
    
    I do so too. I got a kick when I lived in the UK and got an UKIP mailer addressed to Mr. KÃ¼ber trying to convince me that immigrants were a problem.
3. 14
  
  TotallyAsymetricSymbol 2 days ago | link
  
  :) This reminds me when KLM airline issued airline ticket for name Jan StÃ¼£ instead of Jan Stępień.
  1. cpurdy 28 hours ago | link
    
    … a true quarter-pounder!
4. 13
  
  skip 2 days ago | link
  
  Although my legal first name is entirely ASCII, it’s nearly 15 characters long and has a space in it. I have filled out multiple forms from the government (both digital and HTML) that were unable to accommodate this. 1Password and a ton of other software that try to parse my name tend to get this wrong, usually interpreting my “second first name” as my middle name, while ignoring my actual middle name entirely.
  
  This sort of thing always gets me thinking of the “clash” between culture and computing, and how those things can be at odds with each other.
5. 11
  
  majaha 2 days ago | link
  
  Ooh dear, lobste.rs doesn’t display his name correctly in the URL either! That’s a bit cutting.
  
  I never did like punycode. Seems like such a lazy, second-class-citizen-making cludge.
  1. 11
    
    pushcx Sysop 2 days ago | link
    
    We have an open issue available for handling punycode + deduping better.
  2. 7
    
    fanf 45 hours ago | link
    
    I agree that punycode is a second-class-citizen-making cludge, but I don’t think it’s lazy.
    
    Serious work on internationalized domain names started shortly after the work on IPv6 support for the DNS. The plan was to support easy renumbering of IPv6 networks by adding extra indirection into the DNS: a lot of complicated machinery involving A6 and DNAME records and bitstring labels. Eventually (around the time IDN work was getting started) it became clear that all this IPv6 stuff could not be deployed because it required upgrading all software across the entire Internet first. The plan was discarded and we continued to use the stopgap AAAA records instead. A lot of the problems were related to bitstring labels and pervasive assumptions about the syntax of DNS names encoded into lots of software operating at most levels of the stack.
    
    Hence IDN became IDNA, internationalization of domain names in applications, that is, i18n without trying to upgrade all the lower-level non-application software that makes assumptions about the DNS. Based on the lessons learned from bitstring labels, the belief was that it would not be possible to deploy any i18n in a reasonable amount of time without greatly reducing the amount of software that needed to know about it.
  3. cpurdy 27 hours ago | link
    
    The problem isn’t Punycode itself (which is a horrible encoding method). The problem is that there are at least 60 different character rulesets/encoding methods in HTTP/HTML alone, and that’s just for text.
6. 8
  
  timvisee 2 days ago | link
  
  Many websites don’t support a + in email addresses either, and only support some custom subset of characters instead. I’ve already collected more than 25 cases personally, which is extremely annoying.
  1. 11
    
    zk 45 hours ago | link
    
    I had one that allowed + in the email for signup, but the password reset form did not accept emails with a +
    1. abyss 25 hours ago | link
      
      I’ve seen this same problem with unsubscribe forms. Absolutely ridiculous. Straight to the spam bin.
7. 5
  
  liquidev 2 days ago | link
  
  I can’t even begin to imagine how services that don’t properly support diacritics process Chinese, Japanese, Korean, Arabic, Thai, …, names.
  
  I’ve never flied before, but what happens if you’re being checked out on an airport and they cannot read your name? Do passports from countries speaking non Latin-based languages have romanisations of names written on them?
  1. 12
    
    hsivonen 2 days ago | link
    
    The optically-readable part at the bottom of the photo page of an IATA-compliant travel document has an ASCII-only form conforming to IATA constraints. This means that even if the name is in the Latin script, there’s an IATA ASCII version. For example Finnish ä becomes AE and ö becomes OE even though this German convention does not make sense for Finnish.
    1. 2
      
      sknebel 2 days ago | link
      
      Is that really a generally applied rule? I thought that was determined by the issuer, and some countries afaik even let you specify what you want it to be (within reasonable variants of writing your name obviously)?
      1. 6
        
        jmillikin 2 days ago | link
        
        ICAO recommends (but does not require) that Ä be converted to A or AE (and Ö to O or OE) in the machine-readable part of a travel document. Maybe the rule is stricter for EU passports?
        
        Relevant document of ICAO requirements and recommendations for names in travel documents: https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf
        
        5
        
        hsivonen 46 hours ago | link
        
        Sorry about getting IATA and ICAO mixed up above.
        
        As the spec shows, the transliteration is algorithmic even if there are options for some characters. The options aren’t available to individuals: Finland does AE for ä without asking the passport applicant.
        
        Not sure how much of a mess it would be to change the transliteration. (IIRC, at some point Russian passports switched from French-based transliteration to English-based transliteration. There must be some failure stories about that.)
        
        sknebel 22 hours ago | link
        
        If I remember right, Japan was an example of a country that does let the applicant specify. But of course there its a much bigger “gap” being bridged by the transliteration.
  2. 8
    
    jmillikin 2 days ago | link
    
    Do passports from countries speaking non Latin-based languages have romanisations of names written on them?
    
    Yes, the important fields (name, country of citizenship, etc) are in Latin characters. A search for 「日本国旅券見本」 (Japanese passport sample) yields https://www.city.kishiwada.osaka.jp/uploaded/image/56222.png, which is detailed enough to get the gist.
    
    Depending on the language of origin there can be some subtleties in how names are rendered into Latin – either the government has an official romanization scheme, or they let the preferred romanization be written into the passport application (within reason).
    1. 4
      
      0x2ba22e11 2 days ago | link
      
      Thanks, that’s informative.
      
      I find it a little surprising that they don’t have both the romanisation and the local writing system.
      1. 5
        
        gerikson 2 days ago | link
        
        It would not surprise me if there’s a limitation to freakin’ Baudot code considering the age of some air traffic systems…
      2. 4
        
        stephenr 2 days ago | link
        
        My wife and son‘s thai passports have their respective names in both languages, but other details in English only.
        
        Their Thai ID cards have more details in both languages, but they also cram much more information in, and I believe the Thai ID system revolves much more around just knowing the ID number than relying on being able to read the details on the card quickly.
    2. 1
      
      gepardo 44 hours ago | link
      
      It is becoming much more interesting when a country has more than one official language, and Latin transliteration is different depending on which language is used for it.
      1. 3
        
        technomancy 41 hours ago | link
        
        Even within a single language you can still be stuck with multiple romanization schemes; for example there’s 4 or 5 in regular use with Thai, and they are all roughly equally decent if you stick with one consistently, except the main one used for road signs, which frequently renders multiple different Thai sounds with the same latin letters.
  3. 4
    
    gerikson 2 days ago | link
    
    At least for Russian, the passport transliteration is a bit of a mess
    
    https://en.wikipedia.org/wiki/Romanization_of_Russian#Transliteration_of_names_on_Russian_passports
8. 5
  
  owl 2 days ago | link
  
  I’m considering changing my last name so that it’s ASCII-compatible, to deal with all the tech.
  
  I wish I could just translate my name when needed instead, then it would easily adapt to any language and associated text encoding. And it would be easy to pronounce for speakers of the target language, too.
  1. 14
    
    ahelwer 2 days ago | link
    
    Growing up in Canada there was a series of books people liked called Macdonald Hall by Gordon Korman, about hijinks at a boarding school for boys. One book in particular is relevant here: The War with Mr. Wizzle published in 1982, about the school adopting a new (and very buggy) computer system. The system had a length limit for peoples’ names, and one student (I recall with the name of Hankenschleimer or similar) had a name too long for the computer to accept. The sleazy computer salesperson shortened it to Hank, and then took the liberty of advising the student to change their actual name to accommodate this brave new world. At the time I thought this was quite ridiculous. But here four decades later we have people considering a similar choice!
    1. 6
      
      stig 44 hours ago | link
      
      My mother, at 70, changed her name to drop two of her middle names for similar reasons. Years before she failed to book plane tickets online as the form asked for her name “exactly as in her passport” (52 characters) but only accepted 50 characters. She never wanted to experience that again.
      1. 3
        
        ahelwer 44 hours ago | link
        
        Oh yeah I also have two long middle names and don’t provide them when traveling, for the same reason. Otherwise it’s a guaranteed call to the kiosk for an ID check.
9. 3
  
  eigil 2 days ago | link
  
  I love that this page has an RSS feed.
10. 1
  
  deepchasm 45 hours ago | link
  
  Perhaps Unicode, a twenty year old hack, is not the solution?
  1. 10
    
    gepardo 44 hours ago | link
    
    Why is it a hack and how would the solution look like?
11. franta 15 hours ago | link
  
  I have seen so many packages with my name and address corrupted… some were definitely lost or returned to the sender, some were finally delivered to me, just late… One would expect, that in 21. century, Unicode would be supported everywhere… but it is not. Such addresses (for example) go through so many systems and databases, e-shop, delivery company, label printing software, drivers, printer firmware… that the probability of text corruption is too high. So when ordering something from abroad, I usually provide my name and address in ASCII.
  
  In our country, Unicode works reliably, but when ordering something from English speaking countries or Germany, there is often problem.
  
  I would recommend to sellers to store Unicode address provided by customers in UTF-8 in the database, but for delivery do conversion to ASCII (unless they are sure that Unicode will go all the way through undamaged).