V8: Replace icudtl.dat with full ICU data file 64.2 #11046

Simran-B · 2020-02-03T18:36:20Z

Taken from https://github.com/unicode-org/icu/releases/tag/release-64-2
(icu4c-64_2-src.zip/icu/source/data/in/icudt64l.dat)

This is for little endian systems only (l). If needed, it seems like icupkg -tb could be used to convert it to big endianess.

Start server with new data folder and --default-language is, then run the following AQL query:

RETURN CONCAT(
    FOR letter IN SPLIT("YKÁÉVÝLIÓFEÖPNÐRXGÍJÚOSÞBDUHÆTMA", "") /* random order */
    SORT letter
    RETURN letter /* expected is:    AÁBDÐEÉFGHIÍJKLMNOÓPRSTUÚVXYÝÞÆÖ
                     wrong would be: AÁÆBDÐEÉFGHIÍJKLMNOÓÖPRSTUÚVXYÝÞ */
)

With the icudtl.dat replaced in 3rdParty\V8\v7.9.317\third_party\icu\common (10 MB -> 26 MB) I get the correct result. Might need to add specific tests to ensure that it's actually working and to raise flags if we accidentally overwrite the data file with a subset version some time later. Could test Icelandic and German phone book sorting de-DE-u-co-phonebk.

Simran-B · 2020-02-04T07:11:24Z

Something isn't right with ArangoSearch... I created a collection coll and ran:

FOR letter IN SPLIT("YKÁÉVÝLIÓFEÖPNÐRXGÍJÚOSÞBDUHÆTMA", "")
INSERT { letter } INTO coll

I also created two Analyzers:

var analyzers = require("@arangodb/analyzers");
analyzers.save("text_is","text",{ locale: "is.utf-8", stopwords:[] },["frequency","norm","position"]);
analyzers.save("text_is2","text",{ locale:"is.utf-8", accent: true, stopwords:[]},["frequency","norm","position"])

And a View:

  "links": {
    "coll": {
      "fields": {
        "letter": {
          "analyzers": [
            "identity",
            "text_en",
            "text_is",
            "text_is2"
          ]
        }
      },

Sorting the collection or View is okay:

FOR doc IN view // or: IN coll
SORT doc.letter
RETURN doc

Both return the documents in the expected order if the server was started with --default-language is.

However, limiting it to a sub-range of letters doesn't work correctly in the View case:

FOR doc IN coll
FILTER doc.letter >= "X" AND doc.letter <= "Þ"
SORT doc.letter
RETURN doc.letter // [ "X", "Y", "Ý", "Þ" ]

FOR doc IN coll
SEARCH ANALYZER(doc.letter >= "X" AND doc.letter <= "Þ", <analyzer>)
SORT doc.letter
RETURN doc.letter
/* different Analyzers:
identity = [ "Á", "Ð", "É", "Í", "Ó", "Ú", "Y", "Ý", "Þ", "Æ", "Ö"]
text_en = [ "A", "Á", "B", "D", "E", "É", "F", "G", "H", "I", "Í", "J", "K", "L", "M", "N", "O", "Ó", "P", "R", "S", "T", "U", "Ú", "V", "X", "Y", "Ý", "Ö" ]
text_is = [ "A", "Á", "B", "D", "E", "É", "F", "G", "H", "I", "Í", "J", "K", "L", "M", "N", "O", "Ó", "P", "R", "S", "T", "U", "Ú", "V", "X", "Y", "Ý", "Ö" ]
text_is2 = [ "A", "B", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "R", "S", "T", "U", "V", "X", "Y" ]
*/

BTW. it is allowed to do SORT ANALYZER(doc.letter, "text_is"), but I'm not sure if it's supposed to do anything nor if it actually does.

dothebart · 2020-02-04T11:59:38Z

http://jenkins01.arangodb.biz:8080/view/PR/job/arangodb-matrix-pr/8425/

Simran-B · 2020-02-10T17:34:10Z

On a related note, can we get rid of this warning that shows up on every server startup?

WARNING [581d1] failed to locate 'icudtl.dat' at 'C:\Windows\icudtl.dat'

We ship our own data file and it's uncommon to have one in your system I suppose. As long as either exists, no warning should be raised IMO. An INFO log would be okay though, to tell the user which icudtl.dat it picked, should that be of relevance (or what was the reason to add above warning in the first place?)

dothebart · 2020-02-10T19:11:10Z

huh? do you have an environment variable configured that points there? We don't search there by default.

Simran-B · 2020-02-12T13:56:15Z

@dothebart Heureka!

Looks like I created this env var? Probably because it was mentioned in https://www.arangodb.com/docs/stable/installation-compiling-windows.html#building-arangodb. I do have icudt54l.dat in C:\Windows, but recent ArangoDB versions will be looking for other versions.

Is the information from the docs still relevant, or does a locally compiled ArangoDB find the ICU data file from the repository anyway?

yes, this used to be a part of our original windows debug installation instructions.

Taken from https://github.com/unicode-org/icu/releases/tag/release-64-2 (icu4c-64_2-src.zip/icu/source/data/in/icudt64l.dat

Simran-B · 2020-03-19T12:02:41Z

@dothebart According to Andrey it is an (internally) known issue that the SEARCH operation does not take --default-language nor the Analyzer locale into account for the relevant operators: arangodb/docs#374 (review)

For SORT as well as regular FILTER it appears to work just fine, so nothing should stop us from merging the full ICU data file, as the ArangoSearch related issues existed before.

srl295 · 2020-03-24T19:09:25Z

Hi! You might want to see the progress on nodejs/full-icu-npm#36 — I have not started publishing the data binaries as described in ICU-20600 yet. How far back would you need them?

Also note that for Node.js in https://github.com/nodejs/node/pull/29522/files#diff-947d0de2be4d9fa41699237a83d77ac5 I used a .bz2 compression of the data file. Only 9.3Mb

Also by the way, ICU has a data build tool, https://github.com/unicode-org/icu/blob/master/docs/userguide/icu_data/buildtool.md that can be used to customize a subset of the data. If you only need collation (or collation plus something) you could make a more compact data file.

Simran-B · 2020-03-25T15:47:33Z

Hi @srl295,
thanks for getting in touch and sharing some links!

TL;DR - we don't need anything older than ICU v64.2 (and we already have that).
An easy way to get the full ICU data file in the future is all we want.

I don't know all the details but here some background information:

V8 is embedded into our system, but we rarely upgrade it:
- …
- v4.1.0.27 in v2.5 (Mar 2015)
- v4.3.61 in v2.7 (Oct 2015)
- v5.0.71.39 in v3.0 (Jul 2016)
- v5.7.492.77 in v3.2 (Jul 2017)
- v7.1.302.28 in v3.5 (Aug 2019)
- v7.9.317 in v3.7 (~May 2020)
Upgrading is a manual process and mostly about dealing with API breaking changes
We don't use the latest available version on upgrades, but whatever Node.js uses, because of their existing build scripts etc. AFAIK
We use a single ICU data file for multiple components (V8, IResearch/ArangoSearch, server core)
A user asked us about support for Icelandic a while ago. It turned out that our ICU data file was only a subset. No code changes necessary to support the language, just ship the full file.
We would like to ship full ICU but don't want to compile it ourselves.
The size of the ICU data file does not really matter to us, unless it would be >50 MB or so.
Node.js 13.0.0 (released 2019-10-22) was the first version to default to full-icu and used V8 v7.8. Our v7.9 has the smaller 10 MB ICU data file however, not sure why.
We want to replace it by the full 25 MB file (this PR). It's fine for us to download, extract and replace the file whenever we upgrade V8 by hand, but it would be convenient if it was the full version out of the box.
We target x86 64-bit little-endian systems. A big-endian data file isn't needed at the moment.

srl295 · 2020-03-26T15:18:05Z

Then you'll find the full little endian data file within icu-src .tgz or .zip in the ICU download.

srl295 · 2020-04-10T18:56:57Z

please note ICU67rc has the new data file zips https://github.com/unicode-org/icu/releases/tag/release-67-rc

Simran-B · 2020-05-06T15:44:59Z

@KVS85 Meanwhile, I added several of these warnings to the docs:

While the larger ICU data file adds support for more languages, the lexicographic order is still not taken into account. There is no regression because of this PR, so it should be safe to merge.

…ture/full-icu-64-2

KVS85 · 2020-05-13T16:06:20Z

http://jenkins01.arangodb.biz:8080/view/PR/job/arangodb-matrix-pr/9918/

KVS85 · 2020-05-13T18:17:02Z

http://jenkins01.arangodb.biz:8080/job/arangodb-matrix-pr-linux/11669/

srl295 · 2020-05-13T18:44:50Z

@KVS85 FWIW are the Jenkins URLs supposed to be broken? I get host not found…

Simran-B · 2020-05-13T23:16:16Z

@srl295 The links work within our company network only. Our Jenkins is not public.

Simran-B · 2020-05-14T09:52:32Z

@srl295 I added instructions for future updates of the ICU data file. The separate distribution of just the data file is really handy, thanks a lot!

Simran-B · 2020-05-14T11:36:09Z

In the previous PR run, the following error occurred:

* Test "catch"
    [FAILED]  basics

      "test" failed: exit code was 1

    [FAILED]  IResearchQueryLateMaterializationTest

      "test" failed: /work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true

A test run using the latest devel state did not show this error:

http://jenkins01.arangodb.biz:8080/job/arangodb-matrix-pr-linux/11687/

Merged devel into this PR once more. Let's see if the problem persists. If it does, then the ICU data file is to blame - even though I can't think of a good reason why that should be the case:

http://jenkins01.arangodb.biz:8080/job/arangodb-matrix-pr-linux/11689/

Simran-B · 2020-05-14T13:15:11Z

It seems to be caused by the ICU data file after:

* Test "catch"
    [FAILED]  basics

      "test" failed: exit code was 1

    [FAILED]  IResearchQueryLateMaterializationTest

      "test" failed: /work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true

    [FAILED]  IResearchQueryNoMaterializationTest

      "test" failed: /work/ArangoDB/tests/IResearch/IResearchQueryNoMaterialization-test.cpp:240
Value of: memcmp(expectedValue->getCharPtr(), resStr, length) == 0
  Actual: false
Expected: true

   Suites failed: 3 Tests Failed: 0

The tests IResearchQueryNoMaterialization-test.cpp and IResearchQueryLateMaterialization-test.cpp don't seem to deal with non-ASCII characters, so I'm puzzled about these failures.

Simran-B · 2020-05-14T14:34:59Z

With the ICU file applied on top of #11603 it still produces above errors 😕

srl295 · 2020-05-14T16:46:54Z

Welcome. You’ve gotten to use the file before I have :) you’re user #1 of it I think.

dothebart · 2020-05-15T12:15:36Z

Our jenkins is only reachable internally or via VPN. Hence if there are noteworthy results, we copy them here.

Simran-B · 2020-12-11T15:08:07Z

It seems like the official ICU data file and what we currently use is somehow incompatible, and it's unclear if we want to put in the effort of building Chromium with tweaks to get a compatible, complete file.

https://arangodb.atlassian.net/browse/AR-115 (internal)

srl295 · 2020-12-11T15:33:07Z

What kind of incompatibilities ?

…

On Fri, Dec 11, 2020 at 9:08 AM Simran ***@***.***> wrote: It seems like the official ICU data file and what we currently use is somehow incompatible, and it's unclear if we want to put in the effort of building Chromium with tweaks to get a compatible, complete file. https://arangodb.atlassian.net/browse/AR-115 (internal) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11046 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGQZM4766REBOZFFSSUMMDSUIYWRANCNFSM4KPJ7XVQ> .

Simran-B · 2020-12-11T22:23:35Z

@srl295 If I substitute our Chromium/V8 ICU data file (which is a subset) with the matching version of the official full ICU data file, then suddenly bizarre errors occur in some of our unit tests, which don't even deal with non-ASCII characters. Furthermore, some collations seem to be broken. I can't remember which language exactly, but I believe Russian or Ukrainian did not sort correctly anymore (the first letter of the alphabet ended up at the very end, the rest looked okay however). None of these issues occur with the current ICU data file 😩

Simran-B added 1 Feature 9 WIP labels Feb 3, 2020

Simran-B added this to the 3.7 milestone Feb 3, 2020

Simran-B requested a review from dothebart February 3, 2020 18:36

Simran-B self-assigned this Feb 3, 2020

dothebart approved these changes Feb 3, 2020

View reviewed changes

Copy link

Contributor

dothebart commented Feb 12, 2020

yes, this used to be a part of our original windows debug installation instructions.

Replace icudtl.dat with full ICU data file 64.2

fb1f243

Taken from https://github.com/unicode-org/icu/releases/tag/release-64-2 (icu4c-64_2-src.zip/icu/source/data/in/icudt64l.dat

Simran-B force-pushed the feature/full-icu-64-2 branch from 33a7bb4 to fb1f243 Compare March 18, 2020 14:44

Simran-B mentioned this pull request Mar 18, 2020

Snowball 2: Update supported languages arangodb/docs#374

Merged

Simran-B removed the 9 WIP label Mar 19, 2020

Simran-B marked this pull request as ready for review March 19, 2020 12:00

Simran-B added 3 Core 3 Search IResearch / Fulltext index / Analyzers labels Mar 19, 2020

Simran-B added the 9 Ready to Merge label May 6, 2020

Simran-B requested a review from KVS85 May 6, 2020 15:42

KVS85 requested review from gnusi and jsteemann May 13, 2020 15:55

Merge branch 'devel' of https://github.com/arangodb/arangodb into fea…

d9b57e3

…ture/full-icu-64-2

dothebart approved these changes May 13, 2020

View reviewed changes

Simran-B mentioned this pull request May 14, 2020

Wrong sorting for the Icelandic alphabet #7111

Open

Simran-B added 2 commits May 14, 2020 11:14

Update CHANGELOG

fa708b2

Update README_maintainers.md

ee24225

Merge remote-tracking branch 'origin/devel' into feature/full-icu-64-2

17e78a0

Simran-B added 9 Waiting for Jenkins and removed 9 Ready to Merge labels May 14, 2020

Merge branch 'devel' into feature/full-icu-64-2

03f8362

Simran-B added 9 WIP and removed 9 Waiting for Jenkins labels May 22, 2020

Simran-B marked this pull request as draft May 22, 2020 13:30

Simran-B closed this Dec 11, 2020

Simran-B deleted the feature/full-icu-64-2 branch December 11, 2020 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

V8: Replace icudtl.dat with full ICU data file 64.2 #11046

V8: Replace icudtl.dat with full ICU data file 64.2 #11046

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

V8: Replace icudtl.dat with full ICU data file 64.2 #11046

V8: Replace icudtl.dat with full ICU data file 64.2 #11046

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!