-
Notifications
You must be signed in to change notification settings - Fork 853
V8: Replace icudtl.dat with full ICU data file 64.2 #11046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Something isn't right with ArangoSearch... I created a collection
I also created two Analyzers: var analyzers = require("@arangodb/analyzers");
analyzers.save("text_is","text",{ locale: "is.utf-8", stopwords:[] },["frequency","norm","position"]);
analyzers.save("text_is2","text",{ locale:"is.utf-8", accent: true, stopwords:[]},["frequency","norm","position"]) And a View:
Sorting the collection or View is okay:
Both return the documents in the expected order if the server was started with However, limiting it to a sub-range of letters doesn't work correctly in the View case: FOR doc IN coll
FILTER doc.letter >= "X" AND doc.letter <= "Þ"
SORT doc.letter
RETURN doc.letter // [ "X", "Y", "Ý", "Þ" ] FOR doc IN coll
SEARCH ANALYZER(doc.letter >= "X" AND doc.letter <= "Þ", <analyzer>)
SORT doc.letter
RETURN doc.letter
/* different Analyzers:
identity = [ "Á", "Ð", "É", "Í", "Ó", "Ú", "Y", "Ý", "Þ", "Æ", "Ö"]
text_en = [ "A", "Á", "B", "D", "E", "É", "F", "G", "H", "I", "Í", "J", "K", "L", "M", "N", "O", "Ó", "P", "R", "S", "T", "U", "Ú", "V", "X", "Y", "Ý", "Ö" ]
text_is = [ "A", "Á", "B", "D", "E", "É", "F", "G", "H", "I", "Í", "J", "K", "L", "M", "N", "O", "Ó", "P", "R", "S", "T", "U", "Ú", "V", "X", "Y", "Ý", "Ö" ]
text_is2 = [ "A", "B", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "R", "S", "T", "U", "V", "X", "Y" ]
*/ BTW. it is allowed to do |
On a related note, can we get rid of this warning that shows up on every server startup?
We ship our own data file and it's uncommon to have one in your system I suppose. As long as either exists, no warning should be raised IMO. An INFO log would be okay though, to tell the user which icudtl.dat it picked, should that be of relevance (or what was the reason to add above warning in the first place?) |
huh? do you have an environment variable configured that points there? We don't search there by default. |
@dothebart Heureka! Looks like I created this env var? Probably because it was mentioned in https://www.arangodb.com/docs/stable/installation-compiling-windows.html#building-arangodb. I do have Is the information from the docs still relevant, or does a locally compiled ArangoDB find the ICU data file from the repository anyway? |
yes, this used to be a part of our original windows debug installation instructions. |
Taken from https://github.com/unicode-org/icu/releases/tag/release-64-2 (icu4c-64_2-src.zip/icu/source/data/in/icudt64l.dat
33a7bb4
to
fb1f243
Compare
@dothebart According to Andrey it is an (internally) known issue that the SEARCH operation does not take For SORT as well as regular FILTER it appears to work just fine, so nothing should stop us from merging the full ICU data file, as the ArangoSearch related issues existed before. |
Hi! You might want to see the progress on nodejs/full-icu-npm#36 — I have not started publishing the data binaries as described in ICU-20600 yet. How far back would you need them? Also note that for Node.js in https://github.com/nodejs/node/pull/29522/files#diff-947d0de2be4d9fa41699237a83d77ac5 I used a .bz2 compression of the data file. Only 9.3Mb Also by the way, ICU has a data build tool, https://github.com/unicode-org/icu/blob/master/docs/userguide/icu_data/buildtool.md that can be used to customize a subset of the data. If you only need collation (or collation plus something) you could make a more compact data file. |
Hi @srl295, TL;DR - we don't need anything older than ICU v64.2 (and we already have that). I don't know all the details but here some background information:
|
Then you'll find the full little endian data file within icu-src .tgz or .zip in the ICU download. |
please note ICU67rc has the new data file zips https://github.com/unicode-org/icu/releases/tag/release-67-rc |
@KVS85 Meanwhile, I added several of these warnings to the docs: While the larger ICU data file adds support for more languages, the lexicographic order is still not taken into account. There is no regression because of this PR, so it should be safe to merge. |
…ture/full-icu-64-2
@KVS85 FWIW are the Jenkins URLs supposed to be broken? I get host not found… |
@srl295 The links work within our company network only. Our Jenkins is not public. |
@srl295 I added instructions for future updates of the ICU data file. The separate distribution of just the data file is really handy, thanks a lot! |
In the previous PR run, the following error occurred:
A test run using the latest devel state did not show this error: http://jenkins01.arangodb.biz:8080/job/arangodb-matrix-pr-linux/11687/ Merged devel into this PR once more. Let's see if the problem persists. If it does, then the ICU data file is to blame - even though I can't think of a good reason why that should be the case: http://jenkins01.arangodb.biz:8080/job/arangodb-matrix-pr-linux/11689/ |
It seems to be caused by the ICU data file after:
The tests IResearchQueryNoMaterialization-test.cpp and IResearchQueryLateMaterialization-test.cpp don't seem to deal with non-ASCII characters, so I'm puzzled about these failures. |
With the ICU file applied on top of #11603 it still produces above errors 😕 |
Welcome. You’ve gotten to use the file before I have :) you’re user #1 of
it I think.
|
Our jenkins is only reachable internally or via VPN.
Hence if there are noteworthy results, we copy them here.
|
It seems like the official ICU data file and what we currently use is somehow incompatible, and it's unclear if we want to put in the effort of building Chromium with tweaks to get a compatible, complete file. https://arangodb.atlassian.net/browse/AR-115 (internal) |
What kind of incompatibilities ?
…On Fri, Dec 11, 2020 at 9:08 AM Simran ***@***.***> wrote:
It seems like the official ICU data file and what we currently use is
somehow incompatible, and it's unclear if we want to put in the effort of
building Chromium with tweaks to get a compatible, complete file.
https://arangodb.atlassian.net/browse/AR-115 (internal)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11046 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGQZM4766REBOZFFSSUMMDSUIYWRANCNFSM4KPJ7XVQ>
.
|
@srl295 If I substitute our Chromium/V8 ICU data file (which is a subset) with the matching version of the official full ICU data file, then suddenly bizarre errors occur in some of our unit tests, which don't even deal with non-ASCII characters. Furthermore, some collations seem to be broken. I can't remember which language exactly, but I believe Russian or Ukrainian did not sort correctly anymore (the first letter of the alphabet ended up at the very end, the rest looked okay however). None of these issues occur with the current ICU data file 😩 |
Taken from https://github.com/unicode-org/icu/releases/tag/release-64-2
(icu4c-64_2-src.zip/icu/source/data/in/icudt64l.dat)
This is for little endian systems only (
l
). If needed, it seems like icupkg -tb could be used to convert it to big endianess.Start server with new data folder and
--default-language is
, then run the following AQL query:With the icudtl.dat replaced in
3rdParty\V8\v7.9.317\third_party\icu\common
(10 MB -> 26 MB) I get the correct result. Might need to add specific tests to ensure that it's actually working and to raise flags if we accidentally overwrite the data file with a subset version some time later. Could test Icelandic and German phone book sortingde-DE-u-co-phonebk
.Related: