8000 Snowball 2: Update supported languages + KI (#374) · arangodb/docs@e1d9567 · GitHub
[go: up one dir, main page]

Skip to content
This repository was archived by the owner on Dec 13, 2023. It is now read-only.

Commit e1d9567

Browse files
authored
Snowball 2: Update supported languages + KI (#374)
1 parent 16b7ddf commit e1d9567

17 files changed

+254
-18
lines changed

3.5/aql/functions-arangosearch.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -260,6 +260,14 @@ Match documents where the attribute at **path** is greater than (or equal to)
260260
*low* and *high* can be numbers or strings (technically also `null`, `true`
261261
and `false`), but the data type must be the same for both.
262262

263+
{% hint 'warning' %}
264+
The alphabetical order of characters is not taken into account by ArangoSearch,
265+
i.e. range queries in SEARCH operations against Views will not follow the
266+
language rules as per the defined Analyzer locale nor the server language
267+
(startup option `--default-language`)!
268+
Also see [Known Issues](../release-notes-known-issues35.html#arangosearch).
269+
{% endhint %}
270+
263271
- **path** (attribute path expression):
264272
the path of the attribute to test in the document
265273
- **low** (number\|string): minimum value of the desired range
@@ -406,6 +414,14 @@ is processed by a tokenizing Analyzer (type `"text"` or `"delimiter"`) or if it
406414
is an array, then a single token/element starting with the prefix is sufficient
407415
to match the document.
408416

417+
{% hint 'warning' %}
418+
The alphabetical order of characters is not taken into account by ArangoSearch,
419+
i.e. range queries in SEARCH operations against Views will not follow the
420+
language rules as per the defined Analyzer locale nor the server language
421+
(startup option `--default-language`)!
422+
Also see [Known Issues](../release-notes-known-issues35.html#arangosearch).
423+
{% endhint %}
424+
409425
- **path** (attribute path expression): the path of the attribute to compare
410426
against in the document
411427
- **prefix** (string): a string to search at the start of the text

3.5/aql/operations-search.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,14 @@ are supported:
6464
- `!=`
6565
- `IN` (array or range), also `NOT IN`
6666

67+
{% hint 'warning' %}
68+
The alphabetical order of characters is not taken into account by ArangoSearch,
69+
i.e. range queries in SEARCH operations against Views will not follow the
70+
language rules as per the defined Analyzer locale nor the server language
71+
(startup option `--default-language`)!
72+
Also see [Known Issues](../release-notes-known-issues35.html#arangosearch).
73+
{% endhint %}
74+
6775
```js
6876
FOR doc IN viewName
6977
SEARCH ANALYZER(doc.text == "quick" OR doc.text == "brown", "text_en")

3.5/arangosearch-analyzers.md

Lines changed: 37 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ attributes:
134134
- `locale` (string): a locale in the format
135135
`language[_COUNTRY][.encoding][@variant]` (square brackets denote optional
136136
parts), e.g. `"de.utf-8"` or `"en_US.utf-8"`. Only UTF-8 encoding is
137-
meaningful in ArangoDB.
137+
meaningful in ArangoDB. Also see [Supported Languages](#supported-languages).
138138

139139
### Norm
140140

@@ -147,7 +147,7 @@ attributes:
147147
- `locale` (string): a locale in the format
148148
`language[_COUNTRY][.encoding][@variant]` (square brackets denote optional
149149
parts), e.g. `"de.utf-8"` or `"en_US.utf-8"`. Only UTF-8 encoding is
150-
meaningful in ArangoDB.
150+
meaningful in ArangoDB. Also see [Supported Languages](#supported-languages).
151151
- `accent` (boolean, _optional_):
152152
- `true` to preserve accented characters (default)
153153
- `false` to convert accented characters to their base characters
@@ -194,16 +194,13 @@ An Analyzer capable of breaking up strings into individual words while also
194194
optionally filtering out stop-words, extracting word stems, applying
195195
case conversion and accent removal.
196196

197-
Stemming support is provided by
198-
[Snowball](https://snowballstem.org/){:target="_blank"}.
199-
200197
The *properties* allowed for this Analyzer are an object with the following
201198
attributes:
202199

203200
- `locale` (string): a locale in the format
204201
`language[_COUNTRY][.encoding][@variant]` (square brackets denote optional
205202
parts), e.g. `"de.utf-8"` or `"en_US.utf-8"`. Only UTF-8 encoding is
206-
meaningful in ArangoDB.
203+
meaningful in ArangoDB. Also see [Supported Languages](#supported-languages).
207204
- `accent` (boolean, _optional_):
208205
- `true` to preserve accented characters
209206
- `false` to convert accented characters to their base characters (default)
@@ -281,3 +278,37 @@ Name | Type | Language
281278
`text_ru` | `text` | Russian
282279
`text_sv` | `text` | Swedish
283280
`text_zh` | `text` | Chinese
281+
282+
Supported Languages
283+
-------------------
284+
285+
Analyzers rely on [ICU](http://site.icu-project.org/){:target="_blank"} for
286+
language-dependent tokenization and normalization. The ICU data file
287+
`icudtl.dat` that ArangoDB ships with contains information for a lot of
288+
languages, which are technically all supported.
289+
290+
{% hint 'warning' %}
291+
The alphabetical order of characters is not taken into account by ArangoSearch,
292+
i.e. range queries in SEARCH operations against Views will not follow the
293+
language rules as per the defined Analyzer locale nor the server language
294+
(startup option `--default-language`)!
295+
Also see [Known Issues](release-notes-known-issues35.html#arangosearch).
296+
{% endhint %}
297+
298+
Stemming support is provided by [Snowball](https://snowballstem.org/){:target="_blank"},
299+
which supports the following languages:
300+
301+
Code | Language
302+
------|-----------
303+
`de` | German
304+
`en` | English
305+
`es` | Spanish
306+
`fi` | Finnish
307+
`fr` | French
308+
`it` | Italian
309+
`nl` | Dutch
310+
`no` | Norwegian
311+
`pt` | Portuguese
312+
`ru` | Russian
313+
`sv` | Swedish
314+
`zh` | Chinese

3.5/release-notes-known-issues35.md

Lines changed: 1 addition & 0 deletions
F42D
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ ArangoSearch
2323
| **Date Added:** 2018-12-03 <br> **Component:** ArangoSearch <br> **Deployment Mode:** All <br> **Description:** Using a loop variable in expressions within a corresponding SEARCH condition is not supported <br> **Affected Versions:** 3.4.x, 3.5.x <br> **Fixed in Versions:** - <br> **Reference:** [arangodb/backlog#318](https://github.com/arangodb/backlog/issues/318){:target="_blank"} (internal) |
2424
| **Date Added:** 2019-06-25 <br> **Component:** ArangoSearch <br> **Deployment Mode:** All <br> **Description:** The `primarySort` attribute in ArangoSearch View definitions can not be set via the web interface. The option is immutable, but the web interface does not allow to set any View properties upfront (it creates a View with default parameters before the user has a chance to configure it). <br> **Affected Versions:** 3.5.x <br> **Fixed in Versions:** - <br> **Reference:** N/A |
2525
| **Date Added:** 2019-11-06 <br> **Component:** ArangoSearch <br> **Deployment Mode:** Cluster <br> **Description:** There is a possibility to get into deadlocks during Coordinator execution if a custom Analyzer was created (and is present in the `_analyzers` system collection). It is recommended not to use custom Analyzers in production environments in affected versions. <br> **Affected Versions:** 3.5.x <br> **Fixed in Versions:** 3.5.3 <br> **Reference:** [arangodb/backlog#651](https://github.com/arangodb/backlog/issues/651){:target="_blank"} (internal) |
26+
| **Date Added:** 2020-03-19 <br> **Component:** ArangoSearch <br> **Deployment Mode:** All <br> **Description:** Operators and functions in `SEARCH` clauses of AQL queries which compare values such as `>`, `>=`, `<`, `<=`, `IN_RANGE()` and `STARTS_WITH()` neither take the server language (`--default-language`) nor the Analyzer locale into account. The alphabetical order of characters as defined by a language is thus not honored and can lead to unexpected results in range queries. <br> **Affected Versions:** 3.5.x <br> **Fixed in Versions:** - <br> **Reference:** [arangodb/backlog#679](https://github.com/arangodb/backlog/issues/679){:target="_blank"} (internal) |
2627

2728
AQL
2829
---

3.6/aql/functions-arangosearch.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -260,6 +260,14 @@ Match documents where the attribute at **path** is greater than (or equal to)
260260
*low* and *high* can be numbers or strings (technically also `null`, `true`
261261
and `false`), but the data type must be the same for both.
262262

263+
{% hint 'warning' %}
264+
The alphabetical order of characters is not taken into account by ArangoSearch,
265+
i.e. range queries in SEARCH operations against Views will not follow the
266+
language rules as per the defined Analyzer locale nor the server language
267+
(startup option `--default-language`)!
268+
Also see [Known Issues](../release-notes-known-issues35.html#arangosearch).
269+
{% endhint %}
270+
263271
- **path** (attribute path expression):
264272
the path of the attribute to test in the document
265273
- **low** (number\|string): minimum value of the desired range
@@ -438,6 +446,14 @@ is processed by a tokenizing Analyzer (type `"text"` or `"delimiter"`) or if it
438446
is an array, then a single token/element starting with the prefix is sufficient
439447
to match the document.
440448

449+
{% hint 'warning' %}
450+
The alphabetical order of characters is not taken into account by ArangoSearch,
451+
i.e. range queries in SEARCH operations against Views will not follow the
452+
language rules as per the defined Analyzer locale nor the server language
453+
(startup option `--default-language`)!
454+
Also see [Known Issues](../release-notes-known-issues35.html#arangosearch).
455+
{% endhint %}
456+
441457
- **path** (attribute path expression): the path of the attribute to compare
442458
against in the document
443459
- **prefix** (string): a string to search at the start of the text

3.6/aql/operations-search.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,14 @@ are supported:
6464
- `!=`
6565
- `IN` (array or range), also `NOT IN`
6666

67+
{% hint 'warning' %}
68+
The alphabetical order of characters is not taken into account by ArangoSearch,
69+
i.e. range queries in SEARCH operations against Views will not follow the
70+
language rules as per the defined Analyzer locale nor the server language
71+
(startup option `--default-language`)!
72+
Also see [Known Issues](../release-notes-known-issues35.html#arangosearch).
73+
{% endhint %}
74+
6775
```js
6876
FOR doc IN viewName
6977
SEARCH ANALYZER(doc.text == "quick" OR doc.text == "brown", "text_en")

3.6/arangosearch-analyzers.md

Lines changed: 36 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ attributes:
135135
- `locale` (string): a locale in the format
136136
`language[_COUNTRY][.encoding][@variant]` (square brackets denote optional
137137
parts), e.g. `"de.utf-8"` or `"en_US.utf-8"`. Only UTF-8 encoding is
138-
meaningful in ArangoDB.
138+
meaningful in ArangoDB. Also see [Supported Languages](#supported-languages).
139139

140140
### Norm
141141

@@ -148,7 +148,7 @@ attributes:
148148
- `locale` (string): a locale in the format
149149
`language[_COUNTRY][.encoding][@variant]` (square brackets denote optional
150150
parts), e.g. `"de.utf-8"` or `"en_US.utf-8"`. Only UTF-8 encoding is
151-
meaningful in ArangoDB.
151+
meaningful in ArangoDB. Also see [Supported Languages](#supported-languages).
152152
- `accent` (boolean, _optional_):
153153
- `true` to preserve accented characters (default)
154154
- `false` to convert accented characters to their base characters
@@ -215,16 +215,13 @@ An Analyzer capable of breaking up strings into individual words while also
215215
optionally filtering out stop-words, extracting word stems, applying
216216
case conversion and accent removal.
217217

218-
Stemming support is provided by
219-
[Snowball](https://snowballstem.org/){:target="_blank"}.
220-
221218
The *properties* allowed for this Analyzer are an object with the following
222219
attributes:
223220

224221
- `locale` (string): a locale in the format
225222
`language[_COUNTRY][.encoding][@variant]` (square brackets denote optional
226223
parts), e.g. `"de.utf-8"` or `"en_US.utf-8"`. Only UTF-8 encoding is
227-
meaningful in ArangoDB.
224+
meaningful in ArangoDB. Also see [Supported Languages](#supported-languages).
228225
- `accent` (boolean, _optional_):
229226
- `true` to preserve accented characters
230227
- `false` to convert accented characters to their base characters (default)
@@ -367,3 +364,36 @@ Name | Type | Language
367364
`text_ru` | `text` | Russian
368365
`text_sv` | `text` | Swedish
369366
`text_zh` | `text` | Chinese
367+
368+
Supported Languages
369+
-------------------
370+
371+
Analyzers rely on [ICU](http://site.icu-project.org/){:target="_blank"} for
372+
language-dependent tokenization and normalization. The ICU data file
373+
`icudtl.dat` that ArangoDB ships with contains information for a lot of
374+
languages, which are technically all supported.
375+
376+
{% hint 'warning' %}
377+
The alphabetical order of characters is not taken into account by ArangoSearch,
378+
i.e. range queries in SEARCH operations against Views will not follow the
379+
language rules as per the defined Analyzer locale nor the server language
380+
(startup option `--default-language`)!
381+
Also see [Known Issues](release-notes-known-issues36.html#arangosearch).
382+
{% endhint %}
383+
384+
Stemming support is provided by [Snowball](https://snowballstem.org/){:target="_blank"},
385+
which supports the following languages:
386+
387+
Code | Language
388+
------|-----------
389+
`de` | German
390+
`en` | English
391+
`es` | Spanish
392+
`fi` | Finnish
393+
`fr` | French
394+
`it` | Italian
395+
`nl` | Dutch
396+
`no` | Norwegian
397+
`pt` | Portuguese
398+
`ru` | Russian
399+
`sv` | Swedish

3.6/release-notes-known-issues35.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ ArangoSearch
2323
| **Date Added:** 2018-12-03 <br> **Component:** ArangoSearch <br> **Deployment Mode:** All <br> **Description:** Using a loop variable in expressions within a corresponding SEARCH condition is not supported <br> **Affected Versions:** 3.4.x, 3.5.x <br> **Fixed in Versions:** - <br> **Reference:** [arangodb/backlog#318](https://github.com/arangodb/backlog/issues/318){:target="_blank"} (internal) |
2424
| **Date Added:** 2019-06-25 <br> **Component:** ArangoSearch <br> **Deployment Mode:** All <br> **Description:** The `primarySort` attribute in ArangoSearch View definitions can not be set via the web interface. The option is immutable, but the web interface does not allow to set any View properties upfront (it creates a View with default parameters before the user has a chance to configure it). <br> **Affected Versions:** 3.5.x <br> **Fixed in Versions:** - <br> **Reference:** N/A |
2525
| **Date Added:** 2019-11-06 <br> **Component:** ArangoSearch <br> **Deployment Mode:** Cluster <br> **Description:** There is a possibility to get into deadlocks during Coordinator execution if a custom Analyzer was created (and is present in the `_analyzers` system collection). It is recommended not to use custom Analyzers in production environments in affected versions. <br> **Affected Versions:** 3.5.x <br> **Fixed in Versions:** 3.5.3 <br> **Reference:** [arangodb/backlog#651](https://github.com/arangodb/backlog/issues/651){:target="_blank"} (internal) |
26+
| **Date Added:** 2020-03-19 <br> **Component:** ArangoSearch <br> **Deployment Mode:** All <br> **Description:** Operators and functions in `SEARCH` clauses of AQL queries which compare values such as `>`, `>=`, `<`, `<=`, `IN_RANGE()` and `STARTS_WITH()` neither take the server language (`--default-language`) nor the Analyzer locale into account. The alphabetical order of characters as defined by a language is thus not honored and can lead to unexpected results in range queries. <br> **Affected Versions:** 3.5.x <br> **Fixed in Versions:** - <br> **Reference:** [arangodb/backlog#679](https://github.com/arangodb/backlog/issues/679){:target="_blank"} (internal) |
2627

2728
AQL
2829
---

3.6/release-notes-known-issues36.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ ArangoSearch
2121
| **Date Added:** 2018-12-03 <br> **Component:** ArangoSearch <br> **Deployment Mode:** Cluster <br> **Description:** Score values evaluated by corresponding score functions (BM25/TFIDF) may differ in single-server and cluster with a collection having more than 1 shard <br> **Affected Versions:** 3.4.x, 3.5.x, 3.6.x <br> **Fixed in Versions:** - <br> **Reference:** [arangodb/backlog#508](https://github.com/arangodb/backlog/issues/508){:target="_blank"} (internal) |
2222
| **Date Added:** 2018-12-03 <br> **Component:** ArangoSearch <br> **Deployment Mode:** All <br> **Description:** Using a loop variable in expressions within a corresponding SEARCH condition is not supported <br> **Affected Versions:** 3.4.x, 3.5.x, 3.6.x <br> **Fixed in Versions:** - <br> **Reference:** [arangodb/backlog#318](https://github.com/arangodb/backlog/issues/318){:target="_blank"} (internal) |
2323
| **Date Added:** 2019-06-25 <br> **Component:** ArangoSearch <br> **Deployment Mode:** All <br> **Description:** The `primarySort` attribute in ArangoSearch View definitions can not be set via the web interface. The option is immutable, but the web interface does not allow to set any View properties upfront (it creates a View with default parameters before the user has a chance to configure it). <br> **Affected Versions:** 3.5.x, 3.6.x <br> **Fixed in Versions:** - <br> **Reference:** N/A |
24+
| **Date Added:** 2020-03-19 <br> **Component:** ArangoSearch <br> **Deployment Mode:** All <br> **Description:** Operators and functions in `SEARCH` clauses of AQL queries which compare values such as `>`, `>=`, `<`, `<=`, `IN_RANGE()` and `STARTS_WITH()` neither take the server language (`--default-language`) nor the Analyzer locale into account. The alphabetical order of characters as defined by a language is thus not honored and can lead to unexpected results in range queries. <br> **Affected Versions:** 3.5.x, 3.6.x <br> **Fixed in Versions:** - <br> **Reference:** [arangodb/backlog#679](https://github.com/arangodb/backlog/issues/679){:target="_blank"} (internal) |
2425

2526
AQL
2627
---

3.7/aql/functions-arangosearch.md

Lines changed: 16 additions & 0 deletions
Original file line number< 10000 /th>Diff line numberDiff line change
@@ -260,6 +260,14 @@ Match documents where the attribute at **path** is greater than (or equal to)
260260
*low* and *high* can be numbers or strings (technically also `null`, `true`
261261
and `false`), but the data type must be the same for both.
262262

263+
{% hint 'warning' %}
264+
The alphabetical order of characters is not taken into account by ArangoSearch,
265+
i.e. range queries in SEARCH operations against Views will not follow the
266+
language rules as per the defined Analyzer locale nor the server language
267+
(startup option `--default-language`)!
268+
Also see [Known Issues](../release-notes-known-issues35.html#arangosearch).
269+
{% endhint %}
270+
263271
- **path** (attribute path expression):
264272
the path of the attribute to test in the document
265273
- **low** (number\|string): minimum value of the desired range
@@ -438,6 +446,14 @@ is processed by a tokenizing Analyzer (type `"text"` or `"delimiter"`) or if it
438446
is an array, then a single token/element starting with the prefix is sufficient
439447
to match the document.
440448

449+
{% hint 'warning' %}
450+
The alphabetical order of characters is not taken into account by ArangoSearch,
451+
i.e. range queries in SEARCH operations against Views will not follow the
452+
language rules as per the defined Analyzer locale nor the server language
453+
(startup option `--default-language`)!
454+
Also see [Known Issues](../release-notes-known-issues35.html#arangosearch).
455+
{% endhint %}
456+
441457
- **path** (attribute path expression): the path of the attribute to compare
442458
against in the document
443459
- **prefix** (string): a string to search at the start of the text

3.7/aql/operations-search.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,14 @@ are supported:
6565
- `IN` (array or range), also `NOT IN`
6666
- `LIKE` (introduced in v3.7.0), also `NOT LIKE`
6767

68+
{% hint 'warning' %}
69+
The alphabetical order of characters is not taken into account by ArangoSearch,
70+
i.e. range queries in SEARCH operations against Views will not follow the
71+
language rules as per the defined Analyzer locale nor the server language
72+
(startup option `--default-language`)!
73+
Also see [Known Issues](../release-notes-known-issues37.html#arangosearch).
74+
{% endhint %}
75+
6876
```js
6977
FOR doc IN viewName
7078
SEARCH ANALYZER(doc.text == "quick" OR doc.text == "brown", "text_en")

0 commit comments

Comments
 (0)
0