8000 [Question] ArangoSearch "==" same as phrase() when "text_en" analyzer used? · Issue #7488 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

[Question] ArangoSearch "==" same as phrase() when "text_en" analyzer used? #7488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
8000
dvstans opened this issue Nov 27, 2018 · 5 comments
Closed

Comments

@dvstans
Copy link
dvstans commented Nov 27, 2018

I'm unsure whether the two following ArangoSearch queries should be equivalent (where the "something" field has been indexed using the "text_en" analyzer):

  1. for i in myview search analyzer( i.field == "something", "text_en") return i

  2. for i in myview search analyzer( phrase( i.field, "something"), "text_en" ) return i

In my testing, I've found cases where these two return the same result, but then I've also found cases where they do not. In the cases where they differ, the first form fails to return a match, but the second does. More specifically, the second form using "phrase" always returns what I expect, but the first form occasionally does not (depending on the value of "something"). If these are supposed to be equivalent, I can work up a test case.

@KVS85
Copy link
Contributor
KVS85 commented Nov 30, 2018

Hello @dvstans. Can you please provide your data where results are supposed to be equal?

Generally, i.field == "something" will give you only exact match results while PHRASE(i.field, "something"), "text_en") will match all fields where stemmed (according to text_en analyzer) * something * was present.

@dvstans
Copy link
Author
dvstans commented Nov 30, 2018

@KVS85 Given a document where 'field' contains: "some words: apple red bike letter." running the two queries with "red" as the value results in a match to this document for both; however, if I run the queries with "apple" as the value, only query #2 (using phrase function) matches the document.

This behavior is independent of which indexed field I use and also word order in the field. I thought perhaps my index was corrupt, but after rebuilding it this behavior persisted. I have seen this selectivity with different words as well - not just "apple". :)

@KVS85
Copy link
Contributor
KVS85 commented Nov 30, 2018

@dvstans Thank you for clarification. Now I see that everything works as expected here.

Actually, in FOR i IN myview SEARCH ANALYZER( i.field == "something", "text_en") RETURN i the ANALYZER function will not be applied to i.field == "something" comparison. That is because comparison itself doesn't use ANALYZER context. In case of PHRASE analyzer context applies (or can be chosen as the last parameter of this function).

Therefore, since "text_en" analyzer use stemming, these queries are not identical for different words. For "apple" stemmed value is "appl" while for "red" it's still "red".

In order to make these queries similar (for a single word), you can use the following approach:

  1. FOR i IN myview SEARCH i.field == TOKENS("something", "text_en")[0] RETURN i
  2. FOR i IN myview SEARCH PHRASE(i.field, "something", "text_en") RETURN i

The TOKENS function with analyzer name as the second parameter will show you how input is processed and what actually is being searched.

Please notice also that possibility of search on indexed data using specific analyzer depends on whether this data was indexed with it. By default, only "identity" analyzer is applied.

@dvstans
Copy link
Author
dvstans commented Nov 30, 2018

@KVS85 Ah OK! I didn't really understand the distinction between "==" and phrase() when wrapped in an analyzer, so this makes sense now. Thanks!

@graetzer graetzer added the 2 Solved Resolution label Dec 1, 2018
@Simran-B
Copy link
Contributor
Simran-B commented Dec 3, 2018

@KVS85 Do you think your explanation is needed/useful for the documentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0