8000 Sorting in SEARCH by TFIDF or BM25 in combination with LIMIT no longer works correctly · Issue #14427 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

Sorting in SEARCH by TFIDF or BM25 in combination with LIMIT no longer works correctly #14427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
admtech opened this issue Jun 29, 2021 8000 · 11 comments
Labels
1 Analyzing 3 Search IResearch / Fulltext index / Analyzers Waiting User Reply

Comments

@admtech
Copy link
admtech commented Jun 29, 2021

My Environment

  • ArangoDB Version: 3.7.12
  • Storage Engine: RocksDB
  • Deployment Mode: Single Server
  • Deployment Strategy: Manual Start
  • Configuration:
  • Infrastructure: own Datacenter, Server over VMWare Cloud
  • Operating System: Ubuntu 18.04.5 LTS
  • Total RAM in your machine: 60 GB
  • Disks in use: virtual Disk (NAS)
  • Used Package: Ubuntu

Component, Query & Data

Affected feature: Server

Size of your Dataset on disk: 6.2 GB

Problem

When sorting a view only by TFIDF or BM25 in combination with the LIMIT command, some rows are simply not displayed. If I additionally sort by some fantasy field it works again.

AQL query (if applicable):

Simplified:

Does not work (only a few results):

FOR doc IN viewName
  SEARCH ...
  SORT BM25(doc) DESC
  LIMIT 40
  RETURN doc

Works (40 results and all correctly sorted):

FOR doc IN viewName
  SEARCH ...
  SORT doc.irgendeinfeldname, BM25(doc) DESC
  LIMIT 40
  RETURN doc

If I leave out the additional sorting by "doc.anyfieldname", the FullCount tells me that 763 elements were found, but it is not interested in the limit and only 27 are really displayed. But "doc.anyfieldname" is a field that does not exist in the collection.

The search was performed for a term that occurred very often, so the result very often has the same score.

I first noticed the error with version 3.7.12.

Real example:

Search term: query = "Windows 11"

LET queries = TOKENS(@query, 'text_de') FOR doc IN con_create 
SEARCH ANALYZER( doc.title IN queries , 'text_de') AND 
doc.art IN ['tutorial','report','tip','info','imho'] AND 
doc.status=='ok' LET score = TFIDF(doc) 
SORT doc.blabla DESC, score DESC
LIMIT 40 
RETURN {title: doc.title, score: score}

Result: 40 articles

Windows 10: Spartan Rendering Engine im Internet Explorer 11 aktivieren | 17.690162658691406
Windows 11 steht im Insider Preview bereit | 17.428010940551758
Nokia 4.2 u. Android 11 - WLAN Automatische Verbindung defekt | 15.535337448120117
IOS 11 und macOS 10.13: Apple zwingt zu neuem Authentifizierungsverfahren | 14.322734832763672
Windows Datenträgerverwaltung buggy! (Windows Vista bis Windows 10) | 14.114057540893555
Apple Special Event vom 10.09.2019: Arcade, TV+, iPad und iPadOS, Watch und iPhone 11 | 14.060583114624023
Exchange Server 2016 Probleme auf Server 2016 mit iOS 11 Mailapp | 14.060583114624023
Infos zu Patchday-Problemen und CCleaner (10. u. 11. September 2018) | 14.060583114624023
...

If I leave out the "doc.blabla DESC," then I only ..

LET queries = TOKENS(@query, 'text_de') FOR doc IN con_create 
SEARCH ANALYZER( doc.title IN queries , 'text_de') AND 
doc.art IN ['tutorial','report','tip','info','imho'] AND 
doc.status=='ok' LET score = TFIDF(doc) 
SORT score DESC
LIMIT 40 
RETURN {title: doc.title, score: score}

.. get 27 results and there are missing many articles.

Windows 11 steht im Insider Preview bereit | 17.427152633666992
Nokia 4.2 u. Android 11 - WLAN Automatische Verbindung defekt | 15.534043312072754
Symantec Endpoint Protection 11 - Nach der Installation stoppt der Service SEMSRV immer wieder | 14.321438789367676
Veeam 11 ist verfügbar | 14.05964183807373
Windows Server 2016 Suche funktioniert nicht und ist ausgegraut - Windows Server 2016 Search not work | 13.043990135192871
Windows Server 2K8 u. SMTP-Dienst - ACHTUNG bei Windows-Update KB976323 | 13.043990135192871
Windows XP und Windows 7 parallel in verschiedenen Partitionen installieren, Brenner Probleme | 11.831385612487793
Windows 7 - freigegebener Drucker reagiert nicht aus Anwendung nach Windows Update KB3177725 | 11.831385612487793
Windows 10 Build 10565 akzeptiert Keys von Windows 7, 8 und 8.1 | 11.831385612487793
...

Do I omit the limit completely:

LET queries = TOKENS(@query, 'text_de') FOR doc IN con_create 
SEARCH ANALYZER( doc.title IN queries , 'text_de') AND 
doc.art IN ['tutorial','report','tip','info','imho'] AND 
doc.status=='ok' LET score = TFIDF(doc) 
SORT score DESC
RETURN {title: doc.title, score: score}

Are correctly displayed 763 elements

Nokia 4.2 u. Android 11 - WLAN Automatische Verbindung defekt | 15.535677909851074
Windows Datenträgerverwaltung buggy! (Windows Vista bis Windows 10) | 14.114253997802734
Sicherheitsupdates für Exchange Server 11. Mai 2021 | 14.060922622680664
Veeam 11 ist verfügbar | 14.060922622680664
Bluescreen "STOP: 0x000000D1" bei Upgrade Windows Vista auf Windows 7 als VirtualBox Gast | 13.043953895568848
Windows Server 2016 Suche funktioniert nicht und ist ausgegraut - Windows Server 2016 Search not work | 13.043953895568848
Windows Server 2K8 u. SMTP-Dienst - ACHTUNG bei Windows-Update KB976323 | 13.043953895568848
...

No idea where the error is. But the fact is, if I put a fantasy field in front of the sorting, it works again (workaround)

greeting
Frank

@admtech admtech changed the title Die Sortierung bei SEARCH nach TFIDF oder BM25 in Kombination mit LIMIT funktioniert nicht mehr richtig Sorting in SEARCH by TFIDF or BM25 in combination with LIMIT no longer works correctly Jun 29, 2021
@admtech
Copy link
Author
admtech commented Jun 29, 2021

Does not work:

Result: 22 elements

LET queries = TOKENS(@query, 'text_de') FOR doc IN con_create 
SEARCH ANALYZER( doc.title IN queries , 'text_de') AND 
doc.art IN ['tutorial','report','tip','info','imho'] AND 
doc.status=='ok' LET score = TFIDF(doc) 
SORT score DESC
LIMIT 40
RETURN {title: doc.title, score: score}
Query String (283 chars, cacheable: true):
 LET queries = TOKENS(@query, 'text_de') FOR doc IN con_create 
 SEARCH ANALYZER( doc.title IN queries , 'text_de') AND 
 doc.art IN ['tutorial','report','tip','info','imho'] AND 
 doc.status=='ok' LET score = TFIDF(doc) 
 SORT score DESC
 LIMIT 40
 RETURN {title: doc.title, score: score}
 

Execution plan:
 Id   NodeType              Est.   Comment
  1   SingletonNode            1   * ROOT
  3   EnumerateViewNode   270197     - FOR doc IN con_create SEARCH ((doc.`art` IN [ "imho", "info", "report", "tip", "tutorial" ]) && (doc.`status` == "ok") && ANALYZER((doc.`title` IN [ "windows", "11" ]), "text_de")) LET #5 = TFIDF(doc)   /* view query with late materialization */
  5   SortNode            270197       - SORT #5 DESC   /* sorting strategy: constrained heap */
  6   LimitNode               40       - LIMIT 0, 40
  9   MaterializeNode         40       - MATERIALIZE doc
  7   CalculationNode         40       - LET #3 = { "title" : doc.`title`, "score" : #5 }   /* simple expression */
  8   ReturnNode              40       - RETURN #3

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   remove-unnecessary-calculations
  2   handle-arangosearch-views
  3   remove-unnecessary-calculations-2
  4   sort-limit
  5   late-document-materialization-arangosearch

@admtech
Copy link
Author
admtech commented Jun 29, 2021

Work:

Result: 40 elements

LET queries = TOKENS(@query, 'text_de') FOR doc IN con_create 
SEARCH ANALYZER( doc.title IN queries , 'text_de') AND 
doc.art IN ['tutorial','report','tip','info','imho'] AND 
doc.status=='ok' LET score = TFIDF(doc) 
SORT doc.blubblub, score DESC
LIMIT 40
RETURN {title: doc.title, score: score}
Query String (297 chars, cacheable: true):
 LET queries = TOKENS(@query, 'text_de') FOR doc IN con_create 
 SEARCH ANALYZER( doc.title IN queries , 'text_de') AND 
 doc.art IN ['tutorial','report','tip','info','imho'] AND 
 doc.status=='ok' LET score = TFIDF(doc) 
 SORT doc.blubblub, score DESC
 LIMIT 40
 RETURN {title: doc.title, score: score}
 

Execution plan:
 Id   NodeType              Est.   Comment
  1   SingletonNode            1   * ROOT
  3   EnumerateViewNode   270197     - FOR doc IN con_create SEARCH ((doc.`art` IN [ "imho", "info", "report", "tip", "tutorial" ]) && (doc.`status` == "ok") && ANALYZER((doc.`title` IN [ "windows", "11" ]), "text_de")) LET #7 = TFIDF(doc)   /* view query */
  5   CalculationNode     270197       - LET #3 = doc.`blubblub`   /* attribute expression */
  6   SortNode            270197       - SORT #3 ASC, #7 DESC   /* sorting strategy: constrained heap */
  7   LimitNode               40       - LIMIT 0, 40
  8   CalculationNode         40       - LET #5 = { "title" : doc.`title`, "score" : #7 }   /* simple expression */
  9   ReturnNode              40       - RETURN #5

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   remove-unnecessary-calculations
  3   move-calculations-up-2
  4   handle-arangosearch-views
  5   remove-unnecessary-calculations-2
  6   sort-limit

@maxkernbach
Copy link
Contributor

Hi @admtech,

Could you please check whether below query still shows only 22 results when manually disabling the optimizer rule late-document-materialization-arangosearch?

LET queries = TOKENS(@query, 'text_de') FOR doc IN con_create 
SEARCH ANALYZER( doc.title IN queries , 'text_de') AND 
doc.art IN ['tutorial','report','tip','info','imho'] AND 
doc.status=='ok' LET score = TFIDF(doc) 
SORT score DESC
LIMIT 40
RETURN {title: doc.title, score: score}

@maxkernbach maxkernbach added 1 Analyzing 3 Search IResearch / Fulltext index / Analyzers Waiting User Reply labels Jun 30, 2021
@admtech
Copy link
Author
admtech commented Jul 1, 2021

hmm, this is a productive system. Can I turn this on and off without the end users noticing. Do I need to restart ArangoDB after the change? There runs the complete https://administrator.de page.

What is the exact command in the console?

off:
stmt.explain({ optimizer: { rules: ["-late-document-materialization-arangosearch"] } });

on:
stmt.explain({ optimizer: { rules: ["+late-document-materialization-arangosearch"] } });

@admtech
Copy link
Author
admtech commented Jul 5, 2021

I am still waiting for an answer?

@maxkernbach
Copy link
Contributor

Hi @admtech,

Please note that there are not guaranteed SLA for ArangoDB Community Support.
See https://www.arangodb.com/subscriptions/ for details.

Optimizer rules can either be:

  1. disabled permanently by modifying the server configuration (this requires a server restart) or
  2. disabled per query execution (does not require a server restart)
    https://www.arangodb.com/docs/stable/aql/execution-and-performance-optimizer.html#turning-specific-optimizer-rules-off)

In order to run the stmt statement, you first have to create a statement with your query. Replace the query and bindVars parameters accordingly.

Turning off an optimizer rule can be done by prefixing with a -.

Rules that shall be enabled need to be prefixed with a +, rules to be disabled should be prefixed with a -.

Once that is done, you can run:

stmt.explain({ optimizer: { rules: [ "-late-document-materialization-arangosearch"] } });

in order to explain the query with the given disabled optimizer rule.

Similarly, you can execute the query with the disabled optimizer rule:

stmt.execute({ optimizer: { rules: [ "-late-document-materialization-arangosearch"] } });

I am wondering whether the correct result set is returned once optimizer rule late-document-materialization-arangosearch is disabled.

@admtech
Copy link
Author
admtech commented Jul 5, 2021

Hi @admtech,

Please note that there are not guaranteed SLA for ArangoDB Community Support.
See https://www.arangodb.com/subscriptions/ for details.

Hi @maxkernbach

I think you are confusing something fundamental here. You are not helping me with my bug, I am helping you to find the bugs so you can fix them. Working out the issue was a lot of work and took my time, so it makes absolutely no sense to counter with SLA now (besides, one of your employees asked me to do that). I thought it is in your interest to fix such basic bugs in your database as soon as possible.

Since this worked fine in the previous version, I don't think I made a mistake.

I am wondering whether the correct result set is returned once optimizer rule late-document-materialization-arangosearch is disabled.

I will try it out tonight. Thanks for the explanation.

@admtech
Copy link
Author
admtech commented Jul 13, 2021

So I have now made a few queries via the shell:

Test 1:

arangodb:8529@administrator_core> var stmt = db._createStatement( {"query": "LET queries = TOKENS(\"Windows 11\", 'text_de') FOR doc IN con_create SEARCH ANALYZER( doc.title IN queries , 'text_de') AND doc.art IN ['tutorial','report','tip','info','imho'] AND doc.status=='ok' LET score = TFIDF(doc) SORT score DESC LIMIT 40 RETURN {title: doc.title, score: score}" });

Result:

arangodb:8529@administrator_core> c = stmt.execute();
[object ArangoQueryCursor, count: 38, cached: false, hasMore: true]
[
  {
    "title" : "Microsoft hat Windows 11 offiziell vorgestellt",
    "score" : 18.24855613708496
  },
..

Test 2:

arangodb:8529@administrator_core> var stmt = db._createStatement( {"query": "LET queries = TOKENS('Windows 11', 'text_de') FOR doc IN con_create SEARCH ANALYZER( doc.title IN queries , 'text_de') AND doc.art IN ['tutorial','report','tip','info','imho'] AND doc.status=='ok' LET score = TFIDF(doc) SORT doc.blabla DESC, score DESC LIMIT 40 RETURN {title: doc.title, score: score}" });

Result:

arangodb:8529@administrator_core> c = stmt.execute();
[object ArangoQueryCursor, count: 40, cached: false, hasMore: true]
[
  {
    "title" : "Microsoft hat Windows 11 offiziell vorgestellt",
    "score" : 18.248310089111328
  },
 ...

Interestingly, I now have a different result than before. But still a wrong result (result: 38), without the additional sorting with a non-existing field "doc.blabla DESC" (result: 40).

Now let's put the Optimizer Rule to sleep:

arangodb:8529@administrator_core> stmt.execute({ optimizer: { rules: [ "-late-document-materialization-arangosearch"] } });
[object ArangoQueryCursor, count: 38, cached: false, hasMore: true]
[
  {
    "title" : "Windows 10: Spartan Rendering Engine im Internet Explorer 11 aktivieren",
    "score" : 18.216835021972656
  },
  {
    "title" : "Windows 11 Insider Preview Build 22000.65 available",
    "score" : 18.139537811279297
  },
..

on again now:

arangodb:8529@administrator_core> stmt.execute({ optimizer: { rules: [ "+late-document-materialization-arangosearch"] } });
[object ArangoQueryCursor, count: 39, cached: false, hasMore: true]

[
  {
    "title" : "Windows 10: Spartan Rendering Engine im Internet Explorer 11 aktivieren",
    "score" : 18.217409133911133
  },
  {
    "title" : "Windows 11 Insider Preview Build 22000.65 available",
    "score" : 18.139352798461914
  },
...

Now we have one result more (result 39). When called again, there are 38 results again!?

here the output from the explain:

arangodb:8529@administrator_core> stmt.explain({ optimizer: { rules: [ "-late-document-materialization-arangosearch"] } });
{
  "plan" : {
    "nodes" : [
      {
        "type" : "SingletonNode",
        "dependencies" : [ ],
        "id" : 1,
        "estimatedCost" : 1,
        "estimatedNrItems" : 1
      },
      {
        "type" : "EnumerateViewNode",
        "dependencies" : [
          1
        ],
        "id" : 3,
        "estimatedCost" : 270544,
        "estimatedNrItems" : 270543,
        "database" : "administrator_core",
        "view" : "con_create",
        "viewId" : "986969875",
        "outVariable" : {
          "id" : 1,
          "name" : "doc",
          "isDataFromCollection" : false
        },
        "viewValuesVars" : [ ],
        "condition" : {
          "type" : "n-ary or",
          "typeID" : 63,
          "subNodes" : [
            {
              "type" : "n-ary and",
              "typeID" : 62,
              "subNodes" : [
                {
                  "type" : "compare in",
                  "typeID" : 31,
                  "sorted" : false,
                  "subNodes" : [
                    {
                      "type" : "attribute access",
                      "typeID" : 35,
                      "name" : "art",
                      "subNodes" : [
                        {
                          "type" : "reference",
                          "typeID" : 45,
                          "name" : "doc",
                          "id" : 1
                        }
                      ]
                    },
                    {
                      "type" : "array",
                      "typeID" : 41,
                      "sorted" : true,
                      "subNodes" : [
                        {
                          "type" : "value",
                          "typeID" : 40,
                          "value" : "imho",
                          "vType" : "string",
                          "vTypeID" : 4
                        },
                        {
                          "type" : "value",
                          "typeID" : 40,
                          "value" : "info",
                          "vType" : "string",
                          "vTypeID" : 4
                        },
                        {
                          "type" : "value",
                          "typeID" : 40,
                          "value" : "report",
                          "vType" : "string",
                          "vTypeID" : 4
                        },
                        {
                          "type" : "value",
                          "typeID" : 40,
                          "value" : "tip",
                          "vType" : "string",
                          "vTypeID" : 4
                        },
                        {
                          "type" : "value",
                          "typeID" : 40,
                          "value" : "tutorial",
                          "vType" : "string",
                          "vTypeID" : 4
                        }
                      ]
                    }
                  ]
                },
                {
                  "type" : "compare ==",
                  "typeID" : 25,
                  "excludesNull" : false,
                  "subNodes" : [
                    {
                      "type" : "attribute access",
                      "typeID" : 35,
                      "name" : "status",
                      "subNodes" : [
                        {
                          "type" : "reference",
                          "typeID" : 45,
                          "name" : "doc",
                          "id" : 1
                        }
                      ]
                    },
                    {
                      "type" : "value",
                      "typeID" : 40,
                      "value" : "ok",
                      "vType" : "string",
                      "vTypeID" : 4
                    }
                  ]
                },
                {
                  "type" : "function call",
                  "typeID" : 47,
                  "name" : "ANALYZER",
                  "subNodes" : [
                    {
                      "type" : "array",
                      "typeID" : 41,
                      "sorted" : false,
                      "subNodes" : [
                        {
                          "type" : "compare in",
                          "typeID" : 31,
                          "sorted" : false,
                          "subNodes" : [
                            {
                              "type" : "attribute access",
                              "typeID" : 35,
                              "name" : "title",
                              "subNodes" : [
                                {
                                  "type" : "reference",
                                  "typeID" : 45,
                                  "name" : "doc",
                                  "id" : 1
                                }
                              ]
                            },
                            {
                              "type" : "array",
                              "typeID" : 41,
                              "sorted" : false,
                              "subNodes" : [
                                {
                                  "type" : "value",
                                  "typeID" : 40,
                                  "value" : "windows",
                                  "vType" : "string",
                                  "vTypeID" : 4
                                },
                                {
                                  "type" : "value",
                                  "typeID" : 40,
                                  "value" : "11",
                                  "vType" : "string",
                                  "vTypeID" : 4
                                }
                              ]
                            }
                          ]
                        },
                        {
                          "type" : "value",
                          "typeID" : 40,
                          "value" : "text_de",
                          "vType" : "string",
                          "vTypeID" : 4
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        "scorers" : [
          {
            "id" : 6,
            "name" : "5",
            "node" : {
              "type" : "function call",
              "typeID" : 47,
              "name" : "TFIDF",
              "subNodes" : [
                {
                  "type" : "array",
                  "typeID" : 41,
                  "subNodes" : [
                    {
                      "type" : "reference",
                      "typeID" : 45,
                      "name" : "doc",
                      "id" : 1
                    }
                  ]
                }
              ]
            }
          }
        ],
        "shards" : [ ],
        "options" : {
          "waitForSync" : false,
          "conditionOptimization" : "auto",
          "collections" : null
        },
        "volatility" : -1
      },
      {
        "type" : "SortNode",
        "dependencies" : [
          3
        ],
        "id" : 5,
        "estimatedCost" : 5152627.270413454,
        "estimatedNrItems" : 270543,
        "elements" : [
          {
            "inVariable" : {
              "id" : 6,
              "name" : "5",
              "isDataFromCollection" : false
            },
            "ascending" : false
          }
        ],
        "stable" : false,
        "limit" : 40,
        "strategy" : "constrained-heap"
      },
      {
        "type" : "LimitNode",
        "dependencies" : [
          5
        ],
        "id" : 6,
        "estimatedCost" : 5152667.270413454,
        "estimatedNrItems" : 40,
        "offset" : 0,
        "limit" : 40,
        "fullCount" : false
      },
      {
        "type" : "CalculationNode",
        "dependencies" : [
          6
        ],
        "id" : 7,
        "estimatedCost" : 5152707.270413454,
        "estimatedNrItems" : 40,
        "expression" : {
          "type" : "object",
          "typeID" : 42,
          "subNodes" : [
            {
              "type" : "object element",
              "typeID" : 43,
              "name" : "title",
              "subNodes" : [
                {
                  "type" : "attribute access",
                  "typeID" : 35,
                  "name" : "title",
                  "subNodes" : [
                    {
                      "type" : "reference",
                      "typeID" : 45,
                      "name" : "doc",
                      "id" : 1
                    }
                  ]
                }
              ]
            },
            {
              "type" : "object element",
              "typeID" : 43,
              "name" : "score",
              "subNodes" : [
                {
                  "type" : "reference",
                  "typeID" : 45,
                  "name" : "5",
                  "id" : 6
                }
              ]
            }
          ]
        },
        "outVariable" : {
          "id" : 4,
          "name" : "3",
          "isDataFromCollection" : false
        },
        "canThrow" : false,
        "expressionType" : "simple"
      },
      {
        "type" : "ReturnNode",
        "dependencies" : [
          7
        ],
        "id" : 8,
        "estimatedCost" : 5152747.270413454,
        "estimatedNrItems" : 40,
        "inVariable" : {
          "id" : 4,
          "name" : "3",
          "isDataFromCollection" : false
        },
        "count" : true
      }
    ],
    "rules" : [
      "remove-unnecessary-calculations",
      "handle-arangosearch-views",
      "remove-unnecessary-calculations-2",
      "sort-limit"
    ],
    "collections" : [
      {
        "name" : "735675652",
        "type" : "read"
      },
      {
        "name" : "con_create",
        "type" : "read"
      },
      {
        "name" : "content",
        "type" : "read"
      }
    ],
    "variables" : [
      {
        "id" : 6,
        "name" : "5",
        "isDataFromCollection" : false
      },
      {
        "id" : 4,
        "name" : "3",
        "isDataFromCollection" : false
      },
      {
        "id" : 2,
        "name" : "score",
        "isDataFromCollection" : false
      },
      {
        "id" : 1,
        "name" : "doc",
        "isDataFromCollection" : false
      },
      {
        "id" : 0,
        "name" : "queries",
        "isDataFromCollection" : false
      }
    ],
    "estimatedCost" : 5152747.270413454,
    "estimatedNrItems" : 40,
    "isModificationQuery" : false
  },
  "warnings" : [ ],
  "stats" : {
    "rulesExecuted" : 40,
    "rulesSkipped" : 1,
    "plansCreated" : 1
  },
  "cacheable" : true
} 

Any other ideas? Were you able to identify the bug? Is there anything else I can do for the ArangoDB team?

@admtech
Copy link
Author
admtech commented Jul 13, 2021

So that the question does not arise, that possibly not enough results are available here still another test with limit 100:

arangodb:8529@administrator_core> var stmt = db._createStatement( {"query": "LET queries = TOKENS('Windows 11', 'text_de') FOR doc IN con_create SEARCH ANALYZER( doc.title IN queries , 'text_de') AND doc.art IN ['tutorial','report','tip','info','imho'] AND doc.status=='ok' LET score = TFIDF(doc) SORT score DESC LIMIT 100 RETURN {title: doc.title, score: score}" });

Result:

arangodb:8529@administrator_core> c = stmt.execute();
[object ArangoQueryCursor, count: 97, cached: false, hasMore: true]
[
  {
    "title" : "Windows 10: Spartan Rendering Engine im Internet Explorer 11 aktivieren",
    "score" : 18.218204498291016
  },
  {
    "title" : "Windows 11 Insider Preview Build 22000.65 verfügbar",
    "score" : 18.139446258544922
  },
...

If I take out the filter "AND doc.art IN ['tutorial','report','tip','info','imho']", I get a different result again:

arangodb:8529@administrator_core> var stmt = db._createStatement( {"query": "LET queries = TOKENS('Windows 11', 'text_de') FOR doc IN con_create SEARCH ANALYZER( doc.title IN queries , 'text_de') AND doc.status=='ok' LET score = TFIDF(doc) SORT score DESC LIMIT 100 RETURN {title: doc.title, score: score}" });

arangodb:8529@administrator_core> c = stmt.execute();
[object ArangoQueryCursor, count: 99, cached: false, hasMore: true]
[
  {
    "title" : "Windows benötigt 11 Minuten zum starten (Windows 7 64bit)",
    "score" : 13.23962116241455
  },
  {
    "title" : "Windows Media Player 11 unter Windows XP funktioniert nicht",
    "score" : 13.23962116241455
  },
...

Here is the correct result with the workaround on sorting (doc.blabla DESC):

arangodb:8529@administrator_core> var stmt = db._createStatement( {"query": "LET queries = TOKENS('Windows 11', 'text_de') FOR doc IN con_create SEARCH ANALYZER( doc.title IN queries , 'text_de') AND doc.status=='ok' LET score = TFIDF(doc) SORT doc.blabla DESC,score DESC LIMIT 100 RETURN {title: doc.title, score: score}" });

arangodb:8529@administrator_core> c = stmt.execute();
[object ArangoQueryCursor, count: 100, cached: false, hasMore: true]
[
  {
    "title" : "Windows benötigt 11 Minuten zum starten (Windows 7 64bit)",
    "score" : 13.239396095275879
  },
  {
    "title" : "KB929399 Microsoft Windows Media Format 11 SDK for Windows XP kann nicht installiert werden",
    "score" : 13.239396095275879
  },
  {
    "title" : "Windows Media Player 11 unter Windows XP funktioniert nicht",
    "score" : 13.239396095275879
  },
...

As mentioned above, the sorting is always slightly different for the same score. I hope I could help.

@maxkernbach
Copy link
Contributor

Hi @admtech,

Thanks for your reply. We tried to reproduce your issue on a different dataset with an analogous query. However, independent of which sorting mechanism was used, the same result set was returned.

One possible cause of seeing a different result set could be that the indexing of your view "con_create" is broken. Could you try to create a new view using the same properties as view "con_create" and re-run the queries (replacing "con_create" with the newly created view)?

In case the issue is still occurring with a new view, would you able to share a data set which reproduces the problem with the queries you stated? You can send us a message to hackers@arangodb.com (this ML is not public) and attach the dump in that email. This way we can try to reproduce and find the root cause. Please reference the number of this issue in your email.

@admtech
Copy link
Author
admtech commented Aug 5, 2021

I created a new view and re-run the queries, but unfortunately the behavior remained the same. I will switch to ArangoDB 3.8 tonight and test it all again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 Analyzing 3 Search IResearch / Fulltext index / Analyzers Waiting User Reply
Projects
None yet
Development

No branches or pull requests

2 participants
0