10000 Using mixed sort directions ignores indexes · Issue #9451 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

Using mixed sort directions ignores indexes #9451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
erayhanoglu opened this issue Jul 10, 2019 · 13 comments
Open

Using mixed sort directions ignores indexes #9451

erayhanoglu opened this issue Jul 10, 2019 · 13 comments
Labels
1 Feature 3 AQL Query language related 3 Index

Comments

@erayhanoglu
Copy link
erayhanoglu commented Jul 10, 2019

My Environment

  • ArangoDB Version: 3.4.6-1
  • Storage Engine: RocksDB
  • Deployment Mode: Single Server
  • Deployment Strategy: Manual Start
  • Infrastructure: own
  • Operating System: MacOS 10.13.4
  • Total RAM in your machine: <16Gb
  • Disks in use: SSD
  • Used Package: other

Component, Query & Data

Affected feature:
Optimizer

AQL query (if applicable):
FOR c IN Patients
SORT c.name_given, c.name_family desc
RETURN c

AQL explain (if applicable):

Query String:
 FOR c IN Patients
 SORT c.name_given, c.name_family desc
 RETURN c

Execution plan:
 Id   NodeType                  Calls   Items   Runtime [s]   Comment
  1   SingletonNode                 1       1       0.00000   * ROOT
  2   EnumerateCollectionNode       1      98       0.00008     - FOR c IN Patients   /* full collection scan */
  3   CalculationNode               1      98       0.00001       - LET #1 = c.`name_given`   /* attribute expression */   /* collections used: c : Patients */
  4   CalculationNode               1      98       0.00001       - LET #3 = c.`name_family`   /* attribute expression */   /* collections used: c : Patients */
  5   SortNode                      1      98       0.00010       - SORT #1 ASC, #3 DESC
  6   ReturnNode                    1      98       0.00000       - RETURN c

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-up-2

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0          98            0          0         0.00058

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00004
 optimizing ast             0.00000
 loading collections        0.00001
 instantiating plan         0.00002
 optimizing plan            0.00018
 executing                  0.00024
 finalizing                 0.00007
Query String:
 FOR c IN Patients
 SORT c.name_given desc, c.name_family desc
 RETURN c

Execution plan:
 Id   NodeType          Calls   Items   Runtime [s]   Comment
  1   SingletonNode         1       1       0.00000   * ROOT
  7   IndexNode             1      98       0.00033     - FOR c IN Patients   /* reverse skiplist index scan */
  3   CalculationNode       1      98       0.00001       - LET #1 = c.`name_given`   /* attribute expression */   /* collections used: c : Patients */
  4   CalculationNode       1      98       0.00001       - LET #3 = c.`name_family`   /* attribute expression */   /* collections used: c : Patients */
  5   SortNode              1      98       0.00006       - SORT #1 DESC, #3 DESC
  6   ReturnNode            1      98       0.00000       - RETURN c

Indexes used:
 By   Type       Collection   Unique   Sparse   Selectivity   Fields             Ranges
  7   skiplist   Patients     false    false       100.00 %   [ `name_given` ]   *

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-up-2
  3   use-indexes

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0           98          0         0.00073

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00004
 optimizing ast             0.00000
 loading collections        0.00001
 instantiating plan         0.00002
 optimizing plan            0.00012
 executing                  0.00045
 finalizing                 0.00007

Steps to reproduce

  1. create skiplist indexes for 2 different properties. eg. name_given, name_family
    execute a query which sorts this two properties with same direction.
    etc:
 FOR c IN Patients
 SORT c.name_given desc, c.name_family desc
 RETURN c

You will see, index for first sort property will be used.

  1. execute another query which sorts two properties with different direction
    etc:
 FOR c IN Patients
 SORT c.name_given, c.name_family desc
 RETURN c

You will see, no index used.

Problem:
Optimizer ignores indexes if different sort directions used.

Expected result:

@maxkernbach maxkernbach added 1 Bug 3 AQL Query language related 3 Index labels Jul 19, 2019
@maxkernbach
Copy link
Contributor

Hi @erayhanoglu,

Thanks for reporting, I was able to reproduce this issue.
No index is used when using SORT in mixed directions (ASC,DESC) with skiplist indexes on 2 properties.

@maxkernbach maxkernbach removed the 1 Bug label Jul 19, 2019
@maxkernbach
Copy link
Contributor

However this behavior is expected and documented at:

https://www.arangodb.com/docs/stable/indexing-index-basics.html#skiplist-index

In order to use a skiplist index for sorting, the index attributes must be specified in the SORT clause of the query in the same order as they appear in the index definition. Skiplist indexes are always created in ascending order, but they can be used to access the indexed elements in both ascending or descending order. However, for a combined index (an index on multiple attributes) this requires that the sort orders in a single query as specified in the SORT clause must be either all ascending (optionally omitted as ascending is the default) or all descending.

Therefore using mixed SORT directions requires the creation of mixed-order indexes, which would be a new feature.

@erayhanoglu
Copy link
Author
erayhanoglu commented Jul 19, 2019

Thank you for your reply. Documentation says for a combined index (**an index on multiple attributes**). This issue is not about a single index on multiple attributes. There are 2 indexes on different attributes and optimizer uses only 1 index and ignores the second one. So this issue look like a bug.

@maxkernbach
Copy link
Contributor

The documentation states that it is not possible to run your query on a single skiplist index that is created on two attributes value1 and value2. That is the current state since mixed-order indexes are not supported yet.

Sorting by two skiplist indexes, which are created on one attribute each, (exactly your use case) does not work. Since index 1 is not related to index 2, it is not possible to apply the SORT with a second index to a set, which was created using the first index. This applies to databases in general.

@MrCreosote
Copy link
MrCreosote commented Jul 19, 2019

@maxkernbach Out of curiosity, if you did have an asc/asc compound index (let's call the attributes a and b in index order) and wanted to sort a asc and b desc, would it be possible to add a feature where ADB finds the first match for a and then does a binary search to find the lowest value of b, and traverses the index in the opposite direction to do the sort on b? Then move on to the next value of a and so on. That would make the sort O(n log n) vs O(n) but wouldn't increase memory use, which to me is the main reason to always sort on indexes rather than in memory.

E.g. if the index looks like

a           b
itema1  itemb1   
itema1  itemb2
itema1  itemb3
itema2  itemb1
itema3  itemb2
itema3  itemb4
itema3  itemb7
itema3  itemb8

Then the algorithm, assuming there were no filters on a, would find itema1, then binary search to the 3rd item in the index, and run backwards to sort b desc. Then it'd move to itema2 and so on.

Note I'm not necessarily making a request for this feature, I'm just wondering if it's feasible. It would mean you could do mixed direction sorts without having to have multiple indexes (one for each set of directions you wish to sort on).

@mpoeter
Copy link
Contributor
mpoeter commented Jul 21, 2019

@MrCreosote I don't think that this approach is feasible. But I do think that we can utilize the index on a. The order of b is only relevant for the subsets of items with equal values of a. So if we enumerate the items using the index on a, we have a pre-sorted list where each item is either on its final position, or at least very close.

The index's selectivity estimation is already a good indicator of the average subset size. E.g., with a selectivity estimation of 20%, the average subset size is 5. This means, that for each item the max average distance from its current position to its final position is only 4. This is the perfect use case for insertion sort.

In case we have a combined index for a and b, the situation is even better, because we only have to reverse the items in all subsets of size >1. This can be done either while enumerating the index, or in a separate pass - either way, this can be done in linear time.

@maxkernbach what do you think? Mixed-ordered indexes would certainly be nice, but I would assume that they would definitely require more effort than a simple rule for the query optimizer.

@MrCreosote
Copy link

@mpoeter Thanks for the response!

I don't think that this approach is feasible.

Out of curiosity, why is that? I don't really understand the index structure but I'm def. interested.

So if we enumerate the items using the index on a, we have a pre-sorted list where each item is either on its final position, or at least very close.

Aren't there edge cases that could blow up though? For instance, I have a prod Mongo DB with a compound index on a & b. In most cases for each a there are < 100 items in b, but there are a few cases with bs in the 10k range and one with 4M. A sort or reverse would really blow up in that case, whereas finding the 'end' of the index on a and then traversing the index in the reverse direction would be ok - but would incur the expense of finding the 'end', which I presume would be O(log n).

@mpoeter
Copy link
Contributor
mpoeter commented Jul 22, 2019

@MrCreosote it is mainly a gut feeling. To be honest, I don't see how the second index would be of much help. Yes, it does give you the order of items for b, but looking up a single item in the index doesn't give you much. You would have to scan the b index for items with equal values of a, but where do you stop? This would result in a linear scan of large parts of b index for each distinct value of a. And in each scan, you would have to take into account potential filters.

In the worst case, insertion sort is O(n²), so you are right, there are certainly edge cases where the insertion sort could blow up. But I think these cases could easily be recognized and handled accordingly.

But this cannot happen for the combined index with reverse. A reverse operation can be done in linear time with constant memory overhead.

@MrCreosote
Copy link
MrCreosote commented Jul 22, 2019

@mpoeter

A reverse operation can be done in linear time with constant memory overhead.

Sure, but that could be an extremely large memory overhead if there are, say, 100s of millions of objects for one of the as.

You would have to scan the b index for items with equal values of a, but where do you stop?

Again, I don't understand the index structure, but naively I'd expect you could perform a binary search to find the last b for any given a and then traverse the index in the opposite direction from that point, taking any filters into account.

@mpoeter
Copy link
Contributor
mpoeter commented Jul 22, 2019

You would have to scan the b index for items with equal values of a, but where do you stop?

Again, I don't understand the index structure, but naively I'd expect you could perform a binary search to find the last b for any given a and then traverse the index in the opposite direction from that point, taking any filters into account.

Yes, but traversing the index on b is O(n). Looking up an item in index b takes O(log n), and the initial traversal of the index on a is also O(n). So IMHO you end up with an algorithm that is O(n² log n).

A reverse operation can be done in linear time with constant memory overhead.

Sure, but that could be an extremely large memory overhead if there are, say, 100s of millions of objects for one of the as.

No, a reverse operation can be done with constant memory overhead (you only need a single temporary object) and in linear time, i.e., O(n). Therefore, the complete query (enumerating the items using the compound index + reversing the query) has runtime complexity O(n).

@MrCreosote
Copy link

So IMHO you end up with an algorithm that is O(n² log n).

That would mean that for each entry in the index, every other entry in the index is visited, which isn't the case in the algorithm I proposed. I've realized that it can be simplified to a linear time case once the first item in the returned range is found. Assuming a compound index a, b:

  1. Traverse the index from the first element (a1, b_a1_1) that matches the query until reaching the final element that makes the first entry in the index (a1, b_a1_n).
  2. Traverse in the opposite direction and return results to the user in batches.
  3. Advance to the next element (a2, b_a2_1)
    a) This can be O(1) if the location of the element is stored at the end of 1)
  4. Repeat the traverse, reverse and return
  5. Continue with further elements.

In this case other than finding a1 the time complexity is 2n = O(n).
You could potentially speed things up by using a binary search rather than a linear traversal to find the ai, b_ai_n elements. This'd probably be faster if there are many bs for each a.

No, a reverse operation can be done with constant memory overhead (you only need a single temporary object)

Isn't this true only if you have a mutable array in which the data is stored so you can perform the reversal in place? In the case we're talking about you start with a position in an index, so to perform the reversal the data in the index would have to be loaded into another mutable data structure which could require substantial memory depending on the result set. My assumption (and maybe I'm wrong) is that this doesn't happen for a standard query and the results are loaded from the index in batches and returned to the user, not requiring a memory allocation for the full result set.

@mpoeter
Copy link
Contributor
mpoeter commented Jul 27, 2019

@MrCreosote I should have read your first posting more thoroughly. The initial question was about two separate indexes for the two attributes, and somehow your sketch with with the two columns led me to believe that you were also talking about two separate indexes.

  1. Traverse in the opposite direction and return results to the user in batches.

This basically the same as my reverse approach. I see two different ways to do this - either in a single pass while traversing the index (which is basically the same algorithm you describe), or in a separate pass. Obviously, doing this in a single pass would be more efficient. However, the version with a separate pass might be easier to introduce with the existing code.

My assumption (and maybe I'm wrong) is that this doesn't happen for a standard query and the results are loaded from the index in batches and returned to the user, not requiring a memory allocation for the full result set.

Unless you are using streaming cursors, Arango builds the complete query result in-memory before sending the response.

@MrCreosote
Copy link

@mpoeter Sounds like we were talking past each other for a bit there :P

Unless you are using streaming cursors, Arango builds the complete query result in-memory before sending the response.

So if I have a 100M 10kB / node collection and make a request to traverse and return all the nodes in the collection the server's OOMing? Sorry if this is a newb question, I just started using Arango coming from Mongo where AFAIK the result set is not loaded into memory (where the drawback is that the cursor can return the same document more than once if it is changed while iteration via the cursor is in progress).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 Feature 3 AQL Query language related 3 Index
Projects
None yet
Development

No branches or pull requests

4 participants
0