8000 very simple aql,slow query,took 47s · Issue #10898 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

very simple aql,slow query,took 47s #10898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
caozku opened this issue Jan 16, 2020 · 17 comments
Closed

very simple aql,slow query,took 47s #10898

caozku opened this issue Jan 16, 2020 · 17 comments

Comments

@caozku
Copy link
caozku commented Jan 16, 2020

arangodb version 3.5.3,I have more than 290 thousand documents,when I execute a very simple query aql,it tooks 47s or even more time.I have created all indexs,Why is it so slow?
this is my aql :

[8bcee] slow query: 'for t in ocr_task filter t.agentId == "ocrAgent3" and t.processing == 1 and t.hasFail != 1 return t', took: 47.356403 s

@jsteemann
Copy link
Contributor

Please provide the query execution plan for this query. You can retrieve it from the ArangoShell (arangosh) or the ArangoDB web interface.
This query plan will provide more details about if and which indexes are used for the query.

Without such information it is impossible to give you a sound answer. The usual suspects for slow query performance are:

  • no indexes used by the query
  • sub-optimal indexes used by the query
  • huge document sizes
  • slow storage layer
  • shortage of RAM

@caozku
Copy link
Author
caozku commented Jan 17, 2020
 for t in ocr_task filter t.agentId == "ocrAgent2" and t.processing == 1 and t.hasFail != 1 return t

Execution plan:
 Id   NodeType          Calls   Items   Runtime [s]   Comment
  1   SingletonNode         1       1       0.00000   * ROOT
  6   IndexNode            58   57227      43.85388     - FOR t IN ocr_task   /* hash index scan */
  7   CalculationNode      58   57227       0.07531       - LET #1 = ((t.`hasFail` != 1) && (t.`processing` == 1))   /* simple expression */   /* collections used: t : ocr_task */
  4   FilterNode            1       2       0.04401       - FILTER #1
  5   ReturnNode            1       2       0.00001       - RETURN t

Indexes used:
 By   Name            Type   Collection   Unique   Sparse   Selectivity   Fields          Ranges
  6   idx_657010169   hash   ocr_task     false    false         0.00 %   [ `agentId` ]
8000
   (t.`agentId` == "ocrAgent2")

Optimization rules applied:
 Id   RuleName
  1   use-indexes
  2   remove-filter-covered-by-index

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0        57227      57225        44.00610

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00008
 optimizing ast             0.00001
 loading collections        0.00001
 instantiating plan         0.00003
 optimizing plan            0.00020
 executing                 43.97325
 finalizing                 0.03250

@jsteemann
Copy link
Contributor

Indeed an index is used, but it has a selectivity of 0.00% (only two digits after the decimal separator are shown in the explain output). The optimizer estimates that there are 57,227 documents to return, out of 290,000. This is just an estimation, but it shows that the index in general does not seem to be so helpful. According to these figures, there are only about 6 different values present in the data for agentId.
As the filter condition is on both agentId and processing, it may be helpful to add a combined index on both attributes. This will help if there are only few documents with a processing value of 1. If you are using the ArangoDB web interface, such index can be created by using agentId,processing for the index field(s).
In addition, as this is a 3.5.3, the query may benefit from upgrading to 3.6.0. 3.6.0 has a performance improvement especially for post-filtering data, and it may help in this particular case as well.

@OmarAyo
Copy link
Contributor
OmarAyo commented Jan 22, 2020

Hi @caozku,

Did you have the chance to read jsteemann's comment about using combined indexes ?

Please let us know
Best,

@caozku
Copy link
Author
caozku commented Feb 3, 2020

I have already added agentId and processing,it doesn't work! But how to use combined indexes.

@jsteemann
Copy link
Contributor
jsteemann commented Feb 3, 2020

With the arangosh, you can use the command db.ocr_task.ensureIndex({ type: "hash", fields: ["agentId", "processing"] });.
When using the web interface, you can enter agentId,processing into the index' fields attribute when creating the index.

@caozku
Copy link
Author
caozku commented Feb 5, 2020

I added these two at first.
No use at all

@jsteemann
Copy link
Contributor

Ok, so can you confirm you have that combined index in place?
The execution plan above shows there is a single-attribute index used for the query.
If there are only few documents with the processing attribute set to a value of 1, the combined index may still help.
Can you check if that combined index (one index covering both attributes agentId and processing) is present? It would be interesting to see the query profile for the query when this index is in place, plus some estimations about the data distribution (how many documents are there with that combination of agentId and processing values as queried).
Thanks!

@caozku
Copy link
Author
caozku commented Feb 6, 2020

Thank you very much! This problem is solved.I have another question,if I have two documents to query,the aql is “for t in ocr_task filter t.hasDone == 1 for b in ocr_batch filter b._key == t.batchId and b.userKey == "657446338" collect with count into length return length”,how can I make it fast?

@caozku
Copy link
Author
caozku commented Feb 6, 2020

I have another database.there id a document with 2.9 million pieces of data,when I execute the aql "for c in ChinaRoots_gImages filter c._key == "1054747628" update c with {hasDone:1,processing:0,postTime:DATE_NOW(),status:1} in ChinaRoots_gImages OPTIONS { exclusive: true }",it tooks more than 110s

@caozku
Copy link
Author
caozku commented Feb 6, 2020

the query execution plan for this query is:

Query String (175 chars, cacheable: false):
 for c in ChinaRoots_gImages filter c._key == "1054747628" update c with 
 {hasDone:1,processing:0,postTime:DATE_NOW(),status:1} in ChinaRoots_gImages OPTIONS { exclusive: 
 true }

Execution plan:
 Id   NodeType          Calls   Items   Runtime [s]   Comment
  1   SingletonNode         1       1       0.00000   * ROOT
  5   CalculationNode       1       1       0.00002     - LET #5 = { "hasDone" : 1, "processing" : 0, "postTime" : 1580954771463, "status" : 1 }   /* json expression */   /* const assignment */
  7   IndexNode             1       1       0.00005     - FOR c IN ChinaRoots_gImages   /* primary index scan, index only, projections: `_key` */    
  6   UpdateNode            1       1       0.00019       - UPDATE c WITH #5 IN ChinaRoots_gImages 

Indexes used:
 By   Name      Type      Collection           Unique   Sparse   Selectivity   Fields       Ranges
  7   primary   primary   ChinaRoots_gImages   true     false       100.00 %   [ `_key` ]   (c.`_key` == "1054747628")
  6   primary   primary   ChinaRoots_gImages   true     false       100.00 %   [ `_key` ]   c

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   remove-data-modification-out-variables
  3   use-indexes
  4   remove-filter-covered-by-index
  5   remove-unnecessary-calculations-2
  6   patch-update-statements
  7   reduce-extraction-to-projection

Write query options:
 Option                   Value
 ignoreErrors             false
 waitForSync              false
 nullMeansRemove          false
 mergeObjects             true
 ignoreDocumentNotFound   false
 readCompleteInput        false
 useIsRestore             false
 consultAqlWriteFilter    false
 exclusive                true
 overwrite                false
 ignoreRevs               true

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           1            0           0            1          0        85.56753

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00007
 optimizing ast             0.00002
 loading collections       85.56599
 instantiating plan         0.00005
 optimizing plan            0.00030
 executing                  0.00090
 finalizing                 0.00019

@jsteemann
Copy link
Contributor

Re #10898 (comment):
First of all, the query can be simplified to

UPDATE { _key: "1054747628" } 
WITH {hasDone:1,processing:0,postTime:DATE_NOW(),status:1} 
IN ChinaRoots_gImages 
OPTIONS { exclusive: true }

(i.e. removing the FOR loop entirely) because it is a single-document primary key update and the update does not refer to any existing values in the to-be-updated document.
However, that will not help overly much, as a single document lookup and update are expected to be fast anyway.

The main problem with this query is thus likely the OPTIONS { exclusive: true } used. That will make the query wait for any other write operation on the same collection to finish before it starts. I suggest going without this in general and only use that option in case you really need uninterrupted modifications of multiple documents in the same collection. If you remove the option from this query, it should be faster, provided that there aren't any other queries on the same collection that also use OPTIONS { exclusive: true } and that temporarily lock out all other write operations.

@jsteemann
Copy link
Contributor

Re #10898 (comment):

for t in ocr_task filter t.hasDone == 1 
for b in ocr_batch filter b._key == t.batchId and b.userKey == "657446338" 
collect with count into length 
return length

I can't tell how to make this fast because I have no idea what indexes exist here and how many documents the underlying collections contain. It would be good to post the query's profile or execution plan in this Github issue to get at least some idea about the number of documents plus index usage.
The main thing here is probably to have an index on the hasDone attribute in ocr_task (or even a combined index on hasDone and batchId). If it exists, I expect it to be used for the first FOR loop, and I would expect the primary index on _key to be used for the second. But right now I can't tell if my assumptions are correct. Can you post the profile output of that query here?
Thanks!

@caozku
Copy link
Author
caozku commented Feb 11, 2020
Query String (389 chars, cacheable: false):
 //for c in ocr_batch
 //        filter c.hasDone != 1 and c.hasFail != 1
 //        return c
 //        let remanCount = c.taskCount-c.finishCount
 //        collect AGGREGATE remainNum = SUM(remanCount)
 //        return remainNum
 for t in ocr_task filter t.hasDone == 1 
 for b in ocr_batch filter b._key == t.batchId and b.userKey == "657446338" 
 collect with count into length 
 return length

Execution plan:
 Id   NodeType        Calls    Items   Runtime [s]   Comment
  1   SingletonNode       1        1       0.00000   * ROOT
 11   IndexNode         303   302683     136.70162     - FOR t IN ocr_task   /* hash index scan, projections: `batchId` */    
 10   IndexNode           1       16       2.25124       - FOR b IN ocr_batch   /* primary index scan, projections: `userKey` */    FILTER (b.`userKey` == "657446338")   /* early pruning */   
  8   CollectNode         1        1       0.00203         - COLLECT  WITH COUNT INTO length   /* count */
  9   ReturnNode          1        1       0.00001         - RETURN length

Indexes used:
 By   Name            Type      Collection   Unique   Sparse   Selectivity   Fields          Ranges
 11   idx_648119924   hash      ocr_task     false    false         0.00 %   [ `hasDone` ]   (t.`hasDone` == 1)
 10   primary         primary   ocr_batch    true     false       100.00 %   [ `_key` ]      (b.`_key` == "672864747")

Optimization rules applied:
 Id   RuleName
  1   use-indexes
  2   remove-filter-covered-by-index
  3   remove-unnecessary-calculations-2
  4   reduce-extraction-to-projection
  5   move-filters-into-enumerate

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0       302699          0       139.20209

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00011
 optimizing ast             0.00002
 loading collections        0.00001
 instantiating plan         0.08230
 optimizing plan            0.00034
 executing                138.95795
 finalizing                 0.16134

@jsteemann
Copy link
Contributor

The bottleneck in this query seems to be the

11 IndexNode 303 302683 136.70162 - FOR t IN ocr_task /* hash index scan, projections: batchId /

It will find all matching documents in ocr_task using the index on hasDone. There seem to be 302,683 such documents. For each of those, it will need to look up the documents and extract the batchId attribute from them. If these documents are huge, these extra document lookups and extraction of batchId may take the longest here.
In order to avoid the extra document data lookup, it is possible to include the batchId in the index that contains hasDone, so that both attributes are covered.
What happens if you replace the existing index on batchId with a combined index on hasDone and batchId?

@maxkernbach
Copy link
Contributor

Hi @caozku,

Did you have a chance to read jsteemann's latest reply?

Could you please try to replace the existing index on batchId with a combined index on hasDone and batchId and share the profile output of the modified query?

@maxkernbach
Copy link
Contributor

Hi @caozku,

Since we have not heard back from you recently, I am closing this issue for now.
Feel free to reopen this ticket if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0