-
Notifications
You must be signed in to change notification settings - Fork 852
Optimize statistic large data with COLLECT INTO too slow #20337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@son2408 : according to the query profile you supplied in this ticket, the main driver of the query's execution time is the EnumerateViewNode and not the COLLECT. The EnumerateViewNode takes 12 out of the 18 seconds total runtime. I tried running the query locally with the provided dataset, and I get a similar execution plan, with higher execution times (potentially due to my inferior hardware):
It seems that the problem is that the query produces 5M matches in the EnumerateViewNode, but in order to build the resulting documents for them, it will need to do 5M additional document lookups. I just tried this, and created another view (
Using that adjusted view in the query, while keeping everything else the same, dramatically speeds up the query:
I also tried running the query (with the adjusted view) on our current devel, which will become ArangoDB 3.12 at some point in Q1 next year.
|
@jsteemann yes, with your view definition the query speeds up dramatically. If my query have subqueries count then runtime query too slowly. It takes 16.40239s for 5M documents. Let help me optimize it. Query
Query profile
I have multi subqueries. I see it spends alot of time for EnumerateListNode and CalculationNode for each subquery. Do you have the best way to statistic data with multi count by query in a FOR .. IN ? Because I see subqueries similar FOR i IN group |
One idea is to loop over the results only once instead of creating an arbitrary number of subqueries:
I can't estimate how this will impact performance (positively or negatively), but at least it is something you could try if you are worried about the amount of subqueries. |
@jsteemann thanks you, I tried query as your suggestion, but time query not as expected, it takes 26.65674 seconds. It's slower old query :( Profile
|
Please note that the extraction from the view is again the slowest part of your query. |
@jsteemann yes, I adjusted the view defintion.
|
@jsteemann I see if used attributes the more the slow, pls check help me. Thanks you. |
@son2408 it's expected, using additional columns slow down query |
@MBkkt I tried to unify in single column few fields, but it's still too slow. Total runtime is 26.95196 seconds.
View configuration
|
@jsteemann I included used attributes as "stored values", but it can't improve. I see the more atributes, the slower speed. Can you help me try it with data is shared in above, thanks you.
|
Hard to say without any profile what the limiting factor is now. |
@jsteemann yes, i tried to remove some attributes, runtime query speed up. With initial COLLECT is 6 attributes then runtime query is 32s, now attributes numbers is 2 then runtime query is 20s
|
@jsteemann i think "stored values" is no efficient with COLLECT INTO |
"stored values" and COLLECT INTO are unrelated. |
Can you try fusing the two COLLECT statements into a single one, so you will need only one iteration over the dataset? |
@jsteemann two COLLECT on two other collections. COLLECT 1 for initial search statement, COLLECT 2 for group of COLLECT 1. How to fusing them ? |
Have you tried something along the lines of
|
@jsteemann i tried it, runtime query is 4 seconds. thanks you.
|
view used without stored values
view used stored values |
Ok, I think there isn't anything left that I can do now. |
Ok. Let me to try it. Thanks @jsteemann |
hi @jsteemann I upgraded to version 3.12.0-NIGHTLY.20231219. I see runtime query with COLLECT..INTO improved more than old version. But It is not as expect. For detail usecase.
Query
Profile
if this query has two times COLLECT..INTO. It takes 6s.
Profile
|
@jsteemann Can you let me the best way to speedup query with multi COLLECT .. INTO ? . I see the more COLLECT .. INTO the slower runtime query. Thanks you |
Hi @son2408 , Unfortunately, I am not aware of any good way to speed up COLLECT ... INTO further. |
Uh oh!
There was an error while loading. Please reload this page.
My Environment
Component, Query & Data
Affected feature:
AQL query using web interface
AQL query (if applicable):
AQL explain and/or profile (if applicable):
Dataset:
link my data with arangdump
https://drive.google.com/file/d/1PplQ1CbvjAf9kw9YswlC8M5YSyl_LejW/view?usp=drive_link
Problem:
I need to statistic data with about 5 millions documents. Explain this use case, first I want group data by tinhId field and INTO group include daTra field then count by query this daTra field. I see process INTO too slowly. With this use case SOLR APACHE runtime query 7s Let me to optimize it or the best way to resolve this. Thanks all
Expected result:
The text was updated successfully, but these errors were encountered: