Optimize statistic large data with COLLECT INTO too slow #20337

son2408 · 2023-12-19T04:52:37Z

My Environment

ArangoDB Version: 3.11.6
Deployment Mode: Single Server
Deployment Strategy: Manual Start
Infrastructure: own
Operating System: Windows 10
Total RAM in your machine: 64Gb
Disks in use: SSD

Component, Query & Data

Affected feature:
AQL query using web interface

AQL query (if applicable):

FOR i IN search_hosotiepnhan_2
    SEARCH i.tinhId > 0
    LIMIT 5000000
    COLLECT tinhId = i.tinhId
    AGGREGATE count = length(1)
    INTO group = {tinhId: i.tinhId, daTra: i.daTra}
    LET countDaTra = (FOR i IN group
                        FILTER i.daTra == 1
                        COLLECT WITH COUNT INTO countDaTra
                        RETURN countDaTra)
    RETURN {val: tinhId, count : count, countDaTra: countDaTra}

AQL explain and/or profile (if applicable):

Query String (443 chars, cacheable: false):
 FOR i IN search_hosotiepnhan_2
     SEARCH i.tinhId > 0
     LIMIT 5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE count = length(1)
     INTO group = {tinhId: i.tinhId, daTra: i.daTra}
     LET countDaTra = (FOR i IN group
                         FILTER i.daTra == 1
                         COLLECT WITH COUNT INTO countDaTra
                         RETURN countDaTra)
     RETURN {val: tinhId, count : count, countDaTra: countDaTra}

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00001   * ROOT
  2   EnumerateViewNode    5000   5000000          0      12.76392     - FOR i IN search_hosotiepnhan_2 SEARCH (i.`tinhId` > 0)   /* view query */
  3   LimitNode            5000   5000000          0       0.07559       - LIMIT 0, 5000000
  4   CalculationNode      5000   5000000          0       0.61998       - LET #9 = i.`tinhId`   /* attribute expression */
  6   CalculationNode      5000   5000000          0       2.47113       - LET #13 = { "tinhId" : i.`tinhId`, "daTra" : i.`daTra` }   /* simple expression */
  7   CollectNode             1        36          0       0.64942       - COLLECT tinhId = #9 AGGREGATE count = LENGTH() INTO group = #13   /* hash */
 17   SortNode                1        36          0       0.01707       - SORT tinhId ASC   /* sorting strategy: standard */
 18   SubqueryStartNode      36        72          0       0.02025       - LET countDaTra = ( /* subquery begin */
  9   EnumerateListNode    5019   5000036          0       0.72442         - FOR i IN group   /* list iteration */
 10   CalculationNode      5019   5000036          0       0.76680           - LET #15 = (i.`daTra` == 1)   /* simple expression */
 11   FilterNode             36   4218954     781082       0.09861           - FILTER #15
 12   CollectNode             1        72          0       0.00008           - COLLECT AGGREGATE countDaTra = LENGTH()   /* count */
 19   SubqueryEndNode         1        36          0       0.00002           - RETURN  countDaTra ) /* subquery end */
 15   CalculationNode         1        36          0       0.00002       - LET #17 = { "val" : tinhId, "count" : count, "countDaTra" : countDaTra }   /* simple expression */
 16   ReturnNode              1        36          0       0.00001       - RETURN #17

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-up-2
  3   handle-arangosearch-views
  4   remove-unnecessary-calculations-2
  5   move-calculations-down
  6   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0     781082      330334208        18.20805

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00008
 optimizing ast                 0.00001
 loading collections            0.00001
 instantiating plan             0.00006
 optimizing plan                0.00039
 instantiating executors        0.00012
 executing                     18.20740
 finalizing                     0.00005

Dataset:
link my data with arangdump
https://drive.google.com/file/d/1PplQ1CbvjAf9kw9YswlC8M5YSyl_LejW/view?usp=drive_link

Problem:
I need to statistic data with about 5 millions documents. Explain this use case, first I want group data by tinhId field and INTO group include daTra field then count by query this daTra field. I see process INTO too slowly. With this use case SOLR APACHE runtime query 7s Let me to optimize it or the best way to resolve this. Thanks all

Expected result:

[
  {
    "tinhId": 1,
    "countTinhId": 3812,
    "countDaTra": 2128
  },
  {
    "tinhId": 4,
    "countTinhId": 2908,
    "countDaTra": 2027
  },
  {
    "tinhId": 10,
    "countTinhId": 69042,
    "countDaTra": 54427
  },
  {
    "tinhId": 14,
    "countTinhId": 36630,
    "countDaTra": 34859
  },
  {
    "tinhId": 15,
    "countTinhId": 20494,
    "countDaTra": 16379
  },
  {
    "tinhId": 19,
    "countTinhId": 304978,
    "countDaTra": 296676
  },
  {
    "tinhId": 20,
    "countTinhId": 3732,
    "countDaTra": 3135
  },
  {
    "tinhId": 22,
    "countTinhId": 49,
    "countDaTra": 10
  },
  {
    "tinhId": 24,
    "countTinhId": 154159,
    "countDaTra": 108525
  },
  {
    "tinhId": 26,
    "countTinhId": 148450,
    "countDaTra": 86090
  },
  {
    "tinhId": 27,
    "countTinhId": 176,
    "countDaTra": 86
  },
  {
    "tinhId": 30,
    "countTinhId": 356,
    "countDaTra": 336
  },
  {
    "tinhId": 31,
    "countTinhId": 160601,
    "countDaTra": 154888
  },
  {
    "tinhId": 34,
    "countTinhId": 93888,
    "countDaTra": 81656
  },
  {
    "tinhId": 35,
    "countTinhId": 725,
    "countDaTra": 725
  }
]

The text was updated successfully, but these errors were encountered:

jsteemann · 2023-12-19T11:05:50Z

@son2408 : according to the query profile you supplied in this ticket, the main driver of the query's execution time is the EnumerateViewNode and not the COLLECT. The EnumerateViewNode takes 12 out of the 18 seconds total runtime.
So this should be the key part to optimize.

I tried running the query locally with the provided dataset, and I get a similar execution plan, with higher execution times (potentially due to my inferior hardware):

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00001   * ROOT
  2   EnumerateViewNode    5000   5000000          0      34.32340     - FOR i IN search_hosotiepnhan_2 SEARCH (i.`tinhId` > 0)   /* view query */
  3   LimitNode            5000   5000000          0       0.56718       - LIMIT 0, 5000000
  4   CalculationNode      5000   5000000          0       2.18918       - LET #9 = i.`tinhId`   /* attribute expression */
  6   CalculationNode      5000   5000000          0       4.63421       - LET #13 = { "tinhId" : i.`tinhId`, "daTra" : i.`daTra` }   /* simple expression */
  7   CollectNode             1        37          0       1.83643       - COLLECT tinhId = #9 AGGREGATE count = LENGTH() INTO group = #13   /* hash */
 17   SortNode                1        37          0       0.04750       - SORT tinhId ASC   /* sorting strategy: standard */
 18   SubqueryStartNode      37        74          0       0.05996       - LET countDaTra = ( /* subquery begin */
  9   EnumerateListNode    5019   5000037          0       1.03633         - FOR i IN group   /* list iteration */
 10   CalculationNode      5019   5000037          0       1.45216           - LET #15 = (i.`daTra` == 1)   /* simple expression */
 11   FilterNode             37   4228828     771209       0.39480           - FILTER #15
 12   CollectNode             1        74          0       0.00047           - COLLECT AGGREGATE countDaTra = LENGTH()   /* count */
 19   SubqueryEndNode         1        37          0       0.00015           - RETURN  countDaTra ) /* subquery end */
 15   CalculationNode         1        37          0       0.00004       - LET #17 = { "val" : tinhId, "count" : count, "countDaTra" : countDaTra }   /* simple expression */
 16   ReturnNode              1        37          0       0.00002       - RETURN #17

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-up-2
  3   handle-arangosearch-views
  4   remove-unnecessary-calculations-2
  5   move-calculations-down
  6   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0     771209      329908224        46.54266

It seems that the problem is that the query produces 5M matches in the EnumerateViewNode, but in order to build the resulting documents for them, it will need to do 5M additional document lookups.
This could be avoided by adding the required attributes (tinhId and daTra for this query) to the view definition as "stored values".

I just tried this, and created another view (search_hosotiepnhan_3) with that definition:

db._createView("search_hosotiepnhan_3", "arangosearch", {"storedValues":[["tinhId"],["daTra"]], links: {vertex_hosotiepnhan_2:{analyzers:["identity"], includeAllFields: true, trackListPositions: false}}});

Using that adjusted view in the query, while keeping everything else the same, dramatically speeds up the query:

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00001   * ROOT
  2   EnumerateViewNode    5000   5000000          0       1.19117     - FOR i IN search_hosotiepnhan_3 SEARCH (i.`tinhId` > 0) LET #21 = i.`daTra` LET #9 = i.`tinhId`   /* view query without materialization */
  3   LimitNode            5000   5000000          0       0.43068       - LIMIT 0, 5000000
  6   CalculationNode      5000   5000000          0       1.95566       - LET #13 = { "tinhId" : #9, "daTra" : #21 }   /* simple expression */
  7   CollectNode             1        37          0       1.23645       - COLLECT tinhId = #9 AGGREGATE count = LENGTH() INTO group = #13   /* hash */
 17   SortNode                1        37          0       0.04437       - SORT tinhId ASC   /* sorting strategy: standard */
 18   SubqueryStartNode      37        74          0       0.05579       - LET countDaTra = ( /* subquery begin */
  9   EnumerateListNode    5018   5000037          0       0.97822         - FOR i IN group   /* list iteration */
 10   CalculationNode      5018   5000037          0       1.37251           - LET #15 = (i.`daTra` == 1)   /* simple expression */
 11   FilterNode             37   4213311     786726       0.37546           - FILTER #15
 12   CollectNode             1        74          0       0.00039           - COLLECT AGGREGATE countDaTra = LENGTH()   /* count */
 19   SubqueryEndNode         1        37          0       0.00014           - RETURN  countDaTra ) /* subquery end */
 15   CalculationNode         1        37          0       0.00003       - LET #17 = { "val" : tinhId, "count" : count, "countDaTra" : countDaTra }   /* simple expression */
 16   ReturnNode              1        37          0       0.00002       - RETURN #17

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-up-2
  3   handle-arangosearch-views
  4   remove-unnecessary-calculations-2
  5   move-calculations-down
  6   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0     786726      329973760         7.64167

I also tried running the query (with the adjusted view) on our current devel, which will become ArangoDB 3.12 at some point in Q1 next year.
The query performance there will be even better:

Execution plan:
 Id   NodeType            Calls    Par     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1      -         1          0       0.00001   * ROOT
  2   EnumerateViewNode    5000      -   5000000          0       1.04602     - FOR i IN search_hosotiepnhan_3 SEARCH (i.`tinhId` > 0) LET #15 = i.`daTra` LET #9 = i.`tinhId`   /* view query without materialization */
  3   LimitNode            5000      -   5000000          0       0.07833       - LIMIT 0, 5000000
  6   CalculationNode      5000   4999   5000000          0       1.80908       - LET #11 = { "tinhId" : #9, "daTra" : #15 }   /* simple expression */
  7   CollectNode             1      0        37          0       1.16470       - COLLECT tinhId = #9 AGGREGATE count = LENGTH() INTO group = #11   /* hash */
 17   SortNode                1      0        37          0       0.02060       - SORT tinhId ASC   /* sorting strategy: standard */
 18   SubqueryStartNode      37      -        74          0       0.02330       - LET countDaTra = ( /* subquery begin */
  9   EnumerateListNode    5018      -   5000037          0       0.77800         - FOR i IN group   /* list iteration */
 10   CalculationNode      5018      -   5000037          0       1.07768           - LET #12 = (i.`daTra` == 1)   /* simple expression */
 11   FilterNode             37      -   4213311     786726       0.31608           - FILTER #12
 12   CollectNode             1      -        74          0       0.00023           - COLLECT AGGREGATE countDaTra = LENGTH()   /* count */
 19   SubqueryEndNode         1      -        37          0       0.00012           - RETURN  countDaTra ) /* subquery end */
 15   CalculationNode         1      0        37          0       0.00003       - LET #13 = { "val" : tinhId, "count" : count, "countDaTra" : countDaTra }   /* simple expression */
 16   ReturnNode              1      -        37          0       0.00001       - RETURN #13

Indexes used:
 none

Optimization rules applied:
 Id   Rule Name                                 Id   Rule Name                                 Id   Rule Name                        
  1   move-calculations-up                       4   remove-unnecessary-calculations-2          7   splice-subqueries                
  2   move-calculations-up-2                     5   move-calculations-down           
  3   handle-arangosearch-views                  6   async-prefetch                   

Query Statistics:
 Writes Exec      Writes Ign      Doc. Lookups      Scan Full      Scan Index      Cache Hits/Misses      Filtered      Peak Mem [b]      Exec Time [s]
           0               0                 0              0         5000000                  0 / 0        786726         330006528            4.10029

Query Profile:
 Query Stage                Duration [s]         Query Stage                Duration [s]         Query Stage                Duration [s]
 initializing                    0.00001         loading collections             0.00002         instantiating executors         0.00007
 parsing                         0.00009         instantiating plan              0.00004         executing                       4.09961
 optimizing ast                  0.00001         optimizing plan                 0.00042         finalizing                      0.00004

son2408 · 2023-12-20T04:23:19Z

@jsteemann yes, with your view definition the query speeds up dramatically. If my query have subqueries count then runtime query too slowly. It takes 16.40239s for 5M documents. Let help me optimize it.

Query

FOR i IN search_hosotiepnhan_3
    SEARCH i.tinhId > 0
    limit 0,5000000
    COLLECT tinhId = i.tinhId
    AGGREGATE count = length(1)
    INTO group = {daTra: i.daTra, ngayTra: i.ngayTra_ts, choBoSung:i.choBoSung, thoiDiemChoBoSung:i.thoiDiemChoBoSung_ts, ngayXuLy: i.ngayXuLy_ts, ngayTiepNhan:i.ngayTiepNhan_ts}
    LET countDaTra = (FOR i IN group
                        FILTER i.daTra == 1
                        COLLECT WITH COUNT INTO countDaTra
                        RETURN countDaTra)
    LET countNgayTra = (FOR i IN group
                        FILTER i.ngayTra > 0
                        COLLECT WITH COUNT INTO countNgayTra
                        RETURN countNgayTra)
                        
    LET countChoBoSung = (FOR i IN group
                        FILTER i.choBoSung == 1
                        COLLECT WITH COUNT INTO countChoBoSung
                        RETURN countChoBoSung)
    LET countThoiDiemChoBoSung = (FOR i IN group
                        FILTER i.thoiDiemChoBoSung > 0
                        COLLECT WITH COUNT INTO countThoiDiemChoBoSung
                        RETURN countThoiDiemChoBoSung)
    LET countNgayXuLy = (FOR i IN group
                        FILTER i.ngayXuLy > 0
                        COLLECT WITH COUNT INTO countNgayXuLy
                        RETURN countNgayXuLy)
                        
    LET countNgayTiepNhan = (FOR i IN group
                        FILTER i.ngayTiepNhan > 0
                        COLLECT WITH COUNT INTO countNgayTiepNhan
                        RETURN countNgayTiepNhan)
    RETURN {val: tinhId, count : count, countDaTra: countDaTra, countNgayTra: countNgayTra, countChoBoSung: countChoBoSung,countThoiDiemChoBoSung: countThoiDiemChoBoSung,countNgayXuLy:countNgayXuLy,countNgayTiepNhan:countNgayTiepNhan}

Query profile

Query String (1837 chars, cacheable: false):
 FOR i IN search_hosotiepnhan_3
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE count = length(1)
     INTO group = {daTra: i.daTra, ngayTra: i.ngayTra_ts, choBoSung:i.choBoSung, 
 thoiDiemChoBoSung:i.thoiDiemChoBoSung_ts, ngayXuLy: i.ngayXuLy_ts, ngayTiepNhan:i.ngayTiepNhan_s}
     LET countDaTra = (FOR i IN group
                         FILTER i.daTra == 1
                         COLLECT WITH COUNT INTO countDaTra
                         RETURN countDaTra)
     LET countNgayTra = (FOR i IN group
                         FILTER i.ngayTra > 0
                         COLLECT WITH COUNT INTO countNgayTra
                         RETURN countNgayTra)
                         
     LET countChoBoSung = (FOR i IN group
                         FILTER i.choBoSung == 1
                         COLLECT WITH COUNT INTO countChoBoSung
                         RETURN countChoBoSung)
     LET countThoiDiemChoBoSung = (FOR i IN group
                      ...

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00000   * ROOT
  2   EnumerateViewNode    5000   5000000          0       0.99799     - FOR i IN search_hosotiepnhan_3 SEARCH (i.`tinhId` > 0) LET #68 = i.`ngayTra_ts` LET #70 = i.`choBoSung` LET #34 = i.`tinhId` LET #74 = i.`ngayXuLy_ts` LET #66 = i.`daTra` LET #72 = i.`thoiDiemChoBoSung_ts` LET #76 = i.`ngayTiepNhan_s`   /* view query without materialization */
  3   LimitNode            5000   5000000          0       0.07333       - LIMIT 0, 5000000
  6   CalculationNode      5000   5000000          0       2.80703       - LET #38 = { "daTra" : #66, "ngayTra" : #68, "choBoSung" : #70, "thoiDiemChoBoSung" : #72, "ngayXuLy" : #74, "ngayTiepNhan" : #76 }   /* simple expression */
  7   CollectNode             1        37          0       0.96351       - COLLECT tinhId = #34 AGGREGATE count = LENGTH() INTO group = #38   /* hash */
 52   SortNode                1        37          0       0.07302       - SORT tinhId ASC   /* sorting strategy: standard */
 63   SubqueryStartNode      37        74          0       0.16044       - LET countNgayTiepNhan = ( /* subquery begin */
 44   EnumerateListNode    5020   5000037          0       0.75889         - FOR i IN group   /* list iteration */
 45   CalculationNode      5020   5000037          0       0.76964           - LET #50 = (i.`ngayTiepNhan` > 0)   /* simple expression */
 46   FilterNode             37        37    5000000       0.14499           - FILTER #50
 47   CollectNode             1        74          0       0.00011           - COLLECT AGGREGATE countNgayTiepNhan = LENGTH()   /* count */
 64   SubqueryEndNode         1        37          0       0.00002           - RETURN  countNgayTiepNhan ) /* subquery end */
 61   SubqueryStartNode      37        74          0       0.16176       - LET countNgayXuLy = ( /* subquery begin */
 37   EnumerateListNode    5020   5000037          0       0.75396         - FOR i IN group   /* list iteration */
 38   CalculationNode      5020   5000037          0       0.87656           - LET #48 = (i.`ngayXuLy` > 0)   /* simple expression */
 39   FilterNode             37   5000037          0       0.14957           - FILTER #48
 40   CollectNode             1        74          0       0.00012           - COLLECT AGGREGATE countNgayXuLy = LENGTH()   /* count */
 62   SubqueryEndNode         1        37          0       0.00002           - RETURN  countNgayXuLy ) /* subquery end */
 59   SubqueryStartNode      37        74          0       0.15743       - LET countThoiDiemChoBoSung = ( /* subquery begin */
 30   EnumerateListNode    5020   5000037          0       0.75541         - FOR i IN group   /* list iteration */
 31   CalculationNode      5020   5000037          0       0.86033           - LET #46 = (i.`thoiDiemChoBoSung` > 0)   /* simple expression */
 32   FilterNode             37   1123639    3876398       0.16305           - FILTER #46
 33   CollectNode             1        74          0       0.00012           - COLLECT AGGREGATE countThoiDiemChoBoSung = LENGTH()   /* count */
 60   SubqueryEndNode         1        37          0       0.00002           - RETURN  countThoiDiemChoBoSung ) /* subquery end */
 57   SubqueryStartNode      37        74          0       0.16859       - LET countChoBoSung = ( /* subquery begin */
 23   EnumerateListNode    5020   5000037          0       0.75705         - FOR i IN group   /* list iteration */
 24   CalculationNode      5020   5000037          0       0.84098           - LET #44 = (i.`choBoSung` == 1)   /* simple expression */
 25   FilterNode             37     57277    4942760       0.15686           - FILTER #44
 26   CollectNode             1        74          0       0.00013           - COLLECT AGGREGATE countChoBoSung = LENGTH()   /* count */
 58   SubqueryEndNode         1        37          0       0.00002           - RETURN  countChoBoSung ) /* subquery end */
 55   SubqueryStartNode      37        74          0       0.15836       - LET countNgayTra = ( /* subquery begin */
 16   EnumerateListNode    5020   5000037          0       0.75944         - FOR i IN group   /* list iteration */
 17   CalculationNode      5020   5000037          0       0.90365           - LET #42 = (i.`ngayTra` > 0)   /* simple expression */
 18   FilterNode             37   4219307     780730       0.16860           - FILTER #42
 19   CollectNode             1        74          0       0.00012           - COLLECT AGGREGATE countNgayTra = LENGTH()   /* count */
 56   SubqueryEndNode         1        37          0       0.00002           - RETURN  countNgayTra ) /* subquery end */
 53   SubqueryStartNode      37        74          0       0.08596       - LET countDaTra = ( /* subquery begin */
  9   EnumerateListNode    5020   5000037          0       0.76448         - FOR i IN group   /* list iteration */
 10   CalculationNode      5020   5000037          0       0.84268           - LET #40 = (i.`daTra` == 1)   /* simple expression */
 11   FilterNode             37   4211854     788183       0.16787           - FILTER #40
 12   CollectNode             1        74          0       0.00011           - COLLECT AGGREGATE countDaTra = LENGTH()   /* count */
 54   SubqueryEndNode         1        37          0       0.00002           - RETURN  countDaTra ) /* subquery end */
 50   CalculationNode         1        37          0       0.00004       - LET #52 = { "val" : tinhId, "count" : count, "countDaTra" : countDaTra, "countNgayTra" : countNgayTra, "countChoBoSung" : countChoBoSung, "countThoiDiemChoBoSung" : countThoiDiemChoBoSung, "countNgayXuLy" : countNgayXuLy, "countNgayTiepNhan" : countNgayTiepNhan }   /* simple expression */
 51   ReturnNode              1        37          0       0.00001       - RETURN #52

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-up-2
  3   handle-arangosearch-views
  4   remove-unnecessary-calculations-2
  5   move-calculations-down
  6   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0   15388071     1829371904        16.40413

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00013
 optimizing ast                 0.00002
 loading collections            0.00001
 instantiating plan             0.00012
 optimizing plan                0.00122
 instantiating executors        0.00028
 executing                     16.40239
 finalizing                     0.00007

I have multi subqueries. I see it spends alot of time for EnumerateListNode and CalculationNode for each subquery. Do you have the best way to statistic data with multi count by query in a FOR .. IN ? Because I see subqueries similar FOR i IN group

jsteemann · 2023-12-20T09:28:02Z

One idea is to loop over the results only once instead of creating an arbitrary number of subqueries:

FOR i IN search_hosotiepnhan_3
    SEARCH i.tinhId > 0
    limit 0,5000000
    COLLECT tinhId = i.tinhId
    AGGREGATE count = length(1)
    INTO group = {
      daTra: i.daTra, 
      ngayTra: i.ngayTra_ts, 
      choBoSung: i.choBoSung, 
      thoiDiemChoBoSung: i.thoiDiemChoBoSung_ts, 
      ngayXuLy: i.ngayXuLy_ts, 
      ngayTiepNhan:i.ngayTie
8000
pNhan_s
    }
    LET counters = (
      FOR i IN group
        COLLECT AGGREGATE 
          countDaTra = SUM(i.daTra == 1 ? 1 : 0),
          countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
          countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
          countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
          countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
          countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
        RETURN {countDaTra, countNgayTra, countChoBoSung, countThoiDiemChoBoSung, countNgayXuLy, countNgayTiepNhan}
    )
    RETURN {
      val: tinhId, 
      count: count, 
      countDaTra: counters[0].countDaTra, 
      countNgayTra: counters[0].countNgayTra, 
      countChoBoSung: counters[0].countChoBoSung,
      countThoiDiemChoBoSung: counters[0].countThoiDiemChoBoSung,
      countNgayXuLy: counters[0].countNgayXuLy,
      countNgayTiepNhan: counters[0].countNgayTiepNhan
    }

I can't estimate how this will impact performance (positively or negatively), but at least it is something you could try if you are worried about the amount of subqueries.

son2408 · 2023-12-20T09:56:18Z

@jsteemann thanks you, I tried query as your suggestion, but time query not as expected, it takes 26.65674 seconds. It's slower old query :(

Profile

Query String (1307 chars, cacheable: false):
 FOR i IN search_hosotiepnhan_3
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE count = length(1)
     INTO group = {
       daTra: i.daTra, 
       ngayTra: i.ngayTra_ts, 
       choBoSung: i.choBoSung, 
       thoiDiemChoBoSung: i.thoiDiemChoBoSung_ts, 
       ngayXuLy: i.ngayXuLy_ts, 
       ngayTiepNhan:i.ngayTiepNhan_s
     }
     LET counters = (
       FOR i IN group
         COLLECT AGGREGATE 
           countDaTra = SUM(i.daTra == 1 ? 1 : 0),
           countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
           countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
           countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
           countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
           countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
         RETURN {countDaTra, countNgayTra, countChoBoSung, countThoiDiemChoBoSung, countNgayXuLy, 
 countNgayTiepNhan}
     )
     RETURN {
       val: tinhId, 
       count: count, 
       co...

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00000   * ROOT
  2   EnumerateViewNode    5000   5000000          0      13.14875     - FOR i IN search_hosotiepnhan_3 SEARCH (i.`tinhId` > 0)   /* view query */
  3   LimitNode            5000   5000000          0       0.07553       - LIMIT 0, 5000000
  4   CalculationNode      5000   5000000          0       0.62708       - LET #14 = i.`tinhId`   /* attribute expression */
  6   CalculationNode      5000   5000000          0       5.60139       - LET #18 = { "daTra" : i.`daTra`, "ngayTra" : i.`ngayTra_ts`, "choBoSung" : i.`choBoSung`, "thoiDiemChoBoSung" : i.`thoiDiemChoBoSung_ts`, "ngayXuLy" : i.`ngayXuLy_ts`, "ngayTiepNhan" : i.`ngayTiepNhan_s` }   /* simple expression */
  7   CollectNode             1        36          0       0.98457       - COLLECT tinhId = #14 AGGREGATE count = LENGTH() INTO group = #18   /* hash */
 22   SortNode                1        36          0       0.07236       - SORT tinhId ASC   /* sorting strategy: standard */
 24   SubqueryStartNode       1        72          0       0.08490       - LET counters = ( /* subquery begin */
  9   EnumerateListNode    5001   5000036          0       0.74345         - FOR i IN group   /* list iteration */
 10   CalculationNode      5001   5000036          0       0.75794           - LET #20 = ((i.`daTra` == 1) ? 1 : 0)   /* simple expression */
 11   CalculationNode      5001   5000036          0       0.80083           - LET #22 = ((i.`ngayTra` > 0) ? 1 : 0)   /* simple expression */
 12   CalculationNode      5001   5000036          0       0.74203           - LET #24 = ((i.`choBoSung` == 1) ? 1 : 0)   /* simple expression */
 13   CalculationNode      5001   5000036          0       0.75944           - LET #26 = ((i.`thoiDiemChoBoSung` > 0) ? 1 : 0)   /* simple expression */
 14   CalculationNode      5001   5000036          0       0.77572           - LET #28 = ((i.`ngayXuLy` > 0) ? 1 : 0)   /* simple expression */
 15   CalculationNode      5001   5000036          0       0.86129           - LET #30 = ((i.`ngayTiepNhan` > 0) ? 1 : 0)   /* simple expression */
 16   CollectNode             1        72          0       0.62120           - COLLECT AGGREGATE countDaTra = SUM(#20), countNgayTra = SUM(#22), countChoBoSung = SUM(#24), countThoiDiemChoBoSung = SUM(#26), countNgayXuLy = SUM(#28), countNgayTiepNhan = SUM(#30)   /* sorted */
 23   LimitNode               1        72          0       0.00001           - LIMIT 0, 1
 17   CalculationNode         1        72          0       0.00005           - LET #32 = { "countDaTra" : countDaTra, "countNgayTra" : countNgayTra, "countChoBoSung" : countChoBoSung, "countThoiDiemChoBoSung" : countThoiDiemChoBoSung, "countNgayXuLy" : countNgayXuLy, "countNgayTiepNhan" : countNgayTiepNhan }   /* simple expression */
 25   SubqueryEndNode         1        36          0       0.00002           - RETURN  #32 ) /* subquery end */
 20   CalculationNode         1        36          0       0.00009       - LET #34 = { "val" : tinhId, "count" : count, "countDaTra" : counters[0].`countDaTra`, "countNgayTra" : counters[0].`countNgayTra`, "countChoBoSung" : counters[0].`countChoBoSung`, "countThoiDiemChoBoSung" : counters[0].`countThoiDiemChoBoSung`, "countNgayXuLy" : counters[0].`countNgayXuLy`, "countNgayTiepNhan" : counters[0].`countNgayTiepNhan` }   /* simple expression */
 21   ReturnNode              1        36          0       0.00001       - RETURN #34

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   optimize-subqueries
  3   move-calculations-up-2
  4   handle-arangosearch-views
  5   remove-unnecessary-calculations-2
  6   move-calculations-down
  7   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0          0     1805746176        26.65769

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00010
 optimizing ast                 0.00001
 loading collections            0.00001
 instantiating plan             0.00006
 optimizing plan                0.00067
 instantiating executors        0.00015
 executing                     26.65674
 finalizing                     0.00006

jsteemann · 2023-12-20T10:07:26Z

Please note that the extraction from the view is again the slowest part of your query.
You will need to adjust the view definition again so it includes all used attributes as "stored values".

son2408 · 2023-12-20T10:14:04Z

@jsteemann yes, I adjusted the view defintion.

{
  "globallyUniqueId": "h64D5FDE00828/162169",
  "cleanupIntervalStep": 2,
  "writebufferIdle": 64,
  "writebufferActive": 0,
  "id": "162169",
  "consolidationIntervalMsec": 1000,
  "name": "search_hosotiepnhan_3",
  "commitIntervalMsec": 1000,
  "type": "arangosearch",
  "consolidationPolicy": {
    "type": "tier",
    "segmentsBytesFloor": 2097152,
    "segmentsBytesMax": 5368709120,
    "segmentsMax": 10,
    "segmentsMin": 1,
    "minScore": 0
  },
  "primarySort": [
    {
      "field": "tinhId",
      "asc": true
    }
  ],
  "primarySortCompression": "lz4",
  "storedValues": [
    {
      "fields": [
        "tinhId"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "ngayTra_ts"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "choBoSung"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "thoiDiemChoBoSung_ts"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "ngayXuLy_ts"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "ngayTiepNhan_s"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "dayTiepNhan_ts"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "daTra"
      ],
      "compression": "lz4"
    }
  ],
  "writebufferSizeMax": 33554432,
  "links": {
    "vertex_hosotiepnhan_2": {
      "analyzers": [
        "identity"
      ],
      "fields": {},
      "includeAllFields": true,
      "storeValues": "none",
      "trackListPositions": false
    }
  }
}

son2408 · 2023-12-20T10:18:57Z

@jsteemann I see if used attributes the more the slow, pls check help me. Thanks you.

MBkkt · 2023-12-20T10:41:00Z

@son2408 it's expected, using additional columns slow down query
You can try to unify in single column few fields (just write [["a", "b"]] instead of [["a"], ["b"]], if they're always used together it commonly will be faster
Ofc there will be some limit, starting from what read whole document from rocksdb will be faster

son2408 · 2023-12-21T07:20:27Z

@MBkkt I tried to unify in single column few fields, but it's still too slow. Total runtime is 26.95196 seconds.
Profile

Query String (1306 chars, cacheable: false):
 FOR i IN search_hosotiepnhan
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE count = length(1)
     INTO group = {
       daTra: i.daTra, 
       ngayTra: i.ngayTra_ts, 
       choBoSung: i.choBoSung, 
       thoiDiemChoBoSung: i.thoiDiemChoBoSung_ts, 
       ngayXuLy: i.ngayXuLy_ts, 
       ngayTiepNhan:i.ngayTiepNhan_ts
     }
     LET counters = (
       FOR i IN group
         COLLECT AGGREGATE 
           countDaTra = SUM(i.daTra == 1 ? 1 : 0),
           countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
           countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
           countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
           countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
           countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
         RETURN {countDaTra, countNgayTra, countChoBoSung, countThoiDiemChoBoSung, countNgayXuLy, 
 countNgayTiepNhan}
     )
     RETURN {
       val: tinhId, 
       count: count, 
       cou...

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00001   * ROOT
  2   EnumerateViewNode    5000   5000000          0      12.71218     - FOR i IN search_hosotiepnhan SEARCH (i.`tinhId` > 0)   /* view query */
  3   LimitNode            5000   5000000          0       0.07812       - LIMIT 0, 5000000
  4   CalculationNode      5000   5000000          0       0.63890       - LET #14 = i.`tinhId`   /* attribute expression */
  6   CalculationNode      5000   5000000          0       6.13356       - LET #18 = { "daTra" : i.`daTra`, "ngayTra" : i.`ngayTra_ts`, "choBoSung" : i.`choBoSung`, "thoiDiemChoBoSung" : i.`thoiDiemChoBoSung_ts`, "ngayXuLy" : i.`ngayXuLy_ts`, "ngayTiepNhan" : i.`ngayTiepNhan_ts` }   /* simple expression */
  7   CollectNode             1        37          0       1.08334       - COLLECT tinhId = #14 AGGREGATE count = LENGTH() INTO group = #18   /* hash */
 22   SortNode                1        37          0       0.07627       - SORT tinhId ASC   /* sorting strategy: standard */
 24   SubqueryStartNode       1        74          0       0.08903       - LET counters = ( /* subquery begin */
  9   EnumerateListNode    5001   5000037          0       0.74561         - FOR i IN group   /* list iteration */
 10   CalculationNode      5001   5000037          0       0.75636           - LET #20 = ((i.`daTra` == 1) ? 1 : 0)   /* simple expression */
 11   CalculationNode      5001   5000037          0       0.79988           - LET #22 = ((i.`ngayTra` > 0) ? 1 : 0)   /* simple expression */
 12   CalculationNode      5001   5000037          0       0.74970           - LET #24 = ((i.`choBoSung` == 1) ? 1 : 0)   /* simple expression */
 13   CalculationNode      5001   5000037          0       0.75335           - LET #26 = ((i.`thoiDiemChoBoSung` > 0) ? 1 : 0)   /* simple expression */
 14   CalculationNode      5001   5000037          0       0.77716           - LET #28 = ((i.`ngayXuLy` > 0) ? 1 : 0)   /* simple expression */
 15   CalculationNode      5001   5000037          0       0.92626           - LET #30 = ((i.`ngayTiepNhan` > 0) ? 1 : 0)   /* simple expression */
 16   CollectNode             1        74          0       0.63191           - COLLECT AGGREGATE countDaTra = SUM(#20), countNgayTra = SUM(#22), countChoBoSung = SUM(#24), countThoiDiemChoBoSung = SUM(#26), countNgayXuLy = SUM(#28), countNgayTiepNhan = SUM(#30)   /* sorted */
 23   LimitNode               1        74          0       0.00001           - LIMIT 0, 1
 17   CalculationNode         1        74          0       0.00006           - LET #32 = { "countDaTra" : countDaTra, "countNgayTra" : countNgayTra, "countChoBoSung" : countChoBoSung, "countThoiDiemChoBoSung" : countThoiDiemChoBoSung, "countNgayXuLy" : countNgayXuLy, "countNgayTiepNhan" : countNgayTiepNhan }   /* simple expression */
 25   SubqueryEndNode         1        37          0       0.00002           - RETURN  #32 ) /* subquery end */
 20   CalculationNode         1        37          0       0.00008       - LET #34 = { "val" : tinhId, "count" : count, "countDaTra" : counters[0].`countDaTra`, "countNgayTra" : counters[0].`countNgayTra`, "countChoBoSung" : counters[0].`countChoBoSung`, "countThoiDiemChoBoSung" : counters[0].`countThoiDiemChoBoSung`, "countNgayXuLy" : counters[0].`countNgayXuLy`, "countNgayTiepNhan" : counters[0].`countNgayTiepNhan` }   /* simple expression */
 21   ReturnNode              1        37          0       0.00001       - RETURN #34

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   optimize-subqueries
  3   move-calculations-up-2
  4   handle-arangosearch-views
  5   remove-unnecessary-calculations-2
  6   move-calculations-down
  7   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0          0     1886453760        26.95313

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00012
 optimizing ast                 0.00001
 loading collections            0.00001
 instantiating plan             0.00007
 optimizing plan                0.00069
 instantiating executors        0.00031
 executing                     26.95196
 finalizing                     0.00005

View configuration

{
  "globallyUniqueId": "h64D5FDE00828/196510",
  "cleanupIntervalStep": 2,
  "writebufferIdle": 64,
  "writebufferActive": 0,
  "id": "196510",
  "consolidationIntervalMsec": 1000,
  "name": "search_hosotiepnhan",
  "commitIntervalMsec": 1000,
  "type": "arangosearch",
  "consolidationPolicy": {
    "type": "tier",
    "segmentsBytesFloor": 2097152,
    "segmentsBytesMax": 5368709120,
    "segmentsMax": 10,
    "segmentsMin": 1,
    "minScore": 0
  },
  "primarySort": [],
  "primarySortCompression": "lz4",
  "storedValues": [
    {
      "fields": [
        "tinhId",
        "daTra",
        "ngayTra_ts",
        "choBoSung",
        "thoiDiemChoBoSung_ts",
        "ngayXuLy_ts",
        "ngayTiepNhan_ts"
      ],
      "compression": "lz4"
    }
  ],
  "writebufferSizeMax": 33554432,
  "links": {
    "vertex_hosotiepnhan_2": {
      "analyzers": [
        "identity"
      ],
      "fields": {},
      "includeAllFields": true,
      "storeValues": "none",
      "trackListPositions": false
    }
  }
}

son2408 · 2023-12-21T07:36:39Z

@jsteemann I included used attributes as "stored values", but it can't improve. I see the more atributes, the slower speed. Can you help me try it with data is shared in above, thanks you.
Query

FOR i IN search_hosotiepnhan_3
    SEARCH i.tinhId > 0
    limit 0,5000000
    COLLECT tinhId = i.tinhId
    AGGREGATE count = length(1)
    INTO group = {
      daTra: i.daTra, 
      ngayTra: i.ngayTra_ts, 
      choBoSung: i.choBoSung, 
      thoiDiemChoBoSung: i.thoiDiemChoBoSung_ts, 
      ngayXuLy: i.ngayXuLy_ts, 
      ngayTiepNhan:i.ngayTiepNhan_ts
    }
    LET counters = (
      FOR i IN group
        COLLECT AGGREGATE 
          countDaTra = SUM(i.daTra == 1 ? 1 : 0),
          countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
          countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
          countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
          countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
          countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
        RETURN {countDaTra, countNgayTra, countChoBoSung, countThoiDiemChoBoSung, countNgayXuLy, countNgayTiepNhan}
    )
    RETURN {
      val: tinhId, 
      count: count, 
      countDaTra: counters[0].countDaTra, 
      countNgayTra: counters[0].countNgayTra, 
      countChoBoSung: counters[0].countChoBoSung,
      countThoiDiemChoBoSung: counters[0].countThoiDiemChoBoSung,
      countNgayXuLy: counters[0].countNgayXuLy,
      countNgayTiepNhan: counters[0].countNgayTiepNhan
    }

jsteemann · 2023-12-21T09:00:53Z

Hard to say without any profile what the limiting factor is now.
You could also try removing the one left subqueries for the counting and fuse its COLLECT statement with the initial COLLECT. Then only one pass over the data is needed.

son2408 · 2023-12-21T09:18:39Z

@jsteemann yes, i tried to remove some attributes, runtime query speed up. With initial COLLECT is 6 attributes then runtime query is 32s, now attributes numbers is 2 then runtime query is 20s
profile now

Query String (598 chars, cacheable: false):
 FOR i IN search_hosotiepnhan
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE count = length(1)
     INTO group = {
       daTra: i.daTra, 
       ngayTra: i.ngayTra_ts
     }
     LET counters = (
       FOR i IN group
         COLLECT AGGREGATE 
           countDaTra = SUM(i.daTra == 1 ? 1 : 0),
           countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0)
         RETURN {countDaTra, countNgayTra}
     )
     RETURN {
       val: tinhId, 
       count: count, 
       countDaTra: counters[0].countDaTra, 
       countNgayTra: counters[0].countNgayTra
     }

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00000   * ROOT
  2   EnumerateViewNode    5000   5000000          0      12.83193     - FOR i IN search_hosotiepnhan SEARCH (i.`tinhId` > 0)   /* view query */
  3   LimitNode            5000   5000000          0       0.07608       - LIMIT 0, 5000000
  4   CalculationNode      5000   5000000          0       0.64449       - LET #10 = i.`tinhId`   /* attribute expression */
  6   CalculationNode      5000   5000000          0       3.21296       - LET #14 = { "daTra" : i.`daTra`, "ngayTra" : i.`ngayTra_ts` }   /* simple expression */
  7   CollectNode             1        37          0       0.81389       - COLLECT tinhId = #10 AGGREGATE count = LENGTH() INTO group = #14   /* hash */
 18   SortNode                1        37          0       0.02310       - SORT tinhId ASC   /* sorting strategy: standard */
 20   SubqueryStartNode       1        74          0       0.02810       - LET counters = ( /* subquery begin */
  9   EnumerateListNode    5001   5000037          0       0.70404         - FOR i IN group   /* list iteration */
 10   CalculationNode      5001   5000037          0       0.70010           - LET #16 = ((i.`daTra` == 1) ? 1 : 0)   /* simple expression */
 11   CalculationNode      5001   5000037          0       0.90486           - LET #18 = ((i.`ngayTra` > 0) ? 1 : 0)   /* simple expression */
 12   CollectNode             1        74          0       0.30604           - COLLECT AGGREGATE countDaTra = SUM(#16), countNgayTra = SUM(#18)   /* sorted */
 19   LimitNode               1        74          0       0.00001           - LIMIT 0, 1
 13   CalculationNode         1        74          0       0.00004           - LET #20 = { "countDaTra" : countDaTra, "countNgayTra" : countNgayTra }   /* simple expression */
 21   SubqueryEndNode         1        37          0       0.00002           - RETURN  #20 ) /* subquery end */
 16   CalculationNode         1        37          0       0.00004       - LET #22 = { "val" : tinhId, "count" : count, "countDaTra" : counters[0].`countDaTra`, "countNgayTra" : counters[0].`countNgayTra` }   /* simple expression */
 17   ReturnNode              1        37          0       0.00000       - RETURN #22

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   optimize-subqueries
  3   move-calculations-up-2
  4   handle-arangosearch-views
  5   remove-unnecessary-calculations-2
  6   move-calculations-down
  7   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0          0      547389440        20.24671

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00010
 optimizing ast                 0.00001
 loading collections            0.00001
 instantiating plan             0.00007
 optimizing plan                0.00046
 instantiating executors        0.00027
 executing                     20.24583
 finalizing                     0.00003

son2408 · 2023-12-21T09:20:51Z

@jsteemann i think "stored values" is no efficient with COLLECT INTO

jsteemann · 2023-12-21T09:25:05Z

"stored values" and COLLECT INTO are unrelated.

jsteemann · 2023-12-21T09:25:25Z

Can you try fusing the two COLLECT statements into a single one, so you will need only one iteration over the dataset?

son2408 · 2023-12-21T09:38:26Z

@jsteemann two COLLECT on two other collections. COLLECT 1 for initial search statement, COLLECT 2 for group of COLLECT 1. How to fusing them ?

jsteemann · 2023-12-21T09:44:14Z

Have you tried something along the lines of

 FOR i IN search_hosotiepnhan_3
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE 
           count = length(1),
           countDaTra = SUM(i.daTra == 1 ? 1 : 0),
           countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
           countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
           countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
           countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
           countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
...

son2408 · 2023-12-21T10:13:07Z

@jsteemann i tried it, runtime query is 4 seconds. thanks you.
With more complex usecase is I need subquery collect by key huyenId field with query daTra == 1 so I must use INTO. How to rebuild this query without INTO as this above your query? becuase using INTO it run query too slowly

FOR i IN search_hosotiepnhan_3
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE 
           count = length(1),
           countDaTra = SUM(i.daTra == 1 ? 1 : 0)
     INTO group = {daTra: i.daTra, huyenId: i.huyenId}
     LET statHuyen = (FOR g IN group
                         FILTER g.daTra == 1
                         COLLECT huyenId = g.huyenId
                         AGGREGATE
                              countHuyen = length(1)
                              RETURN {val: huyenId, count: countHuyen }
                     )
      ....
      ```

MBkkt · 2023-12-21T10:13:48Z

/* view query */

view used without stored values

/* view query without materialization */

view used stored values

jsteemann · 2023-12-21T11:04:03Z

Ok, I think there isn't anything left that I can do now.
The only suggestion I have is to try the upcoming version of ArangoDB (3.12), which is currently in development. I can't say if that will help or not, but maybe it is worth a try.
Linux nightly builds can be found here: https://download.arangodb.com/nightly/devel/Linux/x86_64/index.html
Please do not run a nightly build on the original data, but on a copy of your data!

son2408 · 2023-12-21T13:38:53Z

Ok, I think there isn't anything left that I can do now. The only suggestion I have is to try the upcoming version of ArangoDB (3.12), which is currently in development. I can't say if that will help or not, but maybe it is worth a try. Linux nightly builds can be found here: https://download.arangodb.com/nightly/devel/Linux/x86_64/index.html Please do not run a nightly build on the original data, but on a copy of your data!

Ok. Let me to try it. Thanks @jsteemann

son2408 · 2023-12-26T10:09:02Z

hi @jsteemann I upgraded to version 3.12.0-NIGHTLY.20231219. I see runtime query with COLLECT..INTO improved more than old version. But It is not as expect. For detail usecase.
My query with three times INTO takes 8.5.
View Configuration

{
  "globallyUniqueId": "h64D5FDE00828/16843",
  "cleanupIntervalStep": 2,
  "writebufferIdle": 64,
  "writebufferActive": 0,
  "id": "16843",
  "consolidationIntervalMsec": 1000,
  "name": "search_hosotiepnhan",
  "commitIntervalMsec": 1000,
  "type": "arangosearch",
  "consolidationPolicy": {
    "type": "tier",
    "segmentsBytesFloor": 2097152,
    "segmentsBytesMax": 5368709120,
    "segmentsMax": 10,
    "segmentsMin": 1,
    "minScore": 0
  },
  "primarySort": [
    {
      "field": "hoSoTiepNhanId",
      "asc": true
    }
  ],
  "primarySortCompression": "lz4",
  "storedValues": [
    {
      "fields": [
        "tinhId",
        "huyenId",
        "daTra",
        "ngayTra_ts",
        "choBoSung",
        "thoiDiemChoBoSung_ts",
        "ngayXuLy_ts",
        "ngayTiepNhan_ts",
        "dayTiepNhan",
        "monthTiepNhan",
        "quarterTiepNhan"
      ],
      "compression": "lz4"
    }
  ],
  "writebufferSizeMax": 33554432,
  "links": {
    "vertex_hosotiepnhan": {
      "analyzers": [
        "identity"
      ],
      "fields": {},
      "includeAllFields": true,
      "storeValues": "none",
      "trackListPositions": false
    }
  }
}

Query

FOR i IN search_hosotiepnhan
    limit 0, 5000000
    collect tinhId = i.tinhId
    aggregate
      countTinh = length(1)
    into group = {daTra: i.daTra, dayTiepNhan: i.dayTiepNhan, monthTiepNhan: i.monthTiepNhan, quarterTiepNhan: i.quarterTiepNhan}
    
    //dayTiepNhan
    let statDayTiepNhan = (
      FOR g in group
        collect collDayTiepNhan = g.dayTiepNhan
        aggregate
          countDayTiepNhan = length(1)
        into groupDayTiepNhan = {daTra:g.daTra}
        let statDayTiepNhanDaTra = (
          FOR gDayTiepNhan in groupDayTiepNhan
            collect daTra = gDayTiepNhan.daTra
            aggregate
              countDaTra = length(1)
              return {val: daTra, val: countDaTra}
        )
        return {val: collDayTiepNhan, count: countDayTiepNhan, buckets: statDayTiepNhanDaTra}
    )
    //monthTiepNhan
    let statMonthTiepNhan = (
      FOR g in group
        collect monthTiepNhan = g.monthTiepNhan
        aggregate
          countMonthTiepNhan = length(1)
        into groupMonthTiepNhan = {daTra:g.daTra}
        let statDaTra = (
          FOR gMonthTiepNhan in groupMonthTiepNhan
            collect daTra = gMonthTiepNhan.daTra
            aggregate
              countDaTra = length(1)
            return {val: daTra, val: countDaTra}
        )
        return {val: monthTiepNhan, count: countMonthTiepNhan}
    )
  return {val: tinhId, count: countTinh, statDayTiepNhan: statDayTiepNhan, statMonthTiepNhan: statMonthTiepNhan}

Profile

Query String (1520 chars, cacheable: false):
FOR i IN search_hosotiepnhan
   limit 0, 5000000
   collect tinhId = i.tinhId
   aggregate
     countTinh = length(1)
   into group = {daTra: i.daTra, dayTiepNhan: i.dayTiepNhan, monthTiepNhan: i.monthTiepNhan, 
quarterTiepNhan: i.quarterTiepNhan}
   
   //dayTiepNhan
   let statDayTiepNhan = (
     FOR g in group
       collect collDayTiepNhan = g.dayTiepNhan
       aggregate
         countDayTiepNhan = length(1)
       into groupDayTiepNhan = {daTra:g.daTra}
       let statDayTiepNhanDaTra = (
         FOR gDayTiepNhan in groupDayTiepNhan
           collect daTra = gDayTiepNhan.daTra
           aggregate
             countDaTra = length(1)
             return {val: daTra, val: countDaTra}
       )
       return {val: collDayTiepNhan, count: countDayTiepNhan, buckets: statDayTiepNhanDaTra}
   )
   //monthTiepNhan
   let statMonthTiepNhan = (
     FOR g in group
       collect monthTiepNhan = g.monthTiepNhan
       aggregate
...

Execution plan:
Id   NodeType            Calls    Par     Items   Filtered   Runtime [s]   Comment
1   SingletonNode           1      -         1          0       0.00000   * ROOT
2   EnumerateViewNode    5000      -   5000000          0       0.51076     - FOR i IN search_hosotiepnhan LET #30 = i.`tinhId` LET #110 = i.`monthTiepNhan` LET #109 = i.`dayTiepNhan` LET #108 = i.`daTra` LET #111 = i.`quarterTiepNhan`   /* view query without materialization */
3   LimitNode            5000      -   5000000          0       0.01663       - LIMIT 0, 5000000
6   CalculationNode      5000   4999   5000000          0       2.16173       - LET #32 = { "daTra" : #108, "dayTiepNhan" : #109, "monthTiepNhan" : #110, "quarterTiepNhan" : #111 }   /* simple expression */
7   CollectNode             1      0        37          0       1.00809       - COLLECT tinhId = #30 AGGREGATE countTinh = LENGTH() INTO group = #32   /* hash */
44   SortNode                1      0        37          0       0.04790       - SORT tinhId ASC   /* sorting strategy: standard */
52   SubqueryStartNode       1      -        74          0       0.11422       - LET statMonthTiepNhan = ( /* subquery begin */
26   EnumerateListNode    5001      -   5000037          0       0.74329         - FOR g IN group   /* list iteration */
27   CalculationNode      5001      -   5000037          0       0.37468           - LET #40 = g.`monthTiepNhan`   /* attribute expression */
29   CalculationNode      5001      -   5000037          0       1.03349           - LET #42 = { "daTra" : g.`daTra` }   /* simple expression */
30   CollectNode             1      -       457          0       0.29846           - COLLECT monthTiepNhan = #40 AGGREGATE countMonthTiepNhan = LENGTH()   /* hash */
47   SortNode                1      -       457          0       0.00010           - SORT monthTiepNhan ASC   /* sorting strategy: standard */
39   CalculationNode         1      -       457          0       0.00017           - LET #46 = { "val" : monthTiepNhan, "count" : countMonthTiepNhan }   /* simple expression */
53   SubqueryEndNode         1      -        37          0       0.00005           - RETURN  #46 ) /* subquery end */
48   SubqueryStartNode       1      -        74          0       0.06846       - LET statDayTiepNhan = ( /* subquery begin */
9   EnumerateListNode    5001      -   5000037          0       0.73775         - FOR g IN group   /* list iteration */
10   CalculationNode      5001      -   5000037          0       0.35109           - LET #33 = g.`dayTiepNhan`   /* attribute expression */
12   CalculationNode      5001      -   5000037          0       1.03256           - LET #35 = { "daTra" : g.`daTra` }   /* simple expression */
13   CollectNode             1      -       292          0       0.44655           - COLLECT collDayTiepNhan = #33 AGGREGATE countDayTiepNhan = LENGTH() INTO groupDayTiepNhan = #35   /* hash */
45   SortNode                1      -       292          0       0.00862           - SORT collDayTiepNhan ASC   /* sorting strategy: standard */
50   SubqueryStartNode       1      -       547          0       0.00820           - LET statDayTiepNhanDaTra = ( /* subquery begin */
15   EnumerateListNode    5001      -   5000292          0       0.31771             - FOR gDayTiepNhan IN groupDayTiepNhan   /* list iteration */
16   CalculationNode      5001      -   5000292          0       0.33699               - LET #36 = gDayTiepNhan.`daTra`   /* attribute expression */
18   CollectNode             1      -       793          0       0.27310               - COLLECT daTra = #36 AGGREGATE countDaTra = LENGTH()   /* hash */
46   SortNode                1      -       793          0       0.00023               - SORT daTra ASC   /* sorting strategy: standard */
19   CalculationNode         1      -       793          0       0.00039               - LET #38 = { "val" : daTra, "val" : countDaTra }   /* simple expression */
51   SubqueryEndNode         1      -       292          0       0.00009               - RETURN  #38 ) /* subquery end */
22   CalculationNode         1      -       292          0       0.00016           - LET #39 = { "val" : collDayTiepNhan, "count" : countDayTiepNhan, "buckets" : statDayTiepNhanDaTra }   /* simple expression */
49   SubqueryEndNode         1      -        37          0       0.00005           - RETURN  #39 ) /* subquery end */
42   CalculationNode         1      0        37          0       0.00004       - LET #47 = { "val" : tinhId, "count" : countTinh, "statDayTiepNhan" : statDayTiepNhan, "statMonthTiepNhan" : statMonthTiepNhan }   /* simple expression */
43   ReturnNode              1      -        37          0       0.00000       - RETURN #47

Indexes used:
none

Optimization rules applied:
Id   Rule Name                                 Id   Rule Name                                 Id   Rule Name                        
1   move-calculations-up                       5   remove-collect-variables                   9   async-prefetch                   
2   remove-redundant-calculations              6   handle-arangosearch-views                 10   splice-subqueries                
3   remove-unnecessary-calculations            7   remove-unnecessary-calculations-2
4   move-calculations-up-2                     8   move-calculations-down           

Query Statistics:
Writes Exec      Writes Ign      Doc. Lookups      Scan Full      Scan Index      Cache Hits/Misses      Filtered      Peak Mem [b]      Exec Time [s]
         0               0                 0              0         5000000                  0 / 0             0        1563394048            8.63782

Query Profile:
Query Stage                Duration [s]         Query Stage                Duration [s]         Query Stage                Duration [s]
initializing                    0.00000         loading collections             0.00001         instantiating executors         0.00018
parsing                         0.00014         instantiating plan              0.00022         executing                       8.63007
optimizing ast                  0.00001         optimizing plan                 0.00719         finalizing                      0.00004

if this query has two times COLLECT..INTO. It takes 6s.
Query

FOR i IN search_hosotiepnhan
    limit 0, 5000000
    collect tinhId = i.tinhId
    aggregate
      countTinh = length(1)
    into group = {daTra: i.daTra, dayTiepNhan: i.dayTiepNhan, monthTiepNhan: i.monthTiepNhan, quarterTiepNhan: i.quarterTiepNhan}
    
    //dayTiepNhan
    let statDayTiepNhan = (
      FOR g in group
        collect collDayTiepNhan = g.dayTiepNhan
        aggregate
          countDayTiepNhan = length(1)
        into groupDayTiepNhan = {daTra:g.daTra}
        let statDayTiepNhanDaTra = (
          FOR gDayTiepNhan in groupDayTiepNhan
            collect daTra = gDayTiepNhan.daTra
            aggregate
              countDaTra = length(1)
              return {val: daTra, val: countDaTra}
        )
        return {val: collDayTiepNhan, count: countDayTiepNhan, buckets: statDayTiepNhanDaTra}
    )
  return {val: tinhId, count: countTinh, statDayTiepNhan: statDayTiepNhan}

Profile

Query String (925 chars, cacheable: false):
 FOR i IN search_hosotiepnhan
     limit 0, 5000000
     collect tinhId = i.tinhId
     aggregate
       countTinh = length(1)
     into group = {daTra: i.daTra, dayTiepNhan: i.dayTiepNhan, monthTiepNhan: i.monthTiepNhan, 
 quarterTiepNhan: i.quarterTiepNhan}
     
     //dayTiepNhan
     let statDayTiepNhan = (
       FOR g in group
         collect collDayTiepNhan = g.dayTiepNhan
         aggregate
           countDayTiepNhan = length(1)
         into groupDayTiepNhan = {daTra:g.daTra}
         let statDayTiepNhanDaTra = (
           FOR gDayTiepNhan in groupDayTiepNhan
             collect daTra = gDayTiepNhan.daTra
             aggregate
               countDaTra = length(1)
               return {val: daTra, val: countDaTra}
         )
         return {val: collDayTiepNhan, count: countDayTiepNhan, buckets: statDayTiepNhanDaTra}
     )
   return {val: tinhId, count: countTinh, statDayTiepNhan: statDayTiepNhan}

Execution plan:
 Id   NodeType            Calls    Par     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1      -         1          0       0.00000   * ROOT
  2   EnumerateViewNode    5000      -   5000000          0       0.51039     - FOR i IN search_hosotiepnhan LET #17 = i.`tinhId` LET #58 = i.`monthTiepNhan` LET #56 = i.`daTra` LET #57 = i.`dayTiepNhan` LET #59 = i.`quarterTiepNhan`   /* view query without materialization */
  3   LimitNode            5000      -   5000000          0       0.01718       - LIMIT 0, 5000000
  6   CalculationNode      5000   4999   5000000          0       2.15355       - LET #19 = { "daTra" : #56, "dayTiepNhan" : #57, "monthTiepNhan" : #58, "quarterTiepNhan" : #59 }   /* simple expression */
  7   CollectNode             1      0        37          0       1.00086       - COLLECT tinhId = #17 AGGREGATE countTinh = LENGTH() INTO group = #19   /* hash */
 27   SortNode                1      0        37          0       0.05242       - SORT tinhId ASC   /* sorting strategy: standard */
 30   SubqueryStartNode       1      -        74          0       0.06866       - LET statDayTiepNhan = ( /* subquery begin */
  9   EnumerateListNode    5001      -   5000037          0       0.72884         - FOR g IN group   /* list iteration */
 10   CalculationNode      5001      -   5000037          0       0.35558           - LET #20 = g.`dayTiepNhan`   /* attribute expression */
 12   CalculationNode      5001      -   5000037          0       1.02633           - LET #22 = { "daTra" : g.`daTra` }   /* simple expression */
 13   CollectNode             1      -       292          0       0.43884           - COLLECT collDayTiepNhan = #20 AGGREGATE countDayTiepNhan = LENGTH() INTO groupDayTiepNhan = #22   /* hash */
 28   SortNode                1      -       292          0       0.00886           - SORT collDayTiepNhan ASC   /* sorting strategy: standard */
 32   SubqueryStartNode       1      -       547          0       0.00816           - LET statDayTiepNhanDaTra = ( /* subquery begin */
 15   EnumerateListNode    5001      -   5000292          0       0.31943             - FOR gDayTiepNhan IN groupDayTiepNhan   /* list iteration */
 16   CalculationNode      5001      -   5000292          0       0.34112               - LET #23 = gDayTiepNhan.`daTra`   /* attribute expression */
 18   CollectNode             1      -       793          0       0.26691               - COLLECT daTra = #23 AGGREGATE countDaTra = LENGTH()   /* hash */
 29   SortNode                1      -       793          0       0.00022               - SORT daTra ASC   /* sorting strategy: standard */
 19   CalculationNode         1      -       793          0       0.00038               - LET #25 = { "val" : daTra, "val" : countDaTra }   /* simple expression */
 33   SubqueryEndNode         1      -       292          0       0.00009               - RETURN  #25 ) /* subquery end */
 22   CalculationNode         1      -       292          0       0.00014           - LET #26 = { "val" : collDayTiepNhan, "count" : countDayTiepNhan, "buckets" : statDayTiepNhanDaTra }   /* simple expression */
 31   SubqueryEndNode         1      -        37          0       0.00004           - RETURN  #26 ) /* subquery end */
 25   CalculationNode         1      0        37          0       0.00002       - LET #27 = { "val" : tinhId, "count" : countTinh, "statDayTiepNhan" : statDayTiepNhan }   /* simple expression */
 26   ReturnNode              1      -        37          0       0.00000       - RETURN #27

Indexes used:
 none

Optimization rules applied:
 Id   Rule Name                                 Id   Rule Name                                 Id   Rule Name                        
  1   move-calculations-up                       4   move-calculations-up-2                     7   move-calculations-down           
  2   remove-redundant-calculations              5   handle-arangosearch-views                  8   async-prefetch                   
  3   remove-unnecessary-calculations            6   remove-unnecessary-calculations-2          9   splice-subqueries                

Query Statistics:
 Writes Exec      Writes Ign      Doc. Lookups      Scan Full      Scan Index      Cache Hits/Misses      Filtered      Peak Mem [b]      Exec Time [s]
           0               0                 0              0         5000000                  0 / 0             0        1246724096            6.03617

Query Profile:
 Query Stage                Duration [s]         Query Stage                Duration [s]         Query Stage                Duration [s]
 initializing                    0.00000         loading collections             0.00001         instantiating executors         0.00014
 parsing                         0.00013         instantiating plan              0.00009         executing                       6.03324
 optimizing ast                  0.00001         optimizing plan                 0.00255         finalizing                      0.00004

son2408 · 2023-12-26T10:09:42Z

@jsteemann

son2408 · 2023-12-26T10:13:25Z

@jsteemann Can you let me the best way to speedup query with multi COLLECT .. INTO ? . I see the more COLLECT .. INTO the slower runtime query. Thanks you

jsteemann · 2023-12-26T12:15:29Z

Hi @son2408 , Unfortunately, I am not aware of any good way to speed up COLLECT ... INTO further.

dothebart added performance 1 Question 3 Search IResearch / Fulltext index / Analyzers 2 Solved Resolution labels Dec 19, 2023

son2408 closed this as completed Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize statistic large data with COLLECT INTO too slow #20337

Optimize statistic large data with COLLECT INTO too slow #20337

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Optimize statistic large data with COLLECT INTO too slow #20337

Optimize statistic large data with COLLECT INTO too slow #20337

Comments

Uh oh!

My Environment

Component, Query & Data

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!