Description
My Environment
- ArangoDB Version: 3.4.8
- Storage Engine: RocksDB
- Deployment Mode: Cluster 3 nodes with 3 Agencis ,3 Dbservers and 3 Coordinators
- Deployment Strategy: ArangoDB Starter in Docker
- Configuration: default
- Infrastructure: own
- Operating System: CentOS
- Total RAM in your machine: 128G
- Disks in use: SSD
Size of your Dataset on disk:
one vertex collection: 374M
one edge collection: 37G
Dataset:
the dataset contains only one vertex collection called users with 41,652,230 docs like as follows:
{
"_key": "12",
"_id": "users/12",
"_rev": "_Z4it3Eu--K",
}
and only one edge collection which means the follower relationship with 1,468,365,182 docs like as follows:
{
"_key": "6842768634",
"_id": "follow/6842768634",
"_from": "users/324",
"_to": "users/20",
"_rev": "_Z4FeNU---u",
"vertex": 324
}
and shard key is ["vertex"];
I confirmed that there are no invaild edges.
Replication Factor & Number of Shards (Cluster only):
Replication Factor 1
Shards 81
Problem:
when I running pregel algorithm,the status received as follows:
the vertexCount is 41,652,230,which is the same as vertex collection, but the edgeCount is 16,695,168, which is much less than edge collection(1.4billion edges).
And, whatever kinds of pregel algorithm I run, the edgeCount number is the same, the logs is as follows:
So is edgeCount parameter represens the total egde number in graph? If so, why the egde number in graph is much less than edge collection? Did I do something wrong?
By the way, how can I get the total edges in graph? I run the following aql but out of time since the edgeCount is too large
AQL query (if applicable):
FOR i IN users
LET ec = (
FOR v,e,p IN 1..1 OUTBOUND i Graph "twitter"
RETURN DISTINCT(e)
)
RETURN COUNT(ec)
AQL explain (if applicable):
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode DBS 1 * ROOT
2 EnumerateCollectionNode DBS 41652230 - FOR i IN users /* full collection scan, 81 shard(s) */
14 RemoteNode COOR 41652230 - REMOTE
15 GatherNode COOR 41652230 - GATHER
8 SubqueryNode COOR 41652230 - LET ec = ... /* subquery */
3 SingletonNode COOR 1 * ROOT
11 CalculationNode COOR 1 - LET #15 = true /* json expression */ /* const assignment */
4 TraversalNode COOR 9 - FOR v /* vertex */, e /* edge */ IN 1..1 /* min..maxPathDepth */ OUTBOUND i /* startnode */ GRAPH 'twitter'
6 CollectNode COOR 9 - COLLECT #11 = e /* distinct */
7 ReturnNode COOR 9 - RETURN #15
9 CalculationNode COOR 41652230 - LET #13 = COUNT(ec) /* simple expression */
10 ReturnNode COOR 41652230 - RETURN #13
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
4 edge follow false false n/a [ `_from` ] base OUTBOUND
Functions used:
Name Deterministic Cacheable Uses V8
COUNT true true false
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter / Prune Conditions
4 1..1 users follow uniqueVertices: none, uniqueEdges: path
Optimization rules applied:
Id RuleName
1 remove-unnecessary-calculations
2 optimize-subqueries
3 move-calculations-up-2
4 optimize-traversals
5 scatter-in-cluster
6 remove-unnecessary-remote-scatter