Description
my environment running ArangoDB
I'm using ArangoDB version:
- 3.3.8 (subsequently upgraded to 3.3.10)
Mode:
- Cluster
Storage-Engine:
- rocksdb
On this operating system:
- Linux
- other: self-built Docker container using the latest Ubuntu package
this is an installation-related issue:
Hello,
Since upgrading to 3.3 we've noticed a rather drastic increase in memory usage of server nodes over time. I've had a look at some of the existing open tickets and this looks quite similar to #5414 - I've decided to open a separate issue and let you decide though.
We're seeing this problem only in server instances - both the agency and coordinator nodes have very stable memory usage. Our standard deployment model for Arango is 3x agency, 3x coordinator and 3x server. We're using the following config options (I've skipped ones that seem irrelevant to me like directory locations, ip addresses etc - let me know if you'd like the full list):
--server.authentication=false
--cluster.my-role PRIMARY
--log.level info
--javascript.v8-contexts 16
--javascript.v8-max-heap 3072
--server.storage-engine rocksdb
Here's a list of sysctls we're setting:
- { name: "vm.max_map_count", value: "262144" }
- { name: "vm.overcommit_memory", value: "0" } # arangodb recommend setting this to 2 but this causes a lot of issues bringing other containers up
- { name: "vm.zone_reclaim_mode", value: "0" }
- { name: "vm.swappiness", value: "1" }
- { name: "net.core.somaxconn", value: "65535" }
net.core.somaxconn
is also set (same value) on the Docker container.
We're setting the transparent_hugepage defrag
and enabled
properties to never
.
We've upgraded from 3.2.9 to 3.3.0 and have since used 3.3.3, 3.3.4 and on 3.3.8 now.
Here's a memory usage (RSS) for the server nodes in one of our environments (which got upgraded to 3.3.x around 25th April - note that this environment gets shutdown in the evening every day hence the large gaps in the graph):
This is the above graph zoomed in to the last 5 working day period:
This is another environment, the upgrade to 3.3.x happened on 3rd of April. The change in the memory usage pattern on the 27th April has been caused by applying a docker memory limit of 2g.
The above environments have extremely light usage of Arango.
Here's one of which gets used a bit more:
As a reference point of sorts, here's what it looks like when we run a load test against our application:
The server nodes eventually tail off at 6.4GB and memory usage remains perfectly stable afterwards.
All of the above graphs were taken using the same settings I've mentioned earlier.
Let me know what other information would be useful to provide - I guess disabling statistics and or Foxx queues would be something you might want us to try? If so - shall we try disabling both off or try them one by one (if so - what order would you prefer)?
Thanks,
Simon
edit: as part of the investigation I have upgraded from 3.3.8 to 3.3.10