Possible memory leak / regression in 3.3+

my environment running ArangoDB

I'm using ArangoDB version:

3.3.8 (subsequently upgraded to 3.3.10)

Mode:

Cluster

Storage-Engine:

rocksdb

On this operating system:

Linux
- other: self-built Docker container using the latest Ubuntu package

this is an installation-related issue:

Hello,

Since upgrading to 3.3 we've noticed a rather drastic increase in memory usage of server nodes over time. I've had a look at some of the existing open tickets and this looks quite similar to #5414 - I've decided to open a separate issue and let you decide though.

We're seeing this problem only in server instances - both the agency and coordinator nodes have very stable memory usage. Our standard deployment model for Arango is 3x agency, 3x coordinator and 3x server. We're using the following config options (I've skipped ones that seem irrelevant to me like directory locations, ip addresses etc - let me know if you'd like the full list):

--server.authentication=false
--cluster.my-role PRIMARY
--log.level info
--javascript.v8-contexts 16
--javascript.v8-max-heap 3072
--server.storage-engine rocksdb

Here's a list of sysctls we're setting:

- { name: "vm.max_map_count", value: "262144" }
- { name: "vm.overcommit_memory", value: "0" } # arangodb recommend setting this to 2 but this causes a lot of issues bringing other containers up
- { name: "vm.zone_reclaim_mode", value: "0" }
- { name: "vm.swappiness", value: "1" }
- { name: "net.core.somaxconn", value: "65535" }

net.core.somaxconn is also set (same value) on the Docker container.

We're setting the transparent_hugepage defrag and enabled properties to never.

We've upgraded from 3.2.9 to 3.3.0 and have since used 3.3.3, 3.3.4 and on 3.3.8 now.
Here's a memory usage (RSS) for the server nodes in one of our environments (which got upgraded to 3.3.x around 25th April - note that this environment gets shutdown in the evening every day hence the large gaps in the graph):

This is the above graph zoomed in to the last 5 working day period:

This is another environment, the upgrade to 3.3.x happened on 3rd of April. The change in the memory usage pattern on the 27th April has been caused by applying a docker memory limit of 2g.

The above environments have extremely light usage of Arango.

Here's one of which gets used a bit more:

As a reference point of sorts, here's what it looks like when we run a load test against our application:

The server nodes eventually tail off at 6.4GB and memory usage remains perfectly stable afterwards.

All of the above graphs were taken using the same settings I've mentioned earlier.

Let me know what other information would be useful to provide - I guess disabling statistics and or Foxx queues would be something you might want us to try? If so - shall we try disabling both off or try them one by one (if so - what order would you prefer)?

Thanks,
Simon

edit: as part of the investigation I have upgraded from 3.3.8 to 3.3.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

my environment running ArangoDB

this is an installation-related issue:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

my environment running ArangoDB

this is an installation-related issue:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions