8000 Possible memory leak / regression in 3.3+ · Issue #5579 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

Possible memory leak / regression in 3.3+ #5579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
choppedpork opened this issue Jun 11, 2018 · 43 comments
Closed
5 tasks done

Possible memory leak / regression in 3.3+ #5579

choppedpork opened this issue Jun 11, 2018 · 43 comments
Assignees
Labels
1 Analyzing 3 OOM System runs out of memory / resources
Milestone

Comments

@choppedpork
Copy link
choppedpork commented Jun 11, 2018

my environment running ArangoDB

I'm using ArangoDB version:

  • 3.3.8 (subsequently upgraded to 3.3.10)

Mode:

  • Cluster

Storage-Engine:

  • rocksdb

On this operating system:

  • Linux
    • other: self-built Docker container using the latest Ubuntu package

this is an installation-related issue:

Hello,

Since upgrading to 3.3 we've noticed a rather drastic increase in memory usage of server nodes over time. I've had a look at some of the existing open tickets and this looks quite similar to #5414 - I've decided to open a separate issue and let you decide though.

We're seeing this problem only in server instances - both the agency and coordinator nodes have very stable memory usage. Our standard deployment model for Arango is 3x agency, 3x coordinator and 3x server. We're using the following config options (I've skipped ones that seem irrelevant to me like directory locations, ip addresses etc - let me know if you'd like the full list):

--server.authentication=false
--cluster.my-role PRIMARY
--log.level info
--javascript.v8-contexts 16
--javascript.v8-max-heap 3072
--server.storage-engine rocksdb

Here's a list of sysctls we're setting:

- { name: "vm.max_map_count", value: "262144" }
- { name: "vm.overcommit_memory", value: "0" } # arangodb recommend setting this to 2 but this causes a lot of issues bringing other containers up
- { name: "vm.zone_reclaim_mode", value: "0" }
- { name: "vm.swappiness", value: "1" }
- { name: "net.core.somaxconn", value: "65535" }

net.core.somaxconn is also set (same value) on the Docker container.

We're setting the transparent_hugepage defrag and enabled properties to never.

We've upgraded from 3.2.9 to 3.3.0 and have since used 3.3.3, 3.3.4 and on 3.3.8 now.
Here's a memory usage (RSS) for the server nodes in one of our environments (which got upgraded to 3.3.x around 25th April - note that this environment gets shutdown in the evening every day hence the large gaps in the graph):

screen shot 2018-06-11 at 13 06 54

This is the above graph zoomed in to the last 5 working day period:

screen shot 2018-06-11 at 13 06 22

This is another environment, the upgrade to 3.3.x happened on 3rd of April. The change in the memory usage pattern on the 27th April has been caused by applying a docker memory limit of 2g.

screen shot 2018-06-11 at 13 05 16

The above environments have extremely light usage of Arango.

Here's one of which gets used a bit more:
screen shot 2018-06-11 at 13 22 43

As a reference point of sorts, here's what it looks like when we run a load test against our application:
screen shot 2018-06-11 at 13 23 39
The server nodes eventually tail off at 6.4GB and memory usage remains perfectly stable afterwards.

All of the above graphs were taken using the same settings I've mentioned earlier.

Let me know what other information would be useful to provide - I guess disabling statistics and or Foxx queues would be something you might want us to try? If so - shall we try disabling both off or try them one by one (if so - what order would you prefer)?

Thanks,
Simon

edit: as part of the investigation I have upgraded from 3.3.8 to 3.3.10

@vinaypyati vinaypyati added the 3 OOM System runs out of memory / resources label Jun 11, 2018
@graetzer
Copy link
Contributor
graetzer commented Jun 13, 2018

Sorry for the trouble, but I think I will need a little bit more background here:
Can you give use some details about your application, or are you perhaps able to limit this to a specific AQL query (or something you do in a Foxx app etc) ?

@choppedpork
Copy link
Author

Hello @graetzer,

Unfortunately there is no easy way for me to tell you what the application(s) are doing exactly as I am not the developer - is there a way to log all queries to a file for a period of time? This might help nail the usage down a bit better.

What I can definitely tell you is that there are no Foxx apps running and our usage is mostly very simple. To the best of my knowledge, we generally use simple document fetching queries. We have a few graphs but those are not present in every environment and not used fre

The usage would be different across the examples above - the first two graphs are from an environment with a small few databases, not much data.

The third graph is from an environment where Arango is deployed and the applications are connected to it but is not actually utilised.

The fourth graph is from an environment with considerably more Arango usage, a few databases and many collections (~500) and a bit more data than in the first one but still a very small amount (a full backup of all of these is... 12M). As opposed to the other environments, there is a small amount of usage of graphs within Arango here.

The last graph is from a load test which inserts documents but also queries them - in this case I can actually tell you what the query is - FOR t IN metadata_104434 FILTER t.objectId == @objectId RETURN t, one object at a time. I've included it here mostly as it seems that there is a natural ceiling to the memory usage - I'm assuming if the other environments had enough headroom they would also reach 6.4G and remain at that level.

I hope this helps - happy to dig in deeper if needed though!

@dothebart
Copy link
Contributor

At least for your last usecase I can tell you that at the time being you should rather access single documents via the single document api, not via AQL. This will be improved with ArangoDB 3.4.

@choppedpork
Copy link
Author

Thanks @dothebart - I will definitely pass this on to the developers. I've just ran a query on an Arango cluster I'm executing a load test against and the document API returns the same data in 12-15ms vs 35-40ms via AQL so that's definitely worth investigating on our side (separately from this issue).

@dothebart
Copy link
Contributor

Please also have a look at https://www.arangodb.com/2017/10/performance-analysis-pyarango-usage-scenarios/ - it explains what to monitor and how to do benchmarks.

@martin-macrometa
Copy link

I am experiencing the same issue in 3.3.9 with RocksDB (haven't tested with MMFiles). When I disable statistics, there seems to no longer be a memory leak. If I change STATISTICS_INTERVAL in js/server/modules/@arangodb/statistics.js, this influences the rate the memory grows. It seems that the leak is related to storing data or in particular to storing statistics.

@choppedpork
Copy link
Author

@martinkunevtoptal that's encouraging, thanks. I'll try disabling statistics in the most affected environment as we run into memory issues there once a (week) day so should be able to confirm by Monday EOB if we're having the same issue indeed.

@choppedpork
Copy link
Author
choppedpork commented Jun 19, 2018

Thanks @martinkunevtoptal - you seem to have hit the nail on the head!
It's quite interesting how differently it affects the different node types:

server:
screen shot 2018-06-19 at 12 50 27

agency:
screen shot 2018-06-19 at 12 51 42

coordinator:
screen shot 2018-06-19 at 12 52 20

Disabling statistics seems to entirely flatten out the memory usage for server nodes (bar some bumps which would almost certainly be cause by actually using the database). On the coordinator nodes disabling stats has zero effect, the agency nodes quite to the contrary - the difference is quite big. Interestingly both the agency and coordinators seem to have a slight upward trend both with statistics enabled and disabled.

Does this help @graetzer @dothebart?

We're going to test the effect of this change over a longer period of time as well as it's looking very promising but it would be a shame to lose the statistics entirely (we monitor client connections and client request / total / queue / connection times as well as thread usage and more from _admin/statistics).

edit: forgot to add - I have also upgraded to 3.3.10 at the same time to make sure we're testing against the latest version.

@dsonet
Copy link
dsonet commented Jul 17, 2018

Exactly what I met recently. The memory usage keep increasing in a 3 nodes cluster. Which I use version 3.3.10, MMFiles, it's almost empty collections with no operations. So seems doesn't matter with engine but other reasons. Really hope can resolve this soon.

Thank you very much.

@choppedpork
Copy link
Author

Hello,

I don't suppose there's been any updates on this one? We're currently on the verge of disabling statistics on all production environments (we've done it for all of our dev/qa envs) but losing the monitoring data might become a problem should we run into production issues :(

@dothebart
Copy link
Contributor

with the upcomming release of ArangoDB 3.4 all javascript implementations still in use in 3.3 have been replaced by native implementations. They shouldn't show similar issues. However, a huge change like that won't be backported.
In the end you can use a regular monitoring system like collectd or prometheus to gather similar statistics.

@choppedpork
Copy link
Author

That sounds great @dothebart! I've just had a look at the What's New in 3.4 document on your website, there's some very exciting changes in there. When you say

a huge change like that won't be backported

do you mean the statistics API format will change? We're currently using https://gitlab.com/flare/arangodb-exporter (slightly modified to work with 3.3) and scraping the data with Prometheus. The exporter relies on the /_admin/statistics and /_admin/statistics-description endpoints.

@dothebart
Copy link
Contributor

As @kwando pointed out in #4587 these will go away.
https://docs.arangodb.com/devel/Cookbook/Monitoring/Collectd.html was updated to tell you how to replace it.
However, this should mostly already work in 3.3 already, so you probably can change your setup now.

@dothebart
Copy link
Contributor

#5414 is probably related / duplicate to this issue.

@martin-macrometa
8000 Copy link

@dothebart So if I understand correctly the memory leak is due to the javascript implementations used on the server side. Is this correct?

@dothebart
Copy link
Contributor

Fact is, in 3.3 this is still implemented in .js; and its a bit hard to tell when references etc. loose their binding in V8. This will definitely go away with the native implementation.

@martin-macrometa
Copy link

I can confirm the memory leak is still present in v3.4.0-rc1. It seems as of this point the cause is still unknown.

@jsteemann jsteemann self-assigned this Nov 6, 2018
@SaulDoesCode
Copy link
SaulDoesCode commented Nov 6, 2018

v3.4.0-rc4, still has this issue, I've had multiple crashes from what appears to be a serious memory leak.

Memory Leak and Crash

this is the only input command.
the only activity otherwise are minimal writes (logging essentially), no worse than what arangodb's statistics are generating by themselves.

docker run -p 8529:8529 -p 8530:8530 -d --name rango -v /var/lib/arangodb3:/var/lib/arangodb3 -v /etc/letsencrypt:/etc/letsencrypt -e ARANGO_ROOT_PASSWORD=XXXXXX -e ARANGO_STORAGE_ENGINE=rocksdb arangodb/arangodb-preview:v3.4.0-rc.4 arangod --server.endpoint ssl://[::]:8530 --server.authentication true --ssl.keyfile /etc/letsencrypt/live/mysite.io/server.key --ssl.session-cache true --server.maximal-threads 1 --log.level trace

@jsteemann
Copy link
Contributor

@SaulDoesCode @martinkunevtoptal @choppedpork: A quick update on this issue:

It seems that there is an issue with the bundled jemalloc memory allocator in some environments.
With an vm.overcommit_memory kernel settings value of 2, the allocator had a problem of splitting existing memory mappings, which made the number of memory mappings of an arangod process grow over time. This could have led to the kernel refusing to hand out more memory to the arangod process, even if physical memory was still available. The kernel will only grant up to vm.max_map_count memory mappings to each process, which defaults to 65530 on most Linuxes I think.
Another issue when running jemalloc with vm.overcommit_memory set to 2 is that for some workloads the amount of memory that the Linux kernel tracks as "committed memory" also grows over time and does not decrease. So eventually and arangod process may hit a wall simply because it reaches the configured overcommit limit (physical RAM * overcommit_ratio + swap space).

As dumb as it sounds, the solution here is to modify the value of vm.overcommit_memory from 2 to either 1 or 0. This will fix both of these problems.
We are still observing ever-increasing virtual memory consumption when using jemalloc with any overcommit setting, but in practice this should not cause problems.
So if you could try adjusting the value of vm.overcommit_memory from 2 to either 0 or 1 (0 is the Linux kernel default btw.) that may already help a lot.

Another way to address the problem, which however requires compilation of ArangoDB from source, is to compile a build without jemalloc (-DUSE_JEMALLOC=Off when cmaking). I am just listing this as an alternative here for completeness. With the system's libc allocator you should see quite stable memory usage. We also tried another allocator, precisely the one from libmusl, and this also shows quite stable memory usage over time. The main problem here which makes exchanging the allocator a non-trivial issue is that jemalloc has very nice performance characteristics otherwise.

@jsteemann
Copy link
Contributor
jsteemann commented Dec 5, 2018 8000

@SaulDoesCode @martinkunevtoptal @choppedpork: Another update on this issue:

We have identified a few issues in this context:

For edge collections, a consumer of memory can be the in-memory edge cache, which can be limited using the startup option --cache.size. This hasn't changed since 3.2.x and shouldn't be a problem if configured sensibly. I am only mentioning it here for the sake of completeness.

  • as noted in my previous comment, an important configuration change is to set vm.overcommit_memory to either 0 or 1.

The biggest long-term consumers of memory while the server is running with the RocksDB engine will likely be the RocksDB block cache plus some in-memory RocksDB write buffers. The max size of the RocksDB block cache can be controlled via the option --rocksdb.block-cache-size since 3.2.x, so this should be good. Up to until including 3.3.19 there was no option in ArangoDB to limit the total amount of memory used by RocksDB for write buffers. 3.3.20 now provides an option --rocksdb.total-write-buffer-size for effectively capping the amount of memory used by write buffers. It will default to 0 in 3.3.20, meaning "unbounded". I suggest setting it to a value such as 30% to 50% of available physical RAM, but the most sensible amount depends on the role of the server and what else is running on it. With these two options in place, the overall memory consumption should be mostly bounded.

We also found that AQL REMOVE queries that delete a lot of documents will read the documents into memory first before starting the actual deletion. This can cause spikes in memory usage, which are ideally avoided. We have a PR for 3.3 (it will likely land in 3.3.21) for fixing this: #7643. A workaround until then is not read the full documents but only document keys, e.g. instead of using FOR doc IN collection FILTER ... REMOVE doc IN collection just use FOR doc IN collection FILTER ... REMOVE doc._key IN collection. 3.4.1 and onwards will do this automatically by the way.

Apart from the above-mentioned issues, we are not aware of anything else that would indicate memory usage growth beyond the configured limits.

@beast-in-black
Copy link
beast-in-black commented Feb 23, 2019

@jsteemann @martinkunevtoptal @choppedpork

I can confirm this issue on v3.3.22 with the RocksDB engine. It does not seem to occur with the MMFiles engine; the runaway memory growth seems to be specific to the usage of the RocksDB engine.

In four hours of running a freshly compiled 3.3.22 build and starting it with scripts/startLocalCluster.sh -t ssl -r true -a 1 -d 1 -c 1 -j mysecret, the DB Primary's Resident Set Size (RSS) as reported by ps -eo vsz,rss,comm,pid | grep <DBPRIMARY_PID> went from 216 MB RSS to 649 MB RSS (a rise in excess of 100MB per hour) with absolutely no external activity on the cluster.

I have attached a CSV with the VSZ and RSS captured every 60 seconds (remove the .txt extension).
3.3.22-rocksdb-memprofile.csv.txt

  • vm.overcommit_memory is set to 0 (Linux kernel default) on my Ubuntu 16.04 laptop
  • GLIBCXX_FORCE_NEW=1 is set and exported.
  • No extra parameters were passed to arangod startup - all parameters are default as set internally in arangod and in the startLocalCluster.sh script.

I will try limiting the RocksDB cache and write buffers as recommended by @jsteemann above, capture data and post it here. I have tried this earlier to no effect, but can only quote anecdotally since I do not have RSS and other mem usage stats from those tries.

@jsteemann
Copy link
Contributor

@beast-in-black : thanks for the detailed report. I guess if vm.overcommit_memory is set to 0 and there is no activity in the cluster, then the internal statistics are to blame for the memory usage growth. The statistics are calculated and stored every few seconds, and every capture will cause one or two writes into the RocksDB engine. And RocksDB will keep the writes buffered in memory for a while.

This can be attacked from two sides:

  • disabling the statistics altogether via --server.statistics false.
  • using --rocksdb.total-write-buffer-size and setting it to a cap value > 0 (btw. the size of the write buffers is capped by default in 3.4.x, but not in 3.3.x).

@srics
Copy link
srics commented Feb 28, 2019

@jsteemann fyi, we are hitting this in 3.4.2.1-1 as well and had to disable server.statistics to keep it running in a memory constraint device. I would suggest the team to consider disabling this feature by default and have the user explicitly enable it if there is a need to capture server statistics.

@beast-in-black
Copy link

@jsteemann apologies for the delay in responding; I have been conducting some tests on ArangoDB in between my other work tasks.

For our use case, turning off statistics is unfortunately not an option. However, I made the following changes (directly changed the defaults in the C++ code):

  • --rocksdb.total-write-buffer-size 314572800 (300 MB)
  • --rocksdb.block-cache-size 268435456 (256 MB)
  • --cache.size 268435456 (256 MB)

With the above changes, I notice the following:

  1. Running on my local Ubuntu laptop with the following configuration (using the same startLocalCluster.sh settings as in my previous comment with 1 DB Primary, 1 agency, 1 coordinator), shows that the memory usage (RSS) rises to about 1.2GB overnight but then stays steady at that limit, oscillating up and down by about a couple of 100MB but more or less clamped at around 1.2GB.

    • System settings:
      • Distro: Ubuntu 16.04 with all the latest updates
      • Kernel:
      Linux <SYSTEM_NAME_REDACTED> 4.15.0-45-generic #48~16.04.1-Ubuntu SMP Tue Jan 29 18:03:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
      
      • system gcc version gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
  2. Running arango as a Docker image with the above settings on AWS under Kubernetes continues to show a memory increase until the DB primary pod is OOM-killed (1 DB Primary, 1 agency, 1 coordinator).

    • AMI Instance runs debian stretch with the following kernel:
    Linux <IP REDACTED> #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
    
    • Docker image is based on the arango helper stretch image, compiled with gcc/gcc 6

My use case requires running it as Docker on K8s as mentioned in (2) above. So #5414 is applicable to me as well (cc @martinkunevtoptal ).

At this point, I am at a loss to explain the following things I have noticed in my testing:

  1. Why it seems to behave itself when run locally but displays the increasing memory usage on K8s/Docker. I even tried deploying an Ubuntu 16.04 AMI (instance kernel 4.4.0-1075-aws #85-Ubuntu SMP ) and using an arango docker image compiled using Ubuntu 16.04 with gcc version 5.4 to make it the same as my dev laptop (I changed the arango docker build framework files to do this), but that did not help either. I am wondering whether the memory issue when run as a docker image under k8s is due to some memory-related interaction between arango and docker/k8s.
  2. When I write some documents to the DB in a collection and then delete the collection, the memory usage does not drop back to what it was before writing the data. This indicates to me that there may be something leaking in the document/collection deletion code path.
  3. Why, if I have clamped rocksdb cache/buffers to a max of 300MB, does the memory usage rise to 4x that amount (1.2GB) and then steady itself at that 4x limit level on my local laptop when run using startLocalCluster.sh

Do you have any further suggestions especially for the docker+K8s case?

@Simran-B
Copy link
Contributor
Simran-B commented Mar 2, 2019
  1. There is more than just the storage engine to the database system, and also not just write buffers. If you observe 1.2GB memory usage, then that is the total in use by ArangoDB, isn't it? I would expect the memory usage to be at least the sum of all caches plus the size of the executable (at least partially), overhead because of threads and services like HTTP/VelocyStream interfaces, i18n data (ICU), V8 stuff, and there needs to be some space for processing queries and storing intermediate results even if it's just for the server statistics.
A3E2

@beast-in-black
Copy link

@Simran-B good point. What I notice is that the RSS memory starts at about 300MB when the cluster is first started up, and then rises in a few hours to the 1.2GB limit. I had assumed that with no external inputs to the system (no data being written/read) then all things being equal, the rise in RSS memory usage from 300MB to 1.2GB was purely being caused by the internal statistics writes. Therefore, the extra memory required by the rest of the arango framework which you've mentioned should not be a factor in the memory rise. Is my assumption in error?

@Simran-B
Copy link
Contributor
Simran-B commented Mar 2, 2019

The buffers/caches might start out at 0 MB and grow slowly until they reach the configured limits, most likely caused by the writes for the statistics every other second. In that case the 300 MB at start would be everything else what is put into memory for or by ArangoDB except the buffers/caches.

300 + 300 + 256 + 256 = ~1.1 GB

@bischoje
Copy link

I think the last few comments may be a tangent from the real issue here. I think the real issue could be what @beast-in-black observed:

When I write some documents to the DB in a collection and then delete the collection, the memory usage does not drop back to what it was before writing the data. This indicates to me that there may be something leaking in the document/collection deletion code path.

I believe we are seeing similar behavior. I logged my issue as a series of comments against #5414.

I think there are some promising observations in the comments in this issue, and I'd like to see the conversation continue.

Maybe we can write a simple test program to add/delete collections in kubernetes to demonstrate the leak?

@choppedpork
Copy link
Author

Woosh - it's almost a year since the last comments!

With the new (and exciting) native prometheus metric support in 3.6 we've tried enabling gathering statistics again but unfortunately we're either encountering the same leak or there's something new at play. I'm currently switching off statistics in this test environment so that I can verify if this is the statistics at fault or not. Interestingly, we do not currently collect those metrics (there's some further work we need to do before we'll be able to integrate this).

Here's the last 30 days of memory usage (this is version 3.4.8 until 27th March when we've upgraded to 3.6.1 (via an intermediate upgrade to 3.5.0, as per docs). Interestingly both the agency and coordinator nodes have lower memory usage now and do not seem to exhibit the same leaky pattern as the server nodes:

Agency:
image

Coordinators:
image

Servers:
image

@choppedpork choppedpork changed the title Possible memory leak / regression in 3.3 Possible memory leak / regression in 3.3+ Apr 8, 2020
@choppedpork
Copy link
Author
choppedpork commented Apr 8, 2020

Looks like both unfortunately (no metrics 😢) and fortunately (as this means there's nothing stopping us from upgrading) the observed behaviour is indeed related to the --server.statistics option being set to true. It's not been long since we've flipped it back to false but you can see the leaky sawtooth shaped pattern is no longer visible:

image

One thing that got me thinking is the sawtooth shape - this looks a lot like a struggling garbage collector to me. As the javascript stats implementation was replaced with a native one in 3.4 (and when looking at these graphs - the pattern is different now, the leak is much slower and the pattern is a more defined / cleaner sawtooth than in my original posts) - could this be something sitting on the edge between native code and V8?

edit: one more thing... the original issue with version 3.3 had this occurring across all three node types but we're definitely only seeing this in server nodes now. I wonder if it's possible that there has actually been two separate leaks - the original one which was fixed by moving from JS to a native implementation whereas what we're seeing now is a different issue, which seems to have been present in 3.3 as well (the speed of the leak was significantly higher for server nodes back then)?

@beast-in-black
Copy link

@choppedpork I haven't worked on this since last April and am no longer involved in any ArangoDB work, but back then some of the things I had fruitlessly tried were:

  1. Change the code to make the V8 GC as well as the RocksDB GC much more aggressive. However, it did not help.
  2. Experimented with tweaking other RocksDB settings but had no joy there either.
  3. A couple of us also instrumented Arango with some profilers as well as went through the main codebase to see if there were any obvious leaks, but didn't find anything that jumped out at us.

With the above tweaks in place, I had found that if Arango was run directly on the machine - i.e. not in a docker container under K8s, the memory (RSS) did stabilize after a while, but the same code inside a docker container in K8s still exhibited the runaway memory growth, leading to an eventual OOM-kill of the container. Also, after the container was killed (or otherwise restarted), the memory drops down but then steadily grows again.

As you and other (including myself) have noted, the runaway memory growth does not occur if stats are turned off, but unfortunately this was not an option for my team in production. There is also the other issue I noticed about the memory usage not dropping back down after adding data to a regular collection and then removing it...

@dothebart
Copy link
Contributor

Hi,
thanks for the update. The actual issue is, that the rocksdb blockcache is setup in a wrong way.
With ArangoDB 3.6.3 we will introduce a workaround - please set the actually available memory in the docker container via the environment variable: ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY
.

@dothebart dothebart added this to the 3.6.3 milestone Apr 14, 2020
@choppedpork
Copy link
Author

That's great news @dothebart - thanks! We should be add this early this week and report back by the end of the week / early next week.

@dothebart
Copy link
Contributor

please note that 3.6.3 is not yet available. However, adding the environment variable already can't hurt.

@maxkernbach
Copy link
Contributor

Hi @choppedpork,

Version 3.6.3 has just been released. Could you please upgrade your deployment, add the environment variable: ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY as suggested and check if this helps?

@choppedpork
Copy link
Author

Awesome! Thanks for the heads-up @maxkernbach, we're going to deploy the new version this week and will keep you posted.

@sweeneki
Copy link
sweeneki commented May 1, 2020

Hello Zusammen! I tried this suggesting (ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY) but unfortunately it didn't work. Turning off the logging did work however not ideal. Let me know if you need any details from my configuration. Regards...

@dothebart
Copy link
Contributor

hi @sweeneki - did you install 3.6.3? please tell us a bit more about your installation - see first post or github issue template.

@sweeneki
Copy link
sweeneki commented May 5, 2020 via email

@dothebart
Copy link
Contributor

Hi,
you should rather try using rocksdb - mmfiles is discontinued as of ArangoDB 3.7

https://www.arangodb.com/docs/stable/tutorials-reduce-memory-footprint.html
explains how to cut down on resources.

@sweeneki
Copy link
sweeneki commented May 6, 2020 via email

@choppedpork
Copy link
Author

Hello @dothebart, @maxkernbach;

Thanks for your continued support here - apologies it's taken me a few days but with this issue we're currently talking about memory usage across weeks rather than days so I wanted to make sure I've got some quality data before coming back to you (the first few days were not very promising).
Where we seem to have got to so far is certainly interesting. The memory usage has certainly improved but at the same time the difference between statistics enabled vs disable is still pronounced.

image

This is definitely an improvement but it's interesting to see how much more memory is in use with statistics enabled. In the graph above the first part is 3.6.1 w/o statistics, middle part was just an unrelated restart, the rightmost part is 3.6.3 with statistics disabled. ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY is set to 4096m in this case, equal to the memory limit on the containers themselves.

I would like to wait 1-2 weeks before closing this issue if that's OK with you? The memory usage patterns do seem to change over time so while it's highly unlikely there could still be dragons. What are your thoughts about the increased memory usage with statistics enabled? Would that be worth opening a new issue for (I can imagine that just like me you'd like to see this one closed soon 😄)?

PS. having stumbled into some memory usage issues after upgrading (which turned out to be caused by usage of views which we've decided to scrap as we didn't need them anymore) I've been running some load test to try and identify the problem and I do have to say 3.6 is performing 40% faster in comparison to 3.4 all while using less CPU and less memory - impressive work 👏

@dothebart
Copy link
Contributor

As discussed in #11577 exposing more options of rocksdb to the commandline seems to ease the memory situation.

Namely:

  • --rocksdb.cache-index-and-filter-blocks-with-high-priority
  • --rocksdb.pin-l0-filter-and-index-blocks-in-cache
  • --rocksdb.pin-top-level-index-and-filter

These have been added to the now released ArangoDB 3.7 - hence closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 Analyzing 3 OOM System runs out of memory / resources
Projects
None yet
Development

No branches or pull requests

0