-
Notifications
You must be signed in to change notification settings - Fork 852
Possible memory leak / regression in 3.3+ #5579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sorry for the trouble, but I think I will need a little bit more background here: |
Hello @graetzer, Unfortunately there is no easy way for me to tell you what the application(s) are doing exactly as I am not the developer - is there a way to log all queries to a file for a period of time? This might help nail the usage down a bit better. What I can definitely tell you is that there are no Foxx apps running and our usage is mostly very simple. To the best of my knowledge, we generally use simple document fetching queries. We have a few graphs but those are not present in every environment and not used fre The usage would be different across the examples above - the first two graphs are from an environment with a small few databases, not much data. The third graph is from an environment where Arango is deployed and the applications are connected to it but is not actually utilised. The fourth graph is from an environment with considerably more Arango usage, a few databases and many collections (~500) and a bit more data than in the first one but still a very small amount (a full backup of all of these is... 12M). As opposed to the other environments, there is a small amount of usage of graphs within Arango here. The last graph is from a load test which inserts documents but also queries them - in this case I can actually tell you what the query is - I hope this helps - happy to dig in deeper if needed though! |
At least for your last usecase I can tell you that at the time being you should rather access single documents via the single document api, not via AQL. This will be improved with ArangoDB 3.4. |
Thanks @dothebart - I will definitely pass this on to the developers. I've just ran a query on an Arango cluster I'm executing a load test against and the document API returns the same data in 12-15ms vs 35-40ms via AQL so that's definitely worth investigating on our side (separately from this issue). |
Please also have a look at https://www.arangodb.com/2017/10/performance-analysis-pyarango-usage-scenarios/ - it explains what to monitor and how to do benchmarks. |
I am experiencing the same issue in 3.3.9 with RocksDB (haven't tested with MMFiles). When I disable statistics, there seems to no longer be a memory leak. If I change STATISTICS_INTERVAL in js/server/modules/@arangodb/statistics.js, this influences the rate the memory grows. It seems that the leak is related to storing data or in particular to storing statistics. |
@martinkunevtoptal that's encouraging, thanks. I'll try disabling statistics in the most affected environment as we run into memory issues there once a (week) day so should be able to confirm by Monday EOB if we're having the same issue indeed. |
Thanks @martinkunevtoptal - you seem to have hit the nail on the head! Disabling statistics seems to entirely flatten out the memory usage for server nodes (bar some bumps which would almost certainly be cause by actually using the database). On the coordinator nodes disabling stats has zero effect, the agency nodes quite to the contrary - the difference is quite big. Interestingly both the agency and coordinators seem to have a slight upward trend both with statistics enabled and disabled. Does this help @graetzer @dothebart? We're going to test the effect of this change over a longer period of time as well as it's looking very promising but it would be a shame to lose the statistics entirely (we monitor client connections and client request / total / queue / connection times as well as thread usage and more from edit: forgot to add - I have also upgraded to 3.3.10 at the same time to make sure we're testing against the latest version. |
Exactly what I met recently. The memory usage keep increasing in a 3 nodes cluster. Which I use version 3.3.10, MMFiles, it's almost empty collections with no operations. So seems doesn't matter with engine but other reasons. Really hope can resolve this soon. Thank you very much. |
Hello, I don't suppose there's been any updates on this one? We're currently on the verge of disabling statistics on all production environments (we've done it for all of our dev/qa envs) but losing the monitoring data might become a problem should we run into production issues :( |
with the upcomming release of ArangoDB 3.4 all javascript implementations still in use in 3.3 have been replaced by native implementations. They shouldn't show similar issues. However, a huge change like that won't be backported. |
That sounds great @dothebart! I've just had a look at the What's New in 3.4 document on your website, there's some very exciting changes in there. When you say
do you mean the statistics API format will change? We're currently using https://gitlab.com/flare/arangodb-exporter (slightly modified to work with 3.3) and scraping the data with Prometheus. The exporter relies on the |
As @kwando pointed out in #4587 these will go away. |
#5414 is probably related / duplicate to this issue. |
@dothebart So if I understand correctly the memory leak is due to the javascript implementations used on the server side. Is this correct? |
Fact is, in 3.3 this is still implemented in .js; and its a bit hard to tell when references etc. loose their binding in V8. This will definitely go away with the native implementation. |
I can confirm the memory leak is still present in v3.4.0-rc1. It seems as of this point the cause is still unknown. |
v3.4.0-rc4, still has this issue, I've had multiple crashes from what appears to be a serious memory leak. this is the only input command. docker run -p 8529:8529 -p 8530:8530 -d --name rango -v /var/lib/arangodb3:/var/lib/arangodb3 -v /etc/letsencrypt:/etc/letsencrypt -e ARANGO_ROOT_PASSWORD=XXXXXX -e ARANGO_STORAGE_ENGINE=rocksdb arangodb/arangodb-preview:v3.4.0-rc.4 arangod --server.endpoint ssl://[::]:8530 --server.authentication true --ssl.keyfile /etc/letsencrypt/live/mysite.io/server.key --ssl.session-cache true --server.maximal-threads 1 --log.level trace |
@SaulDoesCode @martinkunevtoptal @choppedpork: A quick update on this issue: It seems that there is an issue with the bundled jemalloc memory allocator in some environments. As dumb as it sounds, the solution here is to modify the value of Another way to address the problem, which however requires compilation of ArangoDB from source, is to compile a build without jemalloc ( |
@SaulDoesCode @martinkunevtoptal @choppedpork: Another update on this issue: We have identified a few issues in this context: For edge collections, a consumer of memory can be the in-memory edge cache, which can be limited using the startup option
The biggest long-term consumers of memory while the server is running with the RocksDB engine will likely be the RocksDB block cache plus some in-memory RocksDB write buffers. The max size of the RocksDB block cache can be controlled via the option We also found that AQL REMOVE queries that delete a lot of documents will read the documents into memory first before starting the actual deletion. This can cause spikes in memory usage, which are ideally avoided. We have a PR for 3.3 (it will likely land in 3.3.21) for fixing this: #7643. A workaround until then is not read the full documents but only document keys, e.g. instead of using Apart from the above-mentioned issues, we are not aware of anything else that would indicate memory usage growth beyond the configured limits. |
@jsteemann @martinkunevtoptal @choppedpork I can confirm this issue on v3.3.22 with the RocksDB engine. It does not seem to occur with the MMFiles engine; the runaway memory growth seems to be specific to the usage of the RocksDB engine. In four hours of running a freshly compiled 3.3.22 build and starting it with I have attached a CSV with the VSZ and RSS captured every 60 seconds (remove the
I will try limiting the RocksDB cache and write buffers as recommended by @jsteemann above, capture data and post it here. I have tried this earlier to no effect, but can only quote anecdotally since I do not have RSS and other mem usage stats from those tries. |
@beast-in-black : thanks for the detailed report. I guess if This can be attacked from two sides:
|
@jsteemann fyi, we are hitting this in 3.4.2.1-1 as well and had to disable |
@jsteemann apologies for the delay in responding; I have been conducting some tests on ArangoDB in between my other work tasks. For our use case, turning off statistics is unfortunately not an option. However, I made the following changes (directly changed the defaults in the C++ code):
With the above changes, I notice the following:
My use case requires running it as Docker on K8s as mentioned in (2) above. So #5414 is applicable to me as well (cc @martinkunevtoptal ). At this point, I am at a loss to explain the following things I have noticed in my testing:
Do you have any further suggestions especially for the docker+K8s case? |
|
@Simran-B good point. What I notice is that the RSS memory starts at about 300MB when the cluster is first started up, and then rises in a few hours to the 1.2GB limit. I had assumed that with no external inputs to the system (no data being written/read) then all things being equal, the rise in RSS memory usage from 300MB to 1.2GB was purely being caused by the internal statistics writes. Therefore, the extra memory required by the rest of the arango framework which you've mentioned should not be a factor in the memory rise. Is my assumption in error? |
The buffers/caches might start out at 0 MB and grow slowly until they reach the configured limits, most likely caused by the writes for the statistics every other second. In that case the 300 MB at start would be everything else what is put into memory for or by ArangoDB except the buffers/caches. 300 + 300 + 256 + 256 = ~1.1 GB |
I think the last few comments may be a tangent from the real issue here. I think the real issue could be what @beast-in-black observed:
I believe we are seeing similar behavior. I logged my issue as a series of comments against #5414. I think there are some promising observations in the comments in this issue, and I'd like to see the conversation continue. Maybe we can write a simple test program to add/delete collections in kubernetes to demonstrate the leak? |
Woosh - it's almost a year since the last comments! With the new (and exciting) native prometheus metric support in 3.6 we've tried enabling gathering statistics again but unfortunately we're either encountering the same leak or there's something new at play. I'm currently switching off statistics in this test environment so that I can verify if this is the statistics at fault or not. Interestingly, we do not currently collect those metrics (there's some further work we need to do before we'll be able to integrate this). Here's the last 30 days of memory usage (this is version 3.4.8 until 27th March when we've upgraded to 3.6.1 (via an intermediate upgrade to 3.5.0, as per docs). Interestingly both the agency and coordinator nodes have lower memory usage now and do not seem to exhibit the same leaky pattern as the server nodes: |
Looks like both unfortunately (no metrics 😢) and fortunately (as this means there's nothing stopping us from upgrading) the observed behaviour is indeed related to the One thing that got me thinking is the sawtooth shape - this looks a lot like a struggling garbage collector to me. As the javascript stats implementation was replaced with a native one in 3.4 (and when looking at these graphs - the pattern is different now, the leak is much slower and the pattern is a more defined / cleaner sawtooth than in my original posts) - could this be something sitting on the edge between native code and V8? edit: one more thing... the original issue with version 3.3 had this occurring across all three node types but we're definitely only seeing this in server nodes now. I wonder if it's possible that there has actually been two separate leaks - the original one which was fixed by moving from JS to a native implementation whereas what we're seeing now is a different issue, which seems to have been present in 3.3 as well (the speed of the leak was significantly higher for server nodes back then)? |
@choppedpork I haven't worked on this since last April and am no longer involved in any ArangoDB work, but back then some of the things I had fruitlessly tried were:
With the above tweaks in place, I had found that if Arango was run directly on the machine - i.e. not in a docker container under K8s, the memory (RSS) did stabilize after a while, but the same code inside a docker container in K8s still exhibited the runaway memory growth, leading to an eventual OOM-kill of the container. Also, after the container was killed (or otherwise restarted), the memory drops down but then steadily grows again. As you and other (including myself) have noted, the runaway memory growth does not occur if stats are turned off, but unfortunately this was not an option for my team in production. There is also the other issue I noticed about the memory usage not dropping back down after adding data to a regular collection and then removing it... |
Hi, |
That's great news @dothebart - thanks! We should be add this early this week and report back by the end of the week / early next week. |
please note that 3.6.3 is not yet available. However, adding the environment variable already can't hurt. |
Hi @choppedpork, Version 3.6.3 has just been released. Could you please upgrade your deployment, add the environment variable: |
Awesome! Thanks for the heads-up @maxkernbach, we're going to deploy the new version this week and will keep you posted. |
Hello Zusammen! I tried this suggesting (ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY) but unfortunately it didn't work. Turning off the logging did work however not ideal. Let me know if you need any details from my configuration. Regards... |
hi @sweeneki - did you install 3.6.3? please tell us a bit more about your installation - see first post or github issue template. |
Hi Wilfried,
Thank you for your reply.
I am using gcloud Compute Engine VM g1-small (1 vCPU, 1.7 GB memory) image
Linux arangodb/arangodb:3.6.3
RocksDb ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY=500MB CPU platformIntel
Broadwell.
I have turned off the logging and the db is now stable. Before the DB would
crash at least once per day. The MMFiles DB running for months on my laptop
both in Docker and natively. My DB is very small with few than 1k records
and ~collections.
Please let me know if you need more information. I would be happy to
provide.
Kind regards
Kieran Sweeney
…On Mon, May 4, 2020 at 11:18 AM Wilfried Goesgens ***@***.***> wrote:
hi @sweeneki <https://github.com/sweeneki> - did you install 3.6.3?
please tell us a bit more about your installation - see first post or
github issue template.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5579 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACH4CX3CMXZUGFBVBXL4V3DRP2B5HANCNFSM4FEKO4IQ>
.
|
Hi, https://www.arangodb.com/docs/stable/tutorials-reduce-memory-footprint.html |
Hi Wilfried
I did switch to RocksDb which caused my problems. When I *was *using
MMFiles I didn’t have any problems.
Kieran
…On Wed, May 6, 2020 at 12:18 Wilfried Goesgens ***@***.***> wrote:
Hi,
you should rather try using rocksdb - mmfiles is discontinued as of
ArangoDB 3.7
https://www.arangodb.com/docs/stable/tutorials-reduce-memory-footprint.html
explains how to cut down on resources.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5579 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACH4CX2PYQXM3FFFGI66CJTRQE2QNANCNFSM4FEKO4IQ>
.
|
Hello @dothebart, @maxkernbach; Thanks for your continued support here - apologies it's taken me a few days but with this issue we're currently talking about memory usage across weeks rather than days so I wanted to make sure I've got some quality data before coming back to you (the first few days were not very promising). This is definitely an improvement but it's interesting to see how much more memory is in use with statistics enabled. In the graph above the first part is 3.6.1 w/o statistics, middle part was just an unrelated restart, the rightmost part is 3.6.3 with statistics disabled. I would like to wait 1-2 weeks before closing this issue if that's OK with you? The memory usage patterns do seem to change over time so while it's highly unlikely there could still be dragons. What are your thoughts about the increased memory usage with statistics enabled? Would that be worth opening a new issue for (I can imagine that just like me you'd like to see this one closed soon 😄)? PS. having stumbled into some memory usage issues after upgrading (which turned out to be caused by usage of views which we've decided to scrap as we didn't need them anymore) I've been running some load test to try and identify the problem and I do have to say 3.6 is performing 40% faster in comparison to 3.4 all while using less CPU and less memory - impressive work 👏 |
As discussed in #11577 exposing more options of rocksdb to the commandline seems to ease the memory situation. Namely:
These have been added to the now released ArangoDB 3.7 - hence closing this. |
Uh oh!
There was an error while loading. Please reload this page.
my environment running ArangoDB
I'm using ArangoDB version:
Mode:
Storage-Engine:
On this operating system:
this is an installation-related issue:
Hello,
Since upgrading to 3.3 we've noticed a rather drastic increase in memory usage of server nodes over time. I've had a look at some of the existing open tickets and this looks quite similar to #5414 - I've decided to open a separate issue and let you decide though.
We're seeing this problem only in server instances - both the agency and coordinator nodes have very stable memory usage. Our standard deployment model for Arango is 3x agency, 3x coordinator and 3x server. We're using the following config options (I've skipped ones that seem irrelevant to me like directory locations, ip addresses etc - let me know if you'd like the full list):
Here's a list of sysctls we're setting:
net.core.somaxconn
is also set (same value) on the Docker container.We're setting the transparent_hugepage
defrag
andenabled
properties tonever
.We've upgraded from 3.2.9 to 3.3.0 and have since used 3.3.3, 3.3.4 and on 3.3.8 now.
Here's a memory usage (RSS) for the server nodes in one of our environments (which got upgraded to 3.3.x around 25th April - note that this environment gets shutdown in the evening every day hence the large gaps in the graph):
This is the above graph zoomed in to the last 5 working day period:
This is another environment, the upgrade to 3.3.x happened on 3rd of April. The change in the memory usage pattern on the 27th April has been caused by applying a docker memory limit of 2g.
The above environments have extremely light usage of Arango.
Here's one of which gets used a bit more:

As a reference point of sorts, here's what it looks like when we run a load test against our application:

The server nodes eventually tail off at 6.4GB and memory usage remains perfectly stable afterwards.
All of the above graphs were taken using the same settings I've mentioned earlier.
Let me know what other information would be useful to provide - I guess disabling statistics and or Foxx queues would be something you might want us to try? If so - shall we try disabling both off or try them one by one (if so - what order would you prefer)?
Thanks,
Simon
edit: as part of the investigation I have upgraded from 3.3.8 to 3.3.10
The text was updated successfully, but these errors were encountered: