-
Notifications
You must be signed in to change notification settings - Fork 853
intermittent issues with 1.0.0 #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
If you are able to reproduce the issue, then it would be good to have the server running with a detailed log level (e.g. by using the argument --log.level trace on server startup) so it will hopefully produce some useful state information. |
I'm on it... but I just found another issue with the keys API. It is hard to trace this issue since it happens intermittently.
I've dropped some additional traces, cursed.log and cursed2.log, can you access these? |
When turning of --log.level trace, the problem occurs on the first npm test run (or the following). |
That sounds like it is timing-related. However, I am currently unable to reproduce the issue locally. |
Does the problem still occur on your end if you start the server with the additional option --javascript.gc-interval 1? |
Perhaps...I'm a little short on disk space, I have only 3.5GB left.
Maybe Arango is (pre)allocating alot of diskspace so that it runs short after a while... I noticed after I did some tracing that when I terminated the process it freed around 2GB. Will attempt again with the gc-interval option. |
with gc-interval 1, it happens after about 10 npm test runs. |
Seems like the deleted collections are not being freed up. However the last time ArangoDB stalled I had 1GB free...
|
You could try starting the server with a smaller journal size, e.g. using --database.maximal-journal-size 1048576. More on journal sizes can be found here: |
tried that... but it went unresponsive again after a while. This time it went bazoka before the cursor create when setting up a new collection and injecting 50 documents.
However, usually it goes when I create the cursor.
And disk space should not be an issue as shown below after it stalled.
|
Leaving a note regarding memory allocation problem found in trace
I will now attempt to provoke a failure when using a clean build and a freshly rebooted machine with minimal applications running. |
More debug data from last failure
and this
The process just exited after that, no core produced. |
I suspect this a garbage collection issue. I have a patch ready for 1.0 that hopefully fixes this. The patch is available for download at: The patches introduces some different garbage collection strategy. It hopefully solves hogging of file descriptors, disk space and memory as in your case. Using the patch, I got some benefits when running your test case in an endless loop. It did not have any effect when running your test suite just once. |
Re xcode templates: I personally don't use Mac & xcode, so I don't know about it. |
Started up ArangoDB again with trace enabled this time, on the first run it stalled.
Running tests again (without interrupting process) and noticing 'Operation timed out' in the trace.
|
Re xcode: it seems no one here is using xcode for ArangoDB development so we do not have an instruments template available. |
|
The problem on shutdown seems to be caused by several problems happening before. Several operations fail and as a consequence shutdown hangs. |
Uname -a
ulimit -a
using default arango.conf Starting with...
...or without trace, and different journal-size |
The thing is, I want the database to be able to run with limited resources and if possible degrade gracefully in those situations. I plan to deploy ArangoDB on cloud servers with restricted memory & disk space. The DB should not become unstable under any circumstance... almost. :) However that said... I don't think this is an issue with available resources. Since I have available disk space and memory. Better garbage collection is great... I'll try the patch. |
I see. If the patch I mentioned does not help, could you still increase the number of open files a bit (say: to 1024)? The other parameters seem ok, but the open files are needed for sockets and collections (multiple files per collection), and that may easily exceed 256. |
Tried to increase open files, same issue after a while.
ulimit -a -H
with the patch: arangod-20120905-12.00.log.gz |
Did that happen with or without the patch? |
All tests since ~2hrs ago has been with the patch. |
I'm not sure about ulimit either. I tried to reduce my ulimit value to some very low value and the command accepted it and also showed the lowered value. However, it did not have an effect when I ran arangod in the same terminal. arangod was able to open a lot more files than should be allowed. So I am actually not sure how this takes effect. |
Yeah, however the number of open files doesn't explain why it sometimes stalled on the first attempt.
|
The number of files might explain it if you did not restart the database server for each test run but leave it running forever. If you restarted the db on every test run, it does not. Furthermore it does not explain any timeouts. I suspected timeouts to come from some unreleased locks which might have been there as a side effect of too few available file descriptors or other resources. But that's only a guess. |
Btw, if you skip the cursor-related tests from executing, is it any better then? I.e. does the issue only occur if you include the cursor tests in the overall test run? |
Seems to work fine without the cursors. I think it is obvious we are dealing with multiple failures.
And then perhaps a chain of events related to the above. |
I am sorry, I cannot see the output I expected in the logs. Either my git push failed or the patch wasn't yet applied. |
ok, reapplied the patch after cleanup: arangod-20120906-1140.log.gz |
Ok, the reason for the hiccup is that the execution of some generated Javascript code used for query execution never returns. This leads to cleanup not being performed, causing the problems on shutdown etc. I assume that the Javascript code works but probably process a lot of documents so it does not return instantly. |
when I tear down the test it should remove the collection.
Looking through my code... This is a bit nasty
Hehe... when I create the testcursor collection I immediately start to inject 50 documents and as soon the for loop is done I call done(). Which means ArangoDB is probably not done processing the documents when I then attempt to create the cursor in the next step. So... the issue could be related to creating a cursor at the same time documents are being injected. A moving target for the cursor. :) |
Tried to create a specific test suite for the case I described above.
However, create cursor seems to wait until the documents has been injected. |
Normally (if everything works as expected) concurrent creation of documents and using a cursor should not create problems. However, as something definitely goes wrong, there seems to be an issue either in the locking or the document iteration. |
There's a new trace: arangod-20120907-1330.log.gz A little different this time.
Then after a couple more rounds the same as usual.
|
Thanks, I inspected that. |
Ok, great, hopefully you'll nail this. :o) added file: config.log |
This could be a potential reason, but I suspect it's a trivial logic error somewhere in the code. |
I found some definitely locking related issues and I have committed fixes for them in master. |
Sorry, no cigar: arangod-20120907-1930.log.gz |
No, it can't be this. We're not using boost::shared_mutex but pthreads_xxx functions with some wrappers around them. I am still having problems to reproduce the issue. Already tried it on some other machines and with all CPUs being used by other processes, but with no luck. |
New trace: arangod-20120908-0700.log.gz If this is only related to mac os x (I'm currently on 10.6.8 & gcc 4.2.1) it is really a non issue for me since I'll be using linux in production. |
Thanks. I can now clearly see the locks waiting for each other from the logfile (arangod-20120908-0700.log.gz) :
The problem is that 4) is blocking on 3) in your case. This is a deadlock, because 3) is blocking on 2). Re the corruption: |
I have updated the patch again. As before, the new patch replaces all the previous patches so it should be applied against a clean master installation. |
Looking good so far... have been running test for ~10 minutes and my computer is starting to heat up. ;) Ok, so after aprox 20 minutes of testing, even with two concurrently running test processes, I have to conclude that it works now! I don't think you want the log file ;)
|
Thanks for your patience and support while finding out the root cause of this issue. |
Created separate issue #194 for the corruption problem. |
fixed in 1.0 and devel. awaiting pull to master. |
Merge to master |
* add warmup documentation - #188 * warn when relinking build directory * add warmup documentation - #188 * warn when relinking build directory * Renamed warmup to loadIndexesInMemory which is a better name for the current implementation of this feature. * Adapted WebUI to state 'load indexes in memory' instead of 'warmup' * Added loadIndexesInMemory documentation. * Renamed loadIndexesInMemory => loadIndexesIntoMemory
I'm suddenly experiencing failures when creating cursors using the nodejs arango-client
ArangoDB gets unresponsive and CTRL-C doesn't terminate the process.
This message is shown several times after issuing ctrl-c.
To replicate, install the arango-client with devRequirements and run the tests.
By the way, Arangodb-1.0.0 install procedure (sudo make install) didn't create a log directory so the process failed to launch.
The text was updated successfully, but these errors were encountered: