8000 intermittent issues with 1.0.0 · Issue #188 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

intermittent issues with 1.0.0 #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kaerus opened this issue Sep 4, 2012 &mid 8000 dot; 62 comments
Closed

intermittent issues with 1.0.0 #188

kaerus opened this issue Sep 4, 2012 · 62 comments
Assignees
Labels
Milestone

Comments

@kaerus
Copy link
kaerus commented Sep 4, 2012

I'm suddenly experiencing failures when creating cursors using the nodejs arango-client
ArangoDB gets unresponsive and CTRL-C doesn't terminate the process.
This message is shown several times after issuing ctrl-c.

2012-09-04T10:17:33Z [9149] WARNING forcefully shutting down thread 'dispatcher'

To replicate, install the arango-client with devRequirements and run the tests.

git clone git://github.com/kaerus/arango-client.git
cd arango-client
npm install -d
npm test

By the way, Arangodb-1.0.0 install procedure (sudo make install) didn't create a log directory so the process failed to launch.

@kaerus
Copy link
Author
kaerus commented Sep 4, 2012

Suddenly I'm unable to replicate myself... will investigate this further. I'll be back.
Ran the tests a couple of times and then... boom!
After failure in the cursor tests, I'm unable to create a collection.

@jsteemann
Copy link
Contributor

If you are able to reproduce the issue, then it would be good to have the server running with a detailed log level (e.g. by using the argument --log.level trace on server startup) so it will hopefully produce some useful state information.
If you could provide the server logs then, this will probably help. Thanks.

@kaerus
Copy link
Author
kaerus 8000 commented Sep 4, 2012

I'm on it... but I just found another issue with the keys API.

It is hard to trace this issue since it happens intermittently.
I've provided some debug, forward maybe 2/3 of the file and the problems should start to appear.
Although I didn't see any obvious problem from the trace information... yet ArangoDB got unresponsive as before.

https://www.dropbox.com/s/qld2lnj7b7wqu8l/arangod.log.gz
https://www.dropbox.com/sh/g40x4z333rqfd27/nNnxoJuLt3

I've dropped some additional traces, cursed.log and cursed2.log, can you access these?

@kaerus
Copy link
Author
kaerus commented Sep 4, 2012

When turning of --log.level trace, the problem occurs on the first npm test run (or the following).
So this seems to be timing related somehow.

@jsteemann
Copy link
Contributor

That sounds like it is timing-related. However, I am currently unable to reproduce the issue locally.
Btw. the log file has a few lines saying "Cannot allocate memory". Probably that's bogus, but can you check whether there is indeed enough memory available?

@jsteemann
Copy link
Contributor

Does the problem still occur on your end if you start the server with the additional option --javascript.gc-interval 1?

@kaerus
Copy link
Author
kaerus commented Sep 4, 2012

Perhaps...I'm a little short on disk space, I have only 3.5GB left.

K1:arango-client kaerus$ df -k
Filesystem    1024-blocks     Used Available Capacity  Mounted on
/dev/disk0s2     97656252 93896064   3504188    97%    /

Maybe Arango is (pre)allocating alot of diskspace so that it runs short after a while... I noticed after I did some tracing that when I terminated the process it freed around 2GB.

Will attempt again with the gc-interval option.

@kaerus
Copy link
Author
kaerus commented Sep 4, 2012

with gc-interval 1, it happens after about 10 npm test runs.

@kaerus
Copy link
Author
kaerus commented Sep 4, 2012

Seems like the deleted collections are not being freed up.
Since after every test run I have about 200MB less disk space.

However the last time ArangoDB stalled I had 1GB free...

Filesystem    1024-blocks     Used Available Capacity  Mounted on
/dev/disk0s2     97656252 96070304   1329948    99%    /

@jsteemann
Copy link
Contributor

You could try starting the server with a smaller journal size, e.g. using --database.maximal-journal-size 1048576.
This will not allocate that much disk space (and RAM because of memory mapping) for each collection. For test collections that are mostly empty, the default journal size of 32 MB may be too wasteful.
If insufficient memory is the cause of the problem, then reducing the journal size to 1 MB should fix it.

More on journal sizes can be found here:
http://www.arangodb.org/2012/07/08/collection-disk-usage-arangodb
http://www.arangodb.org/2012/08/08/more-datafile-sizes

@kaerus
Copy link
Author
kaerus commented Sep 4, 2012

tried that... but it went unresponsive again after a while.

This time it went bazoka before the cursor create when setting up a new collection and injecting 50 documents.

 1) Arango cursor "before all" hook:
     Error: timeout of 2000ms exceeded

However, usually it goes when I create the cursor.

1) Arango cursor create:
     Error: timeout of 2000ms exceeded

And disk space should not be an issue as shown below after it stalled.

Filesystem    1024-blocks     Used Available Capacity  Mounted on
/dev/disk0s2     97656252 94123604   3276648    97%    /

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

Leaving a note regarding memory allocation problem found in trace

[fillReadBuffer@lib/Scheduler/SocketTask.cpp:334] read return 0 with 12 (Cannot allocate memory)

I will now attempt to provoke a failure when using a clean build and a freshly rebooted machine with minimal applications running.
By the way, do you happen to have an instruments (xcode) trace template for monitoring the arangod process?

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

More debug data from last failure

2012-09-05T08:17:00Z [392] WARNING accept failed with 24 (Too many open files)
2012-09-05T08:17:00Z [392] ERROR too many accept failures, stopping logging

and this


2012-09-05T08:17:01Z [392] ERROR cannot create json file '/usr/local/var/arangodb/collection-4505092519/parameter.json.tmp': 'Too many open files'
2012-09-05T08:17:01Z [392] ERROR cannot save info block '/usr/local/var/arangodb/collection-4505092519/parameter.json': 'Too many open files'
2012-09-05T08:17:01Z [392] ERROR cannot save collection parameter '/usr/local/var/arangodb/collection-4505092519': 'Too many open files'
2012-09-05T08:17:01Z [392] ERROR cannot create document collection
2012-09-05T08:17:01Z [392] ERROR could not read-unlock the read-write lock: Invalid argument

The process just exited after that, no core produced.

@jsteemann
Copy link
Contributor

I suspect this a garbage collection issue. I have a patch ready for 1.0 that hopefully fixes this.
If you could try applying the patch in master or 1.0 it would be great.

The patch is available for download at:
https://raw.github.com/triAGENS/ArangoDB/devel/Documentation/issue188.diff

The patches introduces some different garbage collection strategy. It hopefully solves hogging of file descriptors, disk space and memory as in your case. Using the patch, I got some benefits when running your test case in an endless loop. It did not have any effect when running your test suite just once.

@jsteemann
Copy link
Contributor

Re xcode templates: I personally don't use Mac & xcode, so I don't know about it.
I'll check with the colleagues whether they do.

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

Started up ArangoDB again with trace enabled this time, on the first run it stalled.
It is strange that it says "PUT /_api/cursor/undefined", since I take the id of the cursor from a previous POST request (cursor create). This could indicate that the previous POST _cursor operation failed to return an cursor _id (I need to check for that specifically).
Obviously the POST _cursor operation fails and consecutive tests fails too.

2012-09-05T08:28:52Z [730] TRACE [processRead@lib/HttpServer/HttpCommTask.cpp:97] HTTP READ FOR 0x101114da8:\nPOST /_api/cursor/ HTTP/1.1\r\nContent-Type: application/json\r\nContent-Length: 71\r\nConnection: keep-alive\r\n\r\n
2012-09-05T08:28:52Z [730] TRACE [processRead@lib/HttpServer/HttpCommTask.cpp:111] server port = 8529, client port = 22480
2012-09-05T08:28:52Z [730] TRACE [processRead@lib/HttpServer/HttpCommTask.cpp:186] {"query":"FOR u IN testcursor RETURN u._id","count":true,"batchSize":1}
2012-09-05T08:28:52Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:141] no direct handler found, trying prefixes
2012-09-05T08:28:52Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:160] no prefix handler found, trying catch all
2012-09-05T08:28:52Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:164] found catch all handler '/'
2012-09-05T08:28:52Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:229] found handler for path '/'
2012-09-05T08:28:52Z [730] TRACE [addJob@lib/Dispatcher/Dispatcher.cpp:159] added job 0x101105cc0 to queue STANDARD
2012-09-05T08:28:52Z [730] TRACE [work@./lib/GeneralServer/GeneralServerJob.h:131] beginning job 0x101105cc0
2012-09-05T08:28:52Z [730] TRACE [enterContext@arangod/V8Server/ApplicationV8.cpp:216] found unused V8 context
2012-09-05T08:28:52Z [730] TRACE [TRI_StartScopeAql@arangod/Ahuacatl/ahuacatl-scope.c:244] starting scope of type main
2012-09-05T08:28:52Z [730] TRACE [TRI_StartScopeAql@arangod/Ahuacatl/ahuacatl-scope.c:244] starting scope of type for
2012-09-05T08:28:52Z [730] TRACE [TRI_EndScopeByReturnAql@arangod/Ahuacatl/ahuacatl-scope.c:314] closing scope of type for
2012-09-05T08:28:52Z [730] TRACE [TRI_EndScopeAql@arangod/Ahuacatl/ahuacatl-scope.c:275] closing scope of type main
2012-09-05T08:28:52Z [730] TRACE [OpenCollections@arangod/Ahuacatl/ahuacatl-collections.c:157] locking collection 'testcursor'
2012-09-05T08:28:52Z [730] TRACE [TRI_ReadLockCollectionsAql@arangod/Ahuacatl/ahuacatl-collections.c:400] read-locking collection 'testcursor'
2012-09-05T08:28:52Z [730] TRACE [processRead@lib/HttpServer/HttpCommTask.cpp:97] HTTP READ FOR 0x10110c8d8:\nPUT /_api/cursor/undefined HTTP/1.1\r\nContent-Type: application/json\r\nContent-Length: 2\r\nConnection: keep-alive\r\n\r\n
2012-09-05T08:28:52Z [730] TRACE [processRead@lib/HttpServer/HttpCommTask.cpp:111] server port = 8529, client port = 22224
2012-09-05T08:28:52Z [730] TRACE [processRead@lib/HttpServer/HttpCommTask.cpp:186] {}
2012-09-05T08:28:52Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:141] no direct handler found, trying prefixes
2012-09-05T08:28:52Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:160] no prefix handler found, trying catch all```

Running tests again (without interrupting process) and noticing 'Operation timed out' in the trace.


2012-09-05T08:38:36Z [730] TRACE [processRead@lib/HttpServer/HttpCommTask.cpp:97] HTTP READ FOR 0x101051018:\nDELETE /_api/collection/testcursor HTTP/1.1\r\nContent-Type: application/json\r\nContent-Length: 0\r\nConnection: keep-alive\r\n\r\n
2012-09-05T08:38:36Z [730] TRACE [processRead@lib/HttpServer/HttpCommTask.cpp:111] server port = 8529, client port = 36560
2012-09-05T08:38:36Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:141] no direct handler found, trying prefixes
2012-09-05T08:38:36Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:160] no prefix handler found, trying catch all
2012-09-05T08:38:36Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:164] found catch all handler '/'
2012-09-05T08:38:36Z [730] TRACE [createHandler@lib/HttpServer/HttpHandlerFactory.cpp:229] found handler for path '/'
2012-09-05T08:38:36Z [730] TRACE [addJob@lib/Dispatcher/Dispatcher.cpp:159] added job 0x10104f1e0 to queue STANDARD
2012-09-05T08:38:36Z [730] TRACE [work@./lib/GeneralServer/GeneralServerJob.h:131] beginning job 0x10104f1e0
2012-09-05T08:38:36Z [730] TRACE [enterContext@arangod/V8Server/ApplicationV8.cpp:216] found unused V8 context
2012-09-05T08:38:38Z [730] TRACE [fillReadBuffer@lib/Scheduler/SocketTask.cpp:334] read return 0 with 60 (Operation timed out)
2012-09-05T08:38:38Z [730] TRACE [destroyTask@lib/Scheduler/Scheduler.cpp:308] destroyTask for task 0x101051018
2012-09-05T08:38:38Z [730] DEBUG [beginShutdown@./lib/HttpServer/HttpHandler.h:110] job is still active, trying to shutdown
2012-09-05T08:38:38Z [730] TRACE [beginShutdown@./lib/GeneralServer/GeneralServerJob.h:193] shutdown, job (0x10104f1e0) is still running
2012-09-05T08:38:38Z [730] TRACE [~GeneralCommTask@./lib/GeneralServer/GeneralCommTask.h:104] connection closed, client 31
2012-09-05T08:39:00Z [730] DEBUG [reportStatus@lib/Dispatcher/Dispatcher.cpp:261] dispatcher queue 'STANDARD': threads = 5 started = 0 running = 4 waiting = 1 stopped = 0 special = 0 monopolistic = no

@jsteemann
Copy link
Contributor

Re xcode: it seems no one here is using xcode for ArangoDB development so we do not have an instruments template available.

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

https://www.dropbox.com/sh/g40x4z333rqfd27/nNnxoJuLt3
tracefile: arangod-20120905-1054.log.gz

patched with patch.

@jsteemann
Copy link
Contributor

The problem on shutdown seems to be caused by several problems happening before. Several operations fail and as a consequence shutdown hangs.
Is there any special configuration you run the server with, e.g. a specific virtual memory restriction, number of open files etc. Can you do a ulimit -a on your system and send the result? Furthermore, are you using some configuration parameters when starting the server? Do you use any server startup parameters or a configuration file?
The problem is that I cannot reproduce the issue locally so maybe it depends on specific configuration.

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

Uname -a

Darwin K1.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386

ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 266
virtual memory          (kbytes, -v) unlimited

using default arango.conf

Starting with...

sudo /usr/local/sbin/arangod  --database.maximal-journal-size 1048576 --log.level trace

...or without trace, and different journal-size

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

The thing is, I want the database to be able to run with limited resources and if possible degrade gracefully in those situations. I plan to deploy ArangoDB on cloud servers with restricted memory & disk space. The DB should not become unstable under any circumstance... almost. :)

However that said... I don't think this is an issue with available resources. Since I have available disk space and memory.
...and..... it has worked fine before. ;)

Better garbage collection is great... I'll try the patch.
...
Same problem: arangod-20120905-1139.log.gz

@jsteemann
Copy link
Contributor

I see. If the patch I mentioned does not help, could you still increase the number of open files a bit (say: to 1024)? The other parameters seem ok, but the open files are needed for sockets and collections (multiple files per collection), and that may easily exceed 256.

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

Tried to increase open files, same issue after a while.

K1:~ kaerus$ sudo lsof -p `ps aux|grep arangod|grep -v grep|awk '{print $2}'`|wc -l
     258

ulimit -a -H

K1:ArangoDB kaerus$ sudo ulimit -a -H
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 65532
cpu time               (seconds, -t) unlimited
max user processes              (-u) 532
virtual memory          (kbytes, -v) unlimited

with the patch: arangod-20120905-12.00.log.gz
it complained about to many open files in the trace... yet lsof showed only 258...but I'm a little unsure about how ulimit is actually working.
Oh I forgot to mention... it corrupted the database files... I had to delete the directory containing the collections before starting up again...

@jsteemann
Copy link
Contributor

Did that happen with or without the patch?

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

All tests since ~2hrs ago has been with the patch.
Since you cannot reproduce the problem, you may close this issue if you want.

@jsteemann
Copy link
Contributor

I'm not sure about ulimit either. I tried to reduce my ulimit value to some very low value and the command accepted it and also showed the lowered value. However, it did not have an effect when I ran arangod in the same terminal. arangod was able to open a lot more files than should be allowed. So I am actually not sure how this takes effect.
All I saw from the logs was that in your situation you were at least once running out of file descriptors and that would have been a good explanation for the problem. Actually the patch I sent should prevent it from running out of file descriptors early, but in case you have a lot of collections you could still be above 256 fds. But if all you have is a few collections and the patch is active, running out of fds seems very unlikely.

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

Yeah, however the number of open files doesn't explain why it sometimes stalled on the first attempt.
And I use only ~3-4 collections in my test suite.
Here's another suspicious message from the trace.

2012-09-05T08:38:38Z [730] TRACE [fillReadBuffer@lib/Scheduler/SocketTask.cpp:334] read return 0 with 60 (Operation timed out)

@jsteemann
Copy link
Contributor

The number of files might explain it if you did not restart the database server for each test run but leave it running forever. If you restarted the db on every test run, it does not.

Furthermore it does not explain any timeouts. I suspected timeouts to come from some unreleased locks which might have been there as a side effect of too few available file descriptors or other resources. But that's only a guess.

@jsteemann
Copy link
Contributor

Btw, if you skip the cursor-related tests from executing, is it any better then? I.e. does the issue only occur if you include the cursor tests in the overall test run?

@kaerus
Copy link
Author
kaerus commented Sep 5, 2012

Seems to work fine without the cursors.

I think it is obvious we are dealing with multiple failures.

  1. Cursor create failure (could be timeout related)
  2. File descriptors running out (too many open files)
  3. Exhausted disk space (garbage collection)
  4. Corrupted database (no idea why)
  5. Memory running out (cannot allocate memory)

And then perhaps a chain of events related to the above.
Whats needs to be done is to identify and isolate each case and start all over. :)

@jsteemann
Copy link
Contributor

I am sorry, I cannot see the output I expected in the logs. Either my git push failed or the patch wasn't yet applied.
May I kindly ask you to try again with the latest version?
Btw. I have also added some output about the v8 version used to the patch so you don't need to check it yourself.
I strongly hope this is the last iteration we have to do.

@kaerus
Copy link
Author
kaerus commented Sep 7, 2012

ok, reapplied the patch after cleanup: arangod-20120906-1140.log.gz

@jsteemann
Copy link
Contributor

Ok, the reason for the hiccup is that the execution of some generated Javascript code used for query execution never returns. This leads to cleanup not being performed, causing the problems on shutdown etc.
You do have the same V8 version as I am using (3.9.4) so version differences do not explain it.

I assume that the Javascript code works but probably process a lot of documents so it does not return instantly.
I see from the logs that the test tries to create the "testcursor" collection, but the collection already exists. And each time the test is run, more documents are added.
If you are not deleting the collection between the tests or removing all the documents from it, a lot of documents might have piled up. This could be a reason why getting the documents takes a long time.
Can you check how many documents are in that collection? A db.testcursor.count() from arangosh or using the web interface should tell you that. If there's many documents in it, can you try removing them all and running the test again?

@kaerus
Copy link
Author
kaerus commented Sep 7, 2012

when I tear down the test it should remove the collection.

arangosh>  db.testcursor.count() 
48

Looking through my code... This is a bit nasty

db.collection.create(function(err,ret){
      var data = {};
      for(var i = 0; i < 50; i++) {
        data.i = i;
        data.msg = "test";
        db.document.create(data).on('error',function(error){
          assert(error.code,error.message);
        }).on('result',function(res){

        });
      }
      done();
    });

Hehe... when I create the testcursor collection I immediately start to inject 50 documents and as soon the for loop is done I call done(). Which means ArangoDB is probably not done processing the documents when I then attempt to create the cursor in the next step.
That could explain why there are only 48 stray documents and not 50 in the testcursor collection.

So... the issue could be related to creating a cursor at the same time documents are being injected. A moving target for the cursor. :)
And the reason you cannot reproduce could be that you have a faster machine than I have.
Try testing from a server under moderate load or a slower pc.

@kaerus
Copy link
Author
kaerus commented Sep 7, 2012

Tried to create a specific test suite for the case I described above.
However ArangoDB seems to handle that fine until I get some test failures but no lockups this time. :)
Here's a new log: arangod-20120906-1250.log.gz
These are the steps I perform.

  1. Attempt to delete collection 'testcursor'
  2. Create testcursor collection
  3. Inject 50 documents asynchronously
  4. Attempt to create a cursor (while documents are created)
  5. Read first batch from cursor
  6. Delete Cursor
  7. Delete Collection

However, create cursor seems to wait until the documents has been injected.

@jsteemann
Copy link
Contributor

Normally (if everything works as expected) concurrent creation of documents and using a cursor should not create problems. However, as something definitely goes wrong, there seems to be an issue either in the locking or the document iteration.
I have updated the patch again with more debug output for these sections. I am sorry, but may I ask you to run the original test again with it? That would be helpful. I think we're really getting near now. At least now there's not many more layers I can add debug output for.

@kaerus
Copy link
Author
kaerus commented Sep 7, 2012

There's a new trace: arangod-20120907-1330.log.gz

A little different this time.
I got this error first:

1) Arango cursor delete cursor:
     'disposed or unknown cursor'

Then after a couple more rounds the same as usual.

1) Arango cursor Create cursor while injecting documents:
     Error: timeout of 2000ms exceeded

@jsteemann
Copy link
Contributor

Thanks, I inspected that.
From the logs I can see that at some point the cursor is endlessly waiting for a lock. That lock acquisition request comes from out of a Javascript function, but that doesn't matter. The problem is that the lock cannot be acquired during the test duration, because something else holds it.
That means at some point a lock is held too long or it is not released at all. This must be avoided because it will cause starvation.
I need to investigate the locking code now. This will probably take some time and I think I won't have any result of this today.

@kaerus
Copy link
Author
kaerus commented Sep 7, 2012

Ok, great, hopefully you'll nail this. :o)
By the way, how about gcc optimizations etc, there can be bugs introduced in the compilations.
And are you using volatile declarations where needed ?

added file: config.log

@jsteemann
Copy link
Contributor

This could be a potential reason, but I suspect it's a trivial logic error somewhere in the code.
Btw. I have used gcc (4.5, 4.6 and 4.7) and clang 3.0.6 with -O2 and could not reproduce it. So probably it's really occurring only under some specific race conditions which might be hard to reproduce on a different box. But I'll try.

@jsteemann
Copy link
Contributor

I found some definitely locking related issues and I have committed fixes for them in master.
I am not sure if the problems you had are related to these locking issues, however, it is worth a try. You can pull the latest commit from master and try it, ideally together will the latest patch.
I will continue to look for other locking-related issues, but it will take a while.

@kaerus
Copy link
Author
kaerus commented Sep 7, 2012

Sorry, no cigar: arangod-20120907-1930.log.gz
There might be bugs in the boost library.
http://www.boost.org/users/history/version_1_49_0.html
Could be this: https://svn.boost.org/trac/boost/ticket/5502

@jsteemann
Copy link
Contributor

No, it can't be this. We're not using boost::shared_mutex but pthreads_xxx functions with some wrappers around them.
I have updated the patch again. It will produce much more output about locks now.

I am still having problems to reproduce the issue. Already tried it on some other machines and with all CPUs being used by other processes, but with no luck.

@kaerus
Copy link
Author
kaerus commented Sep 8, 2012

New trace: arangod-20120908-0700.log.gz
I got two consecutive failures when attempting to delete the cursor before cursor create failed and the process stalled.

If this is only related to mac os x (I'm currently on 10.6.8 & gcc 4.2.1) it is really a non issue for me since I'll be using linux in production.
BTW: I had a corrupted database (had to delete before running test), for whatever it is worth you might want to have a look at these files.
db-corrupted-20120907.log
db-corrupted-20120907.tar.gz
But then again, there are just to many uncertainties going on right now, regarding, debug patches and consistent hangups.

@jsteemann
Copy link
Contributor

Thanks. I can now clearly see the locks waiting for each other from the logfile (arangod-20120908-0700.log.gz) :

  1. line 6169: thread A trying to acquire a shared read-lock on the collection (fine)
  2. line 6170: thread A acquired the read-lock (fine)
  3. line 6175: thread B trying to acquire a write-lock on the collection and waiting (fine)
  4. line 6199: thread A trying to acquire a shared read-lock on the collection and waiting (oops!)

The problem is that 4) is blocking on 3) in your case. This is a deadlock, because 3) is blocking on 2).
When reproducing the same lock call order here, it works without problems so I suspect it is indeed pthread-implementation-specific. The pthreads library leaves some room for implementation-defined behavior and I think these specific cases need to be honored in our calling code. Even if it seems to happen in some environments only, once the deadlock state is reached, the database will become unresponsive and we must avoid that.
I will work on a fix and hope I can provide it until the end of today.

Re the corruption:
I looks like when starting the database there are two collections with the same name (probably one being marked as deleted already). Will look into that afterwards.

@jsteemann
Copy link
Contributor

I have updated the patch again. As before, the new patch replaces all the previous patches so it should be applied against a clean master installation.
The should fix the deadlocking issues that I have seen in the logs from your system. I have not done anything yet regarding the corruption issue you mentioned. I will do that tomorrow.
If the patch is helpful, it will soon be committed in 1.0/master branches.

@kaerus
Copy link
Author
kaerus commented Sep 10, 2012

Looking good so far... have been running test for ~10 minutes and my computer is starting to heat up. ;)
Seen only some occasional cursor deletion failures (these started to pop up after you changed wait() to wait(uint64_t) I recon). But that is another issue.

Ok, so after aprox 20 minutes of testing, even with two concurrently running test processes, I have to conclude that it works now!
I must say that your way of handling this issue has boosted my confidence in ArangoDB even further. :)

I don't think you want the log file ;)

K1:arango-client kaerus$ ls -la /usr/local/var/log/arangodb/arangod.log 
-rw-r-----  1 root  wheel  446093487 Sep 10 19:14 /usr/local/var/log/arangodb/arangod.log

@jsteemann
Copy link
Contributor

Thanks for your patience and support while finding out the root cause of this issue.
I will try to figure out what's happening with the cursor deletion and the reported corruption on startup. Looking at the log filesize, I think I don't want the log this time.
I will update this issue when there's news or the patch gets committed in 1.0/master.

@jsteemann
Copy link
Contributor

Created separate issue #194 for the corruption problem.

jsteemann added a commit that referenced this issue Sep 12, 2012
@jsteemann
Copy link
Contributor

fixed in 1.0 and devel. awaiting pull to master.

@fceller
Copy link
Contributor
fceller commented Sep 19, 2012

Merge to master

@fceller fceller closed this as completed Sep 19, 2012
ObiWahn added a commit that referenced this issue Jun 2, 2017
ObiWahn added a commit that referenced this issue Jun 2, 2017
mchacki pushed a commit that referenced this issue Jun 8, 2017
mchacki pushed a commit that referenced this issue Jun 12, 2017
fceller pushed a commit that referenced this issue Jun 12, 2017
* add warmup documentation - #188

* warn when relinking build directory

* add warmup documentation - #188

* warn when relinking build directory

* Renamed warmup to loadIndexesInMemory which is a better name for the current implementation of this feature.

* Adapted WebUI to state 'load indexes in memory' instead of 'warmup'

* Added loadIndexesInMemory documentation.

* Renamed loadIndexesInMemory => loadIndexesIntoMemory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0