8000 ArangoDB 2.7.3 consistently stops responding · Issue #1666 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

ArangoDB 2.7.3 consistently stops responding #1666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ottoyiu opened this issue Jan 19, 2016 · 12 comments
Closed

ArangoDB 2.7.3 consistently stops responding #1666

ottoyiu opened this issue Jan 19, 2016 · 12 comments
Assignees

Comments

@ottoyiu
Copy link
ottoyiu commented Jan 19, 2016

Hello there,

We're running 2.7.3 and 2.6.12 on CentOS6 in four seperate production environments respectively, and we are able to consistently make ArangoDB stop responding with a pretty generic workload we put on it on every single one of them.
(Linux tat1234 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux)

Attempting to 'arangosh' into the instance, does not return a prompt back nor does the dashboard on port 8529 respond.

I did an strace -p -f on the arangod process and I'm not sure what I'm actually looking at, but it doesn't seem TOO out of the ordinary compared to a responding instance:
http://hastebin.com/raw/erujuhewec

I was thinking it could be a behaviour of garbage collection, but it doesn't respond even after leaving it for hours. The only thing that fixes it would be to restart the daemon.

I'm clueless as to what steps are to be taken next. Help is much appreciated :)

Thanks in advance,
Otto

@jsteemann
Copy link
Contributor

Thanks for the strace output. Unfortunately doesn't reveal anything obvious.

If the issue is reproducible, could you check next time whether the server still responds to the version API (a curl -X GET http://127.0.0.1:8529/_api/version --dump - should do, probably, and if authentication is set up, then it will be curl -X GET --basic --user 'username:password' http://127.0.0.1:8529/_api/version --dump -). This will be particularly interesting for diagnosis, because if the server responds here, it means the HTTP I/O is still working but all the dispatcher threads are blocked. This already reduces the problem search space.

Can you also let me know what the pretty generic workload is and whether your workload contains AQL queries or server-side transactions that invoke graph selection or modification operations? This should help rule out potential reasons. It would also help to know if the server has enough RAM to hold the active collections completely in-memory. If it's limited on RAM, then operations may get slowed down, but if there's enough RAM, then this is out, too.

@ottoyiu
Copy link
Author
ottoyiu commented Jan 20, 2016

Hi Jan,

First off, thank you for your reply :) Much appreciated!

I got two different servers with two different scenarios I believe. Both servers responds with the version number.

Server A does not respond through arangosh, and no longer serving requests but respond to the version endpoint.

[root@serverA ~]# arangosh --quiet

^C
[root@serverA ~]# curl -X GET http://127.0.0.1:8529/_api/version
{"server":"arango","version":"2.7.3"}

Server B was in the same scenario as Server A, but after leaving it overnight - I can now get a prompt in arangosh, however, everything returns with a 500 Internal Server Error.

arangosh [_system]> require("org/arangodb/replication").applier.state()
JavaScript exception in file '/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js' at 104,13: [ArangoError 500: Internal Server Error]
!      throw new ArangoError(requestResult);
!            ^
stacktrace: Error
    at Object.exports.checkReque
8000
stResult (/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js:104:13)
    at Object.applier.state (/usr/share/arangodb/js/client/modules/org/arangodb/replication.js:122:12)
    at <shell command>:1:45
arangosh [_system]> require("org/arangodb/replication").applier.state()
JavaScript exception in file '/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js' at 104,13: [ArangoError 500: Internal Server Error]
!      throw new ArangoError(requestResult);
!            ^
stacktrace: Error
    at Object.exports.checkRequestResult (/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js:104:13)
    at Object.applier.state (/usr/share/arangodb/js/client/modules/org/arangodb/replication.js:122:12)
    at <shell command>:1:45


arangosh [_system]> require("internal").wal.properties();
{
  "code" : 500,
  "error" : true,
  "errorNum" : 500,
  "errorMessage" : "Internal Server Error"
}

[root@serverB ~]# curl -X GET http://127.0.0.1:8529/_api/version
{"server":"arango","version":"2.7.3"}

Our workload consists of mostly:

  • reads that consist of a single collection (eg. FOR x in collection FILTER x.name=='foo' RETURN x). We have a custom function that extends the AQL just incase that makes a different
  • large document writes to a collection that consist of bulk imports for new documents, and batch requests for replacement and deletion of existing documents
  • AQL updates (eg. FOR x in collection FILTER x.name=='foo' UPDATE x with...')
  • Graph Traversals and Edge creations (our team member faced this problem before and opened this issue: RangeError: Invalid string length on large traversal #1415)
    These are what I'm aware of, and definitely not an exhaustive list of the operations we use ArangoDB for.

We do not use any server-side transactions - only AQL queries to do graph selections and modification. Is there any way to have ArangoDB log all our queries to the server? If not, I will try and dig into our software to see what queries are being called and used.

Both servers have 64GB of RAM, and an arangodump of all our collections is around 12GB in size.

Thanks for your help!

  • Otto

@fceller
Copy link
Contributor
fceller commented Jan 20, 2016

Hi Otto,

thanks a lot for the explanation of your scenario. I've discussed with Jan. We might have an idea where the deadlock is caused. Jan will back-port the current deadlock detection to 2.7. But we have to do some more tests to avoid any unwanted side-effects. We will then release a new 2.7.

Coming back to your other question: it is possible to log requests by specifying

--log.requests filename

However, that will not log the complete requests. Only the request path.

Thanks
Frank

@ottoyiu
Copy link
Author
ottoyiu commented Jan 20, 2016

Thank you Frank and Jan! I will try the latest beta of 2.8 in staging to see if I can reproduce this there.

@fceller
Copy link
Contributor
fceller commented Jan 20, 2016

@ottoyiu if you can try out 2.8 beta, that would be very helpful,

thanks
Frank

@ottoyiu
Copy link
Author
ottoyiu commented Jan 20, 2016

I just crashed both servers the same way as 2.7. I'm going to log the requests and see if that'll help with the debugging.

[root@serverA ~]# arangosh --quiet

^C
[root@serverA ~]# curl -X GET http://127.0.0.1:8529/_api/version
{"server":"arango","version":"2.8.0-beta3"}[root@serverA ~]#

Is there a way to use the 'queries' module to see a list of running/slow queries, while the server is in this state?

@fceller
Copy link
Contributor
fceller commented Jan 20, 2016

Thanks for testing. We have to investigate in more detail. When the servers hangs and you do a

top -H

do you see any arangodb thread that is still running (i. e. has a significant %cpu)?

@ottoyiu
Copy link
Author
ottoyiu commented Jan 20, 2016

Doesn't seem like so. Sometimes, there's a short spike in CPU usage for one or two dispat_std thread - but not any significant cpu% usage. Another one is the v8-gc thread which runs from time to time and takes around 10% cpu.

top - 21:44:37 up 14 days, 22:21,  2 users,  load average: 0.00, 0.04, 0.06
Tasks: 445 total,   1 running, 444 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65952028k total,  3576140k used, 62375888k free,     1296k buffers
Swap:  6143996k total,        0k used,  6143996k free,  2409096k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12855 arangodb  20   0 12.5g 1.3g 929m S  0.3  2.1   0:02.48 [dispat_std]
14944 arangodb  20   0 12.5g 1.3g 929m S  0.3  2.1   0:01.09 [dispat_std]
14945 arangodb  20   0 12.5g 1.3g 929m S  0.3  2.1   0:01.06 [dispat_std]
12801 arangodb  20   0 90088 3132  568 S  0.0  0.0   0:00.00 arangodb [super
12802 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.51 arangod
12803 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.05 [logging]
12804 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.08 [statistics]
12805 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.00 [IndexBuilder]
12806 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.00 [IndexBuilder]
12807 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.04 [cleanup]
12808 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.03 [databases]
12809 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.29 [WalAllocator]
12810 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:05.79 [WalSynchronize
12811 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.86 [WalCollector]
12812 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.08 [WalRemover]
12813 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.03 [compactor]
12814 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.04 V8 WorkerThread
12815 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.04 V8 WorkerThread
12816 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.03 V8 WorkerThread
12817 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.04 V8 WorkerThread
12850 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.10 [scheduler]
12851 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.12 [scheduler]
12852 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.05 [scheduler]
12853 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.06 [scheduler]
12854 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:01.05 [v8-gc]
12856 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:01.70 [dispat_std]
12870 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:15.82 [dispat_std]
12959 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:01.19 [dispat_std]
14941 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:01.26 [dispat_std]
14942 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:01.20 [dispat_std]
14943 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.92 [dispat_std]
14946 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:01.20 [dispat_std]
14947 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.34 [dispat_std]
14948 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:01.08 [dispat_std]
14949 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:01.13 [dispat_std]
14958 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.82 [dispat_std]
15063 arangodb  20   0 12.5g 1.3g 929m S  0.0  2.1   0:00.77 [dispat_std]

@ottoyiu
Copy link
Author
ottoyiu commented Jan 20, 2016

and it seemed like the servers recovered... could it be a function of the deadlock timeouts introduced in 2.8?

@fceller
Copy link
Contributor
fceller commented Jan 20, 2016

We have to investigate a bit more.

Is there any chance that you could start the server with

--log.level trace

and send us [hackers (at) arangodb.org] the corresponding log file?

@ottoyiu
Copy link
Author
ottoyiu commented Jan 20, 2016

Email sent. Thanks!

@dothebart
Copy link
Contributor

Fixed after private conversation - thus progress is not documented here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0