8000 Cluster can not start after crashes · Issue #7293 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

Cluster can not start after crashes #7293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
suipinglu opened this issue Nov 12, 2018 · 7 comments
Closed

Cluster can not start after crashes #7293

suipinglu opened this issue Nov 12, 2018 · 7 comments

Comments

@suipinglu
Copy link
suipinglu commented Nov 12, 2018

My Environment

  • ArangoDB Version: 3.3.19
  • Storage Engine: MMFiles
  • Deployment Mode: Cluster
  • Deployment Strategy: Manual Start
  • Infrastructure: own
  • Operating System: Oracle Linux 7

Problem:

2018-11-12T00:10:50Z |INFO| restarting agent component=arangodb
2018-11-12T00:10:50Z |INFO| Looking for a running instance of agent on port 8531 component=arangodb
2018-11-12T00:10:50Z |INFO| Starting agent on port 8531 component=arangodb
2018-11-12T00:10:50Z |INFO| agent has terminated quickly, in 211.886478ms (recent failures: 46) component=arangodb
2018-11-12T00:10:51Z |INFO| ## Start of agent log
        2018-11-12T00:10:50Z [19461] INFO using storage engine mmfiles
        2018-11-12T00:10:50Z [19461] INFO {cluster} Starting up with role AGENT
        2018-11-12T00:10:50Z [19461] INFO {syscall} file-descriptors (nofiles) hard limit is 8192, soft limit is 8192
        2018-11-12T00:10:50Z [19461] INFO {authentication} Authentication is turned on (system only), authentication for unix sockets is turned on
        2018-11-12T00:10:50Z [19461] WARNING {mmap} memory-protecting failed for range 0x7f5b06c4b000 - 0x7f5b08c4b000 (33554432 bytes), file-descriptor 18, flags read,write: Permission denied
        2018-11-12T00:10:50Z [19461] ERROR {datafiles} unable to change memory protection for memory backed by datafile '/data1/db/agent8531/data/journals/logfile-81.db'. please check file permissions and mount options.
        2018-11-12T00:10:50Z [19461] ERROR unable to open logfile '/data1/db/agent8531/data/journals/logfile-81.db': system error
        2018-11-12T00:10:50Z [19461] FATAL could not inspect WAL logfiles: system error
        2018-11-12T00:10:50Z [19512] INFO ArangoDB 3.3.19 [linux] 64bit, using jemalloc, build tags/v3.3.19-0-gfe9657c, VPack 0.1.30, RocksDB 5.6.0, ICU 58.1, V8 5.7.492.77, OpenSSL 1.0.2k-fips  26 Jan 2017
        2018-11-12T00:10:50Z [19512] INFO detected operating system: Linux version 3.10.0-693.17.1.el7.x86_64 (mockbuild@x86-041.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Sun Jan 14 10:36:03 EST 2018
        2018-11-12T00:10:50Z [19512] WARNING {memory} It is recommended to set NUMA to interleaved.
        2018-11-12T00:10:50Z [19512] WARNING {memory} put 'numactl --interleave=all' in front of your command
        2018-11-12T00:10:50Z [19512] INFO using storage engine mmfiles
        2018-11-12T00:10:50Z [19512] INFO {cluster} Starting up with role AGENT
        2018-11-12T00:10:50Z [19512] INFO {syscall} file-descriptors (nofiles) hard limit is 8192, soft limit is 8192
        2018-11-12T00:10:50Z [19512] INFO {authentication} Authentication is turned on (system only), authentication for unix sockets is turned on
        2018-11-12T00:10:50Z [19512] WARNING {mmap} memory-protecting failed for range 0x7fc39961c000 - 0x7fc39b61c000 (33554432 bytes), file-descriptor 18, flags read,write: Permission denied
        2018-11-12T00:10:50Z [19512] ERROR {datafiles} unable to change memory protection for memory backed by datafile '/data1/db/agent8531/data/journals/logfile-81.db'. please check file permissions and mount options.
        2018-11-12T00:10:50Z [19512] ERROR unable to open logfile '/data1/db/agent8531/data/journals/logfile-81.db': system error
        2018-11-12T00:10:50Z [19512] FATAL could not inspect WAL logfiles: system error
## End of agent log component=arangodb
2018-11-12T00:10:51Z |INFO| restarting agent component=arangodb
2018-11-12T00:10:51Z |INFO| Looking for a running instance of agent on port 8531 component=arangodb
2018-11-12T00:10:51Z |INFO| Starting agent on port 8531 component=arangodb
2018-11-12T00:10:51Z |INFO| agent has terminated quickly, in 206.043627ms (recent failures: 47) component=arangodb
2018-11-12T00:10:51Z |INFO| ## Start of agent log
        2018-11-12T00:10:50Z [19512] INFO using storage engine mmfiles
        2018-11-12T00:10:50Z [19512] INFO {cluster} Starting up with role AGENT
        2018-11-12T00:10:50Z [19512] INFO {syscall} file-descriptors (nofiles) hard limit is 8192, soft limit is 8192
        2018-11-12T00:10:50Z [19512] INFO {authentication} Authentication is turned on (system only), authentication for unix sockets is turned on
        2018-11-12T00:10:50Z [19512] WARNING {mmap} memory-protecting failed for range 0x7fc39961c000 - 0x7fc39b61c000 (33554432 bytes), file-descriptor 18, flags read,write: Permission denied
        2018-11-12T00:10:50Z [19512] ERROR {datafiles} unable to change memory protection for memory backed by datafile '/data1/db/agent8531/data/journals/logfile-81.db'. please check file permissions and mount options.
        2018-11-12T00:10:50Z [19512] ERROR unable to open logfile '/data1/db/agent8531/data/journals/logfile-81.db': system error
        2018-11-12T00:10:50Z [19512] FATAL could not inspect WAL logfiles: system error
        2018-11-12T00:10:51Z [19563] INFO ArangoDB 3.3.19 [linux] 64bit, using jemalloc, build tags/v3.3.19-0-gfe9657c, VPack 0.1.30, RocksDB 5.6.0, ICU 58.1, V8 5.7.492.77, OpenSSL 1.0.2k-fips  26 Jan 2017
        2018-11-12T00:10:51Z [19563] INFO detected operating system: Linux version 3.10.0-693.17.1.el7.x86_64 (mockbuild@x86-041.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Sun Jan 14 10:36:03 EST 2018
        2018-11-12T00:10:51Z [19563] WARNING {memory} It is recommended to set NUMA to interleaved.
        2018-11-12T00:10:51Z [19563] WARNING {memory} put 'numactl --interleave=all' in front of your command
        2018-11-12T00:10:51Z [19563] INFO using storage engine mmfiles
        2018-11-12T00:10:51Z [19563] INFO {cluster} Starting up with role AGENT
        2018-11-12T00:10:51Z [19563] INFO {syscall} file-descriptors (nofiles) hard limit is 8192, soft limit is 8192
        2018-11-12T00:10:51Z [19563] INFO {authentication} Authentication is turned on (system only), authentication for unix sockets is turned on
        2018-11-12T00:10:51Z [19563] WARNING {mmap} memory-protecting failed for range 0x7f2a098bb000 - 0x7f2a0b8bb000 (33554432 bytes), file-descriptor 18, flags read,write: Permission denied
        2018-11-12T00:10:51Z [19563] ERROR {datafiles} unable to change memory protection for memory backed by datafile '/data1/db/agent8531/data/journals/logfile-81.db'. please check file permissions and mount options.
        2018-11-12T00:10:51Z [19563] ERROR unable to open logfile '/data1/db/agent8531/data/journals/logfile-81.db': system error
        2018-11-12T00:10:51Z [19563] FATAL could not inspect WAL logfiles: system error

Expected result:

When restart the cluster, should pick up the previous state and continue.

@suipinglu
Copy link
Author

there should not be any permission issues, cuz the cluster is start as root account.
i have already check the folder permissions.

@kvahed
Copy link
Contributor
kvahed commented Nov 12, 2018

Just so we get a more complete picture: Is this only happening to the one agent or to all services and all machines?
Would you please collect for us the setup, and logs on all three machines and instances like so:

cd /data1/db
tar czf $(hostname).tar.gz */arangod.log */arangod.conf */arangod_command.txt

and share the resulting archive with us?

Kind regards,
Kaveh

@suipinglu
Copy link
Author

Hi Kaveh,

I have checked and it should be OK for us to share.

node1.tar.gz
node2.tar.gz
node3.tar.gz

Please have a look with these files.

Cheers,
Ping

@dhly-etc
Copy link
Contributor

Hi @pingpongballs, thanks for providing the logs. From what I can tell, it seems we're having trouble changing the memory-mapped protections on the file. We've seen this before with certain fileystems, or with certain mount options (in particular, NOEXEC). Could you provide us with the filesystem type and mount options (e.g. from mount -v | grep '/data1').

@suipinglu
Copy link
Author

Hi Daniel,

Thanks for the response.

We are using the "noexec".

Here is the output:
/dev/sda1 on /data1 type xfs (rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,noquota)

Cheers

@dhly-etc
Copy link
Contributor

@pingpongballs Would it be possible to remount without the noexec flag? There's a good chance that will fix it.

@suipinglu
Copy link
Author

Hi Daniel,

Thanks for your direction. after i remount without "noexec" and now we can join back the downed node.
Also can do a "Rebalance Shards" via the front-end.

Makes sense to me now, there are some binaries within the db folders.

Resolved this issue.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0