8000 Shutting down daemon on Linux is very slow (very large dataset, before upgrade) · Issue #13156 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

Shutting down daemon on Linux is very slow (very large dataset, before upgrade) #13156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matcho opened this issue Dec 7, 2020 · 18 comments
Closed

Comments

@matcho
Copy link
matcho commented Dec 7, 2020

My Environment

  • ArangoDB Version: 3.4.1
  • Storage Engine: RocksDB
  • Deployment Mode: Single Server
  • Deployment Strategy: systemd
  • Infrastructure: own
  • Operating System: Ubuntu 18.04.1 LTS
  • Total RAM in your machine: 256Gb.
  • Disks in use: RAID 10 of HDDs
  • Used Package: Debian or Ubuntu .deb

Component, Query & Data

Affected feature:
Shutdown process

Size of your Dataset on disk:
1.4 TB

Steps to reproduce

  1. Ensure no queries are running on the server
  2. Run systemctl stop arangodb3
  3. Wait and watch logs
2020-12-07T09:03:24Z [2879] INFO control-c received, beginning shut down sequence
2020-12-07T09:10:53Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:18:53Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:26:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:34:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:42:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:50:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:58:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress

Problem:
Server takes almost 1 hour to shut down, from the moment it receives "control-c" to the moment systemctl command returns and the process is killed.

This is penalizing when upgrading ArangoDB package on a production server, which is the case here: we are updating from 3.4.1 to 3.7.3 and this leads to a longer-than-expected period of unavailability for our service. No big deal though, as we do not provide a critical service.

Expected result:
A faster shutdown process.

Our hypothesis is that it might be related to the dataset, which is very large; maybe something with writing persistent indexes to disk before shutting down ?

Thank you very much,
Greetings,

Mathias

@dothebart
Copy link
Contributor

Hi,
please note that 3.4.1 has been superseeded by a certain number of bugfix releases (3.4.11 being the last one), where these issues have probably been addressed.

StatisticsWorker is the thread that aggregates the statistics of ArangoDBs resource usage.

However, 3.4 has meanwhile reached EOL hence its no longer supported.

However, I wouldn't see to high risk with force-terminating the arangod process in this case.

If you experience similar problems with more recent ArangoDB-Versions, please let us know.

@matcho
Copy link
Author
matcho commented Dec 7, 2020

Hi,

It's true that 3.4.1 if fairly old, but on a local dev. machine having the same dataset I had a 3.4.9 that I recently upgraded to 3.7.3, and the situation was the same : 1 hour to shut down.

We'll see if it gets better next time starting with 3.7.

Thank you !

@Simran-B
Copy link
Contributor
Simran-B commented Dec 9, 2020

Closing this issue for now as 3.4 is EoL, but please re-open if you can reproduce it with 3.7. Thanks!

@matcho
Copy link
Author
matcho commented Jan 6, 2021

Hi,

Reopening this ticket because we just experienced a case where v3.7.3 took around 1 hour to shutdown (restart).

Context :

  • same hardware and dataset as described above
  • ArangodDB was getting slower and slower throughout the day, and eventually aardvark stopped responding (we're still investigating to find out what happened) although we could still run arangosh
  • around 10 slow queries were running for ~200s and 3 were running for more than 1h
  • the server was under regular load, which is not so heavy but still many (fast) queries per seconds

What we did :

  • kill all slow queries via arangosh
  • wait a little, queries appeared as "status: killed" but did not terminate
  • systemctl restart arangodb3
  • wait for arangod to restart, which took around 1h (see logs below), notably mentioning info like {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 5 feature leases to be released

What does "feature leases to be released" mean ? Are the slow queries responsible for the shutdown time ? Although I understand this might be dangerous in case there are write queries running, is it possible to terminate them in a rougher way when a shutdown is triggerd ?

In a production context, it seems that a quick shutdown / restart might be expected, as 1) triggereing a shutdown is probably due to an emergency situation, and 2) more shutdown time means more downtime for the users. Of course this is just our case, and might not be representative of the general case.

Also, I'm just asking out of curiosity as nobody should probably do that ever, but what happens if I kill -9 arangod ? Does RocksDB supports brutal failures like this or does it risk to corrupt data ?

Thanks a lot for any advice.

arangod.log

2021-01-05T16:23:40Z [31177] INFO [b4133] control-c received, beginning shut down sequence
2021-01-05T16:24:17Z [31177] WARNING [4512c] {queries} timed out after 900 s waiting for exclusive-lock on collection '_fishbowl'
2021-01-05T16:24:18Z [31177] WARNING [ebe22] Failed to update Foxx store from GitHub.
2021-01-05T16:24:46Z [31177] WARNING [8bcee] {queries} slow query: '    FOR u in @@value0     FOR o IN @@value1      FILTER o.partner.author.id != null      FILTER o.partner.author.id == u.email      FILTER o.author.id == null      LIMIT 1      RETURN {       email: o.partner.author.id,          author_id: u._key      }   ', bind vars: {"@value0":"users","@value1":"observations"}, database: mydbv2, user: mydb, took: 19.607377 s
2021-01-05T16:35:31Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 5 feature leases to be released
2021-01-05T16:35:31Z [31177] WARNING [3ad54] {engines} slow background settings sync: 1591.804362 s
2021-01-05T16:35:31Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 5 feature leases to be released

2021-01-05T16:37:55Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 4 feature leases to be released
2021-01-05T16:37:55Z [31177] INFO [5b708] {engines} last RocksDB WAL sync happened longer ago than configured threshold. last sync happened 144883 ms ago, threshold value: 5
000 ms
2021-01-05T16:37:56Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 4 feature leases to be released

2021-01-05T16:49:55Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 3 feature leases to be released
2021-01-05T16:49:55Z [31177] ERROR [24213] Service "/_admin/aardvark" encountered error 400 while handling GET http://myserver:8529/_db/mydbv2/_admin/aardvark/
foxxes/fishbowl 
2021-01-05T16:49:55Z [31177] ERROR [24213] via ArangoError: shutdown in progress
2021-01-05T16:49:55Z [31177] ERROR [24213]     at ArangoStatement.execute (/usr/share/arangodb3/js/server/modules/@arangodb/arango-statement.js:84:16)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at ArangoDatabase._query (/usr/share/arangodb3/js/server/modules/@arangodb/arango-database.js:102:45)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at SimpleQueryAll.execute (/usr/share/arangodb3/js/server/modules/@arangodb/simple-query.js:95:42)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at SimpleQueryAll.SimpleQuery.hasNext (/usr/share/arangodb3/js/common/modules/@arangodb/simple-query-common.js:219:8)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at Object.availableJson (/usr/share/arangodb3/js/common/modules/@arangodb/foxx/store.js:266:19)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at Route._handler (/usr/share/arangodb3/js/apps/system/_admin/aardvark/APP/foxxes.js:376:20)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at next (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/router/tree.js:419:15)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at /usr/share/arangodb3/js/node/node_modules/lodash/lodash.js:10050:25
2021-01-05T16:49:55Z [31177] ERROR [24213]     at Middleware.<anonymous> (/usr/share/arangodb3/js/apps/system/_admin/aardvark/APP/foxxes.js:54:3)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at next (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/router/tree.js:421:15)
2021-01-05T16:49:55Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 3 feature leases to be released

2021-01-05T17:23:40Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 3 feature leases to be released
2021-01-05T17:23:50Z [88651] INFO [e52b0] ArangoDB 3.7.3 [linux] 64bit, using jemalloc, build tags/v3.7.3-0-g8ad8edb71f, VPack 0.1.33, RocksDB 6.8.0, ICU 64.2, V8 7.9.317, O
penSSL 1.1.1g  21 Apr 2020
2021-01-05T17:23:50Z [88651] INFO [75ddc] detected operating system: Linux version 4.15.0-112-generic (buildd@lcy01-amd64-027) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.0
4)) #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020
2021-01-05T17:23:50Z [88651] INFO [25362] {memory} Available physical memory: 270010130432 bytes, available cores: 96
2021-01-05T17:23:50Z [88651] WARNING [118b0] {memory} maximum number of memory mappings per process is 65530, which seems too low. it is recommended to set it to at least 61
44000
2021-01-05T17:23:50Z [88651] WARNING [49528] {memory} execute 'sudo sysctl -w "vm.max_map_count=6144000"'
2021-01-05T17:23:50Z [88651] WARNING [3e451] {memory} It is recommended to set NUMA to interleaved.
2021-01-05T17:23:50Z [88651] WARNING [b25a4] {memory} put 'numactl --interleave=all' in front of your command
2021-01-05T17:23:50Z [88651] INFO [43396] {authentication} Jwt secret not specified, generating...
2021-01-05T17:23:50Z [88651] INFO [144fe] using storage engine 'rocksdb'
2021-01-05T17:23:50Z [88651] INFO [3bb7d] {cluster} Starting up with role SINGLE
2021-01-05T17:23:50Z [88651] INFO [6ea38] using endpoint 'http+tcp://0.0.0.0:8529' for non-encrypted requests
2021-01-05T17:23:50Z [88651] INFO [a1c60] {syscall} file-descriptors (nofiles) hard limit is 131072, soft limit is 131072
2021-01-05T17:23:50Z [88651] INFO [3844e] {authentication} Authentication is turned on (system only), authentication for unix sockets is turned on
2021-01-05T17:23:50Z [88651] WARNING [b387d] found existing lockfile '/var/lib/arangodb3/LOCK' of previous process with pid 31177, but that process seems to be dead already
2021-01-05T17:24:51Z [88651] INFO [cf3f4] ArangoDB (version 3.7.3 [linux]) is ready for business. Have fun!

@brunoaduarte
Copy link
brunoaduarte commented Nov 9, 2021

This problem still persists on latest 3.8.2.

2021-11-09T18:17:57Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'PrimaryIndex' to let background writes catch up
2021-11-09T18:18:02Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'PrimaryIndex'
2021-11-09T18:18:14Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'VPackIndex' to let background writes catch up
2021-11-09T18:18:23Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:20:09Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'VPackIndex'
2021-11-09T18:20:10Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'default'
2021-11-09T18:24:48Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'VPackIndex' to let background writes catch up
2021-11-09T18:24:50Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:25:31Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'VPackIndex'
2021-11-09T18:28:33Z [7615] WARNING [9123d] {engines} rocksdb has stopped incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:28:54Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'default'
2021-11-09T18:28:54Z [7615] WARNING [3ad54] {engines} slow background settings sync: 23.310444 s
2021-11-09T18:37:01Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'VPackIndex' to let background writes catch up
2021-11-09T18:37:06Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:37:35Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'default'
2021-11-09T18:37:39Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'VPackIndex'
2021-11-09T18:42:14Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:46:09Z [7615] WARNING [9123d] {engines} rocksdb has stopped incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:47:03Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'default'
2021-11-09T18:47:03Z [7615] WARNING [3ad54] {engines} slow background settings sync: 55.883286 s
2021-11-09T18:47:14Z [7615] INFO [5b708] {engines} last RocksDB WAL sync happened longer ago than configured threshold. last sync happened 5189 ms ago, threshold value: 5000 ms
2021-11-09T18:47:21Z [7615] INFO [5b708] {engines} last RocksDB WAL sync happened longer ago than configured threshold. last sync happened 6707 ms ago, threshold value: 5000 ms

@dothebart
Copy link
Contributor

Can you run the server with

--log.level=startup=trace

or the config

[log]
level=startup=trace

so we can better see what the shutdown is held up by?
You probably need to run the server for a while; we should only see some more lines during startup / shutdown.

@matcho
Copy link
Author
matcho commented Nov 30, 2021

On our side, last time we shut down v3.8.0 to upgrade it to v3.8.3 (a few minutes ago) process was quite quick although server was under heavy load: less than 1 minute.

2021-11-30T11:30:33Z [3642228] INFO [b4133] {general} SIGTERM received (sender pid 1), beginning shut down sequence
2021-11-30T11:31:08Z [3642228] INFO [4bcb9] {general} ArangoDB has been shut down

One important note, though: we replaced the RAID10 of HDDs with a RAID10 of nvme SSDs − maybe there's a connection (compactions running in background ? transactions to complete properly ?…)

@matcho
Copy link
Author
matcho commented Aug 23, 2022

Seems fixed for a long time now.

@matcho matcho closed this as completed Aug 23, 2022
@dothebart dothebart added the 2 Solved Resolution label Aug 24, 2022
@dothebart
Copy link
Contributor

thanks for following up.

@pavelnemirovsky
Copy link

The issue is not fixed. See below, last node restart in cluster (version 3.11)

root@arrango-db-03:~# service arangodb3-database status
● arangodb3-database.service - ArangoDB DB Server (Role=Database)
   Loaded: loaded (/lib/systemd/system/arangodb3-database.service; disabled; vendor preset: enabled)
   Active: deactivating (stop-sigterm) since Wed 2023-12-20 18:01:47 UTC; 6min ago
 Main PID: 23525 (arangod)
    Tasks: 51 (limit: 131072)
   CGroup: /system.slice/arangodb3-database.service
           └─23525 /usr/sbin/arangod --configuration /etc/arangodb3/arangod-database.conf --uid arangodb --gid arangodb --pid-file /var/run/arangodb3-database/arangod.pid --temp.path /var/tmp/arangodb3-database --log.foreground-tty true

Dec 20 18:08:15 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:15Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:15 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:15Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:16 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:16Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:16 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:16Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:16 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:16Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:16 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:16Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:17 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:17Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:17 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:17Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:17 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:17Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:17 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:17Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1

@matcho
Copy link
Author
matcho commented Jul 23, 2024

Indeed, it happens again with version 3.11.8.

systemctl restart arangodb3 sends SIGTERM as intended, arangod receives it but keeps saying that "1 feature leases has to be released". After 10mn, sending another SIGTERM through kill command successfully terminates the server, that restarts in less than 1mn.

Important note: our setup is "leader-follower", that is not officially supported anymore starting with version 3.11. The issue might be related to a replication being setup (the machine here is the follower). If this is the case, then we cannot complain, but this comment might still be useful to others.

2024-07-23T07:17:06Z [2196394] INFO [b4133] {general} SIGTERM received (sender pid 1), beginning shut down sequence
2024-07-23T07:18:58Z [2196394] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 1 feature leases to be released
…
2024-07-23T07:29:39Z [2196394] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 1 feature leases to be released
…
2024-07-23T07:29:44Z [2196394] FATAL [11ca3] {general} SIGTERM received during shutdown sequence (sender pid 1240768), terminating!
2024-07-23T07:30:08Z [1241685] INFO [e52b0] {general} ArangoDB 3.11.8 enterprise [linux] 64bit, using jemalloc, build refs/tags/v3.11.8 eb715d099fb, VPack 0.2.1, RocksDB 7.2.0, ICU 64.2, V8 7.9.317, OpenSSL 3.0.13 30 Jan 2024
2024-07-23T07:30:08Z [1241685] INFO [75ddc] {general} detected operating system: Linux version 5.15.0-100-generic (buildd@lcy02-amd64-116) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024, parent process: 1 (systemd)
…
2024-07-23T07:30:34Z [1241685] INFO [cf3f4] {general} ArangoDB (version 3.11.8 enterprise [linux]) is ready for business. Have fun!

Greetings,
Mathias

@jsteemann
Copy link
Contributor
jsteemann commented Jul 23, 2024

@matcho: The important log line here is

2024-07-23T07:29:39Z [2196394] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 1 feature leases to be released

It means that 1 AQL query is still ongoing on the machine that does not react to the shutdown signal. Is it possible that there is some long-running AQL query ongoing while the shutdown command is received?

@matcho
Copy link
Author
matcho commented Jul 23, 2024

Thanks @jsteemann , yes it's definitely possible, will have to check that next time. It helps a lot to know that "feature leases to be released" can be translated to "query still running" 🙂

Could it mean then that a long query has trouble terminating, ie. being killed ? Or that shutdown procedure does not kill queries, unless you send a second SIGTERM during shutdown phase ?

@jsteemann
Copy link
Contributor

@matcho: I need to check if the shutdown procedure is trying to shut down running queries or not.
For that it would help me to know how you attempt to shut down the server. Is it by sending a SIGTERM signal, or by any other means?

@matcho
Copy link
Author
matcho commented Jul 23, 2024

@jsteemann

I attempt to shut down (actually restart) the server by typing sudo systemctl restart arangodb3 in the shell of an Ubuntu 22.04, using default systemd scripts installed along with the official .deb package of ArangoDB.

Seems that it practically sends a SIGTERM, since Arangod says INFO [b4133] {general} SIGTERM received (sender pid 1), beginning shut down sequence in its log.

@jsteemann
Copy link
Contributor

@matcho : alright, thanks, I will try to figure out more.

@jsteemann
Copy link
Contributor

@matcho : I think it is possible that some queries can still run to completion after a shutdown signal was received. Not all queries do not seem to be aborted when a shutdown signal is received, so the server may just wait for all already-started queries to complete. This can take an arbitrary amount of time, depending on what the queries actually do.
The 2 above PRs should change the situation so that running queries are intentionally killed upon shutdown.
In my experiments that helped expediting the shutdown from "indefinitely long" to just under a second.

@matcho
Copy link
Author
matcho commented Jul 25, 2024

Thank you very much @jsteemann that's great news! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0