Shutting down daemon on Linux is very slow (very large dataset, before upgrade) #13156

matcho · 2020-12-07T11:14:57Z

My Environment

ArangoDB Version: 3.4.1
Storage Engine: RocksDB
Deployment Mode: Single Server
Deployment Strategy: systemd
Infrastructure: own
Operating System: Ubuntu 18.04.1 LTS
Total RAM in your machine: 256Gb.
Disks in use: RAID 10 of HDDs
Used Package: Debian or Ubuntu .deb

Component, Query & Data

Affected feature:
Shutdown process

Size of your Dataset on disk:
1.4 TB

Steps to reproduce

Ensure no queries are running on the server
Run systemctl stop arangodb3
Wait and watch logs

2020-12-07T09:03:24Z [2879] INFO control-c received, beginning shut down sequence
2020-12-07T09:10:53Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:18:53Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:26:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:34:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:42:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:50:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress
2020-12-07T09:58:54Z [2879] WARNING {statistics} caught exception in StatisticsWorker: shutdown in progress

Problem:
Server takes almost 1 hour to shut down, from the moment it receives "control-c" to the moment systemctl command returns and the process is killed.

This is penalizing when upgrading ArangoDB package on a production server, which is the case here: we are updating from 3.4.1 to 3.7.3 and this leads to a longer-than-expected period of unavailability for our service. No big deal though, as we do not provide a critical service.

Expected result:
A faster shutdown process.

Our hypothesis is that it might be related to the dataset, which is very large; maybe something with writing persistent indexes to disk before shutting down ?

Thank you very much,
Greetings,

Mathias

The text was updated successfully, but these errors were encountered:

dothebart · 2020-12-07T13:53:27Z

Hi,
please note that 3.4.1 has been superseeded by a certain number of bugfix releases (3.4.11 being the last one), where these issues have probably been addressed.

StatisticsWorker is the thread that aggregates the statistics of ArangoDBs resource usage.

However, 3.4 has meanwhile reached EOL hence its no longer supported.

However, I wouldn't see to high risk with force-terminating the arangod process in this case.

If you experience similar problems with more recent ArangoDB-Versions, please let us know.

matcho · 2020-12-07T14:11:13Z

Hi,

It's true that 3.4.1 if fairly old, but on a local dev. machine having the same dataset I had a 3.4.9 that I recently upgraded to 3.7.3, and the situation was the same : 1 hour to shut down.

We'll see if it gets better next time starting with 3.7.

Thank you !

Simran-B · 2020-12-09T15:19:21Z

Closing this issue for now as 3.4 is EoL, but please re-open if you can reproduce it with 3.7. Thanks!

matcho · 2021-01-06T10:55:28Z

Hi,

Reopening this ticket because we just experienced a case where v3.7.3 took around 1 hour to shutdown (restart).

Context :

same hardware and dataset as described above
ArangodDB was getting slower and slower throughout the day, and eventually aardvark stopped responding (we're still investigating to find out what happened) although we could still run arangosh
around 10 slow queries were running for ~200s and 3 were running for more than 1h
the server was under regular load, which is not so heavy but still many (fast) queries per seconds

What we did :

kill all slow queries via arangosh
wait a little, queries appeared as "status: killed" but did not terminate
systemctl restart arangodb3
wait for arangod to restart, which took around 1h (see logs below), notably mentioning info like {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 5 feature leases to be released

What does "feature leases to be released" mean ? Are the slow queries responsible for the shutdown time ? Although I understand this might be dangerous in case there are write queries running, is it possible to terminate them in a rougher way when a shutdown is triggerd ?

In a production context, it seems that a quick shutdown / restart might be expected, as 1) triggereing a shutdown is probably due to an emergency situation, and 2) more shutdown time means more downtime for the users. Of course this is just our case, and might not be representative of the general case.

Also, I'm just asking out of curiosity as nobody should probably do that ever, but what happens if I kill -9 arangod ? Does RocksDB supports brutal failures like this or does it risk to corrupt data ?

Thanks a lot for any advice.

arangod.log

2021-01-05T16:23:40Z [31177] INFO [b4133] control-c received, beginning shut down sequence
2021-01-05T16:24:17Z [31177] WARNING [4512c] {queries} timed out after 900 s waiting for exclusive-lock on collection '_fishbowl'
2021-01-05T16:24:18Z [31177] WARNING [ebe22] Failed to update Foxx store from GitHub.
2021-01-05T16:24:46Z [31177] WARNING [8bcee] {queries} slow query: '    FOR u in @@value0     FOR o IN @@value1      FILTER o.partner.author.id != null      FILTER o.partner.author.id == u.email      FILTER o.author.id == null      LIMIT 1      RETURN {       email: o.partner.author.id,          author_id: u._key      }   ', bind vars: {"@value0":"users","@value1":"observations"}, database: mydbv2, user: mydb, took: 19.607377 s
2021-01-05T16:35:31Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 5 feature leases to be released
2021-01-05T16:35:31Z [31177] WARNING [3ad54] {engines} slow background settings sync: 1591.804362 s
2021-01-05T16:35:31Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 5 feature leases to be released

…

2021-01-05T16:37:55Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 4 feature leases to be released
2021-01-05T16:37:55Z [31177] INFO [5b708] {engines} last RocksDB WAL sync happened longer ago than configured threshold. last sync happened 144883 ms ago, threshold value: 5
000 ms
2021-01-05T16:37:56Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 4 feature leases to be released

…

2021-01-05T16:49:55Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 3 feature leases to be released
2021-01-05T16:49:55Z [31177] ERROR [24213] Service "/_admin/aardvark" encountered error 400 while handling GET http://myserver:8529/_db/mydbv2/_admin/aardvark/
foxxes/fishbowl 
2021-01-05T16:49:55Z [31177] ERROR [24213] via ArangoError: shutdown in progress
2021-01-05T16:49:55Z [31177] ERROR [24213]     at ArangoStatement.execute (/usr/share/arangodb3/js/server/modules/@arangodb/arango-statement.js:84:16)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at ArangoDatabase._query (/usr/share/arangodb3/js/server/modules/@arangodb/arango-database.js:102:45)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at SimpleQueryAll.execute (/usr/share/arangodb3/js/server/modules/@arangodb/simple-query.js:95:42)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at SimpleQueryAll.SimpleQuery.hasNext (/usr/share/arangodb3/js/common/modules/@arangodb/simple-query-common.js:219:8)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at Object.availableJson (/usr/share/arangodb3/js/common/modules/@arangodb/foxx/store.js:266:19)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at Route._handler (/usr/share/arangodb3/js/apps/system/_admin/aardvark/APP/foxxes.js:376:20)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at next (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/router/tree.js:419:15)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at /usr/share/arangodb3/js/node/node_modules/lodash/lodash.js:10050:25
2021-01-05T16:49:55Z [31177] ERROR [24213]     at Middleware.<anonymous> (/usr/share/arangodb3/js/apps/system/_admin/aardvark/APP/foxxes.js:54:3)
2021-01-05T16:49:55Z [31177] ERROR [24213]     at next (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/router/tree.js:421:15)
2021-01-05T16:49:55Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 3 feature leases to be released

…

2021-01-05T17:23:40Z [31177] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 3 feature leases to be released
2021-01-05T17:23:50Z [88651] INFO [e52b0] ArangoDB 3.7.3 [linux] 64bit, using jemalloc, build tags/v3.7.3-0-g8ad8edb71f, VPack 0.1.33, RocksDB 6.8.0, ICU 64.2, V8 7.9.317, O
penSSL 1.1.1g  21 Apr 2020
2021-01-05T17:23:50Z [88651] INFO [75ddc] detected operating system: Linux version 4.15.0-112-generic (buildd@lcy01-amd64-027) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.0
4)) #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020
2021-01-05T17:23:50Z [88651] INFO [25362] {memory} Available physical memory: 270010130432 bytes, available cores: 96
2021-01-05T17:23:50Z [88651] WARNING [118b0] {memory} maximum number of memory mappings per process is 65530, which seems too low. it is recommended to set it to at least 61
44000
2021-01-05T17:23:50Z [88651] WARNING [49528] {memory} execute 'sudo sysctl -w "vm.max_map_count=6144000"'
2021-01-05T17:23:50Z [88651] WARNING [3e451] {memory} It is recommended to set NUMA to interleaved.
2021-01-05T17:23:50Z [88651] WARNING [b25a4] {memory} put 'numactl --interleave=all' in front of your command
2021-01-05T17:23:50Z [88651] INFO [43396] {authentication} Jwt secret not specified, generating...
2021-01-05T17:23:50Z [88651] INFO [144fe] using storage engine 'rocksdb'
2021-01-05T17:23:50Z [88651] INFO [3bb7d] {cluster} Starting up with role SINGLE
2021-01-05T17:23:50Z [88651] INFO [6ea38] using endpoint 'http+tcp://0.0.0.0:8529' for non-encrypted requests
2021-01-05T17:23:50Z [88651] INFO [a1c60] {syscall} file-descriptors (nofiles) hard limit is 131072, soft limit is 131072
2021-01-05T17:23:50Z [88651] INFO [3844e] {authentication} Authentication is turned on (system only), authentication for unix sockets is turned on
2021-01-05T17:23:50Z [88651] WARNING [b387d] found existing lockfile '/var/lib/arangodb3/LOCK' of previous process with pid 31177, but that process seems to be dead already
2021-01-05T17:24:51Z [88651] INFO [cf3f4] ArangoDB (version 3.7.3 [linux]) is ready for business. Have fun!

brunoaduarte · 2021-11-09T19:01:11Z

This problem still persists on latest 3.8.2.

2021-11-09T18:17:57Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'PrimaryIndex' to let background writes catch up
2021-11-09T18:18:02Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'PrimaryIndex'
2021-11-09T18:18:14Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'VPackIndex' to let background writes catch up
2021-11-09T18:18:23Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:20:09Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'VPackIndex'
2021-11-09T18:20:10Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'default'
2021-11-09T18:24:48Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'VPackIndex' to let background writes catch up
2021-11-09T18:24:50Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:25:31Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'VPackIndex'
2021-11-09T18:28:33Z [7615] WARNING [9123d] {engines} rocksdb has stopped incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:28:54Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'default'
2021-11-09T18:28:54Z [7615] WARNING [3ad54] {engines} slow background settings sync: 23.310444 s
2021-11-09T18:37:01Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'VPackIndex' to let background writes catch up
2021-11-09T18:37:06Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:37:35Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'default'
2021-11-09T18:37:39Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'VPackIndex'
2021-11-09T18:42:14Z [7615] WARNING [9123c] {engines} rocksdb is slowing incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:46:09Z [7615] WARNING [9123d] {engines} rocksdb has stopped incoming writes to column family 'default' to let background writes catch up
2021-11-09T18:47:03Z [7615] INFO [9123e] {engines} rocksdb is resuming normal writes for column family 'default'
2021-11-09T18:47:03Z [7615] WARNING [3ad54] {engines} slow background settings sync: 55.883286 s
2021-11-09T18:47:14Z [7615] INFO [5b708] {engines} last RocksDB WAL sync happened longer ago than configured threshold. last sync happened 5189 ms ago, threshold value: 5000 ms
2021-11-09T18:47:21Z [7615] INFO [5b708] {engines} last RocksDB WAL sync happened longer ago than configured threshold. last sync happened 6707 ms ago, threshold value: 5000 ms

dothebart · 2021-11-10T10:50:30Z

Can you run the server with

--log.level=startup=trace

or the config

[log]
level=startup=trace

so we can better see what the shutdown is held up by?
You probably need to run the server for a while; we should only see some more lines during startup / shutdown.

matcho · 2021-11-30T09:51:12Z

On our side, last time we shut down v3.8.0 to upgrade it to v3.8.3 (a few minutes ago) process was quite quick although server was under heavy load: less than 1 minute.

2021-11-30T11:30:33Z [3642228] INFO [b4133] {general} SIGTERM received (sender pid 1), beginning shut down sequence
2021-11-30T11:31:08Z [3642228] INFO [4bcb9] {general} ArangoDB has been shut down

One important note, though: we replaced the RAID10 of HDDs with a RAID10 of nvme SSDs − maybe there's a connection (compactions running in background ? transactions to complete properly ?…)

matcho · 2022-08-23T15:59:40Z

Seems fixed for a long time now.

dothebart · 2022-08-24T08:38:02Z

thanks for following up.

pavelnemirovsky · 2023-12-20T18:09:23Z

The issue is not fixed. See below, last node restart in cluster (version 3.11)

root@arrango-db-03:~# service arangodb3-database status
● arangodb3-database.service - ArangoDB DB Server (Role=Database)
   Loaded: loaded (/lib/systemd/system/arangodb3-database.service; disabled; vendor preset: enabled)
   Active: deactivating (stop-sigterm) since Wed 2023-12-20 18:01:47 UTC; 6min ago
 Main PID: 23525 (arangod)
    Tasks: 51 (limit: 131072)
   CGroup: /system.slice/arangodb3-database.service
           └─23525 /usr/sbin/arangod --configuration /etc/arangodb3/arangod-database.conf --uid arangodb --gid arangodb --pid-file /var/run/arangodb3-database/arangod.pid --temp.path /var/tmp/arangodb3-database --log.foreground-tty true

Dec 20 18:08:15 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:15Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:15 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:15Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:16 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:16Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:16 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:16Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:16 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:16Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:16 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:16Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:17 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:17Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:17 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:17Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:17 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:17Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1
Dec 20 18:08:17 arrango-db-03.dmetrics.internal arangod[23525]: 2023-12-20T18:08:17Z [23525] INFO [43bf8] {aql} number of remaining queries in query registry at shutdown: 1

matcho · 2024-07-23T09:35:18Z

Indeed, it happens again with version 3.11.8.

systemctl restart arangodb3 sends SIGTERM as intended, arangod receives it but keeps saying that "1 feature leases has to be released". After 10mn, sending another SIGTERM through kill command successfully terminates the server, that restarts in less than 1mn.

Important note: our setup is "leader-follower", that is not officially supported anymore starting with version 3.11. The issue might be related to a replication being setup (the machine here is the follower). If this is the case, then we cannot complain, but this comment might still be useful to others.

2024-07-23T07:17:06Z [2196394] INFO [b4133] {general} SIGTERM received (sender pid 1), beginning shut down sequence
2024-07-23T07:18:58Z [2196394] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 1 feature leases to be released
…
2024-07-23T07:29:39Z [2196394] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 1 feature leases to be released
…
2024-07-23T07:29:44Z [2196394] FATAL [11ca3] {general} SIGTERM received during shutdown sequence (sender pid 1240768), terminating!
2024-07-23T07:30:08Z [1241685] INFO [e52b0] {general} ArangoDB 3.11.8 enterprise [linux] 64bit, using jemalloc, build refs/tags/v3.11.8 eb715d099fb, VPack 0.2.1, RocksDB 7.2.0, ICU 64.2, V8 7.9.317, OpenSSL 3.0.13 30 Jan 2024
2024-07-23T07:30:08Z [1241685] INFO [75ddc] {general} detected operating system: Linux version 5.15.0-100-generic (buildd@lcy02-amd64-116) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024, parent process: 1 (systemd)
…
2024-07-23T07:30:34Z [1241685] INFO [cf3f4] {general} ArangoDB (version 3.11.8 enterprise [linux]) is ready for business. Have fun!

Greetings,
Mathias

jsteemann · 2024-07-23T09:54:16Z

@matcho: The important log line here is

2024-07-23T07:29:39Z [2196394] INFO [63d54] {queries} AQLFeature shutdown, waiting for 0 registered queries to terminate and for 1 feature leases to be released

It means that 1 AQL query is still ongoing on the machine that does not react to the shutdown signal. Is it possible that there is some long-running AQL query ongoing while the shutdown command is received?

matcho · 2024-07-23T12:22:33Z

Thanks @jsteemann , yes it's definitely possible, will have to check that next time. It helps a lot to know that "feature leases to be released" can be translated to "query still running" 🙂

Could it mean then that a long query has trouble terminating, ie. being killed ? Or that shutdown procedure does not kill queries, unless you send a second SIGTERM during shutdown phase ?

jsteemann · 2024-07-23T12:42:15Z

@matcho: I need to check if the shutdown procedure is trying to shut down running queries or not.
For that it would help me to know how you attempt to shut down the server. Is it by sending a SIGTERM signal, or by any other means?

matcho · 2024-07-23T13:12:19Z

@jsteemann

I attempt to shut down (actually restart) the server by typing sudo systemctl restart arangodb3 in the shell of an Ubuntu 22.04, using default systemd scripts installed along with the official .deb package of ArangoDB.

Seems that it practically sends a SIGTERM, since Arangod says INFO [b4133] {general} SIGTERM received (sender pid 1), beginning shut down sequence in its log.

jsteemann · 2024-07-23T13:55:41Z

@matcho : alright, thanks, I will try to figure out more.

jsteemann · 2024-07-24T15:01:15Z

@matcho : I think it is possible that some queries can still run to completion after a shutdown signal was received. Not all queries do not seem to be aborted when a shutdown signal is received, so the server may just wait for all already-started queries to complete. This can take an arbitrary amount of time, depending on what the queries actually do.
The 2 above PRs should change the situation so that running queries are intentionally killed upon shutdown.
In my experiments that helped expediting the shutdown from "indefinitely long" to just under a second.

matcho · 2024-07-25T07:09:09Z

Thank you very much @jsteemann that's great news! 🎉

Simran-B closed this as completed Dec 9, 2020

Simran-B added 1 Analyzing 2 Out Of Date Resolution 3 Core labels Dec 9, 2020

dothebart reopened this Nov 10, 2021

dothebart mentioned this issue Nov 10, 2021

WARNING [3ad54] {engines} slow background settings sync #15080

Open

matcho closed this as completed Aug 23, 2022

dothebart added the 2 Solved Resolution label Aug 24, 2022

This was referenced Jul 24, 2024

Forcefully kill all running AQL queries on server shutdown. #21196

Merged

Forcefully kill all running AQL queries on server shutdown. #21197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shutting down daemon on Linux is very slow (very large dataset, before upgrade) #13156

Shutting down daemon on Linux is very slow (very large dataset, before upgrade) #13156

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shutting down daemon on Linux is very slow (very large dataset, before upgrade) #13156

Shutting down daemon on Linux is very slow (very large dataset, before upgrade) #13156

Comments

My Environment

Component, Query & Data

Steps to reproduce

Uh oh!

Uh oh!

Uh oh!

arangod.log

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!