feat(core/dns): new dns client library #12305

chobits · 2024-01-08T09:34:54Z

Summary

This feature will only be considered for default activation in version 4.0 at the earliest. For now, it is just an alternative to the DNS library, so we are focusing on making it complete and simple. Compatibility with the old DNS library is not being considered at this time.

Introduced global caching for DNS records across workers, significantly reducing the query load on DNS servers.
Introduced observable statistics for the new DNS client, and a new Admin API /status/dns to retrieve them.
Deprecated the dns_no_sync option. Multiple DNS queries for the same name will always be synchronized (even across workers). This remains functional with the legacy DNS client library.
Deprecated the dns_not_found_ttl option. It uses the dns_error_ttl option for all error responses. This option remains functional with the legacy DNS client library.
Deprecated the dns_order option. By default, SRV, A, and AAAA are supported. Only names in the SRV format (_service._proto.name) enable resolving of DNS SRV records.

Checklist

The Pull Request has tests
A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix KAG-3220

For reviewers

README.md (include APIs and logic of _M:resolve())
current error report from the upper caller balancer subsystem:

2024/02/28 16:35:23 [error] 77335#0: *1336 [kong] init.lua:405 [aws-lambda] Lambda:invoke() failed to connect to 'https://lambda.ab-cdef-1.amazonaws.com:10001': 
[cosocket] DNS resolution failed: dns server error: 3 name error. 
Tried: [["lambda.ab-cdef-1.amazonaws.com:A","dns server error: 3 name error"]], 
client: 127.0.0.1, server: kong, request: "GET /get?key1=some_value1 HTTP/1.1", host: "lambda15.test", request_id: "3333b77722d0bc23a44fad31953ccf0b"

if query DNS server failed, also added "Query Time: ms" into error log, like

DNS server error: failed to receive reply from UDP server 198.51.100.0:53: timeout, Query Time: 136ms

Design doc: link

kong/resty/dns_client/init.lua

thibaultcha · 2024-01-26T17:39:09Z

Overall I cannot stress hard enough that mlcache users in an application as complex as Kong should only use :get(), which is an extremely, extremely powerful, clean, and performant interface that should (if properly used) cover the great majority of caching use-cases and guarantee fantastic performance, flexibility, and stability. Most of the mlcache problems that I have seen in the Gateway (in the past or recently) and other large applications that use it at other companies are actually due to improper usage of the library. Only using :get() (or :get_bulk()) should be enough, and should guarantee very few errors, and if there are errors they can easily be mitigated with fallbacks and/or retries in the code itself. I strongly feel like this should be heavily taught to mlcache users especially Kong Gateway developers who often misunderstand how mlcache works and how it should be used.

I am afraid that by misusing mlcache, developers will slowly lose trust in the library (at their own fault) and thus try to rewrite a solution that will eventually not be as stable and powerful as mlcache really is if mlcache is used properly. This is a dangerous pattern that always had me concerned and this concern keeps increasing as I keep seeing mlcache being misused and incorrectly blamed. Perhaps the documentation is not clear enough that the write operations (set(), delete(), update()) should only be used when users really know what they are doing. An mlcache user who truly needs these write functions and who really knows what they are doing would not use these functions without also implementing opts.ipc and using a non-message-based IPC mechanism, which is the very minimum for using these write functions in a production OpenResty application. But again, with most use-cases only get() or :get_bulk() should be necessary, and they are written to be extremely efficient, 100% JIT-compatible, very flexible, and most of all extremely resilient to spurious errors. Only using :get() avoids the need for any IPC, which is the prominent source of errors encountered by users and developers, and the original primary goal behind mlcache development. Since the errors originate from the "mlcache" namespace, users tend to blame mlcache when in fact the library is actually improperly used and mlcache is not at fault. The logic of mlcache is extremely sound and has no known-bug (besides minor ones such as the l1_serializer recently reported), most of these errors are due to improperly configured timeouts, or improper use and/or over-confidence in the default bundled IPC mechanism (because it is messaged-based and not suitable to production applications).

If I were still working in the Gateway, I would absolutely forbid the use of mlcache write functions, perhaps even going as far as removing them from the module when it is being loaded. Either that, or I would make sure to provide an extremely resilient IPC mechanism that can be attached to all mlcache instances instead of the default messaged-based shm mechanism, as documented in the mlcache docs.

Co-authored-by: Chrono <chrono_cpp@me.com>

Co-authored-by: Qi <add_sp@outlook.com>

Tieske · 2024-07-11T12:00:24Z

Introduced global caching for DNS records across workers, significantly reducing the query load on DNS servers.

This can be switched off, by reverting to the old client right?

Introduced observable statistics for the new DNS client, and a new Admin API /status/dns to retrieve them.

Nice

Deprecated the dns_no_sync option. Multiple DNS queries for the same name will always be synchronized (even across workers). This remains functional with the legacy DNS client library.

This was added as an explicit request from @subnetmarco in a very early version. But I haven't seen it used. Other than during some debugging sessions. So to me this seems ok to remove.

Deprecated the dns_not_found_ttl option. It uses the dns_error_ttl option for all error responses. This option remains functional with the legacy DNS client library.

I mentioned it before on confluence somewhere, but this is a bad idea imho. dns_error_ttl (when the server responds with some internal error) should be short (1-2 secs), whereas dns_not_found_ttl should be long (at least 30-60 seconds) because it requires user interaction to resolve.

This is mainly because the former is to be looked at by Kong admins, but the latter will mostly be generated by user errors. Scenario:

error ttl is 1 second
a user in a big enterprise somewhere pushes a bad hostname through a pipeline and it gets deployed
all Kong nodes in the cluster now start to throw errors in the logs every 1 second
nothing the Kong admin can do about it, he/she most likely doesn't even know who owns the bad config anyway.

Deprecated the dns_order option. By default, SRV, A, and AAAA are supported. Only names in the SRV format (_service._proto.name) enable resolving of DNS SRV records.

This seems like a valid change, but also a breaking change, yet I have no idea how much impact it is going to have. So also here, revert to the old client if it breaks right?

chobits · 2024-07-11T13:00:45Z

Hi @Tieske

...
This seems like a valid change, but also a breaking change, yet I have no idea how much impact it is going to have. So also here, revert to the old client if it breaks right?

In terms of compatibility, following discussions with Datong and Saju, this one is aimed at being simple and usable, not needing to be fully compatible with the original one. It will only be switched to when users encounter issues with the original DNS client now, and it won't be enabled by default until at least version 4.0.

To achieve the goal, it directly mirrors the logic of the nginx/openresty/Golang DNS resolver, only keeping a few necessary features, such as stale ttl, error ttl and resolv.conf/hosts parsing.

When issues arise, we can easily explain to customers that the implementation of other industry software is the same way. For known or potential issues, we don't plan on making further fixes. Instead, we'll let users who need those features/bugfixes switch directly to the new one.

dndx

Overall, the implementation is clear, logical and easy to follow. Good job @chobits .

There are some questions and suggestions however, but nothing too concerning from my eyes.

kong/dns/README.md

kong/dns/client.lua

kong/dns/utils.lua

kong/templates/nginx_kong.lua

chronolaw · 2024-07-16T07:35:16Z

kong/dns/README.md

+    * Record: ~80 bytes json string, e.g., `{address = "127.0.0.1", name = <domain>, ttl = 3600, class = 1, type = 1}`.
+  * Domain: ~36 bytes string, e.g., `example<n>.long.long.long.long.test`. Domain names with lengths between 10 and 36 bytes yield similar results.
+
+The results of ) are as follows:


What is )? A mistake?

Try to fix it in #13389

team-gateway-bot · 2024-07-16T21:24:29Z

Cherry-pick failed for master, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git remote add upstream https://github.com/kong/kong-ee
git fetch upstream master
git worktree add -d .worktree/cherry-pick-12305-to-master-to-upstream upstream/master
cd .worktree/cherry-pick-12305-to-master-to-upstream
git checkout -b cherry-pick-12305-to-master-to-upstream
ancref=$(git merge-base bc27ffd414fae5c18a84dbf3c53ee4bba2b4cc8a 1f0bc17dc388a8285cb6e4993e2d6ce0533dfe6f)
git cherry-pick -x $ancref..1f0bc17dc388a8285cb6e4993e2d6ce0533dfe6f

Try to improve #12305 KAG-5001

pull-request-size bot added the size/L label Jan 8, 2024

chobits marked this pull request as draft January 8, 2024 09:35

github-actions bot assigned chobits Jan 8, 2024

pull-request-size bot added size/XL and removed size/L labels Jan 11, 2024

chobits added the skip-changelog label Jan 11, 2024

chobits force-pushed the refactor/dns_client branch from 8943a03 to 6ef1040 Compare January 12, 2024 08:41

pull-request-size bot added size/XXL and removed size/XL labels Jan 17, 2024

chobits force-pushed the refactor/dns_client branch from 1e9d43b to 5505c6b Compare January 24, 2024 06:55