PYTHON-1419 Connection failure to SNI endpoint when first host is unavailable #1243

lukasz-antoniak · 2025-05-19T14:46:42Z

10000

…vailable

absurdfarce · 2025-05-20T20:41:29Z

So the fix looks exactly like what I was expecting. I'm wondering if we can't come up with a more robust test to actually demonstrate the behaviour in question and confirm that this fix fixes it. Yes, we already know it does... but can we write a test to prove it?

I readily agree this is tricky; it might mean pulling out some of the relevant logic from cluster initialization to confirm that things are behaving as expected. It would certainly mean mocking DNS results. But a more complete test is probably worth it at the end of the day.

… tests" This reverts commit 9698893.

absurdfarce

So after looking over this pretty extensively for a couple days and testing a number of things locally I do have a few concerns. Mostly it's around the test infrastructure; I really think a simple unit test would be just as effective and avoid the extra machinery required here. But I'm also somewhat concerned about the actual ramifications of making this change; I'm not sure I fully understand yet how the corresponding failures to build a connection pool to the down ingress point will affect the rest of the driver functionality.

absurdfarce · 2025-05-25T22:45:12Z

tests/integration/long/test_sni_connection.py

@@ -0,0 +1,94 @@
+# Copyright DataStax, Inc.


This test is actually non-deterministic; it will fail or succeed based on the order in which discovered nodes are added to the LBP. LBPs in the Python driver listen for node up and node down events; that's how they're populated with host information. Newly discovered hosts are added after any existing hosts. And since the delivery of these node up/down events is non-deterministic that means the ordering of nodes in an LBP is also non-deterministic.

There's an additional complication: the control connection. It also needs to call resolve() at creation time so that in turn throws the count off somewhat. If the control connection happens to get the end point that is configured to resolve to 127.0.0.1 first then the control connection will be established but subsequent connections to both nodes will try to use the mock IP address and you'll get a test failure like the following:

DEBUG cassandra.pool:pool.py:206 Host proxy.datastax.com:9042:8c4b6ed7-f505-4226-b7a4-41f322520c1f:0 is now marked up DEBUG cassandra.pool:pool.py:206 Host proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:1 is now marked up DEBUG cassandra.cluster:cluster.py:3596 [control connection] Opening new connection to proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:1 INFO cassandra.connection:connection.py:278 Resolved address: 127.0.0.1 … DEBUG cassandra.cluster:cluster.py:1749 Control connection created DEBUG cassandra.pool:pool.py:405 Initializing connection for host proxy.datastax.com:9042:8c4b6ed7-f505-4226-b7a4-41f322520c1f:0 DEBUG cassandra.pool:pool.py:405 Initializing connection for host [proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:2](about:blank) INFO cassandra.connection:connection.py:278 Resolved address: 100.101.102.103 INFO cassandra.connection:connection.py:278 Resolved address: 100.101.102.103 WARNING cassandra.cluster:cluster.py:3244 Failed to create connection pool for new host [proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:3](about:blank): … WARNING cassandra.cluster:cluster.py:3244 Failed to create connection pool for new host [proxy.datastax.com:9042:8c4b6ed7-f505-4226-b7a4-41f322520c1f:1](about:blank): … WARNING cassandra.cluster:cluster.py:2001 Host proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:3 has been marked down 2025-05-23 21:48:59,794 WARNING [cluster:2001]: Host proxy.datastax.com:9042:8c4b6ed7-f505-4226-b7a4-41f322520c1f:1 has been marked down

If the order is reversed and the end point that's configured to use the mock IP address comes first the test passes. I don't clearly understand the order of events in that case but I think the control connection calls resolve(), gets the mock address, tries again (which calls resolve() again) and is able to connect... but I still can't explain how that enables connections to both nodes. Regardless it clearly does pass in that case.

I saw this when running the test locally; about half the time it would pass (mock address endpoint first in LBP) while half the time it didn't with the stack trace above (localhost endpoint first in LBP). I'm not sure that can be easily fixed.

Oh, I missed that nuance before. If localhost is used for the control connection it does connect and then tries to open connection pools for both nodes (as referenced above). Both of those connection pools call resolve() which will return the invalid IP address for both endpoints. That means both connection pools will fail, which means both nodes are marked as down. But the control connection is connected to one of those nodes so it is also torn down... and a new control connection has to be made.

I'll readily admit I don't fully understand the sequence that causes this to lead to a NoHostAvailable... but it's doing so reliably in my testing. I think there's a couple iterations of the fail to establish connection pools leading to down nodes leading to new control connections until we get to a point where both end points return the phantom IP addresses and therefore we can't connect to anybody in the query plan. But that' still just speculation on my part.

If the order is reversed and the end point that's configured to use the mock IP address comes first the test passes.

Correct, that is why I chose to mock IP '100.101.102.103', which is sorted before '127.0.0.1'.

I wanted to have as much integration tests as possible, but true, maybe unit test will be better in this case.

absurdfarce · 2025-05-25T22:47:23Z

tests/integration/long/test_sni_connection.py

+        ssl_context.load_verify_locations(CLIENT_CA_CERTS)
+        ssl_context.verify_mode = ssl.CERT_REQUIRED
+        ssl_context.load_cert_chain(certfile=DRIVER_CERTFILE, keyfile=DRIVER_KEYFILE)
+        config.ssl_context = ssl_context


This test bring along a lot of machinery that it doesn't really need because it's implemented as an integration test. Because we're using the Astra SNI integration via the config we have to worry about establishing SSL connections which means (a) we have to configure the C* cluster to support SSL and (b) we have to setup the SSL context here. This goes some way to obscuring the actual intent of the test. I kinda feel like a simpler test, maybe even a unit test, might be preferred... that way we can avoid all the SSL machinery.

There's also the problem that with all of this SSL configuration the test won't pass unless the SSL update process being discussed in PYTHON-1372 (the ticket to fix test_ssl.py) is brought in. Without that fix (or more accurately the the update to the SSL material it entails) the test fails due to expired certificates.

absurdfarce · 2025-05-25T22:57:49Z

cassandra/connection.py


    def create_from_sni(self, sni):
-        return SniEndPoint(self._proxy_address, sni, self._port)
+        self._init_index += 1
+        return SniEndPoint(self._proxy_address, sni, self._port, self._init_index)


It's worthwhile being clear on the consequence of this change. If our proxy hostname resolves to N IP address we're basically exchanging a 1-in-N chance of complete failure if the first node is unavailable (because all our endpoints will return a common IP address) for an essentially guaranteed failure that the connection for one of our nodes will fail (since with the code in this PR at least one of our nodes will return the failing IP address from resolve()).

I'm not saying it's a problem that we're making this exchange; it probably is better to have a failure with connections to one of our nodes rather than to fail completely at connect time. But it is worth pointing out that this isn't a zero-cost abstraction.

PYTHON-1419 Connection failure to SNI endpoint when first host is una…

8c42514

…vailable

lukasz-antoniak added 3 commits May 21, 2025 12:41

PYTHON-1331 Recreate expired SSL certificates for integration tests

9698893

PYTHON-1331 Integration test

98a592a

Revert "PYTHON-1331 Recreate expired SSL certificates for integration…

72978f5

… tests" This reverts commit 9698893.

absurdfarce reviewed May 26, 2025

View reviewed changes

PYTHON-1331 Unit test

36e2508

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PYTHON-1419 Connection failure to SNI endpoint when first host is unavailable #1243

PYTHON-1419 Connection failure to SNI endpoint when first host is unavailable #1243

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PYTHON-1419 Connection failure to SNI endpoint when first host is unavailable #1243

Are you sure you want to change the base?

PYTHON-1419 Connection failure to SNI endpoint when first host is unavailable #1243

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!