8000 PYTHON-1419 Connection failure to SNI endpoint when first host is unavailable by lukasz-antoniak · Pull Request #1243 · datastax/python-driver · GitHub
[go: up one dir, main page]

Skip to content

PYTHON-1419 Connection failure to SNI endpoint when first host is unavailable #1243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

lukasz-antoniak
Copy link
Contributor
@lukasz-antoniak lukasz-antoniak commented May 19, 2025

Fixes PYTHON-1419.

10000

@absurdfarce
Copy link
Collaborator

So the fix looks exactly like what I was expecting. I'm wondering if we can't come up with a more robust test to actually demonstrate the behaviour in question and confirm that this fix fixes it. Yes, we already know it does... but can we write a test to prove it?

I readily agree this is tricky; it might mean pulling out some of the relevant logic from cluster initialization to confirm that things are behaving as expected. It would certainly mean mocking DNS results. But a more complete test is probably worth it at the end of the day.

Copy link
Collaborator
@absurdfarce absurdfarce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after looking over this pretty extensively for a couple days and testing a number of things locally I do have a few concerns. Mostly it's around the test infrastructure; I really think a simple unit test would be just as effective and avoid the extra machinery required here. But I'm also somewhat concerned about the actual ramifications of making this change; I'm not sure I fully understand yet how the corresponding failures to build a connection pool to the down ingress point will affect the rest of the driver functionality.

@@ -0,0 +1,94 @@
# Copyright DataStax, Inc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is actually non-deterministic; it will fail or succeed based on the order in which discovered nodes are added to the LBP. LBPs in the Python driver listen for node up and node down events; that's how they're populated with host information. Newly discovered hosts are added after any existing hosts. And since the delivery of these node up/down events is non-deterministic that means the ordering of nodes in an LBP is also non-deterministic.

There's an additional complication: the control connection. It also needs to call resolve() at creation time so that in turn throws the count off somewhat. If the control connection happens to get the end point that is configured to resolve to 127.0.0.1 first then the control connection will be established but subsequent connections to both nodes will try to use the mock IP address and you'll get a test failure like the following:

DEBUG    cassandra.pool:pool.py:206 Host proxy.datastax.com:9042:8c4b6ed7-f505-4226-b7a4-41f322520c1f:0 is now marked up                                                                                        
DEBUG    cassandra.pool:pool.py:206 Host proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:1 is now marked up                                                                                        
DEBUG    cassandra.cluster:cluster.py:3596 [control connection] Opening new connection to proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:1                                                        
INFO     cassandra.connection:connection.py:278 Resolved address: 127.0.0.1 
…
DEBUG	cassandra.cluster:cluster.py:1749 Control connection created
DEBUG	cassandra.pool:pool.py:405 Initializing connection for host proxy.datastax.com:9042:8c4b6ed7-f505-4226-b7a4-41f322520c1f:0
DEBUG	cassandra.pool:pool.py:405 Initializing connection for host [proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:2](about:blank)
INFO 	cassandra.connection:connection.py:278 Resolved address: 100.101.102.103
INFO 	cassandra.connection:connection.py:278 Resolved address: 100.101.102.103
WARNING  cassandra.cluster:cluster.py:3244 Failed to create connection pool for new host [proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:3](about:blank):
…
WARNING  cassandra.cluster:cluster.py:3244 Failed to create connection pool for new host [proxy.datastax.com:9042:8c4b6ed7-f505-4226-b7a4-41f322520c1f:1](about:blank):
…
WARNING  cassandra.cluster:cluster.py:2001 Host proxy.datastax.com:9042:2e25021d-8d72-41a7-a247-3da85c5d92d2:3 has been marked down
2025-05-23 21:48:59,794 WARNING [cluster:2001]: Host proxy.datastax.com:9042:8c4b6ed7-f505-4226-b7a4-41f322520c1f:1 has been marked down

If the order is reversed and the end point that's configured to use the mock IP address comes first the test passes. I don't clearly understand the order of events in that case but I think the control connection calls resolve(), gets the mock address, tries again (which calls resolve() again) and is able to connect... but I still can't explain how that enables connections to both nodes. Regardless it clearly does pass in that case.

I saw this when running the test locally; about half the time it would pass (mock address endpoint first in LBP) while half the time it didn't with the stack trace above (localhost endpoint first in LBP). I'm not sure that can be easily fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that nuance before. If localhost is used for the control connection it does connect and then tries to open connection pools for both nodes (as referenced above). Both of those connection pools call resolve() which will return the invalid IP address for both endpoints. That means both connection pools will fail, which means both nodes are marked as down. But the control connection is connected to one of those nodes so it is also torn down... and a new control connection has to be made.

I'll readily admit I don't fully understand the sequence that causes this to lead to a NoHostAvailable... but it's doing so reliably in my testing. I think there's a couple iterations of the fail to establish connection pools leading to down nodes leading to new control connections until we get to a point where both end points return the phantom IP addresses and therefore we can't connect to anybody in the query plan. But that' still just speculation on my part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the order is reversed and the end point that's configured to use the mock IP address comes first the test passes.

Correct, that is why I chose to mock IP '100.101.102.103', which is sorted before '127.0.0.1'.

I wanted to have as much integration tests as possible, but true, maybe unit test will be better in this case.

ssl_context.load_verify_locations(CLIENT_CA_CERTS)
ssl_context.verify_mode = ssl.CERT_REQUIRED
ssl_context.load_cert_chain(certfile=DRIVER_CERTFILE, keyfile=DRIVER_KEYFILE)
config.ssl_context = ssl_context
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test bring along a lot of machinery that it doesn't really need because it's implemented as an integration test. Because we're using the Astra SNI integration via the config we have to worry about establishing SSL connections which means (a) we have to configure the C* cluster to support SSL and (b) we have to setup the SSL context here. This goes some way to obscuring the actual intent of the test. I kinda feel like a simpler test, maybe even a unit test, might be preferred... that way we can avoid all the SSL machinery.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the problem that with all of this SSL configuration the test won't pass unless the SSL update process being discussed in PYTHON-1372 (the ticket to fix test_ssl.py) is brought in. Without that fix (or more accurately the the update to the SSL material it entails) the test fails due to expired certificates.


def create_from_sni(self, sni):
return SniEndPoint(self._proxy_address, sni, self._port)
self._init_index += 1
return SniEndPoint(self._proxy_address, sni, self._port, self._init_index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worthwhile being clear on the consequence of this change. If our proxy hostname resolves to N IP address we're basically exchanging a 1-in-N chance of complete failure if the first node is unavailable (because all our endpoints will return a common IP address) for an essentially guaranteed failure that the connection for one of our nodes will fail (since with the code in this PR at least one of our nodes will return the failing IP address from resolve()).

I'm not saying it's a problem that we're making this exchange; it probably is better to have a failure with connections to one of our nodes rather than to fail completely at connect time. But it is worth pointing out that this isn't a zero-cost abstraction.

Sign up for free to join 60BE this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0