8000 gh-116810: fix memory leak in ssl module by jeffvanvoorst · Pull Request #123249 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-116810: fix memory leak in ssl module #123249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Resolved SSL session memory leak in :mod:`ssl`.
4 changes: 0 additions & 4 deletions Modules/_ssl.c
Original file line number Diff line number Diff line change
Expand Up @@ -2853,10 +2853,6 @@ PySSL_get_session(PySSLSocket *self, void *closure) {
if ((session = _ssl_session_dup(session)) == NULL) {
return NULL;
}
session = SSL_get1_session(self->ssl);
if (session == NULL) {
Py_RETURN_NONE;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by no test failures, this probably does not have any coverage. Any chance of including the tests in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check the tests this week. I doubt I will be able to finish that work today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this change still results in returning a valid copy of the ssl session, I doubt tests would fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreeing there. I think it would be difficult to test this. With that being said, are there any existing tests for checking that the session attribute is valid? If not, then this PR should add them (otherwise, I think this looks good).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some life issues come up. I will try to get back to this in the next few weeks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a cacophony of mistakes here on the CPython side. Let me try to correct things:

I don't really understand how this code was nor always a memory leak in the past.

The memory leak seems to have been introduced by 39258d3. _ssl_session_dup returns an SSL_SESSION that needs to be freed by someone and then you all just immediately clobber the variable with something else. But test_ssl.py clearly accesses session. Does CPython have no way to catch memory leaks in its tests? Might I suggest you all look into LeakSanitizer?

Anyway, the problem with that commit is that you all turned #if ... #else ... #endif into including both branches instead of just one.

That commit not only introduced a memory leak, but it also mostly broke your workaround. I haven't tried it, but I suspect you would see this if you uncleanly destroyed the other connection before calling set_session instead of after. Now, I see uncleanly because it seems CPython never followed up on openssl/openssl#1550 (comment) because you all have been misunderstanding the nature of the workaround. It was not working around an OpenSSL bug, but a CPython bug / broken expectation.

Based on my understanding of https://docs.openssl.org/master/man3/SSL_get_session/ I believe this will work, but it does raise the question of why we have the SSL_get0_session + _ssl_session_dup combo in the first place; with modern OpenSSL 3+ why would this SSL_get1_session not just do what we wanted?

That is not what is going on here. There were, as far as I know, no changes in OpenSSL 3+ relating to this.

SSL_get0_session + _ssl_session_dup and SSL_get1_session do not and have never same thing in any version of OpenSSL. SSL_get0_session and SSL_get1_session are the exact same accessor. Just one of them increments the refcount for you and the other doesn't. _ssl_session_dup is a (misguided) CPython function to make a deep copy of the SSL_SESSION object.

Although the code was, prior to the regression, gated on a OPENSSL_VERSION_1_1 ifdef, that does not imply was specific to OpenSSL 1.1.x. OPENSSL_VERSION_1_1 is a CPython-invented macro (that squats OpenSSL's namespace) which means > 1.1. That is, OpenSSL 3 was also OPENSSL_VERSION_1_1:

#if (OPENSSL_VERSION_NUMBER >= 0x10100000L) && !defined(LIBRESSL_VERSION_NUMBER)
#  define OPENSSL_VERSION_1_1 1
#  define PY_OPENSSL_1_1_API 1
#endif

While I have not tested it (have you all?), based on openssl/openssl#1550 (comment), I highly suspect that CPython "needs" the workaround in OpenSSL 3.x too. I say "needs" because...

I think the issue @.davidben was saying Python has in there was describing more the way Python is interacting with the available APIs (which at the time had to span a wide range of OpenSSL versions) and not using some other features instead.

No, that's not what I was saying in openssl/openssl#1550 (comment). See the immediately preceding comment:

FYI, I poked at this briefly and don't see any evidence that SSL_shutdown is being called by Python. It seems the pseudocode in the bug report is wrong and this is a Python bug.

The bug report was very confused because the minimal reproducer that CPython provided did not actually reflect what CPython does. With all the meandering with the bad reproducer, I suspect it was very confusing. But the CPython folks on the bug never followed up, so my past self assumed you all understood, were planning on fixing it, and I didn't have time to check on you all. 😞

What is actually going on here is that OpenSSL will invalidate sessions on connection destruction if you haven't called SSL_shutdown first. To be honest, I don't think this is good behavior, but it has been OpenSSL's intentional behavior since day 1. I believe this is descended from this guidance:

close_notify
This message notifies the recipient that the sender will not send
any more messages on this connection. The session becomes
unresumable if any connection is terminated without proper
close_notify messages with level equal to warning.

https://www.rfc-editor.org/rfc/rfc2246.html#section-7.2.1

Now, I personally think this behavior is unhelpful. If a connection just got closed, that doesn't make the old key material broken or anything. Also if we were getting any security properties out of this, it wouldn't work anyway. There are loads of cases where you cannot reliably knock out the other session.

  • Stateless resumption ("session tickets" in TLS 1.2) cannot possibly invalidate the session on the server
  • Maybe your application serializes the session, e.g., on disk, and then deserializes it later

Interestingly, OpenSSL doesn't actually look for a full bidirectional shutdown to preserve your session. It only looks for whether you attempted to send close_notify. Kind of pointless, to be honest, but IMO this whole thing is pointless.

TLS 1.1, incidentally, already softened this recommendation:

close_notify
This message notifies the recipient that the sender will not send
any more messages on this connection. Note that as of TLS 1.1,
failure to properly close a connection no longer requires that a
session not be resumed. This is a change from TLS 1.0 to conform
with widespread implementation practice.

https://www.rfc-editor.org/rfc/rfc4346.html#section-7.2.1

But OpenSSL retains this requirement, for better or worse. This requirement was not new in 1.1.x, nor was it removed in 3.x. The only reason you all needed to work around it later on is that, due to other quirks of how CPython was using the API (arguably incorrectly), as detailed in my comment on the OpenSSL issue, CPython was sneaking through a gap in this OpenSSL policy. OpenSSL fixed that gap, and now CPython was affected.

Now, contrary to the claim in openssl/openssl#1550 (comment), CPython did not call SSL_shutdown. You all have a separate method to exchange close_notify and weren't calling it in that test.

So, what to do given this? I think you have two options:

  1. Agree with OpenSSL's policy and pass it up to the callers. Without calling SSLSocket.shutdown(), sessions get invalidated. Callers have to call the shutdown method if they care. Fix your tests to reflect this.
  2. Disagree with OpenSSL's policy and effectively turn it off. To do that, call SSL_shutdown when destroying an SSL. Now, that will cause OpenSSL to do I/O, so you might want to SSL_set_quiet_shutdown. Alternatively, there is an SSL_set_shutdown API to just fake a shutdown state.

In either case, you can remove the workaround because this workaround has been self-inflicted from day one.

If you pick option 2, please write some clear comments in the code so the next time this happens, I don't have to write another essay for you all. :-P

The OpenSSL 1.1.1 fork's I do think we should care somewhat about are BoringSSL and it's AWS-LC derivative. @WillChilds-Klein and @davidben can represent what those might need for these specific APIs.

So, per the above, this question is based on several incorrect premises:

  • BoringSSL is not an OpenSSL 1.1.1 fork. We actually diverged from OpenSSL 1.0.2. We just report a version of 1.1.1 because that's what's most convenient for compatibility.
  • This issue has nothing to do with OpenSSL 1.1.1. I expect you all still "need" this workaround in OpenSSL 3.x too.
  • This issue has nothing to do with these specific APIs

Since this "workaround" is really a workaround for CPython's own mishaps, if you all do the thing I suggest above, it should work just fine in BoringSSL.

But also this is all moot in BoringSSL because BoringSSL does not invalidate sessions on SSL_free in the first place. We removed it because it's just kinda silly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where OpenSSL documents their behavior, by the way:

A session will be automatically removed from the session cache and marked as non-resumable if the connection is not closed down cleanly, e.g. if a fatal error occurs on the connection or SSL_shutdown(3) is not called prior to SSL_free(3).

https://docs.openssl.org/master/man3/SSL_get_session/#notes:~:text=A%20session%20will%20be%20automatically%20removed%20from%20the%20session%20cache%20and%20marked%20as%20non%2Dresumable%20if%20the%20connection%20is%20not%20closed%20down%20cleanly%2C%20e.g.%20if%20a%20fatal%20error%20occurs%20on%20the%20connection%20or%20SSL_shutdown(3)%20is%20not%20called%20prior%20to%20SSL_free(3).

For completeness, there were two different bits of spec text that led to this OpenSSL behavior, and I only cited one of them. The bits I cited were about unclean shutdown and removed very early. Fatal error lasted all the way to TLS 1.2:

Error handling in the TLS Handshake protocol is very simple. When an
error is detected, the detecting party sends a message to the other
party. Upon transmission or receipt of a fatal alert message, both
parties immediately close the connection. Servers and clients MUST
forget any session-identifiers, keys, and secrets associated with a
failed connection. Thus, any connection terminated with a fatal
alert MUST NOT be resumed.

https://www.rfc-editor.org/rfc/rfc5246.html#section-7.2.2

It wasn't removed until TLS 1.3, in tlswg/tls13-spec@dc668dc, for the same reasons. In practice, you cannot actually do this consistently, for the reasons I listed above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

the problem with that commit is that you all turned #if ... #else ... #endif into including both branches instead of just one.

Hah, that explains the code smell here. I'd only been glancing at current version of the code in git blame view to try and understand what this was even for without following the individual commits (git blame doing its job and not exposing history of merely deleted lines)... I agree completely on getting rid of that _ssl_session_dup hack. (and more, but we need to examine the SOTW and pick an amenable way out of the past API sins in here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely makes sense to just start with something minimal (probably this PR) to fix the immediate leak. I think probably the thing to do is:

  1. This PR
  2. Remove _ssl_session_dup and confirm tests indeed now fail in OpenSSL 3.x. My expectation is that they'll still fail, but I've no idea if there were other changes that further complicated this mess.
  3. No-op the goofy OpenSSL behavior. Probably something like SSL_set_quiet_shutdown(ssl); SSL_shutdown(ssl); SSL_free(ssl); would work? Confirm the tests now pass again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling SSL_set_shutdown in the destructor sounds fine to me. Is there a reason to believe that the SSL_set_quiet_shutdown(ssl); SSL_shutdown(ssl); dance would be better?

Oh and thanks for the detailed explanation @davidben .

pysess = PyObject_GC_New(PySSLSession, self->ctx->state->PySSLSession_Type);
if (pysess == NULL) {
SSL_SESSION_free(session);
Expand Down
Loading
0