-
Notifications
You must be signed in to change notification settings - Fork 133
Socket leak #434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
This sounds similar to #388 Would you mind adding your comment on that issue and closing this one to avoid duplicates? Thanks! |
@manics that looks like a memory leak while this is a socket leak. Should be different no? |
We just saw a node with a lot of proxy pods spew a lot of |
This comment was marked as resolved.
This comment was marked as resolved.
ws release is unlikely to be relevant, since it's only used in tests. |
I think the configurable-http-proxy/lib/configproxy.js Lines 233 to 245 in cb03f77
|
I think this remains a problem in 4.6.1 using node 20 in on GKE with linux kernel 5.15 based on info from @shaneknapp. |
CHP has multiple node HTTP servers working in parallell, one for its own REST API, one for proxying, one for metrics. It would be good to conclude if the growing tcp memory / sockets etc are associated with a specific instance of these. |
Looked at one CHP process and saw for example...
This wasn't expected to have anything close to 500 open connections or similar, so I think its very safe to say that this reproduces. This is from latest chp running with node 20 on linux kernel 5.15 nodes. |
I'm not sure when I expect a socket to be closed. When it times out based on a "timeout" option? I think the timeout option may be infinite. Is the issue that there is simply nothing that makes us destroy sockets once created, because we default to an infinite timeout? |
@minrk and others, is it safe to change a default for a timeout value here to something quite extreme, like 24 hours? I don't want us to disrupt users that are semi active and runs into issues at the 24th hour - but they wouldn't as long as they are semi-active right? We have two timeout args matching the node-http-proxy options at https://github.com/http-party/node-http-proxy?tab=readme-ov-file#options, |
There is also a related issue reported in node-http-proxy - http-party/node-http-proxy#1510. The node-http-proxy was foked and had that issue fixed with a one line commit at Jimbly/http-proxy-node16@56283e3, to use a Looking closer at that fork, there is also another memory leak fixed in another commit according to the commit message: Jimbly/http-proxy-node16@ba0c414 . This is detailed in a PR as well: http-party/node-http-proxy#1559 Those two memory fix commits are the only actual fixes in the fork, where the rest is just docs etc. Maybe we should do a build of chp based on the forked node-http-proxy project and push a tag that users can opt into? Like |
I pushed If someone wants to try if this fork help, just reference 4.6.1-fork instead as a image tag. |
Nice research! I don't think we should have a separate If we think the fork is production ready then we should either switch CHP to use it if it's fully trusted, or vendor it if it's not (as previously discussed in #413 (comment)) |
i'll be testing out edit: here are my changes in case anyone wants to see them! |
If it works, 👍 to just switching the dependency and publishing a new release, without any I think the sustainable longer-term solution is to vendor http2-proxy, which I started here but haven't had time to finish. It would be great to be able to have some actual tests to exercise these things, since it's been hard to control or verify. |
ok, i identified the most problematic hub... the chp pod been getting OOMKilled and stack tracing at least every 2-3 days. i just deployed the test fork of the chp pod to it and will keep an eye on things over the rest of the week. in other news, this fix seems (read: seems) to use less tcp memory than before. it's tough to say for certain, but at the very least w/my latest deployment on the problematic hub i'll have something that mildly on fire to watch, vs the others i've deployed on much less trafficked hubs. 🤞 |
womp womp. that pod has restarted three times in the past hour after deploying the -fork chp:
i'm also seeing a lot of these
and sometimes after it's killed (but not every time) we get the following in dmesg:
are we DOSing ourselves somehow? |
i killed that pod and the so. confusing. |
nope. that pod's chp is still getting OOMKilled. time for a break. :) |
What is its normal memory use and what its k8s request/limit in memory? |
i've reverted that node is still regularly hitting the max memory, but just not as quickly(?) as with
i looked at chp pod restarts across our hubs. we're seeing this intermittently across our biggest deployments (both 4.6.1 and 4.6.1-fork... which all run on the same core node). the smaller usage hub's chps usually run between ~200-300MiB (~200 users max at any time, no massive spikes in logins etc). the larger hubs run ~400+ MiB but depending on user count, the chp pods eventually run out of memory and are OOMKilled. today was spent looking in to our largest class, but the other two had definitely been experiencing the same issue to slightly lesser impact. now that i know what to look for, i'll keep an eye on these individual pods' memory usage and know better what to look for in the logs. the biggest takeaway so far is that |
i also suspect the disclaimer: i have a few tabs open and suspect that the former might be something isn't cleaning up when something in the proxy stack goes away. the latter might just be a symptom of this, but i can neither confirm nor deny. https://en.wikipedia.org/wiki/Martian_packet so martians are involved somehow, causing network flooding and a DOS? |
btw i rolled the fork back out to a couple of smaller hubs, minus the timeout settings. everything seems cromulent, but the only way to really test this is to have a lot (200+) of people logging in within a short period of time. |
quick update here: this fix really does look promising. orphaned sockets seem to drop significantly, and memory usage doesn't explode wildly and cause users to receive 500s. |
huzzah! Thank you so much for testing and reporting, @shaneknapp. @consideRatio do you want to switch the dependency to the fork and make a real release of CHP? |
@minrk -- while i'm confident it helps, i'm also confident that it doesn't fix the problem outright. while we're not getting the spiky and constant OOMKills w/this test fork, there is still a pretty significant memory leak somewhere: i checked our javascript heap as well (this is in Mi):
as you can see we're bouncing against that pretty quickly (we had a 3-day weekened this week so the figures are a little smaller than usual): so there is still a significant memory leak. maybe we're exposed another bug during the testing... we're also seeing many 503s for ECONNREFUSED on our two biggest hubs. these pop up after che chp has been at the heap limit for a couple of hours and it looks like people's proxy from the core node (w/the hub and chp pods) is disappearing. this is SUPER disruptive and is impacting coursework.
another quick update: the 503 errors we're getting are appearing on high traffic hubs running both the vanilla and fork versions of the chp. they're appearing in multiples of 30 (30 or 60). i think this behavior might be related to, but not caused by the chp. |
i just deployed the fork to prod for all of our hubs -- the fork seems to be holding up quite well on the high-traffic hubs (>1k users/day, high compute loads), so now we're rolling it out for the rest. if this continues to squelch the memory spikes/OOMKills for another week i'd feel comfortable giving my thumbs-up to roll the fork in to a new release! |
alright, it's been a week and i feel very comfortable in saying that we should definitely roll these changes in to a release branch asap. it doesn't fix the problem outright (our highest-traffic/load hubs still have one or two chp OOMKills per day w/250+ concurrent users), but it's a significant improvement over vanilla 4.6.1! i firmly believe that we should still investigate further, and even after deploying 4.6.1-fork, we had another OOMKill/chp outage on march 5th that impacted ~300 users. yesterday, i sat down w/GCP and core node kernel logs, plus grafana/prom data, and put together a minute-by-minute timeline of how things went down. since i'll be on a plane for a few hours today, i'm hoping to get this transcribed from paper to an update on this github issue to help w/debugging. TL;DR: |
quick ping here... is the |
@shaneknapp thanks for the ping. I opened #539, then we can make a release with it. |
@shaneknapp Since it improves but doesn't fully fix the issue it might be worth also testing an older CHP image (or building your own) based on NodeJS <=15.4 |
4.6.2 was released 2 months ago with a fix for the leaking sockets. Can we close this issue? |
Wieeee!! Thank you for following up @jyounggo!!! |
fwiw, ucb datahubs are still seeing this, albeit nowhere nearly as much as
before.... i guess YMMV. :)
…On Tue, Mar 25, 2025 at 9:27 AM Erik Sundell ***@***.***> wrote:
Wieeee!! Thank you for following up @jyounggo
<https://github.com/jyounggo>!!!
—
Reply to this email directly, view it on GitHub
<#434 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMIHLET6BMRK2WCN5THZFL2WFYUPAVCNFSM6AAAAABZYBNC72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJRHAZTSNZRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: consideRatio]*consideRatio* left a comment
(jupyterhub/configurable-http-proxy#434)
<#434 (comment)>
Wieeee!! Thank you for following up @jyounggo
<https://github.com/jyounggo>!!!
—
Reply to this email directly, view it on GitHub
<#434 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMIHLET6BMRK2WCN5THZFL2WFYUPAVCNFSM6AAAAABZYBNC72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJRHAZTSNZRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
- this is the exact same thing jupyterhub was hitting with (for now) the same fix: jupyterhub/configurable-http-proxy#434 - that said, http-proxy (and http-proxy-node16) are unmaintained and have many security vulnerabilities deps, known bugs with PR's to fix them, etc. This really needs to change.
Hi JupyterHub Devs, I did a modern rewrite of http-proxy this week (it's not that much code!), because even with http-proxy-node16, I was seeing some major socket leaks in cocalc.com, where we use proxying very heavily, and I also it seems really sad that http-proxy isn't maintained. My rewrite is at https://www.npmjs.com/package/http-proxy-3 (MIT licensed) in case anybody wants to test it out. It's a drop in replacement for http-proxy, with no api changes. Another motivation is that "npm audit" showed a lot of issues so I updated all dependencies to the latest versions, and also some code in the implementation used API's from nodejs that were deprecated due to security concerns. -- William |
@williamstein No api changes = No HTTP/2 support? Cross references: How does it compare to https://github.com/unjs/httpxy? |
Correct for now. And just to be clear: this is definitely not ready for production use yet. I’m hoping to start stirring a little interest in there being a modern maintained successor to http-proxy that merges PR’s, etc. |
Yeah, I'd absolutely be interested. Maybe collaborating with httpxy? FWIW, my own effort on this was to try to vendor https://github.com/nxtedition/node-http2-proxy into this repo since it's also unmaintained, but if there's a maintained http-proxy fork, that would be easier to switch to. |
@minrk thanks! I looked into httpxy, and it has the same motivation as this http-proxy-3 I made. It seems like a big difference is that I'm writing a lot of jest unit tests for http-proxy-3, whereas httpxy has very few. This could make a difference more longterm regarding how easy it is to contribute to http-proxy-3. Today I hope to write tests to particular address the socket leak issue discussed here. Anyway, if or when http-proxy-3 is "ready for production use" and tested for a while, I'll announce that again here, so you guys can consider using it as a drop in replacement. |
That would be awesome, I'll be very happy to switch when you're ready |
F441
tr>
I spent all week and modernized ALL the unit tests from the original http-proxy, and also made all of their examples into unit tests that are tested to work (they were not actually tested in http-proxy). I fixed various vulnerabilities and issues (some revealed by unit testing). I also added some tests involving websockets along with instrumenting counting how many sockets are open to ensure leaks like reported here don't recur. It's running live now on http://cocalc.com. So now I think this is ready for general use: https://www.npmjs.com/package/http-proxy-3 When @shaneknapp above says "fwiw, ucb datahubs are still seeing this, albeit nowhere nearly as much as |
@williamstein you. are. the. best. i'm also glad that it wasn't just me going completely bonkers, and doubly glad you had the cycles to get this sorted and get to the bottom of it! fwiw we had a workaround by increasing the ephemeral ports on the hub pods. |
Nice! #572 switches to http-proxy-3. Only one test fails. |
Uh oh!
There was an error while loading. Please reload this page.
Bug description
We are running z2jh (https://z2jh.jupyter.org/en/stable/) and found there is a socket leak in
proxy
pod.The number of socket is constantly increasing (over 60k) and the kernel generates an error after a week ,
kernel: TCP: out of memory -- consider tuning tcp_mem
.I have checked the number of sockets using lsof.
Your personal set up
This chart use the proxy docker image from jupyterhub/configurable-http-proxy:4.5.0
The config.yaml related to proxy
The text was updated successfully, but these errors were encountered: