Socket leak #434

jyounggo · 2022-10-12T21:56:14Z

Bug description

We are running z2jh (https://z2jh.jupyter.org/en/stable/) and found there is a socket leak in proxy pod.
The number of socket is constantly increasing (over 60k) and the kernel generates an error after a week ,kernel: TCP: out of memory -- consider tuning tcp_mem.

I have checked the number of sockets using lsof.

/srv/configurable-http-proxy $ lsof
1	/usr/local/bin/node	socket:[48829679]
1	/usr/local/bin/node	socket:[48829681]
1	/usr/local/bin/node	socket:[48825415]
1	/usr/local/bin/node	socket:[48825417]
1	/usr/local/bin/node	socket:[48829792]
1	/usr/local/bin/node	socket:[48829790]
1	/usr/local/bin/node	socket:[48829783]
1	/usr/local/bin/node	socket:[48829785]

/srv/configurable-http-proxy $ lsof | wc -l
64708
/srv/configurable-http-proxy $ lsof | wc -l
64719

Your personal set up

Version(s): Jupyterhub Helm chart: v1.1.2 ( https://jupyterhub.github.io/helm-chart/)
This chart use the proxy docker image from jupyterhub/configurable-http-proxy:4.5.0

The config.yaml related to proxy

proxy:
  secretToken: xxxx

  service:
    loadBalancerIP: x.x.x.x
  https:
    enabled: true
    hosts:
     - "exmaple.com"
    letsencrypt:
      contactEmail: "help@example.com"
  chp: # proxy pod, running jupyterhub/configurable-http-proxy
    livenessProbe:
      enabled: true
      initialDelaySeconds: 60
      periodSeconds: 20
      failureThreshold: 10 # retry 10 times before declaring failure
      timeoutSeconds: 3
      successThreshold: 1
    resources:
      requests:
        cpu: 1000m # 0m - 1000m
        memory: 5000Mi # Recommended is 100Mi - 600Mi -- we seem to run at 4.3GB a lot
  traefik: # autohttps pod (optional, running traefik/traefik)
    resources:
      requests:
        cpu: 1000m # 0m - 1000m
        memory: 512Mi # 100Mi - 1.1Gi
  secretSync: # autohttps pod (optional, sidecar container running small Python script)
    resources:
      requests:
        cpu: 10m
        memory: 64Mi

The text was updated successfully, but these errors were encountered:

welcome · 2022-10-12T21:56:16Z

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.

You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

manics · 2022-10-12T22:04:27Z

This sounds similar to #388

Would you mind adding your comment on that issue and closing this one to avoid duplicates? Thanks!

yuvipanda · 2022-10-31T17:15:00Z

@manics that looks like a memory leak while this is a socket leak. Should be different no?

yuvipanda · 2022-10-31T17:17:27Z

We just saw a node with a lot of proxy pods spew a lot of out of memory -- consider tuning tcp_mem while a lot of the proxies stopped working. Could be related?

minrk · 2023-02-06T08:48:04Z

ws release is unlikely to be relevant, since it's only used in tests.

consideRatio · 2023-11-27T13:22:44Z

I think the node version could be important debugging info because we are relying on node library's http(s) server here:

configurable-http-proxy/lib/configproxy.js

Lines 233 to 245 in cb03f77

    
           // proxy requests separately 
        
           var proxyCallback = logErrors(this.handleProxyWeb); 
        
           if (this.options.ssl) { 
        
             this.proxyServer = https.createServer({ ...this.options.ssl, ...httpOptions }, proxyCallback); 
        
           } else { 
        
             this.proxyServer = http.createServer(httpOptions, proxyCallback); 
        
           } 
        
           // proxy websockets 
        
           this.proxyServer.on("upgrade", bound(this, this.handleProxyWs)); 
        
           this.proxy.on("proxyRes", function (proxyRes, req, res) { 
        
             that.metrics.requestsProxyCount.labels(proxyRes.statusCode).inc(); 
        
           });

CHP	node	Date
`4.6.1`	`v20.10.0`	Nov 27 2023
`4.6.0`	`v18.18.2`	Sep 19 2023
`4.5.6`	`v18.17.1`	Aug 10 2023
`4.5.5`	`v18.15.0`	Apr 3 2023
`4.5.4`	`v18.12.1`	Dec 5 2022
`4.5.3`	`v16.17.0`	Sep 9 2022
`4.5.2`	`v16.17.0`	Aug 19 2022
`4.5.1`	`v16.13.2`	Feb 3 2022
`4.5.0`	`v14.17.3`	Jul 19 2021

consideRatio · 2024-02-02T14:55:17Z

I think this remains a problem in 4.6.1 using node 20 in on GKE with linux kernel 5.15 based on info from @shaneknapp.

consideRatio · 2024-02-02T15:23:00Z

CHP has multiple node HTTP servers working in parallell, one for its own REST API, one for proxying, one for metrics.

It would be good to conclude if the growing tcp memory / sockets etc are associated with a specific instance of these.

consideRatio · 2024-02-02T16:02:20Z

Looked at one CHP process and saw for example...

# cd /proc/<pid of chp process>
cat net/sockstat
sockets: used 556
TCP: inuse 174 orphan 2 tw 36 alloc 1572 mem 40536
UDP: inuse 0 mem 6
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

This wasn't expected to have anything close to 500 open connections or similar, so I think its very safe to say that this reproduces. This is from latest chp running with node 20 on linux kernel 5.15 nodes.

consideRatio · 2024-02-02T16:30:28Z

I'm not sure when I expect a socket to be closed. When it times out based on a "timeout" option? I think the timeout option may be infinite.

Is the issue that there is simply nothing that makes us destroy sockets once created, because we default to an infinite timeout?

consideRatio · 2024-02-02T16:39:03Z

@minrk and others, is it safe to change a default for a timeout value here to something quite extreme, like 24 hours? I don't want us to disrupt users that are semi active and runs into issues at the 24th hour - but they wouldn't as long as they are semi-active right?

We have two timeout args matching the node-http-proxy options at https://github.com/http-party/node-http-proxy?tab=readme-ov-file#options, timeout and proxyTimeout - should we try setting them to 24 hours both, or just one?

consideRatio · 2024-02-02T16:51:26Z

There is also a related issue reported in node-http-proxy - http-party/node-http-proxy#1510.

The node-http-proxy was foked and had that issue fixed with a one line commit at Jimbly/http-proxy-node16@56283e3, to use a .destroy function instead of a .end function. I figure the difference may be that .end allows for re-use of a process file descriptor perhaps?

Looking closer at that fork, there is also another memory leak fixed in another commit according to the commit message: Jimbly/http-proxy-node16@ba0c414 . This is detailed in a PR as well: http-party/node-http-proxy#1559

Those two memory fix commits are the only actual fixes in the fork, where the rest is just docs etc.

Maybe we should do a build of chp based on the forked node-http-proxy project and push a tag that users can opt into? Like jupyterhub/configurable-http-proxy:4.6.1-fork image?

consideRatio · 2024-02-02T17:19:34Z

I pushed jupyterhub/configurable-http-proxy:4.6.1-fork and quay.io/jupyterhub/configurable-http-proxy:4.6.1-fork (amd64 only, not aarch64 also) where I just updated package.json to reference node-http-proxy16 instead and made it be imported in a .js file (commits seen here: https://github.com/consideRatio/configurable-http-proxy/commits/main/).

If someone wants to try if this fork help, just reference 4.6.1-fork instead as a image tag.

manics · 2024-02-03T15:36:51Z

Nice research! I don't think we should have a separate jupyterhub/configurable-http-proxy:4.6.1-fork image, it adds confusion.

If we think the fork is production ready then we should either switch CHP to use it if it's fully trusted, or vendor it if it's not (as previously discussed in #413 (comment))

shaneknapp · 2024-02-05T01:27:48Z

i'll be testing out jupyterhub/configurable-http-proxy:4.6.1-fork in some of our smaller hubs this week and will report back!

edit: here are my changes in case anyone wants to see them!

berkeley-dsep-infra/datahub#5501

minrk · 2024-02-05T13:17:48Z

If it works, 👍 to just switching the dependency and publishing a new release, without any -fork or other unusual tags.

I think the sustainable longer-term solution is to vendor http2-proxy, which I started here but haven't had time to finish.

It would be great to be able to have some actual tests to exercise these things, since it's been hard to control or verify.

shaneknapp · 2024-02-07T20:00:40Z

ok, i identified the most problematic hub... the chp pod been getting OOMKilled and stack tracing at least every 2-3 days. i just deployed the test fork of the chp pod to it and will keep an eye on things over the rest of the week.

in other news, this fix seems (read: seems) to use less tcp memory than before. it's tough to say for certain, but at the very least w/my latest deployment on the problematic hub i'll have something that mildly on fire to watch, vs the others i've deployed on much less trafficked hubs.

🤞

shaneknapp · 2024-02-07T21:28:39Z

ok, i identified the most problematic hub... the chp pod been getting OOMKilled and stack tracing at least every 2-3 days. i just deployed the test fork of the chp pod to it and will keep an eye on things over the rest of the week.

in other news, this fix seems (read: seems) to use less tcp memory than before. it's tough to say for certain, but at the very least w/my latest deployment on the problematic hub i'll have something that mildly on fire to watch, vs the others i've deployed on much less trafficked hubs.

🤞

womp womp. that pod has restarted three times in the past hour after deploying the -fork chp:

[Wed Feb  7 20:59:45 2024] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
[Wed Feb  7 20:59:45 2024] Call Trace:
[Wed Feb  7 20:59:45 2024]  <TASK>
[Wed Feb  7 20:59:45 2024]  dump_stack_lvl+0x4c/0x67
[Wed Feb  7 20:59:45 2024]  dump_header+0x53/0x240
[Wed Feb  7 20:59:45 2024]  oom_kill_process+0x10e/0x1d0
[Wed Feb  7 20:59:45 2024]  out_of_memory+0x496/0x5a0
[Wed Feb  7 20:59:45 2024]  ? mem_cgroup_iter+0x213/0x280
[Wed Feb  7 20:59:45 2024]  try_charge_memcg+0x7dd/0x9f0
[Wed Feb  7 20:59:45 2024]  charge_memcg+0x42/0x1a0
[Wed Feb  7 20:59:45 2024]  __mem_cgroup_charge+0x2d/0x80
[Wed Feb  7 20:59:45 2024]  handle_mm_fault+0x1088/0x15d0
[Wed Feb  7 20:59:45 2024]  do_user_addr_fault+0x279/0x4f0
[Wed Feb  7 20:59:45 2024]  exc_page_fault+0x78/0xf0
[Wed Feb  7 20:59:45 2024]  asm_exc_page_fault+0x22/0x30
[Wed Feb  7 20:59:45 2024] RIP: 0033:0x7e707bb07081
[Wed Feb  7 20:59:45 2024] Code: 48 ff c1 eb f3 c3 48 89 f8 48 83 fa 08 72 14 f7 c7 07 00 00 00 74 0c a4 48 ff ca f7 c7 07 00 00 00 75 f4 48 89 d1 48 c1 e9 03 <f3> 48 a5 83 e2 07 74 05 a4 ff ca 75 fb c3 48 89 f8 48 29 f0 48 39
[Wed Feb  7 20:59:45 2024] RSP: 002b:00007e707af865a8 EFLAGS: 00010202
[Wed Feb  7 20:59:45 2024] RAX: 000000fbf7993c48 RBX: 0000000000000044 RCX: 0000000000000050
[Wed Feb  7 20:59:45 2024] RDX: 0000000000000638 RSI: 000027b131f9cc38 RDI: 000000fbf7994000
[Wed Feb  7 20:59:45 2024] RBP: 00007e707af86640 R08: 000000fbf7993c48 R09: 000000fbf7993c40
[Wed Feb  7 20:59:45 2024] R10: 000000fbf7993c41 R11: 000027b131f9c879 R12: 00001fe7f33411c9
[Wed Feb  7 20:59:45 2024] R13: 000000fbf798dd50 R14: 00007e70760e2ed0 R15: 0000000000000640
[Wed Feb  7 20:59:45 2024]  </TASK>
[Wed Feb  7 20:59:45 2024] memory: usage 1048484kB, limit 1048576kB, failcnt 209
[Wed Feb  7 20:59:45 2024] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[Wed Feb  7 20:59:45 2024] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6efce583_ef72_49b2_a3a4_af239a6c15f1.slice:
[Wed Feb  7 20:59:45 2024] anon 982282240
                           file 0
                           kernel_stack 245760
                           pagetables 28696576
                           percpu 46240
                           sock 442368
                           shmem 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           swapcached 0
                           anon_thp 0
                           file_thp 0
                           shmem_thp 0
                           inactive_anon 982249472
                           active_anon 8192
                           inactive_file 0
                           active_file 0
                           unevictable 0
                           slab_reclaimable 16552048
                           slab_unreclaimable 44657080
                           slab 61209128
                           workingset_refault_anon 0
                           workingset_refault_file 0
                           workingset_activate_anon 0
                           workingset_activate_file 0
                           workingset_restore_anon 0
                           workingset_restore_file 0
                           workingset_nodereclaim 0
                           pgfault 4469590
                           pgmajfault 0
                           pgrefill 0
                           pgscan 0
                           pgsteal 0
                           pgactivate 1
                           pgdeactivate 0
                           pglazyfree 0
                           pglazyfreed 0
                           thp_fault_alloc 0
                           thp_collapse_alloc 0
[Wed Feb  7 20:59:45 2024] Tasks state (memory values in pages):
[Wed Feb  7 20:59:45 2024] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Feb  7 20:59:45 2024] [3783484] 65535 3783484      257        1    28672        0          -998 pause
[Wed Feb  7 20:59:45 2024] [3825947] 65534 3825947   410957   250184 29212672        0           999 node
[Wed Feb  7 20:59:45 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-5f8f41d80b0d3993bdb2aafc8f480cd15cc623f5cb512212d187f99bbe550a13.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6efce583_ef72_49b2_a3a4_af239a6c15f1.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6efce583_ef72_49b2_a3a4_af239a6c15f1.slice/cri-containerd-5f8f41d80b0d3993bdb2aafc8f480cd15cc623f5cb512212d187f99bbe550a13.scope,task=node,pid=3825947,uid=65534
[Wed Feb  7 20:59:45 2024] Memory cgroup out of memory: Killed process 3825947 (node) total-vm:1643828kB, anon-rss:958732kB, file-rss:42004kB, shmem-rss:0kB, UID:65534 pgtables:28528kB oom_score_adj:999

i'm also seeing a lot of these martian source messages in dmesg with this hub's IP:

[Wed Feb  7 21:17:15 2024] IPv4: martian source 10.20.3.65 from 10.20.3.8, on dev eth0
[Wed Feb  7 21:17:15 2024] ll header: 00000000: 42 01 0a 00 00 1d 42 01 0a 00 00 01 08 00

10.20.3.8 is an nginx-ingress-controller.

and sometimes after it's killed (but not every time) we get the following in dmesg:

[Wed Feb  7 21:36:18 2024] TCP: request_sock_TCP: Possible SYN flooding on port 8000. Sending cookies.  Check SNMP counters.

are we DOSing ourselves somehow?

shaneknapp · 2024-02-07T21:41:16Z

i'm also seeing a lot of these martian source messages in dmesg with this hub's IP:
[Wed Feb  7 21:17:15 2024] IPv4: martian source 10.20.3.65 from 10.20.3.8, on dev eth0
[Wed Feb  7 21:17:15 2024] ll header: 00000000: 42 01 0a 00 00 1d 42 01 0a 00 00 01 08 00
10.20.3.8 is an nginx-ingress-controller.

are we DOSing ourselves somehow?

i killed that pod and the martian source messages have stopped... lets see if this helps things.

so. confusing.

shaneknapp · 2024-02-07T21:53:09Z

i'm also seeing a lot of these martian source messages in dmesg with this hub's IP:
[Wed Feb  7 21:17:15 2024] IPv4: martian source 10.20.3.65 from 10.20.3.8, on dev eth0
[Wed Feb  7 21:17:15 2024] ll header: 00000000: 42 01 0a 00 00 1d 42 01 0a 00 00 01 08 00
10.20.3.8 is an nginx-ingress-controller.
are we DOSing ourselves somehow?
i killed that pod and the martian source messages have stopped... lets see if this helps things.

so. confusing.

nope. that pod's chp is still getting OOMKilled. time for a break. :)

consideRatio · 2024-02-07T22:06:56Z

What is its normal memory use and what its k8s request/limit in memory?

shaneknapp · 2024-02-07T22:35:25Z

chp pod is set to have 1Gi ram as the limit and request is 64Mi. it's definitely hitting the limit pretty regularly. edit: we expect normal usage to be ~500Mi.

there's also a corresponding cpu spike when it OOMs:

there are ~280 logged in to the hub currently.

shaneknapp · 2024-02-07T22:41:03Z

here's what overall tcp mem usage for the core node (as reported from /proc/net/sockstat) looks like:

you can see the overall memory used drop significantly each time i deploy the 4.6.1-fork chp to an existing hub.

all hub and proxy pods for any given hub are all on this node. it's a GCP n2-highmem-8.

shaneknapp · 2024-02-07T23:57:36Z

i've reverted 4.6.1-fork on that node. it was causing many users to get 503s, which happen during the chp restarting. a quick reload bring their notebooks back but it was seriously disruptive.

that node is still regularly hitting the max memory, but just not as quickly(?) as with 4.6.1-fork. i think.

~~we don't see this behavior on other chp deployments~~

i looked at chp pod restarts across our hubs. we're seeing this intermittently across our biggest deployments (both 4.6.1 and 4.6.1-fork... which all run on the same core node).

the smaller usage hub's chps usually run between ~200-300MiB (~200 users max at any time, no massive spikes in logins etc).

the larger hubs run ~400+ MiB but depending on user count, the chp pods eventually run out of memory and are OOMKilled.

today was spent looking in to our largest class, but the other two had definitely been experiencing the same issue to slightly lesser impact. now that i know what to look for, i'll keep an eye on these individual pods' memory usage and know better what to look for in the logs. the biggest takeaway so far is that tcp_mem is king, and if it wasn't for that we'd have had at least one big outage over the past few weeks. throwing ram at the problem isn't always the right solution, but at least we've been able to winnow out some of the impacts of heavy hub usage (and hopefully get some bugs fixed!). :)

shaneknapp · 2024-02-08T01:21:04Z

i also suspect the martian source and SYN_FLOOD errors are excellent breadcrumbs leading to bad behavior somewhere(s).

disclaimer: i have a few tabs open and suspect that the former might be something isn't cleaning up when something in the proxy stack goes away. the latter might just be a symptom of this, but i can neither confirm nor deny.

https://en.wikipedia.org/wiki/Martian_packet

so martians are involved somehow, causing network flooding and a DOS?

shaneknapp · 2024-02-08T01:38:04Z

current state of the most hub (~175+ users)...

it's been an afternoon, that's for sure. ;)

shaneknapp · 2024-02-13T19:01:19Z

btw i rolled the fork back out to a couple of smaller hubs, minus the timeout settings. everything seems cromulent, but the only way to really test this is to have a lot (200+) of people logging in within a short period of time.

shaneknapp · 2024-02-21T18:11:37Z

quick update here: this fix really does look promising. orphaned sockets seem to drop significantly, and memory usage doesn't explode wildly and cause users to receive 500s.

berkeley-dsep-infra/datahub#5501 (comment)

minrk · 2024-02-22T08:02:20Z

huzzah! Thank you so much for testing and reporting, @shaneknapp. @consideRatio do you want to switch the dependency to the fork and make a real release of CHP?

shaneknapp · 2024-02-22T17:14:46Z

@minrk -- while i'm confident it helps, i'm also confident that it doesn't fix the problem outright.

while we're not getting the spiky and constant OOMKills w/this test fork, there is still a pretty significant memory leak somewhere:

i checked our javascript heap as well (this is in Mi):

/srv/configurable-http-proxy $ node -e 'console.log(v8.getHeapStatistics().heap_size_limit/(1024*1024))'
792

as you can see we're bouncing against that pretty quickly (we had a 3-day weekened this week so the figures are a little smaller than usual):

so there is still a significant memory leak.

maybe we're exposed another bug during the testing... we're also seeing many 503s for ECONNREFUSED on our two biggest hubs. these pop up after che chp has been at the heap limit for a couple of hours and it looks like people's proxy from the core node (w/the hub and chp pods) is disappearing. this is SUPER disruptive and is impacting coursework.

06:47:19.515 [ConfigProxy] �[31merror�[39m: 503 GET /user/<username>/api/kernels/e55827ba-035e-4bc2-94a3-8abc6e763896/channels connect ECONNREFUSED 10.20.0.226:8888

another quick update:

the 503 errors we're getting are appearing on high traffic hubs running both the vanilla and fork versions of the chp. they're appearing in multiples of 30 (30 or 60). i think this behavior might be related to, but not caused by the chp.

shaneknapp · 2024-02-29T16:03:23Z

i just deployed the fork to prod for all of our hubs -- the fork seems to be holding up quite well on the high-traffic hubs (>1k users/day, high compute loads), so now we're rolling it out for the rest.

if this continues to squelch the memory spikes/OOMKills for another week i'd feel comfortable giving my thumbs-up to roll the fork in to a new release!

shaneknapp · 2024-03-07T15:36:50Z

@consideRatio @minrk

alright, it's been a week and i feel very comfortable in saying that we should definitely roll these changes in to a release branch asap.

it doesn't fix the problem outright (our highest-traffic/load hubs still have one or two chp OOMKills per day w/250+ concurrent users), but it's a significant improvement over vanilla 4.6.1!

i firmly believe that we should still investigate further, and even after deploying 4.6.1-fork, we had another OOMKill/chp outage on march 5th that impacted ~300 users. yesterday, i sat down w/GCP and core node kernel logs, plus grafana/prom data, and put together a minute-by-minute timeline of how things went down. since i'll be on a plane for a few hours today, i'm hoping to get this transcribed from paper to an update on this github issue to help w/debugging.

TL;DR: chp is running out of heap space (792M) under load and going in to a ~45m death spiral before getting OOMKilled.

shaneknapp · 2024-05-22T21:37:13Z

quick ping here... is the 4.6.1-fork release going to be merged in to main any time soon?

minrk · 2024-05-23T10:34:56Z

@shaneknapp thanks for the ping. I opened #539, then we can make a release with it.

manics · 2024-06-07T14:58:46Z

it doesn't fix the problem outright (our highest-traffic/load hubs still have one or two chp OOMKills per day w/250+ concurrent users), but it's a significant improvement over vanilla 4.6.1!

@shaneknapp http-proxy-node16 fixes a change made in NodeJs 15.5+: #539 (review)

Since it improves but doesn't fully fix the issue it might be worth also testing an older CHP image (or building your own) based on NodeJS <=15.4

manics · 2024-08-13T13:28:15Z

4.6.2 was released 2 months ago with a fix for the leaking sockets. Can we close this issue?

shaneknapp · 2024-09-13T00:29:23Z

#388 (comment)

jyounggo · 2025-03-25T16:07:58Z

We have been using version 4.6.2 for over three months, and no memory leak issues have occurred. Please refer to the following graph for reference.

consideRatio · 2025-03-25T16:26:40Z

Wieeee!! Thank you for following up @jyounggo!!!

shaneknapp · 2025-03-25T16:57:59Z

fwiw, ucb datahubs are still seeing this, albeit nowhere nearly as much as before.... i guess YMMV. :)

…

On Tue, Mar 25, 2025 at 9:27 AM Erik Sundell ***@***.***> wrote: Wieeee!! Thank you for following up @jyounggo <https://github.com/jyounggo>!!! — Reply to this email directly, view it on GitHub <#434 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMIHLET6BMRK2WCN5THZFL2WFYUPAVCNFSM6AAAAABZYBNC72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJRHAZTSNZRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***> [image: consideRatio]*consideRatio* left a comment (jupyterhub/configurable-http-proxy#434) <#434 (comment)> Wieeee!! Thank you for following up @jyounggo <https://github.com/jyounggo>!!! — Reply to this email directly, view it on GitHub <#434 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMIHLET6BMRK2WCN5THZFL2WFYUPAVCNFSM6AAAAABZYBNC72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJRHAZTSNZRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

- this is the exact same thing jupyterhub was hitting with (for now) the same fix: jupyterhub/configurable-http-proxy#434 - that said, http-proxy (and http-proxy-node16) are unmaintained and have many security vulnerabilities deps, known bugs with PR's to fix them, etc. This really needs to change.

williamstein · 2025-05-08T03:58:15Z

Hi JupyterHub Devs, I did a modern rewrite of http-proxy this week (it's not that much code!), because even with http-proxy-node16, I was seeing some major socket leaks in cocalc.com, where we use proxying very heavily, and I also it seems really sad that http-proxy isn't maintained. My rewrite is at https://www.npmjs.com/package/http-proxy-3 (MIT licensed) in case anybody wants to test it out. It's a drop in replacement for http-proxy, with no api changes. Another motivation is that "npm audit" showed a lot of issues so I updated all dependencies to the latest versions, and also some code in the implementation used API's from nodejs that were deprecated due to security concerns.

-- William

benz0li · 2025-05-08T04:38:30Z

My rewrite is at https://www.npmjs.com/package/http-proxy-3 (MIT licensed) in case anybody wants to test it out. It's a drop in replacement for http-proxy, with no api changes.

@williamstein No api changes = No HTTP/2 support?

Cross references:

ClientHttp2Stream.h2StreamError from extension in code-server coder/code-server#6983 (comment)
- HTTP2 support http-party/node-http-proxy#1237

How does it compare to https://github.com/unjs/httpxy?

williamstein · 2025-05-08T04:44:12Z

@williamstein No api changes = No HTTP/2 support?

Correct for now. And just to be clear: this is definitely not ready for production use yet. I’m hoping to start stirring a little interest in there being a modern maintained successor to http-proxy that merges PR’s, etc.

minrk · 2025-05-08T07:56:39Z

Yeah, I'd absolutely be interested. Maybe collaborating with httpxy?

FWIW, my own effort on this was to try to vendor https://github.com/nxtedition/node-http2-proxy into this repo since it's also unmaintained, but if there's a maintained http-proxy fork, that would be easier to switch to.

williamstein · 2025-05-08T14:40:36Z

@minrk thanks! I looked into httpxy, and it has the same motivation as this http-proxy-3 I made. It seems like a big difference is that I'm writing a lot of jest unit tests for http-proxy-3, whereas httpxy has very few. This could make a difference more longterm regarding how easy it is to contribute to http-proxy-3. Today I hope to write tests to particular address the socket leak issue discussed here. Anyway, if or when http-proxy-3 is "ready for production use" and tested for a while, I'll announce that again here, so you guys can consider using it as a drop in replacement.

minrk · 2025-05-09T09:17:10Z

That would be awesome, I'll be very happy to switch when you're ready

williamstein · 2025-05-11T02:03:04Z

That would be awesome, I'll be very happy to switch when you're ready

I spent all week and modernized ALL the unit tests from the original http-proxy, and also made all of their examples into unit tests that are tested to work (they were not actually tested in http-proxy). I fixed various vulnerabilities and issues (some revealed by unit testing). I also added some tests involving websockets along with instrumenting counting how many sockets are open to ensure leaks like reported here don't recur. It's running live now on http://cocalc.com. So now I think this is ready for general use:

https://www.npmjs.com/package/http-proxy-3

When @shaneknapp above says "fwiw, ucb datahubs are still seeing this, albeit nowhere nearly as much as
before.... i guess YMMV. :)", that was also my experience -- without cocalc we still had issues after using http-proxy-node16. I added additional code to further clean up sockets, added instrumentation to know it is working, and unit tests, as mentioned above. Now things seem to work fine without leaks for cocalc at least. I hope JupyterHub has a similar improved reliability when using hub-proxy-3.

shaneknapp · 2025-05-11T03:00:44Z

@williamstein you. are. the. best.

i'm also glad that it wasn't just me going completely bonkers, and doubly glad you had the cycles to get this sorted and get to the bottom of it!

fwiw we had a workaround by increasing the ephemeral ports on the hub pods.
research here: #557
PR to increase eph ports here: https://github.com/berkeley-dsep-infra/datahub/pull/6441/files

minrk · 2025-05-11T10:55:48Z

Nice! #572 switches to http-proxy-3. Only one test fails.

jyounggo added the bug label Oct 12, 2022

This comment was marked as resolved.

Sign in to view

consideRatio mentioned this issue Nov 27, 2023

Document what node version is used for various tags #517

Open

shaneknapp mentioned this issue Feb 2, 2024

[DH-3] try using a test fork of the chp berkeley-dsep-infra/datahub#5501

Merged

minrk mentioned this issue May 23, 2024

switch dependency to http-proxy-node16 #539

Merged

shaneknapp mentioned this issue Jun 6, 2024

Memory leak in proxy? #388

Open

jyounggo closed this as completed Mar 25, 2025

williamstein mentioned this issue May 5, 2025

fork and maintain node-http-proxy / http-proxy sagemathinc/cocalc#8324

Closed

williamstein mentioned this issue May 9, 2025

HTTP2 support sagemathinc/http-proxy-3#2

Open

minrk mentioned this issue May 14, 2025

switch to http-proxy-3 #572

Merged

Socket leak #434

Socket leak #434

Comments

Uh oh!

Bug description

Your personal set up

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!