8000 WIP: add support for `upstream command "/path/to/command"` by tkluck · Pull Request #228 · tinyproxy/tinyproxy · GitHub
[go: up one dir, main page]

Skip to content

WIP: add support for upstream command "/path/to/command" #228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tkluck
Copy link
@tkluck tkluck commented Mar 1, 2019

This is a first stab at #224 : a way to get arbitrarily complex upstream proxy logic by forking out to a user-defined command. As it is now, it contains many TODO comments; I'm opening this pull request now to open it up to feedback as early as possible. @rofl0r, I would appreciate any insight / suggestions / corrections!

As described in that ticket, my idea is to keep a single subprocess (per worker) running. Messages are passed over pipes connected to its stdin and stdout. An easier alternative would be to fork off a new process every time we need it, but I'd really like to amortize any initialization cost for the subprocess. (In my target use case, the subprocess needs to initialize a javascript interpreter.)

After receiving the result from the subprocess, this commit modifies the fields of a struct upstream in-place. This would be a problem if tinyproxy handles several connections simultaneously in the same process, but I couldn't spot any evidence for that happening. Please correct me if I'm wrong.

@rofl0r
Copy link
Contributor
rofl0r commented Mar 1, 2019

An easier alternative would be to fork off a new process every time we need it, but I'd really like to amortize any initialization cost for the subprocess.

that sounds like premature optimization

This would be a problem if tinyproxy handles several connections simultaneously in the same process

i'm actually working on / have finished a port to thread which i plan to make the default soon (check my threads branch #218 ), so there will soon be concurrent access.

i think we should keep it as simple as possible. for example the script could return a line containing a config directive like upstream http localhost:8080 .foo.com and we feed it to the same parsing function that the config parser uses. also we could simply say we don't handle error, so we don't have the complexity of dealing with stderr, which means we can use the portable popen().

@tkluck
Copy link
Author
tkluck commented Mar 2, 2019

An easier alternative would be to fork off a new process every time we need it, but I'd really like to amortize any initialization cost for the subprocess.

that sounds like premature optimization

Without a use case, I would agree with you. But initialization of the script I'm using currently takes hundreds of milliseconds, and message passing takes only hundreds of micro seconds.

(My personal perception of what kinds of delays are acceptable is coloured by working with envoyproxy in a service mesh setting; there, I run http queries that should have a P99 of ~10ms. That's why I'm hoping to push any overhead into microsecond land.)

And while it does add some complication, some of that is unavoidable with a subprocess anyway. A subprocess can fail (or worse, hang) for many reasons regardless of whether it's expected to run briefly or continuously, and some accommodation needs to be made for that.

This would be a problem if tinyproxy handles several connections simultaneously in the same process

i'm actually working on / have finished a port to thread which i plan to make the default soon (check my threads branch #218 ), so there will soon be concurrent access.

Ah! This is exactly the kind of stuff for which I like getting early feedback :) I'll take a look and it may even be wise to rebase my work on top of that one from the get-go.

i think we should keep it as simple as possible. for example the script could return a line containing a config directive like upstream http localhost:8080 .foo.com and we feed it to the same parsing function that the config parser uses.

Yes I like that idea!

also we could simply say we don't handle error, so we don't have the complexity of dealing with stderr,

I'm not sure I agree with that. Processes error out in many different ways; they may even hang and that carries the risk of tinyproxy hanging on a read syscall. In my experience, this is bound to happen when you serve any meaningful rate of traffic, so we'd better be prepared.

which means we can use the portable popen().

This sounds like you have portability concerns with some of the other libc calls I'm making, do I interpret that correctly? If so, which ones? And what platforms does tinyproxy target for portability?

@tkluck
Copy link
Author
tkluck commented Mar 2, 2019

Without a use case, I would agree with you. But initialization of the script I'm using currently takes hundreds of milliseconds, and message passing takes only hundreds of micro seconds.

For putting these numbers in context, even just starting a python process takes ~50ms; starting a bash process takes ~3ms. The former number would be unacceptable, and even that latter number would be on the high end of what I'd be willing to add to every http request tinyproxy serves. What's your take on this?

(My personal perception of what kinds of delays are acceptable is coloured by working with envoyproxy in a service mesh setting; there, I run http queries that should have a P99 of ~10ms. That's why I'm hoping to push any overhead into microsecond land.)

@rofl0r
Copy link
Contributor
rofl0r commented Mar 2, 2019

Without a use case, I would agree with you. But initialization of the script I'm using currently takes hundreds of milliseconds, and message passing takes only hundreds of micro seconds.

the message passing stuff you're talking about can be done in the script/program that's called. for example you could have some javascript interpreter service running which takes requests on a socket, and the spawned process just does something like echo "DEST:192.168.23.5" | netcat localhost 12345
i want to keep the mechanism offered by tinyproxy as simple as possible.

I'm not sure I agree with that. Processes error out in many different ways; they may even hang and that carries the risk of tinyproxy hanging on a read syscall. In my experience, this is bound to happen when you serve any meaningful rate of traffic, so we'd better be prepared.

this logic can all be stuffed in the called program too.
if the tiny delay caused by calling exec is already too much for you, maybe you should look into using musl libc which is a lot more responsive than glibc. :)

This sounds like you have portability concerns with some of the other libc calls I'm making, do I interpret that correctly? If so, which ones? And what platforms does tinyproxy target for portability?

you've added a configure check for fdopen, which (without knowing alot about that API) suggests that it might be lacking - otherwise there would be no need to check for it, correct ?
the baseline profile we're targeting is a system supporting C99 and POSIX 2003ish. so it should work on mac, bsd, linux and eventually mingw. adding hacks to make it work on ancient AIX is definitely not something i intend to do...

@tkluck
Copy link
Author
tkluck commented Mar 2, 2019

this logic can all be stuffed in the called program too.
(...) the tiny delay caused by calling exec (...)

Fair enough. It's fine if we'd make different tradeoffs between code simplicity on the one hand and latency + defensiveness on the other. After all, it's your standards that count for tinyproxy.

I also like your suggestion for forking off netcat. I measured single-digit millisecond latency for that operation. As I said, that may be on the high end for other use cases that I've had in the past, but it's probably fine for what I'm doing right now.

So, thanks for helping me sort this out before I've gone down a rabbit hole of implementation.

I looked into popen; it can only connect to either stdin or stdout. I could change to passing host as a command line argument instead of on stdin, but I worry about ending up in shell escaping hell (or worse, shell injection over a network socket).

As for fdopen, it seems pretty compatible (a random internet link says POSIX.1 and UNIX). The only reason I added it to configure.ac was basically as a reminder to check this. As a comparison, I added it to a list that also contains strdup and strdup is used quite liberally already (without any checks for HAVE_STRDUP, fwiw). So I'd like to stick with it.

In summary, here's what I'd propose:

  • I'll take your suggestion of synchronous communication with the command, spawning a new process every time, reading its output and waiting for its completion.. That has the advantages that (1) it's simpler and (2) it's much easier in a future threaded world. (It has the downside of not being defensive against issues with the subcommand, but arguably that's the administrator's problem, not tinyproxy's.)
  • The output message is parsed according to the same format as upstream ... config lines (except the upstream command ... one obviously).
  • I use the fork, execl, fdopen, fprintf, fflush, fgets, waitpid calls for communicating with it. I don't see obvious compatibility issues with them. (For comparison, getline or getdelim would have been useful, but would restrict us to POSIX 200809L.)

@rofl0r
Copy link
Contributor
rofl0r commented Mar 2, 2019

ah, popen() does indeed call the shell, rather than passing an argv array. that pretty much precludes its usage in a potentially hostile env.
anyway, you may want to take a look at what i wrote here: https://github.com/rofl0r/libulz/blob/master/src/proclib/process.c . it'd be no issue at all to relicense it for use in tinyproxy.
since it appears that the feature will probably take a lot more code than i initially intended, we could as well make it optional and automatically disabled if posix_spawn is not available.
as we have to circumvent the shell, i dont really see an advantage over using stdin for passing the connection target over argv[1], as the receiving process sees it as a single string. i figure by not having to fiddle around with stdin the code can be kept a little shorter.

some other thoughts:

  • we could eventually fit the parsing of the "upstream command" into the existing regex by using a grouping and an or
  • the receiving function would then call either a subfunc parsing the "command" directive, or another one parsing the existing config, which would then be reused by the code dealing with the stdout result of the script. i think if we do it that way, we dont have to shoehorn yet another parameter into the argument list of the upstream handler func.

@tkluck
Copy link
Author
tkluck commented Mar 3, 2019

Can I make an observation? I don't mind any of your suggestions, but I also don't always agree they are much better; they are just alternatives. That doesn't mean I'm unwilling to implement them, but it leaves me with the feeling that anything I may be about to write will go through five back-and-forths before it is exactly as you would have written it.

That's a wasted effort on both your side and mine, if the outcome is code in the way you would write it. Isn't it just better if you implement this? That may even cost you less time than reviewing what I write.

@max19931
Copy link
max19931 commented Oct 24, 2021

Why not just use a single hostname in config and let dns handle the discovery of the right proxy server. This is way simpler and doesn't bloat tinyproxy with additional logic and risks.

A or AAAA for direct mapping and CNAME as offload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0