8000 wait for balancer to have some address available on gateway by sp-joseluis-ledesma · Pull Request #76 · uswitch/kiam · GitHub
[go: up one dir, main page]

Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

wait for balancer to have some address available on gateway #76

Merged
merged 1 commit into from
Jun 1, 2018

Conversation

sp-joseluis-ledesma
Copy link
Contributor

When running the health command, more often than not, the output is:

WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO[0000] healthy: ok

running with GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 the problem is that the balancer has not been notifiied of the address before it tries to connect:

INFO: 2018/06/01 06:55:06 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc420205320
INFO: 2018/06/01 06:55:06 dialing to target with scheme: ""
INFO: 2018/06/01 06:55:06 could not get resolver for scheme: ""
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/01 06:55:06 balancerWrapper: is pickfirst: false
INFO: 2018/06/01 06:55:06 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 100.64.0.10:53: no such host.
[...]

This PR adds a "wait" for the balancer to have address on the NewGateway function, so when it returns the KiamGateway we ensure there is some address in the balancer

Copy link
Contributor
@pingles pingles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see the timeout/context specified as an arg to NewGateway rather than inside NewGateway- to allow whatever is consuming it to specify/propagate a timeout.

Then, we can place balancer.Get inside a retry loop that can just keep retrying until the balancer resolves- again allowing whatever is consuming NewGateway to determine how long they're happy to wait.

Thanks for the PR though! Be nice to have this integrated.

@sp-joseluis-ledesma
Copy link
8000 Contributor Author

I added a parameter to the NewGateway function, and passed it from the agent and health commands. Is it ok? or do you prefer another aproach?

Perhaps the NewGateway parameters could be converted to the "WithOption" pattern, but I leave this for another PR :)

Copy link
Contributor
@pingles pingles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks- I've made some more comments in-line to help.

Sorry for not explaining before but I'd like the context.Context to be constructed by the commands and passed to NewGateway rather than the wait duration (and add cli flags to set defaults).

Thanks again so much for adding this.

@@ -45,7 +45,7 @@ type GatewayConfig struct {
TLS *TLSConfig
}

func NewGateway(address string, refresh time.Duration, caFile, certificateFile, keyFile string) (*KiamGateway, error) {
func NewGateway(address string, refresh time.Duration, waitForBalancer time.Duration, caFile, certificateFile, keyFile string) (*KiamGateway, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pass the context.Context through as the first arg to NewGateway, rather than the wait duration.

@@ -80,6 +80,13 @@ func NewGateway(address string, refresh time.Duration, caFile, certificateFile,
return nil, err
}

ctx, cancel := context.WithTimeout(context.Background(), waitForBalancer)
defer cancel()
_, _, err = balancer.Get(ctx, grpc.BalancerGetOptions{BlockingWait: true})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass the context through then this doesn't need to create context.WithTimeout it can just cancel according to the context it's provided.

@@ -129,7 +129,7 @@ func main() {
config := http.NewConfig(opts.port)
config.AllowIPQuery = opts.allowIPQuery

gateway, err := kiamserver.NewGateway(opts.serverAddress, opts.serverAddressRefresh, opts.caPath, opts.certificatePath, opts.keyPath)
gateway, err := kiamserver.NewGateway(opts.serverAddress, opts.serverAddressRefresh, 50*time.Millisecond, opts.caPath, opts.certificatePath, opts.keyPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing the time we'll pass a context.Context through. We should also add a CLI flag to control what the wait duration is.

@@ -63,7 +63,7 @@ func main() {
log.SetLevel(log.ErrorLevel)
}

gateway, err := kiamserver.NewGateway(opts.serverAddress, opts.serverAddressRefresh, opts.caPath, opts.certificatePath, opts.keyPath)
gateway, err := kiamserver.NewGateway(opts.serverAddress, opts.serverAddressRefresh, 50*time.Millisecond, opts.caPath, opts.certificatePath, opts.keyPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the agent's main, we should construct a context and pass it to NewGateway and set the timeout on that context from a cli flag.

@sp-joseluis-ledesma
Copy link
Contributor Author

Ok, updated following your comments. Let me know if there is anything more to change.

@mikesplain
Copy link
Contributor

Thanks @sp-joseluis-ledesma! We run into this as well, this is great.

@pingles pingles merged commit 1d6d618 into uswitch:master Jun 1, 2018
@pingles
Copy link
Contributor
pingles commented Jun 1, 2018

Thanks @sp-joseluis-ledesma

@pingles
Copy link
Contributor
pingles commented Jun 1, 2018

I've tagged it as v2.8 and the image is now up on quay.io, I'll create a separate PR with the CHANGELOG updates but you can update now if you like!

@TattiQ
Copy link
TattiQ commented Aug 3, 2018

Hi guys, sorry for bothering you in this thread but my issues seems a bit connected to this one. I am using version 2.7 now with kops , it ran fine in dev and today I tried it out in prod (hoping to substitute kube2iam) and got an issue with some of the agents when they can not seem to connect to the server. The error suggests that there are no addresses to connect to and unlike for Jose Luis, I get a fatal error. and this is reproduced constantly on those agents even if the ping succeeds after all. I wonder if anyone has any thoughts on that.
screen shot 2018-08-03 at 10 33 09

then the agent reports in its logs - {"addr":"100.99.186.53:33766","level":"error","method":"GET","msg":"error processing request: rpc error: code = Unavailable desc = there is no address available","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2018-08-03T07:35:45Z"} .

and the healthchecks on all servers run fine + nothing suspicious in servers logs.

@pingles
Copy link
Contributor
pingles commented Aug 3, 2018

@TattiQ only thing I can recommend for now is increasing the time limits on your health check and see if that helps- do liveness probes always fail or intermittently? Does it do the same if you specify the FQDN for the server?

Could you perhaps open as a new issue and then we can track it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0