[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature (ECS): add support for managing failing deployments #6225

Open
iniinikoski opened this issue Dec 10, 2020 · 11 comments
Open

Feature (ECS): add support for managing failing deployments #6225

iniinikoski opened this issue Dec 10, 2020 · 11 comments

Comments

@iniinikoski
Copy link

Issue Summary:

Right now when Spinnaker deploys ECS services, a new service is always created. If the deployed Task on the new service continues to fail on startup, the deployment stage in Spinnaker just times out. In AWS, ECS continues to try to get the tasks running "forever".

Just recently, AWS introduce this new feature in Preview: https://aws.amazon.com/blogs/containers/announcing-amazon-ecs-deployment-circuit-breaker/ to mitigate this described problem - but - unfortunately, as Spinnaker spins up a service for every new deployment, this feature cannot be used.

We have tried to solve the issue in Spinnaker-stage-level (to shrink a new - failing service), but, this did not look very bulletproof.

Unfortunately this affects every customer using Spinnaker and ECS - meaning, there is a definite risk of higher costs because of the container registry traffic (which can be quite enormous at times).

See this issue for some details: #4951 (comment)

Cloud Provider(s):

AWS / ECS

Environment:

All versions of Spinnaker

Feature Area:

Deployments

Description:

See above.

--

@lifeofguenter
Copy link

I think this is the default behavior also for EC2. I see an option with EC2 that you can automatically rollback on failures, but I have never tried that out. Maybe something similar can be done with ECS where a deploy that fails will kill the cluster.

Alternative for now, I think it should be possible to create a post-step that can disable the cluster after a certain time (= timeout) ? I assume that would also go in direction of canary deployments.

Those ECS-built-in-strategies are really good, but I think they are better purposed when using ECS natively vs. using something like Spinnaker on top (but this is just my opinion).

@piradeepk
Copy link

Dependent on Task Sets being implemented: #4951

@iniinikoski
Copy link
Author
iniinikoski commented Dec 11, 2020

We also still have some EC2-based deployments, only saw this option now - thanks @lifeofguenter. Yes, something similar would be needed here... Unfortunately we cannot do canaries for most of our applications... We tried to have a post-step before in a pipeline as a separate stage but we could not trust it's logic completely.

There's actually also a bigger problem which arises of Spinnaker's way to do deployments (= creating a new service in ECS on each deployment): https://spinnaker.io/setup/install/providers/aws/aws-ecs/#optional-service-auto-scaling - this; to my understanding, cannot be "codified" right now... Please correct me wrong :)

Part of me agrees with you @lifeofguenter on this - but we also discussed this internally - and one good point was that as AWS is constantly adding enhancements to ECS, it'd be a shame to reinvent the wheel in Spinnaker. Also.

@piradeepk
Copy link
piradeepk commented Dec 11, 2020

@iniinikoski @lifeofguenter completely agree. The fact that Spinnaker creates a new service on every deployment (and doesn't function the way that AWS does deployents to ECS) prevents us from using a lot of the built in support for use cases like updating a service (such as circuit breakers).

Updating deployment logic in Spinnaker to use Task Sets would definitely help to improve that experience. There's a roadmap item to do just that, but it requires a bit of investigating as previously Task Sets did not support everything that was available when creating a new service.

@iniinikoski
Copy link
Author

@piradeepk thanks for your update. Completely understand this is definitely not something trivial :) But, I also understand that'd break the "Spinnaker deployment pattern"... Would it be possible to support both modes at the end (without crazy complexity) - meaning, the "Spinnaker mode" and... "native ECS"...?

@akshayabd akshayabd added the triaged Triaged in a SIG meeting label Dec 16, 2020
@HaroonSaid
Copy link

We see these type of issues a lot, especially in our lower environments where automation or developer keep on deploying the same image over and over in the hope of a successful outcome
And eventually exhausting all available resources and blocking other deployments

@iniinikoski
Copy link
Author
iniinikoski commented Dec 16, 2021

@iniinikoski @lifeofguenter completely agree. The fact that Spinnaker creates a new service on every deployment (and doesn't function the way that AWS does deployents to ECS) prevents us from using a lot of the built in support for use cases like updating a service (such as circuit breakers).

Updating deployment logic in Spinnaker to use Task Sets would definitely help to improve that experience. There's a roadmap item to do just that, but it requires a bit of investigating as previously Task Sets did not support everything that was available when creating a new service.

We've learned one thing here @piradeepk & @lifeofguenter: enabling the Circuit Breaker to "Enabled no rollback" would already stop ECS trying to continuously redeploy a failing service - but let it stay down (and not explode costs or anything) (see the screenshots).

According to the linked ticket about Task Sets - implementing this would not help anyone with the Circuit Breaker feature.

How do you feel about this?

Screenshot 2021-12-16 at 16 58 42
Screenshot 2021-12-16 at 16 55 09

@jgrumboe
Copy link

@allisaurus
This would be a nice (intermediate) solution to mitigate ever-failing ECS deployments. For the description of this issue, it would be fair enough if we could select the "Enable ECS CB with no rollback" as a UI option for ECS Deployment Cluster in Spinnaker. Together with container health checks defined in the task definition and the CB (with no rollback) activated on each new server group/ECS service, there's a good chance to mitigate ever-failing ECS deployments.

Additionally, if CB is enabled and was triggered, it would be cool if this can somehow be read/fetched by clouddriver, so the red/black strategy can react on the triggered CB and fail the stage immediately instead of timing out like it does currently.

@iandelahorne
Copy link

Having the circuitbreaker enabled on deployments would be amazing, this would lower a lot of confusion around failed/stuck ECS deploys

@allisaurus
Copy link

Hi @jgrumboe - I'm no longer an active Spinnaker contributor, but CC'ing @paragbhingre & @uttarasridhar for awareness.

@jgrumboe
Copy link
jgrumboe commented Jan 7, 2024

I've created the above linked PRs to support enabling ECS Deployment Circuit Breaker for the server group (=ECS service).
Looking forward to reading your comments and thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants