-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature (ECS): add support for managing failing deployments #6225
Comments
I think this is the default behavior also for EC2. I see an option with EC2 that you can automatically rollback on failures, but I have never tried that out. Maybe something similar can be done with ECS where a deploy that fails will kill the cluster. Alternative for now, I think it should be possible to create a post-step that can disable the cluster after a certain time (= timeout) ? I assume that would also go in direction of canary deployments. Those ECS-built-in-strategies are really good, but I think they are better purposed when using ECS natively vs. using something like Spinnaker on top (but this is just my opinion). |
Dependent on Task Sets being implemented: #4951 |
We also still have some EC2-based deployments, only saw this option now - thanks @lifeofguenter. Yes, something similar would be needed here... Unfortunately we cannot do canaries for most of our applications... We tried to have a post-step before in a pipeline as a separate stage but we could not trust it's logic completely. There's actually also a bigger problem which arises of Spinnaker's way to do deployments (= creating a new service in ECS on each deployment): https://spinnaker.io/setup/install/providers/aws/aws-ecs/#optional-service-auto-scaling - this; to my understanding, cannot be "codified" right now... Please correct me wrong :) Part of me agrees with you @lifeofguenter on this - but we also discussed this internally - and one good point was that as AWS is constantly adding enhancements to ECS, it'd be a shame to reinvent the wheel in Spinnaker. Also. |
@iniinikoski @lifeofguenter completely agree. The fact that Spinnaker creates a new service on every deployment (and doesn't function the way that AWS does deployents to ECS) prevents us from using a lot of the built in support for use cases like updating a service (such as circuit breakers). Updating deployment logic in Spinnaker to use Task Sets would definitely help to improve that experience. There's a roadmap item to do just that, but it requires a bit of investigating as previously Task Sets did not support everything that was available when creating a new service. |
@piradeepk thanks for your update. Completely understand this is definitely not something trivial :) But, I also understand that'd break the "Spinnaker deployment pattern"... Would it be possible to support both modes at the end (without crazy complexity) - meaning, the "Spinnaker mode" and... "native ECS"...? |
We see these type of issues a lot, especially in our lower environments where automation or developer keep on deploying the same image over and over in the hope of a successful outcome |
We've learned one thing here @piradeepk & @lifeofguenter: enabling the Circuit Breaker to "Enabled no rollback" would already stop ECS trying to continuously redeploy a failing service - but let it stay down (and not explode costs or anything) (see the screenshots). According to the linked ticket about Task Sets - implementing this would not help anyone with the Circuit Breaker feature. How do you feel about this? |
@allisaurus Additionally, if CB is enabled and was triggered, it would be cool if this can somehow be read/fetched by clouddriver, so the red/black strategy can react on the triggered CB and fail the stage immediately instead of timing out like it does currently. |
Having the circuitbreaker enabled on deployments would be amazing, this would lower a lot of confusion around failed/stuck ECS deploys |
Hi @jgrumboe - I'm no longer an active Spinnaker contributor, but CC'ing @paragbhingre & @uttarasridhar for awareness. |
I've created the above linked PRs to support enabling ECS Deployment Circuit Breaker for the server group (=ECS service). |
Issue Summary:
Right now when Spinnaker deploys ECS services, a new service is always created. If the deployed Task on the new service continues to fail on startup, the deployment stage in Spinnaker just times out. In AWS, ECS continues to try to get the tasks running "forever".
Just recently, AWS introduce this new feature in Preview: https://aws.amazon.com/blogs/containers/announcing-amazon-ecs-deployment-circuit-breaker/ to mitigate this described problem - but - unfortunately, as Spinnaker spins up a service for every new deployment, this feature cannot be used.
We have tried to solve the issue in Spinnaker-stage-level (to shrink a new - failing service), but, this did not look very bulletproof.
Unfortunately this affects every customer using Spinnaker and ECS - meaning, there is a definite risk of higher costs because of the container registry traffic (which can be quite enormous at times).
See this issue for some details: #4951 (comment)
Cloud Provider(s):
AWS / ECS
Environment:
All versions of Spinnaker
Feature Area:
Deployments
Description:
See above.
--
The text was updated successfully, but these errors were encountered: