Feature (ECS): add support for managing failing deployments #6225

iniinikoski · 2020-12-10T15:04:45Z

Issue Summary:

Right now when Spinnaker deploys ECS services, a new service is always created. If the deployed Task on the new service continues to fail on startup, the deployment stage in Spinnaker just times out. In AWS, ECS continues to try to get the tasks running "forever".

Just recently, AWS introduce this new feature in Preview: https://aws.amazon.com/blogs/containers/announcing-amazon-ecs-deployment-circuit-breaker/ to mitigate this described problem - but - unfortunately, as Spinnaker spins up a service for every new deployment, this feature cannot be used.

We have tried to solve the issue in Spinnaker-stage-level (to shrink a new - failing service), but, this did not look very bulletproof.

Unfortunately this affects every customer using Spinnaker and ECS - meaning, there is a definite risk of higher costs because of the container registry traffic (which can be quite enormous at times).

See this issue for some details: #4951 (comment)

Cloud Provider(s):

AWS / ECS

Environment:

All versions of Spinnaker

Feature Area:

Deployments

Description:

See above.

--

lifeofguenter · 2020-12-11T12:47:21Z

I think this is the default behavior also for EC2. I see an option with EC2 that you can automatically rollback on failures, but I have never tried that out. Maybe something similar can be done with ECS where a deploy that fails will kill the cluster.

Alternative for now, I think it should be possible to create a post-step that can disable the cluster after a certain time (= timeout) ? I assume that would also go in direction of canary deployments.

Those ECS-built-in-strategies are really good, but I think they are better purposed when using ECS natively vs. using something like Spinnaker on top (but this is just my opinion).

piradeepk · 2020-12-11T14:25:34Z

Dependent on Task Sets being implemented: #4951

iniinikoski · 2020-12-11T14:28:09Z

We also still have some EC2-based deployments, only saw this option now - thanks @lifeofguenter. Yes, something similar would be needed here... Unfortunately we cannot do canaries for most of our applications... We tried to have a post-step before in a pipeline as a separate stage but we could not trust it's logic completely.

There's actually also a bigger problem which arises of Spinnaker's way to do deployments (= creating a new service in ECS on each deployment): https://spinnaker.io/setup/install/providers/aws/aws-ecs/#optional-service-auto-scaling - this; to my understanding, cannot be "codified" right now... Please correct me wrong :)

Part of me agrees with you @lifeofguenter on this - but we also discussed this internally - and one good point was that as AWS is constantly adding enhancements to ECS, it'd be a shame to reinvent the wheel in Spinnaker. Also.

piradeepk · 2020-12-11T14:34:05Z

@iniinikoski @lifeofguenter completely agree. The fact that Spinnaker creates a new service on every deployment (and doesn't function the way that AWS does deployents to ECS) prevents us from using a lot of the built in support for use cases like updating a service (such as circuit breakers).

Updating deployment logic in Spinnaker to use Task Sets would definitely help to improve that experience. There's a roadmap item to do just that, but it requires a bit of investigating as previously Task Sets did not support everything that was available when creating a new service.

iniinikoski · 2020-12-11T16:57:02Z

@piradeepk thanks for your update. Completely understand this is definitely not something trivial :) But, I also understand that'd break the "Spinnaker deployment pattern"... Would it be possible to support both modes at the end (without crazy complexity) - meaning, the "Spinnaker mode" and... "native ECS"...?

HaroonSaid · 2021-04-27T12:53:27Z

We see these type of issues a lot, especially in our lower environments where automation or developer keep on deploying the same image over and over in the hope of a successful outcome
And eventually exhausting all available resources and blocking other deployments

iniinikoski · 2021-12-16T16:05:23Z

@iniinikoski @lifeofguenter completely agree. The fact that Spinnaker creates a new service on every deployment (and doesn't function the way that AWS does deployents to ECS) prevents us from using a lot of the built in support for use cases like updating a service (such as circuit breakers).

Updating deployment logic in Spinnaker to use Task Sets would definitely help to improve that experience. There's a roadmap item to do just that, but it requires a bit of investigating as previously Task Sets did not support everything that was available when creating a new service.

We've learned one thing here @piradeepk & @lifeofguenter: enabling the Circuit Breaker to "Enabled no rollback" would already stop ECS trying to continuously redeploy a failing service - but let it stay down (and not explode costs or anything) (see the screenshots).

According to the linked ticket about Task Sets - implementing this would not help anyone with the Circuit Breaker feature.

How do you feel about this?

jgrumboe · 2021-12-17T08:08:07Z

@allisaurus
This would be a nice (intermediate) solution to mitigate ever-failing ECS deployments. For the description of this issue, it would be fair enough if we could select the "Enable ECS CB with no rollback" as a UI option for ECS Deployment Cluster in Spinnaker. Together with container health checks defined in the task definition and the CB (with no rollback) activated on each new server group/ECS service, there's a good chance to mitigate ever-failing ECS deployments.

Additionally, if CB is enabled and was triggered, it would be cool if this can somehow be read/fetched by clouddriver, so the red/black strategy can react on the triggered CB and fail the stage immediately instead of timing out like it does currently.

iandelahorne · 2021-12-22T13:09:50Z

Having the circuitbreaker enabled on deployments would be amazing, this would lower a lot of confusion around failed/stuck ECS deploys

allisaurus · 2021-12-23T17:15:21Z

Hi @jgrumboe - I'm no longer an active Spinnaker contributor, but CC'ing @paragbhingre & @uttarasridhar for awareness.

jgrumboe · 2024-01-07T21:38:13Z

I've created the above linked PRs to support enabling ECS Deployment Circuit Breaker for the server group (=ECS service).
Looking forward to reading your comments and thoughts.

allisaurus added no-lifecycle provider/ecs sig/aws labels Dec 10, 2020

akshayabd added the triaged Triaged in a SIG meeting label Dec 16, 2020

iniinikoski mentioned this issue Dec 16, 2021

ECS Proposal: Use new External Deployment Controller and Task Set concept as mapping for Server Group #4951

Open

This was referenced Jan 7, 2024

feat(ecs): Add support for ECS deployment circuit breaker spinnaker/clouddriver#6132

Merged

feat(ecs): Add support for ECS deployment circuit breaker spinnaker/deck#10076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature (ECS): add support for managing failing deployments #6225

Feature (ECS): add support for managing failing deployments #6225

Feature (ECS): add support for managing failing deployments #6225

Feature (ECS): add support for managing failing deployments #6225

Comments

Issue Summary:

Cloud Provider(s):

Environment:

Feature Area:

Description: