8000 [PREEMPTIVE] Removal of `ephemeral` variants on `scale-config.yml` · Issue #153468 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content
[PREEMPTIVE] Removal of ephemeral variants on scale-config.yml #153468
@jeanschmidt

Description

@jeanschmidt

This is a preemptive ci:sev. We recently finished the experimentation with ephemeral variants and we're migrating all runners to be ephemeral by default.

The experiment is currently disabled for more than 24 hours, so we believe that there won't be new jobs at this state requesting runners on the ephemeral variant.

For people re-running old jobs, their job might stay queued forever, never running. If this is the case and your job request a runner label that STARTS with ephemeral. or lf.ephemeral. you might see indefinite queueing for it.

In order to solve this, just re-run the whole workflow instead of the single job.

Current Status

  • PREEMPTIVE

Error looks like

  • job that starts with ephemeral. or lf.ephemeral. and never actually starting

Incident timeline (all times pacific)

  • 7:35 PCT - Merged PR removing variants

User impact

  • queued jobs

Root cause

NA

Mitigation

NA

Prevention/followups

NA

< 57B8 /div>

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0