8000 fix: make GetWorkspacesEligibleForTransition return less false-positives by DanielleMaywood · Pull Request #15429 · coder/coder · GitHub
[go: up one dir, main page]

Skip to content

fix: make GetWorkspacesEligibleForTransition return less false-positives #15429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
chore: comment query
  • Loading branch information
DanielleMaywood committed Nov 8, 2024
commit 18e9550d24f62f73ed9ea620e1c633ca61376a2b
32 changes: 26 additions & 6 deletions coderd/database/queries.sql.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

32 changes: 26 additions & 6 deletions coderd/database/queries/workspaces.sql
Original file line number Diff line number Diff line change
Expand Up @@ -580,7 +580,11 @@ WHERE
) AND

(
-- isEligibleForAutostop
-- A workspace may be eligible for autostop if the following are true:
-- * The provisioner job has not failed.
-- * The workspace is not dormant.
-- * The workspace build was a start transition.
-- * The workspace's owner is suspended OR the workspace build deadline has passed.
(
provisioner_jobs.job_status != 'failed'::provisioner_job_status AND
workspaces.dormant_at IS NULL AND
Expand All @@ -592,22 +596,33 @@ WHERE
)
) OR

-- isEligibleForAutostart
-- A workspace may be eligible for autostart if the following are true:
-- * The workspace's owner is active.
-- * The provisioner job did not failed.
-- * The workspace build was a stop transition.
-- * The workspace has an autostart schedule.
(
users.status = 'active'::user_status AND
provisioner_jobs.job_status != 'failed'::provisioner_job_status AND
workspace_builds.transition = 'stop'::workspace_transition AND
workspaces.autostart_schedule IS NOT NULL
) OR

-- isEligibleForDormantStop
-- A workspace may be eligible for dormant stop if the following are true:
-- * The workspace is not dormant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who/when marks a workspace as dormant? Is there a chance the workspace can be marked dormant and we never trigger either autostop or dormant stop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autobuild executor does it here:

// isEligibleForDormantStop returns true if the workspace should be dormant
// for breaching the inactivity threshold of the template.
func isEligibleForDormantStop(ws database.Workspace, templateSchedule schedule.TemplateScheduleOptions, currentTick time.Time) bool {
// Only attempt against workspaces not already dormant.
return !ws.DormantAt.Valid &&
// The template must specify an time_til_dormant value.
templateSchedule.TimeTilDormant > 0 &&
// The workspace must breach the time_til_dormant value.
currentTick.Sub(ws.LastUsedAt) > templateSchedule.TimeTilDormant
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that plus this seems relevant:

case isEligibleForDormantStop(ws, templateSchedule, currentTick):
// Only stop started workspaces.
if latestBuild.Transition == database.WorkspaceTransitionStart {
return database.WorkspaceTransitionStop, database.BuildReasonDormancy, nil
}
// We shouldn't transition the workspace but we sho 8000 uld still
// make it dormant.
return "", database.BuildReasonDormancy, nil

As does this:

// Transition the workspace to dormant if it has breached the template's
// threshold for inactivity.
if reason == database.BuildReasonDormancy {
wsOld := ws
wsNew, err := tx.UpdateWorkspaceDormantDeletingAt(e.ctx, database.UpdateWorkspaceDormantDeletingAtParams{
ID: ws.ID,
DormantAt: sql.NullTime{
Time: dbtime.Now(),
Valid: true,
},
})

So one thing that can happen at least in the code-version is that a workspace has remained in a failed stop state for a long time. Then it becomes eligible for dormant stop but since it wasn't a start, we don't try to stop it but mark it as dormant anyway.

Then we'll have a workspace marked dormant that we never attempted to stop.

-- * The template has set a time til dormant.
-- * The workspace has been unused for longer than the time til dormancy.
(
workspaces.dormant_at IS NULL AND
templates.time_til_dormant > 0 AND
(@now :: timestamptz) - workspaces.last_used_at > (INTERVAL '1 millisecond' * (templates.time_til_dormant / 1000000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is probably already in use elsewhere but not going to lie, it's weird seeing nanos being converted into millis. I know ms is the smallest unit in pg but feels like this isn't super intuitive for humans to understand what's going on here.

Obviously storing ns in the db is the main problem (and we're not fixing that here). But I'd prefer to see the conversion into seconds instead of ms as that feels like a more intuitive unit.

Even the db column for time_til_dormant doesn't have a comment so this is pretty much hidden magic 😄.

Not a blocker, but we should probably do something about this at some point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every day I regret my previous decision 😓

) OR

-- isEligibleForDelete
-- A workspace may be eligible for deletion if the following are true:
-- * The workspace is dormant.
-- * The workspace is scheduled to be deleted.
-- * If there was a prior attempt to delete the workspace that failed:
-- * This attempt was at least 24 hours ago>
(
workspaces.dormant_at IS NOT NULL AND
workspaces.deleting_at IS NOT NULL AND
Expand All @@ -632,11 +647,16 @@ WHERE
END
) OR

-- isEligibleForFailedStop
-- A workspace may be eligible for failed stop if the following are true:
-- * The template has a failure ttl set.
-- * The workspace build was a start transition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we failed to autostop previously, then this limitation will prevent this case from triggering and I think we'll never retry it because of the other cases. Is that as intended?

Copy link
Member
@johnstcn johnstcn Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the same logic that's done later on? We're not changing any logic here, just front-loading it.

// isEligibleForFailedStop returns true if the workspace is eligible to be stopped
// due to a failed build.
func isEligibleForFailedStop(build database.WorkspaceBuild, job database.ProvisionerJob, templateSchedule schedule.TemplateScheduleOptions, currentTick time.Time) bool {
// If the template has specified a failure TLL.
return templateSchedule.FailureTTL > 0 &&
// And the job resulted in failure.
job.JobStatus == database.ProvisionerJobStatusFailed &&
build.Transition == database.WorkspaceTransitionStart &&
// And sufficient time has elapsed since the job has completed.
job.CompletedAt.Valid &&
currentTick.Sub(job.CompletedAt.Time) > templateSchedule.FailureTTL
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the original logic, but then I suppose I'm proposing that the original logic doesn't take this into account?

Here, unless the failed build is "start", we'll never try to stop it. The failed build could just as well be a cancelled (or failed) "stop" which has left resources behind.

Copy link
Member
@johnstcn johnstcn Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch. We might want to adjust this behaviour in a separate PR.
EDIT: filed #15477

-- * The provisioner job failed.
-- * The provisioner job had completed.
-- * The provisioner job has been completed for longer than the failure ttl.
(
templates.failure_ttl > 0 AND
provisioner_jobs.job_status = 'failed'::provisioner_job_status AND
workspace_builds.transition = 'start'::workspace_transition AND
provisioner_jobs.job_status = 'failed'::provisioner_job_status AND
provisioner_jobs.completed_at IS NOT NULL AND
(@now :: timestamptz) - provisioner_jobs.completed_at > (INTERVAL '1 millisecond' * (templates.failure_ttl / 1000000))
)
Expand Down
Loading
0