-
Notifications
You must be signed in to change notification settings - Fork 937
refactor: replace startup script logs EOF with starting/ready time #8082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: replace startup script logs EOF with starting/ready time #8082
Conversation
// Only the latest state is reported, intermediate states may be | ||
// lost if the agent can't communicate with the API. | ||
// reportLifecycleLoop reports the current lifecycle state once. All state | ||
// changes are reported in order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: Originally, this was a weird/performance choice to only submit the latest status which simply resulted in needing to document the behavior increasing the complexity. We now change it so that the agent always reports all states (still non-blocking), we achieve this by adding the timestamp for the event in the payload.
a.lifecycleMu.Unlock() | ||
return | ||
} | ||
a.lifecycleState = state | ||
a.logger.Debug(ctx, "set lifecycle state", slog.F("state", state), slog.F("last", lastState)) | ||
a.lifecycleStates = append(a.lifecycleStates, report) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: Guaranteed to not grow unboundedly (len(a.lifecycleStates) <= enum entries
).
} else { | ||
logger.Info(ctx, "script completed", slog.F("execution_time", execTime), slog.F("exit_code", exitCode)) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: Small cleanup to unify logging between startup and shutdown scripts.
d8a51d1
to
cedf8a4
Compare
return err | ||
} | ||
|
||
workspace, err := q.db.GetWorkspaceByAgentID(ctx, agent.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: Perhaps I shouldn't have made this simplification? I see mixed use in dbauthz not sure why we were doing both.
488b0ae
to
1334344
Compare
This commit reverts some of the changes in #8029 and implements an alternative method of keeping track of when the startup script has ended and there will be no more logs. This is achieved by adding new agent fields for tracking when the agent enters the "starting" and "ready"/"start_error" lifecycle states. The timestamps simplify logic since we don't need understand if the current state is before or after the state we're interested in. They can also be used to show data like how long the startup script took to execute. This also allowed us to remove the EOF field from the logs as the implementation was problematic when we returned the EOF log entry in the response since requesting _after_ that ID would give no logs and the API would thus lose track of EOF.
1334344
to
73f19d6
Compare
coderd/database/migrations/000128_drop_startup_logs_eof_and_add_completion.down.sql
Outdated
Show resolved
Hide resolved
@@ -461,15 +470,20 @@ func (a *agent) reportLifecycleLoop(ctx context.Context) { | |||
// setLifecycle sets the lifecycle state and notifies the lifecycle loop. | |||
// The state is only updated if it's a valid state transition. | |||
func (a *agent) setLifecycle(ctx context.Context, state codersdk.WorkspaceAgentLifecycle) { | |||
report := agentsdk.PostLifecycleRequest{ | |||
State: state, | |||
ChangedAt: database.Now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
database.Now()
or time.Now()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentionally used database.Now()
here for consistency since it's a value that will be stored in the DB. Startup logs also use database.Now()
but we do seem to have some mixed use in the agent and some other places may be wrong?
Thoughts @kylecarbs? For all intents and purposes, this shouldn't matter since the DB fields are timestamptz
, what's the motivation for using database.Now()
that always uses UTC? Logging purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the motivation for using database.Now() that always uses UTC? Logging purposes?
- It might be a good candidate for the linter rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks for addressing all comments. If CI is happy to merge it, I'm cool with that too! You can work on improvements in follow-ups.
This change breaks envbuilder, logstream-kube, and anything that writes startup logs to the UI after the startup script might be complete. We should rename startup logs to something more generalized like agent logs to remove this confusion in the future. The way it worked before (I believe) was that the EOF was sent incorrectly, but we really shouldn't have been sending an EOF at all - there is no EOF, because infrastructure logs are never actually complete. A pod can always restart, or something can trigger a reboot of the agent. |
The only break is the EOF btw, which is like 6 lines of code ;p I'm just removing that for now after talking with Mathias, and we'll work on a better long-term fix soon. |
PR here: #8528 This just reverts the EOF, which should bring back the prior behavior. |
This commit reverts some of the changes in #8029 and implements an
alternative method of keeping track of when the startup script has ended
and there will be no more logs.
This is achieved by adding new agent fields for tracking when the agent
enters the "starting" and "ready"/"start_error" lifecycle states. The
timestamps simplify logic since we don't need understand if the current
state is before or after the state we're interested in. They can also be
used to show data like how long the startup script took to execute. This
also allowed us to remove the EOF field from the logs as the
implementation was problematic when we returned the EOF log entry in the
response since requesting after that ID would give no logs and the API
would thus lose track of EOF.