Commit graph

20 commits

Author SHA1 Message Date
hashicorp-copywrite[bot] 005636afa0 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Mahmood Ali 00be4fc63c
tests: deflake TestTaskRunner_StatsHook_Periodic (#9734)
This PR deflakes TestTaskRunner_StatsHook_Periodic tests and adds backoff when the driver closes the channel.

TestTaskRunner_StatsHook_Periodic is currently the most flaky test - failing ~4% of the time (20 out of 486 workflows). A sample failure: https://app.circleci.com/pipelines/github/hashicorp/nomad/14028/workflows/957b674f-cbcc-4228-96d9-1094fdee5b9c/jobs/128563 .

This change has two components:

First, it updates the StatsHook so that it backs off when stats channel is closed. In the context of the test where the mock driver emits a single stats update and closes the channel, the test may make tens of thousands update during the period. In real context, if a driver doesn't implement the stats handler properly or when a task finishes, we may generate way too many Stats queries in a tight loop. Here, the backoff reduces these queries. I've added a failing test that shows 154,458 stats updates within 500ms in https://app.circleci.com/pipelines/github/hashicorp/nomad/14092/workflows/50672445-392d-4661-b19e-e3561ed32746/jobs/129423 .

Second, the test ignores the first stats update after a task exit. Due to the asynchronicity of updates and channel/context use, it's possible that an update is enqueued while the test marks the task as exited, resulting into a spurious update.
2021-01-06 16:03:00 -05:00
Kris Hicks 0cf9cae656
Apply some suggested fixes from staticcheck (#9598) 2020-12-10 07:29:18 -08:00
Tim Gross 065fa7af8b
stats_hook: log normal shutdown condition as debug, not error (#8028)
The `stats_hook` writes an Error log every time an allocation becomes
terminal. This is a normal condition, not an error. A real error
condition like a failure to collect the stats is logged later. It just
creates log noise, and this is a particularly bad operator experience
for heavy batch workloads.
2020-05-20 10:28:30 -04:00
Mahmood Ali 8a82260319 log unrecoverable errors 2019-07-17 11:01:59 +07:00
Mahmood Ali 1a299c7b28 client/taskrunner: fix stats stats retry logic
Previously, if a channel is closed, we retry the Stats call.  But, if that call
fails, we go in a backoff loop without calling Stats ever again.

Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.

I aimed to preserve the logging formats but made small improvements as I saw fit.
2019-07-11 13:58:07 +08:00
Nick Ethier bcc3935228
tr: use context in as select statement 2019-01-22 20:11:39 -05:00
Michael Schurter 1ca858fa92
Update client/allocrunner/taskrunner/stats_hook.go
Co-Authored-By: nickethier <ncethier@gmail.com>
2019-01-14 12:31:27 -05:00
Nick Ethier fbd403df96
tr: stop stats collection on Exited hook 2019-01-14 12:30:14 -05:00
Nick Ethier 597b7b751d
tr: add retry /w backoff to stats_hook failure 2019-01-12 12:18:24 -05:00
Nick Ethier 7e306afde3
executor: fix failing stats related test 2019-01-12 12:18:23 -05:00
Nick Ethier 9fea54e0dc
executor: implement streaming stats API
plugins/driver: update driver interface to support streaming stats

client/tr: use streaming stats api

TODO:
 * how to handle errors and closed channel during stats streaming
 * prevent tight loop if Stats(ctx) returns an error

drivers: update drivers TaskStats RPC to handle streaming results

executor: better error handling in stats rpc

docker: better control and error handling of stats rpc

driver: allow stats to return a recoverable error
2019-01-12 12:18:22 -05:00
Alex Dadgar 0106f23aaa Review comments 2019-01-07 14:50:28 -08:00
Alex Dadgar 3f24e4d6ca comments 2019-01-07 14:49:40 -08:00
Alex Dadgar 44dca19012 Fix hooks 2019-01-07 14:49:40 -08:00
Alex Dadgar c9825a9c36 recover 2019-01-07 14:49:40 -08:00
Danielle Tomlinson f3a77b8084 client: Merge driver/shared/structs and client/structs 2018-11-30 10:56:45 +01:00
Danielle Tomlinson 04c8851b4c client: Migrate DriverStats optout to drivers/shared/structs 2018-11-30 10:46:13 +01:00
Michael Schurter 5bd744ac3d client: support graceful shutdowns
Client.Shutdown now blocks until all AllocRunners and TaskRunners have
exited their Run loops. Tasks are left running.
2018-11-19 16:39:30 -08:00
Alex Dadgar 45e41cca03 allocrunnerv2 -> allocrunner 2018-10-16 16:56:56 -07:00
Renamed from client/allocrunnerv2/taskrunner/stats_hook.go (Browse further)