This code chooses to be conservative as opposed to optimal: when failing
to reattach to logmon simply return a recoverable error instead of
immediately trying to restart logmon.
The recoverable error will cause the task's restart policy to be
applied and a new logmon will be launched upon restart.
Trying to do the optimal approach of simply starting a new logmon
requires error string comparison and should be tested against a task
actively logging to assert the behavior (are writes blocked? dropped?).
This commit causes the docker driver to return undetected before it
first establishes a connection to the docker daemon.
This fixes a bug where hosts without docker installed would return as
unhealthy, rather than undetected.
There were multiple bugs here:
1. Reattach unmarshalling always returned an error because you can't
unmarshal into a nil pointer.
2. The hook data wasn't being saved because it was put on the request
struct, not the response struct.
3. The plugin configuration should only have reattach *or* a command
set. Not both.
4. Setting Done=true meant the hook was never re-run on agent restart so
reattaching was never attempted.
As far as I can tell this is the most straightforward and resilient way
to skip error logging on context cancellation with grpc streams. You
cannot compare the error against context.Canceled directly as it is of
type `*status.statusError`. The next best solution I found was:
```go
resp, err := stream.Recv()
if code, ok := err.(interface{ Code() code.Code }); ok {
if code.Code == code.Canceled {
return
}
}
```
However I think checking ctx.Err() directly makes the code much easier
to read and is resilient against grpc API changes.
Currently if a docker_logger cannot be reattached to, we will leak the
container that was being used. This is problematic if e.g using static
ports as it means you can never recover your task, or if a service is
expensive to run and will then be running without supervision.
Track the download status of each artifact independently so that if only
one of many artifacts fails to download, completed artifacts aren't
downloaded again.
Sometimes the nomad docker_logger may be killed by a service manager
when restarting the client for upgrades or reliability reasons.
Currently if this happens, we leak the users container and try to
reschedule over it.
This commit adds a new step to the recovery process that will spawn a
new docker logger process that will fetch logs from _the current
timestamp_. This is to avoid restarting users tasks because our logging
sidecar has failed.
0.9.0beta2 contains a regression where artifact download errors would
not cause a task restart and instead immediately fail the task.
This restores the pre-0.9 behavior of retrying all artifact errors and
adds missing tests.