Commit graph

17 commits

Author SHA1 Message Date
Mahmood Ali 46bc3b57e6 address review comments 2019-12-13 11:21:00 -05:00
Mahmood Ali 0b7085ba3a driver: allow disabling log collection
Operators commonly have docker logs aggregated using various tools and
don't need nomad to manage their docker logs.  Worse, Nomad uses a
somewhat heavy docker api call to collect them and it seems to cause
problems when a client runs hundreds of log collections.

Here we add a knob to disable log aggregation completely for nomad.
When log collection is disabled, we avoid running logmon and
docker_logger for the docker tasks in this implementation.

The downside here is once disabled, `nomad logs ...` commands and API
no longer return logs and operators must corrolate alloc-ids with their
aggregated log info.

This is meant as a stop gap measure.  Ideally, we'd follow up with at
least two changes:

First, we should optimize behavior when we can such that operators don't
need to disable docker log collection.  Potentially by reverting to
using pre-0.9 syslog aggregation in linux environments, though with
different trade-offs.

Second, when/if logs are disabled, nomad logs endpoints should lookup
docker logs api on demand.  This ensures that the cost of log collection
is paid sparingly.
2019-12-08 14:15:03 -05:00
Mahmood Ali bf0a09e270 retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
Mahmood Ali fc78521f29 add logging about attempts 2019-04-25 18:09:36 -04:00
Mahmood Ali bbac73883c logmon: retry starting logmon if it exits
Retry if we detect shutting down during Start() api call is started,
locally.
2019-04-25 15:10:16 -04:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Danielle Lancashire c31966fc71 loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Nick Ethier dc18b8928a
logmon: make Start rpc idempotent and simplify hook 2019-03-19 14:02:36 -04:00
Nick Ethier ac7fbee1b8
logmon:add static check for logmon exited hook 2019-03-18 15:59:43 -04:00
Nick Ethier 7dc3d83634
client/logmon: restart log collection correctly when a task is restarted 2019-03-15 23:59:18 -04:00
Michael Schurter 64e145ebdb logmon: drop reattach log level as its expected
Logged once per terminal task on agent restart.
2019-03-04 13:26:01 -08:00
Michael Schurter ef8d284352 client: ensure task is cleaned up when terminal
This commit is a significant change. TR.Run is now always executed, even
for terminal allocations. This was changed to allow TR.Run to cleanup
(run stop hooks) if a handle was recovered.

This is intended to handle the case of Nomad receiving a
DesiredStatus=Stop allocation update, persisting it, but crashing before
stopping AR/TR.

The commit also renames task runner hook data as it was very easy to
accidently set state on Requests instead of Responses using the old
field names.
2019-03-01 14:00:23 -08:00
Michael Schurter 3b2a592e93 client: restart task on logmon failures
This code chooses to be conservative as opposed to optimal: when failing
to reattach to logmon simply return a recoverable error instead of
immediately trying to restart logmon.

The recoverable error will cause the task's restart policy to be
applied and a new logmon will be launched upon restart.

Trying to do the optimal approach of simply starting a new logmon
requires error string comparison and should be tested against a task
actively logging to assert the behavior (are writes blocked? dropped?).
2019-02-25 15:42:45 -08:00
Michael Schurter a2e3ea6dc9 logmon: fix reattach configuration
There were multiple bugs here:

1. Reattach unmarshalling always returned an error because you can't
   unmarshal into a nil pointer.
2. The hook data wasn't being saved because it was put on the request
   struct, not the response struct.
3. The plugin configuration should only have reattach *or* a command
   set. Not both.
4. Setting Done=true meant the hook was never re-run on agent restart so
   reattaching was never attempted.
2019-02-21 15:32:18 -08:00
Alex Dadgar 72a5691897 Driver tests do not use hcl2/hcl, hclspec, or hclutils 2019-01-22 15:43:34 -08:00
Danielle Tomlinson 99da4c780d logmon: Reattach to existing loggers
This commit prevents us from creating duplicate logmon hooks when
restoring allocations by persisting the logmon reattach config using
HookData.
2019-01-16 14:56:10 +01:00
Alex Dadgar 45e41cca03 allocrunnerv2 -> allocrunner 2018-10-16 16:56:56 -07:00
Renamed from client/allocrunnerv2/taskrunner/logmon_hook.go (Browse further)