Examples for HTTP based task-group service healthchecks are
covered by the `countdash` demo, but gRPC checks currently
have no runnable examples.
This PR adds a trivial gRPC enabled application that provides
a Service implementing the standard gRPC healthcheck interface.
Running `make dev` runs `hclfmt`, but this isn't checked as part of
CI. That makes it possible to merge un-formatted HCL and Nomad
jobspecs that later will make for dirty git staging areas when
developers pull master.
This changeset adds HCL linting to the `make check` target.
Adds a `CSIVolumeClaim` type to be tracked as current and past claims
on a volume. Allows for a client RPC failure during node or controller
detachment without having to keep the allocation around after the
first garbage collection eval.
This changeset lays groundwork for moving the actual detachment RPCs
into a volume watching loop outside the GC eval.
Failed requests due to API client errors are to be marked as DEBUG.
The Error log level should be reserved to signal problems with the
cluster and are actionable for nomad system operators. Logs due to
misbehaving API clients don't represent a system level problem and seem
spurius to nomad maintainers at best. These log messages can also be
attack vectors for deniel of service attacks by filling servers disk
space with spurious log messages.
Pipe http server log to hclog, so that it uses the same logging format
as rest of nomad logs. Also, supports emitting them as json logs, when
json formatting is set.
The http server logs are emitted as Trace level, as they are typically
repsent HTTP client errors (e.g. failed tls handshakes, invalid headers,
etc).
Though, Panic logs represent server errors and are relayed as Error
level.
Protect against a panic when we attempt to start a container with a name
that conflicts with an existing one. If the existing one is being
deleted while nomad first attempts to create the container, the
createContainer will fail with `container already exists`, but we get
nil container reference from the `containerByName` lookup, and cause a
crash.
I'm not certain how we get into the state, except for being very
unlucky. I suspect that this case may be the result of a concurrent
restart or the docker engine API not being fully consistent (e.g. an
earlier call purged the container, but docker didn't free up resources
yet to create a new container with the same name immediately yet).
If that's the case, then re-attempting creation will hopefully succeed,
or we'd at least fail enough times for the alloc to be rescheduled to
another node.