This PR fixes various instances of plugins being launched without using
the parent loggers. This meant that logs would not all go to the same
output, break formatting etc.
This PR improves how killing a task is handled. Before the kill function
directly orchestrated the killing and was only valid while the task was
running. The new behavior is to mark the desired state and wait for the
task runner to converge to that state.
We were just emitting Killed/Terminated events before. In v0.8 we
emitted Killing/Killed, but lacked Terminated when explicitly stopping
a task. This change makes it so Terminated is always included, whether
explicitly stopping a task or it exiting on its own.
New output:
2019-01-04T14:58:51-08:00 Killed Task successfully killed
2019-01-04T14:58:51-08:00 Terminated Exit Code: 130, Signal: 2
2019-01-04T14:58:51-08:00 Killing Sent interrupt
2019-01-04T14:58:51-08:00 Leader Task Dead Leader Task in Group dead
2019-01-04T14:58:49-08:00 Started Task started by client
2019-01-04T14:58:49-08:00 Task Setup Building Task Directory
2019-01-04T14:58:49-08:00 Received Task received by client
Old (v0.8.6) output:
2019-01-04T22:14:54Z Killed Task successfully killed
2019-01-04T22:14:54Z Killing Sent interrupt. Waiting 5s before force killing
2019-01-04T22:14:54Z Leader Task Dead Leader Task in Group dead
2019-01-04T22:14:53Z Started Task started by client
2019-01-04T22:14:53Z Task Setup Building Task Directory
2019-01-04T22:14:53Z Received Task received by client
Simplify allocDir.Build() function to avoid depending on client/structs,
and remove a parameter that's always set to `false`.
The motivation here is to avoid a dependency cycle between
drivers/cstructs and alloc_dir.
**The Bug:**
You may have seen log lines like this when running 0.9.0-dev:
```
... client.alloc_runner.task_runner: some environment variables not available for rendering: ... keys="attr.driver.docker.volumes.enabled, attr.driver.docker.version, attr.driver.docker.bridge_ip, attr.driver.qemu.version"
```
Not only should we not be erroring on builtin driver attributes, but the
results were nondeterministic due to map iteration order!
The root cause is that we have an old root attribute for all drivers
like:
```
attr.driver.docker = "1"
```
When attributes were opaque variable names it was fine to also have
"nested" attributes like:
```
attr.driver.docker.version = "1.2.3"
```
However in the HCLv2 world the variable names are no longer opaque: they
form an object tree. The `docker` object can no longer both hold a value
(`"1"`) *and* nested attributes (`version = "1.2.3"`).
**The Fix:**
Since the old `attr.driver.<name> = "1"` attribues are useless for task
config interpolation, create a new precedence rule for creating the task
config evaluation context:
*Maps take precedence over primitives.*
This means `attr.driver.docker.version` will always take precedence over
`attr.driver.docker`. The results are determinstic and give users access
to the more useful metadata.
I made this a general precedence rule instead of special-casing driver
attrs because it seemed like better default behavior than spamming
WARNings to logs that were likely unactionable by users.
Unfortunately I don't know how to test these errors. As far as I can
tell they should only happen if there was a programming error in the
upgrade code or the underlying boltdb was corrupted somehow.
* Prefix task bucket with task- to prevent name conflicts
* Shorten device manager bucket name
* Remove commented out outdated var
* Update layout comment
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.
Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
The RestartCount is not really suitable for use as a source of
uniqueness within task invocations as it is not monotonic, and interacts
with the restart stanza in a users config, so conflates restarts due to
task failures, with restarts due to enviromental changes, such as consul
template or vault secrets changing.
Here we instead use a substring from a uuid, which is more random than
we strictly need, but is nicer than rolling our own random string
generator here.
This creates a new buffered channel and goroutine on the allocrunner for
serializing updates to allocations. This allows us to take updates off
the routine that is used from processing updates from the server,
without having complicated machinery for tracking update lifetimes, or
other external synchronization.
This results in a nice performance improvement and signficantly better
throughput on batch changes such as preempting a large number of jobs
for a larger placement.
This commit reduces the locking required to shutdown or destroy
allocrunners, and allows parallel shutdown and destroy of allocrunners during
shutdown.
The assertion here is causing many spurious failures that aren't
actually relevant to the test itself.
We are tracking the cause for this failure independently, and it would
make more sense to have a dedicated test for clean shutdown.
Currently, there is a race condition between creating a taskrunner, and
updating node attributes via fingerprinting.
This is because the taskenv builder will try to iterate over the
clientconfig.Node.Attributes map, which can be concurrently updated by
the fingerprinting process, thus causing a panic.
This fixes that by providing a copy of the clientconfg to the
allocrunner inside the Read lock during config creation.
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.
This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.