The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.
Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
The RestartCount is not really suitable for use as a source of
uniqueness within task invocations as it is not monotonic, and interacts
with the restart stanza in a users config, so conflates restarts due to
task failures, with restarts due to enviromental changes, such as consul
template or vault secrets changing.
Here we instead use a substring from a uuid, which is more random than
we strictly need, but is nicer than rolling our own random string
generator here.
This creates a new buffered channel and goroutine on the allocrunner for
serializing updates to allocations. This allows us to take updates off
the routine that is used from processing updates from the server,
without having complicated machinery for tracking update lifetimes, or
other external synchronization.
This results in a nice performance improvement and signficantly better
throughput on batch changes such as preempting a large number of jobs
for a larger placement.
This commit reduces the locking required to shutdown or destroy
allocrunners, and allows parallel shutdown and destroy of allocrunners during
shutdown.
The assertion here is causing many spurious failures that aren't
actually relevant to the test itself.
We are tracking the cause for this failure independently, and it would
make more sense to have a dedicated test for clean shutdown.
Currently, there is a race condition between creating a taskrunner, and
updating node attributes via fingerprinting.
This is because the taskenv builder will try to iterate over the
clientconfig.Node.Attributes map, which can be concurrently updated by
the fingerprinting process, thus causing a panic.
This fixes that by providing a copy of the clientconfg to the
allocrunner inside the Read lock during config creation.
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.
This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.
Fixes a bug where a driver health and attributes are never updated from
their initial status. If a driver started unhealthy, it may never go
into a healthy status.
Noticed few places where tests seem to block indefinitely and panic
after the test run reaches the test package timeout.
I intend to follow up with the proper fix later, but timing out is much
better than indefinitely blocking.
When starting an allocation that is preempting other allocs, we create a
new group allocation watcher, and then wait for the allocations to
terminate in the allocation PreRun hooks.
If there's no preempted allocations, then we simply provide a
NoopAllocWatcher.