Commit graph

570 commits

Author SHA1 Message Date
Lang Martin 371014b781
Merge pull request #5553 from hashicorp/b-fingerprinter-manual-config
client fingerprinter doesn't overwrite manual configuration
2019-04-26 12:55:34 -04:00
Danielle Lancashire 3409e0be89 allocs: Add nomad alloc signal command
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.

Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
2019-04-25 12:43:32 +02:00
Chris Baker 0b1a4dd206 client/metrics: modified metrics to use (updated) client copy of allocation instead of (unupdated) server copy 2019-04-22 18:31:45 +00:00
Lang Martin eba4e29440 client fingerprinter doesn't overwrite manual configuration
Revert "Revert accidental merge of pr #5482"
This reverts commit c45652ab8c113487b9d4fbfb107782cbcf8a85b0.
2019-04-19 15:23:48 -04:00
Mahmood Ali 902eed4bf9 clarify cryptic log line 2019-04-19 09:31:43 -04:00
Mahmood Ali 6bdc9860b7 client: avoid registering node twice right away
I noticed that `watchNodeUpdates()` almost immediately after
`registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5
seconds.

This call is unnecessary and made debugging a bit harder.  So here, we
ensure that we only re-register node for new node events, not for
initial registration.
2019-04-19 09:12:50 -04:00
Mahmood Ali f82ea8824f client: wait for batched driver updated
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes https://github.com/hashicorp/nomad/issues/5579
2019-04-19 09:00:24 -04:00
Lang Martin a2a1e7829d Revert accidental merge of pr #5482
Revert "fingerprint Constraints and Affinities have Equals, as set"
This reverts commit 596f16fb5f1a4a6766a57b3311af806d22382609.

Revert "client tests assert the independent handling of interface and speed"
This reverts commit 7857ac5993a578474d0570819f99b7b6e027de40.

Revert "structs missed applying a style change from the review"
This reverts commit 658916e3274efa438beadc2535f47109d0c2f0f2.

Revert "client, structs comments"
This reverts commit be2838d6baa9d382a5013fa80ea016856f28ade2.

Revert "client fingerprint updateNetworks preserves the network configuration"
This reverts commit fc309cb430e62d8e66267a724f006ae9abe1c63c.

Revert "client_test cleanup comments from review"
This reverts commit bc0bf4efb9114e699bc662f50c8f12319b6b3445.

Revert "client Networks Equals is set equality"
This reverts commit f8d432345b54b1953a4a4c719b9269f845e3e573.

Revert "struct cleanup indentation in RequestedDevice Equals"
This reverts commit f4746411cab328215def6508955b160a53452da3.

Revert "struct Equals checks for identity before value checking"
This reverts commit 0767a4665ed30ab8d9586a59a74db75d51fd9226.

Revert "fix client-test, avoid hardwired platform dependecy on lo0"
This reverts commit e89dbb2ab182b6368507dbcd33c3342223eb0ae7.

Revert "refactor error in client fingerprint to include the offending data"
This reverts commit a7fed726c6e0264d42a58410d840adde780a30f5.

Revert "add client updateNodeResources to merge but preserve manual config"
This reverts commit 84bd433c7e1d030193e054ec23474380ff3b9032.

Revert "refactor struts.RequestedDevice to have its own Equals"
This reverts commit 689782524090e51183474516715aa2f34908b8e6.

Revert "refactor structs.Resource.Networks to have its own Equals"
This reverts commit 49e2e6c77bb3eaa4577772b36c62205061c92fa1.

Revert "refactor structs.Resource.Devices to have its own Equals"
This reverts commit 4ede9226bb971ae42cc203560ed0029897aec2c9.

Revert "add COMPAT(0.10): Remove in 0.10 notes to impl for structs.Resources"
This reverts commit 49fbaace5298d5ccf031eb7ebec93906e1d468b5.

Revert "add structs.Resources Equals"
This reverts commit 8528a2a2a6450e4462a1d02741571b5efcb45f0b.

Revert "test that fingerprint resources are updated, net not clobbered"
This reverts commit 8ee02ddd23bafc87b9fce52b60c6026335bb722d.
2019-04-11 10:29:40 -04:00
Lang Martin 7258a13c72 client, structs comments 2019-04-11 09:56:22 -04:00
Lang Martin 22d87e4538 client fingerprint updateNetworks preserves the network configuration 2019-04-11 09:56:22 -04:00
Lang Martin f211500cea add client updateNodeResources to merge but preserve manual config 2019-04-11 09:56:22 -04:00
Danielle Lancashire e135876493 allocs: Add nomad alloc restart
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
2019-04-11 14:25:49 +02:00
Michael Schurter fec2752fb2 client: log when allocs have been processed
Will hopefully help us catch deadlocks/livelocks/slowdowns in the
add/remove allocs pipeline which should be fast.
2019-02-04 11:07:57 -08:00
Preetha Appan e7b59ac08c
Only set deployment health if not already set 2019-01-12 10:38:20 -06:00
Michael Schurter dbf4c3a3c8
Apply suggestions from code review
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-01-12 10:38:20 -06:00
Preetha Appan 7bd1440710
REfactor statedb factory config to set it directly in client config 2019-01-12 10:38:20 -06:00
Preetha Appan e237f19b38
Remove invalid allocs 2019-01-12 10:38:20 -06:00
Preetha Appan f059ef8a47
Modified destroy failure handling to rely on allocrunner's destroy method
Added a unit test with custom statedb implementation that errors, to
use to verify destroy errors
2019-01-12 10:37:12 -06:00
Preetha Appan 6c95da8f67
Add back code to mark alloc as failed when restore fails
Also modify restore such that any handled errors don't propagate
back to the client
2019-01-12 10:37:12 -06:00
Preetha Appan 5fde0b0f5c
Revert code that made an alloc update when restore fails
Restore currently shuts down the client so the alloc update cant
always make it to the server
2019-01-12 10:37:12 -06:00
Preetha Appan 41bfdd764b
Handle client initialization errors when adding allocs or restoring allocs
We mark the alloc as failed and track failed allocs so that we don't send
updates after the first time
2019-01-12 10:37:12 -06:00
Danielle Tomlinson 3e586e93da client: Cleanup allocrunner access 2019-01-11 18:39:18 +01:00
Alex Dadgar c9825a9c36 recover 2019-01-07 14:49:40 -08:00
Nick Ethier a96afb6c91
fix tests that fail as a result of async client startup 2018-12-20 00:53:44 -05:00
Michael Schurter d9ea8252a7 client/state: support upgrading from 0.8->0.9
Also persist and load DeploymentStatus to avoid rechecking health after
client restarts.
2018-12-19 10:39:27 -08:00
Nick Ethier a02308ee6a
drivermanager: attempt to reattach and shutdown driver plugin if blocked by allow/block lists 2018-12-18 23:01:57 -05:00
Nick Ethier ce1a5cba0e
drivermanager: use allocID and task name to route task events 2018-12-18 23:01:51 -05:00
Nick Ethier d8a0265e68
client: batch initial fingerprinting in plugin manangers
drivermanager: fix pr comments/feedback
2018-12-18 22:56:19 -05:00
Nick Ethier 7d23cbf448
client/drivermananger: fixup issues from rebase and address PR comments 2018-12-18 22:55:38 -05:00
Nick Ethier 82175d1328
client/drivermananger: add driver manager
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.

Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
2018-12-18 22:55:18 -05:00
Danielle Tomlinson cb78a90f40 client: Async API for shutdown/destroy allocrunners 2018-12-18 23:38:33 +01:00
Danielle Tomlinson d9174d8dcf
Merge pull request #4989 from hashicorp/dani/b-client-update-race-condition
client: Give a copy of clientconfig to allocrunner
2018-12-17 10:49:46 +01:00
Danielle Tomlinson 8b06e8d297
Merge pull request #4990 from hashicorp/dani/b-alloc-lock
client: updateAlloc release lock after read
2018-12-13 12:43:59 +01:00
Danielle Tomlinson 3823599da9 client: Give a copy of clientconfig to allocrunner
Currently, there is a race condition between creating a taskrunner, and
updating node attributes via fingerprinting.

This is because the taskenv builder will try to iterate over the
clientconfig.Node.Attributes map, which can be concurrently updated by
the fingerprinting process, thus causing a panic.

This fixes that by providing a copy of the clientconfg to the
allocrunner inside the Read lock during config creation.
2018-12-13 12:42:15 +01:00
Danielle Tomlinson 4184eadaf4 client: updateAlloc release lock after read
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.

This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.
2018-12-12 16:30:01 +01:00
Mahmood Ali 3d166e6e9c
Merge pull request #4984 from hashicorp/b-client-update-driver
client: update driver info on new driver fingerprint
2018-12-11 18:01:03 -05:00
Alex Dadgar 1531b6d534
Merge pull request #4970 from hashicorp/f-no-iops
Deprecate IOPS
2018-12-11 12:51:22 -08:00
Mahmood Ali ba515947c2 client: update driver info on new fingerprint
Fixes a bug where a driver health and attributes are never updated from
their initial status.  If a driver started unhealthy, it may never go
into a healthy status.
2018-12-11 14:25:10 -05:00
Danielle Tomlinson 805669ead4 client: Correctly pass a noop PrevAllocMigrator when restoring 2018-12-11 15:46:58 +01:00
Danielle Tomlinson 83720575de client: Unify handling of previous and preempted allocs 2018-12-11 13:12:35 +01:00
Danielle Tomlinson dff7093243 client: Wait for preempted allocs to terminate
When starting an allocation that is preempting other allocs, we create a
new group allocation watcher, and then wait for the allocations to
terminate in the allocation PreRun hooks.

If there's no preempted allocations, then we simply provide a
NoopAllocWatcher.
2018-12-11 00:59:18 +01:00
Alex Dadgar 1e3c3cb287 Deprecate IOPS
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
2018-12-06 15:09:26 -08:00
Danielle Tomlinson 66c521ca17 client: Move fingerprint structs to pkg
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.

It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
2018-12-01 17:10:39 +01:00
Alex Dadgar 4ee603c382 Device hook and devices affect computed node class
This PR introduces a device hook that retrieves the device mount
information for an allocation. It also updates the computed node class
computation to take into account devices.

TODO Fix the task runner unit test. The environment variable is being
lost even though it is being properly set in the prestart hook.
2018-11-27 17:25:33 -08:00
Michael Schurter 1e4ef139dd
Merge pull request #4883 from hashicorp/f-graceful-shutdown
Support graceful shutdowns in agent
2018-11-27 15:55:15 -06:00
Michael Schurter 4f7e6f9464 client: fix races in use of goroutine group
The group utility struct does not support asynchronously launched
goroutines (goroutines-inside-of-goroutines), so switch those uses to a
normal go call.

This means watchNodeUpdates and watchNodeEvents may not be shutdown when
Shutdown() exits. During nomad agent shutdown this does not matter.

During tests this means a test may leak those goroutines or be unable to
know when those goroutines have exited.

Since there's no runtime impact and these goroutines do not affect alloc
state syncing it seems ok to risk leaking them.
2018-11-26 12:52:55 -08:00
Michael Schurter 9f43fb6d29 client: reuse group instead of diy'ing it 2018-11-26 12:52:31 -08:00
Michael Schurter 5bd744ac3d client: support graceful shutdowns
Client.Shutdown now blocks until all AllocRunners and TaskRunners have
exited their Run loops. Tasks are left running.
2018-11-19 16:39:30 -08:00
Mahmood Ali f139234372 address review comments 2018-11-16 17:13:01 -05:00
Mahmood Ali f72e599ee7 Populate alloc stats API with device stats
This change makes few compromises:

* Looks up the devices associated with tasks at look up time.  Given
that `nomad alloc status` is called rarely generally (compared to stats
telemetry and general job reporting), it seems fine.  However, the
lookup overhead grows bounded by number of `tasks x total-host-devices`,
which can be significant.

* `client.Client` performs the task devices->statistics lookup.  It
passes self to alloc/task runners so they can look up the device statistics
allocated to them.
  * Currently alloc/task runners are responsible for constructing the
entire RPC response for stats
  * The alternatives for making task runners device statistics aware
don't seem appealing (e.g. having task runners contain reference to hostStats)

* On the alloc aggregation resource usage, I did a naive merging of task device statistics.
  * Personally, I question the value of such aggregation, compared to
costs of struct duplication and bloating the response - but opted to be
consistent in the API.
  * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.
2018-11-16 10:26:32 -05:00