Commit graph

3707 commits

Author SHA1 Message Date
Mahmood Ali ab2cae0625 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Mahmood Ali 446f06721d aux: helper method that returns token as well as ACL policy
This helper returns the token as well as the ACL policy, to be used in a later
commit for logging the token info associated with nomad exec invocation.
2019-04-30 10:23:56 -04:00
Lang Martin 371014b781
Merge pull request #5553 from hashicorp/b-fingerprinter-manual-config
client fingerprinter doesn't overwrite manual configuration
2019-04-26 12:55:34 -04:00
Danielle 79515496cb
Merge pull request #5515 from hashicorp/dani/f-alloc-signal
allocs: Add nomad alloc signal command
2019-04-26 14:21:05 +02:00
Danielle Lancashire a8880f9643 alloc_signal: Add autcompletion and cmd tests 2019-04-26 12:47:53 +02:00
Mahmood Ali bf0a09e270 retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
Mahmood Ali 81841e8528 try checking process status 2019-04-25 18:16:13 -04:00
Mahmood Ali fc78521f29 add logging about attempts 2019-04-25 18:09:36 -04:00
Mahmood Ali e6ca8641a8 try sleeping for stop signal to take effect 2019-04-25 17:16:29 -04:00
Mahmood Ali ff3a095015 add a test that simulates logmon dying during Start() call 2019-04-25 16:41:17 -04:00
Mahmood Ali bbac73883c logmon: retry starting logmon if it exits
Retry if we detect shutting down during Start() api call is started,
locally.
2019-04-25 15:10:16 -04:00
Mahmood Ali b51f00a7f3 logmon client to handle grpc closing errors 2019-04-25 14:32:24 -04:00
Danielle Lancashire 3409e0be89 allocs: Add nomad alloc signal command
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.

Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
2019-04-25 12:43:32 +02:00
Chris Baker 91c4e1eabb
Merge pull request #5541 from hashicorp/b/5540-bad-client-alloc-metrics
client/metrics: fixed stale metrics
2019-04-22 15:07:30 -04:00
Mahmood Ali f515b93b5e
Merge pull request #5577 from hashicorp/dani/b-logmon-unrecoverable
logging: Attempt to recover logmon failures
2019-04-22 14:40:24 -04:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Chris Baker 0b1a4dd206 client/metrics: modified metrics to use (updated) client copy of allocation instead of (unupdated) server copy 2019-04-22 18:31:45 +00:00
Lang Martin eba4e29440 client fingerprinter doesn't overwrite manual configuration
Revert "Revert accidental merge of pr #5482"
This reverts commit c45652ab8c113487b9d4fbfb107782cbcf8a85b0.
2019-04-19 15:23:48 -04:00
Michael Schurter 26f3bdbf8f
Merge pull request #5583 from ygersie/fingerprint_nilpointer
fix nil pointer in fingerprinting AWS env leading to crash
2019-04-19 08:08:59 -07:00
Mahmood Ali 902eed4bf9 clarify cryptic log line 2019-04-19 09:31:43 -04:00
Mahmood Ali f74d60439f client: log detected driver health state
Noticed that `detected drivers` log line was misleading - when a driver
doesn't fingerprint before timeout, their health status is empty string
`""` which we would mark as detected.

Now, we log all drivers along with their state to ease driver
fingerprint debugging.
2019-04-19 09:15:25 -04:00
Mahmood Ali 6bdc9860b7 client: avoid registering node twice right away
I noticed that `watchNodeUpdates()` almost immediately after
`registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5
seconds.

This call is unnecessary and made debugging a bit harder.  So here, we
ensure that we only re-register node for new node events, not for
initial registration.
2019-04-19 09:12:50 -04:00
Mahmood Ali f82ea8824f client: wait for batched driver updated
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes https://github.com/hashicorp/nomad/issues/5579
2019-04-19 09:00:24 -04:00
Yorick Gersie 95f81f3eeb fix nil pointer in fingerprinting AWS env leading to crash
HTTP Client returns a nil response if an error has occured. We first
  need to check for an error before being able to check the HTTP response
  code.
2019-04-19 11:07:13 +02:00
Danielle Lancashire c31966fc71 loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Michael Schurter a85e7b7cc9 vault: fix data races 2019-04-16 11:22:44 -07:00
Michael Schurter 0aeb3dbd86 vault: fix renewal time
Renewal time was being calculated as 10s+Intn(lease-10s), so the renewal
time could be very rapid or within 1s of the deadline: [10s, lease)

This commit fixes the renewal time by calculating it as:

	(lease/2) +/- 10s

For a lease of 60s this means the renewal will occur in [20s, 40s).
2019-04-16 11:22:44 -07:00
Michael Schurter f7a7acc345
Merge pull request #5518 from hashicorp/f-simplify-kill
client: simplify kill logic
2019-04-15 14:11:58 -07:00
Chris Baker 6848591914 vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation 2019-04-12 15:06:34 +00:00
Lang Martin a2a1e7829d Revert accidental merge of pr #5482
Revert "fingerprint Constraints and Affinities have Equals, as set"
This reverts commit 596f16fb5f1a4a6766a57b3311af806d22382609.

Revert "client tests assert the independent handling of interface and speed"
This reverts commit 7857ac5993a578474d0570819f99b7b6e027de40.

Revert "structs missed applying a style change from the review"
This reverts commit 658916e3274efa438beadc2535f47109d0c2f0f2.

Revert "client, structs comments"
This reverts commit be2838d6baa9d382a5013fa80ea016856f28ade2.

Revert "client fingerprint updateNetworks preserves the network configuration"
This reverts commit fc309cb430e62d8e66267a724f006ae9abe1c63c.

Revert "client_test cleanup comments from review"
This reverts commit bc0bf4efb9114e699bc662f50c8f12319b6b3445.

Revert "client Networks Equals is set equality"
This reverts commit f8d432345b54b1953a4a4c719b9269f845e3e573.

Revert "struct cleanup indentation in RequestedDevice Equals"
This reverts commit f4746411cab328215def6508955b160a53452da3.

Revert "struct Equals checks for identity before value checking"
This reverts commit 0767a4665ed30ab8d9586a59a74db75d51fd9226.

Revert "fix client-test, avoid hardwired platform dependecy on lo0"
This reverts commit e89dbb2ab182b6368507dbcd33c3342223eb0ae7.

Revert "refactor error in client fingerprint to include the offending data"
This reverts commit a7fed726c6e0264d42a58410d840adde780a30f5.

Revert "add client updateNodeResources to merge but preserve manual config"
This reverts commit 84bd433c7e1d030193e054ec23474380ff3b9032.

Revert "refactor struts.RequestedDevice to have its own Equals"
This reverts commit 689782524090e51183474516715aa2f34908b8e6.

Revert "refactor structs.Resource.Networks to have its own Equals"
This reverts commit 49e2e6c77bb3eaa4577772b36c62205061c92fa1.

Revert "refactor structs.Resource.Devices to have its own Equals"
This reverts commit 4ede9226bb971ae42cc203560ed0029897aec2c9.

Revert "add COMPAT(0.10): Remove in 0.10 notes to impl for structs.Resources"
This reverts commit 49fbaace5298d5ccf031eb7ebec93906e1d468b5.

Revert "add structs.Resources Equals"
This reverts commit 8528a2a2a6450e4462a1d02741571b5efcb45f0b.

Revert "test that fingerprint resources are updated, net not clobbered"
This reverts commit 8ee02ddd23bafc87b9fce52b60c6026335bb722d.
2019-04-11 10:29:40 -04:00
Lang Martin 5d3596eb7e client tests assert the independent handling of interface and speed 2019-04-11 09:56:22 -04:00
Lang Martin 7258a13c72 client, structs comments 2019-04-11 09:56:22 -04:00
Lang Martin 22d87e4538 client fingerprint updateNetworks preserves the network configuration 2019-04-11 09:56:22 -04:00
Lang Martin 8fe9699e51 client_test cleanup comments from review 2019-04-11 09:56:22 -04:00
Lang Martin 63c993c8ae fix client-test, avoid hardwired platform dependecy on lo0 2019-04-11 09:56:22 -04:00
Lang Martin a9db848974 refactor error in client fingerprint to include the offending data 2019-04-11 09:56:22 -04:00
Lang Martin f211500cea add client updateNodeResources to merge but preserve manual config 2019-04-11 09:56:22 -04:00
Lang Martin a4b59130d2 test that fingerprint resources are updated, net not clobbered 2019-04-11 09:56:21 -04:00
Danielle Lancashire e135876493 allocs: Add nomad alloc restart
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
2019-04-11 14:25:49 +02:00
Chris Baker 829a972693
vault client test: minor formatting
vendor: using upstream circonus-gometrics
2019-04-10 10:34:10 -05:00
Chris Baker c0a7aee610
vault e2e: pass vault version into setup instead of having to infer it from test name 2019-04-10 10:34:10 -05:00
Chris Baker f0c184fc29
taskrunner: removed some unecessary config from a test 2019-04-10 10:34:10 -05:00
Chris Baker a26d4fe1e5
docs: -vault-namespace, VAULT_NAMESPACE, and config
agent: added VAULT_NAMESPACE env-based configuration
2019-04-10 10:34:10 -05:00
Chris Baker 170f5239c8
client: gofmt 2019-04-10 10:34:10 -05:00
Chris Baker a1d7971b2e
taskrunner: pass configured Vault namespace into TaskTemplateConfig 2019-04-10 10:34:10 -05:00
Chris Baker 0eaeef872f
config/docs: added namespace to vault config
server/client: process `namespace` config, setting on the instantiated vault client
2019-04-10 10:34:10 -05:00
Michael Schurter 45b4827ad7 Bump to 0.9.1-dev 2019-04-09 09:01:48 -07:00
Nomad Release bot e307734e4a Generate files for 0.9.0 release 2019-04-09 01:56:00 +00:00
Michael Schurter f7d4428855 client: simplify kill logic
Remove runLaunched tracking as Run is *always* called for killable
TaskRunners. TaskRunners which fail before Run can be called (during
NewTaskRunner or Restore) are not killable as they're never added to the
client's alloc map.
2019-04-04 15:18:33 -07:00
Michael Schurter 3af602b633 Remove 0.9.0-rc2 generated files 2019-04-03 07:41:09 -07:00