Commit Graph

14895 Commits

Author SHA1 Message Date
Nick Ethier c9bbdbf208
website: fixs a few errors in new plugin docs 2019-04-23 11:15:26 -04:00
Mahmood Ali 8b778f832d
Merge pull request #5598 from hashicorp/b-dont-forward-logs
fix crash when executor parent nomad process dies
2019-04-23 10:15:30 -04:00
Mahmood Ali 60ee243149 fix crash when executor parent nomad process dies
Fixes https://github.com/hashicorp/nomad/issues/5593

Executor seems to die unexpectedly after nomad agent dies or is
restarted.  The crash seems to occur at the first log message after
the nomad agent dies.

To ease debugging we forward executor log messages to executor.log as
well as to Stderr.  `go-plugin` sets up plugins with Stderr pointing to
a pipe being read by plugin client, the nomad agent in our case[1].
When the nomad agent dies, the pipe is closed, and any subsequent
executor logs fail with ErrClosedPipe and SIGPIPE signal.  SIGPIPE
results into executor process dying.

I considered adding a handler to ignore SIGPIPE, but hc-log library
currently panics when logging write operation fails[2]

This we opt to revert to v0.8 behavior of exclusively writing logs to
executor.log, while we investigate alternative options.

[1] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-plugin/client.go#L528-L535
[2] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-hclog/int.go#L320-L323
2019-04-23 09:52:46 -04:00
Danielle Lancashire fde36992f1 api docs: Add allocation stop 2019-04-23 13:28:21 +02:00
Danielle Lancashire 5fe3b99ec8 docs: Add documentation for 2019-04-23 13:22:27 +02:00
Danielle Lancashire 3b6bda04e2 changelog: Update for GH-5512 and GH-5577 2019-04-23 13:12:08 +02:00
Danielle 198a838b61
Merge pull request #5512 from hashicorp/dani/f-alloc-stop
alloc-lifecycle: nomad alloc stop
2019-04-23 13:05:08 +02:00
Danielle Lancashire 832f607433 allocs: Add nomad alloc stop
This adds a `nomad alloc stop` command that can be used to stop and
force migrate an allocation to a different node.

This is built on top of the AllocUpdateDesiredTransitionRequest and
explicitly limits the scope of access to that transition to expose it
under the alloc-lifecycle ACL.

The API returns the follow up eval that can be used as part of
monitoring in the CLI or parsed and used in an external tool.
2019-04-23 12:50:23 +02:00
Michael Lange f530c2f5c1 Updated serializer unit tests 2019-04-22 17:20:52 -07:00
Michael Lange 35e34fea8b Test coverage for preemption on the client detail page 2019-04-22 16:40:10 -07:00
Michael Lange b7860a9bca Test coverage for preemption on the allocation detail page 2019-04-22 16:40:09 -07:00
Michael Lange 29ccd8bcc5 Preemption modeling as page objects 2019-04-22 16:40:08 -07:00
Michael Lange 5124dfe30f Integration test for the alloc row icon 2019-04-22 16:40:07 -07:00
Michael Lange 000bfce30f Add preemption properties to Mirage allocation factory 2019-04-22 16:40:07 -07:00
Michael Lange 4c7e350e84 Show which allocations an allocation preempted on the alloc page 2019-04-22 16:40:06 -07:00
Michael Lange 42a4793d9d Show which alloc, if any, preempted an alloc on the alloc detail page 2019-04-22 16:40:05 -07:00
Michael Lange a5a659a98a Preemptions count and filtering on client detail page
Show the count in the allocations table next to the existing total alloc
count badge. Clicking either will filter by all or by preemptions.
2019-04-22 16:40:04 -07:00
Michael Lange 1266567098 Add preempted icon to alloc row 2019-04-22 16:40:04 -07:00
Michael Lange e35139e453 Make sure tooltips show up over the top of the side bar 2019-04-22 16:40:03 -07:00
Michael Lange d12d5f9163 Add wasPreempted bool to allocs 2019-04-22 16:40:02 -07:00
Michael Lange dcc219fe73 Show preemptions on the job plan phase of job submission 2019-04-22 16:40:01 -07:00
Michael Lange cb11f46ecf Data modeling for preemptions 2019-04-22 16:40:00 -07:00
Chris Baker 812abe153f
Merge pull request #5591 from hashicorp/cgbaker/changelog
changelog: added entry for #5540 fix
2019-04-22 15:31:22 -04:00
Michael Schurter 12ccadcbd0
Merge pull request #5586 from hashicorp/docs-deploy-ver
docs: bump deployment guide to 0.9.0
2019-04-22 12:29:22 -07:00
Chris Baker 0baf547059 changelog: added entry for #5540 fix 2019-04-22 19:27:40 +00:00
Chris Baker 91c4e1eabb
Merge pull request #5541 from hashicorp/b/5540-bad-client-alloc-metrics
client/metrics: fixed stale metrics
2019-04-22 15:07:30 -04:00
Mahmood Ali f515b93b5e
Merge pull request #5577 from hashicorp/dani/b-logmon-unrecoverable
logging: Attempt to recover logmon failures
2019-04-22 14:40:24 -04:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Chris Baker 0b1a4dd206 client/metrics: modified metrics to use (updated) client copy of allocation instead of (unupdated) server copy 2019-04-22 18:31:45 +00:00
Lang Martin 8aa97cff13 tests over setwise equality of fingerprinted parts 2019-04-19 15:49:24 -04:00
Michael Schurter 6e43f72a12 docs: bump deployment guide to 0.9.0 2019-04-19 12:39:38 -07:00
Lang Martin 7de6e28ddc structs need to keep assert Equal interface implementation for tests 2019-04-19 15:23:49 -04:00
Lang Martin 977d33970b structs equals use labeled continue for clarity 2019-04-19 15:23:48 -04:00
Lang Martin 7b99488afa struct equals use a working pattern for setwise comparison 2019-04-19 15:23:48 -04:00
Lang Martin eba4e29440 client fingerprinter doesn't overwrite manual configuration
Revert "Revert accidental merge of pr #5482"
This reverts commit c45652ab8c113487b9d4fbfb107782cbcf8a85b0.
2019-04-19 15:23:48 -04:00
Michael Schurter 26f3bdbf8f
Merge pull request #5583 from ygersie/fingerprint_nilpointer
fix nil pointer in fingerprinting AWS env leading to crash
2019-04-19 08:08:59 -07:00
Mahmood Ali 6b8f855c14
Merge pull request #5437 from hashicorp/r-upstream-libcontainer-plain
Use upstream libcontainer package
2019-04-19 10:15:13 -04:00
Mahmood Ali 6014a884be comment on using init() for libcontainer handling 2019-04-19 09:49:04 -04:00
Mahmood Ali 4322055301 comment what refer to 2019-04-19 09:49:04 -04:00
Mahmood Ali 18993421f2 Move libcontainer helper to executor package 2019-04-19 09:49:04 -04:00
Mahmood Ali e0c7063697 vendor upstream opencontainers/runc 2019-04-19 09:49:04 -04:00
Mahmood Ali 97aba5ad20
Merge pull request #5585 from hashicorp/b-drivers-node-registration
client: wait for batched driver updates before registering nodes
2019-04-19 09:47:21 -04:00
Mahmood Ali 902eed4bf9 clarify cryptic log line 2019-04-19 09:31:43 -04:00
Mahmood Ali f74d60439f client: log detected driver health state
Noticed that `detected drivers` log line was misleading - when a driver
doesn't fingerprint before timeout, their health status is empty string
`""` which we would mark as detected.

Now, we log all drivers along with their state to ease driver
fingerprint debugging.
2019-04-19 09:15:25 -04:00
Mahmood Ali 6bdc9860b7 client: avoid registering node twice right away
I noticed that `watchNodeUpdates()` almost immediately after
`registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5
seconds.

This call is unnecessary and made debugging a bit harder.  So here, we
ensure that we only re-register node for new node events, not for
initial registration.
2019-04-19 09:12:50 -04:00
Preetha a9327e58fb
Update CHANGELOG.md 2019-04-19 08:02:48 -05:00
Mahmood Ali f82ea8824f client: wait for batched driver updated
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes https://github.com/hashicorp/nomad/issues/5579
2019-04-19 09:00:24 -04:00
Yorick Gersie 95f81f3eeb fix nil pointer in fingerprinting AWS env leading to crash
HTTP Client returns a nil response if an error has occured. We first
  need to check for an error before being able to check the HTTP response
  code.
2019-04-19 11:07:13 +02:00
Preetha 4fdd82c601
Merge pull request #5580 from hashicorp/f-api-preemption-info
Add preemption related fields to AllocationListStub
2019-04-18 18:38:25 -07:00
Preetha Appan 22109d1e20
Add preemption related fields to AllocationListStub 2019-04-18 10:36:44 -05:00