Commit Graph

14666 Commits

Author SHA1 Message Date
Michael Lange dcc219fe73 Show preemptions on the job plan phase of job submission 2019-04-22 16:40:01 -07:00
Michael Lange cb11f46ecf Data modeling for preemptions 2019-04-22 16:40:00 -07:00
Chris Baker 812abe153f
Merge pull request #5591 from hashicorp/cgbaker/changelog
changelog: added entry for #5540 fix
2019-04-22 15:31:22 -04:00
Michael Schurter 12ccadcbd0
Merge pull request #5586 from hashicorp/docs-deploy-ver
docs: bump deployment guide to 0.9.0
2019-04-22 12:29:22 -07:00
Chris Baker 0baf547059 changelog: added entry for #5540 fix 2019-04-22 19:27:40 +00:00
Chris Baker 91c4e1eabb
Merge pull request #5541 from hashicorp/b/5540-bad-client-alloc-metrics
client/metrics: fixed stale metrics
2019-04-22 15:07:30 -04:00
Mahmood Ali f515b93b5e
Merge pull request #5577 from hashicorp/dani/b-logmon-unrecoverable
logging: Attempt to recover logmon failures
2019-04-22 14:40:24 -04:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Chris Baker 0b1a4dd206 client/metrics: modified metrics to use (updated) client copy of allocation instead of (unupdated) server copy 2019-04-22 18:31:45 +00:00
Michael Schurter 6e43f72a12 docs: bump deployment guide to 0.9.0 2019-04-19 12:39:38 -07:00
Michael Schurter 26f3bdbf8f
Merge pull request #5583 from ygersie/fingerprint_nilpointer
fix nil pointer in fingerprinting AWS env leading to crash
2019-04-19 08:08:59 -07:00
Mahmood Ali 6b8f855c14
Merge pull request #5437 from hashicorp/r-upstream-libcontainer-plain
Use upstream libcontainer package
2019-04-19 10:15:13 -04:00
Mahmood Ali 6014a884be comment on using init() for libcontainer handling 2019-04-19 09:49:04 -04:00
Mahmood Ali 4322055301 comment what refer to 2019-04-19 09:49:04 -04:00
Mahmood Ali 18993421f2 Move libcontainer helper to executor package 2019-04-19 09:49:04 -04:00
Mahmood Ali e0c7063697 vendor upstream opencontainers/runc 2019-04-19 09:49:04 -04:00
Mahmood Ali 97aba5ad20
Merge pull request #5585 from hashicorp/b-drivers-node-registration
client: wait for batched driver updates before registering nodes
2019-04-19 09:47:21 -04:00
Mahmood Ali 902eed4bf9 clarify cryptic log line 2019-04-19 09:31:43 -04:00
Mahmood Ali f74d60439f client: log detected driver health state
Noticed that `detected drivers` log line was misleading - when a driver
doesn't fingerprint before timeout, their health status is empty string
`""` which we would mark as detected.

Now, we log all drivers along with their state to ease driver
fingerprint debugging.
2019-04-19 09:15:25 -04:00
Mahmood Ali 6bdc9860b7 client: avoid registering node twice right away
I noticed that `watchNodeUpdates()` almost immediately after
`registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5
seconds.

This call is unnecessary and made debugging a bit harder.  So here, we
ensure that we only re-register node for new node events, not for
initial registration.
2019-04-19 09:12:50 -04:00
Preetha a9327e58fb
Update CHANGELOG.md 2019-04-19 08:02:48 -05:00
Mahmood Ali f82ea8824f client: wait for batched driver updated
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes https://github.com/hashicorp/nomad/issues/5579
2019-04-19 09:00:24 -04:00
Yorick Gersie 95f81f3eeb fix nil pointer in fingerprinting AWS env leading to crash
HTTP Client returns a nil response if an error has occured. We first
  need to check for an error before being able to check the HTTP response
  code.
2019-04-19 11:07:13 +02:00
Preetha 4fdd82c601
Merge pull request #5580 from hashicorp/f-api-preemption-info
Add preemption related fields to AllocationListStub
2019-04-18 18:38:25 -07:00
Preetha Appan 22109d1e20
Add preemption related fields to AllocationListStub 2019-04-18 10:36:44 -05:00
Danielle 72862db778
Merge pull request #5572 from hashicorp/dani/b-docker-volumes
Switch to pre-0.9 behaviour for handling volumes
2019-04-18 15:48:23 +02:00
Danielle be7daaaf15
Merge pull request #5573 from hashicorp/dani/update-vol-docs
docs: Clarify docker volume behaviour
2019-04-18 14:30:16 +02:00
Danielle Lancashire a096a7f112 Switch to pre-0.9 behaviour for handling volumes
In Nomad 0.9, we made volume driver handling the same for `""`, and
`"local"` volumes. Prior to Nomad 0.9 however these had slightly different
behaviour for relative paths and named volumes.

Prior to 0.9 the empty string would expand relative paths within the task
dir, and `"local"` volumes that are not absolute paths would be treated
as docker named volumes.

This commit reverts to the previous behaviour as follows:

| Nomad Version | Driver  |   Volume Spec    | Behaviour                 |
|-------------------------------------------------------------------------
| all           | ""      | testing:/testing | allocdir/testing          |
| 0.8.7         | "local" | testing:/testing | "testing" as named volume |
| 0.9.0         | "local" | testing:/testing | allocdir/testing          |
| 0.9.1         | "local" | testing:/testing | "testing" as named volume |
2019-04-18 14:28:45 +02:00
Danielle Lancashire c31966fc71 loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Chris Baker 338d4e989d
Merge pull request #5559 from ArangoGutierrez/website_docs_singularity
list singularity as a community driver
2019-04-17 12:42:29 -04:00
Charlie Voiselle 7f01244ece
fixed header level 2019-04-17 10:12:43 -04:00
Danielle Lancashire 1e0d3ffe24 docs: Clairfy docker volume behaviour 2019-04-17 11:31:55 +02:00
Mahmood Ali 12a9896a7e
Merge pull request #5568 from hashicorp/b-nomad-logger-restart
Fixes #5566 .

Fix a case where docker logging process may lock up nomad agent restart.

Looks like we have a case where docker logger is started even through logmon isn't. In such case, the fifo writer blocks indefinitely and because the open operation happens in the main goroutine, nomad agent blocks indefinitely.

This fixes the issue where the fifo open operation happens in goroutine instead of main goroutine.

We should follow up independently to ensure logmon <-> dockerlogger ordering and consider having task recovery happen in non-main goroutine with some sensible timeouts.
2019-04-16 19:34:37 -04:00
Eduardo Arango 40d0af5422
resolve merge conflicts
Signed-off-by: Eduardo Arango <eduardo@sylabs.io>
2019-04-16 17:01:22 -05:00
Eduardo Arango 6934b98313
address @cgbaker comments
Signed-off-by: Eduardo Arango <eduardo@sylabs.io>
2019-04-16 16:59:59 -05:00
Michael Schurter 3ba39e7c76
Merge pull request #5479 from hashicorp/b-vault-renewal
vault: fix renewal time
2019-04-16 12:20:26 -07:00
Michael Schurter 6421c55384 changelog: add #5479 2019-04-16 11:23:28 -07:00
Michael Schurter a85e7b7cc9 vault: fix data races 2019-04-16 11:22:44 -07:00
Michael Schurter 0aeb3dbd86 vault: fix renewal time
Renewal time was being calculated as 10s+Intn(lease-10s), so the renewal
time could be very rapid or within 1s of the deadline: [10s, lease)

This commit fixes the renewal time by calculating it as:

	(lease/2) +/- 10s

For a lease of 60s this means the renewal will occur in [20s, 40s).
2019-04-16 11:22:44 -07:00
Mahmood Ali 01a13a0947 locking and opening streams in goroutine comment 2019-04-16 11:02:19 -04:00
Mahmood Ali 357b86adc3 open fifo on background goroutine 2019-04-15 21:20:09 -04:00
Michael Schurter f7a7acc345
Merge pull request #5518 from hashicorp/f-simplify-kill
client: simplify kill logic
2019-04-15 14:11:58 -07:00
Michael Schurter 373748a327
Merge pull request #5486 from hashicorp/b-validate-migrate
api: fix migrate stanza initialization
2019-04-15 09:44:59 -07:00
Danielle a34b950a89
Merge pull request #5565 from hashicorp/dani/alloc-restart-docs
docs: Add docs for nomad-alloc-restart
2019-04-15 17:26:28 +02:00
Danielle Lancashire 3aef4343ae docs: Add docs for nomad-alloc-restart 2019-04-15 17:21:06 +02:00
Chris Baker a73d7e797b
Update singularity.html.md 2019-04-15 09:49:30 -04:00
Chris Baker 5b66a00689
Merge pull request #5560 from hashicorp/f-3251-cli-force-periodic
cli: add support for periodic force evaluation
2019-04-15 09:40:35 -04:00
Danielle Lancashire 60d7fc4bf5 Update CHANGELOG
Add `nomad alloc restart` and `nomad status -verbose`
2019-04-15 11:14:51 +02:00
Eduardo Arango c9bae637f2
Merge branch 'website_docs_singularity' of github.com:ArangoGutierrez/nomad into website_docs_singularity 2019-04-12 16:27:33 -05:00
Eduardo Arango 7ada6a2c4c
address requestec changes, iteration 1
Signed-off-by: Eduardo Arango <eduardo@sylabs.io>
2019-04-12 16:26:52 -05:00