Commit graph

18186 commits

Author SHA1 Message Date
Tim Gross 00c9bd7ff0
reorder volume claim batch request raft message (#7871)
For backwards compatibility during upgrades, new raft message types
need to come at the end of the enum.
2020-05-06 08:57:51 -04:00
Mahmood Ali 24e0c7f081 ui: only count running allocations in client view
In the client view list, only show running allocations count for each
client, rather than include already completed tasks.

This is done for two reasons:

First, consitency with the CLI: `nomad node status --allocs` only
shows running allocs.

Second, and more importantly, the count is useful to estimate how loaded
the clients are.  Allocs that have completed (but not GCed yet) have
very little value to operators.
2020-05-05 21:31:58 -04:00
Tim Gross ce86a594a6
csi: fix plugin counts on node update (#7844)
In this changeset:

* If a Nomad client node is running both a controller and a node
  plugin (which is a common case), then if only the controller or the
  node is removed, the plugin was not being updated with the correct
  counts.
* The existing test for plugin cleanup didn't go back to the state
  store, which normally is ok but is complicated in this case by
  denormalization which changes the behavior. This commit makes the
  test more comprehensive.
* Set "controller required" when plugin has `PUBLISH_READONLY`. All
  known controllers that support `PUBLISH_READONLY` also support
  `PUBLISH_UNPUBLISH_VOLUME` but we shouldn't assume this.
* Only create plugins when the allocs for those plugins are
  healthy. If we allow a plugin to be created for the first time when
  the alloc is not healthy, then we'll recreate deleted plugins when
  the job's allocs all get marked terminal.
* Terminal plugin alloc updates should cleanup the plugin. The client
  fingerprint can't tell if the plugin is unhealthy intentionally (for
  the case of updates or job stop). Allocations that are
  server-terminal should delete themselves from the plugin and trigger
  a plugin self-GC, the same as an unused node.
2020-05-05 15:39:57 -04:00
Tim Gross 3cca738478
csi: fix mount validation (#7869)
Several of the CSI `VolumeCapability` methods return pointers, which
we were then comparing to pointers in the request rather than
dereferencing them and comparing their contents.

This changeset does a more fine-grained comparison of the request vs
the capabilities, and adds better error messaging.
2020-05-05 15:13:07 -04:00
Drew Bailey 13beb103a4
Merge pull request #7867 from hashicorp/license-command-updates
update license command output to reflect api changes
2020-05-05 14:10:22 -04:00
Tim Gross 22e3815e8c
docstring improvements and typo fixes (#7862) 2020-05-05 10:30:50 -04:00
Drew Bailey 48c451709e
update license command output to reflect api changes 2020-05-05 10:28:58 -04:00
Juan Larriba a0df437c62
Run Linux Images (LCOW) and Windows Containers side by side (#7850)
Makes it possible to run Linux Containers On Windows with Nomad alongside Windows Containers. Fingerprint prevents only to run Nomad in Windows 10 with Linux Containers
2020-05-04 13:08:47 -04:00
Tim Gross 1c6dcab56b
volumewatcher: remove spurious nil-check (#7858)
The nil-check here is left-over from an earlier approach that didn't
get merged. It doesn't do anything for us now as we can't ever pass it
`nil` and if we leave it in the `getVolume` call it guards will panic
anyways.
2020-05-04 12:28:32 -04:00
Mahmood Ali 0d730326c1
Merge pull request #7856 from Renerick/patch-1
Fix URL schema in `drain` documentation
2020-05-03 13:51:05 -04:00
Denis Palashevskiy 4095860116
Fix URL schema in drain documentation 2020-05-03 20:50:40 +04:00
Michael Schurter 1d7f8391ee
Merge pull request #7846 from hashicorp/changli0617-patch-1
Update _app.js for Nomad Virtual Day
2020-05-01 14:33:23 -07:00
Michael Lange 260da00852 Add embedded task group to allocation to reference when allocation is historical 2020-05-01 14:30:02 -07:00
Michael Lange 91b97e0170 Stabilize job and allocation job versions in fixtures 2020-05-01 14:29:24 -07:00
Michael Lange 9a857a7042 Comment why the allocation has to be reloaded 2020-05-01 14:27:53 -07:00
Mahmood Ali bc25da9dac
Merge pull request #7851 from hashicorp/spread-configuration-followup
Follow up fix for spread
2020-05-01 13:48:31 -04:00
Mahmood Ali 759eade78b missed fixing one invocation 2020-05-01 13:38:46 -04:00
Tim Gross 139c65c436
e2e: csi test can purge target job (#7823) 2020-05-01 13:25:50 -04:00
Mahmood Ali 78ae7b885a
Merge pull request #7810 from hashicorp/spread-configuration
spread scheduling algorithm
2020-05-01 13:15:19 -04:00
Mahmood Ali 3da74068dd changelog and fix typo 2020-05-01 13:14:20 -04:00
Mahmood Ali b9e3cde865 tests and some clean up 2020-05-01 13:13:30 -04:00
Charlie Voiselle d8e5e02398 Wiring algorithm to scheduler calls 2020-05-01 13:13:29 -04:00
Charlie Voiselle 663fb677cf Add SchedulerAlgorithm to SchedulerConfig 2020-05-01 13:13:29 -04:00
Lang Martin ad2fb4b297 client/heartbeatstop: don't store client state, use timeout
In order to minimize this change while keeping a simple version of the
behavior, we set `lastOk` to the current time less the intial server
connection timeout. If the client starts and never contacts the
server, it will stop all configured tasks after the initial server
connection grace period, on the assumption that we've been out of
touch longer than any configured `stop_after_client_disconnect`.

The more complex state behavior might be justified later, but we
should learn about failure modes first.
2020-05-01 12:35:49 -04:00
Lang Martin 28bac139cb client/heartbeatstop: destroy allocs when disconnected from servers
- track lastHeartbeat, the client local time of the last successful
  heartbeat round trip
- track allocations with `stop_after_client_disconnect` configured
- trigger allocation destroy (which handles cleanup)
- restore heartbeat/killable allocs tracking when allocs are recovered from disk
- on client restart, stop those allocs after a grace period if the
  servers are still partioned
2020-05-01 12:35:49 -04:00
Drew Bailey fdd61e4a63
Merge pull request #7847 from hashicorp/ent-license-404
temporarily test for 404 until endpoint is ready
2020-05-01 11:31:10 -04:00
Drew Bailey 581ad558a8
temporarily test for 404 until endpoint is ready 2020-05-01 11:24:37 -04:00
Drew Bailey 2b8fc650c9
Merge pull request #7778 from hashicorp/license-cli
License cli
2020-05-01 08:51:40 -04:00
changli0617 a37176ab44
Update _app.js 2020-04-30 15:25:43 -07:00
Michael Schurter cebc6b939e
Merge pull request #7730 from hashicorp/b-reserved-scoring
core: fix node reservation scoring
2020-04-30 14:48:36 -07:00
Michael Schurter c901d0e7dd
Merge branch 'master' into b-reserved-scoring 2020-04-30 14:48:14 -07:00
Michael Schurter 439a9f7301
Update website/pages/docs/upgrade/upgrade-specific.mdx
Co-authored-by: Alex Dadgar <alex@hashicorp.com>
2020-04-30 14:47:12 -07:00
Tim Gross cbae10333c
csi: check returned volume capability validation (#7831)
This changeset corrects handling of the `ValidationVolumeCapabilities`
response:

* The CSI spec for the `ValidationVolumeCapabilities` requires that
  plugins only set the `Confirmed` field if they've validated all
  capabilities. The Nomad client improperly assumes that the lack of a
  `Confirmed` field should be treated as a failure. This breaks the
  Azure and Linode block storage plugins, which don't set this
  optional field.

* The CSI spec also requires that the orchestrator check the validation
  responses to guard against older versions of a plugin reporting
  "valid" for newer fields it doesn't understand.
2020-04-30 17:12:32 -04:00
Tim Gross cc7dbad1c7
csi: restore long timeout for controller plugins (#7840)
During MVP development, we reduced the timeout for controller plugins
to avoid long hangs in GC workers. But now that this work has been
moved to the volume watcher, we can restore the original timeout which
is better suited for the characteristic timescales of some cloud
provider APIs and better matches the behavior of k8s.
2020-04-30 17:12:05 -04:00
Tim Gross 52e805a6a6
csi: ensure Read/WriteAllocs aren't released early (#7841)
We should only remove the `ReadAllocs`/`WriteAllocs` values for a
volume after the claim has entered the "ready to free"
state. The volume will eventually be released as expected. But
querying the volume API will show the volume is released before the
controller unpublish has finished and this can cause a race with
starting new jobs.

Test updates are to cover cases where we're dropping claims but not
running through the whole reaping process.
2020-04-30 17:11:31 -04:00
Drew Bailey 41c7d49eb7
properly format license output 2020-04-30 14:46:26 -04:00
Drew Bailey 42075ef30e
allow test to check if server is enterprise 2020-04-30 14:46:21 -04:00
Drew Bailey acacecc67b
add license reset command to commands
help text formatting

remove reset

no signed option
2020-04-30 14:46:20 -04:00
Drew Bailey a266284f60
test all commands oss err 2020-04-30 14:46:19 -04:00
Drew Bailey 59b76f90e8
hcl fmt from editor
license cli formatting, license endpoints ent only

test oss error

type assertions
2020-04-30 14:46:18 -04:00
Drew Bailey 74abe6ef48
license cli commands
cli changes, formatting
2020-04-30 14:46:17 -04:00
Jasmine Dahilig a9004faa11
UI: Add representations for task lifecycles (#7659)
This adds details about task lifecycles to allocations, task groups,
and tasks. It includes a live-updating timeline-like chart on allocations.
2020-04-30 08:15:19 -05:00
Tim Gross a7a64443e1
csi: move volume claim release into volumewatcher (#7794)
This changeset adds a subsystem to run on the leader, similar to the
deployment watcher or node drainer. The `Watcher` performs a blocking
query on updates to the `CSIVolumes` table and triggers reaping of
volume claims.

This will avoid tying up scheduling workers by immediately sending
volume claim workloads into their own loop, rather than blocking the
scheduling workers in the core GC job doing things like talking to CSI
controllers

The volume watcher is enabled on leader step-up and disabled on leader
step-down.

The volume claim GC mechanism now makes an empty claim RPC for the
volume to trigger an index bump. That in turn unblocks the blocking
query in the volume watcher so it can assess which claims can be
released for a volume.
2020-04-30 09:13:00 -04:00
Michael Lange c3085f04b6
Merge pull request #7820 from hashicorp/b-ui/ui-log-races
UI: Log streaming bug fix medley
2020-04-29 18:06:47 -07:00
Michael Lange 21ef3633be Make the no connection error on the logs page dismissable 2020-04-29 17:36:17 -07:00
Michael Lange e74cd16252 Fix race condition where stdout and stderr requests can cause a no connection error
This would happen because a no connection error happens after the second request fails, but
that's because it's assumed the second request is to a server node. However, if a user clicks
stderr fast enough, the first and second requests are both to the client node. This changes
the logic to check if the request is to the server before deeming log streaming a total failure.
2020-04-29 17:36:17 -07:00
Michael Lange aafbeaba75 Clicking stdout/stderr when already on that tab is now a noop 2020-04-29 17:36:16 -07:00
Michael Lange 7452a9a57d Abort log fetch request when failing over from client to server
Typically a failover means that the client can't be reached. However, if
the client does eventually return after the timeout period, the log will
stream indefinitely. This fixes that using an API that wasn't broadly
available at the time this was first written.
2020-04-29 17:34:49 -07:00
Michael Lange 9ba563c48e Always pass credential in fetch requests, but also treat options reasonably
Now options can be provided without also having to remember to pass
credentials. This is convenient for abort controller signals.
2020-04-29 17:34:49 -07:00
Seth Hoenig dee7f3ea11
Merge pull request #7828 from hashicorp/b-ec2-speeds
env_aws: use best-effort lookup table for CPU performance in EC2
2020-04-29 11:25:54 -06:00