Commit Graph

2700 Commits

Author SHA1 Message Date
Tim Gross 3aa761b151
Periodic GC for volume claims (#7881)
This changeset implements a periodic garbage collection of CSI volumes
with missing allocations. This can happen in a scenario where a node
update fails partially and the allocation updates are written to raft
but the evaluations to GC the volumes are dropped. This feature will
cover this edge case and ensure that upgrades from 0.11.0 and 0.11.1
get any stray claims cleaned up.
2020-05-11 08:20:50 -04:00
Mahmood Ali 2c963885b0 handle upgrade path and defaults
Ensure that `""` Scheduler Algorithm gets explicitly set to binpack on
upgrades or on API handling when user misses the value.

The scheduler already treats `""` value as binpack.  This PR merely
ensures that the operator API returns the effective value.
2020-05-09 12:34:08 -04:00
Drew Bailey fde40046a1
update license output 2020-05-07 12:14:15 -04:00
Tim Gross 801ebcfe8d
periodic GC for CSI plugins (#7878)
This changeset implements a periodic garbage collection of unused CSI
plugins. Plugins are self-cleaning when the last allocation for a
plugin is stopped, but this feature will cover any missing edge cases
and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins
cleaned up.
2020-05-06 16:49:12 -04:00
Drew Bailey 48c451709e
update license command output to reflect api changes 2020-05-05 10:28:58 -04:00
Mahmood Ali 78ae7b885a
Merge pull request #7810 from hashicorp/spread-configuration
spread scheduling algorithm
2020-05-01 13:15:19 -04:00
Mahmood Ali b9e3cde865 tests and some clean up 2020-05-01 13:13:30 -04:00
Charlie Voiselle 663fb677cf Add SchedulerAlgorithm to SchedulerConfig 2020-05-01 13:13:29 -04:00
Drew Bailey 581ad558a8
temporarily test for 404 until endpoint is ready 2020-05-01 11:24:37 -04:00
Drew Bailey 41c7d49eb7
properly format license output 2020-04-30 14:46:26 -04:00
Drew Bailey 42075ef30e
allow test to check if server is enterprise 2020-04-30 14:46:21 -04:00
Drew Bailey acacecc67b
add license reset command to commands
help text formatting

remove reset

no signed option
2020-04-30 14:46:20 -04:00
Drew Bailey a266284f60
test all commands oss err 2020-04-30 14:46:19 -04:00
Drew Bailey 59b76f90e8
hcl fmt from editor
license cli formatting, license endpoints ent only

test oss error

type assertions
2020-04-30 14:46:18 -04:00
Drew Bailey 74abe6ef48
license cli commands
cli changes, formatting
2020-04-30 14:46:17 -04:00
Lang Martin e32b5b12dd
command: deployment status without a prefix lists deployments (#7821) 2020-04-28 15:11:32 -04:00
Mahmood Ali b8fb32f5d2 http: adjust log level for request failure
Failed requests due to API client errors are to be marked as DEBUG.

The Error log level should be reserved to signal problems with the
cluster and are actionable for nomad system operators.  Logs due to
misbehaving API clients don't represent a system level problem and seem
spurius to nomad maintainers at best.  These log messages can also be
attack vectors for deniel of service attacks by filling servers disk
space with spurious log messages.
2020-04-22 16:19:59 -04:00
Mahmood Ali 5b42796f1e
Merge pull request #7704 from hashicorp/b-agent-shutdown-order
agent: shutdown agent http server last
2020-04-20 10:37:26 -04:00
Mahmood Ali 4e1366f285 agent: route http logs through hclog
Pipe http server log to hclog, so that it uses the same logging format
as rest of nomad logs.  Also, supports emitting them as json logs, when
json formatting is set.

The http server logs are emitted as Trace level, as they are typically
repsent HTTP client errors (e.g. failed tls handshakes, invalid headers,
etc).

Though, Panic logs represent server errors and are relayed as Error
level.
2020-04-20 10:33:40 -04:00
Jeffrey 'jf' Lim eab600d3e1
Fix/improve "job plan" messaging (#7580) 2020-04-17 15:53:16 -04:00
Mahmood Ali b78680eee7 agent: shutdown agent http server last
Shutdown http server last, after nomad client/server components
terminate.

Before this change, if the agent is taking an unexpectedly long time to
shutdown, the operator cannot query the http server directly: they
cannot access agent specific http endpoints and need to query another
agent about the troublesome agent.

Unexpectedly long shutdown can happen in normal cases, e.g. a client
might hung is if one of the allocs it is running has a long
shutdown_delay.

Here, we switch to ensuring that the http server is shutdown last.

I believe this doesn't require extra care in agent shutting down logic
while operators may be able to submit write http requests.  We already
need to cope with operators submiting these http requests to another
agent or by servers updating the client allocations.
2020-04-13 10:50:07 -04:00
Mahmood Ali 14d6fec05a tests: deflake some SetServer related tests
Some tests assert on numbers on numbers of servers, e.g.
TestHTTP_AgentSetServers and TestHTTP_AgentListServers_ACL . Though, in dev and
test modes, the agent starts with servers having duplicate entries for
advertised and normalized RPC values, then settles with one unique value after
Raft/Serf re-sets servers with one single unique value.

This leads to flakiness, as the test will fail if assertion runs before Serf
update takes effect.

Here, we update the inital dev handling so it only adds a unique value if the
advertised and normalized values are the same.

Sample log lines illustrating the problem:

```
=== CONT  TestHTTP_AgentSetServers
    TestHTTP_AgentSetServers: testlog.go:34: 2020-04-06T21:47:51.016Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:127.0.0.1:9008 Address:127.0.0.1:9008}]"
    TestHTTP_AgentSetServers: testlog.go:34: 2020-04-06T21:47:51.016Z [INFO]  nomad: serf: EventMemberJoin: TestHTTP_AgentSetServers.global 127.0.0.1
    TestHTTP_AgentSetServers: testlog.go:34: 2020-04-06T21:47:51.035Z [DEBUG] client.server_mgr: new server list: new_servers=[127.0.0.1:9008, 127.0.0.1:9008] old_servers=[]
...
    TestHTTP_AgentSetServers: agent_endpoint_test.go:759:
                Error Trace:    agent_endpoint_test.go:759
                                                        http_test.go:1089
                                                        agent_endpoint_test.go:705
                Error:          "[127.0.0.1:9008 127.0.0.1:9008]" should have 1 item(s), but has 2
                Test:           TestHTTP_AgentSetServers
```
2020-04-07 09:27:48 -04:00
Mahmood Ali ed4c4d13a4 fixup! backend: support WS authentication handshake in alloc/exec 2020-04-03 14:20:31 -04:00
Mahmood Ali e63e096136 backend: support WS authentication handshake in alloc/exec
The javascript Websocket API doesn't support setting custom headers
(e.g. `X-Nomad-Token`).  This change adds support for having an
authentication handshake message: clients can set `ws_handshake` URL
query parameter to true and send a single handshake message with auth
token first before any other mssage.

This is a backward compatible change: it does not affect nomad CLI path, as it
doesn't set `ws_handshake` parameter.
2020-04-03 11:18:54 -04:00
Mahmood Ali 990cfb6fef agent config parsing tests for scheduler config 2020-04-03 07:54:32 -04:00
Chris Baker 277d29c6e7
Merge pull request #7572 from hashicorp/f-7422-scaling-events
finalizing scaling API work
2020-04-01 13:49:22 -05:00
Seth Hoenig 9aa9721143 connect: fix bug where absent connect.proxy stanza needs default config
In some refactoring, a bug was introduced where if the connect.proxy
stanza in a submitted job was nil, the default proxy configuration
would not be initialized with default values, effectively breaking
Connect.

      connect {
        sidecar_service {} # should work
      }

In contrast, by setting an empty proxy stanza, the config values would
be inserted correctly.

      connect {
        sidecar_service {
	  proxy {} # workaround
	}
      }

This commit restores the original behavior, where having a proxy
stanza present is not required.

The unit test for this case has also been corrected.
2020-04-01 11:19:32 -06:00
Chris Baker 40d6b3bbd1 adding raft and state_store support to track job scaling events
updated ScalingEvent API to record "message string,error bool" instead
of confusing "reason,error *string"
2020-04-01 16:15:14 +00:00
Seth Hoenig 14c7cebdea connect: enable automatic expose paths for individual group service checks
Part of #6120

Building on the support for enabling connect proxy paths in #7323, this change
adds the ability to configure the 'service.check.expose' flag on group-level
service check definitions for services that are connect-enabled. This is a slight
deviation from the "magic" that Consul provides. With Consul, the 'expose' flag
exists on the connect.proxy stanza, which will then auto-generate expose paths
for every HTTP and gRPC service check associated with that connect-enabled
service.

A first attempt at providing similar magic for Nomad's Consul Connect integration
followed that pattern exactly, as seen in #7396. However, on reviewing the PR
we realized having the `expose` flag on the proxy stanza inseperably ties together
the automatic path generation with every HTTP/gRPC defined on the service. This
makes sense in Consul's context, because a service definition is reasonably
associated with a single "task". With Nomad's group level service definitions
however, there is a reasonable expectation that a service definition is more
abstractly representative of multiple services within the task group. In this
case, one would want to define checks of that service which concretely make HTTP
or gRPC requests to different underlying tasks. Such a model is not possible
with the course `proxy.expose` flag.

Instead, we now have the flag made available within the check definitions themselves.
By making the expose feature resolute to each check, it is possible to have
some HTTP/gRPC checks which make use of the envoy exposed paths, as well as
some HTTP/gRPC checks which make use of some orthongonal port-mapping to do
checks on some other task (or even some other bound port of the same task)
within the task group.

Given this example,

group "server-group" {
  network {
    mode = "bridge"
    port "forchecks" {
      to = -1
    }
  }

  service {
    name = "myserver"
    port = 2000

    connect {
      sidecar_service {
      }
    }

    check {
      name     = "mycheck-myserver"
      type     = "http"
      port     = "forchecks"
      interval = "3s"
      timeout  = "2s"
      method   = "GET"
      path     = "/classic/responder/health"
      expose   = true
    }
  }
}

Nomad will automatically inject (via job endpoint mutator) the
extrapolated expose path configuration, i.e.

expose {
  path {
    path            = "/classic/responder/health"
    protocol        = "http"
    local_path_port = 2000
    listener_port   = "forchecks"
  }
}

Documentation is coming in #7440 (needs updating, doing next)

Modifications to the `countdash` examples in https://github.com/hashicorp/demo-consul-101/pull/6
which will make the examples in the documentation actually runnable.

Will add some e2e tests based on the above when it becomes available.
2020-03-31 17:15:50 -06:00
Seth Hoenig 41244c5857 jobspec: parse multi expose.path instead of explicit slice 2020-03-31 17:15:27 -06:00
Seth Hoenig 0266f056b8 connect: enable proxy.passthrough configuration
Enable configuration of HTTP and gRPC endpoints which should be exposed by
the Connect sidecar proxy. This changeset is the first "non-magical" pass
that lays the groundwork for enabling Consul service checks for tasks
running in a network namespace because they are Connect-enabled. The changes
here provide for full configuration of the

  connect {
    sidecar_service {
      proxy {
        expose {
          paths = [{
		path = <exposed endpoint>
                protocol = <http or grpc>
                local_path_port = <local endpoint port>
                listener_port = <inbound mesh port>
	  }, ... ]
       }
    }
  }

stanza. Everything from `expose` and below is new, and partially implements
the precedent set by Consul:
  https://www.consul.io/docs/connect/registration/service-registration.html#expose-paths-configuration-reference

Combined with a task-group level network port-mapping in the form:

  port "exposeExample" { to = -1 }

it is now possible to "punch a hole" through the network namespace
to a specific HTTP or gRPC path, with the anticipated use case of creating
Consul checks on Connect enabled services.

A future PR may introduce more automagic behavior, where we can do things like

1) auto-fill the 'expose.path.local_path_port' with the default value of the
   'service.port' value for task-group level connect-enabled services.

2) automatically generate a port-mapping

3) enable an 'expose.checks' flag which automatically creates exposed endpoints
   for every compatible consul service check (http/grpc checks on connect
   enabled services).
2020-03-31 17:15:27 -06:00
Seth Hoenig 1ce4eb17fa client: use consistent name for struct receiver parameter
This helps reduce the number of squiggly lines in Goland.
2020-03-31 17:15:27 -06:00
Lang Martin 8d4f39fba1
csi: add node events to report progress mounting and unmounting volumes (#7547)
* nomad/structs/structs: new NodeEventSubsystemCSI

* client/client: pass triggerNodeEvent in the CSIConfig

* client/pluginmanager/csimanager/instance: add eventer to instanceManager

* client/pluginmanager/csimanager/manager: pass triggerNodeEvent

* client/pluginmanager/csimanager/volume: node event on [un]mount

* nomad/structs/structs: use storage, not CSI

* client/pluginmanager/csimanager/volume: use storage, not CSI

* client/pluginmanager/csimanager/volume_test: eventer

* client/pluginmanager/csimanager/volume: event on error

* client/pluginmanager/csimanager/volume_test: check event on error

* command/node_status: remove an extra space in event detail format

* client/pluginmanager/csimanager/volume: use snake_case for details

* client/pluginmanager/csimanager/volume_test: snake_case details
2020-03-31 17:13:52 -04:00
Yoan Blanc 225c9c1215 fixup! vendor: explicit use of hashicorp/go-msgpack
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-03-31 09:48:07 -04:00
Yoan Blanc 761d014071 vendor: explicit use of hashicorp/go-msgpack
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-03-31 09:45:21 -04:00
Seth Hoenig b3664c628c
Merge pull request #7524 from hashicorp/docs-consul-acl-minimums
consul: annotate Consul interfaces with ACLs
2020-03-30 13:27:27 -06:00
Mahmood Ali 7df337e4c4
Merge pull request #7534 from hashicorp/b-windows-dev-network
windows: support -dev mode
2020-03-30 14:35:28 -04:00
Seth Hoenig 0a812ab689 consul: annotate Consul interfaces with ACLs 2020-03-30 10:17:28 -06:00
Drew Bailey a98dc8c768
update audit examples to an endpoint that is audited 2020-03-30 10:03:11 -04:00
Mahmood Ali dedf1cd3d7 tests: remove TestHTTP_NodeDrain_Compat
Nomad 0.11 servers no longer support having pre-0.8 clients.
2020-03-30 07:06:52 -04:00
Mahmood Ali 8b2b3f99d3 tests: deflake TestHTTP_NodeDrain
A node may be recognized as not running any allocs and have its drain
flag reset before the test queries it.
2020-03-30 07:06:52 -04:00
Mahmood Ali b0cc23ae63 tests: deflake TestConsul_PeriodicSync 2020-03-30 07:06:47 -04:00
Mahmood Ali ec6afa5795 windows: support -dev mode
Support running `nomad agent -dev` in Windows, by setting proper network
interface.

Prior to this change, `nomad` uses `lo` interface but Windows uses
"Loopback Pseudo-Interface 1" to refer to loopback device interface:
https://github.com/golang/go/blob/go1.14.1/src/net/net_windows_test.go#L304-L318
.
2020-03-28 12:01:51 -04:00
Drew Bailey a66b4be0f3
remove auditing for /ui/ 2020-03-27 10:12:42 -04:00
Drew Bailey de687edb2e
wrap http.Handlers
better comments
2020-03-27 09:35:10 -04:00
Lang Martin 50ff9ccd44
csi: plugin deregistration on plugin job GC (#7502)
* nomad/structs/csi: delete just one plugin type from a node

* nomad/structs/csi: add DeleteAlloc

* nomad/state/state_store: add deleteJobFromPlugin

* nomad/state/state_store: use DeleteAlloc not DeleteNodeType

* move CreateTestCSIPlugin to state to avoid an import cycle

* nomad/state/state_store_test: delete a plugin by deleting its jobs

* nomad/*_test: move CreateTestCSIPlugin to state

* nomad/state/state_store: update one plugin per transaction

* command/plugin_status_test: move CreateTestCSIPlugin

* nomad: csi: handle nils CSIPlugin methods, clarity
2020-03-26 17:07:18 -04:00
Drew Bailey b96a4da6fc
sync changes made to oss files from ent 2020-03-25 10:57:44 -04:00
Drew Bailey 218bfff6dd
add in change missed from ent 2020-03-25 10:53:38 -04:00
Drew Bailey 97cc19276d
add auditor 2020-03-25 10:48:23 -04:00
Drew Bailey 7329a88758
allow all build contexts to use noOpAuditor 2020-03-25 10:38:40 -04:00
Mahmood Ali 1c1186b344
Merge pull request #7487 from hashicorp/b-xss-oss
agent: prevent XSS by controlling Content-Type
2020-03-25 09:56:11 -04:00
Michael Schurter 29622013fa remove double negative from comment
Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>
2020-03-25 09:45:43 -04:00
Michael Schurter 1a27b8a07d test: assert monitor endpoint sets proper headers 2020-03-25 09:45:43 -04:00
Michael Schurter d6d44a8214 test: assert fs endpoints are xss safe 2020-03-25 09:45:43 -04:00
Michael Schurter 5ff458e840 agent: prevent XSS by controlling Content-Type 2020-03-25 09:45:43 -04:00
Mahmood Ali c7cf60c837 tests: test agent to use a noop auditor 2020-03-25 08:45:44 -04:00
Mahmood Ali ceed57b48f per-task restart policy 2020-03-24 17:00:41 -04:00
Lang Martin 8bd0405f33
csi: return an empty result list from plugins & volumes without `type`, not an error (#7471) 2020-03-24 14:28:28 -04:00
Chris Baker bc13bfb433 bad conversion between api.ScalingPolicy and structs.ScalingPolicy meant
that we were throwing away .Min if provided
2020-03-24 14:39:06 +00:00
Chris Baker f6ec5f9624 made count optional during job scaling actions
added ACL protection in Job.Scale
in Job.Scale, only perform a Job.Register if the Count was non-nil
2020-03-24 14:39:05 +00:00
Chris Baker 233db5258a changes to Canonicalize, Validate, and api->struct conversion so that tg.Count, tg.Scaling.Min/Max are well-defined with reasonable defaults.
- tg.Count defaults to tg.Scaling.Min if present (falls back on previous default of 1 if Scaling is absent)
- Validate() enforces tg.Scaling.Min <= tg.Count <= tg.Scaling.Max

modification in ApiScalingPolicyToStructs, api.TaskGroup.Validate so that defaults are handled for TaskGroup.Count and
2020-03-24 13:57:17 +00:00
Chris Baker 00092a6c29 fixed http endpoints for job.register and job.scalestatus 2020-03-24 13:57:16 +00:00
Chris Baker 925b59e1d2 wip: scaling status return, almost done 2020-03-24 13:57:15 +00:00
Chris Baker 42270d862c wip: some tests still failing
updating job scaling endpoints to match RFC, cleaning up the API object as well
2020-03-24 13:57:14 +00:00
Chris Baker abc7a52f56 finished refactoring state store, schema, etc 2020-03-24 13:57:14 +00:00
Chris Baker 3d54f1feba wip: added Enabled to ScalingPolicyListStub, removed JobID from body of scaling request 2020-03-24 13:57:12 +00:00
Chris Baker 024d203267 wip: added tests for client methods around group scaling 2020-03-24 13:57:11 +00:00
Chris Baker 1c5c2eb71b wip: add GET endpoint for job group scaling target 2020-03-24 13:57:10 +00:00
Chris Baker 179ab68258 wip: added job.scale rpc endpoint, needs explicit test (tested via http now) 2020-03-24 13:57:09 +00:00
Chris Baker 8453e667c2 wip: working on job group scaling endpoint 2020-03-24 13:55:20 +00:00
Chris Baker 6665d0bfb0 wip: added policy get endpoint, added UUID to policy 2020-03-24 13:55:20 +00:00
Chris Baker 9c2560ceeb wip: upsert/delete scaling policies on job upsert/delete 2020-03-24 13:55:18 +00:00
Chris Baker 65d92f1fbf WIP: adding ScalingPolicy to api/structs and state store 2020-03-24 13:55:18 +00:00
Drew Bailey 10f3b6899b
rename struct field to auditor 2020-03-23 20:09:01 -04:00
Drew Bailey cf5fcf3748
make auditor interface more explicit 2020-03-23 19:32:58 -04:00
Mahmood Ali 61c42034d5 cli: show lifecycle info in alloc status
Display task lifecycle info in `nomad alloc status <alloc_id>` output.
I chose to embed it in the Task header and only add it for tasks with
lifecycle info.

Also, I chose to order the tasks in the following order:

1. prestart non-sidecar tasks
2. prestart sidecar tasks
3. main tasks

The tasks are sorted lexicographically within each tier.

Sample output:

```
$ nomad alloc status 6ec0eb52
ID                  = 6ec0eb52-e6c8-665c-169c-113d6081309b
Eval ID             = fb0caa98
Name                = lifecycle.cache[0]
[...]

Task "init" (prestart) is "dead"
Task Resources
CPU        Memory       Disk     Addresses
0/500 MHz  0 B/256 MiB  300 MiB
[...]

Task "some-sidecar" (prestart sidecar) is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  68 KiB/256 MiB  300 MiB
[...]

Task "redis" is "running"
Task Resources
CPU         Memory           Disk     Addresses
10/500 MHz  984 KiB/256 MiB  300 MiB
[...]
```
2020-03-23 15:57:24 -04:00
Drew Bailey d0d32d8f06
fix compilation with correct func 2020-03-23 14:32:11 -04:00
Tim Gross 076fbbf08f
Merge pull request #7012 from hashicorp/f-csi-volumes
Container Storage Interface Support
2020-03-23 14:19:46 -04:00
Lang Martin e100444740 csi: add mount_options to volumes and volume requests (#7398)
Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host.

Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context.

closes #7007

* nomad/structs/volumes: add MountOptions to volume request

* jobspec/test-fixtures/basic.hcl: add mount_options to volume block

* jobspec/parse_test: add expected MountOptions

* api/tasks: add mount_options

* jobspec/parse_group: use hcl decode not mapstructure, mount_options

* client/allocrunner/csi_hook: pass MountOptions through

client/allocrunner/csi_hook: add a VolumeMountOptions

client/allocrunner/csi_hook: drop Options

client/allocrunner/csi_hook: use the structs options

* client/pluginmanager/csimanager/interface: UsageOptions.MountOptions

* client/pluginmanager/csimanager/volume: pass MountOptions in capabilities

* plugins/csi/plugin: remove todo 7007 comment

* nomad/structs/csi: MountOptions

* api/csi: add options to the api for parsing, match structs

* plugins/csi/plugin: move VolumeMountOptions to structs

* api/csi: use specific type for mount_options

* client/allocrunner/csi_hook: merge MountOptions here

* rename CSIOptions to CSIMountOptions

* client/allocrunner/csi_hook

* client/pluginmanager/csimanager/volume

* nomad/structs/csi

* plugins/csi/fake/client: add PrevVolumeCapability

* plugins/csi/plugin

* client/pluginmanager/csimanager/volume_test: remove debugging

* client/pluginmanager/csimanager/volume: fix odd merging logic

* api: rename CSIOptions -> CSIMountOptions

* nomad/csi_endpoint: remove a 7007 comment

* command/alloc_status: show mount options in the volume list

* nomad/structs/csi: include MountOptions in the volume stub

* api/csi: add MountOptions to stub

* command/volume_status_csi: clean up csiVolMountOption, add it

* command/alloc_status: csiVolMountOption lives in volume_csi_status

* command/node_status: display mount flags

* nomad/structs/volumes: npe

* plugins/csi/plugin: npe in ToCSIRepresentation

* jobspec/parse_test: expand volume parse test cases

* command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions

* command/volume_status_csi: copy paste error

* jobspec/test-fixtures/basic: hclfmt

* command/volume_status_csi: clean up csiVolMountOption
2020-03-23 13:59:25 -04:00
Lang Martin 3621df1dbf csi: volume ids are only unique per namespace (#7358)
* nomad/state/schema: use the namespace compound index

* scheduler/scheduler: CSIVolumeByID interface signature namespace

* scheduler/stack: SetJob on CSIVolumeChecker to capture namespace

* scheduler/feasible: pass the captured namespace to CSIVolumeByID

* nomad/state/state_store: use namespace in csi_volume index

* nomad/fsm: pass namespace to CSIVolumeDeregister & Claim

* nomad/core_sched: pass the namespace in volumeClaimReap

* nomad/node_endpoint_test: namespaces in Claim testing

* nomad/csi_endpoint: pass RequestNamespace to state.*

* nomad/csi_endpoint_test: appropriately failed test

* command/alloc_status_test: appropriately failed test

* node_endpoint_test: avoid notTheNamespace for the job

* scheduler/feasible_test: call SetJob to capture the namespace

* nomad/csi_endpoint: ACL check the req namespace, query by namespace

* nomad/state/state_store: remove deregister namespace check

* nomad/state/state_store: remove unused CSIVolumes

* scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace

* nomad/csi_endpoint: ACL check

* nomad/state/state_store_test: remove call to state.CSIVolumes

* nomad/core_sched_test: job namespace match so claim gc works
2020-03-23 13:59:25 -04:00
Lang Martin 99841222ed csi: change the API paths to match CLI command layout (#7325)
* command/agent/csi_endpoint: support type filter in volumes & plugins

* command/agent/http: use /v1/volume/csi & /v1/plugin/csi

* api/csi: use /v1/volume/csi & /v1/plugin/csi

* api/nodes: use /v1/volume/csi & /v1/plugin/csi

* api/nodes: not /volumes/csi, just /volumes

* command/agent/csi_endpoint: fix ot parameter parsing
2020-03-23 13:58:30 -04:00
Lang Martin 80619137ab csi: volumes listed in `nomad node status` (#7318)
* api/allocations: GetTaskGroup finds the taskgroup struct

* command/node_status: display CSI volume names

* nomad/state/state_store: new CSIVolumesByNodeID

* nomad/state/iterator: new SliceIterator type implements memdb.ResultIterator

* nomad/csi_endpoint: deal with a slice of volumes

* nomad/state/state_store: CSIVolumesByNodeID return a SliceIterator

* nomad/structs/csi: CSIVolumeListRequest takes a NodeID

* nomad/csi_endpoint: use the return iterator

* command/agent/csi_endpoint: parse query params for CSIVolumes.List

* api/nodes: new CSIVolumes to list volumes by node

* command/node_status: use the new list endpoint to print volumes

* nomad/state/state_store: error messages consider the operator

* command/node_status: include the Provider
2020-03-23 13:58:30 -04:00
Tim Gross de4ad6ca38 csi: add Provider field to CSI CLIs and APIs (#7285)
Derive a provider name and version for plugins (and the volumes that
use them) from the CSI identity API `GetPluginInfo`. Expose the vendor
name as `Provider` in the API and CLI commands.
2020-03-23 13:58:30 -04:00
Lang Martin 887e1f28c9 csi: CLI for volume status, registration/deregistration and plugin status (#7193)
* command/csi: csi, csi_plugin, csi_volume

* helper/funcs: move ExtraKeys from parse_config to UnusedKeys

* command/agent/config_parse: use helper.UnusedKeys

* api/csi: annotate CSIVolumes with hcl fields

* command/csi_plugin: add Synopsis

* command/csi_volume_register: use hcl.Decode style parsing

* command/csi_volume_list

* command/csi_volume_status: list format, cleanup

* command/csi_plugin_list

* command/csi_plugin_status

* command/csi_volume_deregister

* command/csi_volume: add Synopsis

* api/contexts/contexts: add csi search contexts to the constants

* command/commands: register csi commands

* api/csi: fix struct tag for linter

* command/csi_plugin_list: unused struct vars

* command/csi_plugin_status: unused struct vars

* command/csi_volume_list: unused struct vars

* api/csi: add allocs to CSIPlugin

* command/csi_plugin_status: format the allocs

* api/allocations: copy Allocation.Stub in from structs

* nomad/client_rpc: add some error context with Errorf

* api/csi: collapse read & write alloc maps to a stub list

* command/csi_volume_status: cleanup allocation display

* command/csi_volume_list: use Schedulable instead of Healthy

* command/csi_volume_status: use Schedulable instead of Healthy

* command/csi_volume_list: sprintf string

* command/csi: delete csi.go, csi_plugin.go

* command/plugin: refactor csi components to sub-command plugin status

* command/plugin: remove csi

* command/plugin_status: remove csi

* command/volume: remove csi

* command/volume_status: split out csi specific

* helper/funcs: add RemoveEqualFold

* command/agent/config_parse: use helper.RemoveEqualFold

* api/csi: do ,unusedKeys right

* command/volume: refactor csi components to `nomad volume`

* command/volume_register: split out csi specific

* command/commands: use the new top level commands

* command/volume_deregister: hardwired type csi for now

* command/volume_status: csiFormatVolumes rescued from volume_list

* command/plugin_status: avoid a panic on no args

* command/volume_status: avoid a panic on no args

* command/plugin_status: predictVolumeType

* command/volume_status: predictVolumeType

* nomad/csi_endpoint_test: move CreateTestPlugin to testing

* command/plugin_status_test: use CreateTestCSIPlugin

* nomad/structs/structs: add CSIPlugins and CSIVolumes search consts

* nomad/state/state_store: add CSIPlugins and CSIVolumesByIDPrefix

* nomad/search_endpoint: add CSIPlugins and CSIVolumes

* command/plugin_status: move the header to the csi specific

* command/volume_status: move the header to the csi specific

* nomad/state/state_store: CSIPluginByID prefix

* command/status: rename the search context to just Plugins/Volumes

* command/plugin,volume_status: test return ids now

* command/status: rename the search context to just Plugins/Volumes

* command/plugin_status: support -json and -t

* command/volume_status: support -json and -t

* command/plugin_status_csi: comments

* command/*_status: clean up text

* api/csi: fix stale comments

* command/volume: make deregister sound less fearsome

* command/plugin_status: set the id length

* command/plugin_status_csi: more compact plugin health

* command/volume: better error message, comment
2020-03-23 13:58:30 -04:00
Tim Gross 016281135c storage: add volumes to 'nomad alloc status' CLI (#7256)
Adds a stanza for both Host Volumes and CSI Volumes to the the CLI
output for `nomad alloc status`. Mostly relies on information already
in the API structs, but in the case where there are CSI Volumes we
need to make extra API calls to get the volume status. To reduce
overhead, these extra calls are hidden behind the `-verbose` flag.
2020-03-23 13:58:30 -04:00
Danielle Lancashire 15c6c05ccf api: Parse CSI Volumes
Previously when deserializing volumes we skipped over volumes that were
not of type `host`. This commit ensures that we parse both host and csi
volumes correctly.
2020-03-23 13:58:30 -04:00
Lang Martin 88316208a0 csi: server-side plugin state tracking and api (#6966)
* structs: CSIPlugin indexes jobs acting as plugins and node updates

* schema: csi_plugins table for CSIPlugin

* nomad: csi_endpoint use vol.Denormalize, plugin requests

* nomad: csi_volume_endpoint: rename to csi_endpoint

* agent: add CSI plugin endpoints

* state_store_test: use generated ids to avoid t.Parallel conflicts

* contributing: add note about registering new RPC structs

* command: agent http register plugin lists

* api: CSI plugin queries, ControllerHealthy -> ControllersHealthy

* state_store: copy on write for volumes and plugins

* structs: copy on write for volumes and plugins

* state_store: CSIVolumeByID returns an unhealthy volume, denormalize

* nomad: csi_endpoint use CSIVolumeDenormalizePlugins

* structs: remove struct errors for missing objects

* nomad: csi_endpoint return nil for missing objects, not errors

* api: return meta from Register to avoid EOF error

* state_store: CSIVolumeDenormalize keep allocs in their own maps

* state_store: CSIVolumeDeregister error on missing volume

* state_store: CSIVolumeRegister set indexes

* nomad: csi_endpoint use CSIVolumeDenormalizePlugins tests
2020-03-23 13:58:29 -04:00
Lang Martin 2f646fa5e9 agent: csi endpoint 2020-03-23 13:58:29 -04:00
Danielle Lancashire 8fb312e48e node_status: Add CSI Summary info
This commit introduces two new fields to the basic output of `nomad
node status <node-id>`.

1) "CSI Controllers", which displays the names of registered controller
plugins.

2) "CSI Drivers", which displays the names of registered CSI Node
plugins.

However, it does not implement support for verbose output, such as
including health status or other fingerprinted data.
2020-03-23 13:58:29 -04:00
Danielle Lancashire 426c26d7c0 CSI Plugin Registration (#6555)
This changeset implements the initial registration and fingerprinting
of CSI Plugins as part of #5378. At a high level, it introduces the
following:

* A `csi_plugin` stanza as part of a Nomad task configuration, to
  allow a task to expose that it is a plugin.

* A new task runner hook: `csi_plugin_supervisor`. This hook does two
  things. When the `csi_plugin` stanza is detected, it will
  automatically configure the plugin task to receive bidirectional
  mounts to the CSI intermediary directory. At runtime, it will then
  perform an initial heartbeat of the plugin and handle submitting it to
  the new `dynamicplugins.Registry` for further use by the client, and
  then run a lightweight heartbeat loop that will emit task events
  when health changes.

* The `dynamicplugins.Registry` for handling plugins that run
  as Nomad tasks, in contrast to the existing catalog that requires
  `go-plugin` type plugins and to know the plugin configuration in
  advance.

* The `csimanager` which fingerprints CSI plugins, in a similar way to
  `drivermanager` and `devicemanager`. It currently only fingerprints
  the NodeID from the plugin, and assumes that all plugins are
  monolithic.

Missing features

* We do not use the live updates of the `dynamicplugin` registry in
  the `csimanager` yet.

* We do not deregister the plugins from the client when they shutdown
  yet, they just become indefinitely marked as unhealthy. This is
  deliberate until we figure out how we should manage deploying new
  versions of plugins/transitioning them.
2020-03-23 13:58:28 -04:00
Drew Bailey b09abef332
Audit config, seams for enterprise audit features
allow oss to parse sink duration

clean up audit sink parsing

ent eventer config reload

fix typo

SetEnabled to eventer interface

client acl test

rm dead code

fix failing test
2020-03-23 13:47:42 -04:00
Jasmine Dahilig 73a64e4397 change jobspec lifecycle stanza to use sidecar attribute instead of
block_until status
2020-03-21 17:52:57 -04:00
Jasmine Dahilig 1485b342e2 remove deadline code for now 2020-03-21 17:52:56 -04:00
Jasmine Dahilig 7b3f3497ed mock task hook coordinator in consul integration test 2020-03-21 17:52:55 -04:00
Jasmine Dahilig 34f8055f39 remove logging debug line from cli 2020-03-21 17:52:49 -04:00
Mahmood Ali c0f59ea06e minor improvement 2020-03-21 17:52:44 -04:00
Jasmine Dahilig bc78d6b64d add lifecycle info to alloc status short 2020-03-21 17:52:42 -04:00
Jasmine Dahilig fc13fa9739 change TaskLifecycle RunLevel to Hook and add Deadline time duration 2020-03-21 17:52:37 -04:00
Mahmood Ali 4ebeac721a update structs with lifecycle 2020-03-21 17:52:36 -04:00
Mahmood Ali 3b5786ddb3 add lifecycle to api and parser 2020-03-21 17:52:36 -04:00
James Rasell ef469e1a6e
Merge pull request #7379 from hashicorp/b-fix-agent-cmd--dev-connect-help
cli: fix indentation issue with -dev-connect agent help output.
2020-03-19 08:34:45 +01:00
James Rasell e3d14cc634
cli: fix indentation issue with -dev-connect agent help output. 2020-03-18 12:25:20 +01:00
Derek Strickland b1490fe2dd update log output to clarify that nodes were filtered out rather than down 2020-03-17 14:45:11 -04:00
Michael Schurter b72b3e765c
Merge pull request #7170 from fredrikhgrelland/consul_template_upgrade
Update consul-template to v0.24.1 and remove deprecated vault grace
2020-03-10 14:15:47 -07:00
Mahmood Ali 19f25f588f
Merge pull request #7252 from hashicorp/b-test-cluster-forming
Simplify Bootstrap logic in tests
2020-03-03 16:56:08 -05:00
Mahmood Ali acbfeb5815 Simplify Bootstrap logic in tests
This change updates tests to honor `BootstrapExpect` exclusively when
forming test clusters and removes test only knobs, e.g.
`config.DevDisableBootstrap`.

Background:

Test cluster creation is fragile.  Test servers don't follow the
BootstapExpected route like production clusters.  Instead they start as
single node clusters and then get rejoin and may risk causing brain
split or other test flakiness.

The test framework expose few knobs to control those (e.g.
`config.DevDisableBootstrap` and `config.Bootstrap`) that control
whether a server should bootstrap the cluster.  These flags are
confusing and it's unclear when to use: their usage in multi-node
cluster isn't properly documented.  Furthermore, they have some bad
side-effects as they don't control Raft library: If
`config.DevDisableBootstrap` is true, the test server may not
immediately attempt to bootstrap a cluster, but after an election
timeout (~50ms), Raft may force a leadership election and win it (with
only one vote) and cause a split brain.

The knobs are also confusing as Bootstrap is an overloaded term.  In
BootstrapExpect, we refer to bootstrapping the cluster only after N
servers are connected.  But in tests and the knobs above, it refers to
whether the server is a single node cluster and shouldn't wait for any
other server.

Changes:

This commit makes two changes:

First, it relies on `BootstrapExpected` instead of `Bootstrap` and/or
`DevMode` flags.  This change is relatively trivial.

Introduce a `Bootstrapped` flag to track if the cluster is bootstrapped.
This allows us to keep `BootstrapExpected` immutable.  Previously, the
flag was a config value but it gets set to 0 after cluster bootstrap
completes.
2020-03-02 13:47:43 -05:00
Mahmood Ali 386f20099b Honor CNI and bridge related fields
Nomad agent may silently ignore cni_path and bridge setting, when it
merges configs from multiple files (or against default/dev config).

This PR ensures that the values are merged properly.
2020-02-28 14:23:13 -05:00
Mahmood Ali 437d03779c tests: add tests for parsing cni fields 2020-02-28 14:18:45 -05:00
Fredrik Hoem Grelland edb3bd0f3f Update consul-template to v0.24.1 and remove deprecated vault_grace (#7170) 2020-02-23 16:24:53 +01:00
Seth Hoenig 0f99cdd0d9
Merge pull request #7192 from hashicorp/b-connect-stanza-ignore
consul/connect: in-place update sidecar service registrations on changes
2020-02-21 09:24:53 -06:00
Seth Hoenig 54b5173eca consul/connect: in-place update sidecar service registrations on changes
Fix a bug where consul service definitions would not be updated if changes
were made to the service in the Nomad job. Currently this only fixes the
bug for cases where the fix is a matter of updating consul agent's service
registration. There is related bug where destructive changes are required
(see #6877) which will be fixed in another PR.

The enable_tag_override configuration setting for the parent service is
applied to the sidecar service.

Fixes #6459
2020-02-19 13:07:04 -06:00
Mahmood Ali f4d8e1296f
Merge pull request #7171 from hashicorp/update-autopilot-20200214
Update consul vendor and add MinQuorum flag
2020-02-19 10:45:20 -06:00
Drew Bailey 3c0719274c
inlude pro in http_oss.go 2020-02-18 10:29:28 -05:00
Mahmood Ali 98ad59b1de update rest of consul packages 2020-02-16 16:25:04 -06:00
Mahmood Ali f492ab6d9e implement MinQuorum 2020-02-16 16:04:59 -06:00
Mahmood Ali fd51982018 tests: Avoid StartAsLeader raft config flag
It's being deprecated
2020-02-13 18:56:53 -05:00
Seth Hoenig 543354aabe
Merge pull request #7106 from hashicorp/f-ctag-override
client: enable configuring enable_tag_override for services
2020-02-13 12:34:48 -06:00
Michael Schurter 8c332a3757
Merge pull request #7102 from hashicorp/test-limits
Fix some race conditions and flaky tests
2020-02-13 10:19:11 -08:00
Seth Hoenig 7f33b92e0b command: use consistent CONSUL_HTTP_TOKEN name
Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same.
Note that consul-template uses CONSUL_TOKEN, which Nomad also uses,
so be careful to preserve any reference to that in the consul-template
context.
2020-02-12 10:42:33 -06:00
Seth Hoenig 0e44094d1a client: enable configuring enable_tag_override for services
Consul provides a feature of Service Definitions where the tags
associated with a service can be modified through the Catalog API,
overriding the value(s) configured in the agent's service configuration.

To enable this feature, the flag enable_tag_override must be configured
in the service definition.

Previously, Nomad did not allow configuring this flag, and thus the default
value of false was used. Now, it is configurable.

Because Nomad itself acts as a state machine around the the service definitions
of the tasks it manages, it's worth describing what happens when this feature
is enabled and why.

Consider the basic case where there is no Nomad, and your service is provided
to consul as a boring JSON file. The ultimate source of truth for the definition
of that service is the file, and is stored in the agent. Later, Consul performs
"anti-entropy" which synchronizes the Catalog (stored only the leaders). Then
with enable_tag_override=true, the tags field is available for "external"
modification through the Catalog API (rather than directly configuring the
service definition file, or using the Agent API). The important observation
is that if the service definition ever changes (i.e. the file is changed &
config reloaded OR the Agent API is used to modify the service), those
"external" tag values are thrown away, and the new service definition is
once again the source of truth.

In the Nomad case, Nomad itself is the source of truth over the Agent in
the same way the JSON file was the source of truth in the example above.
That means any time Nomad sets a new service definition, any externally
configured tags are going to be replaced. When does this happen? Only on
major lifecycle events, for example when a task is modified because of an
updated job spec from the 'nomad job run <existing>' command. Otherwise,
Nomad's periodic re-sync's with Consul will now no longer try to restore
the externally modified tag values (as long as enable_tag_override=true).

Fixes #2057
2020-02-10 08:00:55 -06:00
Michael Schurter 65d38d9255 test: fix flaky TestHTTP_FreshClientAllocMetrics 2020-02-07 15:50:53 -08:00
Michael Schurter 9d3093fa31 test: fix missing agent shutdowns 2020-02-07 15:50:53 -08:00
Michael Schurter d96ceee8c5 testagent: fix case where agent would retry forever 2020-02-07 15:50:53 -08:00
Michael Schurter e903501e65 test: improve error messages when failing 2020-02-07 15:50:53 -08:00
Michael Schurter 63032917fc test: allow goroutine to exit even if test blocks 2020-02-07 15:50:53 -08:00
Michael Schurter 9905dec6a3 test: workaround limits race 2020-02-07 15:50:53 -08:00
Michael Schurter 19a1932bbb test: wait longer than timeout
The 1s timeout raced with the 1s deadline it was trying to detect.
2020-02-07 15:50:53 -08:00
Michael Schurter fd81208db7 test: fix flaky health test
Test set Agent.client=nil which prevented the client from being
shutdown. This leaked goroutines and could cause panics due to the
leaked client goroutines logging after their parent test had finished.

Removed ACLs from the server test because I couldn't get it to work with
the test agent, and it tested very little.
2020-02-07 15:50:53 -08:00
Michael Schurter 2896f78f77 client: fix race accessing Node.status
* Call Node.Canonicalize once when Node is created.
 * Lock when accessing fields mutated by node update goroutine
2020-02-07 15:50:47 -08:00
Drew Bailey d830998572
agent Profile req nil check s.agent.Server()
clean up logic and tests
2020-02-03 13:20:05 -05:00
Drew Bailey c4f45f9bde
Fix panic when monitoring a local client node
Fixes a panic when accessing a.agent.Server() when agent is a client
instead. This pr removes a redundant ACL check since ACLs are validated
at the RPC layer. It also nil checks the agent server and uses Client()
when appropriate.
2020-02-03 13:20:04 -05:00
Seth Hoenig 78a7d1e426 comments: cleanup some leftover debug comments and such 2020-01-31 19:04:35 -06:00
Seth Hoenig 076cb4754e agent: re-enable the server in dev mode 2020-01-31 19:04:19 -06:00
Seth Hoenig 8219c78667 nomad: handle SI token revocations concurrently
Be able to revoke SI token accessors concurrently, and also
ratelimit the requests being made to Consul for the various
ACL API uses.
2020-01-31 19:04:14 -06:00
Seth Hoenig 2c7ac9a80d nomad: fixup token policy validation 2020-01-31 19:04:08 -06:00
Seth Hoenig 9df33f622f nomad: proxy requests for Service Identity tokens between Clients and Consul
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
2020-01-31 19:03:53 -06:00
Seth Hoenig f030a22c7c command, docs: create and document consul token configuration for connect acls (gh-6716)
This change provides an initial pass at setting up the configuration necessary to
enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul
Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert`
commands (similar to Vault tokens).

These values are not actually used yet in this changeset.
2020-01-31 19:02:53 -06:00
Michael Schurter c82b14b0c4 core: add limits to unauthorized connections
Introduce limits to prevent unauthorized users from exhausting all
ephemeral ports on agents:

 * `{https,rpc}_handshake_timeout`
 * `{http,rpc}_max_conns_per_client`

The handshake timeout closes connections that have not completed the TLS
handshake by the deadline (5s by default). For RPC connections this
timeout also separately applies to first byte being read so RPC
connections with TLS enabled have `rpc_handshake_time * 2` as their
deadline.

The connection limit per client prevents a single remote TCP peer from
exhausting all ephemeral ports. The default is 100, but can be lowered
to a minimum of 26. Since streaming RPC connections create a new TCP
connection (until MultiplexV2 is used), 20 connections are reserved for
Raft and non-streaming RPCs to prevent connection exhaustion due to
streaming RPCs.

All limits are configurable and may be disabled by setting them to `0`.

This also includes a fix that closes connections that attempt to create
TLS RPC connections recursively. While only users with valid mTLS
certificates could perform such an operation, it was added as a
safeguard to prevent programming errors before they could cause resource
exhaustion.
2020-01-30 10:38:25 -08:00
Mahmood Ali 9611324654
Merge pull request #6922 from hashicorp/b-alloc-canoncalize
Handle Upgrades and Alloc.TaskResources modification
2020-01-28 15:12:41 -05:00
Mahmood Ali 90cae566e5
Merge pull request #6935 from hashicorp/b-default-preemption-flag
scheduler: allow configuring default preemption for system scheduler
2020-01-28 15:11:06 -05:00
Mahmood Ali af17b4afc7 Support customizing full scheduler config 2020-01-28 14:51:42 -05:00
Nick Ethier 5636203d4e consul: fix var name from rebase 2020-01-27 14:00:19 -05:00
Nick Ethier 0ae99b3c9c consul: fix var name from rebase 2020-01-27 12:55:52 -05:00
Nick Ethier 5cbb94e16e consul: add support for canary meta 2020-01-27 09:53:30 -05:00
Danielle 5fd52171aa
cli: add system command and subcmds to interact with system API. (#6924)
cli: add system command and subcmds to interact with system API.
2020-01-13 16:16:08 +01:00
Mahmood Ali 1ab682f622 scheduler: allow configuring default preemption for system scheduler
Some operators want a greater control over when preemption is enabled,
especially during an upgrade to limit potential side-effects.
2020-01-13 08:30:49 -05:00
James Rasell 4e48217a4e
cli: add system command and subcmds to interact with system API.
The system command includes gc and reconcile-summaries subcommands
which covers all currently available system API calls. The help
information is largely pulled from the current Nomad website API
documentation.
2020-01-13 11:34:46 +01:00
Drew Bailey f97d2e96c1
refactor api profile methods
comment why we ignore errors parsing params
2020-01-09 15:15:12 -05:00
Drew Bailey b702dede49
adds qc param, address pr feedback 2020-01-09 15:15:11 -05:00
Drew Bailey 085659f6ff
condense table test 2020-01-09 15:15:10 -05:00