open-nomad

Commit Graph

Author	SHA1	Message	Date
Mahmood Ali	e76ff9f679	Merge pull request #7543 from hashicorp/test-flakiness-20200330_1 Test flakiness fixes - 2020-03-30 Edition	2020-03-30 09:26:26 -04:00
Mahmood Ali	57bebfdb5c	tests: avoid logging after test completion	2020-03-30 09:08:34 -04:00
Mahmood Ali	13381448e0	avoid logging in draining job watcher In tests where the logger is a test logger, emitting a trace log in a background thread while it's shutting down may trigger a panic. Thus avoid logging Trace if err != nil. Note that we already log an error when err isn't a trace. This fixes cases where tests panic with a trace like: ``` panic: Log in goroutine after TestAllocGarbageCollector_MakeRoomFor_MaxAllocs has completed goroutine 30 [running]: testing.(common).logDepth(0xc000aa9e60, 0xc000c4a000, 0xab, 0x3) /usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:680 +0x4d3 testing.(common).log(...) /usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:662 testing.(common).Logf(0xc000aa9e60, 0x690b941, 0x4, 0xc001366c00, 0x2, 0x2) /usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:701 +0x7e github.com/hashicorp/nomad/helper/testlog.(writer).Write(0xc000a82a60, 0xc0000b48c0, 0xab, 0x13f, 0x0, 0x0, 0x0) /Users/notnoop/go/src/github.com/hashicorp/nomad/helper/testlog/testlog.go:34 +0x106 github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(writer).Flush(0xc000a80900, 0xbf9870f000000001, 0x20a87556e, 0x8b12bc0) /Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/writer.go:29 +0x14f github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(intLogger).log(0xc000e2c180, 0xc0003b6880, 0x17, 0x1, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6) /Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:139 +0x15d github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(intLogger).Trace(0xc000e2c180, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6) /Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:446 +0x7a github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(interceptLogger).Trace(0xc0002f1ad0, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6) /Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/interceptlogger.go:48 +0x9c github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc0002f2380) /Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:147 +0x1125 created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher /Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:89 +0x1e3 FAIL github.com/hashicorp/nomad/client 10.605s FAIL ```	2020-03-30 07:06:53 -04:00
Mahmood Ali	36ad8ee2e0	tests: add debugging for TestAutopilot_RollingUpdate	2020-03-30 07:06:53 -04:00
Chris Baker	d6287c43b9	clean up some tests	2020-03-29 23:38:36 +00:00
Chris Baker	5e3c38be2f	state_store: * added method to retrieve all scaling policies for use in snapshotting, plus test * better testing for ScalingPoliciesByNamespace * added scaling policy snapshot persist and restore (and test of restore) manually tested snapshot restore. resolves #7539	2020-03-29 13:32:44 +00:00
Lang Martin	50ff9ccd44	csi: plugin deregistration on plugin job GC (#7502 ) * nomad/structs/csi: delete just one plugin type from a node * nomad/structs/csi: add DeleteAlloc * nomad/state/state_store: add deleteJobFromPlugin * nomad/state/state_store: use DeleteAlloc not DeleteNodeType * move CreateTestCSIPlugin to state to avoid an import cycle * nomad/state/state_store_test: delete a plugin by deleting its jobs * nomad/_test: move CreateTestCSIPlugin to state nomad/state/state_store: update one plugin per transaction * command/plugin_status_test: move CreateTestCSIPlugin * nomad: csi: handle nils CSIPlugin methods, clarity	2020-03-26 17:07:18 -04:00
Lang Martin	3375c92aa0	csi: make volume registration idempotent (#7490 ) If not in use and not changing external ids, it should not be an error to register a volume again. * nomad/state/state_store: make volume registration idempotent	2020-03-26 12:27:19 -04:00
Lang Martin	ea80330aaa	csi: nomad/structs: test volume denormalize without plugin (#7472 )	2020-03-26 09:43:59 -04:00
Mahmood Ali	b33dbe539b	tests: TestCSIPluginEndpoint_ACLNamespaceAlloc is ent TestCSIPluginEndpoint_ACLNamespaceAlloc uses namespace features not present in OSS.	2020-03-25 08:45:44 -04:00
Mahmood Ali	281fc9837c	tests: relax index checks TestStateStore_Indexes specifically tests for `nodes` index, but asserts on the exact number of indexes present in the state. This is fragile and will break almost everytime we add a state index.	2020-03-25 08:45:38 -04:00
Mahmood Ali	ceed57b48f	per-task restart policy	2020-03-24 17:00:41 -04:00
Chris Baker	ffd79583f6	Merge pull request #7474 from hashicorp/f-scaling-changes-from-review more testing for scaling API	2020-03-24 15:32:10 -05:00
Chris Baker	c638c2c352	update RPC scaling endpoint tests to use renamed 'scale' policy disposition	2020-03-24 20:18:12 +00:00
Chris Baker	5979d6a81e	more testing for ScalingPolicy, mainly around parsing and canonicalization for Min/Max	2020-03-24 19:43:50 +00:00
Chris Baker	aa5beafe64	Job.Scale should not result in job update or eval create if args.Count == nil plus tests	2020-03-24 17:36:06 +00:00
Tim Gross	913da68296	csi: remove client from plugin on client node update (#7462 ) Plugins track the client nodes where they are placed. On client updates, remove the client from the plugin tracking if the client is no longer running an instance of that controller/node plugin. Extends the state store tests to ensure deregistration works as expected and that controllers and nodes are being tracked independently.	2020-03-24 13:26:31 -04:00
Chris Baker	9e530e167d	Merge pull request #7409 from hashicorp/scaling-api Scaling API changes	2020-03-24 11:02:09 -05:00
Chris Baker	606c79b320	add acl validation to Scaling.ListPolicies and Scaling.GetPolicy	2020-03-24 14:39:05 +00:00
Chris Baker	f6ec5f9624	made count optional during job scaling actions added ACL protection in Job.Scale in Job.Scale, only perform a Job.Register if the Count was non-nil	2020-03-24 14:39:05 +00:00
Chris Baker	41b002eecc	wip: ACL checking for RPC Job.ScaleStatus	2020-03-24 14:39:05 +00:00
Lang Martin	bd22afd003	csi: volume deregister fails for volumes actively in use (#7445 ) * nomad/structs/csi: add InUse to CSIVolume * nomad/state/state_store: block volume deregistration for in use vols	2020-03-24 10:10:44 -04:00
Chris Baker	233db5258a	changes to Canonicalize, Validate, and api->struct conversion so that tg.Count, tg.Scaling.Min/Max are well-defined with reasonable defaults. - tg.Count defaults to tg.Scaling.Min if present (falls back on previous default of 1 if Scaling is absent) - Validate() enforces tg.Scaling.Min <= tg.Count <= tg.Scaling.Max modification in ApiScalingPolicyToStructs, api.TaskGroup.Validate so that defaults are handled for TaskGroup.Count and	2020-03-24 13:57:17 +00:00
Chris Baker	f9876a487e	finished Job.ScaleStatus RPC, need to work on http endpoint	2020-03-24 13:57:16 +00:00
Chris Baker	925b59e1d2	wip: scaling status return, almost done	2020-03-24 13:57:15 +00:00
James Rasell	f125b5fb2d	scaling: ensure min and max int64s are in toplevel of block.	2020-03-24 13:57:15 +00:00
Chris Baker	42270d862c	wip: some tests still failing updating job scaling endpoints to match RFC, cleaning up the API object as well	2020-03-24 13:57:14 +00:00
Chris Baker	abc7a52f56	finished refactoring state store, schema, etc	2020-03-24 13:57:14 +00:00
Chris Baker	116aa98ed7	wip: removed some commented junk from scaling poc	2020-03-24 13:57:13 +00:00
Chris Baker	3d54f1feba	wip: added Enabled to ScalingPolicyListStub, removed JobID from body of scaling request	2020-03-24 13:57:12 +00:00
Chris Baker	024d203267	wip: added tests for client methods around group scaling	2020-03-24 13:57:11 +00:00
Chris Baker	179ab68258	wip: added job.scale rpc endpoint, needs explicit test (tested via http now)	2020-03-24 13:57:09 +00:00
Chris Baker	8453e667c2	wip: working on job group scaling endpoint	2020-03-24 13:55:20 +00:00
Chris Baker	6665d0bfb0	wip: added policy get endpoint, added UUID to policy	2020-03-24 13:55:20 +00:00
Chris Baker	9c2560ceeb	wip: upsert/delete scaling policies on job upsert/delete	2020-03-24 13:55:18 +00:00
Chris Baker	65d92f1fbf	WIP: adding ScalingPolicy to api/structs and state store	2020-03-24 13:55:18 +00:00
Tim Gross	fa01a6ea59	csi: fix missing health count from volume list stub	2020-03-24 09:42:59 -04:00
Lang Martin	0847cb513c	csi: volume/plugin list should return an empty array, not nil (#7443 ) * nomad/csi_endpoint: return an empty list, not nil * nomad/csi_endpoint_test: volume list returns non-nil	2020-03-23 21:21:40 -04:00
Lang Martin	d994990ef0	csi: the scheduler allows a job with a volume write claim to be updated (#7438 ) * nomad/structs/csi: split CanWrite into health, in use * scheduler/scheduler: expose AllocByID in the state interface * nomad/state/state_store_test * scheduler/stack: SetJobID on the matcher * scheduler/feasible: when a volume writer is in use, check if it's us * scheduler/feasible: remove SetJob * nomad/state/state_store: denormalize allocs before Claim * nomad/structs/csi: return errors on claim, with context * nomad/csi_endpoint_test: new alloc doesn't look like an update * nomad/state/state_store_test: change test reference to CanWrite	2020-03-23 21:21:04 -04:00
Tim Gross	076fbbf08f	Merge pull request #7012 from hashicorp/f-csi-volumes Container Storage Interface Support	2020-03-23 14:19:46 -04:00
Lang Martin	e100444740	csi: add mount_options to volumes and volume requests (#7398 ) Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host. Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context. closes #7007 * nomad/structs/volumes: add MountOptions to volume request * jobspec/test-fixtures/basic.hcl: add mount_options to volume block * jobspec/parse_test: add expected MountOptions * api/tasks: add mount_options * jobspec/parse_group: use hcl decode not mapstructure, mount_options * client/allocrunner/csi_hook: pass MountOptions through client/allocrunner/csi_hook: add a VolumeMountOptions client/allocrunner/csi_hook: drop Options client/allocrunner/csi_hook: use the structs options * client/pluginmanager/csimanager/interface: UsageOptions.MountOptions * client/pluginmanager/csimanager/volume: pass MountOptions in capabilities * plugins/csi/plugin: remove todo 7007 comment * nomad/structs/csi: MountOptions * api/csi: add options to the api for parsing, match structs * plugins/csi/plugin: move VolumeMountOptions to structs * api/csi: use specific type for mount_options * client/allocrunner/csi_hook: merge MountOptions here * rename CSIOptions to CSIMountOptions * client/allocrunner/csi_hook * client/pluginmanager/csimanager/volume * nomad/structs/csi * plugins/csi/fake/client: add PrevVolumeCapability * plugins/csi/plugin * client/pluginmanager/csimanager/volume_test: remove debugging * client/pluginmanager/csimanager/volume: fix odd merging logic * api: rename CSIOptions -> CSIMountOptions * nomad/csi_endpoint: remove a 7007 comment * command/alloc_status: show mount options in the volume list * nomad/structs/csi: include MountOptions in the volume stub * api/csi: add MountOptions to stub * command/volume_status_csi: clean up csiVolMountOption, add it * command/alloc_status: csiVolMountOption lives in volume_csi_status * command/node_status: display mount flags * nomad/structs/volumes: npe * plugins/csi/plugin: npe in ToCSIRepresentation * jobspec/parse_test: expand volume parse test cases * command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions * command/volume_status_csi: copy paste error * jobspec/test-fixtures/basic: hclfmt * command/volume_status_csi: clean up csiVolMountOption	2020-03-23 13:59:25 -04:00
Lang Martin	6b6ae6c2bd	csi: ACLs for plugin endpoints (#7380 ) * acl/policy: add PolicyList for global ACLs * acl/acl: plugin policy * acl/acl: maxPrivilege is required to allow "list" * nomad/csi_endpoint: enforce plugin access with PolicyPlugin * nomad/csi_endpoint: check job ACL swapped params * nomad/csi_endpoint_test: test alloc filtering * acl/policy: add namespace csi-register-plugin * nomad/job_endpoint: check csi-register-plugin ACL on registration * nomad/job_endpoint_test: add plugin job cases	2020-03-23 13:59:25 -04:00
Lang Martin	b596e67f47	csi: implement volume ACLs (#7339 ) * acl/policy: add the volume ACL policies * nomad/csi_endpoint: enforce ACLs for volume access * nomad/search_endpoint_oss: volume acls * acl/acl: add plugin read as a global policy * acl/policy: add PluginPolicy global cap type * nomad/csi_endpoint: check the global plugin ACL policy * nomad/mock/acl: PluginPolicy * nomad/csi_endpoint: fix list rebase * nomad/core_sched_test: new test since #7358 * nomad/csi_endpoint_test: use correct permissions for list * nomad/csi_endpoint: allowCSIMount keeps ACL checks together * nomad/job_endpoint: check mount permission for jobs * nomad/job_endpoint_test: need plugin read, too	2020-03-23 13:59:25 -04:00
Lang Martin	3621df1dbf	csi: volume ids are only unique per namespace (#7358 ) * nomad/state/schema: use the namespace compound index * scheduler/scheduler: CSIVolumeByID interface signature namespace * scheduler/stack: SetJob on CSIVolumeChecker to capture namespace * scheduler/feasible: pass the captured namespace to CSIVolumeByID * nomad/state/state_store: use namespace in csi_volume index * nomad/fsm: pass namespace to CSIVolumeDeregister & Claim * nomad/core_sched: pass the namespace in volumeClaimReap * nomad/node_endpoint_test: namespaces in Claim testing * nomad/csi_endpoint: pass RequestNamespace to state.* * nomad/csi_endpoint_test: appropriately failed test * command/alloc_status_test: appropriately failed test * node_endpoint_test: avoid notTheNamespace for the job * scheduler/feasible_test: call SetJob to capture the namespace * nomad/csi_endpoint: ACL check the req namespace, query by namespace * nomad/state/state_store: remove deregister namespace check * nomad/state/state_store: remove unused CSIVolumes * scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace * nomad/csi_endpoint: ACL check * nomad/state/state_store_test: remove call to state.CSIVolumes * nomad/core_sched_test: job namespace match so claim gc works	2020-03-23 13:59:25 -04:00
Tim Gross	22e9f679c3	csi: implement controller detach RPCs (#7356 ) This changeset implements the remaining controller detach RPCs: server-to-client and client-to-controller. The tests also uncovered a bug in our RPC for claims which is fixed here; the volume claim RPC is used for both claiming and releasing a claim on a volume. We should only submit a controller publish RPC when the claim is new and not when it's being released.	2020-03-23 13:59:25 -04:00
Tim Gross	0cd2d3cc29	csi: make claims on volumes idempotent for the same alloc (#7328 ) Nomad clients will push node updates during client restart which can cause an extra claim for a volume by the same alloc. If an alloc already claims a volume, we can allow it to be treated as a valid claim and continue.	2020-03-23 13:58:30 -04:00
Lang Martin	6750c262a4	csi: use `ExternalID`, when set, to identify volumes for outside RPC calls (#7326 ) * nomad/structs/csi: new RemoteID() uses the ExternalID if set * nomad/csi_endpoint: pass RemoteID to volume request types * client/pluginmanager/csimanager/volume: pass RemoteID to NodePublishVolume	2020-03-23 13:58:30 -04:00
Lang Martin	80619137ab	csi: volumes listed in `nomad node status` (#7318 ) * api/allocations: GetTaskGroup finds the taskgroup struct * command/node_status: display CSI volume names * nomad/state/state_store: new CSIVolumesByNodeID * nomad/state/iterator: new SliceIterator type implements memdb.ResultIterator * nomad/csi_endpoint: deal with a slice of volumes * nomad/state/state_store: CSIVolumesByNodeID return a SliceIterator * nomad/structs/csi: CSIVolumeListRequest takes a NodeID * nomad/csi_endpoint: use the return iterator * command/agent/csi_endpoint: parse query params for CSIVolumes.List * api/nodes: new CSIVolumes to list volumes by node * command/node_status: use the new list endpoint to print volumes * nomad/state/state_store: error messages consider the operator * command/node_status: include the Provider	2020-03-23 13:58:30 -04:00
Lang Martin	de25fc6cf4	csi: csi-hostpath plugin unimplemented error on controller publish (#7299 ) * client/allocrunner/csi_hook: tag errors * nomad/client_csi_endpoint: tag errors * nomad/client_rpc: remove an unnecessary error tag * nomad/state/state_store: ControllerRequired fix intent We use ControllerRequired to indicate that a volume should use the publish/unpublish workflow, rather than that it has a controller. We need to check both RequiresControllerPlugin and SupportsAttachDetach from the fingerprint to check that. * nomad/csi_endpoint: tag errors * nomad/csi_endpoint_test: longer error messages, mock fingerprints	2020-03-23 13:58:30 -04:00
Tim Gross	b04d23dae0	csi: ensure volume query is idempotent (#7303 ) We denormalize the `CSIVolume` struct when we query it from the state store by getting the plugin and its health. But unless we copy the volume, this denormalization gets synced back to the state store without passing through the fsm (which is invalid).	2020-03-23 13:58:30 -04:00
Tim Gross	b57df162ce	csi: ensure GET for plugin is idempotent (#7298 ) We denormalize the `CSIPlugin` struct when we query it from the state store by getting the current set of allocations that provide the plugin. But unless we copy the plugin, this denormalization gets synced back to the state store and each time we query we'll add another copy of the current allocations.	2020-03-23 13:58:30 -04:00
Tim Gross	de4ad6ca38	csi: add Provider field to CSI CLIs and APIs (#7285 ) Derive a provider name and version for plugins (and the volumes that use them) from the CSI identity API `GetPluginInfo`. Expose the vendor name as `Provider` in the API and CLI commands.	2020-03-23 13:58:30 -04:00
Lang Martin	887e1f28c9	csi: CLI for volume status, registration/deregistration and plugin status (#7193 ) * command/csi: csi, csi_plugin, csi_volume * helper/funcs: move ExtraKeys from parse_config to UnusedKeys * command/agent/config_parse: use helper.UnusedKeys * api/csi: annotate CSIVolumes with hcl fields * command/csi_plugin: add Synopsis * command/csi_volume_register: use hcl.Decode style parsing * command/csi_volume_list * command/csi_volume_status: list format, cleanup * command/csi_plugin_list * command/csi_plugin_status * command/csi_volume_deregister * command/csi_volume: add Synopsis * api/contexts/contexts: add csi search contexts to the constants * command/commands: register csi commands * api/csi: fix struct tag for linter * command/csi_plugin_list: unused struct vars * command/csi_plugin_status: unused struct vars * command/csi_volume_list: unused struct vars * api/csi: add allocs to CSIPlugin * command/csi_plugin_status: format the allocs * api/allocations: copy Allocation.Stub in from structs * nomad/client_rpc: add some error context with Errorf * api/csi: collapse read & write alloc maps to a stub list * command/csi_volume_status: cleanup allocation display * command/csi_volume_list: use Schedulable instead of Healthy * command/csi_volume_status: use Schedulable instead of Healthy * command/csi_volume_list: sprintf string * command/csi: delete csi.go, csi_plugin.go * command/plugin: refactor csi components to sub-command plugin status * command/plugin: remove csi * command/plugin_status: remove csi * command/volume: remove csi * command/volume_status: split out csi specific * helper/funcs: add RemoveEqualFold * command/agent/config_parse: use helper.RemoveEqualFold * api/csi: do ,unusedKeys right * command/volume: refactor csi components to `nomad volume` * command/volume_register: split out csi specific * command/commands: use the new top level commands * command/volume_deregister: hardwired type csi for now * command/volume_status: csiFormatVolumes rescued from volume_list * command/plugin_status: avoid a panic on no args * command/volume_status: avoid a panic on no args * command/plugin_status: predictVolumeType * command/volume_status: predictVolumeType * nomad/csi_endpoint_test: move CreateTestPlugin to testing * command/plugin_status_test: use CreateTestCSIPlugin * nomad/structs/structs: add CSIPlugins and CSIVolumes search consts * nomad/state/state_store: add CSIPlugins and CSIVolumesByIDPrefix * nomad/search_endpoint: add CSIPlugins and CSIVolumes * command/plugin_status: move the header to the csi specific * command/volume_status: move the header to the csi specific * nomad/state/state_store: CSIPluginByID prefix * command/status: rename the search context to just Plugins/Volumes * command/plugin,volume_status: test return ids now * command/status: rename the search context to just Plugins/Volumes * command/plugin_status: support -json and -t * command/volume_status: support -json and -t * command/plugin_status_csi: comments * command/_status: clean up text api/csi: fix stale comments * command/volume: make deregister sound less fearsome * command/plugin_status: set the id length * command/plugin_status_csi: more compact plugin health * command/volume: better error message, comment	2020-03-23 13:58:30 -04:00
Tim Gross	b3bf64485e	csi: remove DevDisableBootstrap flag from tests (#7267 ) In #7252 we removed the `DevDisableBootstrap` flag to require tests to honor only `BootstrapExpect`, in order to reduce a source of test flakiness. This changeset applies the same fix to the CSI tests.	2020-03-23 13:58:30 -04:00
Lang Martin	369b0e54b9	csi: volumes use `Schedulable` rather than `Healthy` (#7250 ) * structs: add ControllerRequired, volume.Name, no plug.Type * structs: Healthy -> Schedulable * state_store: Healthy -> Schedulable * api: add ControllerRequired to api data types * api: copy csi structs changes * nomad/structs/csi: include name and external id * api/csi: include Name and ExternalID * nomad/structs/csi: comments for the 3 ids	2020-03-23 13:58:30 -04:00
Lang Martin	a4784ef258	csi add allocation context to fingerprinting results (#7133 ) * structs: CSIInfo include AllocID, CSIPlugins no Jobs * state_store: eliminate plugin Jobs, delete an empty plugin * nomad/structs/csi: detect empty plugins correctly * client/allocrunner/taskrunner/plugin_supervisor_hook: option AllocID * client/pluginmanager/csimanager/instance: allocID * client/pluginmanager/csimanager/fingerprint: set AllocID * client/node_updater: split controller and node plugins * api/csi: remove Jobs The CSI Plugin API will map plugins to allocations, which allows plugins to be defined by jobs in many configurations. In particular, multiple plugins can be defined in the same job, and multiple jobs can be used to define a single plugin. Because we now map the allocation context directly from the node, it's no longer necessary to track the jobs associated with a plugin directly. * nomad/csi_endpoint_test: CreateTestPlugin & register via fingerprint * client/dynamicplugins: lift AllocID into the struct from Options * api/csi_test: remove Jobs test * nomad/structs/csi: CSIPlugins has an array of allocs * nomad/state/state_store: implement CSIPluginDenormalize * nomad/state/state_store: CSIPluginDenormalize npe on missing alloc * nomad/csi_endpoint_test: defer deleteNodes for clarity * api/csi_test: disable this test awaiting mocks: https://github.com/hashicorp/nomad/issues/7123	2020-03-23 13:58:30 -04:00
Danielle Lancashire	e75f057df3	csi: Fix Controller RPCs Currently the handling of CSINode RPCs does not correctly handle forwarding RPCs to Nodes. This commit fixes this by introducing a shim RPC (nomad/client_csi_enpdoint) that will correctly forward the request to the owning node, or submit the RPC to the client. In the process it also cleans up handling a little bit by adding the `CSIControllerQuery` embeded struct for required forwarding state. The CSIControllerQuery embeding the requirement of a `PluginID` also means we could move node targetting into the shim RPC if wanted in the future.	2020-03-23 13:58:30 -04:00
Tim Gross	8bc5641438	csi: volume claim garbage collection (#7125 ) When an alloc is marked terminal (and after node unstage/unpublish have been called), the client syncs the terminal alloc state with the server via `Node.UpdateAlloc RPC`. For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC handler at the server will emit an eval for a new core job to garbage collect CSI volume claims. When this eval is handled on the core scheduler, it will call a `volumeReap` method to release the claims for all terminal allocs on the job. The volume reap will issue a `ControllerUnpublishVolume` RPC for any node that has no alloc claiming the volume. Once this returns (or is skipped), the volume reap will send a new `CSIVolume.Claim` RPC that releases the volume claim for that allocation in the state store, making it available for scheduling again. This same `volumeReap` method will be called from the core job GC, which gives us a second chance to reclaim volumes during GC if there were controller RPC failures.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	9d4307a3ef	csi_endpoint: Provide AllocID in req, and return Volume Currently, the client has to ship an entire allocation to the server as part of performing a VolumeClaim, this has a few problems: Firstly, it means the client is sending significantly more data than is required (an allocation contains the entire contents of a Nomad job, alongside other irrelevant state) which has a non-zero (de)serialization cost. Secondly, because the allocation was never re-fetched from the state store, it means that we were potentially open to issues caused by stale state on a misbehaving or malicious client. The change removes both of those issues at the cost of a couple of more state store lookups, but they should be relatively cheap. We also now provide the CSIVolume in the response for a claim, so the client can perform a Claim without first going ahead and fetching all of the volumes.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	c3b1154703	csi: Validate Volumes during registration This PR implements some intitial support for doing deeper validation of a volume during its registration with the server. This allows us to validate the capabilities before users attempt to use the volumes during most cases, and also prevents registering volumes without first setting up a plugin, which should help to catch typos and the like during registration. This does have the downside of requiring users to wait for (1) instance of a plugin to be running in their cluster before they can register volumes.	2020-03-23 13:58:30 -04:00
Tim Gross	b03b78b212	csi: server-to-controller publish/unpublish RPCs (#7124 ) Nomad servers need to make requests to CSI controller plugins running on a client for publish/unpublish. The RPC needs to look up the client node based on the plugin, load balancing across controllers, and then perform the required client RPC to that node (via server forwarding if neccessary).	2020-03-23 13:58:30 -04:00
Tim Gross	b9b315f8d1	csi: stub methods for server-to-controller RPC calls (#7117 )	2020-03-23 13:58:30 -04:00
Danielle Lancashire	77bcaa8183	csi_endpoint: Support No ACLs and restrict Nodes This commit refactors the ACL code for the CSI endpoint to support environments that run without acls enabled (e.g developer environments) and also provides an easy way to restrict which endpoints may be accessed with a client's SecretID to limit the blast radius of a malicious client on the state of the environment.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	22e8317a53	csi: Disable validation of volume topology	2020-03-23 13:58:30 -04:00
Tim Gross	01c704ab9d	csi: add PublishContext to CSIVolumeClaimResponse (#7113 ) The `ControllerPublishVolumeResponse` CSI RPC includes the publish context intended to be passed by the orchestrator as an opaque value to the node plugins. This changeset adds it to our response to a volume claim request to proxy the controller's response back to the client node.	2020-03-23 13:58:29 -04:00
Tim Gross	fb1aad66ee	csi: implement releasing volume claims for terminal allocs (#7076 ) When an alloc is marked terminal, and after node unstage/unpublish have been called, the client will sync the terminal alloc state with the server via `Node.UpdateAlloc` RPC. This changeset implements releasing the volume claim for each volume associated with the terminal alloc. It doesn't yet implement the RPC call we need to make to the `ControllerUnpublishVolume` CSI RPC.	2020-03-23 13:58:29 -04:00
Tim Gross	d4cd272de3	csi: implement VolumeClaimRPC (#7048 ) When the client receives an allocation which includes a CSI volume, the alloc runner will block its main `Run` loop. The alloc runner will issue a `VolumeClaim` RPC to the Nomad servers. This changeset implements the portions of the `VolumeClaim` RPC endpoint that have not been previously completed.	2020-03-23 13:58:29 -04:00
Lang Martin	421d7ed2e4	nomad: csi_endpoint send register & deregister requests to raft (#7059 )	2020-03-23 13:58:29 -04:00
Lang Martin	7b675f89ac	csi: fix index maintenance for CSIVolume and CSIPlugin tables (#7049 ) * state_store: csi volumes/plugins store the index in the txn * nomad: csi_endpoint_test require index checks need uint64() * nomad: other tests using int 0 not uint64(0) * structs: pass index into New, but not other struct methods * state_store: csi plugin indexes, use new struct interface * nomad: csi_endpoint_test check index/query meta (on explicit 0) * structs: NewCSIVolume takes an index arg now * scheduler/test: NewCSIVolume takes an index arg now	2020-03-23 13:58:29 -04:00
Lang Martin	a0a6766740	CSI: Scheduler knows about CSI constraints and availability (#6995 ) * structs: piggyback csi volumes on host volumes for job specs * state_store: CSIVolumeByID always includes plugins, matches usecase * scheduler/feasible: csi volume checker * scheduler/stack: add csi volumes * contributing: update rpc checklist * scheduler: add volumes to State interface * scheduler/feasible: introduce new checker collection tgAvailable * scheduler/stack: taskGroupCSIVolumes checker is transient * state_store CSIVolumeDenormalizePlugins comment clarity * structs: remote TODO comment in TaskGroup Validate * scheduler/feasible: CSIVolumeChecker hasPlugins improve comment * scheduler/feasible_test: set t.Parallel * Update nomad/state/state_store.go Co-Authored-By: Danielle <dani@hashicorp.com> * Update scheduler/feasible.go Co-Authored-By: Danielle <dani@hashicorp.com> * structs: lift ControllerRequired to each volume * state_store: store plug.ControllerRequired, use it for volume health * feasible: csi match fast path remove stale host volume copied logic * scheduler/feasible: improve comments Co-authored-by: Danielle <dani@builds.terrible.systems>	2020-03-23 13:58:29 -04:00
Tim Gross	8673ea5cba	csi: add empty CSI volume publication GC to scheduled core jobs (#7014 ) This changeset adds a new core job `CoreJobCSIVolumePublicationGC` to the leader's loop for scheduling core job evals. Right now this is an empty method body without even a config file stanza. Later changesets will implement the logic of volume publication GC.	2020-03-23 13:58:29 -04:00
Lang Martin	88316208a0	csi: server-side plugin state tracking and api (#6966 ) * structs: CSIPlugin indexes jobs acting as plugins and node updates * schema: csi_plugins table for CSIPlugin * nomad: csi_endpoint use vol.Denormalize, plugin requests * nomad: csi_volume_endpoint: rename to csi_endpoint * agent: add CSI plugin endpoints * state_store_test: use generated ids to avoid t.Parallel conflicts * contributing: add note about registering new RPC structs * command: agent http register plugin lists * api: CSI plugin queries, ControllerHealthy -> ControllersHealthy * state_store: copy on write for volumes and plugins * structs: copy on write for volumes and plugins * state_store: CSIVolumeByID returns an unhealthy volume, denormalize * nomad: csi_endpoint use CSIVolumeDenormalizePlugins * structs: remove struct errors for missing objects * nomad: csi_endpoint return nil for missing objects, not errors * api: return meta from Register to avoid EOF error * state_store: CSIVolumeDenormalize keep allocs in their own maps * state_store: CSIVolumeDeregister error on missing volume * state_store: CSIVolumeRegister set indexes * nomad: csi_endpoint use CSIVolumeDenormalizePlugins tests	2020-03-23 13:58:29 -04:00
Lang Martin	61cfc806ad	csi_volume_endpoint_test: gen uuids to avoid t.Parallel conflicts	2020-03-23 13:58:29 -04:00
Lang Martin	334979a754	nomad/rpc: indicate missing region in error message	2020-03-23 13:58:29 -04:00
Lang Martin	5b31b140c3	csi: do not use namespace specific identifiers	2020-03-23 13:58:29 -04:00
Lang Martin	e922531aaf	structs: move the content of csi_volumes into csi	2020-03-23 13:58:29 -04:00
Lang Martin	04b6e7c7fb	server: rpc register CSIVolume	2020-03-23 13:58:29 -04:00
Lang Martin	8f33fb9a6d	csi volume endpoint: new RPC endpoint for CSI volumes	2020-03-23 13:58:29 -04:00
Lang Martin	4bb4dd98eb	state_store: CSIVolume insert, get, delete, claim state_store: change claim counts state_store: get volumes by all, by driver state_store: process volume claims state_store: csi volume register error on update	2020-03-23 13:58:29 -04:00
Lang Martin	0422b967db	schema: csi_volumes schema	2020-03-23 13:58:29 -04:00
Lang Martin	857cd37ab5	fsm: dispatch CSIVolume register, deregister, claim	2020-03-23 13:58:29 -04:00
Lang Martin	f9d9faf673	structs: eliminate MaxReaders & MaxWriters	2020-03-23 13:58:29 -04:00
Lang Martin	3a7e1b6d14	client structs: move CSIVolumeAttachmentMode and CSIVolumeAccessMode	2020-03-23 13:58:29 -04:00
Lang Martin	637ce9dfad	structs: new CSIVolume, request types	2020-03-23 13:58:29 -04:00
Danielle Lancashire	57ae1d2cd6	csimanager: Fingerprint Node Service capabilities	2020-03-23 13:58:29 -04:00
Danielle Lancashire	564f5cec93	csimanager: Fingerprint controller capabilities	2020-03-23 13:58:29 -04:00
Danielle Lancashire	426c26d7c0	CSI Plugin Registration (#6555 ) This changeset implements the initial registration and fingerprinting of CSI Plugins as part of #5378. At a high level, it introduces the following: * A `csi_plugin` stanza as part of a Nomad task configuration, to allow a task to expose that it is a plugin. * A new task runner hook: `csi_plugin_supervisor`. This hook does two things. When the `csi_plugin` stanza is detected, it will automatically configure the plugin task to receive bidirectional mounts to the CSI intermediary directory. At runtime, it will then perform an initial heartbeat of the plugin and handle submitting it to the new `dynamicplugins.Registry` for further use by the client, and then run a lightweight heartbeat loop that will emit task events when health changes. * The `dynamicplugins.Registry` for handling plugins that run as Nomad tasks, in contrast to the existing catalog that requires `go-plugin` type plugins and to know the plugin configuration in advance. * The `csimanager` which fingerprints CSI plugins, in a similar way to `drivermanager` and `devicemanager`. It currently only fingerprints the NodeID from the plugin, and assumes that all plugins are monolithic. Missing features * We do not use the live updates of the `dynamicplugin` registry in the `csimanager` yet. * We do not deregister the plugins from the client when they shutdown yet, they just become indefinitely marked as unhealthy. This is deliberate until we figure out how we should manage deploying new versions of plugins/transitioning them.	2020-03-23 13:58:28 -04:00
Drew Bailey	b09abef332	Audit config, seams for enterprise audit features allow oss to parse sink duration clean up audit sink parsing ent eventer config reload fix typo SetEnabled to eventer interface client acl test rm dead code fix failing test	2020-03-23 13:47:42 -04:00
Jasmine Dahilig	73a64e4397	change jobspec lifecycle stanza to use sidecar attribute instead of block_until status	2020-03-21 17:52:57 -04:00
Jasmine Dahilig	1485b342e2	remove deadline code for now	2020-03-21 17:52:56 -04:00
Jasmine Dahilig	d54a83afee	fix linting errors	2020-03-21 17:52:53 -04:00
Jasmine Dahilig	a0fe570317	clean up restore test	2020-03-21 17:52:52 -04:00
Jasmine Dahilig	7ed08eb75a	partial test for restore functionality	2020-03-21 17:52:52 -04:00
Jasmine Dahilig	81d051d7e8	fix bug in lifecycle scheduler test mocks	2020-03-21 17:52:51 -04:00
Jasmine Dahilig	b7f08c9d13	add appropriate lifecycle deadline default of 120s	2020-03-21 17:52:48 -04:00
Jasmine Dahilig	0cc9212a54	add test cases for scheduler alloc placement with lifecycle resources	2020-03-21 17:52:47 -04:00
Jasmine Dahilig	0d2988652c	add lifecycle job mock	2020-03-21 17:52:47 -04:00
Jasmine Dahilig	c27223207c	update task hook coordinator tests	2020-03-21 17:52:46 -04:00
Mahmood Ali	b880607bad	update scheduler to account for hooks	2020-03-21 17:52:45 -04:00
Jasmine Dahilig	12393f90e7	add test for lifecycle coordinator	2020-03-21 17:52:42 -04:00
Jasmine Dahilig	f6e58d6dad	add canonicalize in the right place	2020-03-21 17:52:41 -04:00
Jasmine Dahilig	4498c8c24f	add canonicalization	2020-03-21 17:52:39 -04:00
Jasmine Dahilig	67262d841b	add validation tests and more validation	2020-03-21 17:52:39 -04:00
Mahmood Ali	214d128bd9	it's running now	2020-03-21 17:52:37 -04:00
Jasmine Dahilig	fc13fa9739	change TaskLifecycle RunLevel to Hook and add Deadline time duration	2020-03-21 17:52:37 -04:00
Mahmood Ali	4ebeac721a	update structs with lifecycle	2020-03-21 17:52:36 -04:00
Yoan Blanc	67692789b7	vendor: vault api and sdk Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-21 17:57:48 +01:00
Mahmood Ali	53e20e5cc2	Deflake TestRPC_Limits_Streaming test The test starts enough connections to hit the limit, then closes the connection and immediately starts one expecting the new one to succeed. We must wait until the server side recognizes the connection closing and free up a limits slot. The current test attempts to achieve that by waiting to get an error on conn.Read, however, this error is returned from local client without waiting for server update. As such, I change the logic so it retries on connection rejection but force the first non-EOF failure to be a deadline error.	2020-03-20 17:21:43 -04:00
Mahmood Ali	0da7130a1a	Protect against args being modified	2020-03-18 08:11:16 -04:00
Mahmood Ali	52fd31af80	server: node connections must not be forwarded This fixes a bug where a forwarded node update request may be assumed to be the actual direct client connection if the server just lost leadership. When a nomad non-leader server receives a Node.UpdateStatus request, it forwards the RPC request to the leader, and holds on the request Yamux connection in a cache to allow for server<->client forwarding. When the leader handles the request, it must differentiate between a forwarded connection vs the actual connection. This is done in https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/node_endpoint.go#L412 Now, consider if the non-leader server forwards to the connection to a recently deposed nomad leader, which in turn forwards the RPC request to the new leader. Without this change, the deposed leader will mistake the forwarded connection for the actual client connection and cache it mapped to the client ID. If the server attempts to connect to that client, it will attempt to start a connection/session to the other server instead and the call will hang forever. This change ensures that we only add node connection mapping if the request is not a forwarded request, regardless of circumstances.	2020-03-17 16:39:01 -04:00
Mahmood Ali	9d88f1d568	tests: deflake deploymentwatcher package This deflake the tests in the deploymentwatcher package. The package uses a mock deployment watcher backend, where the watcher in a background goroutine calls UpdateDeploymentStatus . If the mock isn't configured to expect the call, the background goroutine will fail. One UpdateDeploymentStatus call is made at the end of the background goroutine, which may occur after the test completes, thus explaining the flakiness.	2020-03-12 15:42:01 -04:00
Michael Schurter	2dcc85bed1	jobspec: fixup vault_grace deprecation Followup to #7170 - Moved canonicalization of VaultGrace back into `api/` package. - Fixed tests. - Made docs styling consistent.	2020-03-10 14:58:49 -07:00
Michael Schurter	b72b3e765c	Merge pull request #7170 from fredrikhgrelland/consul_template_upgrade Update consul-template to v0.24.1 and remove deprecated vault grace	2020-03-10 14:15:47 -07:00
Mahmood Ali	005bd37758	tests: deflake TestServer_ReconcileMember TestServer_ReconcileMember assumes that S3 isn't the leader: `reconcileMembers` call would fail when attempting to remove itself!	2020-03-06 14:14:41 -05:00
Mahmood Ali	17ee94b52b	fix typo	2020-03-03 16:55:54 -05:00
Mahmood Ali	acbfeb5815	Simplify Bootstrap logic in tests This change updates tests to honor `BootstrapExpect` exclusively when forming test clusters and removes test only knobs, e.g. `config.DevDisableBootstrap`. Background: Test cluster creation is fragile. Test servers don't follow the BootstapExpected route like production clusters. Instead they start as single node clusters and then get rejoin and may risk causing brain split or other test flakiness. The test framework expose few knobs to control those (e.g. `config.DevDisableBootstrap` and `config.Bootstrap`) that control whether a server should bootstrap the cluster. These flags are confusing and it's unclear when to use: their usage in multi-node cluster isn't properly documented. Furthermore, they have some bad side-effects as they don't control Raft library: If `config.DevDisableBootstrap` is true, the test server may not immediately attempt to bootstrap a cluster, but after an election timeout (~50ms), Raft may force a leadership election and win it (with only one vote) and cause a split brain. The knobs are also confusing as Bootstrap is an overloaded term. In BootstrapExpect, we refer to bootstrapping the cluster only after N servers are connected. But in tests and the knobs above, it refers to whether the server is a single node cluster and shouldn't wait for any other server. Changes: This commit makes two changes: First, it relies on `BootstrapExpected` instead of `Bootstrap` and/or `DevMode` flags. This change is relatively trivial. Introduce a `Bootstrapped` flag to track if the cluster is bootstrapped. This allows us to keep `BootstrapExpected` immutable. Previously, the flag was a config value but it gets set to 0 after cluster bootstrap completes.	2020-03-02 13:47:43 -05:00
Fredrik Hoem Grelland	edb3bd0f3f	Update consul-template to v0.24.1 and remove deprecated vault_grace (#7170 )	2020-02-23 16:24:53 +01:00
Seth Hoenig	0f99cdd0d9	Merge pull request #7192 from hashicorp/b-connect-stanza-ignore consul/connect: in-place update sidecar service registrations on changes	2020-02-21 09:24:53 -06:00
Seth Hoenig	07b9b24ceb	nomad: note why AddressMode is not part of CSD hash Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2020-02-21 09:24:42 -06:00
Seth Hoenig	54b5173eca	consul/connect: in-place update sidecar service registrations on changes Fix a bug where consul service definitions would not be updated if changes were made to the service in the Nomad job. Currently this only fixes the bug for cases where the fix is a matter of updating consul agent's service registration. There is related bug where destructive changes are required (see #6877) which will be fixed in another PR. The enable_tag_override configuration setting for the parent service is applied to the sidecar service. Fixes #6459	2020-02-19 13:07:04 -06:00
Mahmood Ali	98ad59b1de	update rest of consul packages	2020-02-16 16:25:04 -06:00
Mahmood Ali	f492ab6d9e	implement MinQuorum	2020-02-16 16:04:59 -06:00
Mahmood Ali	3dcc65d58d	Update consul autopilot dependency	2020-02-16 15:41:43 -06:00
Mahmood Ali	cf53ee57cd	remove unused dropButLastChannel	2020-02-13 18:56:53 -05:00
Mahmood Ali	fd51982018	tests: Avoid StartAsLeader raft config flag It's being deprecated	2020-02-13 18:56:53 -05:00
Mahmood Ali	367133a399	Use latest raft patterns	2020-02-13 18:56:52 -05:00
Seth Hoenig	543354aabe	Merge pull request #7106 from hashicorp/f-ctag-override client: enable configuring enable_tag_override for services	2020-02-13 12:34:48 -06:00
Michael Schurter	8c332a3757	Merge pull request #7102 from hashicorp/test-limits Fix some race conditions and flaky tests	2020-02-13 10:19:11 -08:00
Mahmood Ali	bc70beeb4a	Merge pull request #7044 from hashicorp/f-use-multiplexv2 rpc: Use MultiplexV2 for connections	2020-02-13 12:07:20 -05:00
Drew Bailey	24a5d36fcf	Merge pull request #7112 from hashicorp/f-include-pro-tag include pro tag in serveral oss.go files	2020-02-13 11:26:41 -05:00
Seth Hoenig	2829b4cd23	Merge pull request #7129 from hashicorp/b-consistent-ct-name command: use consistent CONSUL_HTTP_TOKEN name	2020-02-12 12:27:46 -06:00
Seth Hoenig	7f33b92e0b	command: use consistent CONSUL_HTTP_TOKEN name Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same. Note that consul-template uses CONSUL_TOKEN, which Nomad also uses, so be careful to preserve any reference to that in the consul-template context.	2020-02-12 10:42:33 -06:00
Seth Hoenig	ce50345b7a	nomad: assert consul token is unset on job register in tests	2020-02-12 10:17:42 -06:00
Seth Hoenig	02151dee45	nomad: unset consul token on job register	2020-02-12 09:58:51 -06:00
Drew Bailey	6bd6c6638c	include pro tag in serveral oss.go files	2020-02-10 15:56:14 -05:00
Seth Hoenig	0e44094d1a	client: enable configuring enable_tag_override for services Consul provides a feature of Service Definitions where the tags associated with a service can be modified through the Catalog API, overriding the value(s) configured in the agent's service configuration. To enable this feature, the flag enable_tag_override must be configured in the service definition. Previously, Nomad did not allow configuring this flag, and thus the default value of false was used. Now, it is configurable. Because Nomad itself acts as a state machine around the the service definitions of the tasks it manages, it's worth describing what happens when this feature is enabled and why. Consider the basic case where there is no Nomad, and your service is provided to consul as a boring JSON file. The ultimate source of truth for the definition of that service is the file, and is stored in the agent. Later, Consul performs "anti-entropy" which synchronizes the Catalog (stored only the leaders). Then with enable_tag_override=true, the tags field is available for "external" modification through the Catalog API (rather than directly configuring the service definition file, or using the Agent API). The important observation is that if the service definition ever changes (i.e. the file is changed & config reloaded OR the Agent API is used to modify the service), those "external" tag values are thrown away, and the new service definition is once again the source of truth. In the Nomad case, Nomad itself is the source of truth over the Agent in the same way the JSON file was the source of truth in the example above. That means any time Nomad sets a new service definition, any externally configured tags are going to be replaced. When does this happen? Only on major lifecycle events, for example when a task is modified because of an updated job spec from the 'nomad job run <existing>' command. Otherwise, Nomad's periodic re-sync's with Consul will now no longer try to restore the externally modified tag values (as long as enable_tag_override=true). Fixes #2057	2020-02-10 08:00:55 -06:00
Michael Schurter	c5073f61a7	test: add timeout to ease debugging	2020-02-07 15:50:53 -08:00
Michael Schurter	9905dec6a3	test: workaround limits race	2020-02-07 15:50:53 -08:00
Michael Schurter	14c5ef3a8d	test: fix race around reused default rpc addr The default RPC addr was a global which is fine for normal runtime use when it only has a single user. However many tests modify it and cause races. Follow our convention of returning defaults from funcs instead of using globals.	2020-02-07 15:50:53 -08:00
Mahmood Ali	e106d373b2	rpc: Use MultiplexV2 for connections MultiplexV2 is a new connection multiplex header that supports multiplex both RPC and streaming requests over the same Yamux connection. MultiplexV2 was added in 0.8.0 as part of https://github.com/hashicorp/nomad/pull/3892 . So Nomad 0.11 can expect it to be supported. Though, some more rigorous testing is required before merging this. I want to call out some implementation details: First, the current connection pool reuses the Yamux stream for multiple RPC calls, and doesn't close them until an error is encountered. This commit doesn't change it, and sets the `RpcNomad` byte only at stream creation. Second, the StreamingRPC session gets closed by callers and cannot be reused. Every StreamingRPC opens a new Yamux session.	2020-02-03 19:31:39 -05:00
Drew Bailey	9a65556211	add state store test to ensure PlacedCanaries is updated	2020-02-03 13:58:01 -05:00
Drew Bailey	f51a3d1f37	nomad state store must be modified through raft, rm local state change	2020-02-03 13:57:34 -05:00
Drew Bailey	74779f23e6	keep placed canaries aligned with alloc status	2020-02-03 13:57:33 -05:00
Michael Schurter	9bedd0202e	sentinel: copy jobs to prevent mutation It's unclear whether Sentinel code can mutate values passed to the eval, so ensure it cannot by copying the job.	2020-02-03 08:48:51 -05:00
Seth Hoenig	6bfa50acdc	nomad: remove unused default schedular variable This is from a merge conflict resolution that went the wrong direction. I assumed the block had been added, but really it had been removed. Now, it is removed once again.	2020-01-31 19:06:37 -06:00
Seth Hoenig	d3cd6afd7e	nomad: min cluster version for connect ACLs is now v0.10.4	2020-01-31 19:06:19 -06:00
Seth Hoenig	587a5d4a8d	nomad: make TaskGroup.UsesConnect helper a public helper	2020-01-31 19:05:11 -06:00
Seth Hoenig	ee89a754f1	nomad: fix leftover missed refactoring in consul policy checking	2020-01-31 19:05:06 -06:00
Seth Hoenig	4ee55fcd6c	nomad,client: apply more comment/style PR tweaks	2020-01-31 19:04:52 -06:00
Seth Hoenig	be7c671919	nomad,client: apply smaller PR suggestions Apply smaller suggestions like doc strings, variable names, etc. Co-Authored-By: Nick Ethier <nethier@hashicorp.com> Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2020-01-31 19:04:40 -06:00
Seth Hoenig	78a7d1e426	comments: cleanup some leftover debug comments and such	2020-01-31 19:04:35 -06:00
Seth Hoenig	8219c78667	nomad: handle SI token revocations concurrently Be able to revoke SI token accessors concurrently, and also ratelimit the requests being made to Consul for the various ACL API uses.	2020-01-31 19:04:14 -06:00
Seth Hoenig	2c7ac9a80d	nomad: fixup token policy validation	2020-01-31 19:04:08 -06:00
Seth Hoenig	9df33f622f	nomad: proxy requests for Service Identity tokens between Clients and Consul Nomad jobs may be configured with a TaskGroup which contains a Service definition that is Consul Connect enabled. These service definitions end up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In the case where Consul ACLs are enabled, a Service Identity token is required for these tasks to run & connect, etc. This changeset enables the Nomad Server to recieve RPC requests for the derivation of SI tokens on behalf of instances of Consul Connect using Tasks. Those tokens are then relayed back to the requesting Client, which then injects the tokens in the secrets directory of the Task.	2020-01-31 19:03:53 -06:00
Seth Hoenig	93cf770edb	client: enable nomad client to request and set SI tokens for tasks When a job is configured with Consul Connect aware tasks (i.e. sidecar), the Nomad Client should be able to request from Consul (through Nomad Server) Service Identity tokens specific to those tasks.	2020-01-31 19:03:38 -06:00
Seth Hoenig	2b66ce93bb	nomad: ensure a unique ClusterID exists when leader (gh-6702) Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.	2020-01-31 19:03:26 -06:00
Seth Hoenig	f030a22c7c	command, docs: create and document consul token configuration for connect acls (gh-6716) This change provides an initial pass at setting up the configuration necessary to enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert` commands (similar to Vault tokens). These values are not actually used yet in this changeset.	2020-01-31 19:02:53 -06:00
Michael Schurter	dd7712795d	Merge branch 'master' into b-tls-validation	2020-01-30 11:05:15 -08:00
Mahmood Ali	a9f551542d	Merge pull request #160 from hashicorp/b-mtls-hostname server: validate role and region for RPC w/ mTLS	2020-01-30 12:59:17 -06:00
Michael Schurter	c82b14b0c4	core: add limits to unauthorized connections Introduce limits to prevent unauthorized users from exhausting all ephemeral ports on agents: * `{https,rpc}_handshake_timeout` * `{http,rpc}_max_conns_per_client` The handshake timeout closes connections that have not completed the TLS handshake by the deadline (5s by default). For RPC connections this timeout also separately applies to first byte being read so RPC connections with TLS enabled have `rpc_handshake_time * 2` as their deadline. The connection limit per client prevents a single remote TCP peer from exhausting all ephemeral ports. The default is 100, but can be lowered to a minimum of 26. Since streaming RPC connections create a new TCP connection (until MultiplexV2 is used), 20 connections are reserved for Raft and non-streaming RPCs to prevent connection exhaustion due to streaming RPCs. All limits are configurable and may be disabled by setting them to `0`. This also includes a fix that closes connections that attempt to create TLS RPC connections recursively. While only users with valid mTLS certificates could perform such an operation, it was added as a safeguard to prevent programming errors before they could cause resource exhaustion.	2020-01-30 10:38:25 -08:00
Drew Bailey	da4af9bef3	fix tests, update changelog	2020-01-29 13:55:39 -05:00
Drew Bailey	a61bf32314	Allow nomad monitor command to lookup server UUID Allows addressing servers with nomad monitor using the servers name or ID. Also unifies logic for addressing servers for client_agent_endpoint commands and makes addressing logic region aware. rpc getServer test	2020-01-29 13:55:29 -05:00
Mahmood Ali	9611324654	Merge pull request #6922 from hashicorp/b-alloc-canoncalize Handle Upgrades and Alloc.TaskResources modification	2020-01-28 15:12:41 -05:00
Mahmood Ali	90cae566e5	Merge pull request #6935 from hashicorp/b-default-preemption-flag scheduler: allow configuring default preemption for system scheduler	2020-01-28 15:11:06 -05:00
Mahmood Ali	af17b4afc7	Support customizing full scheduler config	2020-01-28 14:51:42 -05:00
Mahmood Ali	f7a51a14c6	Merge pull request #6977 from hashicorp/b-leadership-flapping-2 Handle Nomad leadership flapping (attempt 2)	2020-01-28 11:40:41 -05:00
Mahmood Ali	687d2b7054	tests: defer closing shutdownCh	2020-01-28 09:53:48 -05:00
Mahmood Ali	ded4233c27	tweak leadership flapping log messages	2020-01-28 09:49:36 -05:00
Mahmood Ali	79823ae07d	handle channel close signal Always deliver last value then send close signal.	2020-01-28 09:44:34 -05:00
Mahmood Ali	d202924a93	include test and address review comments	2020-01-28 09:06:52 -05:00
Nick Ethier	5cbb94e16e	consul: add support for canary meta	2020-01-27 09:53:30 -05:00
Mahmood Ali	e436d2701a	Handle Nomad leadership flapping Fixes a deadlock in leadership handling if leadership flapped. Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader` until the channel is consumed[1] and does not move on to executing follower related logic (in `raft.runFollower`). While Raft `runLeader` defer function blocks, raft cannot process any other raft operations. For example, `run{Leader\|Follower}` methods consume `raft.applyCh`, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely. Sadly, `leaderLoop` and `establishLeader` makes few Raft calls! `establishLeader` attempts to auto-create autopilot/scheduler config [3]; and `leaderLoop` attempts to check raft configuration [4]. All of these calls occur without a timeout. Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is invoked and hit any of these Raft calls, Raft handler _deadlock_ forever. Depending on how many times it flapped and where exactly we get stuck, I suspect it's possible to get in the following case: * Agent metrics/stats http and RPC calls hang as they check raft.Configurations * raft.State remains in Leader state, and server attempts to handle RPC calls (e.g. node/alloc updates) and these hang as well As we create goroutines per RPC call, the number of goroutines grow over time and may trigger a out of memory errors in addition to missed updates. [1] `d90d6d6bda/config.go (L190-L193)` [2] `d90d6d6bda/raft.go (L425-L436)` [3] `2a89e47746/nomad/leader.go (L198-L202)` [4] `2a89e47746/nomad/leader.go (L877)`	2020-01-22 13:08:34 -05:00
Mahmood Ali	129c884105	extract leader step function	2020-01-22 10:55:48 -05:00
Mahmood Ali	f36cc54efd	actually always canonicalize alloc.Job alloc.Job may be stale as well and need to migrate it. It does cost extra cycles but should be negligible.	2020-01-15 09:02:48 -05:00
Mahmood Ali	b1b714691c	address review comments	2020-01-15 08:57:05 -05:00
Mahmood Ali	1ab682f622	scheduler: allow configuring default preemption for system scheduler Some operators want a greater control over when preemption is enabled, especially during an upgrade to limit potential side-effects.	2020-01-13 08:30:49 -05:00
Drew Bailey	ff4bfb8809	Merge pull request #6841 from hashicorp/f-agent-pprof-acl Remote agent pprof endpoints	2020-01-10 14:52:39 -05:00
Mahmood Ali	bfa33cf471	canonicalize allocs from plan results too	2020-01-10 10:41:12 -05:00
Nick Ethier	1f28633954	Merge pull request #6816 from hashicorp/b-multiple-envoy connect: configure envoy to support multiple sidecars in the same alloc	2020-01-09 23:25:39 -05:00
Drew Bailey	b702dede49	adds qc param, address pr feedback	2020-01-09 15:15:11 -05:00
Drew Bailey	45210ed901	Rename profile package to pprof Address pr feedback, rename profile package to pprof to more accurately describe its purpose. Adds gc param for heap lookup profiles.	2020-01-09 15:15:10 -05:00
Drew Bailey	1b8af920f3	address pr feedback	2020-01-09 15:15:09 -05:00
Drew Bailey	4ced73875b	leave acl checking to rpc endpoints fix test expectation test wrapNonJSON	2020-01-09 15:15:08 -05:00
Drew Bailey	279512c7f8	provide helpful error, cleanup logic	2020-01-09 15:15:08 -05:00
Drew Bailey	7bbba613a5	prevent doubly wrapping with rpc error	2020-01-09 15:15:07 -05:00
Drew Bailey	fd42020ad6	RPC server EnableDebug option Passes in agent enable_debug config to nomad server and client configs. This allows for rpc endpoints to have more granular control if they should be enabled or not in combination with ACLs. enable debug on client test	2020-01-09 15:15:07 -05:00
Drew Bailey	31c0aca10a	rename forward func, add comment for why we forward	2020-01-09 15:15:06 -05:00
Drew Bailey	9a80938fb1	region forwarding; prevent recursive forwards for impossible requests prevent region forwarding loop, backfill tests fix failing test	2020-01-09 15:15:06 -05:00
Drew Bailey	46121fe3fd	move shared structs out of client and into nomad	2020-01-09 15:15:05 -05:00
Drew Bailey	aec81a0b99	api agent endpoints helper func to return serverPart based off of serverID	2020-01-09 15:15:05 -05:00
Drew Bailey	3672414888	test pprof headers and profile methods tidy up, add comments clean up seconds param assignment	2020-01-09 15:15:04 -05:00
Drew Bailey	fc37448683	warn when enabled debug is on when registering m -> a receiver name return codederrors, fix query	2020-01-09 15:15:04 -05:00
Drew Bailey	50288461c9	Server request forwarding for Agent.Profile Return rpc errors for profile requests, set up remote forwarding to target leader or server id for profile requests. server forwarding, endpoint tests	2020-01-09 15:15:03 -05:00
Mahmood Ali	d740d347ce	Migrate old alloc structs on read This commit ensures that Alloc.AllocatedResources is properly populated when read from persistence stores (namely Raft and client state store). The alloc struct may have been written previously by an arbitrary old version that may only populate Alloc.TaskResources.	2020-01-09 08:46:50 -05:00
James Rasell	df2dc48790	Fix error parsing config when setting consul.timeout. (#6907 ) When parsing a config file which had the consul.timeout param set, Nomad was reporting an error causing startup to fail. This seems to be caused by the HCL decoder interpreting the timeout type as an int rather than a string. This is caused by the struct TimeoutHCL param having a hcl key of timeout alongside a Timeout struct param of type time.Duration (int). Ensuring the decoder ignores the Timeout struct param ensure the decoder runs correctly.	2020-01-07 13:40:55 -05:00
Nick Ethier	677e9cdc16	connect: configure envoy such that multiple sidecars can run in the same alloc	2020-01-06 11:26:27 -05:00
Michael Schurter	92cdc9de01	nomad/state: remove dead upgrade path code It is uncalled so there hsould be no runtime changes.	2019-12-20 11:10:22 -08:00
Drew Bailey	d9e41d2880	docs for shutdown delay update docs, address pr comments ensure pointer is not nil use pointer for diff tests, set vs unset	2019-12-16 11:38:35 -05:00
Drew Bailey	ae145c9a37	allow only positive shutdown delay more explicit test case, remove select statement	2019-12-16 11:38:30 -05:00
Drew Bailey	24929776a2	shutdown delay for task groups copy struct values ensure groupserviceHook implements RunnerPreKillhook run deregister first test that shutdown times are delayed move magic number into variable	2019-12-16 11:38:16 -05:00
Seth Hoenig	270233e23d	tests: remove trace statements from nodeDrainWatcher.watch Avoid logging in the `watch` function as much as possible, since it is not waited on during a server shutdown. When the logger logs after a test passes, it may or may not cause the testing framework to panic. More info in: https://github.com/golang/go/issues/29388#issuecomment-453648436	2019-12-16 07:08:11 -06:00
Michael Schurter	95fd2643d7	connect: canonicalize before adding sidecar Fixes #6853 Canonicalize jobs first before adding any sidecars. This fixes a bug where sidecar tasks were added without interpolated names and broke validation. Sidecar tasks must be canonicalized independently. Also adds a group network to the mock connect job because it wasn't a valid connect job before!	2019-12-12 20:55:56 -08:00
Seth Hoenig	d45dec1ca8	tests: parallelize state store tests It has been decided we're going to live in a many core world. Let's take advantage of that and parallelize these state store tests which all run in memory and are largely CPU bound. An unscientific benchmark demonstrating the improvement: [mp state (master)] $ go test PASS ok github.com/hashicorp/nomad/nomad/state 5.162s [mp state (f-parallelize-state-store-tests)] $ go test PASS ok github.com/hashicorp/nomad/nomad/state 1.527s	2019-12-11 09:36:37 -06:00
Seth Hoenig	f0c3dca49c	tests: swap lib/freeport for tweaked helper/freeport Copy the updated version of freeport (sdk/freeport), and tweak it for use in Nomad tests. This means staying below port 10000 to avoid conflicts with the lib/freeport that is still transitively used by the old version of consul that we vendor. Also provide implementations to find ephemeral ports of macOS and Windows environments. Ports acquired through freeport are supposed to be returned to freeport, which this change now also introduces. Many tests are modified to include calls to a cleanup function for Server objects. This should help quite a bit with some flakey tests, but not all of them. Our port problems will not go away completely until we upgrade our vendor version of consul. With Go modules, we'll probably do a 'replace' to swap out other copies of freeport with the one now in 'nomad/helper/freeport'.	2019-12-09 08:37:32 -06:00
Danielle Lancashire	d2075ebae9	spellcheck: Fix spelling of retrieve	2019-12-05 18:59:47 -06:00
Mahmood Ali	b3e557cae3	address feedback review apply `s/requestAuthToken/requestACLToken/g`	2019-11-26 08:39:04 -05:00
Mahmood Ali	02e20c720b	acl_endpoint: permission denied for unauthenticated requests If ACL Request is unauthenticated, we should honor the anonymous token. This PR makes few changes: * `GetPolicy` endpoints may return policy if anonymous policy allows it, or return permission denied otherwise. * `ListPolicies` returns an empty policy list, or one with anonymous policy if one exists. Without this PR, the we return an incomprehensible error. Before: ``` $ curl http://localhost:4646/v1/acl/policy/doesntexist; echo acl token lookup failed: index error: UUID must be 36 characters $ curl http://localhost:4646/v1/acl/policies; echo acl token lookup failed: index error: UUID must be 36 characters ``` After: ``` $ curl http://localhost:4646/v1/acl/policy/doesntexist; echo Permission denied $ curl http://localhost:4646/v1/acl/policies; echo [] ```	2019-11-22 08:43:09 -05:00
Michael Schurter	4b6762511d	Merge pull request #6021 from hashicorp/f-anonymous-policy-access api: Update policy endpoint to permit anonymous access	2019-11-20 15:33:45 -08:00
Buck Doyle	5fcc00d0f9	Add gofmt changes	2019-11-20 12:47:01 -06:00
Buck Doyle	dc9c0d5ead	Add explanatory comment	2019-11-20 11:45:44 -06:00
Buck Doyle	bea9837510	Remove extraneous else block	2019-11-20 11:37:45 -06:00
Buck Doyle	d6a3e571bd	Remove extraneous whitespace	2019-11-20 11:37:01 -06:00
Buck Doyle	db77a24ed3	Merge branch 'master' into f-policy-json	2019-11-20 11:20:07 -06:00
Mahmood Ali	7fb4c35831	comments and casing	2019-11-19 16:03:55 -05:00
Mahmood Ali	97d0fd009d	404 if token isn't found	2019-11-19 15:52:53 -05:00
Mahmood Ali	6f8bb5e90b	api: acl bootstrap errors aren't 500 Noticed that ACL endpoints return 500 status code for user errors. This is confusing and can lead to false monitoring alerts. Here, I introduce a concept of RPCCoded errors to be returned by RPC that signal a code in addition to error message. Codes for now match HTTP codes to ease reasoning. ``` $ nomad acl bootstrap Error bootstrapping: Unexpected response code: 500 (ACL bootstrap already done (reset index: 9)) $ nomad acl bootstrap Error bootstrapping: Unexpected response code: 400 (ACL bootstrap already done (reset index: 9)) ```	2019-11-19 15:51:57 -05:00
Michael Schurter	796758b8a5	core: add semver constraint The existing version constraint uses logic optimized for package managers, not schedulers, when checking prereleases: - 1.3.0-beta1 will not satisfy ">= 0.6.1" - 1.7.0-rc1 will not satisfy ">= 1.6.0-beta1" This is due to package managers wishing to favor final releases over prereleases. In a scheduler versions more often represent the earliest release all required features/APIs are available in a system. Whether the constraint or the version being evaluated are prereleases has no impact on ordering. This commit adds a new constraint - `semver` - which will use Semver v2.0 ordering when evaluating constraints. Given the above examples: - 1.3.0-beta1 satisfies ">= 0.6.1" using `semver` - 1.7.0-rc1 satisfies ">= 1.6.0-beta1" using `semver` Since existing jobspecs may rely on the old behavior, a new constraint was added and the implicit Consul Connect and Vault constraints were updated to use it.	2019-11-19 08:40:19 -08:00
Nick Ethier	bd454a4c6f	client: improve group service stanza interpolation and check_re… (#6586 ) * client: improve group service stanza interpolation and check_restart support Interpolation can now be done on group service stanzas. Note that some task runtime specific information that was previously available when the service was registered poststart of a task is no longer available. The check_restart stanza for checks defined on group services will now properly restart the allocation upon check failures if configured.	2019-11-18 13:04:01 -05:00
Luiz Aoqui	e499c5bddc	Merge pull request #6698 from hashicorp/f-add-drain-start-time api: add `StartedAt` in `Node.DrainStrategy`	2019-11-15 15:38:38 -05:00
Luiz Aoqui	e862b61daa	api: use the same initial time for all drain properties	2019-11-14 16:06:09 -05:00
Drew Bailey	9b63828658	serverID to target remote leader or server handle the case where we request a server-id which is this current server update docs, error on node and server id params more accurate names for tests use shared no leader err, formatting rm bad comment remove redundant variable	2019-11-14 10:07:35 -05:00
Drew Bailey	b644e1f47d	add server-id to monitor specific server	2019-11-14 09:53:41 -05:00
Drew Bailey	2185c1a89e	Allows monitor to target leader server Allows user to pass in node-id=leader to forward monitor request to remote a remote leader.	2019-11-14 09:53:40 -05:00
Luiz Aoqui	5bd7cdd5c3	api: add `StartedAt` in `Node.DrainStrategy`	2019-11-13 17:54:40 -05:00
Lars Lehtonen	22a3c21dd0	nomad: fix dropped test error	2019-11-13 12:49:41 -08:00
Michael Schurter	08afb7d605	vault: allow overriding implicit vault constraint There's a bug in version parsing that breaks this constraint when using a prerelease enterprise version of Vault (eg 1.3.0-beta1+ent). While this does not fix the underlying bug it does provide a workaround for future issues related to the implicit constraint. Like the implicit Connect constraint: all implicit constraints should be overridable to allow users to workaround bugs or other factors should the need arise.	2019-11-12 12:26:36 -08:00
Mahmood Ali	c4c37cb42e	vault: check token_explicit_max_ttl as well Vault 1.2.0 deprecated `explicit_max_ttl` in favor of `token_explicit_max_ttl`.	2019-11-12 08:47:23 -05:00
Lars Lehtonen	adbab29228	nomad: TestEvalBroker_Dequeue_Empty_Timeout() proper goroutine error handling (#6657 )	2019-11-08 14:35:06 -05:00
Drew Bailey	7420446458	Merge pull request #6639 from hashicorp/return-after-forward return after request has been forwarded	2019-11-08 09:48:35 -05:00
Lars Lehtonen	39b68e0b88	TestEvalBroker_Dequeue_Blocked() proper goroutine error handling (#6651 ) TestEvalBroker_Dequeue_Blocked() improve test readability	2019-11-08 08:52:23 -05:00
Nick Ethier	e947aaed4f	nomad: fix bug that didn't allow for multiple connect services in same tg	2019-11-08 04:33:39 -05:00
Lars Lehtonen	6deae70e35	TestEvalBroker_PauseResumeNackTimeout() proper goroutine error handling (#6649 ) TestEvalBroker_PauseResumeNackTimeout() improve test readability	2019-11-07 16:04:59 -05:00
Lars Lehtonen	2638cbb31d	nomad: TestEvalBroker_EnqueueAll_Dequeue_Fair() proper goroutine error handling (#6636 ) nomad: TestEvalBroker_EnqueueAll_Dequeue_Fair() improve test readability	2019-11-07 10:39:29 -05:00
Drew Bailey	a5e2e1805f	return after request has been forwarded	2019-11-07 08:33:53 -05:00
Lars Lehtonen	e64f98837c	nomad: fix dropped error in TestJobEndpoint_Deregister_ACL (#6602 )	2019-11-06 16:40:45 -05:00
Drew Bailey	f4a7e3dc75	coordinate closing of doneCh, use interface to simplify callers comments	2019-11-05 11:44:26 -05:00
Drew Bailey	fe542680dc	log-json -> json fix typo command/agent/monitor/monitor.go Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com> Update command/agent/monitor/monitor.go Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com> address feedback, lock to prevent send on closed channel fix lock/unlock for dropped messages	2019-11-05 09:51:59 -05:00
Drew Bailey	298b8358a9	move forwarded monitor request into helper	2019-11-05 09:51:56 -05:00
Drew Bailey	8726b685de	address feedback	2019-11-05 09:51:56 -05:00
Drew Bailey	0e759c401c	moving endpoints over to frames	2019-11-05 09:51:54 -05:00
Drew Bailey	17d876d5ef	rename function, initialize log level better underscores instead of dashes for query params	2019-11-05 09:51:53 -05:00
Drew Bailey	8178beecf0	address feedback, use agent_endpoint instead of monitor	2019-11-05 09:51:53 -05:00
Drew Bailey	db65b1f4a5	agent:read acl policy for monitor	2019-11-05 09:51:52 -05:00
Drew Bailey	2533617888	rpc acl tests for both monitor endpoints	2019-11-05 09:51:51 -05:00
Drew Bailey	3c33747e1f	client monitor endpoint tests	2019-11-05 09:51:50 -05:00
Drew Bailey	4bc68855d0	use intercepting loggers for rpchandlers	2019-11-05 09:51:50 -05:00
Drew Bailey	3b9c33a5f0	new hclog with standardlogger intercept	2019-11-05 09:51:49 -05:00
Drew Bailey	a45ae1cd58	enable json formatting, use queryoptions	2019-11-05 09:51:49 -05:00
Drew Bailey	786989dbe3	New monitor pkg for shared monitor functionality Adds new package that can be used by client and server RPC endpoints to facilitate monitoring based off of a logger clean up old code small comment about write rm old comment about minsize rename to Monitor Removes connection logic from monitor command Keep connection logic in endpoints, use a channel to send results from monitoring use new multisink logger and interfaces small test for dropped messages update go-hclogger and update sink/intercept logger interfaces	2019-11-05 09:51:49 -05:00
Lars Lehtonen	0a4542fadc	nomad: fix test goroutine (#6593 )	2019-10-31 08:23:32 -04:00
Seth Hoenig	98592113a3	Merge pull request #6582 from hashicorp/b-vault-createToken-log-msg nomad: fix vault.CreateToken log message printing wrong error	2019-10-29 17:35:05 -05:00
Mahmood Ali	7f2e4dc5d8	Merge pull request #6574 from hashicorp/b-gh-6570-vault-role-validation vault: honor new `token_period` in vault token role	2019-10-29 10:18:59 -04:00
Seth Hoenig	838c6e3329	nomad: fix vault.CreateToken log message printing wrong error Fixes typo in word "failed". Fixes bug where incorrect error is printed. The old code would only ever print a nil error, instead of the validationErr which is being created.	2019-10-28 23:05:32 -05:00
Mahmood Ali	c5d8d66787	Fix admissionValidators `admissionValidators` doesn't aggregate errors correctly, as it aggregates errors in `errs` reference yet it always returns the nil `err`. Here, we avoid shadowing `err`, and move variable declarations to where they are used.	2019-10-28 10:52:53 -04:00
Mahmood Ali	abb930249a	consul connect: do basic validation before mutating job `groupConnectHook` assumes that Networks is a non-empty slice, but TG hasn't been validated yet and validation may depend on mutation results. As such, we do basic check here before dereferencing network slice elements.	2019-10-28 10:49:02 -04:00
Mahmood Ali	bb45a7a776	add tests for consul connect validation	2019-10-28 10:41:51 -04:00
Mahmood Ali	4c64658397	vault: Support new role field `token_role` Vault 1.2.0 deprecated `period` field in favor of `token_period` in auth role: > * Token store roles use new, common token fields for the values > that overlap with other auth backends. `period`, `explicit_max_ttl`, and > `bound_cidrs` will continue to work, with priority being given to the > `token_` prefixed versions of those parameters. They will also be returned > when doing a read on the role if they were used to provide values initially; > however, in Vault 1.4 if `period` or `explicit_max_ttl` is zero they will no > longer be returned. (`explicit_max_ttl` was already not returned if empty.) https://github.com/hashicorp/vault/blob/master/CHANGELOG.md#120-july-30th-2019	2019-10-28 09:33:26 -04:00
Seth Hoenig	8b03477f46	Merge pull request #6448 from hashicorp/f-set-connect-sidecar-tags connect: enable setting tags on consul connect sidecar service in job…	2019-10-17 15:14:09 -05:00
Seth Hoenig	039fbd3f3b	connect: enable setting tags on consul connect sidecar service in jobspec (#6415 )	2019-10-17 19:25:20 +00:00
Mahmood Ali	4e4a9b252c	Merge pull request #6290 from hashicorp/r-generated-code-refactor dev: avoid codecgen code in downstream projects	2019-10-15 08:22:31 -04:00
Danielle	fee482ae6c	Merge pull request #6331 from hashicorp/dani/f-volume-mount-propagation volumes: Add support for mount propagation	2019-10-14 14:29:40 +02:00
Danielle Lancashire	4fbcc668d0	volumes: Add support for mount propagation This commit introduces support for configuring mount propagation when mounting volumes with the `volume_mount` stanza on Linux targets. Similar to Kubernetes, we expose 3 options for configuring mount propagation: - private, which is equivalent to `rprivate` on Linux, which does not allow the container to see any new nested mounts after the chroot was created. - host-to-task, which is equivalent to `rslave` on Linux, which allows new mounts that have been created _outside of the container_ to be visible inside the container after the chroot is created. - bidirectional, which is equivalent to `rshared` on Linux, which allows both the container to see new mounts created on the host, but importantly _allows the container to create mounts that are visible in other containers an don the host_ private and host-to-task are safe, but bidirectional mounts can be dangerous, as if the code inside a container creates a mount, and does not clean it up before tearing down the container, it can cause bad things to happen inside the kernel. To add a layer of safety here, we require that the user has ReadWrite permissions on the volume before allowing bidirectional mounts, as a defense in depth / validation case, although creating mounts should also require a priviliged execution environment inside the container.	2019-10-14 14:09:58 +02:00
Mahmood Ali	4b2ba62e35	acl: check ACL against object namespace Fix a bug where a millicious user can access or manipulate an alloc in a namespace they don't have access to. The allocation endpoints perform ACL checks against the request namespace, not the allocation namespace, and performs the allocation lookup independently from namespaces. Here, we check that the requested can access the alloc namespace regardless of the declared request namespace. Ideally, we'd enforce that the declared request namespace matches the actual allocation namespace. Unfortunately, we haven't documented alloc endpoints as namespaced functions; we suspect starting to enforce this will be very disruptive and inappropriate for a nomad point release. As such, we maintain current behavior that doesn't require passing the proper namespace in request. A future major release may start enforcing checking declared namespace.	2019-10-08 12:59:22 -04:00
Mahmood Ali	674a457865	use RequestNamespace(), the canonical way to get namespace	2019-09-27 07:40:58 -04:00
Mahmood Ali	e29ee4c400	nomad: defensive check for namespaces in job registration call In a job registration request, ensure that the request namespace "header" and job namespace field match. This should be the case already in prod, as http handlers ensures that the values match [1]. This mitigates bugs that exploit bugs where we may check a value but act on another, resulting into bypassing ACL system. [1] https://github.com/hashicorp/nomad/blob/v0.9.5/command/agent/job_endpoint.go#L415-L418	2019-09-26 17:02:47 -04:00
Lang Martin	fb41dd86ba	default raft protocol v2	2019-09-24 14:37:55 -04:00
Lang Martin	31d7f116dd	nomad/server comments	2019-09-24 14:36:18 -04:00
Tim Gross	cd9c23617f	client/connect: ConsulProxy LocalServicePort/Address (#6358 ) Without a `LocalServicePort`, Connect services will try to use the mapped port even when delivering traffic locally. A user can override this behavior by pinning the port value in the `service` stanza but this prevents us from using the Consul service name to reach the service. This commits configures the Consul proxy with its `LocalServicePort` and `LocalServiceAddress` fields.	2019-09-23 14:30:48 -04:00
Danielle Lancashire	78b61de45f	config: Hoist volume.config.source into volume Currently, using a Volume in a job uses the following configuration: ``` volume "alias-name" { type = "volume-type" read_only = true config { source = "host_volume_name" } } ``` This commit migrates to the following: ``` volume "alias-name" { type = "volume-type" source = "host_volume_name" read_only = true } ``` The original design was based due to being uncertain about the future of storage plugins, and to allow maxium flexibility. However, this causes a few issues, namely: - We frequently need to parse this configuration during submission, scheduling, and mounting - It complicates the configuration from and end users perspective - It complicates the ability to do validation As we understand the problem space of CSI a little more, it has become clear that we won't need the `source` to be in config, as it will be used in the majority of cases: - Host Volumes: Always need a source - Preallocated CSI Volumes: Always needs a source from a volume or claim name - Dynamic Persistent CSI Volumes: Always needs a source to attach the volumes to for managing upgrades and to avoid dangling. - Dynamic Ephemeral CSI Volumes: Less thought out, but `source` will probably point to the plugin name, and a `config` block will allow you to pass meta to the plugin. Or will point to a pre-configured ephemeral config. *If implemented The new design simplifies this by merging the source into the volume stanza to solve the above issues with usability, performance, and error handling.	2019-09-13 04:37:59 +02:00
Mahmood Ali	4b8280e51d	remove generated code	2019-09-06 19:24:15 +00:00
Nomad Release bot	dc7d728a82	Generate files for 0.10.0-beta1 release	2019-09-06 18:47:09 +00:00
Mahmood Ali	01f42053e4	dev: avoid codecgen code in downstream projects This is an attempt to ease dependency management for external driver plugins, by avoiding requiring them to compile ugorji/go generated files. Plugin developers reported some pain with the brittleness of ugorji/go dependency in particular, specially when using go mod, the default go mod manager in golang 1.13. Context -------- Nomad uses msgpack to persist and serialize internal structs, using ugorji/go library. As an optimization, we use ugorji/go code generation to speedup process and aovid the relection-based slow path. We commit these generated files in repository when we cut and tag the release to ease reproducability and debugging old releases. Thus, downstream projects that depend on release tag, indirectly depends on ugorji/go generated code. Sadly, the generated code is brittle and specific to the version of ugorji/go being used. When go mod picks another version of ugorji/go then nomad (go mod by default uses release according to semver), downstream projects face compilation errors. Interestingly, downstream projects don't commonly serialize nomad internal structs. Drivers and device plugins use grpc instead of msgpack for the most part. In the few cases where they use msgpag (e.g. decoding task config), they do without codegen path as they run on driver specific structs not the nomad internal structs. Also, the ugorji/go serialization through reflection is generally backward compatible (mod some ugorji/go regression bugs that get introduced every now and then :( ). Proposal --------- The proposal here is to keep committing ugorji/go codec generated files for releases but to use a go tag for them. All nomad development through the makefile, including releasing, CI and dev flow, has the tag enabled. Downstream plugin projects, by default, will skip these files and life proceed as normal for them. The downside is that nomad developers who use generated code but avoid using make must start passing additional go tag argument. Though this is not a blessed configuration.	2019-09-06 09:22:00 -04:00
Mahmood Ali	6d73ca0cfb	Merge pull request #6250 from hashicorp/f-raft-protocol-v3 Update default raft protocol to version 3	2019-09-04 09:34:41 -04:00
Mahmood Ali	c94a5ef1f8	tests: give up on TestAutopilot_CleanupStaleRaftServer for now	2019-09-04 09:10:53 -04:00
Nick Ethier	6a90a9f505	structs: canonicalize tg Services and Networks (#6257 )	2019-09-04 08:55:47 -04:00
Mahmood Ali	6cefd8f97e	tests: attempt to fix TestAutopilot_CleanupStaleRaftServer Also add a utility function for waiting for stable leadership	2019-09-04 08:49:33 -04:00
Mahmood Ali	035a7a94d9	tests: update time sensitive tests Fix tests whose messages seem timing dependent.	2019-09-04 08:45:25 -04:00
Mahmood Ali	0beb757b6f	tests: disable server auto join by default Tests typically call join cluster directly rather than rely on consul discovery. Worse, consul discovery seems to cause additional leadership transitions when a server is shutdown in tests than tests expect.	2019-09-04 07:54:54 -04:00
Mahmood Ali	3e2ab6e2a3	address review feedback	2019-09-03 21:44:39 -04:00
Mahmood Ali	0a6d73020c	use current nomad version in testing	2019-09-03 21:42:41 -04:00
Mahmood Ali	9bd56587cd	Fix raft tests Wait until leadership stabalizes and all non-voters get promoted before killing leader	2019-09-03 14:53:29 -04:00
Michael Schurter	5957030d18	connect: add unix socket to proxy grpc for envoy (#6232 ) * connect: add unix socket to proxy grpc for envoy Fixes #6124 Implement a L4 proxy from a unix socket inside a network namespace to Consul's gRPC endpoint on the host. This allows Envoy to connect to Consul's xDS configuration API. * connect: pointer receiver on structs with mutexes * connect: warn on all proxy errors	2019-09-03 08:43:38 -07:00
Buck Doyle	21ec6a237c	Merge branch 'master' into f-policy-json # Conflicts: # CHANGELOG.md	2019-09-03 09:56:25 -05:00
Jasmine Dahilig	4edebe389a	add default update stanza and max_parallel=0 disables deployments (#6191 )	2019-09-02 10:30:09 -07:00
Buck Doyle	ab96785fc9	Change test to use valid HCL for rules	2019-08-29 16:09:02 -05:00
Buck Doyle	4a159f5dcf	Change parsing error to set rules to nil	2019-08-29 15:50:34 -05:00
Buck Doyle	5495a7e689	Add standard error-handling for parse failure	2019-08-29 11:12:02 -05:00
Buck Doyle	8b06712d21	Merge branch 'master' into f-policy-json	2019-08-29 11:11:21 -05:00
Mahmood Ali	3da10b5cb3	scheduler: tests for multiple drivers in TG	2019-08-29 09:03:31 -04:00
Mahmood Ali	a67f5f0565	update tests to run with v2	2019-08-28 16:42:08 -04:00
Mahmood Ali	6eabf53b91	Default raft protocol to version 3	2019-08-28 15:56:59 -04:00
Michael Schurter	f5792635ca	Merge pull request #6218 from hashicorp/f-consul-defaults consul: use Consul's defaults and env vars	2019-08-28 11:54:44 -07:00
Nick Ethier	9e96971a75	cli: display group ports and address in alloc status command output (#6189 ) * cli: display group ports and address in alloc status command output * add assertions for port.To = -1 case and convert assertions to testify	2019-08-27 23:59:36 -04:00
Nick Ethier	cbb27e74bc	Add environment variables for connect upstreams (#6171 ) * taskenv: add connect upstream env vars + test * set taskenv upstreams instead of appending * Update client/taskenv/env.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-27 23:41:38 -04:00
Michael Schurter	3b0e1d8ef7	consul: use Consul's defaults and env vars Use Consul's API package defaults and env vars as Nomad's defaults.	2019-08-27 14:56:52 -07:00
Mahmood Ali	3791a70aa9	Merge pull request #5676 from hashicorp/f-b-upgrade-ugorji-dep-20190508 Update ugorji/go to latest	2019-08-23 18:29:49 -04:00
Jerome Gravel-Niquet	cbdc1978bf	Consul service meta (#6193 ) * adds meta object to service in job spec, sends it to consul * adds tests for service meta * fix tests * adds docs * better hashing for service meta, use helper for copying meta when registering service * tried to be DRY, but looks like it would be more work to use the helper function	2019-08-23 12:49:02 -04:00
Michael Schurter	95b8048553	Merge pull request #6121 from hashicorp/f-connect-bootstrap connect: task hook for bootstrapping envoy sidecar	2019-08-22 10:58:31 -07:00
Michael Schurter	59e0b67c7f	connect: task hook for bootstrapping envoy sidecar Fixes #6041 Unlike all other Consul operations, boostrapping requires Consul be available. This PR tries Consul 3 times with a backoff to account for the group services being asynchronously registered with Consul.	2019-08-22 08:15:32 -07:00
Danielle Lancashire	2e5f28029f	remove hidden field from host volumes We're not shipping support for "hidden" volumes in 0.10 any more, I'll convert this to an issue+mini RFC for future enhancement.	2019-08-22 08:48:05 +02:00
Danielle	0428284aee	Merge pull request #6180 from hashicorp/dani/readonly-acl Fine grained ACLs for Host Volumes	2019-08-21 22:22:14 +02:00
Danielle Lancashire	91bb67f713	acls: Break mount acl into mount-rw and mount-ro	2019-08-21 21:17:30 +02:00
Nick Ethier	c8556daf37	structs: validate no tcp checks for connect services (#6169 )	2019-08-21 12:42:53 -04:00
Michael Schurter	050cc32fde	Merge pull request #6157 from hashicorp/f-connect-register Register connect enabled group services with Consul	2019-08-20 14:45:38 -07:00
Tim Gross	7dc6ee2d27	structs: add taskgroup networks and services to plan diffs Adds a check for differences in `job.Diff` so that task group networks and services, including new Consul connect stanzas, show up in the job plan outputs.	2019-08-20 16:18:30 -04:00
Michael Schurter	b008fd1724	connect: register group services with Consul Fixes #6042 Add new task group service hook for registering group services like Connect-enabled services. Does not yet support checks.	2019-08-20 12:25:10 -07:00
Tim Gross	a0e923f46c	add optional task field to group service checks	2019-08-20 09:35:31 -04:00
Mahmood Ali	d699a70875	Merge pull request #5911 from hashicorp/b-rpc-consistent-reads Block rpc handling until state store is caught up	2019-08-20 09:29:37 -04:00
Nick Ethier	24f5a4c276	sidecar_task override in connect admission controller (#6140 ) * structs: use seperate SidecarTask struct for sidecar_task stanza and add merge * nomad: merge SidecarTask into proxy task during connect Mutate hook	2019-08-20 01:22:46 -04:00
Nick Ethier	965f00b2fc	Builtin Admission Controller Framework (#6116 ) * nomad: add admission controller framework * nomad: add admission controller framework and Consul Connect hooks * run admission controllers before checking permissions * client: add default node meta for connect configurables * nomad: remove validateJob func since it has been moved to admission controller * nomad: use new TaskKind type * client: use consts for connect sidecar image and log level * Apply suggestions from code review Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> * nomad: add job register test with connect sidecar * Update nomad/job_endpoint_hooks.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-15 11:22:37 -04:00
Preetha Appan	72e45dd01e	More code review feedback	2019-08-12 17:41:40 -05:00
Preetha	76c8a11b31	Apply suggestions from code review Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-12 17:03:30 -05:00
Preetha Appan	219dc05541	Fix type for kind	2019-08-12 14:39:50 -05:00
Preetha Appan	35506c516d	Improve validation logic and add table driven tests	2019-08-12 14:39:50 -05:00
Preetha Appan	d324a9864e	Add validation for kind field if it is a consul connect proxy	2019-08-12 14:39:50 -05:00
Danielle Lancashire	b38c1d810e	job_endpoint: Validate volume permissions	2019-08-12 15:39:09 +02:00
Danielle Lancashire	33db40d4e6	structs: Document VolumeMount	2019-08-12 15:39:08 +02:00
Danielle Lancashire	861caa9564	HostVolumeConfig: Source -> Path	2019-08-12 15:39:08 +02:00
Danielle Lancashire	e132a30899	structs: Unify Volume and VolumeRequest	2019-08-12 15:39:08 +02:00
Danielle Lancashire	6d7b417e54	structs: Add declarations of basic structs for volume support	2019-08-12 15:39:08 +02:00
Nick Ethier	1871c1edbc	Add sidecar_task stanza parsing (#6104 ) * jobspec: breakup parse.go into smaller files * add sidecar_task parsing to jobspec and api * jobspec: combine service parsing logic for task and group service stanzas * api: use slice of ConsulUpstream values instead of pointers	2019-08-09 15:18:53 -04:00
Preetha Appan	a393ea79e8	Add field "kind" to task for use in connect tasks	2019-08-07 18:43:36 -05:00
Jasmine Dahilig	8d980edd2e	add create and modify timestamps to evaluations (#5881 )	2019-08-07 09:50:35 -07:00
Michael Schurter	3e4796799a	Merge pull request #6003 from pete-woods/add-job-status-metrics nomad: add job status metrics	2019-08-07 08:02:16 -07:00
Michael Schurter	d2862b33e6	Merge pull request #6045 from hashicorp/f-connect-groupservice consul: add Connect structs	2019-08-06 15:43:38 -07:00
Michael Schurter	ef9d100d2f	Merge pull request #6082 from hashicorp/b-vault-deadlock vault: fix deadlock in SetConfig	2019-08-06 15:30:17 -07:00
Michael Schurter	ecb1a65bb9	Merge pull request #6077 from hashicorp/b-vault-revlock vault: fix race in accessor revocations	2019-08-06 14:28:47 -07:00
Michael Schurter	b8e127b3c0	vault: ensure SetConfig calls are serialized This is a defensive measure as SetConfig should only be called serially.	2019-08-06 11:17:10 -07:00
Michael Schurter	5022341b27	vault: fix deadlock in SetConfig This seems to be the minimum viable patch for fixing a deadlock between establishConnection and SetConfig. SetConfig calls tomb.Kill+tomb.Wait while holding v.lock. establishConnection needs to acquire v.lock to exit but SetConfig is holding v.lock until tomb.Wait exits. tomb.Wait can't exit until establishConnect does! ``` SetConfig -> tomb.Wait ^ \| \| v v.lock <- establishConnection ```	2019-08-06 10:40:14 -07:00
Michael Schurter	17fd82d6ad	consul: add Connect structs Refactor all Consul structs into {api,structs}/services.go because api/tasks.go didn't make sense anymore and structs/structs.go is gigantic.	2019-08-06 08:15:07 -07:00
Michael Schurter	d0a83eb818	vault: fix race in accessor revocations	2019-08-05 15:08:04 -07:00
Preetha Appan	8b298621ef	Add more comments to clarify job.Stable field	2019-08-05 15:00:53 -05:00
Preetha Appan	e6a496bac0	Code review feedback	2019-07-31 01:04:08 -04:00
Preetha Appan	99eca85206	Scheduler changes to support network at task group level Also includes unit tests for binpacker and preemption. The tests verify that network resources specified at the task group level are properly accounted for	2019-07-31 01:04:08 -04:00
Michael Schurter	4501fe3c4d	structs: deepcopy shared alloc resources Also DRY up Networks code by using Networks.Copy	2019-07-31 01:04:06 -04:00
Michael Schurter	fb487358fb	connect: add group.service stanza support	2019-07-31 01:04:05 -04:00
Nick Ethier	a03f6a95a2	structs: refactor network validation to seperate fn	2019-07-31 01:03:16 -04:00
Danielle	1e7571eb85	fix structs comment Co-Authored-By: nickethier <ncethier@gmail.com>	2019-07-31 01:03:16 -04:00
Nick Ethier	aa7c08679e	structs: Add validations for task group networks	2019-07-31 01:03:16 -04:00
Nick Ethier	6c160df689	fix tests from introducing new struct fields	2019-07-31 01:03:16 -04:00
Nick Ethier	8650429e38	Add network stanza to group Adds a network stanza and additional options to the task group level in prep for allowing shared networking between tasks of an alloc.	2019-07-31 01:03:12 -04:00
Preetha Appan	d048029b5a	remove generated code and change version to 0.10.0	2019-07-30 15:56:05 -05:00
Nomad Release bot	e39fb11531	Generate files for 0.9.4 release	2019-07-30 19:05:18 +00:00
Buck Doyle	0a1a0419cb	Combine conditionals	2019-07-29 10:38:07 -05:00
Buck Doyle	0a082c1e5e	Update assertion to use better failure-reporting	2019-07-29 10:35:07 -05:00
Buck Doyle	c3deb7703d	Update policy endpoint to permit anonymous access	2019-07-26 13:07:42 -05:00
Pete Woods	9096aa3d23	Add job status metrics This avoids having to write services to repeatedly hit the jobs API	2019-07-26 10:12:49 +01:00
Buck Doyle	77f5a38c8f	Add parsed rules to policy response	2019-07-25 10:43:57 -05:00
Preetha Appan	6b4c40f5a8	remove generated code	2019-07-23 12:07:49 -05:00
Nomad Release bot	04187c8b86	Generate files for 0.9.4-rc1 release	2019-07-22 21:42:36 +00:00
Jasmine Dahilig	2157f6ddf1	add formatting for hcl parsing error messages (#5972 )	2019-07-19 10:04:39 -07:00
Lang Martin	f282da4ced	blocked_evals_test disable calls Flush	2019-07-18 10:32:13 -04:00
Lang Martin	8f7a20839e	worker comment system -> core	2019-07-18 10:32:13 -04:00
Lang Martin	83d20169f6	blocked_evals reset system evals on Flush	2019-07-18 10:32:13 -04:00
Lang Martin	6e3425babf	blocked_evals_test Test_UnblockNode	2019-07-18 10:32:12 -04:00
Lang Martin	ea275d5ce7	fsm attach UnblockNode on node updates	2019-07-18 10:32:12 -04:00
Lang Martin	3bf618f217	blocked_evals system evals indexed by job and node	2019-07-18 10:32:12 -04:00
Michael Schurter	81b4b6f19b	Merge pull request #5791 from hashicorp/b-plan-snapshotindex nomad: include snapshot index when submitting plans	2019-07-17 09:25:00 -07:00
Mahmood Ali	ad39bcef60	rpc: use tls wrapped connection for streaming rpc This ensures that server-to-server streaming RPC calls use the tls wrapped connections. Prior to this, `streamingRpcImpl` function uses tls for setting header and invoking the rpc method, but returns unwrapped tls connection. Thus, streaming writes fail with tls errors. This tls streaming bug existed since 0.8.0[1], but PR #5654[2] exacerbated it in 0.9.2. Prior to PR #5654, nomad client used to shuffle servers at every heartbeat -- `servers.Manager.setServers`[3] always shuffled servers and was called by heartbeat code[4]. Shuffling servers meant that a nomad client would heartbeat and establish a connection against all nomad servers eventually. When handling streaming RPC calls, nomad servers used these local connection to communicate directly to the client. The server-to-server forwarding logic was left mostly unexercised. PR #5654 means that a nomad client may connect to a single server only and caused the server-to-server forward streaming RPC code to get exercised more and unearthed the problem. [1] https://github.com/hashicorp/nomad/blob/v0.8.0/nomad/rpc.go#L501-L515 [2] https://github.com/hashicorp/nomad/pull/5654 [3] https://github.com/hashicorp/nomad/blob/v0.9.1/client/servers/manager.go#L198-L216 [4] https://github.com/hashicorp/nomad/blob/v0.9.1/client/client.go#L1603	2019-07-12 14:41:44 +08:00
Mahmood Ali	9c9bec62fd	rpc: add positive tests for server streaming RPC	2019-07-12 14:32:52 +08:00
Lang Martin	0b97175a16	node_endpoint preserve both messages as rpcs and in raft	2019-07-10 13:56:20 -04:00
Lang Martin	ee4848167c	core_sched add compat comment for later removal	2019-07-10 13:56:20 -04:00
Lang Martin	c13c97c6c2	structs drop deprecation warning, revert unnecessary comment change	2019-07-10 13:56:20 -04:00
Lang Martin	a95225d754	NodeDeregisterBatch -> NodeBatchDeregister match JobBatch pattern	2019-07-10 13:56:20 -04:00
Lang Martin	a8e72a5b68	state_store error if called without node_ids	2019-07-10 13:56:20 -04:00
Lang Martin	44cbca9b98	fsm new NodeDeregisterBatchRequestType sorted at the end of the case	2019-07-10 13:56:20 -04:00
Lang Martin	91e139dcb5	structs NodeDeregisterBatchRequestType must go at the end	2019-07-10 13:56:20 -04:00
Lang Martin	1cc6b4062c	fsm label batch_deregister_node metrics explicitly Co-Authored-By: Mahmood Ali <mahmood@notnoop.com>	2019-07-10 13:56:20 -04:00
Lang Martin	ad3549f906	core_sched use the new rpc names	2019-07-10 13:56:20 -04:00
Lang Martin	ce0f03651a	fsm support new NodeDeregisterBatchRequest	2019-07-10 13:56:20 -04:00
Lang Martin	fa5649998e	node endpoint support new NodeDeregisterBatchRequest	2019-07-10 13:56:19 -04:00
Lang Martin	683ab8d1d2	structs add NodeDeregisterBatchRequest	2019-07-10 13:56:19 -04:00
Lang Martin	82349aba5d	node_endpoint argument setup	2019-07-10 13:56:19 -04:00
Lang Martin	6dbf5d7d13	fsm return an error on both NodeDeregisterRequest fields set	2019-07-10 13:56:19 -04:00
Lang Martin	fbc78ba96c	fsm variable names for consistency	2019-07-10 13:56:19 -04:00
Lang Martin	09fd05bd8f	node_endpoint raft store then shutdown, test deprecation	2019-07-10 13:56:19 -04:00
Lang Martin	4610c70777	util simplify partitionAll	2019-07-10 13:56:19 -04:00
Lang Martin	d22d9fb5b2	core_sched check ServersMeetMinimumVersion	2019-07-10 13:56:19 -04:00
Lang Martin	3bf41211fb	fsm honor new and old style NodeDeregisterRequests	2019-07-10 13:56:19 -04:00
Lang Martin	3fb82e83a5	structs add back NodeDeregisterRequest.NodeID, compatibility	2019-07-10 13:56:19 -04:00
Lang Martin	a4472e3d34	core_sched check ServersMeetMinimumVersion, send old node deregister	2019-07-10 13:56:19 -04:00
Lang Martin	8e53c105fc	state_store just one index update, test deletion	2019-07-10 13:56:19 -04:00
Lang Martin	3e2d1f0338	node_endpoint improve error messages	2019-07-10 13:56:19 -04:00
Lang Martin	5a6a947e98	state_store improve error messages	2019-07-10 13:56:19 -04:00
Lang Martin	fd14cedf95	drainer watch_nodes_test batch of 1	2019-07-10 13:56:19 -04:00
Lang Martin	b176066d42	node_endpoint deregister the batch of nodes	2019-07-10 13:56:19 -04:00
Lang Martin	a97407e030	fsm NodeDeregisterRequest is now a batch	2019-07-10 13:56:19 -04:00
Lang Martin	d5ff2834ca	core_sched batch node deregistration requests	2019-07-10 13:56:19 -04:00
Lang Martin	10848841be	util partitionAll for paging	2019-07-10 13:56:19 -04:00
Lang Martin	be2d6853cb	state_store DeleteNode operates on a batch of ids	2019-07-10 13:56:19 -04:00
Lang Martin	77cf037bff	struct NodeDeregisterRequest has a batch of NodeIDs	2019-07-10 13:56:19 -04:00
Mahmood Ali	ea3a98357f	Block rpc handling until state store is caught up Here, we ensure that when leader only responds to RPC calls when state store is up to date. At leadership transition or launch with restored state, the server local store might not be caught up with latest raft logs and may return a stale read. The solution here is to have an RPC consistency read gate, enabled when `establishLeadership` completes before we respond to RPC calls. `establishLeadership` is gated by a `raft.Barrier` which ensures that all prior raft logs have been applied. Conversely, the gate is disabled when leadership is lost. This is very much inspired by https://github.com/hashicorp/consul/pull/3154/files	2019-07-02 16:07:37 +08:00
Preetha Appan	3cb798235d	Missed one revert of backwards compatibility for node drain	2019-07-01 16:46:05 -05:00
Preetha Appan	aa2b4b4e00	Undo removal of node drain compat changes Decided to remove that in 0.10	2019-07-01 15:12:01 -05:00
Preetha Appan	3484f18984	Fix more tests	2019-06-26 16:30:53 -05:00
Preetha Appan	ff1b80dba6	Fix node drain test	2019-06-26 16:12:07 -05:00
Preetha Appan	23319e04d6	Restore accidentally deleted block	2019-06-26 13:59:14 -05:00
Michael Schurter	69ba495f0c	nomad: expand comments on subtle plan apply behaviors	2019-06-26 08:49:24 -07:00
Preetha Appan	66fa6a67ec	newline	2019-06-25 19:41:09 -05:00
Preetha Appan	10e7d6df6d	Remove compat code associated with many previous versions of nomad This removes compat code for namespaces (0.7), Drain(0.8) and other older features from releases older than Nomad 0.7	2019-06-25 19:05:25 -05:00
Michael Schurter	e4bc943a68	nomad: SnapshotAfter -> SnapshotMinIndex Rename SnapshotAfter to SnapshotMinIndex. The old name was not technically accurate. SnapshotAtOrAfter is more accurate, but wordy and still lacks context about what precisely it is at or after (the index). SnapshotMinIndex was chosen as it describes the action (snapshot), a constraint (minimum), and the object of the constraint (index).	2019-06-24 12:16:46 -07:00
Michael Schurter	0f8164b2f1	nomad: evaluate plans after previous plan index The previous commit prevented evaluating plans against a state snapshot which is older than the snapshot at which the plan was created. This is correct and prevents failures trying to retrieve referenced objects that may not exist until the plan's snapshot. However, this is insufficient to guarantee consistency if the following events occur: 1. P1, P2, and P3 are enqueued with snapshot @ 100 2. Leader evaluates and applies Plan P1 with snapshot @ 100 3. Leader evaluates Plan P2 with snapshot+P1 @ 100 4. P1 commits @ 101 4. Leader evaluates applies Plan P3 with snapshot+P2 @ 100 Since only the previous plan is optimistically applied to the state store, the snapshot used to evaluate a plan may not contain the N-2 plan! To ensure plans are evaluated and applied serially we must consider all previous plan's committed indexes when evaluating further plans. Therefore combined with the last PR, the minimum index at which to evaluate a plan is: min(previousPlanResultIndex, plan.SnapshotIndex)	2019-06-24 12:16:46 -07:00
Michael Schurter	e10fea1d7a	nomad: include snapshot index when submitting plans Plan application should use a state snapshot at or after the Raft index at which the plan was created otherwise it risks being rejected based on stale data. This commit adds a Plan.SnapshotIndex which is set by workers when submitting plan. SnapshotIndex is set to the Raft index of the snapshot the worker used to generate the plan. Plan.SnapshotIndex plays a similar role to PlanResult.RefreshIndex. While RefreshIndex informs workers their StateStore is behind the leader's, SnapshotIndex is a way to prevent the leader from using a StateStore behind the worker's. Plan.SnapshotIndex should be considered the lower bound index for consistently handling plan application. Plans must also be committed serially, so Plan N+1 should use a state snapshot containing Plan N. This is guaranteed for plans after the first plan after a leader election. The Raft barrier on leader election ensures the leader's statestore has caught up to the log index at which it was elected. This guarantees its StateStore is at an index > lastPlanIndex.	2019-06-24 12:16:46 -07:00
Chris Baker	59fac48d92	alloc lifecycle: 404 when attempting to stop non-existent allocation	2019-06-20 21:27:22 +00:00
Preetha	586e50d1a4	Merge pull request #5841 from hashicorp/f-raft-snapshot-metrics Raft and state store indexes as metrics	2019-06-19 12:01:03 -05:00
Preetha Appan	dc0ac81609	Change interval of raft stats collection to 10s	2019-06-19 11:58:46 -05:00
Preetha Appan	104d66f10c	Changed name of metric	2019-06-17 15:51:31 -05:00
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Preetha Appan	c54b4a5b17	Emit metrics with raft commit and apply index and statestore latest index	2019-06-14 16:30:27 -05:00
Jasmine Dahilig	ed9740db10	Merge pull request #5664 from hashicorp/f-http-hcl-region backfill region from hcl for jobUpdate and jobPlan	2019-06-13 12:25:01 -07:00
Jasmine Dahilig	51e141be7a	backfill region from job hcl in jobUpdate and jobPlan endpoints - updated region in job metadata that gets persisted to nomad datastore - fixed many unrelated unit tests that used an invalid region value (they previously passed because hcl wasn't getting picked up and the job would default to global region)	2019-06-13 08:03:16 -07:00
Nick Ethier	1b7fa4fe29	Optional Consul service tags for nomad server and agent services (#5706 ) Optional Consul service tags for nomad server and agent services	2019-06-13 09:00:35 -04:00
Mahmood Ali	e31159bf1f	Prepare for 0.9.4 dev cycle	2019-06-12 18:47:50 +00:00
Nomad Release bot	4803215109	Generate files for 0.9.3 release	2019-06-12 16:11:16 +00:00
Mahmood Ali	07f2c77c44	comment DenormalizeAllocationDiffSlice applies to terminal allocs only	2019-06-12 08:28:43 -04:00
Lang Martin	fe8a4781d8	config merge maintains *HCL string fields used for duration conversion	2019-06-11 16:34:04 -04:00
Mahmood Ali	392f5bac44	Stop updating allocs.Job on stopping or preemption	2019-06-10 18:30:20 -04:00
Mahmood Ali	6c8e329819	test that stopped alloc jobs aren't modified When an alloc is stopped, test that we don't update the job found in alloc with new job that is no longer relevent for this alloc.	2019-06-10 17:14:26 -04:00
Mahmood Ali	d30c3d10b0	Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1 More test fixes	2019-06-05 19:09:18 -04:00
Mahmood Ali	87173111de	Merge pull request #5746 from hashicorp/b-no-updating-inmem-node set node.StatusUpdatedAt in raft	2019-06-05 19:05:21 -04:00
Mahmood Ali	97957fbf75	Prepare for 0.9.3 dev cycle	2019-06-05 14:54:00 +00:00
Nomad Release bot	43bfbf3fcc	Generate files for 0.9.2 release	2019-06-05 11:59:27 +00:00
Michael Schurter	073893f529	nomad: disable service+batch preemption by default Enterprise only. Disable preemption for service and batch jobs by default. Maintain backward compatibility in a x.y.Z release. Consider switching the default for new clusters in the future.	2019-06-04 15:54:50 -07:00
Michael Schurter	a8fc50cc1b	nomad: revert use of SnapshotAfter in planApply Revert plan_apply.go changes from #5411 Since non-Command Raft messages do not update the StateStore index, SnapshotAfter may unnecessarily block and needlessly fail in idle clusters where the last Raft message is a non-Command message. This is trivially reproducible with the dev agent and a job that has 2 tasks, 1 of which fails. The correct logic would be to SnapshotAfter the previous plan's index to ensure consistency. New clusters or newly elected leaders will not have a previous plan, so the index the leader was elected should be used instead.	2019-06-03 15:34:21 -07:00
Mahmood Ali	a4ead8ff79	remove 0.9.2-rc1 generated code	2019-05-23 11:14:24 -04:00
Nomad Release bot	6d6bc59732	Generate files for 0.9.2-rc1 release	2019-05-22 19:29:30 +00:00
Lang Martin	d46613ff44	structs check TaskGroup.Update for nil	2019-05-22 12:34:57 -04:00
Lang Martin	10a3fd61b0	comment replace COMPAT 0.7.0 for job.Update with more current info	2019-05-22 12:34:57 -04:00
Lang Martin	67ebcc47dd	structs comment todo DeploymentStatus & DeploymentStatusDescription	2019-05-22 12:34:57 -04:00
Lang Martin	21bf9fdf90	structs job warnings for taskgroup with mixed auto_promote settings	2019-05-22 12:34:57 -04:00
Lang Martin	0f6f543a5f	deployment_watcher auto promote iff every task group is auto promotable	2019-05-22 12:34:57 -04:00
Lang Martin	d27d6f8ede	structs validate requires Canary for AutoPromote	2019-05-22 12:32:08 -04:00
Lang Martin	0c668ecc7a	log error on autoPromoteDeployment failure	2019-05-22 12:32:08 -04:00
Lang Martin	f23f9fd99e	describe a pending deployment without auto_promote more explicitly	2019-05-22 12:32:08 -04:00
Lang Martin	34230577df	describe a pending deployment with auto_promote accurately	2019-05-22 12:32:08 -04:00
Lang Martin	b5fd735960	add update AutoPromote bool	2019-05-22 12:32:08 -04:00
Lang Martin	3c5a9fed22	deployments_watcher_test new TestWatcher_AutoPromoteDeployment	2019-05-22 12:32:08 -04:00
Lang Martin	0bebf5d7f8	deployment_watcher when it's ok to autopromote, do so	2019-05-22 12:32:08 -04:00
Lang Martin	0cf4168ed9	deployments_watcher comments	2019-05-22 12:32:08 -04:00
Lang Martin	0c403eafde	state_store typo in a comment	2019-05-22 12:32:08 -04:00
Lang Martin	e1e28307be	new deploymentwatcher/doc.go for package level documentation	2019-05-22 12:32:08 -04:00
Mahmood Ali	9ff5f163b5	update callers in tests	2019-05-21 21:10:17 -04:00
Mahmood Ali	6bdbeed319	set node.StatusUpdatedAt in raft Fix a case where `node.StatusUpdatedAt` was manipulated directly in memory. This ensures that StatusUpdatedAt is set in raft layer, and ensures that the field is updated when node drain/eligibility is updated too.	2019-05-21 16:13:32 -04:00
Mahmood Ali	2159d0f3ac	tests: fix some nomad/drainer test data races	2019-05-21 14:40:58 -04:00
Mahmood Ali	3b0152d778	tests: fix deploymentwatcher tests data races	2019-05-21 14:29:45 -04:00
Michael Schurter	689794e08d	nomad: fix deadlock in UnblockClassAndQuota Previous commit could introduce a deadlock if the capacityChangeCh was full and the receiving side exited before freeing a slot for the sending side could send. Flush would then block forever waiting to acquire the lock just to throw the pending update away. The race is around getting/setting the chan field, not chan operations, so only lock around getting the chan field.	2019-05-20 15:41:52 -07:00
Michael Schurter	8c99214f69	nomad: fix race in BlockedEvals I assume the mutex was being released before sending on capacityChangeCh to avoid blocking in the critical section, but: 1. This is race. 2. capacityChangeCh has a huge buffer (8096). If it's full things already seem Very Bad, and a little backpressure seems appropriate.	2019-05-20 15:26:20 -07:00
Michael Schurter	05a9c6aedb	Merge pull request #5411 from hashicorp/b-snapshotafter Block plan application until state store has caught up to raft	2019-05-20 14:03:10 -07:00
Mahmood Ali	cd64ada95d	Run TestClientAllocations_Restart_ACL test	2019-05-17 20:30:23 -04:00
Michael Schurter	0e39927782	nomad: emit more detailed error Avoid returning context.DeadlineExceeded as it lacks helpful information and is often ignored or handled specially by callers.	2019-05-17 14:37:42 -07:00

... 7 8 9 10 11 ...

3588 Commits