open-nomad

Commit Graph

Author	SHA1	Message	Date
Danielle Lancashire	da4f6b60a2	csi: Pass through usage options to the csimanager The CSI Spec requires us to attach and stage volumes based on different types of usage information when it may effect how they are bound. Here we pass through some basic usage options in the CSI Hook (specifically the volume aliases ReadOnly field), and the attachment/access mode from the volume. We pass the attachment/access mode seperately from the volume as it simplifies some handling and doesn't necessarily force every attachment to use the same mode should more be supported (I.e if we let each `volume "foo" {}` specify an override in the future).	2020-03-23 13:58:30 -04:00
Danielle Lancashire	a62a90e03c	csi: Unpublish volumes during ar.Postrun This commit introduces initial support for unmounting csi volumes. It takes a relatively simplistic approach to performing NodeUnpublishVolume calls, optimising for cleaning up any leftover state rather than terminating early in the case of errors. This is because it happens during an allocation's shutdown flow and may not always have a corresponding call to `NodePublishVolume` that succeeded.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	6665bdec2e	taskrunner/volume_hook: Cleanup arg order of prepareHostVolumes	2020-03-23 13:58:30 -04:00
Danielle Lancashire	8692ca86bb	taskrunner/volume_hook: Mounts for CSI Volumes This commit implements support for creating driver mounts for CSI Volumes. It works by fetching the created mounts from the allocation resources and then iterates through the volume requests, creating driver mount configs as required. It's a little bit messy primarily because there's _so_ much terminology overlap and it's a bit difficult to follow.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	7a33864edf	volume_hook: Loosen validation in host volume prep	2020-03-23 13:58:30 -04:00
Danielle Lancashire	d8334cf884	allocrunner: Push state from hooks to taskrunners This commit is an initial (read: janky) approach to forwarding state from an allocrunner hook to a taskrunner using a similar `hookResources` approach that tr's use internally. It should eventually probably be replaced with something a little bit more message based, but for things that only come from pre-run hooks, and don't change, it's probably fine for now.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	3ef41fbb86	csi_hook: Stage/Mount volumes as required This commit introduces the first stage of volume mounting for an allocation. The csimanager.VolumeMounter interface manages the blocking and actual minutia of the CSI implementation allowing this hook to do the minimal work of volume retrieval and creating mount info. In the future the `CSIVolume.Get` request should be replaced by `CSIVolume.Claim(Batch?)` to minimize the number of RPCs and to handle external triggering of a ControllerPublishVolume request as required. We also need to ensure that if pre-run hooks fail, we still get a full unwinding of any publish and staged volumes to ensure that there are no hanging references to volumes. That is not handled in this commit.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	4a2492ecb1	client: Pass an RPC Client to AllocRunners As part of introducing support for CSI, AllocRunner hooks need to be able to communicate with Nomad Servers for validation of and interaction with storage volumes. Here we create a small RPCer interface and pass the client (rpc client) to the AR in preparation for making these RPCs.	2020-03-23 13:58:30 -04:00
Tim Gross	60901fa764	csi: implement CSI controller detach request/response (#7107 ) This changeset implements the minimal structs on the client-side we need to compile the work-in-progress implementation of the server-to-controller RPCs. It doesn't include implementing the `ClientCSI.DettachVolume` RPC on the client.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	f77d3813d1	csi: Fix broken call to newVolumeManager	2020-03-23 13:58:29 -04:00
Danielle Lancashire	3bff9fefae	csi: Provide plugin-scoped paths during RPCs When providing paths to plugins, the path needs to be in the scope of the plugins container, rather than that of the host. Here we enable that by providing the mount point through the plugin registration and then use it when constructing request target paths.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	94e87fbe9c	csimanager: Cleanup volumemanager setup	2020-03-23 13:58:29 -04:00
Danielle Lancashire	ee85c468c0	csimanager: Instantiate fingerprint manager's csiclient	2020-03-23 13:58:29 -04:00
Danielle Lancashire	bbf6a9c14b	volume_manager: cleanup of mount detection No functional changes, but makes ensure.*Dir follow a nicer return style.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	80b7aa0a31	volume_manager: Add support for publishing volumes	2020-03-23 13:58:29 -04:00
Danielle Lancashire	e619ae5a42	volume_manager: Initial support for unstaging volumes	2020-03-23 13:58:29 -04:00
Danielle Lancashire	6e71baa77d	volume_manager: NodeStageVolume Support This commit introduces support for staging volumes when a plugin implements the STAGE_UNSTAGE_VOLUME capability. See the following for further reference material: `4731db0e0b/spec.md (nodestagevolume)`	2020-03-23 13:58:29 -04:00
Danielle Lancashire	f1ab38e845	volume_manager: Introduce helpers for staging This commit adds helpers that create and validate the staging directory for a given volume. It is currently missing usage options as the interfaces are not yet in place for those. The staging directory is only required when a volume has the STAGE_UNSTAGE Volume capability and has to live within the plugin root as the plugin needs to be able to create mounts inside it from within the container.	2020-03-23 13:58:29 -04:00
Lang Martin	33c55e609b	csi: pluginmanager use PluginID instead of Driver	2020-03-23 13:58:29 -04:00
Danielle Lancashire	1a10433b97	csi: Add VolumeManager (#6920 ) This changeset is some pre-requisite boilerplate that is required for introducing CSI volume management for client nodes. It extracts out fingerprinting logic from the csi instance manager. This change is to facilitate reusing the csimanager to also manage the node-local CSI functionality, as it is the easiest place for us to guaruntee health checking and to provide additional visibility into the running operations through the fingerprinter mechanism and goroutine. It also introduces the VolumeMounter interface that will be used to manage staging/publishing unstaging/unpublishing of volumes on the host.	2020-03-23 13:58:29 -04:00
Lang Martin	41cbd55657	client structs: use nstructs rather than s for nomad/structs	2020-03-23 13:58:29 -04:00
Lang Martin	3a7e1b6d14	client structs: move CSIVolumeAttachmentMode and CSIVolumeAccessMode	2020-03-23 13:58:29 -04:00
Danielle Lancashire	de5d373001	csi: Setup gRPC Clients with a logger	2020-03-23 13:58:29 -04:00
Danielle Lancashire	57ae1d2cd6	csimanager: Fingerprint Node Service capabilities	2020-03-23 13:58:29 -04:00
Danielle Lancashire	564f5cec93	csimanager: Fingerprint controller capabilities	2020-03-23 13:58:29 -04:00
Danielle Lancashire	9a23e27439	client_csi: Validate Access/Attachment modes	2020-03-23 13:58:28 -04:00
Danielle Lancashire	2fc65371a8	csi: ClientCSIControllerPublish* -> ClientCSIControllerAttach*	2020-03-23 13:58:28 -04:00
Danielle Lancashire	259852b05f	csi: Model Attachment and Access modes	2020-03-23 13:58:28 -04:00
Danielle Lancashire	2c29b1c53d	client: Setup CSI RPC Endpoint This commit introduces a new set of endpoints to a Nomad Client: ClientCSI. ClientCSI is responsible for mediating requests from a Nomad Server to a CSI Plugin running on a Nomad Client. It should only really be used to make controller RPCs.	2020-03-23 13:58:28 -04:00
Danielle Lancashire	426c26d7c0	CSI Plugin Registration (#6555 ) This changeset implements the initial registration and fingerprinting of CSI Plugins as part of #5378. At a high level, it introduces the following: * A `csi_plugin` stanza as part of a Nomad task configuration, to allow a task to expose that it is a plugin. * A new task runner hook: `csi_plugin_supervisor`. This hook does two things. When the `csi_plugin` stanza is detected, it will automatically configure the plugin task to receive bidirectional mounts to the CSI intermediary directory. At runtime, it will then perform an initial heartbeat of the plugin and handle submitting it to the new `dynamicplugins.Registry` for further use by the client, and then run a lightweight heartbeat loop that will emit task events when health changes. * The `dynamicplugins.Registry` for handling plugins that run as Nomad tasks, in contrast to the existing catalog that requires `go-plugin` type plugins and to know the plugin configuration in advance. * The `csimanager` which fingerprints CSI plugins, in a similar way to `drivermanager` and `devicemanager`. It currently only fingerprints the NodeID from the plugin, and assumes that all plugins are monolithic. Missing features * We do not use the live updates of the `dynamicplugin` registry in the `csimanager` yet. * We do not deregister the plugins from the client when they shutdown yet, they just become indefinitely marked as unhealthy. This is deliberate until we figure out how we should manage deploying new versions of plugins/transitioning them.	2020-03-23 13:58:28 -04:00
Drew Bailey	b09abef332	Audit config, seams for enterprise audit features allow oss to parse sink duration clean up audit sink parsing ent eventer config reload fix typo SetEnabled to eventer interface client acl test rm dead code fix failing test	2020-03-23 13:47:42 -04:00
Mahmood Ali	fa1244f8c5	health tracker: account for group service checks	2020-03-22 12:38:37 -04:00
Mahmood Ali	d61140dcac	health check account for task lifecycle In service jobs, lifecycles non-sidecar task tweak health logic a bit: they may terminate successfully without impacting alloc health, but fail the alloc if they fail. Sidecars should be treated just like a normal task.	2020-03-22 12:37:40 -04:00
Mahmood Ali	07a30580ac	health: fail health if any task is pending Fixes a bug where an allocation is considered healthy if some of the tasks are being restarted and as such, their checks aren't tracked by consul agent client. Here, we fix the immediate case by ensuring that an alloc is healthy only if tasks are running and the registered checks at the time are healthy. Previously, health tracker tracked task "health" independently from checks and leads to problems when a task restarts. Consider the following series of events: 1. all tasks start running -> `tracker.tasksHealthy` is true 2. one task has unhealthy checks and get restarted 3. remaining checks are healthy -> `tracker.checksHealthy` is true 4. propagate health status now that `tracker.tasksHealthy` and `tracker.checksHealthy`. This change ensures that we accurately use the latest status of tasks and checks regardless of their status changes. Also, ensures that we only consider check health after tasks are considered healthy, otherwise we risk trusting incomplete checks. This approach accomodates task dependencies well. Service jobs can have prestart short-lived tasks that will terminate before main process runs. These dead tasks that complete successfully will not negate health status.	2020-03-22 11:13:41 -04:00
Mahmood Ali	b0a7e4381b	tests: add a check for failing service checks Add tests to check for failing or missing service checks in consul update.	2020-03-22 11:13:40 -04:00
Mahmood Ali	5801039214	address review feedback	2020-03-21 17:52:58 -04:00
Mahmood Ali	e1f53347e9	tr: proceed to mark other tasks as dead if alloc fails	2020-03-21 17:52:58 -04:00
Mahmood Ali	e30d26b404	fix test	2020-03-21 17:52:57 -04:00
Jasmine Dahilig	73a64e4397	change jobspec lifecycle stanza to use sidecar attribute instead of block_until status	2020-03-21 17:52:57 -04:00
Jasmine Dahilig	89778bc88d	fix restart policy for system jobs with no lifecycle	2020-03-21 17:52:56 -04:00
Jasmine Dahilig	56e0b8e933	refactor TaskHookCoordinator tests to use mock package and add failed init and sidecar test cases	2020-03-21 17:52:56 -04:00
Jasmine Dahilig	2a8dac077c	remove debugging test code from TestAllocRunner_TaskLeader_StopRestoredTG	2020-03-21 17:52:54 -04:00
Jasmine Dahilig	deb26aefab	fix bug in lifecycle restore tests after refactor	2020-03-21 17:52:54 -04:00
Jasmine Dahilig	2e93d7a875	fix failing ci test: TestTaskRunner_UnregisterConsul_Retries	2020-03-21 17:52:54 -04:00
Jasmine Dahilig	d54a83afee	fix linting errors	2020-03-21 17:52:53 -04:00
Jasmine Dahilig	3d1ffb9337	add task hook coordinator many init tasks test case	2020-03-21 17:52:53 -04:00
Jasmine Dahilig	80f0256cb4	refactor task hook coordinator helper method and tests	2020-03-21 17:52:53 -04:00
Jasmine Dahilig	a0fe570317	clean up restore test	2020-03-21 17:52:52 -04:00
Jasmine Dahilig	7ed08eb75a	partial test for restore functionality	2020-03-21 17:52:52 -04:00
Jasmine Dahilig	0c44d0017d	account for client restarts in task lifecycle hooks	2020-03-21 17:52:51 -04:00
Jasmine Dahilig	4ab39318cc	clean up restart conditions and restart tests for task lifecycle	2020-03-21 17:52:50 -04:00
Jasmine Dahilig	7064deaafb	put lifecycle nil and empty checks in api Canonicalize	2020-03-21 17:52:50 -04:00
Jasmine Dahilig	c27223207c	update task hook coordinator tests	2020-03-21 17:52:46 -04:00
Jasmine Dahilig	12393f90e7	add test for lifecycle coordinator	2020-03-21 17:52:42 -04:00
Jasmine Dahilig	b9a258ed7b	incorporate lifecycle into restart tracker	2020-03-21 17:52:40 -04:00
Mahmood Ali	d7354b8920	Add a coordinator for alloc runners	2020-03-21 17:52:38 -04:00
Yoan Blanc	67692789b7	vendor: vault api and sdk Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-21 17:57:48 +01:00
Mahmood Ali	92712c48eb	Merge pull request #7236 from hashicorp/b-remove-rkt Remove rkt as a built-in driver	2020-03-17 09:07:35 -04:00
Mahmood Ali	d59f149597	Update gopsutil code Latest gosutil includes two backward incompatible changes: First, it removed unused Stolen field in `cae8efcffa (diff-d9747e2da342bdb995f6389533ad1a3d)` . Second, it updated the Windows cpu stats calculation to be inline with other platforms, where it returns absolate stats rather than percentages. See https://github.com/shirou/gopsutil/pull/611.	2020-03-15 09:37:05 +01:00
Yoan Blanc	f85cbddaf1	gopsutils: v2.20.2 Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-15 09:36:59 +01:00
Michael Schurter	b72b3e765c	Merge pull request #7170 from fredrikhgrelland/consul_template_upgrade Update consul-template to v0.24.1 and remove deprecated vault grace	2020-03-10 14:15:47 -07:00
Mahmood Ali	21e19ef40d	Merge pull request #7255 from hashicorp/vendor-update-grpc-20200302 update grpc	2020-03-04 09:32:16 -05:00
Mahmood Ali	88cfe504a0	update grpc Upgrade grpc to v1.27.1 and protobuf plugins to v1.3.4.	2020-03-03 08:39:54 -05:00
Mahmood Ali	acbfeb5815	Simplify Bootstrap logic in tests This change updates tests to honor `BootstrapExpect` exclusively when forming test clusters and removes test only knobs, e.g. `config.DevDisableBootstrap`. Background: Test cluster creation is fragile. Test servers don't follow the BootstapExpected route like production clusters. Instead they start as single node clusters and then get rejoin and may risk causing brain split or other test flakiness. The test framework expose few knobs to control those (e.g. `config.DevDisableBootstrap` and `config.Bootstrap`) that control whether a server should bootstrap the cluster. These flags are confusing and it's unclear when to use: their usage in multi-node cluster isn't properly documented. Furthermore, they have some bad side-effects as they don't control Raft library: If `config.DevDisableBootstrap` is true, the test server may not immediately attempt to bootstrap a cluster, but after an election timeout (~50ms), Raft may force a leadership election and win it (with only one vote) and cause a split brain. The knobs are also confusing as Bootstrap is an overloaded term. In BootstrapExpect, we refer to bootstrapping the cluster only after N servers are connected. But in tests and the knobs above, it refers to whether the server is a single node cluster and shouldn't wait for any other server. Changes: This commit makes two changes: First, it relies on `BootstrapExpected` instead of `Bootstrap` and/or `DevMode` flags. This change is relatively trivial. Introduce a `Bootstrapped` flag to track if the cluster is bootstrapped. This allows us to keep `BootstrapExpected` immutable. Previously, the flag was a config value but it gets set to 0 after cluster bootstrap completes.	2020-03-02 13:47:43 -05:00
Mahmood Ali	a8d6950007	Remove rkt as a built-in driver Rkt has been archived and is no longer an active project: * https://github.com/rkt/rkt * https://github.com/rkt/rkt/issues/4024 The rkt driver will continue to live as an external plugin.	2020-02-26 22:16:41 -05:00
Fredrik Hoem Grelland	edb3bd0f3f	Update consul-template to v0.24.1 and remove deprecated vault_grace (#7170 )	2020-02-23 16:24:53 +01:00
Nick Ethier	eb9c8593ba	Merge pull request #7163 from hashicorp/b-driver-plugin-recovery drivermanager: attempt dispense on reattachment failure	2020-02-21 10:33:20 -05:00
Mahmood Ali	98ad59b1de	update rest of consul packages	2020-02-16 16:25:04 -06:00
Nick Ethier	d8eed3119d	drivermanager: attempt dispense on reattachment failure	2020-02-15 00:50:06 -05:00
Seth Hoenig	543354aabe	Merge pull request #7106 from hashicorp/f-ctag-override client: enable configuring enable_tag_override for services	2020-02-13 12:34:48 -06:00
Michael Schurter	8c332a3757	Merge pull request #7102 from hashicorp/test-limits Fix some race conditions and flaky tests	2020-02-13 10:19:11 -08:00
Seth Hoenig	7f33b92e0b	command: use consistent CONSUL_HTTP_TOKEN name Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same. Note that consul-template uses CONSUL_TOKEN, which Nomad also uses, so be careful to preserve any reference to that in the consul-template context.	2020-02-12 10:42:33 -06:00
Seth Hoenig	0e44094d1a	client: enable configuring enable_tag_override for services Consul provides a feature of Service Definitions where the tags associated with a service can be modified through the Catalog API, overriding the value(s) configured in the agent's service configuration. To enable this feature, the flag enable_tag_override must be configured in the service definition. Previously, Nomad did not allow configuring this flag, and thus the default value of false was used. Now, it is configurable. Because Nomad itself acts as a state machine around the the service definitions of the tasks it manages, it's worth describing what happens when this feature is enabled and why. Consider the basic case where there is no Nomad, and your service is provided to consul as a boring JSON file. The ultimate source of truth for the definition of that service is the file, and is stored in the agent. Later, Consul performs "anti-entropy" which synchronizes the Catalog (stored only the leaders). Then with enable_tag_override=true, the tags field is available for "external" modification through the Catalog API (rather than directly configuring the service definition file, or using the Agent API). The important observation is that if the service definition ever changes (i.e. the file is changed & config reloaded OR the Agent API is used to modify the service), those "external" tag values are thrown away, and the new service definition is once again the source of truth. In the Nomad case, Nomad itself is the source of truth over the Agent in the same way the JSON file was the source of truth in the example above. That means any time Nomad sets a new service definition, any externally configured tags are going to be replaced. When does this happen? Only on major lifecycle events, for example when a task is modified because of an updated job spec from the 'nomad job run <existing>' command. Otherwise, Nomad's periodic re-sync's with Consul will now no longer try to restore the externally modified tag values (as long as enable_tag_override=true). Fixes #2057	2020-02-10 08:00:55 -06:00
Michael Schurter	2896f78f77	client: fix race accessing Node.status * Call Node.Canonicalize once when Node is created. * Lock when accessing fields mutated by node update goroutine	2020-02-07 15:50:47 -08:00
Seth Hoenig	db7bcba027	tests: set consul token for nomad client for testing SIDS TR hook	2020-01-31 19:06:15 -06:00
Seth Hoenig	9b20ca5b25	e2e: setup consul ACLs a little more correctly	2020-01-31 19:06:11 -06:00
Seth Hoenig	4152254c3a	tests: skip some SIDS hook tests if running tests as root	2020-01-31 19:05:32 -06:00
Seth Hoenig	441e8c7db7	client: additional test cases around failures in SIDS hook	2020-01-31 19:05:27 -06:00
Seth Hoenig	c281b05fc0	client: PR cleanup - improved logging around kill task in SIDS hook	2020-01-31 19:05:23 -06:00
Seth Hoenig	03a4af9563	client: PR cleanup - shadow context variable	2020-01-31 19:05:19 -06:00
Seth Hoenig	587a5d4a8d	nomad: make TaskGroup.UsesConnect helper a public helper	2020-01-31 19:05:11 -06:00
Seth Hoenig	057f117592	client: manage TR kill from parent on SI token derivation failure Re-orient the management of the tr.kill to happen in the parent of the spawned goroutine that is doing the actual token derivation. This makes the code a little more straightforward, making it easier to reason about not leaking the worker goroutine.	2020-01-31 19:05:02 -06:00
Seth Hoenig	c8761a3f11	client: set context timeout around SI token derivation The derivation of an SI token needs to be safegaurded by a context timeout, otherwise an unresponsive Consul could cause the siHook to block forever on Prestart.	2020-01-31 19:04:56 -06:00
Seth Hoenig	4ee55fcd6c	nomad,client: apply more comment/style PR tweaks	2020-01-31 19:04:52 -06:00
Seth Hoenig	be7c671919	nomad,client: apply smaller PR suggestions Apply smaller suggestions like doc strings, variable names, etc. Co-Authored-By: Nick Ethier <nethier@hashicorp.com> Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2020-01-31 19:04:40 -06:00
Seth Hoenig	78a7d1e426	comments: cleanup some leftover debug comments and such	2020-01-31 19:04:35 -06:00
Seth Hoenig	5c5da95f34	client: skip task SI token file load failure if testing as root The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when running as a non-priveleged user, since it deliberately tries to read an un-readable file to simulate a failure loading the SI token file.	2020-01-31 19:04:30 -06:00
Seth Hoenig	ab7ae8bbb4	client: remove unused indirection for referencing consul executable Was thinking about using the testing pattern where you create executable shell scripts as test resources which "mock" the process a bit of code is meant to fork+exec. Turns out that wasn't really necessary in this case.	2020-01-31 19:04:25 -06:00
Seth Hoenig	2c7ac9a80d	nomad: fixup token policy validation	2020-01-31 19:04:08 -06:00
Seth Hoenig	d204f2f4f0	client: enable envoy bootstrap hook to set SI token When creating the envoy bootstrap configuration, we should append the "-token=<token>" argument in the case where the sidsHook placed the token in the secrets directory.	2020-01-31 19:04:01 -06:00
Seth Hoenig	9df33f622f	nomad: proxy requests for Service Identity tokens between Clients and Consul Nomad jobs may be configured with a TaskGroup which contains a Service definition that is Consul Connect enabled. These service definitions end up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In the case where Consul ACLs are enabled, a Service Identity token is required for these tasks to run & connect, etc. This changeset enables the Nomad Server to recieve RPC requests for the derivation of SI tokens on behalf of instances of Consul Connect using Tasks. Those tokens are then relayed back to the requesting Client, which then injects the tokens in the secrets directory of the Task.	2020-01-31 19:03:53 -06:00
Seth Hoenig	93cf770edb	client: enable nomad client to request and set SI tokens for tasks When a job is configured with Consul Connect aware tasks (i.e. sidecar), the Nomad Client should be able to request from Consul (through Nomad Server) Service Identity tokens specific to those tasks.	2020-01-31 19:03:38 -06:00
Mahmood Ali	9611324654	Merge pull request #6922 from hashicorp/b-alloc-canoncalize Handle Upgrades and Alloc.TaskResources modification	2020-01-28 15:12:41 -05:00
Mahmood Ali	bc183a3654	tests: run_for is already a string	2020-01-28 14:58:57 -05:00
Mahmood Ali	a0340016b9	client: canonicalize alloc.Job on restore There is a case for always canonicalizing alloc.Job field when canonicalizing the alloc. I'm less certain of implications though, and the job canonicalize hasn't changed for a long time. Here, we special case client restore from database as it's probably the most relevant part. When receiving an alloc from RPC, the data should be fresh enough.	2020-01-28 09:59:05 -05:00
Mahmood Ali	f36cc54efd	actually always canonicalize alloc.Job alloc.Job may be stale as well and need to migrate it. It does cost extra cycles but should be negligible.	2020-01-15 09:02:48 -05:00
Mahmood Ali	b1b714691c	address review comments	2020-01-15 08:57:05 -05:00
Drew Bailey	ff4bfb8809	Merge pull request #6841 from hashicorp/f-agent-pprof-acl Remote agent pprof endpoints	2020-01-10 14:52:39 -05:00
Nick Ethier	1f28633954	Merge pull request #6816 from hashicorp/b-multiple-envoy connect: configure envoy to support multiple sidecars in the same alloc	2020-01-09 23:25:39 -05:00
Drew Bailey	45210ed901	Rename profile package to pprof Address pr feedback, rename profile package to pprof to more accurately describe its purpose. Adds gc param for heap lookup profiles.	2020-01-09 15:15:10 -05:00
Drew Bailey	1b8af920f3	address pr feedback	2020-01-09 15:15:09 -05:00
Drew Bailey	279512c7f8	provide helpful error, cleanup logic	2020-01-09 15:15:08 -05:00
Drew Bailey	fd42020ad6	RPC server EnableDebug option Passes in agent enable_debug config to nomad server and client configs. This allows for rpc endpoints to have more granular control if they should be enabled or not in combination with ACLs. enable debug on client test	2020-01-09 15:15:07 -05:00
Drew Bailey	9a80938fb1	region forwarding; prevent recursive forwards for impossible requests prevent region forwarding loop, backfill tests fix failing test	2020-01-09 15:15:06 -05:00
Drew Bailey	46121fe3fd	move shared structs out of client and into nomad	2020-01-09 15:15:05 -05:00
Drew Bailey	3672414888	test pprof headers and profile methods tidy up, add comments clean up seconds param assignment	2020-01-09 15:15:04 -05:00
Drew Bailey	fc37448683	warn when enabled debug is on when registering m -> a receiver name return codederrors, fix query	2020-01-09 15:15:04 -05:00
Drew Bailey	50288461c9	Server request forwarding for Agent.Profile Return rpc errors for profile requests, set up remote forwarding to target leader or server id for profile requests. server forwarding, endpoint tests	2020-01-09 15:15:03 -05:00
Drew Bailey	49ad5fbc85	agent pprof endpoints wip, agent endpoint and client endpoint for pprof profiles agent endpoint test	2020-01-09 15:15:02 -05:00
Mahmood Ali	4e5d867644	client: stop using alloc.TaskResources Now that alloc.Canonicalize() is called in all alloc sources in the client (i.e. on state restore and RPC fetching), we no longer need to check alloc.TaskResources. alloc.AllocatedResources is always non-nil through alloc runner. Though, early on, we check for alloc validity, so NewTaskRunner and TaskEnv must still check. `TestClient_AddAllocError` test validates that behavior.	2020-01-09 09:25:07 -05:00
Mahmood Ali	7c153e1a64	client: canonicalize alloc runner on RPC	2020-01-09 08:46:50 -05:00
Mahmood Ali	d740d347ce	Migrate old alloc structs on read This commit ensures that Alloc.AllocatedResources is properly populated when read from persistence stores (namely Raft and client state store). The alloc struct may have been written previously by an arbitrary old version that may only populate Alloc.TaskResources.	2020-01-09 08:46:50 -05:00
Tim Gross	fa4da93578	interpolate environment for services in script checks (#6916 ) In 0.10.2 (specifically 387b016) we added interpolation to group service blocks and centralized the logic for task environment interpolation. This wasn't also added to script checks, which caused a regression where the IDs for script checks for services w/ interpolated fields (ex. the service name) didn't match the service ID that was registered with Consul. This changeset calls the same taskenv interpolation logic during `script_check` configuration, and adds tests to reduce the risk of future regressions by comparing the IDs of service hook and the check hook.	2020-01-09 08:12:54 -05:00
Nick Ethier	9c3cc63cd1	tr: initialize envoybootstrap prestart hook response.Env field	2020-01-08 13:41:38 -05:00
Nick Ethier	105cbf6df9	tr: expose envoy sidecar admin port as environment variable	2020-01-06 21:53:45 -05:00
Nick Ethier	677e9cdc16	connect: configure envoy such that multiple sidecars can run in the same alloc	2020-01-06 11:26:27 -05:00
Tim Gross	e9bac50c76	client: fix trace log message in alloc hook update (#6881 )	2019-12-19 16:44:04 -05:00
Drew Bailey	d9e41d2880	docs for shutdown delay update docs, address pr comments ensure pointer is not nil use pointer for diff tests, set vs unset	2019-12-16 11:38:35 -05:00
Drew Bailey	ae145c9a37	allow only positive shutdown delay more explicit test case, remove select statement	2019-12-16 11:38:30 -05:00
Drew Bailey	24929776a2	shutdown delay for task groups copy struct values ensure groupserviceHook implements RunnerPreKillhook run deregister first test that shutdown times are delayed move magic number into variable	2019-12-16 11:38:16 -05:00
Danielle	b006be623d	Update client/fingerprint/env_aws.go Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2019-12-16 14:48:52 +01:00
Danielle Lancashire	5a87b3ab4b	env_aws: Disable Retries and set Session cfg Previously, Nomad used hand rolled HTTP requests to interact with the EC2 metadata API. Recently however, we switched to using the AWS SDK for this fingerprinting. The default behaviour of the AWS SDK is to perform retries with exponential backoff when a request fails. This is problematic for Nomad, because interacting with the EC2 API is in our client start path. Here we revert to our pre-existing behaviour of not performing retries in the fast path, as if the metadata service is unavailable, it's likely that nomad is not running in AWS.	2019-12-16 10:56:32 +01:00
Mahmood Ali	4a1cc67f58	Merge pull request #6820 from hashicorp/f-skip-docker-logging-knob driver: allow disabling log collection	2019-12-13 11:41:20 -05:00
Mahmood Ali	a7361612b6	Merge pull request #6556 from hashicorp/c-vendor-multierror-20191025 Update go-multierror library	2019-12-13 11:32:42 -05:00
Mahmood Ali	46bc3b57e6	address review comments	2019-12-13 11:21:00 -05:00
Mahmood Ali	b3a1e571e5	tests: fix error format assertion multierror library changed formatting slightly.	2019-12-13 11:01:20 -05:00
Chris Dickson	4d8ba272d1	client: expose allocated CPU per task (#6784 )	2019-12-09 15:40:22 -05:00
Seth Hoenig	f0c3dca49c	tests: swap lib/freeport for tweaked helper/freeport Copy the updated version of freeport (sdk/freeport), and tweak it for use in Nomad tests. This means staying below port 10000 to avoid conflicts with the lib/freeport that is still transitively used by the old version of consul that we vendor. Also provide implementations to find ephemeral ports of macOS and Windows environments. Ports acquired through freeport are supposed to be returned to freeport, which this change now also introduces. Many tests are modified to include calls to a cleanup function for Server objects. This should help quite a bit with some flakey tests, but not all of them. Our port problems will not go away completely until we upgrade our vendor version of consul. With Go modules, we'll probably do a 'replace' to swap out other copies of freeport with the one now in 'nomad/helper/freeport'.	2019-12-09 08:37:32 -06:00
Mahmood Ali	0b7085ba3a	driver: allow disabling log collection Operators commonly have docker logs aggregated using various tools and don't need nomad to manage their docker logs. Worse, Nomad uses a somewhat heavy docker api call to collect them and it seems to cause problems when a client runs hundreds of log collections. Here we add a knob to disable log aggregation completely for nomad. When log collection is disabled, we avoid running logmon and docker_logger for the docker tasks in this implementation. The downside here is once disabled, `nomad logs ...` commands and API no longer return logs and operators must corrolate alloc-ids with their aggregated log info. This is meant as a stop gap measure. Ideally, we'd follow up with at least two changes: First, we should optimize behavior when we can such that operators don't need to disable docker log collection. Potentially by reverting to using pre-0.9 syslog aggregation in linux environments, though with different trade-offs. Second, when/if logs are disabled, nomad logs endpoints should lookup docker logs api on demand. This ensures that the cost of log collection is paid sparingly.	2019-12-08 14:15:03 -05:00
Mahmood Ali	ded2a725db	Merge pull request #6788 from hashicorp/b-timeout-logmon-stop logmon: add timeout to RPC operations	2019-12-06 19:12:06 -05:00
Danielle Lancashire	d2075ebae9	spellcheck: Fix spelling of retrieve	2019-12-05 18:59:47 -06:00
Mahmood Ali	b2ae27863e	Merge pull request #6779 from hashicorp/r-aws-fingerprint-via-library Use AWS SDK to access EC2 Metadata	2019-12-02 13:30:51 -05:00
Mahmood Ali	83089feff5	logmon: add timeout to RPC operations Add an RPC timeout for logmon. In https://github.com/hashicorp/nomad/issues/6461#issuecomment-559747758 , `logmonClient.Stop` locked up and indefinitely blocked the task runner destroy operation. This is an incremental improvement. We still need to follow up to understand how we got to that state, and the full impact of locked-up Stop and its link to pending allocations on restart.	2019-12-02 10:33:05 -05:00
Mahmood Ali	293276a457	fingerprint code refactor Some code cleanup: * Use a field for setting EC2 metadata instead of env-vars in testing; but keep environment variables for backward compatibility reasons * Update tests to use testify	2019-11-26 10:51:28 -05:00
Mahmood Ali	1e48f8e20d	fingerprint: avoid api query if config overrides it	2019-11-26 10:51:28 -05:00
Mahmood Ali	5bb9089431	fingerprint: use ec2metadata package	2019-11-26 10:51:27 -05:00
Lars Lehtonen	0d344e8578	client: fix use of T.Fatal inside TestFS_logsImpl_NoFollow() goroutine.	2019-11-25 23:51:28 -08:00
Mahmood Ali	e89108fb01	fixup! tests: don't assume eth0 network is available	2019-11-21 08:28:20 -05:00
Mahmood Ali	443804b5c7	tests: don't assume eth0 network is available TestClient_UpdateNodeFromFingerprintKeepsConfig checks a test node network interface, which is hardcoded to `eth0` and is updated asynchronously. This causes flakiness when eth0 isn't available. Here, we hardcode the value to an arbitrary network interface.	2019-11-20 20:37:30 -05:00
Mahmood Ali	ed3f1957e7	tests: run TestClient_WatchAllocs in non-linux environments	2019-11-20 20:37:29 -05:00
Mahmood Ali	521f51a929	testS: fix TestClient_RestoreError When spinning a second client, ensure that it uses new driver instances, rather than reuse the already shutdown unhealthy drivers from first instance. This speeds up tests significantly, but cutting ~50 seconds or so, the timeout in NewClient until drivers fingerprints. They never do because drivers were shutdown already.	2019-11-20 20:37:28 -05:00
Mahmood Ali	4efb71cf0c	tests: remove TestClient_RestoreError test TestClient_RestoreError is very slow, taking ~81 seconds. It has few problematic patterns. It's unclear what it tests, it simulates a failure condition where all state db lookup fails and asserts that alloc fails. Though starting from https://github.com/hashicorp/nomad/pull/6216 , we don't fail allocs in that condition but rather restart them. Also, the drivers used in second client `c2` are the same singleton instances used in `c1` and already shutdown. We ought to start healthy new driver instances.	2019-11-20 20:37:27 -05:00
Preetha	be4a51d5b8	Merge pull request #6349 from hashicorp/b-host-stats client: Return empty values when host stats fail	2019-11-20 10:13:02 -06:00
Lang Martin	aa985ebe21	getter: allow the gcs download scheme (#6692 )	2019-11-19 09:10:56 -05:00
Nick Ethier	bd454a4c6f	client: improve group service stanza interpolation and check_re… (#6586 ) * client: improve group service stanza interpolation and check_restart support Interpolation can now be done on group service stanzas. Note that some task runtime specific information that was previously available when the service was registered poststart of a task is no longer available. The check_restart stanza for checks defined on group services will now properly restart the allocation upon check failures if configured.	2019-11-18 13:04:01 -05:00
Drew Bailey	b644e1f47d	add server-id to monitor specific server	2019-11-14 09:53:41 -05:00
Drew Bailey	f4a7e3dc75	coordinate closing of doneCh, use interface to simplify callers comments	2019-11-05 11:44:26 -05:00
Drew Bailey	fe542680dc	log-json -> json fix typo command/agent/monitor/monitor.go Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com> Update command/agent/monitor/monitor.go Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com> address feedback, lock to prevent send on closed channel fix lock/unlock for dropped messages	2019-11-05 09:51:59 -05:00
Drew Bailey	ddfa20b993	address feedback, fix gauge metric name	2019-11-05 09:51:57 -05:00
Drew Bailey	298b8358a9	move forwarded monitor request into helper	2019-11-05 09:51:56 -05:00
Drew Bailey	318b6c91bf	monitor command takes no args rm extra new line fix lint errors return after close fix, simplify test	2019-11-05 09:51:55 -05:00
Drew Bailey	c7b633b6c1	lock in sub select rm redundant lock wip to use framing wip switch to stream frames	2019-11-05 09:51:54 -05:00
Drew Bailey	17d876d5ef	rename function, initialize log level better underscores instead of dashes for query params	2019-11-05 09:51:53 -05:00
Drew Bailey	8178beecf0	address feedback, use agent_endpoint instead of monitor	2019-11-05 09:51:53 -05:00
Drew Bailey	db65b1f4a5	agent:read acl policy for monitor	2019-11-05 09:51:52 -05:00
Drew Bailey	2533617888	rpc acl tests for both monitor endpoints	2019-11-05 09:51:51 -05:00
Drew Bailey	a45ae1cd58	enable json formatting, use queryoptions	2019-11-05 09:51:49 -05:00
Drew Bailey	786989dbe3	New monitor pkg for shared monitor functionality Adds new package that can be used by client and server RPC endpoints to facilitate monitoring based off of a logger clean up old code small comment about write rm old comment about minsize rename to Monitor Removes connection logic from monitor command Keep connection logic in endpoints, use a channel to send results from monitoring use new multisink logger and interfaces small test for dropped messages update go-hclogger and update sink/intercept logger interfaces	2019-11-05 09:51:49 -05:00
Drew Bailey	e076204820	get local rpc endpoint working	2019-11-05 09:51:48 -05:00
Drew Bailey	976c43157c	remove log_writer prefix output with proper spacing update gzip handler, adjust first byte flow to allow gzip handler bypass wip, first stab at wiring up rpc endpoint	2019-11-05 09:51:48 -05:00
Michael Schurter	9fed8d1bed	client: fix panic from 0.8 -> 0.10 upgrade makeAllocTaskServices did not do a nil check on AllocatedResources which causes a panic when upgrading directly from 0.8 to 0.10. While skipping 0.9 is not supported we intend to fix serious crashers caused by such upgrades to prevent cluster outages. I did a quick audit of the client package and everywhere else that accesses AllocatedResources appears to be properly guarded by a nil check.	2019-11-01 07:47:03 -07:00
Lars Lehtonen	4ed9427c77	client/allocwatcher: fix dropped test error (#6592 )	2019-10-31 08:29:25 -04:00
Michael Schurter	eba4d4cd6f	vault: remove dead lease code	2019-10-25 15:08:35 -07:00
Michael Schurter	39437a5c5b	Merge branch 'master' into release-0100	2019-10-22 08:17:57 -07:00
Michael Schurter	b6bb561854	cleanup post 0.10.0 release	2019-10-22 07:48:09 -07:00
Nomad Release bot	3e6c9dd40e	Generate files for 0.10.0 release	2019-10-22 12:34:56 +00:00
Mahmood Ali	262dcb0842	Revert "lint: ignore generated windows syscall wrappers" This reverts commit 482862e6ab0f8db748367bb1eefc2efd11fbe11a.	2019-10-22 08:23:44 -04:00
Michael Schurter	460bd63db0	client: expose group network ports in env vars Fixes #6375 Intentionally omitted IPs prior to 0.10.0 release to minimize changes and risk.	2019-10-21 13:28:35 -07:00
Michael Schurter	8634533e82	client: expose group network ports in env vars Fixes #6375 Intentionally omitted IPs prior to 0.10.0 release to minimize changes and risk.	2019-10-21 12:31:13 -07:00
Mahmood Ali	e1b3e208d1	client: don't retry fingerprinting on shutdown At shutdown, driver manager context expires and the fingerprinting channel closes. Thus it is undeterministic which clause of The select statement gets executed, and we may keep retrying until the `i.ctx.Done()` block is executed. Here, we check always check ctx expiration before retrying again.	2019-10-21 08:54:11 -04:00
Michael Schurter	bb82f365ff	connect: upgrade to envoy 1.11.2 and add sha Append the Docker image sha to the Envoy image to ensure users default to using the version that Nomad was tested against.	2019-10-18 10:16:58 -07:00
Michael Schurter	ee5ea3ecc7	connect: upgrade to envoy 1.11.2 and add sha Append the Docker image sha to the Envoy image to ensure users default to using the version that Nomad was tested against.	2019-10-18 07:46:53 -07:00
Mahmood Ali	4e4a9b252c	Merge pull request #6290 from hashicorp/r-generated-code-refactor dev: avoid codecgen code in downstream projects	2019-10-15 08:22:31 -04:00
Michael Schurter	2992cb80b0	Remove 0.10.0-rc1 generated files	2019-10-10 13:31:42 -07:00
Nomad Release bot	3007f1662e	Generate files for 0.10.0-rc1 release	2019-10-10 19:08:23 +00:00
Michael Schurter	f54f1cb321	Revert "Revert "Use joint context to cancel prestart hooks""	2019-10-08 11:34:09 -07:00
Michael Schurter	81a30ae106	Revert "Use joint context to cancel prestart hooks"	2019-10-08 11:27:08 -07:00
Mahmood Ali	4b2ba62e35	acl: check ACL against object namespace Fix a bug where a millicious user can access or manipulate an alloc in a namespace they don't have access to. The allocation endpoints perform ACL checks against the request namespace, not the allocation namespace, and performs the allocation lookup independently from namespaces. Here, we check that the requested can access the alloc namespace regardless of the declared request namespace. Ideally, we'd enforce that the declared request namespace matches the actual allocation namespace. Unfortunately, we haven't documented alloc endpoints as namespaced functions; we suspect starting to enforce this will be very disruptive and inappropriate for a nomad point release. As such, we maintain current behavior that doesn't require passing the proper namespace in request. A future major release may start enforcing checking declared namespace.	2019-10-08 12:59:22 -04:00
Drew Bailey	69eebcd241	simplify logic to check for vault read event defer shutdown to cleanup after failed run Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> update comment to include ctx note for shutdown	2019-09-30 11:02:14 -07:00
Drew Bailey	7565b8a8d9	Use joint context to cancel prestart hooks fixes https://github.com/hashicorp/nomad/issues/6382 The prestart hook for templates blocks while it resolves vault secrets. If the secret is not found it continues to retry. If a task is shutdown during this time, the prestart hook currently does not receive shutdownCtxCancel, causing it to hang. This PR joins the two contexts so either killCtx or shutdownCtx cancel and stop the task.	2019-09-30 10:48:01 -07:00
Mahmood Ali	e29ee4c400	nomad: defensive check for namespaces in job registration call In a job registration request, ensure that the request namespace "header" and job namespace field match. This should be the case already in prod, as http handlers ensures that the values match [1]. This mitigates bugs that exploit bugs where we may check a value but act on another, resulting into bypassing ACL system. [1] https://github.com/hashicorp/nomad/blob/v0.9.5/command/agent/job_endpoint.go#L415-L418	2019-09-26 17:02:47 -04:00
Tim Gross	a6aadb3714	connect: remove proxy socket for restarted client	2019-09-25 14:58:17 -04:00
Tim Gross	e43d33aa50	client: don't run alloc postrun during shutdown	2019-09-25 14:58:17 -04:00
Tim Gross	d965a15490	driver/networking: don't recreate existing network namespaces	2019-09-25 14:58:17 -04:00
Jasmine Dahilig	0780adfa7f	timeout after 5 seconds when client opens a data directory (#6348 )	2019-09-24 16:28:21 -07:00
Danielle Lancashire	c9bcb7e76a	client_stats: Always emit client stats	2019-09-19 01:22:08 +02:00
Danielle Lancashire	4f2343e1c0	client: Return empty values when host stats fail Currently, there is an issue when running on Windows whereby under some circumstances the Windows stats API's will begin to return errors (such as internal timeouts) when a client is under high load, and potentially other forms of resource contention / system states (and other unknown cases). When an error occurs during this collection, we then short circuit further metrics emission from the client until the next interval. This can be problematic if it happens for a sustained number of intervals, as our metrics aggregator will begin to age out older metrics, and we will eventually stop emitting various types of metrics including `nomad.client.unallocated.*` metrics. However, when metrics collection fails on Linux, gopsutil will in many cases (e.g cpu.Times) silently return 0 values, rather than an error. Here, we switch to returning empty metrics in these failures, and logging the error at the source. This brings the behaviour into line with Linux/Unix platforms, and although making aggregation a little sadder on intermittent failures, will result in more desireable overall behaviour of keeping metrics available for further investigation if things look unusual.	2019-09-19 01:22:07 +02:00
Danielle	f8b64ee1ab	Merge pull request #6330 from hashicorp/f-host-vols-fail-startup client: Fail startup if host volumes do not exist	2019-09-17 10:55:30 -07:00
Danielle	ec3ecdecfc	Merge pull request #6321 from hashicorp/dani/remove-config Hoist Volume.Config.Source into Volume.Source	2019-09-16 10:12:58 -07:00
Danielle Lancashire	e3796e9d60	client: Fail startup if host volumes do not exist Some drivers will automatically create directories when trying to mount a path into a container, but some will not. To unify this behaviour, this commit requires that host volumes already exist, and can be stat'd by the Nomad Agent during client startup.	2019-09-13 23:28:10 +02:00
Ruslan Usifov	b3c72d1729	close file handle when FileRotator object will closed. Fixes https://github.com/hashicorp/nomad/issues/6309 (#6323 )	2019-09-13 10:31:13 -04:00
Tim Gross	a6ef8c5d42	client/networking: wrap error message from CNI plugin (#6316 )	2019-09-13 08:20:05 -04:00
Danielle Lancashire	78b61de45f	config: Hoist volume.config.source into volume Currently, using a Volume in a job uses the following configuration: ``` volume "alias-name" { type = "volume-type" read_only = true config { source = "host_volume_name" } } ``` This commit migrates to the following: ``` volume "alias-name" { type = "volume-type" source = "host_volume_name" read_only = true } ``` The original design was based due to being uncertain about the future of storage plugins, and to allow maxium flexibility. However, this causes a few issues, namely: - We frequently need to parse this configuration during submission, scheduling, and mounting - It complicates the configuration from and end users perspective - It complicates the ability to do validation As we understand the problem space of CSI a little more, it has become clear that we won't need the `source` to be in config, as it will be used in the majority of cases: - Host Volumes: Always need a source - Preallocated CSI Volumes: Always needs a source from a volume or claim name - Dynamic Persistent CSI Volumes: Always needs a source to attach the volumes to for managing upgrades and to avoid dangling. - Dynamic Ephemeral CSI Volumes: Less thought out, but `source` will probably point to the plugin name, and a `config` block will allow you to pass meta to the plugin. Or will point to a pre-configured ephemeral config. *If implemented The new design simplifies this by merging the source into the volume stanza to solve the above issues with usability, performance, and error handling.	2019-09-13 04:37:59 +02:00
Mahmood Ali	78f62d3670	Merge pull request #6080 from lchayoun/bug-6079 Allow dash in environment variable names	2019-09-11 11:17:24 -07:00
Mahmood Ali	b6bf7f9a6c	Merge pull request #6260 from hashicorp/c-circleci-tweak-20190903 ci: ignore nested pkgs in GOTEST_PKGS_EXCLUDE	2019-09-11 11:17:10 -07:00
Tim Gross	b05cd4c430	test: expand symlink for temp dir for macOS compatibility (#6303 ) On macOS, `os.TempDir` returns a symlinked path under `/var` which is outside of the directories shared into the VM used for Docker, and that fails tests using Docker that need that mount. If we expand the symlink to get the real path in `/private`, we're in the shared folders and can safely mount them.	2019-09-10 12:20:09 -04:00
Mahmood Ali	4b8280e51d	remove generated code	2019-09-06 19:24:15 +00:00
Nomad Release bot	dc7d728a82	Generate files for 0.10.0-beta1 release	2019-09-06 18:47:09 +00:00
Tim Gross	3fa4bca4a0	script checks: Update needs to update Alloc as well (#6291 )	2019-09-06 11:18:00 -04:00
Mahmood Ali	01f42053e4	dev: avoid codecgen code in downstream projects This is an attempt to ease dependency management for external driver plugins, by avoiding requiring them to compile ugorji/go generated files. Plugin developers reported some pain with the brittleness of ugorji/go dependency in particular, specially when using go mod, the default go mod manager in golang 1.13. Context -------- Nomad uses msgpack to persist and serialize internal structs, using ugorji/go library. As an optimization, we use ugorji/go code generation to speedup process and aovid the relection-based slow path. We commit these generated files in repository when we cut and tag the release to ease reproducability and debugging old releases. Thus, downstream projects that depend on release tag, indirectly depends on ugorji/go generated code. Sadly, the generated code is brittle and specific to the version of ugorji/go being used. When go mod picks another version of ugorji/go then nomad (go mod by default uses release according to semver), downstream projects face compilation errors. Interestingly, downstream projects don't commonly serialize nomad internal structs. Drivers and device plugins use grpc instead of msgpack for the most part. In the few cases where they use msgpag (e.g. decoding task config), they do without codegen path as they run on driver specific structs not the nomad internal structs. Also, the ugorji/go serialization through reflection is generally backward compatible (mod some ugorji/go regression bugs that get introduced every now and then :( ). Proposal --------- The proposal here is to keep committing ugorji/go codec generated files for releases but to use a go tag for them. All nomad development through the makefile, including releasing, CI and dev flow, has the tag enabled. Downstream plugin projects, by default, will skip these files and life proceed as normal for them. The downside is that nomad developers who use generated code but avoid using make must start passing additional go tag argument. Though this is not a blessed configuration.	2019-09-06 09:22:00 -04:00
Tim Gross	8ce201854a	client: recreate script checks on Update (#6265 ) Splitting the immutable and mutable components of the scriptCheck led to a bug where the environment interpolation wasn't being incorporated into the check's ID, which caused the UpdateTTL to update for a check ID that Consul didn't have (because our Consul client creates the ID from the structs.ServiceCheck each time we update). Task group services don't have access to a task environment at creation, so their checks get registered before the check can be interpolated. Use the original check ID so they can be updated.	2019-09-05 11:43:23 -04:00
Michael Schurter	ee06c36345	Merge pull request #6254 from hashicorp/test-connect-e2e-demo e2e: test demo job for connect	2019-09-04 14:33:26 -07:00
Nick Ethier	e440ba80f1	ar: refactor network bridge config to use go-cni lib (#6255 ) * ar: refactor network bridge config to use go-cni lib * ar: use eth as the iface prefix for bridged network namespaces * vendor: update containerd/go-cni package * ar: update network hook to use TODO contexts when calling configurator * unnecessary conversion	2019-09-04 16:33:25 -04:00
Michael Schurter	93b47f4ddc	client: reword error message	2019-09-04 12:40:09 -07:00
Mahmood Ali	b76de72943	Merge pull request #6251 from hashicorp/b-port-map-regression NOMAD_PORT_<label> regression	2019-09-04 11:54:09 -04:00
Danielle	2564a1dabd	Merge pull request #6239 from hashicorp/b-32bitmem Fix memory fingerprinting on 32bit	2019-09-04 17:39:07 +02:00
Danielle	aa5605fce1	fingerprint: Add backwards compatibility comment Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-09-04 17:38:35 +02:00
Mahmood Ali	87f0457973	fix qemu and update docker with tests	2019-09-04 11:27:51 -04:00
Jasmine Dahilig	5b6e39b37c	fix portmap envvars in docker driver	2019-09-04 11:26:13 -04:00
Danielle Lancashire	67715d846e	fingerprint: Restore support for legacy memory fingerprint	2019-09-04 17:10:28 +02:00
Tim Gross	0f29dcc935	support script checks for task group services (#6197 ) In Nomad prior to Consul Connect, all Consul checks work the same except for Script checks. Because the Task being checked is running in its own container namespaces, the check is executed by Nomad in the Task's context. If the Script check passes, Nomad uses the TTL check feature of Consul to update the check status. This means in order to run a Script check, we need to know what Task to execute it in. To support Consul Connect, we need Group Services, and these need to be registered in Consul along with their checks. We could push the Service down into the Task, but this doesn't work if someone wants to associate a service with a task's ports, but do script checks in another task in the allocation. Because Nomad is handling the Script check and not Consul anyways, this moves the script check handling into the task runner so that the task runner can own the script check's configuration and lifecycle. This will allow us to pass the group service check configuration down into a task without associating the service itself with the task. When tasks are checked for script checks, we walk back through their task group to see if there are script checks associated with the task. If so, we'll spin off script check tasklets for them. The group-level service and any restart behaviors it needs are entirely encapsulated within the group service hook.	2019-09-03 15:09:04 -04:00
Pete Woods	49b7d23cea	Add node "status" and "scheduling eligibility" tags to client metrics (#6130 ) When summing up the capability of your Nomad fleet for scaling purposes, it's important to exclude draining nodes, as they won't accept new jobs.	2019-09-03 12:11:11 -04:00
Michael Schurter	5957030d18	connect: add unix socket to proxy grpc for envoy (#6232 ) * connect: add unix socket to proxy grpc for envoy Fixes #6124 Implement a L4 proxy from a unix socket inside a network namespace to Consul's gRPC endpoint on the host. This allows Envoy to connect to Consul's xDS configuration API. * connect: pointer receiver on structs with mutexes * connect: warn on all proxy errors	2019-09-03 08:43:38 -07:00
Mahmood Ali	0f401e6b8b	lint: ignore generated windows syscall wrappers	2019-09-03 10:59:58 -04:00
Jasmine Dahilig	4edebe389a	add default update stanza and max_parallel=0 disables deployments (#6191 )	2019-09-02 10:30:09 -07:00
Evan Ercolano	fcf66918d0	Remove unused canary param from MakeTaskServiceID	2019-08-31 16:53:23 -04:00
Danielle Lancashire	4fcb7394e9	client: Fix memory fingerprinting on 32bit Also introduce regression ci for 32 bit fingerprinting	2019-08-31 18:33:59 +02:00
Mahmood Ali	0bd2eee87f	Merge pull request #6216 from hashicorp/b-recognize-pending-allocs alloc_runner: wait when starting suspicious allocs	2019-08-28 14:46:09 -04:00
Mahmood Ali	e0da3c5d0e	rename to hasLocalState, and ignore clientstate The ClientState being pending isn't a good criteria; as an alloc may have been updated in-place before it was completed. Also, updated the logic so we only check for task states. If an alloc has deployment state but no persisted tasks at all, restore will still fail.	2019-08-28 11:44:48 -04:00
Nick Ethier	cf014c7fd5	ar: ensure network forwarding is allowed for bridged allocs (#6196 ) * ar: ensure network forwarding is allowed in iptables for bridged allocs * ensure filter rule exists at setup time	2019-08-28 10:51:34 -04:00
Nick Ethier	cbb27e74bc	Add environment variables for connect upstreams (#6171 ) * taskenv: add connect upstream env vars + test * set taskenv upstreams instead of appending * Update client/taskenv/env.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-27 23:41:38 -04:00
Mahmood Ali	90c5eefbab	Alternative approach: avoid restoring This uses an alternative approach where we avoid restoring the alloc runner in the first place, if we suspect that the alloc may have been completed already.	2019-08-27 17:30:55 -04:00
Jasmine Dahilig	4078393bb6	expose nomad namespace as environment variable in allocation #5692 (#6192 )	2019-08-27 08:38:07 -07:00
Mahmood Ali	647c1457cb	alloc_runner: wait when starting suspicious allocs This commit aims to help users running with clients suseptible to the destroyed alloc being restrarted bug upgrade to latest. Without this, such users will have their tasks run unexpectedly on upgrade and only see the bug resolved after subsequent restart. If, on restore, the client sees a pending alloc without any other persisted info, then err on the side that it's an corrupt persisted state of an alloc instead of the client happening to be killed right when alloc is assigned to client. Few reasons motivate this behavior: Statistically speaking, corruption being the cause is more likely. A long running client will have higher chance of having allocs persisted incorrectly with pending state. Being killed right when an alloc is about to start is relatively unlikely. Also, delaying starting an alloc that hasn't started (by hopefully seconds) is not as severe as launching too many allocs that may bring client down. More importantly, this helps customers upgrade their clients without risking taking their clients down and destablizing their cluster. We don't want existing users to force triggering the bug while they upgrade and restart cluster.	2019-08-26 22:05:31 -04:00
Mahmood Ali	dfdf0edd3b	Merge pull request #6207 from hashicorp/b-gc-destroyed-allocs-rerun Don't persist allocs of destroyed alloc runners	2019-08-26 17:26:18 -04:00
Mahmood Ali	cc460d4804	Write to client store while holding lock Protect against a race where destroying and persist state goroutines race. The downside is that the database io operation will run while holding the lock and may run indefinitely. The risk of lock being long held is slow destruction, but slow io has bigger problems.	2019-08-26 13:45:58 -04:00
Mahmood Ali	1851820f20	logmon: log stat error to help debugging	2019-08-26 10:10:20 -04:00
Mahmood Ali	c132623ffc	Don't persist allocs of destroyed alloc runners This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes https://github.com/hashicorp/nomad/issues/5984 Related to https://github.com/hashicorp/nomad/pull/5890	2019-08-25 11:21:28 -04:00
Mahmood Ali	6301725002	logmon: revert workaround for Windows go1.11 bug Revert e0126123ab1ba848f72458538bc6118c978245e6 now that we are running with Golang 1.12, and https://github.com/golang/go/issues/29119 is no longer relevant.	2019-08-24 08:19:44 -04:00
Mahmood Ali	df1f3eb9ee	Merge pull request #6201 from hashicorp/b-device-stats-interval initialize device manager stats interval	2019-08-24 08:16:03 -04:00
Mahmood Ali	07b5f4c530	Merge pull request #6146 from hashicorp/b-config-template-copy clientConfig.Copy() to copy template config too	2019-08-23 19:00:57 -04:00
Mahmood Ali	b98568774b	clientConfig.Copy() to copy template config too	2019-08-23 18:43:22 -04:00
Lang Martin	4f6493a301	taskrunner getter set Umask for go-getter, setuid test	2019-08-23 15:59:03 -04:00
Mahmood Ali	3890619100	initialize device manager stats interval Fixes a bug where we cpu is pigged at 100% due to collecting devices statistics. The passed stats interval was ignored, and the default zero value causes a very tight loop of stats collection. FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a `g2.2xlarge` ec2 instance. The stats interval defaults to 1 second and is user configurable. I believe this is too frequent as a default, and I may advocate for reducing it to a value closer to 5s or 10s, but keeping it as is for now. Fixes https://github.com/hashicorp/nomad/issues/6057 .	2019-08-23 14:58:34 -04:00
Jerome Gravel-Niquet	cbdc1978bf	Consul service meta (#6193 ) * adds meta object to service in job spec, sends it to consul * adds tests for service meta * fix tests * adds docs * better hashing for service meta, use helper for copying meta when registering service * tried to be DRY, but looks like it would be more work to use the helper function	2019-08-23 12:49:02 -04:00
Nick Ethier	96d379071d	ar: fix bridge networking port mapping when port.To is unset (#6190 )	2019-08-22 21:53:52 -04:00
Michael Schurter	59e0b67c7f	connect: task hook for bootstrapping envoy sidecar Fixes #6041 Unlike all other Consul operations, boostrapping requires Consul be available. This PR tries Consul 3 times with a backoff to account for the group services being asynchronously registered with Consul.	2019-08-22 08:15:32 -07:00
Michael Schurter	b008fd1724	connect: register group services with Consul Fixes #6042 Add new task group service hook for registering group services like Connect-enabled services. Does not yet support checks.	2019-08-20 12:25:10 -07:00
lchayoun	2307c9d1d2	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-16 11:11:47 +03:00
Nick Ethier	965f00b2fc	Builtin Admission Controller Framework (#6116 ) * nomad: add admission controller framework * nomad: add admission controller framework and Consul Connect hooks * run admission controllers before checking permissions * client: add default node meta for connect configurables * nomad: remove validateJob func since it has been moved to admission controller * nomad: use new TaskKind type * client: use consts for connect sidecar image and log level * Apply suggestions from code review Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> * nomad: add job register test with connect sidecar * Update nomad/job_endpoint_hooks.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-15 11:22:37 -04:00
lchayoun	c5a38a045a	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-13 19:23:13 +03:00
Tim Gross	03433f35d4	client/template: configuration for function blacklist and sandboxing When rendering a task template, the `plugin` function is no longer permitted by default and will raise an error. An operator can opt-in to permitting this function with the new `template.function_blacklist` field in the client configuration. When rendering a task template, path parameters for the `file` function will be treated as relative to the task directory by default. Relative paths or symlinks that point outside the task directory will raise an error. An operator can opt-out of this protection with the new `template.disable_file_sandbox` field in the client configuration.	2019-08-12 16:34:48 -04:00
Danielle Lancashire	7e6c8e5ac1	Copy documentation to api/tasks	2019-08-12 16:22:27 +02:00
Danielle Lancashire	861caa9564	HostVolumeConfig: Source -> Path	2019-08-12 15:39:08 +02:00
Danielle Lancashire	e132a30899	structs: Unify Volume and VolumeRequest	2019-08-12 15:39:08 +02:00
Danielle Lancashire	6ef8d5233e	client: Add volume_hook for mounting volumes	2019-08-12 15:39:08 +02:00
Danielle Lancashire	063e4240c1	client: Add parsing and registration of HostVolume configuration	2019-08-12 15:39:08 +02:00
lchayoun	ca892163b2	allow dash in non generated environment variable names	2019-08-11 12:51:42 +03:00
Nick Ethier	7806f4c597	Revert "client: add autofetch for CNI plugins" This reverts commit 0bd157cc3b04fb090dd0d54affcae71496102ce8.	2019-08-08 15:10:19 -04:00
Nick Ethier	7d28ece8de	Revert "client: remove debugging lines" This reverts commit 54ce4d1f7ef4913cb12c03dbc98bcd903f7787c9.	2019-08-08 14:52:52 -04:00

... 3 4 5 6 7 ...

4274 Commits