open-nomad

Commit Graph

Author	SHA1	Message	Date
Jasmine Dahilig	71a694f39c	Merge pull request #8390 from hashicorp/lifecycle-poststart-hook task lifecycle poststart hook	2020-08-31 13:53:24 -07:00
Jasmine Dahilig	fbe0c89ab1	task lifecycle poststart: code review fixes	2020-08-31 13:22:41 -07:00
Seth Hoenig	9f1f2a5673	Merge branch 'master' into f-cc-ingress	2020-08-26 15:31:05 -05:00
Seth Hoenig	dfe179abc5	consul/connect: fixup some comments and context timeout	2020-08-26 13:17:16 -05:00
Mahmood Ali	10954bf717	close file when done reading	2020-08-24 20:22:42 -04:00
Mahmood Ali	0be632debf	don't lock if ref is nil Ensure that d.mu is only dereferenced if d is not-nil, to avoid a null dereference panic.	2020-08-24 20:19:40 -04:00
Seth Hoenig	26e77623e5	consul/connect: fixup tests to use new consul sdk	2020-08-24 12:02:41 -05:00
Seth Hoenig	a09d1746bf	Merge branch 'master' into consul-v1.7.7	2020-08-24 10:43:00 -05:00
Yoan Blanc	327d17e0dc	fixup! vendor: consul/api, consul/sdk v1.6.0 Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-08-24 08:59:03 +02:00
Mark Lee	cd23fd7ca2	refactor lookup code	2020-08-24 12:24:16 +09:00
Mark Lee	cd7aabca72	lookup kernel builtin modules too	2020-08-24 11:09:13 +09:00
Seth Hoenig	5b072029f2	consul/connect: add initial support for ingress gateways This PR adds initial support for running Consul Connect Ingress Gateways (CIGs) in Nomad. These gateways are declared as part of a task group level service definition within the connect stanza. ```hcl service { connect { gateway { proxy { // envoy proxy configuration } ingress { // ingress-gateway configuration entry } } } } ``` A gateway can be run in `bridge` or `host` networking mode, with the caveat that host networking necessitates manually specifying the Envoy admin listener (which cannot be disabled) via the service port value. Currently Envoy is the only supported gateway implementation in Consul, and Nomad only supports running Envoy as a gateway using the docker driver. Aims to address #8294 and tangentially #8647	2020-08-21 16:21:54 -05:00
Shengjing Zhu	7a4f48795d	Adjust cgroup change in libcontainer	2020-08-20 00:31:07 +08:00
Michael Schurter	de08ae8083	test: add allocrunner test for poststart hooks	2020-08-12 09:54:14 -07:00
Nick Ethier	e39574be59	docker: support group allocated ports and host_networks (#8623 ) * docker: support group allocated ports * docker: add new ports driver config to specify which group ports are mapped * docker: update port mapping docs	2020-08-11 18:30:22 -04:00
Lang Martin	a27913e699	CSI RPC Token (#8626 ) * client/allocrunner/csi_hook: use the Node SecretID * client/allocrunner/csi_hook: include the namespace for Claim	2020-08-11 13:08:39 -04:00
Michael Schurter	e1946b66ce	client: remove shortcircuit preventing poststart hooks from running	2020-08-11 09:48:24 -07:00
Michael Schurter	04a135b57d	client: don't restart poststart sidecars on success	2020-08-11 09:47:18 -07:00
Tim Gross	7d53ed88d6	csi: client RPCs should return wrapped errors for checking (#8605 ) When the client-side actions of a CSI client RPC succeed but we get disconnected during the RPC or we fail to checkpoint the claim state, we want to be able to retry the client RPC without getting blocked by the client-side state (ex. mount points) already having been cleaned up in previous calls.	2020-08-07 11:01:36 -04:00
Tim Gross	2854298089	csi: release claims via csi_hook postrun unpublish RPC (#8580 ) Add a Postrun hook to send the `CSIVolume.Unpublish` RPC to the server. This may forward client RPCs to the node plugins or to the controller plugins, depending on whether other allocations on this node have claims on this volume. By making clients responsible for running the `CSIVolume.Unpublish` RPC (and making the RPC available to a `nomad volume detach` command), the volumewatcher becomes only used by the core GC job and we no longer need async volume GC from job deregister and node update.	2020-08-06 14:51:46 -04:00
Jasmine Dahilig	e8ed6851e2	lifecycle: add allocrunner and task hook coordinator unit tests	2020-07-29 12:39:42 -07:00
Seth Hoenig	a392b19b6a	consul/connect: fixup some spelling, comments, consts	2020-07-29 09:26:01 -05:00
Seth Hoenig	04bb6c416f	consul/connect: organize lock & fields in http/grpc socket hooks	2020-07-29 09:26:01 -05:00
Seth Hoenig	dbee956c05	consul/connect: optimze grpc socket hook check for bridge network first	2020-07-29 09:26:01 -05:00
Seth Hoenig	2511f48351	consul/connect: add support for bridge networks with connect native tasks Before, Connect Native Tasks needed one of these to work: - To be run in host networking mode - To have the Consul agent configured to listen to a unix socket - To have the Consul agent configured to listen to a public interface None of these are a great experience, though running in host networking is still the best solution for non-Linux hosts. This PR establishes a connection proxy between the Consul HTTP listener and a unix socket inside the alloc fs, bypassing the network namespace for any Connect Native task. Similar to and re-uses a bunch of code from the gRPC listener version for envoy sidecar proxies. Proxy is established only if the alloc is configured for bridge networking and there is at least one Connect Native task in the Task Group. Fixes #8290	2020-07-29 09:26:01 -05:00
Drew Bailey	bd421b6197	Merge pull request #8453 from hashicorp/oss-multi-vault-ns oss compoments for multi-vault namespaces	2020-07-27 08:45:22 -04:00
Mahmood Ali	2d0b80a0ed	Merge pull request #6517 from hashicorp/b-fingerprint-shutdown-race client: don't retry fingerprinting on shutdown	2020-07-24 11:56:32 -04:00
Drew Bailey	b296558b8e	oss compoments for multi-vault namespaces adds in oss components to support enterprise multi-vault namespace feature upgrade specific doc on vault multi-namespaces vault docs update test to reflect new error	2020-07-24 10:14:59 -04:00
Michael Schurter	1400e0480d	Merge pull request #8521 from hashicorp/docs-hearbeat docs: s/hearbeat/heartbeat and fix link	2020-07-23 14:07:24 -07:00
Tim Gross	56c6dacd38	csi: NodePublish should not create target_path, only its parent dir (#8505 ) The NodePublish workflow currently creates the target path and its parent directory. However, the CSI specification says that the CO shall ensure the parent directory of the target path exists, and that the SP shall place the block device or mounted directory at the target path. Much of our testing has been with CSI plugins that are more forgiving, but our behavior breaks spec-compliant CSI plugins. This changeset ensures we only create the parent directory.	2020-07-23 15:52:22 -04:00
Michael Schurter	8340ad4da8	docs: s/hearbeat/heartbeat and fix link Also fixed the same typo in a test. Fixing the typo fixes the link, but the link was still broken when running the website locally due to the trailing slash. It would have worked in prod thanks to redirects, but using the canonical URL seems ideal.	2020-07-23 11:33:34 -07:00
Mahmood Ali	91ba9ccefe	honor config.NetworkInterface in NodeNetworks	2020-07-21 15:43:45 -04:00
Mahmood Ali	c2d3c3e431	nvidia: support disabling the nvidia plugin (#8353 )	2020-07-21 10:11:16 -04:00
Jasmine Dahilig	44c21bd3c7	fix panic, but poststart is still stalled	2020-07-10 09:03:10 -07:00
Mahmood Ali	e9bf3a42f5	Merge pull request #8333 from hashicorp/b-test-tweak-20200701 tests: avoid os.Exit in a test	2020-07-10 11:18:28 -04:00
Jasmine Dahilig	9e27231953	add poststart hook to task hook coordinator & structs	2020-07-08 11:01:35 -07:00
Nick Ethier	e0fb634309	ar: support opting into binding host ports to default network IP (#8321 ) * ar: support opting into binding host ports to default network IP * fix config plumbing * plumb node address into network resource * struct: only handle network resource upgrade path once	2020-07-06 18:51:46 -04:00
Lang Martin	6c22cd587d	api: `nomad debug` new /agent/host (#8325 ) * command/agent/host: collect host data, multi platform * nomad/structs/structs: new HostDataRequest/Response * client/agent_endpoint: add RPC endpoint * command/agent/agent_endpoint: add Host * api/agent: add the Host endpoint * nomad/client_agent_endpoint: add Agent Host with forwarding * nomad/client_agent_endpoint: use findClientConn This changes forwardMonitorClient and forwardProfileClient to use findClientConn, which was cribbed from the common parts of those funcs. * command/debug: call agent hosts * command/agent/host: eliminate calling external programs	2020-07-02 09:51:25 -04:00
Mahmood Ali	026d8c6eed	tests: avoid os.Exit in a test	2020-07-01 15:25:13 -04:00
Mahmood Ali	7f460d2706	allocrunner: terminate sidecars in the end This fixes a bug where a batch allocation fails to complete if it has sidecars. If the only remaining running tasks in an allocations are sidecars - we must kill them and mark the allocation as complete.	2020-06-29 15:12:15 -04:00
Seth Hoenig	011c6b027f	connect/native: doc and comment tweaks from PR	2020-06-24 10:13:22 -05:00
Seth Hoenig	03a5706919	connect/native: check for pre-existing consul token	2020-06-24 09:16:28 -05:00
Seth Hoenig	6154181a64	connect/native: update connect native hook tests	2020-06-23 12:07:35 -05:00
Seth Hoenig	c5d3f58bee	connect/native: give tls files an extension	2020-06-23 12:06:28 -05:00
Seth Hoenig	4d71f22a11	consul/connect: add support for running connect native tasks This PR adds the capability of running Connect Native Tasks on Nomad, particularly when TLS and ACLs are enabled on Consul. The `connect` stanza now includes a `native` parameter, which can be set to the name of task that backs the Connect Native Consul service. There is a new Client configuration parameter for the `consul` stanza called `share_ssl`. Like `allow_unauthenticated` the default value is true, but recommended to be disabled in production environments. When enabled, the Nomad Client's Consul TLS information is shared with Connect Native tasks through the normal Consul environment variables. This does NOT include auth or token information. If Consul ACLs are enabled, Service Identity Tokens are automatically and injected into the Connect Native task through the CONSUL_HTTP_TOKEN environment variable. Any of the automatically set environment variables can be overridden by the Connect Native task using the `env` stanza. Fixes #6083	2020-06-22 14:07:44 -05:00
Tim Gross	3d38592fbb	csi: add VolumeContext to NodeStage/Publish RPCs (#8239 ) In #7957 we added support for passing a volume context to the controller RPCs. This is an opaque map that's created by `CreateVolume` or, in Nomad's case, in the volume registration spec. However, we missed passing this field to the `NodeStage` and `NodePublish` RPC, which prevents certain plugins (such as MooseFS) from making node RPCs.	2020-06-22 13:54:32 -04:00
Michael Schurter	562704124d	Merge pull request #8208 from hashicorp/f-multi-network multi-interface network support	2020-06-19 15:46:48 -07:00
Mahmood Ali	3824e0362c	Revert "client: defensive against getting stale alloc updates"	2020-06-19 15:39:44 -04:00
Nick Ethier	a87e91e971	test: fix up testing around host networks	2020-06-19 13:53:31 -04:00
Nick Ethier	f0ac1f027a	lint: spelling	2020-06-19 11:29:41 -04:00
Nick Ethier	0374ad3e6c	taskenv: populate NOMAD_IP\|PORT\|ADDR env from allocated ports	2020-06-19 10:51:32 -04:00
Nick Ethier	f0559a8162	multi-interface network support	2020-06-19 09:42:10 -04:00
Nick Ethier	4a44deaa5c	CNI Implementation (#7518 )	2020-06-18 11:05:29 -07:00
Nick Ethier	0bc0403cc3	Task DNS Options (#7661 ) Co-Authored-By: Tim Gross <tgross@hashicorp.com> Co-Authored-By: Seth Hoenig <shoenig@hashicorp.com>	2020-06-18 11:01:31 -07:00
Drew Bailey	84afc28ceb	only report tasklogger is running if both stdout and stderr are still running (#8155 ) * only report tasklogger is running if both stdout and stderr are still running * changelog	2020-06-12 09:17:35 -04:00
Lang Martin	ac7c39d3d3	Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection (#8099 ) * client/heartbeatstop: reversed time condition for startup grace * scheduler/generic_sched: use `delayInstead` to avoid a loop Without protecting the loop that creates followUpEvals, a delayed eval is allowed to create an immediate subsequent delayed eval. For both `stop_after_client_disconnect` and the `reschedule` block, a delayed eval should always produce some immediate result (running or blocked) and then only after the outcome of that eval produce a second delayed eval. * scheduler/reconcile: lostLater are different than delayedReschedules Just slightly. `lostLater` allocs should be used to create batched evaluations, but `handleDelayedReschedules` assumes that the allocations are in the untainted set. When it creates the in-place updates to those allocations at the end, it causes the allocation to be treated as running over in the planner, which causes the initial `stop_after_client_disconnect` evaluation to be retried by the worker.	2020-06-03 09:48:38 -04:00
Mahmood Ali	5703c0db80	tests: Run a task long enough to be restartable	2020-05-31 10:33:03 -04:00
Drew Bailey	59ca304fce	give enterpriseclient a logger (#8072 )	2020-05-28 15:43:16 -04:00
Drew Bailey	34871f89be	Oss license support for ent builds (#8054 ) * changes necessary to support oss licesning shims revert nomad fmt changes update test to work with enterprise changes update tests to work with new ent enforcements make check update cas test to use scheduler algorithm back out preemption changes add comments * remove unused method	2020-05-27 13:46:52 -04:00
Mahmood Ali	2588b3bc98	cleanup driver eventor goroutines This fixes few cases where driver eventor goroutines are leaked during normal operations, but especially so in tests. This change makes few modifications: First, it switches drivers to use `Context`s to manage shutdown events. Previously, it relied on callers invoking `.Shutdown()` function that is specific to internal drivers only and require casting. Using `Contexts` provide a consistent idiomatic way to manage lifecycle for both internal and external drivers. Also, I discovered few places where we don't clean up a temporary driver instance in the plugin catalog code, where we dispense a driver to inspect and validate the schema config without properly cleaning it up.	2020-05-26 11:04:04 -04:00
Tim Gross	ba11aef5d9	csi: skip unit tests on unsupported platforms (#8033 ) Some of the unit tests for CSI require platform-specific APIs that aren't available on macOS. We can safely skip these tests.	2020-05-21 13:56:50 -04:00
Tim Gross	aa8927abb4	volumes: return better error messages for unsupported task drivers (#8030 ) When an allocation runs for a task driver that can't support volume mounts, the mounting will fail in a way that can be hard to understand. With host volumes this usually means failing silently, whereas with CSI the operator gets inscrutable internals exposed in the `nomad alloc status`. This changeset adds a MountConfig field to the task driver Capabilities response. We validate this when the `csi_hook` or `volume_hook` fires and return a user-friendly error. Note that we don't currently have a way to get driver capabilities up to the server, except through attributes. Validating this when the user initially submits the jobspec would be even better than what we're doing here (and could be useful for all our other capabilities), but that's out of scope for this changeset. Also note that the MountConfig enum starts with "supports all" in order to support community plugins in a backwards compatible way, rather than cutting them off from volume mounting unexpectedly.	2020-05-21 09:18:02 -04:00
Tim Gross	065fa7af8b	stats_hook: log normal shutdown condition as debug, not error (#8028 ) The `stats_hook` writes an Error log every time an allocation becomes terminal. This is a normal condition, not an error. A real error condition like a failure to collect the stats is logged later. It just creates log noise, and this is a particularly bad operator experience for heavy batch workloads.	2020-05-20 10:28:30 -04:00
Mahmood Ali	751f337f1c	Update hcl2 vendoring The hcl2 library has moved from http://github.com/hashicorp/hcl2 to https://github.com/hashicorp/hcl/tree/hcl2. This updates Nomad's vendoring to start using hcl2 library. Also updates some related libraries (e.g. `github.com/zclconf/go-cty/cty` and `github.com/apparentlymart/go-textseg`).	2020-05-19 15:00:03 -04:00
Tim Gross	6a463dc13a	csi: use a blocking initial connection with timeout (#7965 ) The plugin supervisor lazily connects to plugins, but this means we only get "Unavailable" back from the gRPC call in cases where the plugin can never be reached (for example, if the Nomad client has the wrong permissions for the socket). This changeset improves the operator experience by switching to a blocking `DialWithContext`. It eagerly connects so that we can validate the connection is real and get a "failed to open" error in case where Nomad can't establish the initial connection.	2020-05-15 08:17:11 -04:00
Tim Gross	2082cf738a	csi: support for VolumeContext and VolumeParameters (#7957 ) The MVP for CSI in the 0.11.0 release of Nomad did not include support for opaque volume parameters or volume context. This changeset adds support for both. This also moves args for ControllerValidateCapabilities into a struct. The CSI plugin `ControllerValidateCapabilities` struct that we turn into a CSI RPC is accumulating arguments, so moving it into a request struct will reduce the churn of this internal API, make the plugin code more readable, and make this method consistent with the other plugin methods in that package.	2020-05-15 08:16:01 -04:00
Tim Gross	24aa32c503	csi: use a blocking initial connection with timeout The plugin supervisor lazily connects to plugins, but this means we only get "Unavailable" back from the gRPC call in cases where the plugin can never be reached (for example, if the Nomad client has the wrong permissions for the socket). This changeset improves the operator experience by switching to a blocking `DialWithContext`. It eagerly connects so that we can validate the connection is real and get a "failed to open" error in case where Nomad can't establish the initial connection.	2020-05-14 15:59:19 -04:00
Tim Gross	4f54a633a2	csi: refactor internal client field name to ExternalID (#7958 ) The CSI plugins RPCs require the use of the storage provider's volume ID, rather than the user-defined volume ID. Although changing the RPCs to use the field name `ExternalID` risks breaking backwards compatibility, we can use the `ExternalID` name internally for the client and only use `VolumeID` at the RPC boundaries.	2020-05-14 11:56:07 -04:00
Lang Martin	d3c4700cd3	server: stop after client disconnect (#7939 ) * jobspec, api: add stop_after_client_disconnect * nomad/state/state_store: error message typo * structs: alloc methods to support stop_after_client_disconnect 1. a global AllocStates to track status changes with timestamps. We need this to track the time at which the alloc became lost originally. 2. ShouldClientStop() and WaitClientStop() to actually do the math * scheduler/reconcile_util: delayByStopAfterClientDisconnect * scheduler/reconcile: use delayByStopAfterClientDisconnect * scheduler/util: updateNonTerminalAllocsToLost comments This was setup to only update allocs to lost if the DesiredStatus had already been set by the scheduler. It seems like the intention was to update the status from any non-terminal state, and not all lost allocs have been marked stop or evict by now * scheduler/testing: AssertEvalStatus just use require * scheduler/generic_sched: don't create a blocked eval if delayed * scheduler/generic_sched_test: several scheduling cases	2020-05-13 16:39:04 -04:00
Mahmood Ali	0ece631e60	allochealth: Fix when check health preceeds task health Fix a bug where if the alloc check becomes healthy before the task health, the alloc may never be considered healthy.	2020-05-13 07:44:39 -04:00
Mahmood Ali	934c5e8ff0	tests: tests for health check sequencing Add a failing tests to show that if an alloc checks is marked healthy before the alloc tasks start up, the alloc may be forever considered unhealthy.	2020-05-13 07:43:00 -04:00
Tim Gross	4374c1a837	csi: support Secrets parameter in CSI RPCs (#7923 ) CSI plugins can require credentials for some publishing and unpublishing workflow RPCs. Secrets are configured at the time of volume registration, stored in the volume struct, and then passed around as an opaque map by Nomad to the plugins.	2020-05-11 17:12:51 -04:00
Mahmood Ali	938e916d9c	When serializing msgpack, only consider codec tag When serializing structs with msgpack, only consider type tags of `codec`. Hashicorp/go-msgpack (based on ugorji/go) defaults to interpretting `codec` tag if it's available, but falls to using `json` if `codec` isn't present. This behavior is surprising in cases where we want to serialize json differently from msgpack, e.g. serializing `ConsulExposeConfig`.	2020-05-11 14:14:10 -04:00
Mahmood Ali	543f08c1ae	Deflake TestTaskTemplateManager_BlockedEvents test This change deflakes TestTaskTemplateManager_BlockedEvents test, because it is expecting a number of events without accounting for transitional state. The test TestTaskTemplateManager_BlockedEvents attempts to ensure that a template rendering emits blocked events for missing template ksys. It works by setting a template that requires keys 0,1,2,3,4 and then eventually sets keys 0,1,2,3 and ensures that we get a final event indicating that keys 3 and 4 are still missing. The test waits to get a blocked event for the final state, but it can fail if receives a blocked event for a transitional state (e.g. one reporting 2,3,4,5 are missing). This fixes the test by ensuring that it waits until the final message before assertion. Also, it clarifies the intent of the test with stricter assertions and additional comments.	2020-05-09 14:09:39 -04:00
Juan Larriba	a0df437c62	Run Linux Images (LCOW) and Windows Containers side by side (#7850 ) Makes it possible to run Linux Containers On Windows with Nomad alongside Windows Containers. Fingerprint prevents only to run Nomad in Windows 10 with Linux Containers	2020-05-04 13:08:47 -04:00
Lang Martin	ad2fb4b297	client/heartbeatstop: don't store client state, use timeout In order to minimize this change while keeping a simple version of the behavior, we set `lastOk` to the current time less the intial server connection timeout. If the client starts and never contacts the server, it will stop all configured tasks after the initial server connection grace period, on the assumption that we've been out of touch longer than any configured `stop_after_client_disconnect`. The more complex state behavior might be justified later, but we should learn about failure modes first.	2020-05-01 12:35:49 -04:00
Lang Martin	28bac139cb	client/heartbeatstop: destroy allocs when disconnected from servers - track lastHeartbeat, the client local time of the last successful heartbeat round trip - track allocations with `stop_after_client_disconnect` configured - trigger allocation destroy (which handles cleanup) - restore heartbeat/killable allocs tracking when allocs are recovered from disk - on client restart, stop those allocs after a grace period if the servers are still partioned	2020-05-01 12:35:49 -04:00
Tim Gross	cc7dbad1c7	csi: restore long timeout for controller plugins (#7840 ) During MVP development, we reduced the timeout for controller plugins to avoid long hangs in GC workers. But now that this work has been moved to the volume watcher, we can restore the original timeout which is better suited for the characteristic timescales of some cloud provider APIs and better matches the behavior of k8s.	2020-04-30 17:12:05 -04:00
Seth Hoenig	880c4e23d3	env_aws: combine 3 log lines into 1	2020-04-29 10:47:36 -06:00
Seth Hoenig	67303b666c	env_aws: downgrade log line Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2020-04-29 10:34:26 -06:00
Seth Hoenig	5ddc607701	env_aws: fixup log line Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2020-04-29 10:33:53 -06:00
Seth Hoenig	f8596a3602	env_aws: use best-effort lookup table for CPU performance in EC2 Fixes #7681 The current behavior of the CPU fingerprinter in AWS is that it reads the current speed from `/proc/cpuinfo` (`CPU MHz` field). This is because the max CPU frequency is not available by reading anything on the EC2 instance itself. Normally on Linux one would look at e.g. `sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq` or perhaps parse the values from the `CPU max MHz` field in `/proc/cpuinfo`, but those values are not available. Furthermore, no metadata about the CPU is made available in the EC2 metadata service. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html Since `go-psutil` cannot determine the max CPU speed it defaults to the current CPU speed, which could be basically any number between 0 and the true max. This is particularly bad on large, powerful reserved instances which often idle at ~800 MHz while Nomad does its fingerprinting (typically IO bound), which Nomad then uses as the max, which results in severe loss of available resources. Since the CPU specification is unavailable programmatically (at least not without sudo) use a best-effort lookup table. This table was generated by going through every instance type in AWS documentation and copy-pasting the numbers. https://aws.amazon.com/ec2/instance-types/ This approach obviously is not ideal as future instance types will need to be added as they are introduced to AWS. However, using the table should only be an improvement over the status quo since right now Nomad miscalculates available CPU resources on all instance types.	2020-04-28 19:01:33 -06:00
Mahmood Ali	18dba6fdad	Harmonize go-msgpack/codec/codecgen Use v1.1.5 of go-msgpack/codec/codecgen, so go-msgpack codecgen matches the library version. We branched off earlier to pick up `f51b518921` , but apparently that's not needed as we could customize the package via `-c` argument.	2020-04-28 17:12:31 -04:00
Tim Gross	083b35d651	csi: checkpoint volume claim garbage collection (#7782 ) Adds a `CSIVolumeClaim` type to be tracked as current and past claims on a volume. Allows for a client RPC failure during node or controller detachment without having to keep the allocation around after the first garbage collection eval. This changeset lays groundwork for moving the actual detachment RPCs into a volume watching loop outside the GC eval.	2020-04-23 11:06:23 -04:00
Charlie Voiselle	c68c19f3cf	Use ExternalID in NodeStageVolume RPC (#7754 )	2020-04-20 17:13:46 -04:00
Anthony Scalisi	9664c6b270	fix spelling errors (#6985 )	2020-04-20 09:28:19 -04:00
Drew Bailey	8bfee62b70	Run task shutdown_delay regardless of service registration task shutdown_delay will currently only run if there are registered services for the task. This implementation detail isn't explicity stated anywhere and is defined outside of the service stanza. This change moves shutdown_delay to be evaluated after prekill hooks are run, outside of any task runner hooks. just use time.sleep	2020-04-10 11:06:26 -04:00
Nick Ethier	44ad5d96d8	ar/bridge: use cni.IsCNINotInitialized helper	2020-04-06 21:44:01 -04:00
Nick Ethier	58fe326090	ar/bridge: better cni status err handling	2020-04-06 21:21:42 -04:00
Nick Ethier	6a286777c7	ar/bridge: ensure cni configuration is always loaded	2020-04-06 21:02:26 -04:00
Nick Ethier	5166806993	Merge pull request #7600 from hashicorp/b-5767 tr/service_hook: prevent Update from running before Poststart finish	2020-04-06 16:52:42 -04:00
Nick Ethier	567609e101	tr/service_hook: reset initialized flag during deregister	2020-04-06 16:05:36 -04:00
Drew Bailey	4ab7c03641	Merge pull request #7618 from hashicorp/b-shutdown-delay-updates Fixes bug that prevented group shutdown_delay updates	2020-04-06 13:05:20 -04:00
Drew Bailey	0d550049e9	ensure shutdown delay can be removed	2020-04-06 11:33:04 -04:00
Drew Bailey	9874e7b21d	Group shutdown delay fixes Group shutdown delay updates were not properly handled in Update hook. This commit also ensures that plan output is displayed.	2020-04-06 11:29:12 -04:00
Tim Gross	027277a0d9	csi: make volume GC in job deregister safely async The `Job.Deregister` call will block on the client CSI controller RPCs while the alloc still exists on the Nomad client node. So we need to make the volume claim reaping async from the `Job.Deregister`. This allows `nomad job stop` to return immediately. In order to make this work, this changeset changes the volume GC so that the GC jobs are on a by-volume basis rather than a by-job basis; we won't have to query the (possibly deleted) job at the time of volume GC. We smuggle the volume ID and whether it's a purge into the GC eval ID the same way we smuggled the job ID previously.	2020-04-06 10:15:55 -04:00
Tim Gross	5a3b45864d	csi: fix unpublish workflow ID mismatches The CSI plugins uses the external volume ID for all operations, but the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the mount paths. Pass the External ID as an arg in the RPC call so that the unpublish workflows have it without calling back to the server to find the external ID. The controller CSI plugins need the CSI node ID (or in other words, the storage provider's view of node ID like the EC2 instance ID), not the Nomad node ID, to determine how to detach the external volume.	2020-04-06 10:15:55 -04:00
Seth Hoenig	60c9b73eba	Merge pull request #7602 from hashicorp/b-connect-bootstrap-tls-config connect: set consul TLS options on envoy bootstrap	2020-04-03 08:50:36 -06:00
Tim Gross	f6b3d38eb8	CSI: move node unmount to server-driven RPCs (#7596 ) If a volume-claiming alloc stops and the CSI Node plugin that serves that alloc's volumes is missing, there's no way for the allocrunner hook to send the `NodeUnpublish` and `NodeUnstage` RPCs. This changeset addresses this issue with a redesign of the client-side for CSI. Rather than unmounting in the alloc runner hook, the alloc runner hook will simply exit. When the server gets the `Node.UpdateAlloc` for the terminal allocation that had a volume claim, it creates a volume claim GC job. This job will made client RPCs to a new node plugin RPC endpoint, and only once that succeeds, move on to making the client RPCs to the controller plugin. If the node plugin is unavailable, the GC job will fail and be requeued.	2020-04-02 16:04:56 -04:00
Nick Ethier	3b5d2f8eb8	tr/service_hook: update hook fields during update when poststart hasn't finished	2020-04-02 12:48:19 -04:00

1 2 3 4 5 ...

4274 Commits