open-nomad

Author	SHA1	Message	Date
Jerome Gravel-Niquet	d1f1dbd203	Don't ignore nil devices in plugin fingerprint Even if a plugin sends back an empty `[]device.DeviceGroup`, it's transformed to `nil` during the RPC. Our custom device plugin is returning empty `FingerprintResponse.Devices` very often. Our temporary fix is to send a dummy `DeviceGroup` if the slice is empty. This has the effect of never triggering the "first fingerprint" and therefore timing out after 50s. In turn, this made our node exceed its hearbeat grace period when restarting it, revoking all vault tokens for its allocations, causing a restart of all our allocations because the token couldn't be renewed. Removing the logic for `f.Devices == nil` does not appear to affect the functionality of the function.	2020-11-10 16:04:22 -05:00
Seth Hoenig	9960f96446	client/fingerprint: detect unloaded dynamic bridge kernel module In Nomad v0.12.0, the client added additional fingerprinting around the presense of the bridge kernel module. The fingerprinter only checked in `/proc/modules` which is a list of loaded modules. In some cases, the bridge kernel module is builtin rather than dynamically loaded. The fix for that case is in #8721. However we were still missing the case where the bridge module is dynamically loaded, but not yet loaded during the startup of the Nomad agent. In this case the fingerprinter would believe the bridge module was unavailable when really it gets loaded on demand. This PR now has the fingerprinter scan the kernel module dependency file, which will contain an entry for the bridge module even if it is not yet loaded. In summary, the client now looks for the bridge kernel module in - /proc/modules - /lib/modules/<kernel>/modules.builtin - /lib/modules/<kernel>/modules.dep Closes #8423	2020-11-09 13:56:14 -06:00
Nick Ethier	04f5c4ee5f	ar/groupservice: remove drivernetwork (#9233 ) * ar/groupservice: remove drivernetwork * consul: allow host address_mode to accept raw port numbers * consul: fix logic for blank address	2020-11-05 15:00:22 -05:00
Stefan Richter	484ef8a1e8	Add NOMAD_JOB_ID and NOMAD_JOB_PAERENT_ID env variables (#8967 ) Beforehand tasks and field replacements did not have access to the unique ID of their job or its parent. This adds this information as new environment variables.	2020-10-23 10:49:58 -04:00
Tim Gross	1fb1c9c5d4	artifact/template: make destination path absolute inside taskdir (#9149 ) Prior to Nomad 0.12.5, you could use `${NOMAD_SECRETS_DIR}/mysecret.txt` as the `artifact.destination` and `template.destination` because we would always append the destination to the task working directory. In the recent security patch we treated the `destination` absolute path as valid if it didn't escape the working directory, but this breaks backwards compatibility and interpolation of `destination` fields. This changeset partially reverts the behavior so that we always append the destination, but we also perform the escape check on that new destination after interpolation so the security hole is closed. Also, ConsulTemplate test should exercise interpolation	2020-10-22 15:47:49 -04:00
Tim Gross	6df36e4cdb	artifact/template: prevent file sandbox escapes Ensure that the client honors the client configuration for the `template.disable_file_sandbox` field when validating the jobspec's `template.source` parameter, and not just with consul-template's own `file` function. Prevent interpolated `template.source`, `template.destination`, and `artifact.destination` fields from escaping file sandbox.	2020-10-21 14:34:12 -04:00
Alexander Shtuchkin	90fd8bb85f	Implement 'batch mode' for persisting allocations on the client. (#9093 ) Fixes #9047, see problem details there. As a solution, we use BoltDB's 'Batch' mode that combines multiple parallel writes into small number of transactions. See https://github.com/boltdb/bolt#batch-read-write-transactions for more information.	2020-10-20 16:15:37 -04:00
Seth Hoenig	9cdb98f0e4	client: add tests around meta and canarymeta interpolation Expanding on #9096, add tests for making sure service.Meta and service.CanaryMeta are interpolated from environment variables.	2020-10-20 12:50:29 -05:00
Jorge Marey	8a0ef606a3	Add interpolation on service canarymeta	2020-10-20 12:45:36 -05:00
Drew Bailey	6c788fdccd	Events/msgtype cleanup (#9117 ) * use msgtype in upsert node adds message type to signature for upsert node, update tests, remove placeholder method * UpsertAllocs msg type test setup * use upsertallocs with msg type in signature update test usage of delete node delete placeholder msgtype method * add msgtype to upsert evals signature, update test call sites with test setup msg type handle snapshot upsert eval outside of FSM and ignore eval event remove placeholder upsertevalsmsgtype handle job plan rpc and prevent event creation for plan msgtype cleanup upsertnodeevents updatenodedrain msgtype msg type 0 is a node registration event, so set the default to the ignore type * fix named import * fix signature ordering on upsertnode to match	2020-10-19 09:30:15 -04:00
Nick Ethier	4903e5b114	Consul with CNI and host_network addresses (#9095 ) * consul: advertise cni and multi host interface addresses * structs: add service/check address_mode validation * ar/groupservices: fetch networkstatus at hook runtime * ar/groupservice: nil check network status getter before calling * consul: comment network status can be nil	2020-10-15 15:32:21 -04:00
Michael Schurter	9c3972937b	s/0.13/1.0/g 1.0 here we come!	2020-10-14 15:17:47 -07:00
Chris Baker	1d35578bed	removed backwards-compatible/untagged metrics deprecated in 0.7	2020-10-13 20:18:39 +00:00
Seth Hoenig	ed13e5723f	consul/connect: dynamically select envoy sidecar at runtime As newer versions of Consul are released, the minimum version of Envoy it supports as a sidecar proxy also gets bumped. Starting with the upcoming Consul v1.9.X series, Envoy v1.11.X will no longer be supported. Current versions of Nomad hardcode a version of Envoy v1.11.2 to be used as the default implementation of Connect sidecar proxy. This PR introduces a change such that each Nomad Client will query its local Consul for a list of Envoy proxies that it supports (https://github.com/hashicorp/consul/pull/8545) and then launch the Connect sidecar proxy task using the latest supported version of Envoy. If the `SupportedProxies` API component is not available from Consul, Nomad will fallback to the old version of Envoy supported by old versions of Consul. Setting the meta configuration option `meta.connect.sidecar_image` or setting the `connect.sidecar_task` stanza will take precedence as is the current behavior for sidecar proxies. Setting the meta configuration option `meta.connect.gateway_image` will take precedence as is the current behavior for connect gateways. `meta.connect.sidecar_image` and `meta.connect.gateway_image` may make use of the special `${NOMAD_envoy_version}` variable interpolation, which resolves to the newest version of Envoy supported by the Consul agent. Addresses #8585 #7665	2020-10-13 09:14:12 -05:00
Seth Hoenig	5a3748ca82	Merge pull request #9038 from hashicorp/f-ec2-table env_aws: get ec2 cpu perf data from AWS API	2020-10-12 18:55:33 -05:00
Nick Ethier	d45be0b5a6	client: add NetworkStatus to Allocation (#8657 )	2020-10-12 13:43:04 -04:00
Yoan Blanc	891accb89a	use allow/deny instead of the colored alternatives (#9019 ) Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-10-12 08:47:05 -04:00
Tim Gross	b5abf4ec9d	csi: fix incorrect comment on csi_hook context lifetime	2020-10-09 11:03:51 -04:00
Seth Hoenig	9b555fe6d5	env_aws: fixup test case node attr detection	2020-10-08 12:59:07 -05:00
Seth Hoenig	e693d15a5b	env_aws: get ec2 cpu perf data from AWS API Previously, Nomad was using a hand-made lookup table for looking up EC2 CPU performance characteristics (core count + speed = ticks). This data was incomplete and incorrect depending on region. The AWS API has the correct data but requires API keys to use (i.e. should not be queried directly from Nomad). This change introduces a lookup table generated by a small command line tool in Nomad's tools module which uses the Amazon AWS API. Running the tool requires AWS_* environment variables set. $ # in nomad/tools/cpuinfo $ go run . Going forward, Nomad can incorporate regeneration of the lookup table somewhere in the CI pipeline so that we remain up-to-date on the latest offerings from EC2. Fixes #7830	2020-10-08 12:01:09 -05:00
Landan Cheruka	023a2d36b7	fingerprint: changed unique.platform.azure.hostname to unique.platform.azure.name (#9016 )	2020-10-02 16:50:12 -04:00
Javier Heredia	103ac0a37f	Add consul segment fingerprint (#7214 )	2020-10-02 15:15:59 -04:00
Fredrik Hoem Grelland	a015c52846	configure nomad cluster to use a Consul Namespace [Consul Enterprise] (#8849 )	2020-10-02 14:46:36 -04:00
Fredrik Hoem Grelland	953d4de8dd	update consul-template to v0.25.1 (#8988 )	2020-10-01 14:08:49 -04:00
Landan Cheruka	3df1802119	client: added azure fingerprinting support (#8979 )	2020-10-01 09:10:27 -04:00
Lars Lehtonen	03abe3c890	client: fix test umask (#8987 )	2020-09-30 08:09:41 -04:00
Mahmood Ali	2e9e8ccc24	Merge pull request #8982 from hashicorp/b-exec-dns-resolv drivers/exec: fix DNS resolution in systemd hosts	2020-09-29 11:39:43 -05:00
Mahmood Ali	7ddf4b2902	drivers/exec: fix DNS resolution in systemd hosts Host with systemd-resolved have `/etc/resolv.conf` is a symlink to `/run/systemd/resolve/stub-resolv.conf`. By bind-mounting /etc/resolv.conf only, the exec container DNS resolution fail very badly. This change fixes DNS resolution by binding /run/systemd/resolve as well. Note that this assumes that the systemd resolver (default to 127.0.0.53) is accessible within the container. This is the case here because exec containers share the same network namespace by default. Jobs with custom network dns configurations are not affected, and Nomad will continue to use the job dns settings rather than host one.	2020-09-29 11:33:51 -04:00
Seth Hoenig	af9543c997	consul: fix validation of task in group-level script-checks When defining a script-check in a group-level service, Nomad needs to know which task is associated with the check so that it can use the correct task driver to execute the check. This PR fixes two bugs: 1) validate service.task or service.check.task is configured 2) make service.check.task inherit service.task if it is itself unset Fixes #8952	2020-09-28 15:02:59 -05:00
Pete Woods	81fa2a01fc	Add node "status", "scheduling eligibility" to all client metrics (#8925 ) - We previously added these to the client host metrics, but it's useful to have them on all client metrics. - e.g. so you can exclude draining nodes from charts showing your fleet size.	2020-09-22 13:53:50 -04:00
Pierre Cauchois	e4b739cafd	RPC Timeout/Retries account for blocking requests (#8921 ) The current implementation measures RPC request timeout only against config.RPCHoldTimeout, which is fine for non-blocking requests but will almost surely be exceeded by long-poll requests that block for minutes at a time. This adds an HasTimedOut method on the RPCInfo interface that takes into account whether the request is blocking, its maximum wait time, and the RPCHoldTimeout.	2020-09-18 08:58:41 -04:00
Joel May	2adc5bdec7	fingerprinting: add AWS MAC and public-ipv6 (#8887 )	2020-09-17 09:03:01 -04:00
Lars Lehtonen	55f0302c46	client/allocrunner/taskrunner: client.Close after err check (#8825 )	2020-09-04 08:12:08 -04:00
Tim Gross	8ad90b4253	fix params for Agent.Host client RPC (#8795 ) The parameters for the receiving side of the Agent.Host client RPC did not take the arguments serialized at the server side. This results in a panic.	2020-08-31 17:14:26 -04:00
Jasmine Dahilig	71a694f39c	Merge pull request #8390 from hashicorp/lifecycle-poststart-hook task lifecycle poststart hook	2020-08-31 13:53:24 -07:00
Jasmine Dahilig	fbe0c89ab1	task lifecycle poststart: code review fixes	2020-08-31 13:22:41 -07:00
Seth Hoenig	9f1f2a5673	Merge branch 'master' into f-cc-ingress	2020-08-26 15:31:05 -05:00
Seth Hoenig	dfe179abc5	consul/connect: fixup some comments and context timeout	2020-08-26 13:17:16 -05:00
Mahmood Ali	10954bf717	close file when done reading	2020-08-24 20:22:42 -04:00
Mahmood Ali	0be632debf	don't lock if ref is nil Ensure that d.mu is only dereferenced if d is not-nil, to avoid a null dereference panic.	2020-08-24 20:19:40 -04:00
Seth Hoenig	26e77623e5	consul/connect: fixup tests to use new consul sdk	2020-08-24 12:02:41 -05:00
Seth Hoenig	a09d1746bf	Merge branch 'master' into consul-v1.7.7	2020-08-24 10:43:00 -05:00
Yoan Blanc	327d17e0dc	fixup! vendor: consul/api, consul/sdk v1.6.0 Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-08-24 08:59:03 +02:00
Mark Lee	cd23fd7ca2	refactor lookup code	2020-08-24 12:24:16 +09:00
Mark Lee	cd7aabca72	lookup kernel builtin modules too	2020-08-24 11:09:13 +09:00
Seth Hoenig	5b072029f2	consul/connect: add initial support for ingress gateways This PR adds initial support for running Consul Connect Ingress Gateways (CIGs) in Nomad. These gateways are declared as part of a task group level service definition within the connect stanza. ```hcl service { connect { gateway { proxy { // envoy proxy configuration } ingress { // ingress-gateway configuration entry } } } } ``` A gateway can be run in `bridge` or `host` networking mode, with the caveat that host networking necessitates manually specifying the Envoy admin listener (which cannot be disabled) via the service port value. Currently Envoy is the only supported gateway implementation in Consul, and Nomad only supports running Envoy as a gateway using the docker driver. Aims to address #8294 and tangentially #8647	2020-08-21 16:21:54 -05:00
Shengjing Zhu	7a4f48795d	Adjust cgroup change in libcontainer	2020-08-20 00:31:07 +08:00
Michael Schurter	de08ae8083	test: add allocrunner test for poststart hooks	2020-08-12 09:54:14 -07:00
Nick Ethier	e39574be59	docker: support group allocated ports and host_networks (#8623 ) * docker: support group allocated ports * docker: add new ports driver config to specify which group ports are mapped * docker: update port mapping docs	2020-08-11 18:30:22 -04:00
Lang Martin	a27913e699	CSI RPC Token (#8626 ) * client/allocrunner/csi_hook: use the Node SecretID * client/allocrunner/csi_hook: include the namespace for Claim	2020-08-11 13:08:39 -04:00
Michael Schurter	e1946b66ce	client: remove shortcircuit preventing poststart hooks from running	2020-08-11 09:48:24 -07:00
Michael Schurter	04a135b57d	client: don't restart poststart sidecars on success	2020-08-11 09:47:18 -07:00
Tim Gross	7d53ed88d6	csi: client RPCs should return wrapped errors for checking (#8605 ) When the client-side actions of a CSI client RPC succeed but we get disconnected during the RPC or we fail to checkpoint the claim state, we want to be able to retry the client RPC without getting blocked by the client-side state (ex. mount points) already having been cleaned up in previous calls.	2020-08-07 11:01:36 -04:00
Tim Gross	2854298089	csi: release claims via csi_hook postrun unpublish RPC (#8580 ) Add a Postrun hook to send the `CSIVolume.Unpublish` RPC to the server. This may forward client RPCs to the node plugins or to the controller plugins, depending on whether other allocations on this node have claims on this volume. By making clients responsible for running the `CSIVolume.Unpublish` RPC (and making the RPC available to a `nomad volume detach` command), the volumewatcher becomes only used by the core GC job and we no longer need async volume GC from job deregister and node update.	2020-08-06 14:51:46 -04:00
Jasmine Dahilig	e8ed6851e2	lifecycle: add allocrunner and task hook coordinator unit tests	2020-07-29 12:39:42 -07:00
Seth Hoenig	a392b19b6a	consul/connect: fixup some spelling, comments, consts	2020-07-29 09:26:01 -05:00
Seth Hoenig	04bb6c416f	consul/connect: organize lock & fields in http/grpc socket hooks	2020-07-29 09:26:01 -05:00
Seth Hoenig	dbee956c05	consul/connect: optimze grpc socket hook check for bridge network first	2020-07-29 09:26:01 -05:00
Seth Hoenig	2511f48351	consul/connect: add support for bridge networks with connect native tasks Before, Connect Native Tasks needed one of these to work: - To be run in host networking mode - To have the Consul agent configured to listen to a unix socket - To have the Consul agent configured to listen to a public interface None of these are a great experience, though running in host networking is still the best solution for non-Linux hosts. This PR establishes a connection proxy between the Consul HTTP listener and a unix socket inside the alloc fs, bypassing the network namespace for any Connect Native task. Similar to and re-uses a bunch of code from the gRPC listener version for envoy sidecar proxies. Proxy is established only if the alloc is configured for bridge networking and there is at least one Connect Native task in the Task Group. Fixes #8290	2020-07-29 09:26:01 -05:00
Drew Bailey	bd421b6197	Merge pull request #8453 from hashicorp/oss-multi-vault-ns oss compoments for multi-vault namespaces	2020-07-27 08:45:22 -04:00
Mahmood Ali	2d0b80a0ed	Merge pull request #6517 from hashicorp/b-fingerprint-shutdown-race client: don't retry fingerprinting on shutdown	2020-07-24 11:56:32 -04:00
Drew Bailey	b296558b8e	oss compoments for multi-vault namespaces adds in oss components to support enterprise multi-vault namespace feature upgrade specific doc on vault multi-namespaces vault docs update test to reflect new error	2020-07-24 10:14:59 -04:00
Michael Schurter	1400e0480d	Merge pull request #8521 from hashicorp/docs-hearbeat docs: s/hearbeat/heartbeat and fix link	2020-07-23 14:07:24 -07:00
Tim Gross	56c6dacd38	csi: NodePublish should not create target_path, only its parent dir (#8505 ) The NodePublish workflow currently creates the target path and its parent directory. However, the CSI specification says that the CO shall ensure the parent directory of the target path exists, and that the SP shall place the block device or mounted directory at the target path. Much of our testing has been with CSI plugins that are more forgiving, but our behavior breaks spec-compliant CSI plugins. This changeset ensures we only create the parent directory.	2020-07-23 15:52:22 -04:00
Michael Schurter	8340ad4da8	docs: s/hearbeat/heartbeat and fix link Also fixed the same typo in a test. Fixing the typo fixes the link, but the link was still broken when running the website locally due to the trailing slash. It would have worked in prod thanks to redirects, but using the canonical URL seems ideal.	2020-07-23 11:33:34 -07:00
Mahmood Ali	91ba9ccefe	honor config.NetworkInterface in NodeNetworks	2020-07-21 15:43:45 -04:00
Mahmood Ali	c2d3c3e431	nvidia: support disabling the nvidia plugin (#8353 )	2020-07-21 10:11:16 -04:00
Jasmine Dahilig	44c21bd3c7	fix panic, but poststart is still stalled	2020-07-10 09:03:10 -07:00
Mahmood Ali	e9bf3a42f5	Merge pull request #8333 from hashicorp/b-test-tweak-20200701 tests: avoid os.Exit in a test	2020-07-10 11:18:28 -04:00
Jasmine Dahilig	9e27231953	add poststart hook to task hook coordinator & structs	2020-07-08 11:01:35 -07:00
Nick Ethier	e0fb634309	ar: support opting into binding host ports to default network IP (#8321 ) * ar: support opting into binding host ports to default network IP * fix config plumbing * plumb node address into network resource * struct: only handle network resource upgrade path once	2020-07-06 18:51:46 -04:00
Lang Martin	6c22cd587d	api: `nomad debug` new /agent/host (#8325 ) * command/agent/host: collect host data, multi platform * nomad/structs/structs: new HostDataRequest/Response * client/agent_endpoint: add RPC endpoint * command/agent/agent_endpoint: add Host * api/agent: add the Host endpoint * nomad/client_agent_endpoint: add Agent Host with forwarding * nomad/client_agent_endpoint: use findClientConn This changes forwardMonitorClient and forwardProfileClient to use findClientConn, which was cribbed from the common parts of those funcs. * command/debug: call agent hosts * command/agent/host: eliminate calling external programs	2020-07-02 09:51:25 -04:00
Mahmood Ali	026d8c6eed	tests: avoid os.Exit in a test	2020-07-01 15:25:13 -04:00
Mahmood Ali	7f460d2706	allocrunner: terminate sidecars in the end This fixes a bug where a batch allocation fails to complete if it has sidecars. If the only remaining running tasks in an allocations are sidecars - we must kill them and mark the allocation as complete.	2020-06-29 15:12:15 -04:00
Seth Hoenig	011c6b027f	connect/native: doc and comment tweaks from PR	2020-06-24 10:13:22 -05:00
Seth Hoenig	03a5706919	connect/native: check for pre-existing consul token	2020-06-24 09:16:28 -05:00
Seth Hoenig	6154181a64	connect/native: update connect native hook tests	2020-06-23 12:07:35 -05:00
Seth Hoenig	c5d3f58bee	connect/native: give tls files an extension	2020-06-23 12:06:28 -05:00
Seth Hoenig	4d71f22a11	consul/connect: add support for running connect native tasks This PR adds the capability of running Connect Native Tasks on Nomad, particularly when TLS and ACLs are enabled on Consul. The `connect` stanza now includes a `native` parameter, which can be set to the name of task that backs the Connect Native Consul service. There is a new Client configuration parameter for the `consul` stanza called `share_ssl`. Like `allow_unauthenticated` the default value is true, but recommended to be disabled in production environments. When enabled, the Nomad Client's Consul TLS information is shared with Connect Native tasks through the normal Consul environment variables. This does NOT include auth or token information. If Consul ACLs are enabled, Service Identity Tokens are automatically and injected into the Connect Native task through the CONSUL_HTTP_TOKEN environment variable. Any of the automatically set environment variables can be overridden by the Connect Native task using the `env` stanza. Fixes #6083	2020-06-22 14:07:44 -05:00
Tim Gross	3d38592fbb	csi: add VolumeContext to NodeStage/Publish RPCs (#8239 ) In #7957 we added support for passing a volume context to the controller RPCs. This is an opaque map that's created by `CreateVolume` or, in Nomad's case, in the volume registration spec. However, we missed passing this field to the `NodeStage` and `NodePublish` RPC, which prevents certain plugins (such as MooseFS) from making node RPCs.	2020-06-22 13:54:32 -04:00
Michael Schurter	562704124d	Merge pull request #8208 from hashicorp/f-multi-network multi-interface network support	2020-06-19 15:46:48 -07:00
Mahmood Ali	3824e0362c	Revert "client: defensive against getting stale alloc updates"	2020-06-19 15:39:44 -04:00
Nick Ethier	a87e91e971	test: fix up testing around host networks	2020-06-19 13:53:31 -04:00
Nick Ethier	f0ac1f027a	lint: spelling	2020-06-19 11:29:41 -04:00
Nick Ethier	0374ad3e6c	taskenv: populate NOMAD_IP\|PORT\|ADDR env from allocated ports	2020-06-19 10:51:32 -04:00
Nick Ethier	f0559a8162	multi-interface network support	2020-06-19 09:42:10 -04:00
Nick Ethier	4a44deaa5c	CNI Implementation (#7518 )	2020-06-18 11:05:29 -07:00
Nick Ethier	0bc0403cc3	Task DNS Options (#7661 ) Co-Authored-By: Tim Gross <tgross@hashicorp.com> Co-Authored-By: Seth Hoenig <shoenig@hashicorp.com>	2020-06-18 11:01:31 -07:00
Drew Bailey	84afc28ceb	only report tasklogger is running if both stdout and stderr are still running (#8155 ) * only report tasklogger is running if both stdout and stderr are still running * changelog	2020-06-12 09:17:35 -04:00
Lang Martin	ac7c39d3d3	Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection (#8099 ) * client/heartbeatstop: reversed time condition for startup grace * scheduler/generic_sched: use `delayInstead` to avoid a loop Without protecting the loop that creates followUpEvals, a delayed eval is allowed to create an immediate subsequent delayed eval. For both `stop_after_client_disconnect` and the `reschedule` block, a delayed eval should always produce some immediate result (running or blocked) and then only after the outcome of that eval produce a second delayed eval. * scheduler/reconcile: lostLater are different than delayedReschedules Just slightly. `lostLater` allocs should be used to create batched evaluations, but `handleDelayedReschedules` assumes that the allocations are in the untainted set. When it creates the in-place updates to those allocations at the end, it causes the allocation to be treated as running over in the planner, which causes the initial `stop_after_client_disconnect` evaluation to be retried by the worker.	2020-06-03 09:48:38 -04:00
Mahmood Ali	5703c0db80	tests: Run a task long enough to be restartable	2020-05-31 10:33:03 -04:00
Drew Bailey	59ca304fce	give enterpriseclient a logger (#8072 )	2020-05-28 15:43:16 -04:00
Drew Bailey	34871f89be	Oss license support for ent builds (#8054 ) * changes necessary to support oss licesning shims revert nomad fmt changes update test to work with enterprise changes update tests to work with new ent enforcements make check update cas test to use scheduler algorithm back out preemption changes add comments * remove unused method	2020-05-27 13:46:52 -04:00
Mahmood Ali	2588b3bc98	cleanup driver eventor goroutines This fixes few cases where driver eventor goroutines are leaked during normal operations, but especially so in tests. This change makes few modifications: First, it switches drivers to use `Context`s to manage shutdown events. Previously, it relied on callers invoking `.Shutdown()` function that is specific to internal drivers only and require casting. Using `Contexts` provide a consistent idiomatic way to manage lifecycle for both internal and external drivers. Also, I discovered few places where we don't clean up a temporary driver instance in the plugin catalog code, where we dispense a driver to inspect and validate the schema config without properly cleaning it up.	2020-05-26 11:04:04 -04:00
Tim Gross	ba11aef5d9	csi: skip unit tests on unsupported platforms (#8033 ) Some of the unit tests for CSI require platform-specific APIs that aren't available on macOS. We can safely skip these tests.	2020-05-21 13:56:50 -04:00
Tim Gross	aa8927abb4	volumes: return better error messages for unsupported task drivers (#8030 ) When an allocation runs for a task driver that can't support volume mounts, the mounting will fail in a way that can be hard to understand. With host volumes this usually means failing silently, whereas with CSI the operator gets inscrutable internals exposed in the `nomad alloc status`. This changeset adds a MountConfig field to the task driver Capabilities response. We validate this when the `csi_hook` or `volume_hook` fires and return a user-friendly error. Note that we don't currently have a way to get driver capabilities up to the server, except through attributes. Validating this when the user initially submits the jobspec would be even better than what we're doing here (and could be useful for all our other capabilities), but that's out of scope for this changeset. Also note that the MountConfig enum starts with "supports all" in order to support community plugins in a backwards compatible way, rather than cutting them off from volume mounting unexpectedly.	2020-05-21 09:18:02 -04:00
Tim Gross	065fa7af8b	stats_hook: log normal shutdown condition as debug, not error (#8028 ) The `stats_hook` writes an Error log every time an allocation becomes terminal. This is a normal condition, not an error. A real error condition like a failure to collect the stats is logged later. It just creates log noise, and this is a particularly bad operator experience for heavy batch workloads.	2020-05-20 10:28:30 -04:00
Mahmood Ali	751f337f1c	Update hcl2 vendoring The hcl2 library has moved from http://github.com/hashicorp/hcl2 to https://github.com/hashicorp/hcl/tree/hcl2. This updates Nomad's vendoring to start using hcl2 library. Also updates some related libraries (e.g. `github.com/zclconf/go-cty/cty` and `github.com/apparentlymart/go-textseg`).	2020-05-19 15:00:03 -04:00
Tim Gross	6a463dc13a	csi: use a blocking initial connection with timeout (#7965 ) The plugin supervisor lazily connects to plugins, but this means we only get "Unavailable" back from the gRPC call in cases where the plugin can never be reached (for example, if the Nomad client has the wrong permissions for the socket). This changeset improves the operator experience by switching to a blocking `DialWithContext`. It eagerly connects so that we can validate the connection is real and get a "failed to open" error in case where Nomad can't establish the initial connection.	2020-05-15 08:17:11 -04:00
Tim Gross	2082cf738a	csi: support for VolumeContext and VolumeParameters (#7957 ) The MVP for CSI in the 0.11.0 release of Nomad did not include support for opaque volume parameters or volume context. This changeset adds support for both. This also moves args for ControllerValidateCapabilities into a struct. The CSI plugin `ControllerValidateCapabilities` struct that we turn into a CSI RPC is accumulating arguments, so moving it into a request struct will reduce the churn of this internal API, make the plugin code more readable, and make this method consistent with the other plugin methods in that package.	2020-05-15 08:16:01 -04:00

1 2 3 4 5 ...

4308 commits