open-nomad

Commit Graph

Author	SHA1	Message	Date
Mahmood Ali	95d85b9cac	oversubscription: set the linux memory limit Use the MemoryMaxMB as the LinuxResources limit. This is intended to ease drivers implementation and adoption of the features: drivers that use `resources.LinuxResources.MemoryLimitBytes` don't need to be updated. Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.	2021-03-30 16:55:58 -04:00
Tim Gross	f820021f9e	deps: bump gopsutil to v3.21.2	2021-03-30 16:02:51 -04:00
Seth Hoenig	03ed2a8035	Merge pull request #10243 from apollo13/issue10239 Automatically populate `CONSUL_HTTP_ADDR` for connect native tasks in host networking mode.	2021-03-30 09:00:17 -05:00
Nick Ethier	daecfa61e6	Merge pull request #10203 from hashicorp/f-cpu-cores Reserved Cores [1/4]: Structs and scheduler implementation	2021-03-29 14:05:54 -04:00
Florian Apolloner	b9b71e7ac5	Automatically populate `CONSUL_HTTP_ADDR` for connect native tasks in host networking mode. Fixes #10239	2021-03-28 14:34:31 +02:00
Nick Ethier	ab4ea0db5c	api: add Resource.Canonicalize test and fix tests to handle ReservedCores field	2021-03-19 22:08:27 -04:00
Tim Gross	fa25e048b2	CSI: unique volume per allocation Add a `PerAlloc` field to volume requests that directs the scheduler to test feasibility for volumes with a source ID that includes the allocation index suffix (ex. `[0]`), rather than the exact source ID. Read the `PerAlloc` field when making the volume claim at the client to determine if the allocation index suffix (ex. `[0]`) should be added to the volume source ID.	2021-03-18 15:35:11 -04:00
Seth Hoenig	02919a7e89	Merge pull request #10103 from AndrewChubatiuk/service-portlabel-interpolation-fix fixed service interpolation for sidecar tasks	2021-03-17 10:40:48 -05:00
Michael Schurter	15e3d61e59	client: fix task name logging	2021-03-08 09:15:02 -08:00
Adrian Todorov	47e1cb11df	driver/docker: add extra labels ( job name, task and task group name)	2021-03-08 08:59:52 -05:00
AndrewChubatiuk	6a4f3c6c8a	fixed service interpolation for sidecar tasks	2021-03-01 10:39:14 +02:00
Andre Ilhicas	30c840e88e	consul/connect: enable setting local_bind_address in upstream	2021-02-26 11:47:00 +00:00
Drew Bailey	86d9e1ff90	Merge pull request #9955 from hashicorp/on-update-services Service and Check on_update configuration option (readiness checks)	2021-02-24 10:11:05 -05:00
AndrewChubatiuk	3d0aa2ef56	allocate sidecar task port on host_network interface	2021-02-13 02:42:13 +02:00
AndrewChubatiuk	78465bbd23	customized default sidecar checks	2021-02-13 02:42:13 +02:00
AndrewChubatiuk	eff180be91	enabled hairpin mode	2021-02-13 02:42:13 +02:00
Drew Bailey	8507d54e3b	e2e test for on_update service checks check_restart not compatible with on_update=ignore reword caveat	2021-02-08 08:32:40 -05:00
Drew Bailey	82f971f289	OnUpdate configuration for services and checks Allow for readiness type checks by configuring nomad to ignore warnings or errors reported by a service check. This allows the deployment to progress and while Consul handles introducing the sercive into a resource pool once the check passes.	2021-02-08 08:32:40 -05:00
Nick Ethier	88793e92b6	ar: isolate network actions performed by client	2021-02-02 23:24:57 -05:00
Seth Hoenig	8b05efcf88	consul/connect: Add support for Connect terminating gateways This PR implements Nomad built-in support for running Consul Connect terminating gateways. Such a gateway can be used by services running inside the service mesh to access "legacy" services running outside the service mesh while still making use of Consul's service identity based networking and ACL policies. https://www.consul.io/docs/connect/gateways/terminating-gateway These gateways are declared as part of a task group level service definition within the connect stanza. service { connect { gateway { proxy { // envoy proxy configuration } terminating { // terminating-gateway configuration entry } } } } Currently Envoy is the only supported gateway implementation in Consul. The gateay task can be customized by configuring the connect.sidecar_task block. When the gateway.terminating field is set, Nomad will write/update the Configuration Entry into Consul on job submission. Because CEs are global in scope and there may be more than one Nomad cluster communicating with Consul, there is an assumption that any terminating gateway defined in Nomad for a particular service will be the same among Nomad clusters. Gateways require Consul 1.8.0+, checked by a node constraint. Closes #9445	2021-01-25 10:36:04 -06:00
Tim Gross	64449cddc1	implement alloc runner task restart hook Most allocation hooks don't need to know when a single task within the allocation is restarted. The check watcher for group services triggers the alloc runner to restart all tasks, but the alloc runner's `Restart` method doesn't trigger any of the alloc hooks, including the group service hook. The result is that after the first time a check triggers a restart, we'll never restart the tasks of an allocation again. This commit adds a `RunnerTaskRestartHook` interface so that alloc runner hooks can act if a task within the alloc is restarted. The only implementation is in the group service hook, which will force a re-registration of the alloc's services and fix check restarts.	2021-01-22 10:55:40 -05:00
Seth Hoenig	5abaf1b86d	consul/connect: ensure proxyID in test case	2021-01-20 09:48:12 -06:00
Seth Hoenig	a18e63ed55	client: use closed variable in append	2021-01-20 09:20:50 -06:00
Seth Hoenig	991884e715	consul/connect: Enable running multiple ingress gateways per Nomad agent Connect ingress gateway services were being registered into Consul without an explicit deterministic service ID. Consul would generate one automatically, but then Nomad would have no way to register a second gateway on the same agent as it would not supply 'proxy-id' during envoy bootstrap. Set the ServiceID for gateways, and supply 'proxy-id' when doing envoy bootstrap. Fixes #9834	2021-01-19 12:58:36 -06:00
Kris Hicks	d71a90c8a4	Fix some errcheck errors (#9811 ) * Throw away result of multierror.Append When given a multierror.Error, it is mutated, therefore the return value is not needed. Simplify MergeMultierrorWarnings, use StringBuilder * Hash.Write() never returns an error * Remove error that was always nil * Remove error from Resources.Add signature When this was originally written it could return an error, but that was refactored away, and callers of it as of today never handle the error. * Throw away results of io.Copy during Bridge * Handle errors when computing node class in test	2021-01-14 12:46:35 -08:00
Tim Gross	d55e3e2018	lifecycle: successful prestart tasks should not fail deployments In 492d62d we prevented poststop tasks from contributing to allocation health status, which fixed a bug where poststop tasks would prevent a deployment from ever being marked successful. The patch introduced a regression where prestart tasks that complete are causing the allocation to be marked unhealthy. This changeset restores the previous behavior for prestart tasks.	2021-01-13 11:40:21 -05:00
Seth Hoenig	3a3c006460	Merge pull request #9779 from apollo13/fix_9776 Properly detect unloaded dynamic modules on RHEL derivates. Fixes #9776	2021-01-12 12:25:30 -06:00
Drew Bailey	03a9541822	ignore poststop task in alloc health tracker (#9548 ), fixes #9361 * investigating where to ignore poststop task in alloc health tracker * ignore poststop when setting latest start time for allocation * clean up logic * lifecycle: isolate mocks for poststop deployment test * lifecycle: update comments in tracker Co-authored-by: Jasmine Dahilig <jasmine@dahilig.com>	2021-01-12 10:03:48 -08:00
Florian Apolloner	df7e22362d	Properly detect unloaded dynamic modules on RHEL derivates. Fixes #9776 The modules.dep file on RHEL includes .xz for compressed kernel modules.	2021-01-12 18:28:00 +01:00
Tim Gross	d78b4fc1a1	safely handle existing net namespace in default network manager When a client restarts, the network_hook's prerun will call `CreateNetwork`. Drivers that don't implement their own network manager will fall back to the default network manager, which doesn't handle the case where the network namespace is being recreated safely. This results in an error and the task being restarted for `exec` tasks with `network` blocks (this also impacts the community `containerd` and probably other community task drivers). If we get an error when attempting to create the namespace and that error is because the file already exists and is locked by its process, then we'll return a `nil` error with the `created` flag set to false, just as we do with the `docker` driver.	2021-01-11 11:31:03 -05:00
Joel May	13faf0d79e	Allow client.cpu_total_compute to override attr.cpu.totalcompute	2021-01-07 15:31:11 -05:00
Seth Hoenig	303856183c	consul/connect: fix panic during in-place upgrade with connect jobs When upgrading from Nomad v0.12.x to v1.0.x, Nomad client will panic on startup if the node is running Connect enabled jobs. This is caused by a missing piece of plumbing of the Consul Proxies API interface during the client restore process. Fixes #9738	2021-01-07 13:24:24 -06:00
Mahmood Ali	00be4fc63c	tests: deflake TestTaskRunner_StatsHook_Periodic (#9734 ) This PR deflakes TestTaskRunner_StatsHook_Periodic tests and adds backoff when the driver closes the channel. TestTaskRunner_StatsHook_Periodic is currently the most flaky test - failing ~4% of the time (20 out of 486 workflows). A sample failure: https://app.circleci.com/pipelines/github/hashicorp/nomad/14028/workflows/957b674f-cbcc-4228-96d9-1094fdee5b9c/jobs/128563 . This change has two components: First, it updates the StatsHook so that it backs off when stats channel is closed. In the context of the test where the mock driver emits a single stats update and closes the channel, the test may make tens of thousands update during the period. In real context, if a driver doesn't implement the stats handler properly or when a task finishes, we may generate way too many Stats queries in a tight loop. Here, the backoff reduces these queries. I've added a failing test that shows 154,458 stats updates within 500ms in https://app.circleci.com/pipelines/github/hashicorp/nomad/14092/workflows/50672445-392d-4661-b19e-e3561ed32746/jobs/129423 . Second, the test ignores the first stats update after a task exit. Due to the asynchronicity of updates and channel/context use, it's possible that an update is enqueued while the test marks the task as exited, resulting into a spurious update.	2021-01-06 16:03:00 -05:00
Seth Hoenig	b4eafe6f2d	consul: always include task services hook Previously, Nomad would optimize out the services task runner hook for tasks which were initially submitted with no services defined. This causes a problem when the job is later updated to include service(s) on that task, which will result in nothing happening because the hook is not present to handle the service registration in the .Update. Instead, always enable the services hook. The group services alloc runner hook is already always enabled. Fixes #9707	2021-01-05 08:47:19 -06:00
Chris Baker	02980b55cb	added documenting unit tests for new TaskEnv.ClientPath method	2021-01-04 22:25:38 +00:00
Chris Baker	5e73c62f2b	Update client/taskenv/env.go Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2021-01-04 22:25:36 +00:00
Chris Baker	c7072258af	enabled broken test that is no longer broken	2021-01-04 22:25:35 +00:00
Chris Baker	9b125b8837	update template and artifact interpolation to use client-relative paths resolves #9839 resolves #6929 resolves #6910 e2e: template env interpolation path testing	2021-01-04 22:25:34 +00:00
Tim Gross	c24f4d9925	client: improve alloc GC API error messages (#9488 ) The client allocation GC API returns a misleading error message when the allocation exists but is not yet eligible for GC. Make this clear in the error response. Note in the docs that the allocation will still show on the server responses.	2021-01-04 11:34:12 -05:00
Jerome Gravel-Niquet	c50e0de903	print the actual fingerprint error instead of an unrelated (and probably nil) error	2021-01-04 08:20:29 -05:00
Tim Gross	1785822386	template: trigger change_mode for dynamic secrets on restore (#9636 ) When a task is restored after a client restart, the template runner will create a new lease for any dynamic secret (ex. Consul or PKI secrets engines). But because this lease is being created in the prestart hook, we don't trigger the `change_mode`. This changeset uses the the existence of the task handle to detect a previously running task that's been restored, so that we can trigger the template `change_mode` if the template is changed, as it will be only with dynamic secrets.	2020-12-16 13:36:19 -05:00
Tim Gross	782c05f8c0	cni: prevent NPE if no interface has sandbox field set When we iterate over the interfaces returned from CNI setup, we filter for one with the `Sandbox` field set. Ensure that if none of the interfaces has that field set that we still return an available interface.	2020-12-16 10:36:03 -05:00
Seth Hoenig	e531e90b1b	build: set linux build tag on CNI networking CNI network configuration is currently only supported on Linux. For now, add the linux build tag so that the deadcode linter does not trip over unused CNI stuff on macOS.	2020-12-14 12:05:16 -06:00
Seth Hoenig	beaa6359d5	consul/connect: fix regression where client connect images ignored Nomad v1.0.0 introduced a regression where the client configurations for `connect.sidecar_image` and `connect.gateway_image` would be ignored despite being set. This PR restores that functionality. There was a missing layer of interpolation that needs to occur for these parameters. Since Nomad 1.0 now supports dynamic envoy versioning through the ${NOMAD_envoy_version} psuedo variable, we basically need to first interpolate ${connect.sidecar_image} => envoyproxy/envoy:v${NOMAD_envoy_version} then use Consul at runtime to resolve to a real image, e.g. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.16.0 Of course, if the version of Consul is too old to provide an envoy version preference, we then need to know to fallback to the old version of envoy that we used before. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09 Beyond that, we also need to continue to support jobs that set the sidecar task themselves, e.g. sidecar_task { config { image: "custom/envoy" } } which itself could include teh pseudo envoy version variable.	2020-12-14 09:47:55 -06:00
Kris Hicks	0cf9cae656	Apply some suggested fixes from staticcheck (#9598 )	2020-12-10 07:29:18 -08:00
Kris Hicks	54a8b49c5e	pluginmanager: WaitForFirstFingerprint times out (#9597 ) As pointed out by @tgross[1], prior to this change we would have been blocking until all managers waited for first fingerprint rather than timing out as intended. 1: https://github.com/hashicorp/nomad/pull/9590#discussion_r539534906	2020-12-10 07:27:15 -08:00
Seth Hoenig	b3d744fea3	Merge pull request #9586 from hashicorp/f-connect-interp consul/connect: interpolate connect block	2020-12-09 13:21:50 -06:00
Kris Hicks	0a3a748053	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
Seth Hoenig	cc70ce64ce	consul/connect: avoid extra copy of connect stanza while interpolating	2020-12-09 11:44:07 -06:00
Seth Hoenig	eb7cdce52b	client/fingerprint/cpu: use fallback total compute value if cpu not detected Previously, Nomad would fail to startup if the CPU fingerprinter could not detect the cpu total compute (i.e. cores * mhz). This is common on some EC2 instance types (graviton class), where the env_aws fingerprinter will override the detected CPU performance with a more accurate value anyway. Instead of crashing on startup, have Nomad use a low default for available cpu performance of 1000 ticks (e.g. 1 core * 1 GHz). This enables Nomad to get past the useless cpu fingerprinting on those EC2 instances. The crashing error message is now a log statement suggesting the setting of cpu_total_compute in client config. Fixes #7989	2020-12-09 10:35:58 -06:00

1 2 3 4 5 ...

4326 Commits