open-nomad

Commit Graph

Author	SHA1	Message	Date
Michael Schurter	35d65c7c7e	Dynamic Node Metadata (#15844 ) Fixes #14617 Dynamic Node Metadata allows Nomad users, and their jobs, to update Node metadata through an API. Currently Node metadata is only reloaded when a Client agent is restarted. Includes new UI for editing metadata as well. --------- Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>	2023-02-07 14:42:25 -08:00
Charlie Voiselle	31a289891d	Add sprig for command templates (#9053 ) Adds the sprig functions to the template funcmap prepended with `sprig_` to match the behavior in consul-template	2023-02-07 14:07:20 -05:00
Seth Hoenig	590ae08752	main: remove deprecated uses of rand.Seed (#16074 ) * main: remove deprecated uses of rand.Seed go1.20 deprecates rand.Seed, and seeds the rand package automatically. Remove cases where we seed the random package, and cleanup the one case where we intentionally create a known random source. * cl: update cl * mod: update go mod	2023-02-07 09:19:38 -06:00
Tim Gross	8a7d6b0cde	cli: remove deprecated `keyring` and `keygen` commands (#16068 ) These command were marked as deprecated in 1.4.0 with intent to remove in 1.5.0. Remove them and clean up the docs.	2023-02-07 09:49:52 -05:00
Dao Thanh Tung	ae720fe28d	Add `-json` and `-t` flag for `nomad acl token create` command (#16055 ) Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>	2023-02-07 12:05:41 +01:00
Seth Hoenig	68894bdc62	docker: disable driver when running as non-root on cgroups v2 hosts (#16063 ) * docker: disable driver when running as non-root on cgroups v2 hosts This PR modifies the docker driver to behave like exec when being run as a non-root user on a host machine with cgroups v2 enabled. Because of how cpu resources are managed by the Nomad client, the nomad agent must be run as root to manage docker-created cgroups. * cl: update cl	2023-02-06 14:09:19 -06:00
Michael Schurter	0a496c845e	Task API via Unix Domain Socket (#15864 ) This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled. This PR contains the core feature and tests but followup work is required for the following TODO items: - Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink - Unit tests for auth middleware - Caching for auth middleware - Rate limiting on negative lookups for auth middleware --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-02-06 11:31:22 -08:00
Seth Hoenig	911700ffea	build: update to go1.20 (#16029 ) * build: update to go1.20 * build: use stringy go1.20 in circle yaml * tests: handle new x509 certificate error structure in go1.20 * cl: add cl entry	2023-02-03 08:14:53 -06:00
Phil Renaud	d3c351d2d2	Label for the Web UI (#16006 ) * Demoable state * Demo mirage color * Label as a block with foreground and background colours * Test mock updates * Go test updated * Documentation update for label support	2023-02-02 16:29:04 -05:00
Tim Gross	19a2c065f4	System and sysbatch jobs always have zero index (#16030 ) Service jobs should have unique allocation Names, derived from the Job.ID. System jobs do not have unique allocation Names because the index is intended to indicated the instance out of a desired count size. Because system jobs do not have an explicit count but the results are based on the targeted nodes, the index is less informative and this was intentionally omitted from the original design. Update docs to make it clear that NOMAD_ALLOC_INDEX is always zero for system/sysbatch jobs Validate that `volume.per_alloc` is incompatible with system/sysbatch jobs. System and sysbatch jobs always have a `NOMAD_ALLOC_INDEX` of 0. So interpolation via `per_alloc` will not work as soon as there's more than one allocation placed. Validate against this on job submission.	2023-02-02 16:18:01 -05:00
Daniel Bennett	335f0a5371	docs: how to troubleshoot consul connect envoy (#15908 ) * largely a doc-ification of this commit message: d47678074bf8ae9ff2da3c91d0729bf03aee8446 this doesn't spell out all the possible failure modes, but should be a good starting point for folks. * connect: add doc link to envoy bootstrap error * add Unwrap() to RecoverableError mainly for easier testing	2023-02-02 14:20:26 -06:00
Charlie Voiselle	cc6f4719f1	Add option to expose workload token to task (#15755 ) Add `identity` jobspec block to expose workload identity tokens to tasks. --------- Co-authored-by: Anders <mail@anars.dk> Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2023-02-02 10:59:14 -08:00
Daniel Bennett	dc9c8d4e47	Change `job init` default to example`.nomad.hcl` and recommend in docs (#15997 ) recommend .nomad.hcl for job files instead of .nomad (without .hcl) * nomad job init -> example.nomad.hcl * update docs	2023-02-02 11:47:47 -06:00
Tim Gross	971a286ea3	cli: Fix a panic in `deployment status` when scheduling is slow (#16011 ) If a deployment fails, the `deployment status` command can get a nil deployment when it checks for a rollback deployment if there isn't one (or at least not one at the time of the query). Fix the panic.	2023-02-02 12:34:44 -05:00
Phil Renaud	3db9f11c37	[feat] Nomad Job Templates (#15746 ) * Extend variables under the nomad path prefix to allow for job-templates (#15570) * Extend variables under the nomad path prefix to allow for job-templates * Add job-templates to error message hinting * RadioCard component for Job Templates (#15582) * chore: add * test: component API * ui: component template * refact: remove bc naming collission * styles: remove SASS var causing conflicts * Disallow specific variable at nomad/job-templates (#15681) * Disallows variables at exactly nomad/job-templates * idiomatic refactor * Expanding nomad job init to accept a template flag (#15571) * Adding a string flag for templates on job init * data-down actions-up version of a custom template editor within variable * Dont force grid on job template editor * list-templates flag started * Correctly slice from end of path name * Pre-review cleanup * Variable form acceptance test for job template editing * Some review cleanup * List Job templates test * Example from template test * Using must.assertions instead of require etc * ui: add choose template button (#15596) * ui: add new routes * chore: update file directory * ui: add choose template button * test: button and page navigation * refact: update var name * ui: use `Button` component from `HDS` (#15607) * ui: integrate buttons * refact: remove helper * ui: remove icons on non-tertiary buttons * refact: update normalize method for key/value pairs (#15612) * `revert`: `onCancel` for `JobDefinition` The `onCancel` method isn't included in the component API for `JobEditor` and the primary cancel behavior exists outside of the component. With the exception of the `JobDefinition` page where we include this button in the top right of the component instead of next to the `Plan` button. * style: increase button size * style: keep lime green * ui: select template (#15613) * ui: deprecate unused component * ui: deprecate tests * ui: jobs.run.templates.index * ui: update logic to handle templates * refact: revert key/value changes * style: padding for cards + buttons * temp: fixtures for mirage testing * Revert "refact: revert key/value changes" This reverts commit 124e95d12140be38fc921f7e15243034092c4063. * ui: guard template for unsaved job * ui: handle reading template variable * Revert "refact: update normalize method for key/value pairs (#15612)" This reverts commit 6f5ffc9b610702aee7c47fbff742cc81f819ab74. * revert: remove test fixtures * revert: prettier problems * refact: test doesnt need filter expression * styling: button sizes and responsive cards * refact: remove route guarding * ui: update variable adapter * refact: remove model editing behavior * refact: model should query variables to populate editor * ui: clear qp on exit * refact: cleanup deprecated API * refact: query all namespaces * refact: deprecate action * ui: rely on collection * refact: patch deprecate transition API * refact: patch test to expect namespace qp * styling: padding, conditionals * ui: flashMessage on 404 * test: update for o(n+1) query * ui: create new job template (#15744) * refact: remove unused code * refact: add type safety * test: select template flow * test: add data-test attrs * chore: remove dead code * test: create new job flow * ui: add create button * ui: create job template * refact: no need for wildcard * refact: record instead of delete * styling: spacing * ui: add error handling and form validation to job create template (#15767) * ui: handle server side errors * ui: show error to prevent duplicate * refact: conditional namespace * ui: save as template flow (#15787) * bug: patches failing tests associated with `pretender` (#15812) * refact: update assertion * refact: test set-up * ui: job templates manager view (#15815) * ui: manager list view * test: edit flow * refact: deprecate column-helper * ui: template edit and delete flow (#15823) * ui: manager list view * refact: update title * refact: update permissions * ui: template edit page * bug: typo * refact: update toast messages * bug: clear selections on exit (#15827) * bug: clear controllers on exit * test: mirage config changes (#15828) * refact: deprecate column-helper * style: update z-index for HDS * Revert "style: update z-index for HDS" This reverts commit d3d87ceab6d083f7164941587448607838944fc1. * refact: update delete button * refact: edit redirect * refact: patch reactivity issues * styling: fixed width * refact: override defaults * styling: edit text causing overflow * styling: add inline text Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com> * bug: edit `text` to `template` Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com> Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com> * test: delete flow job templates (#15896) * refact: edit names * bug: set correct ref to store * chore: trim whitespace: * test: delete flow * bug: reactively update view (#15904) * Initialized default jobs (#15856) * Initialized default jobs * More jobs scaffolded * Better commenting on a couple example job specs * Adapter doing the work * fall back to epic config * Label format helper and custom serialization logic * Test updates to account for a never-empty state * Test suite uses settled and maintain RecordArray in adapter return * Updates to hello-world and variables example jobspecs * Parameterized job gets optional payload output * Formatting changes for param and service discovery job templates * Multi-group service discovery job * Basic test for default templates (#15965) * Basic test for default templates * Percy snapshot for manage page * Some late-breaking design changes * Some copy edits to the header paragraphs for job templates (#15967) * Added some init options for job templates (#15994) * Async method for populating default job templates from the variable adapter --------- Co-authored-by: Jai <41024828+ChaiWithJai@users.noreply.github.com>	2023-02-02 10:37:40 -05:00
Charlie Voiselle	4caac1a92f	client: Add option to enable hairpinMode on Nomad bridge (#15961 ) * Add `bridge_network_hairpin_mode` client config setting * Add node attribute: `nomad.bridge.hairpin_mode` * Changed format string to use `%q` to escape user provided data * Add test to validate template JSON for developer safety Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2023-02-02 10:12:15 -05:00
jmwilkinson	37834dffda	Allow wildcard datacenters to be specified in job file (#11170 ) Also allows for default value of `datacenters = ["*"]`	2023-02-02 09:57:45 -05:00
Luiz Aoqui	7c47b576cd	changelog: fix entries for #15522 and #15819 (#15998 )	2023-02-01 18:03:39 -05:00
Tim Gross	0abf0b948b	job parsing: fix panic when variable validation is missing condition (#16018 )	2023-02-01 16:41:03 -05:00
Tristan Pemble	5440965260	fix(#13844 ): canonicalize job to avoid nil pointer deference (#13845 )	2023-02-01 16:01:28 -05:00
Seth Hoenig	ca7ead191e	consul: restore consul token when reverting a job (#15996 ) * consul: reset consul token on job during registration of a reversion * e2e: add test for reverting a job with a consul service * cl: fixup cl entry	2023-02-01 14:02:45 -06:00
James Rasell	9e8325d63c	acl: fix a bug in token creation when parsing expiration TTLs. (#15999 ) The ACL token decoding was not correctly handling time duration syntax such as "1h" which forced people to use the nanosecond representation via the HTTP API. The change adds an unmarshal function which allows this syntax to be used, along with other styles correctly.	2023-02-01 17:43:41 +01:00
James Rasell	67acfd9f6b	acl: return 400 not 404 code when creating an invalid policy. (#16000 )	2023-02-01 17:40:15 +01:00
Mike Nomitch	80848b202e	Increases max variable size to 64KiB from 16KiB (#15983 )	2023-01-31 13:32:36 -05:00
stswidwinski	16eefbbf4d	GC: ensure no leakage of evaluations for batch jobs. (#15097 ) Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.	2023-01-31 13:32:14 -05:00
Seth Hoenig	139f2c0b0f	docker: set force=true on remove image to handle images referenced by multiple tags (#15962 ) * docker: set force=true on remove image to handle images referenced by multiple tags This PR changes our call of docker client RemoveImage() to RemoveImageExtended with the Force=true option set. This fixes a bug where an image referenced by more than one tag could never be garbage collected by Nomad. The Force option only applies to stopped containers; it does not affect running workloads. * docker: add note about image_delay and multiple tags	2023-01-31 07:53:18 -06:00
Yorick Gersie	d94f22bee2	Ensure infra_image gets proper label used for reconciliation (#15898 ) * Ensure infra_image gets proper label used for reconciliation Currently infra containers are not cleaned up as part of the dangling container cleanup routine. The reason is that Nomad checks if a container is a Nomad owned container by verifying the existence of the: `com.hashicorp.nomad.alloc_id` label. Ensure we set this label on the infra container as well. * fix unit test * changelog: add entry --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-01-30 09:46:45 -06:00
Jorge Marey	d1c9aad762	Rename fields on proxyConfig (#15541 ) * Change api Fields for expose and paths * Add changelog entry * changelog: add deprecation notes about connect fields * api: minor style tweaks --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-01-30 09:31:16 -06:00
dependabot[bot]	bb79824a20	build(deps): bump github.com/docker/docker from 20.10.21+incompatible to 20.10.23+incompatible (#15848 ) * build(deps): bump github.com/docker/docker Bumps [github.com/docker/docker](https://github.com/docker/docker) from 20.10.21+incompatible to 20.10.23+incompatible. - [Release notes](https://github.com/docker/docker/releases) - [Commits](https://github.com/docker/docker/compare/v20.10.21...v20.10.23) --- updated-dependencies: - dependency-name: github.com/docker/docker dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * changelog: add entry for docker/docker --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-01-30 09:10:06 -06:00
舍我其谁	3abb453bd0	volume: Add the missing option propagation_mode (#15626 )	2023-01-30 09:32:07 -05:00
Seth Hoenig	074b76e3bf	consul: check for acceptable service identity on consul tokens (#15928 ) When registering a job with a service and 'consul.allow_unauthenticated=false', we scan the given Consul token for an acceptable policy or role with an acceptable policy, but did not scan for an acceptable service identity (which is backed by an acceptable virtual policy). This PR updates our consul token validation to also accept a matching service identity when registering a service into Consul. Fixes #15902	2023-01-27 18:15:51 -06:00
Seth Hoenig	0fac4e19b3	client: always run alloc cleanup hooks on final update (#15855 ) * client: run alloc pre-kill hooks on last pass despite no live tasks This PR fixes a bug where alloc pre-kill hooks were not run in the edge case where there are no live tasks remaining, but it is also the final update to process for the (terminal) allocation. We need to run cleanup hooks here, otherwise they will not run until the allocation gets garbage collected (i.e. via Destroy()), possibly at a distant time in the future. Fixes #15477 * client: do not run ar cleanup hooks if client is shutting down	2023-01-27 09:59:31 -06:00
Luiz Aoqui	de87cdc816	template: restore driver handle on update (#15915 ) When the template hook Update() method is called it may recreate the template manager if the Nomad or Vault token has been updated. This caused the new template manager did not have a driver handler because this was only being set on the Poststart hook, which is not called for inplace updates.	2023-01-27 10:55:59 -05:00
Luiz Aoqui	09fc054c82	ui: fix alloc memory stats to match CLI output (#15909 )	2023-01-26 17:08:13 -05:00
Luiz Aoqui	bb323ef3de	ui: fix navigation for namespaced jobs in search and job version (#15906 )	2023-01-26 16:03:07 -05:00
Seth Hoenig	7375fd40fc	nsd: block on removal of services (#15862 ) * nsd: block on removal of services This PR uses a WaitGroup to ensure workload removals are complete before returning from ServiceRegistrationHandler.RemoveWorkload of the nomad service provider. The de-registration of individual services still occurs asynchrously, but we must block on the parent removal call so that we do not race with further operations on the same set of services - e.g. in the case of a task restart where we de-register and then re-register the services in quick succession. Fixes #15032 * nsd: add e2e test for initial failing check and restart	2023-01-26 08:17:57 -06:00
Yorick Gersie	2a5c423ae0	Allow per_alloc to be used with host volumes (#15780 ) Disallowing per_alloc for host volumes in some cases makes life of a nomad user much harder. When we rely on the NOMAD_ALLOC_INDEX for any configuration that needs to be re-used across restarts we need to make sure allocation placement is consistent. With CSI volumes we can use the `per_alloc` feature but for some reason this is explicitly disabled for host volumes. Ensure host volumes understand the concept of per_alloc	2023-01-26 09:14:47 -05:00
Tim Gross	6677a103c2	metrics: measure rate of RPC requests that serve API (#15876 ) This changeset configures the RPC rate metrics that were added in #15515 to all the RPCs that support authenticated HTTP API requests. These endpoints already configured with pre-forwarding authentication in #15870, and a handful of others were done already as part of the proof-of-concept work. So this changeset is entirely copy-and-pasting one method call into a whole mess of handlers. Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining set of RPCs that have no API consumers or aren't authenticated, in smaller chunks that can be more thoughtfully reviewed.	2023-01-25 16:37:24 -05:00
Luiz Aoqui	3479e2231f	core: enforce strict steps for clients reconnect (#15808 ) When a Nomad client that is running an allocation with `max_client_disconnect` set misses a heartbeat the Nomad server will update its status to `disconnected`. Upon reconnecting, the client will make three main RPC calls: - `Node.UpdateStatus` is used to set the client status to `ready`. - `Node.UpdateAlloc` is used to update the client-side information about allocations, such as their `ClientStatus`, task states etc. - `Node.Register` is used to upsert the entire node information, including its status. These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations. For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources. When this client comes back the order of events may be: 1. Client calls `Node.UpdateStatus` and is now `ready`. 2. Scheduler reconciles allocations and places the replacement alloc to the client. The client is now assigned two allocations: the original alloc that is still `unknown` and the replacement that is `pending`. 3. Client calls `Node.UpdateAlloc` and updates the original alloc to `running`. 4. Scheduler notices too many allocs and stops the replacement. This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled. To avoid problems like this clients must update _all_ of its relevant information before they can be considered `ready` and available for scheduling. To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting: - `Node.Register` does not set the client status anymore. - `Node.UpdateStatus` sets the reconnecting client to the `initializing` status until it successfully calls `Node.UpdateAlloc`. These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would. The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.	2023-01-25 15:53:59 -05:00
Tim Gross	f3f64af821	WI: allow workloads to use RPCs associated with HTTP API (#15870 ) This changeset allows Workload Identities to authenticate to all the RPCs that support HTTP API endpoints, for use with PR #15864. * Extends the work done for pre-forwarding authentication to all RPCs that support a HTTP API endpoint. * Consolidates the auth helpers used by the CSI, Service Registration, and Node endpoints that are currently used to support both tokens and client secrets. Intentionally excluded from this changeset: * The Variables endpoint still has custom handling because of the implicit policies. Ideally we'll figure out an efficient way to resolve those into real policies and then we can get rid of that custom handling. * The RPCs that don't currently support auth tokens (i.e. those that don't support HTTP endpoints) have not been updated with the new pre-forwarding auth We'll be doing this under a separate PR to support RPC rate metrics.	2023-01-25 14:33:06 -05:00
Nick Wales	825af1f62a	docker: add option for Windows isolation modes (#15819 )	2023-01-24 16:31:48 -05:00
Karl Johann Schubert	b773a1b77f	client: add disk_total_mb and disk_free_mb config options (#15852 )	2023-01-24 09:14:22 -05:00
Michael Schurter	92c7d96e0a	Add INFO task even log line and make logmon less noisy (#15842 ) * client: log task events at INFO level Fixes #15840 Example INFO level client logs with this enabled: ``` [INFO] client: node registration complete [INFO] client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Received msg="Task received by client" failed=false [INFO] client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type="Task Setup" msg="Building Task Directory" failed=false [WARN] client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy path=/tmp/NomadClient2414238708/b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51/alloc/logs/.sleepy.stdout.fifo @module=logmon timestamp=2023-01-20T11:19:34.275-0800 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy @module=logmon path=/tmp/NomadClient2414238708/b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51/alloc/logs/.sleepy.stderr.fifo timestamp=2023-01-20T11:19:34.275-0800 [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}" [WARN] client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 driver=raw_exec task_name=sleepy [INFO] client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Started msg="Task started by client" failed=false [INFO] client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false [INFO] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 driver=raw_exec task_name=sleepy path=/home/schmichael/go/bin/nomad pid=27668 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Terminated msg="Exit Code: 130, Signal: 2" failed=false [INFO] client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Killed msg="Task successfully killed" failed=false [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy path=/home/schmichael/go/bin/nomad pid=27653 [INFO] client.gc: marking allocation for GC: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 ``` So task events will approximately double the number of per-task log lines, but I think they add a lot of value. * client: drop logmon 'opening' from debug->info Cannot imagine why users care and removes 2 log lines per task invocation. ``` [INFO] client: node registration complete [INFO] client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type=Received msg="Task received by client" failed=false [INFO] client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type="Task Setup" msg="Building Task Directory" failed=false <<< 2 "opening fifo" lines elided here >>> [WARN] client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}" [WARN] client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 driver=raw_exec task_name=sleepy [INFO] client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type=Started msg="Task started by client" failed=false ``` * docs: add changelog for #15842	2023-01-20 14:35:00 -08:00
Tim Gross	a51149736d	Rename `nomad.broker.total_blocked` metric (#15835 ) This changeset fixes a long-standing point of confusion in metrics emitted by the eval broker. The eval broker has a queue of "blocked" evals that are waiting for an in-flight ("unacked") eval of the same job to be completed. But this "blocked" state is not the same as the `blocked` status that we write to raft and expose in the Nomad API to end users. There's a second metric `nomad.blocked_eval.total_blocked` that refers to evaluations in that state. This has caused ongoing confusion in major customer incidents and even in our own documentation! (Fixed in this PR.) There's little functional change in this PR aside from the name of the metric emitted, but there's a bit refactoring to clean up the names in `eval_broker.go` so that there aren't name collisions and multiple names for the same state. Changes included are: * Everything that was previously called "pending" referred to entities that were associated witht he "ready" metric. These are all now called "ready" to match the metric. * Everything named "blocked" in `eval_broker.go` is now named "pending", except for a couple of comments that actually refer to blocked RPCs. * Added a note to the upgrade guide docs for 1.5.0. * Fixed the scheduling performance metrics docs because the description for `nomad.broker.total_blocked` was actually the description for `nomad.blocked_eval.total_blocked`.	2023-01-20 14:23:56 -05:00
Charlie Voiselle	5ea1d8a970	Add raft snapshot configuration options (#15522 ) * Add config elements * Wire in snapshot configuration to raft * Add hot reload of raft config * Add documentation for new raft settings * Add changelog	2023-01-20 14:21:51 -05:00
Seth Hoenig	d2d8ebbeba	consul: correctly interpret missing consul checks as unhealthy (#15822 ) * consul: correctly understand missing consul checks as unhealthy This PR fixes a bug where Nomad assumed any registered Checks would exist in the service registration coming back from Consul. In some cases, the Consul may be slow in processing the check registration, and the response object would not contain checks. Nomad would then scan the empty response looking for Checks with failing health status, finding none, and then marking a task/alloc as healthy. In reality, we must always use Nomad's view of what checks should exist as the source of truth, and compare that with the response Consul gives us, making sure they match, before scanning the Consul response for failing check statuses. Fixes #15536 * consul: minor CR refactor using maps not sets * consul: observe transition from healthy to unhealthy checks * consul: spell healthy correctly	2023-01-19 14:01:12 -06:00
James Rasell	94aba987c6	changelog: add feature entry for SSO OIDC (#15821 )	2023-01-19 16:48:04 +01:00
Dao Thanh Tung	e2ae6d62e1	fix bug in nomad fmt -check does not return error code (#15797 )	2023-01-17 09:15:34 -05:00
Benjamin Buzbee	13cc30ebeb	Return buffered text from log endpoint if decoding fails (#15558 ) To see why I think this is a good change lets look at why I am making it My disk was full, which means GC was happening agressively. So by the time I called the logging endpoint from the SDK, the logs were GC'd The error I was getting before was: ``` invalid character 'i' in literal false (expecting 'l') ``` Now the error I get is: ``` failed to decode log endpoint response as JSON: "failed to list entries: open /tmp/nomad.data.4219353875/alloc/f11fee50-2b66-a7a2-d3ec-8442cb3d557a/alloc/logs: no such file or directory" ``` Still not super descriptive but much more debugable	2023-01-16 10:39:56 +01:00
Phil Renaud	d588aabca6	[ui] Fixes logger height issue when sidebar has events (#15759 ) * Fixes logger height issue when sidebar has events * Much simpler grid method for height calc	2023-01-13 12:16:02 -05:00
Seth Hoenig	8cd77c14a2	env/aws: update ec2 cpu info data (#15770 )	2023-01-13 09:58:23 -06:00
Seth Hoenig	a8d40ce26b	build: update to go 1.19.5 (#15769 )	2023-01-13 09:57:32 -06:00
dependabot[bot]	094caaabdf	build(deps): bump github.com/containerd/containerd from 1.6.6 to 1.6.12 (#15726 ) * build(deps): bump github.com/containerd/containerd from 1.6.6 to 1.6.12 Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.6.6 to 1.6.12. - [Release notes](https://github.com/containerd/containerd/releases) - [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md) - [Commits](https://github.com/containerd/containerd/compare/v1.6.6...v1.6.12) --- updated-dependencies: - dependency-name: github.com/containerd/containerd dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> * cl: add cl for containerd/containerd Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-01-13 09:22:41 -06:00
Seth Hoenig	fe7795ce16	consul/connect: support for proxy upstreams opaque config (#15761 ) This PR adds support for configuring `proxy.upstreams[].config` for Consul Connect upstreams. This is an opaque config value to Nomad - the data is passed directly to Consul and is unknown to Nomad.	2023-01-12 08:20:54 -06:00
Anthony Davis	1c32471805	Fix rejoin_after_leave behavior (#15552 )	2023-01-11 16:39:24 -05:00
Daniel Bennett	7d1059b5ae	connect: ingress gateway validation for http hosts and wildcards (#15749 ) * connect: fix non-"tcp" ingress gateway validation changes apply to http, http2, and grpc: * if "hosts" is excluded, consul will use its default domain e.g. <service-name>.ingress.dc1.consul * can't set hosts with "" service name test http2 and grpc too	2023-01-11 11:52:32 -06:00
Seth Hoenig	719eee8112	consul: add client configuration for grpc_ca_file (#15701 ) * [no ci] first pass at plumbing grpc_ca_file * consul: add support for grpc_ca_file for tls grpc connections in consul 1.14+ This PR adds client config to Nomad for specifying consul.grpc_ca_file These changes combined with https://github.com/hashicorp/consul/pull/15913 should finally enable Nomad users to upgrade to Consul 1.14+ and use tls grpc connections. * consul: add cl entgry for grpc_ca_file * docs: mention grpc_tls changes due to Consul 1.14	2023-01-11 09:34:28 -06:00
Dao Thanh Tung	09b25d71b8	cli: Add a nomad operator client state command (#15469 ) Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>	2023-01-11 10:03:31 -05:00
Phil Renaud	76bed82192	[ui] Show task events in the sidebar (#15733 ) * Add task events to task logs sidebar * Max-heighting inner table when present for nice looking borders	2023-01-10 17:02:21 -05:00
Phil Renaud	4e16ccc5fa	Basic sidebar expander (#15735 )	2023-01-10 16:35:53 -05:00
Luiz Aoqui	ed5fccc183	scheduler: allow using device ID as attribute (#15455 ) Devices are fingerprinted as groups of similar devices. This prevented specifying specific device by their ID in constraint and affinity rules. This commit introduces the `${device.ids}` attribute that returns a comma separated list of IDs that are part of the device group. Users can then use the set operators to write rules.	2023-01-10 14:28:23 -05:00
Seth Hoenig	83450c8762	vault: configure user agent on Nomad vault clients (#15745 ) * vault: configure user agent on Nomad vault clients This PR attempts to set the User-Agent header on each Vault API client created by Nomad. Still need to figure a way to set User-Agent on the Vault client created internally by consul-template. * vault: fixup find-and-replace gone awry	2023-01-10 10:39:45 -06:00
Seth Hoenig	2868a45982	docker: configure restart policy for networking pause container (#15732 ) This PR modifies the configuration of the networking pause contaier to include the "unless-stopped" restart policy. The pause container should always be restored into a running state until Nomad itself issues a stop command for the container. This is not a _perfect_ fix for #12216 but it should cover the 99% use case - where a pause container gets accidently stopped / killed for some reason. There is still a possibility where the pause container and main task container are stopped and started in the order where the bad behavior persists, but this is fundamentally unavoidable due to how docker itself abstracts and manages the underlying network namespace referenced by the containers. Closes #12216	2023-01-10 07:50:09 -06:00
Dao Thanh Tung	ca2f509e82	agent: Make agent syslog log level inherit from Nomad agent log (#15625 )	2023-01-04 09:38:06 -05:00
Tim Gross	8859e1bff1	csi: Fix parsing of '=' in secrets at command line and HTTP (#15670 ) The command line flag parsing and the HTTP header parsing for CSI secrets incorrectly split at more than one '=' rune, making it impossible to use secrets that included that rune.	2023-01-03 16:28:38 -05:00
Dao Thanh Tung	53cd1b4871	fix: `stale` querystring parameter value as boolean (#15605 ) * Add changes to make stale querystring param boolean Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg> * Make error message more consistent Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg> * Changes from code review + Adding CHANGELOG file Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg> * Changes from code review to use github.com/shoenig/test package Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg> * Change must.Nil() to must.NoError() Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg> * Minor fix on the import order Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg> * Fix existing code format too Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg> * Minor changes addressing code review feedbacks Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg> * swap must.EqOp() order of param provided Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg> Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>	2023-01-01 13:04:14 -06:00
Danish Prakash	dc81568f93	command/job_stop: accept multiple jobs, stop concurrently (#12582 ) * command/job_stop: accept multiple jobs, stop concurrently Signed-off-by: danishprakash <grafitykoncept@gmail.com> * command/job_stop_test: add test for multiple job stops Signed-off-by: danishprakash <grafitykoncept@gmail.com> * improve output, add changelog and docs Signed-off-by: danishprakash <grafitykoncept@gmail.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-12-16 15:46:58 -08:00
Phil Renaud	dce8717866	[ui] Token management interface on policy pages (#15435 ) * basic-functionality demo for token CRUD * Styling for tokens crud * Tokens crud styles * Expires, not expiry * Mobile styles etc * Refresh and redirect rules for policy save and token creation * Delete method and associated serializer change * Ability-checking for tokens * Update policies acceptance tests to reflect new redirect rules * Token ability unit tests * Mirage config methods for token crud * Token CRUD acceptance tests * A couple visual diff snapshots * Add and Delete abilities referenced for token operations * Changing timeouts and adding a copy to clipboard action * replaced accessor with secret when copying to clipboard * PR comments addressed * Simplified error passing for policy editor	2022-12-15 13:11:28 -05:00
Tim Gross	989d7d9fcf	csi: avoid a nil pointer when handling plugin events (#15518 ) If a plugin crashes quickly enough, we can get into a situation where the deregister function is called before it's ever registered. Safely handle the resulting nil pointer in the dynamic registry by not emitting a plugin event, but also update the plugin event handler to tolerate nil pointers in case we wire it up elsewhere in the future.	2022-12-12 08:42:57 -05:00
Seth Hoenig	51a2212d3d	client: sandbox go-getter subprocess with landlock (#15328 ) * client: sandbox go-getter subprocess with landlock This PR re-implements the getter package for artifact downloads as a subprocess. Key changes include On all platforms, run getter as a child process of the Nomad agent. On Linux platforms running as root, run the child process as the nobody user. On supporting Linux kernels, uses landlock for filesystem isolation (via go-landlock). On all platforms, restrict environment variables of the child process to a static set. notably TMP/TEMP now points within the allocation's task directory kernel.landlock attribute is fingerprinted (version number or unavailable) These changes make Nomad client more resilient against a faulty go-getter implementation that may panic, and more secure against bad actors attempting to use artifact downloads as a privilege escalation vector. Adds new e2e/artifact suite for ensuring artifact downloading works. TODO: Windows git test (need to modify the image, etc... followup PR) * landlock: fixup items from cr * cr: fixup tests and go.mod file	2022-12-07 16:02:25 -06:00
Phil Renaud	ce0ffdd077	[ui] Policies UI (#13976 ) Co-authored-by: Mike Nomitch <mail@mikenomitch.com>	2022-12-06 12:45:36 -05:00
Seth Hoenig	3ed37b0b1d	fingerprint: add fingerprinting for CNI plugins presense and version (#15452 ) This PR adds a fingerprinter to set the attribute "plugins.cni.version.<name>" => "<version>" for each CNI plugin in <client>.cni_path (/opt/cni/bin by default).	2022-12-05 14:22:47 -06:00
Phil Renaud	541ca94576	[ui] Adding canary_tags the web UI (#15458 ) * Adding canary_tags to anyplace we show service tags * CSS moved and tabs to spaces	2022-12-05 14:50:17 -05:00
Phil Renaud	df749ff54a	Add namespaces to exec window (#15454 )	2022-12-02 15:38:01 -05:00
Seth Hoenig	119f7b1cd1	consul: fixup expected consul tagged_addresses when using ipv6 (#15411 ) This PR is a continuation of #14917, where we missed the ipv6 cases. Consul auto-inserts tagged_addresses for keys - lan_ipv4 - wan_ipv4 - lan_ipv6 - wan_ipv6 even though the service registration coming from Nomad does not contain such elements. When doing the differential between services Nomad expects to be registered vs. the services actually registered into Consul, we must first purge these automatically inserted tagged_addresses if they do not exist in the Nomad view of the Consul service.	2022-12-01 07:38:30 -06:00
dependabot[bot]	944a7dbb70	build(deps): bump google.golang.org/grpc from 1.50.1 to 1.51.0 (#15402 ) * build(deps): bump google.golang.org/grpc from 1.50.1 to 1.51.0 Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.50.1 to 1.51.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](https://github.com/grpc/grpc-go/compare/v1.50.1...v1.51.0) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * changelog: add entry for #15402 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-11-29 14:55:17 -05:00
Seth Hoenig	a65fbeb3b3	client: manually cleanup leaked iptables rules (#15407 ) This PR adds a secondary path for cleaning up iptables created for an allocation when the normal CNI library fails to do so. This typically happens when the state of the pause container is unexpected - e.g. deleted out of band from Nomad. Before, the iptables rules would be leaked which could lead to unexpected nat routing behavior later on (in addition to leaked resources). With this change, we scan for the rules created on behalf of the allocation being GC'd and delete them. Fixes #6385	2022-11-28 11:32:16 -06:00
Phil Renaud	ffd16dfec6	[ui, epic] SSO and Auth improvements (#15110 ) * Top nav auth dropdown (#15055) * Basic dropdown styles * Some cleanup * delog * Default nomad hover state styles * Component separation-of-concerns and acceptance tests for auth dropdown * lintfix * [ui, sso] Handle token expiry 500s (#15073) * Handle error states generally * Dont direct, just redirect * no longer need explicit error on controller * Redirect on token-doesnt-exist * Forgot to import our time lib * Linting on _blank * Redirect tests * changelog * [ui, sso] warn user about pending token expiry (#15091) * Handle error states generally * Dont direct, just redirect * no longer need explicit error on controller * Linting on _blank * Custom notification actions and shift the template to within an else block * Lintfix * Make the closeAction optional * changelog * Add a mirage token that will always expire in 11 minutes * Test for token expiry with ember concurrency waiters * concurrency handling for earlier test, and button redirect test * [ui] if ACLs are disabled, remove the Sign In link from the top of the UI (#15114) * Remove top nav link if ACLs disabled * Change to an enabled-by-default model since you get no agent config when ACLs are disabled but you lack a token * PR feedback addressed; down with double negative conditionals * lintfix * ember getter instead of ?.prop * [SSO] Auth Methods and Mock OIDC Flow (#15155) * Big ol first pass at a redirect sign in flow * dont recursively add queryparams on redirect * Passing state and code qps * In which I go off the deep end and embed a faux provider page in the nomad ui * Buggy but self-contained flow * Flow auto-delay added and a little more polish to resetting token * secret passing turned to accessor passing * Handle SSO Failure * General cleanup and test fix * Lintfix * SSO flow acceptance tests * Percy snapshots added * Explicitly note the OIDC test route is mirage only * Handling failure case for complete-auth * Leentfeex * Tokens page styles (#15273) * styling and moving columns around * autofocus and enter press handling * Styles refined * Split up manager and regular tests * Standardizing to a binary status state * Serialize auth-methods response to use "name" as primary key (#15380) * Serializer for unique-by-name * Use @classic because of class extension	2022-11-28 10:44:52 -05:00
Luiz Aoqui	8f91be26ab	scheduler: create placements for non-register MRD (#15325 ) * scheduler: create placements for non-register MRD For multiregion jobs, the scheduler does not create placements on registration because the deployment must wait for the other regions. Once of these regions will then trigger the deployment to run. Currently, this is done in the scheduler by considering any eval for a multiregion job as "paused" since it's expected that another region will eventually unpause it. This becomes a problem where evals not triggered by a job registration happen, such as on a node update. These types of regional changes do not have other regions waiting to progress the deployment, and so they were never resulting in placements. The fix is to create a deployment at job registration time. This additional piece of state allows the scheduler to differentiate between a multiregion change, where there are other regions engaged in the deployment so no placements are required, from a regional change, where the scheduler does need to create placements. This deployment starts in the new "initializing" status to signal to the scheduler that it needs to compute the initial deployment state. The multiregion deployment will wait until this deployment state is persisted and its starts is set to "pending". Without this state transition it's possible to hit a race condition where the plan applier and the deployment watcher may step of each other and overwrite their changes. * changelog: add entry for #15325	2022-11-25 12:45:34 -05:00
Piotr Kazmierczak	9c85315bd2	bugfix: typos in acl role commands (#15382 ) Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2022-11-25 10:28:33 +01:00
Tim Gross	8657695322	scheduler: set job on system stack for CSI feasibility check (#15372 ) When the scheduler checks feasibility of each node, it creates a "stack" which carries attributes of the job and task group it needs to check feasibility for. The `system` and `sysbatch` scheduler use a different stack than `service` and `batch` jobs. This stack was missing the call to set the job ID and namespace for the CSI check. This prevents CSI volumes from being scheduled for system jobs whenever the volume is in a non-default namespace. Set the job ID and namespace to match the generic scheduler.	2022-11-23 16:47:35 -05:00
Jack	62f7de7ed5	cli: `wait` flag for use with `deployment status -monitor` (#15262 )	2022-11-23 16:36:13 -05:00
Sam	4689822628	Fix missing host header in http check (#15337 )	2022-11-23 08:58:13 -05:00
Phil Renaud	3189826a5b	Task sub row alignment changes (#15363 )	2022-11-22 15:49:50 -05:00
Lance Haig	0263e7af34	Add command "nomad tls" (#14296 )	2022-11-22 14:12:07 -05:00
James Rasell	e2a2ea68fc	client: accommodate Consul 1.14.0 gRPC and agent self changes. (#15309 ) * client: accommodate Consul 1.14.0 gRPC and agent self changes. Consul 1.14.0 changed the way in which gRPC listeners are configured, particularly when using TLS. Prior to the change, a single listener was responsible for handling plain-text and encrypted gRPC requests. In 1.14.0 and beyond, separate listeners will be used for each, defaulting to 8502 and 8503 for plain-text and TLS respectively. The change means that Nomad’s Consul Connect integration would not work when integrated with Consul clusters using TLS and running 1.14.0 or greater. The Nomad Consul fingerprinter identifies the gRPC port Consul has exposed using the "DebugConfig.GRPCPort" value from Consul’s “/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only represents the plain-text gRPC port which is likely to be disbaled in clusters running TLS. In order to fix this issue, Nomad now takes into account the Consul version and configured scheme to optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent self return. The “consul_grcp_socket” allocrunner hook has also been updated so that the fingerprinted gRPC port attribute is passed in. This provides a better fallback method, when the operator does not configure the “consul.grpc_address” option. * docs: modify Consul Connect entries to detail 1.14.0 changes. * changelog: add entry for #15309 * fixup: tidy tests and clean version match from review feedback. * fixup: use strings tolower func.	2022-11-21 09:19:09 -06:00
Seth Hoenig	bf4b5f9a8d	consul: add trace logging around service registrations (#15311 ) This PR adds trace logging around the differential done between a Nomad service registration and its corresponding Consul service registration, in an effort to shed light on why a service registration request is being made.	2022-11-21 08:03:56 -06:00
Phil Renaud	11dc19b307	[ui] Show Consul Connect upstreams / on update info in sidebar (#15324 ) * Added consul connect icon and sidebar info * Show icon to the right of name	2022-11-18 22:49:10 -05:00
James Rasell	3225cf77b6	api: ensure all request body decode error return a 400 status code. (#15252 )	2022-11-18 17:04:33 +01:00
stswidwinski	7b6e856a29	Add mount propagation to protobuf definition of mounts (#15096 ) * Add mount propagation to protobuf definition of mounts * Fix formatting * Add mount propagation to the simple roundtrip test. * changelog: add entry for #15096 Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-11-17 18:14:59 -05:00
Tim Gross	d0f9e887f7	autopilot: include only servers from the same region (#15290 ) When we migrated to the updated autopilot library in Nomad 1.4.0, the interface for finding servers changed. Previously autopilot would get the serf members and call `IsServer` on each of them, leaving it up to the implementor to filter out clients (and in Nomad's case, other regions). But in the "new" autopilot library, the equivalent interface is `KnownServers` for which we did not filter by region. This causes spurious attempts for the cross-region stats fetching, which results in TLS errors and a lot of log noise. Filter the member set by region to fix the regression.	2022-11-17 12:09:36 -05:00
stswidwinski	75f80e2fdd	Fix goroutine leakage (#15180 ) * Fix goroutine leakage * cl: add cl entry Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-11-17 09:47:11 -06:00
Tim Gross	dd3a07302e	keyring: update handle to state inside replication loop (#15227 ) * keyring: update handle to state inside replication loop When keyring replication starts, we take a handle to the state store. But whenever a snapshot is restored, this handle is invalidated and no longer points to a state store that is receiving new keys. This leaks a bunch of memory too! In addition to operator-initiated restores, when fresh servers are added to existing clusters with large-enough state, the keyring replication can get started quickly enough that it's running before the snapshot from the existing clusters have been restored. Fix this by updating the handle to the state store on each pass.	2022-11-17 08:40:12 -05:00
Tim Gross	6415fb4284	eval broker: shed all but one blocked eval per job after ack (#14621 ) When an evaluation is acknowledged by a scheduler, the resulting plan is guaranteed to cover up to the `waitIndex` set by the worker based on the most recent evaluation for that job in the state store. At that point, we no longer need to retain blocked evaluations in the broker that are older than that index. Move all but the highest priority / highest `ModifyIndex` blocked eval into a canceled set. When the `Eval.Ack` RPC returns from the eval broker it will signal a reap of a batch of cancelable evals to write to raft. This paces the cancelations limited by how frequently the schedulers are acknowledging evals; this should reduce the risk of cancelations from overwhelming raft relative to scheduler progress. In order to avoid straggling batches when the cluster is quiet, we also include a periodic sweep through the cancelable list.	2022-11-16 16:10:11 -05:00
Tim Gross	37134a4a37	eval delete: move batching of deletes into RPC handler and state (#15117 ) During unusual outage recovery scenarios on large clusters, a backlog of millions of evaluations can appear. In these cases, the `eval delete` command can put excessive load on the cluster by listing large sets of evals to extract the IDs and then sending larges batches of IDs. Although the command's batch size was carefully tuned, we still need to be JSON deserialize, re-serialize to MessagePack, send the log entries through raft, and get the FSM applied. To improve performance of this recovery case, move the batching process into the RPC handler and the state store. The design here is a little weird, so let's look a the failed options first: * A naive solution here would be to just send the filter as the raft request and let the FSM apply delete the whole set in a single operation. Benchmarking with 1M evals on a 3 node cluster demonstrated this can block the FSM apply for several minutes, which puts the cluster at risk if there's a leadership failover (the barrier write can't be made while this apply is in-flight). * A less naive but still bad solution would be to have the RPC handler filter and paginate, and then hand a list of IDs to the existing raft log entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and took roughly an hour to complete. Instead, we're filtering and paginating in the RPC handler to find a page token, and then passing both the filter and page token in the raft log. The FSM apply recreates the paginator using the filter and page token to get roughly the same page of evaluations, which it then deletes. The pagination process is fairly cheap (only abut 5% of the total FSM apply time), so counter-intuitively this rework ends up being much faster. A benchmark of 1M evaluations showed this blocked the FSM apply for 20-30ms at a time (typical for normal operations) and completes in less than 4 minutes. Note that, as with the existing design, this delete is not consistent: a new evaluation inserted "behind" the cursor of the pagination will fail to be deleted.	2022-11-14 14:08:13 -05:00
Charlie Voiselle	c73fb51d3a	[bug] Return a spec on reconnect (#15214 ) client: fixed a bug where non-`docker` tasks with network isolation would leak network namespaces and iptables rules if the client was restarted while they were running	2022-11-11 13:27:36 -05:00
Seth Hoenig	21237d8337	client: avoid unconsumed channel in timer construction (#15215 ) * client: avoid unconsumed channel in timer construction This PR fixes a bug introduced in #11983 where a Timer initialized with 0 duration causes an immediate tick, even if Reset is called before reading the channel. The fix is to avoid doing that, instead creating a Timer with a non-zero initial wait time, and then immediately calling Stop. * pr: remove redundant stop	2022-11-11 09:31:34 -06:00
Tim Gross	eabbcebdd4	exec: allow running commands from host volume (#14851 ) The exec driver and other drivers derived from the shared executor check the path of the command before handing off to libcontainer to ensure that the command doesn't escape the sandbox. But we don't check any host volume mounts, which should be safe to use as a source for executables if we're letting the user mount them to the container in the first place. Check the mount config to verify the executable lives in the mount's host path, but then return an absolute path within the mount's task path so that we can hand that off to libcontainer to run. Includes a good bit of refactoring here because the anchoring of the final task path has different code paths for inside the task dir vs inside a mount. But I've fleshed out the test coverage of this a good bit to ensure we haven't created any regressions in the process.	2022-11-11 09:51:15 -05:00
Piotr Kazmierczak	4851f9e68a	acl: sso auth method schema and store functions (#15191 ) This PR implements ACLAuthMethod type, acl_auth_methods table schema and crud state store methods. It also updates nomadSnapshot.Persist and nomadSnapshot.Restore methods in order for them to work with the new table, and adds two new Raft messages: ACLAuthMethodsUpsertRequestType and ACLAuthMethodsDeleteRequestType This PR is part of the SSO work captured under ☂️ ticket #13120.	2022-11-10 19:42:41 +01:00
Seth Hoenig	6e3309ebc6	template: protect use of template manager with a lock (#15192 ) This PR protects access to `templateHook.templateManager` with its lock. So far we have not been able to reproduce the panic - but it seems either Poststart is running without a Prestart being run first (should be impossible), or the Update hook is running concurrently with Poststart, nil-ing out the templateManager in a race with Poststart. Fixes #15189	2022-11-10 08:30:27 -06:00
Derek Strickland	80b6f27efd	api: remove `mapstructure` tags from`Port` struct (#12916 ) This PR solves a defect in the deserialization of api.Port structs when returning structs from theEventStream. Previously, the api.Port struct's fields were decorated with both mapstructure and hcl tags to support the network.port stanza's use of the keyword static when posting a static port value. This works fine when posting a job and when retrieving any struct that has an embedded api.Port instance as long as the value is deserialized using JSON decoding. The EventStream, however, uses mapstructure to decode event payloads in the api package. mapstructure expects an underlying field named static which does not exist. The result was that the Port.Value field would always be set to 0. Upon further inspection, a few things became apparent. The struct already has hcl tags that support the indirection during job submission. Serialization/deserialization with both the json and hcl packages produce the desired result. The use of of the mapstructure tags provided no value as the Port struct contains only fields with primitive types. This PR: Removes the mapstructure tags from the api.Port structs Updates the job parsing logic to use hcl instead of mapstructure when decoding Port instances. Closes #11044 Co-authored-by: DerekStrickland <dstrickland@hashicorp.com> Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>	2022-11-08 11:26:28 +01:00
Drew Gonzales	aac9404ee5	server: add git revision to serf tags (#9159 )	2022-11-07 10:34:33 -05:00
Phil Renaud	85521c49c4	[ui] Remove animation from task logs sidebar (#15146 ) * Remove animation from task logs sidebar * changelog	2022-11-07 10:11:18 -05:00
Tim Gross	9e1c0b46d8	API for `Eval.Count` (#15147 ) Add a new `Eval.Count` RPC and associated HTTP API endpoints. This API is designed to support interactive use in the `nomad eval delete` command to get a count of evals expected to be deleted before doing so. The state store operations to do this sort of thing are somewhat expensive, but it's cheaper than serializing a big list of evals to JSON. Note that although it seems like this could be done as an extra parameter and response field on `Eval.List`, having it as its own endpoint avoids having to change the response body shape and lets us avoid handling the legacy filter params supported by `Eval.List`.	2022-11-07 08:53:19 -05:00
Luiz Aoqui	e4c8b59919	Update alloc after reconnect and enforece client heartbeat order (#15068 ) * scheduler: allow updates after alloc reconnects When an allocation reconnects to a cluster the scheduler needs to run special logic to handle the reconnection, check if a replacement was create and stop one of them. If the allocation kept running while the node was disconnected, it will be reconnected with `ClientStatus: running` and the node will have `Status: ready`. This combination is the same as the normal steady state of allocation, where everything is running as expected. In order to differentiate between the two states (an allocation that is reconnecting and one that is just running) the scheduler needs an extra piece of state. The current implementation uses the presence of a `TaskClientReconnected` task event to detect when the allocation has reconnected and thus must go through the reconnection process. But this event remains even after the allocation is reconnected, causing all future evals to consider the allocation as still reconnecting. This commit changes the reconnect logic to use an `AllocState` to register when the allocation was reconnected. This provides the following benefits: - Only a limited number of task states are kept, and they are used for many other events. It's possible that, upon reconnecting, several actions are triggered that could cause the `TaskClientReconnected` event to be dropped. - Task events are set by clients and so their timestamps are subject to time skew from servers. This prevents using time to determine if an allocation reconnected after a disconnect event. - Disconnect events are already stored as `AllocState` and so storing reconnects there as well makes it the only source of information required. With the new logic, the reconnection logic is only triggered if the last `AllocState` is a disconnect event, meaning that the allocation has not been reconnected yet. After the reconnection is handled, the new `ClientStatus` is store in `AllocState` allowing future evals to skip the reconnection logic. * scheduler: prevent spurious placement on reconnect When a client reconnects it makes two independent RPC calls: - `Node.UpdateStatus` to heartbeat and set its status as `ready`. - `Node.UpdateAlloc` to update the status of its allocations. These two calls can happen in any order, and in case the allocations are updated before a heartbeat it causes the state to be the same as a node being disconnected: the node status will still be `disconnected` while the allocation `ClientStatus` is set to `running`. The current implementation did not handle this order of events properly, and the scheduler would create an unnecessary placement since it considered the allocation was being disconnected. This extra allocation would then be quickly stopped by the heartbeat eval. This commit adds a new code path to handle this order of events. If the node is `disconnected` and the allocation `ClientStatus` is `running` the scheduler will check if the allocation is actually reconnecting using its `AllocState` events. * rpc: only allow alloc updates from `ready` nodes Clients interact with servers using three main RPC methods: - `Node.GetAllocs` reads allocation data from the server and writes it to the client. - `Node.UpdateAlloc` reads allocation from from the client and writes them to the server. - `Node.UpdateStatus` writes the client status to the server and is used as the heartbeat mechanism. These three methods are called periodically by the clients and are done so independently from each other, meaning that there can't be any assumptions in their ordering. This can generate scenarios that are hard to reason about and to code for. For example, when a client misses too many heartbeats it will be considered `down` or `disconnected` and the allocations it was running are set to `lost` or `unknown`. When connectivity is restored the to rest of the cluster, the natural mental model is to think that the client will heartbeat first and then update its allocations status into the servers. But since there's no inherit order in these calls the reverse is just as possible: the client updates the alloc status and then heartbeats. This results in a state where allocs are, for example, `running` while the client is still `disconnected`. This commit adds a new verification to the `Node.UpdateAlloc` method to reject updates from nodes that are not `ready`, forcing clients to heartbeat first. Since this check is done server-side there is no need to coordinate operations client-side: they can continue sending these requests independently and alloc update will succeed after the heartbeat is done. * chagelog: add entry for #15068 * code review * client: skip terminal allocations on reconnect When the client reconnects with the server it synchronizes the state of its allocations by sending data using the `Node.UpdateAlloc` RPC and fetching data using the `Node.GetClientAllocs` RPC. If the data fetch happens before the data write, `unknown` allocations will still be in this state and would trigger the `allocRunner.Reconnect` flow. But when the server `DesiredStatus` for the allocation is `stop` the client should not reconnect the allocation. * apply more code review changes * scheduler: persist changes to reconnected allocs Reconnected allocs have a new AllocState entry that must be persisted by the plan applier. * rpc: read node ID from allocs in UpdateAlloc The AllocUpdateRequest struct is used in three disjoint use cases: 1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs, and WriteRequest fields 2. Raft log message using the Allocs, Evals, and WriteRequest fields 3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields Adding a new field that would only be used in one these cases (1) made things more confusing and error prone. While in theory an AllocUpdateRequest could send allocations from different nodes, in practice this never actually happens since only clients call this method with their own allocations. * scheduler: remove logic to handle exceptional case This condition could only be hit if, somehow, the allocation status was set to "running" while the client was "unknown". This was addressed by enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC calls, so this scenario is not expected to happen. Adding unnecessary code to the scheduler makes it harder to read and reason about it. * more code review * remove another unused test	2022-11-04 16:25:11 -04:00
Luiz Aoqui	1b87d292a3	client: retry RPC call when no server is available (#15140 ) When a Nomad service starts it tries to establish a connection with servers, but it also runs alloc runners to manage whatever allocations it needs to run. The alloc runner will invoke several hooks to perform actions, with some of them requiring access to the Nomad servers, such as Native Service Discovery Registration. If the alloc runner starts before a connection is established the alloc runner will fail, causing the allocation to be shutdown. This is particularly problematic for disconnected allocations that are reconnecting, as they may fail as soon as the client reconnects. This commit changes the RPC request logic to retry it, using the existing retry mechanism, if there are no servers available.	2022-11-04 14:09:39 -04:00
Charlie Voiselle	79c4478f5b	template: error on missing key (#15141 ) * Support error_on_missing_value for templates * Update docs for template stanza	2022-11-04 13:23:01 -04:00
Charlie Voiselle	83e43e01c1	Add missing timer reset (#15134 )	2022-11-03 18:57:57 -04:00
Ethan	654ae1d591	fix: batchFirstFingerprints does not update device on node after v1.3.5 (#15125 ) * fix: update device in batch first footprint * cl: add cl note Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-11-03 16:31:39 -05:00
Tim Gross	672fb46d12	WI: set identity to client secret if missing (#15121 ) Allocations created before 1.4.0 will not have a workload identity token. When the client running these allocs is upgraded to 1.4.x, the identity hook will run and replace the node secret ID token used previously with an empty string. This causes service discovery queries to fail. Fallback to the node's secret ID when the allocation doesn't have a signed identity. Note that pre-1.4.0 allocations won't have templates that read Variables, so there's no threat that this new node ID secret will be able to read data that the allocation shouldn't have access to.	2022-11-03 11:10:11 -04:00
Phil Renaud	ffb4c63af7	[ui] Adds meta to job list stub and displays a pack logo on the jobs index (#14833 ) * Adds meta to job list stub and displays a pack logo on the jobs index * Changelog * Modifying struct for optional meta param * Explicitly ask for meta anytime I look up a job from index or job page * Test case for the endpoint * adding meta field to API struct and ommitting from response if empty * passthru method added to api/jobs.list * Meta param listed in docs for jobs list * Update api/jobs.go Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-11-02 16:58:24 -04:00
Phil Renaud	6d5fe56fa1	Job spec upload (#14747 ) * Job spec upload by click or drag * pseudo-restrict formats * Changelog * Tweak to job spec upload to be above editor layer * Within the job-editor again tho * Beginning testcase cleanup * Test progression * refact: update codemirror fillin logic Co-authored-by: Jai Bhagat <jaybhagat841@gmail.com>	2022-11-02 10:34:10 -04:00
Tim Gross	4d7a4171cd	volumewatcher: prevent panic on nil volume (#15101 ) If a GC claim is written and then volume is deleted before the `volumewatcher` enters its run loop, we panic on the nil-pointer access. Simply doing a nil-check at the top of the loop reveals a race condition around shutting down the loop just as a new update is coming in. Have the parent `volumeswatcher` send an initial update on the channel before returning, so that we're still holding the lock. Update the watcher's `Stop` method to set the running state, which lets us avoid having a second context and makes stopping synchronous. This reduces the cases we have to handle in the run loop. Updated the tests now that we'll safely return from the goroutine and stop the runner in a larger set of cases. Ran the tests with the `-race` detection flag and fixed up any problems found here as well.	2022-11-01 16:53:10 -04:00
Tim Gross	38542f256e	variables: limit rekey eval to half the nack timeout (#15102 ) In order to limit how much the rekey job can monopolize a scheduler worker, we limit how long it can run to 1min before stopping work and emitting a new eval. But this exactly matches the default nack timeout, so it'll fail the eval rather than getting a chance to emit a new one. Set the timeout for the rekey eval to half the configured nack timeout.	2022-11-01 16:50:50 -04:00
Tim Gross	903b5baaa4	keyring: safely handle missing keys and restore GC (#15092 ) When replication of a single key fails, the replication loop breaks early and therefore keys that fall later in the sorting order will never get replicated. This is particularly a problem for clusters impacted by the bug that caused #14981 and that were later upgraded; the keys that were never replicated can now never be replicated, and so we need to handle them safely. Included in the replication fix: * Refactor the replication loop so that each key replicated in a function call that returns an error, to make the workflow more clear and reduce nesting. Log the error and continue. * Improve stability of keyring replication tests. We no longer block leadership on initializing the keyring, so there's a race condition in the keyring tests where we can test for the existence of the root key before the keyring has been initialize. Change this to an "eventually" test. But these fixes aren't enough to fix #14981 because they'll end up seeing an error once a second complaining about the missing key, so we also need to fix keyring GC so the keys can be removed from the state store. Now we'll store the key ID used to sign a workload identity in the Allocation, and we'll index the Allocation table on that so we can track whether any live Allocation was signed with a particular key ID.	2022-11-01 15:00:50 -04:00
dependabot[bot]	acc94d523f	build(deps): bump github.com/docker/cli from 20.10.18+incompatible to 20.10.21+incompatible (#15078 ) * build(deps): bump github.com/docker/cli Bumps [github.com/docker/cli](https://github.com/docker/cli) from 20.10.18+incompatible to 20.10.21+incompatible. - [Release notes](https://github.com/docker/cli/releases) - [Commits](https://github.com/docker/cli/compare/v20.10.18...v20.10.21) --- updated-dependencies: - dependency-name: github.com/docker/cli dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * deps: updated github.com/docker/cli from 20.10.18+incompatible to 20.10.21+incompatible Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-31 08:50:33 -05:00
dependabot[bot]	369e4da4ad	build(deps): bump github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126 (#15081 ) * build(deps): bump github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126 Bumps [github.com/aws/aws-sdk-go](https://github.com/aws/aws-sdk-go) from 1.44.84 to 1.44.126. - [Release notes](https://github.com/aws/aws-sdk-go/releases) - [Commits](https://github.com/aws/aws-sdk-go/compare/v1.44.84...v1.44.126) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * deps: update github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-31 08:47:48 -05:00
Tim Gross	2ce1728fa6	Merge release 1.4.2 files Changelog updates for 1.4.2 and backports.	2022-10-27 13:31:29 -04:00
Tim Gross	9d906d4632	variables: fix filter on List RPC The List RPC correctly authorized against the prefix argument. But when filtering results underneath the prefix, it only checked authorization for standard ACL tokens and not Workload Identity. This results in WI tokens being able to read List results (metadata only: variable paths and timestamps) for variables under the `nomad/` prefix that belong to other jobs in the same namespace. Fixes the filtering and split the `handleMixedAuthEndpoint` function into separate authentication and authorization steps so that we don't need to re-verify the claim token on each filtered object. Also includes: * update semgrep rule for mixed auth endpoints * variables: List returns empty set when all results are filtered	2022-10-27 13:08:05 -04:00
James Rasell	da5069bded	event stream: ensure token expiry is correctly checked for subs. This change ensures that a token's expiry is checked before every event is sent to the caller. Previously, a token could still be used to listen for events after it had expired, as long as the subscription was made while it was unexpired. This would last until the token was garbage collected from state. The check occurs within the RPC as there is currently no state update when a token expires.	2022-10-27 13:08:05 -04:00
dependabot[bot]	07796965b1	build(deps): bump google.golang.org/grpc from 1.48.0 to 1.50.1 (#14897 ) * build(deps): bump google.golang.org/grpc from 1.48.0 to 1.50.1 Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.48.0 to 1.50.1. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](https://github.com/grpc/grpc-go/compare/v1.48.0...v1.50.1) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * cl: add changelog entry for grpc Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-27 11:32:48 -05:00
dependabot[bot]	eb210f2af7	build(deps): bump github.com/fsouza/go-dockerclient from 1.8.2 to 1.9.0 (#14898 ) * build(deps): bump github.com/fsouza/go-dockerclient from 1.8.2 to 1.9.0 Bumps [github.com/fsouza/go-dockerclient](https://github.com/fsouza/go-dockerclient) from 1.8.2 to 1.9.0. - [Release notes](https://github.com/fsouza/go-dockerclient/releases) - [Changelog](https://github.com/fsouza/go-dockerclient/blob/main/container_changes_test.go) - [Commits](https://github.com/fsouza/go-dockerclient/compare/v1.8.2...v1.9.0) --- updated-dependencies: - dependency-name: github.com/fsouza/go-dockerclient dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * cl: add changelog entry Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-27 11:05:45 -05:00
Charlie Voiselle	28cd831085	Update consul-template dep (#15045 )	2022-10-26 11:51:45 -04:00
Tim Gross	aca95c0bc6	keyring: remove root key GC (#15034 )	2022-10-25 17:06:18 -04:00
Seth Hoenig	d69556fb35	client: ensure minimal cgroup controllers enabled (#15027 ) * client: ensure minimal cgroup controllers enabled This PR fixes a bug where Nomad could not operate properly on operating systems that set the root cgroup.subtree_control to a set of controllers that do not include the minimal set of controllers needed by Nomad. Nomad needs these controllers enabled to operate: - cpuset - cpu - io - memory - pids Now, Nomad will ensure these controllers are enabled during Client initialization, adding them to cgroup.subtree_control as necessary. This should be particularly helpful on the RHEL/CentOS/Fedora family of system. Ubuntu systems should be unaffected as they enable all controllers by default. Fixes: https://github.com/hashicorp/nomad/issues/14494 * docs: cleanup doc string * client: cleanup controller writes, enhance log messages	2022-10-24 16:08:54 -05:00
Seth Hoenig	32744a3548	deps: update hashicorp/raft to v1.3.11 (#15021 ) * deps: update hashicorp/raft to v1.3.11 Includes part of the fix for https://github.com/hashicorp/raft/issues/524 * cl: add changelog entry	2022-10-24 12:10:24 -05:00
Tim Gross	b9922631bd	keyring: fix missing GC config, don't rotate on manual GC (#15009 ) The configuration knobs for root keyring garbage collection are present in the consumer and present in the user-facing config, but we missed the spot where we copy from one to the other. Fix this so that users can set their own thresholds. The root key is automatically rotated every ~30d, but the function that does both rotation and key GC was wired up such that `nomad system gc` caused an unexpected key rotation. Split this into two functions so that `nomad system gc` cleans up old keys without forcing a rotation, which will be done periodially or by the `nomad operator root keyring rotate` command.	2022-10-24 08:43:42 -04:00
James Rasell	206fb04dc1	acl: allow tokens to read policies linked via roles to the token. (#14982 ) ACL tokens are granted permissions either by direct policy links or via ACL role links. Callers should therefore be able to read policies directly assigned to the caller token or indirectly by ACL role links.	2022-10-21 09:05:17 +02:00
Luiz Aoqui	593e48e826	cli: prevent panic on `operator debug` (#14992 ) If the API returns an error during debug bundle collection the CLI was expanding the wrong error object, resulting in a panic since `err` is `nil`.	2022-10-20 15:53:58 -04:00
Jai	08fde3a4ff	refact: upgrade Promise.then to async/await (#14798 ) * refact: upgrade Promise.then to async/await * naive solution (#14800) * refact: use id instead of model * chore: add changelog entry * refact: add conditional safety around alloc	2022-10-20 14:25:41 -04:00
Seth Hoenig	6e9c8a9955	deps: update go-memdb for goroutine leak fix (#14983 ) * deps: update go-memdb for goroutine leak fix * cl: update for goroutine leak go-memdb	2022-10-20 10:34:52 -05:00
James Rasell	215b4e7e36	acl: add ACL roles to event stream topic and resolve policies. (#14923 ) This changes adds ACL role creation and deletion to the event stream. It is exposed as a single topic with two types; the filter is primarily the role ID but also includes the role name. While conducting this work it was also discovered that the events stream has its own ACL resolution logic. This did not account for ACL tokens which included role links, or tokens with expiry times. ACL role links are now resolved to their policies and tokens are checked for expiry correctly.	2022-10-20 09:43:35 +02:00
James Rasell	d7b311ce55	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
Phil Renaud	54eeb6ebe8	Adds searching and filtering for nodes on topology view (#14913 ) * Adds searching and filtering for nodes on topology view * Lintfix and changelog * Acceptance tests for topology search and filter * Search terms also apply to class and dc on topo page * Initialize queryparam values so as to not break history state	2022-10-19 15:00:35 -04:00
Seth Hoenig	57375566d4	consul: register checks along with service on initial registration (#14944 ) * consul: register checks along with service on initial registration This PR updates Nomad's Consul service client to include checks in an initial service registration, so that the checks associated with the service are registered "atomically" with the service. Before, we would only register the checks after the service registration, which causes problems where the service is deemed healthy, even if one or more checks are unhealthy - especially problematic in the case where SuccessBeforePassing is configured. Fixes #3935 * cr: followup to fix cause of extra consul logging * cr: fix another bug * cr: fixup changelog	2022-10-19 12:40:56 -05:00
James Rasell	8e25048f3d	acl: gate ACL role write and delete RPC usage on v1.4.0 or greater. (#14908 )	2022-10-18 16:46:11 +02:00
James Rasell	9923f9e6f3	nnsd: gate registration write & delete RPC use on v1.3.0 or greater. (#14924 )	2022-10-18 15:30:28 +02:00
Seth Hoenig	f1b902beac	consul: do not re-register already registered services (#14917 ) This PR updates Nomad's Consul service client to do map comparisons using maps.Equal instead of reflect.DeepEqual. The bug fix is in how DeepEqual treats nil slices different from empty slices, when actually they should be treated the same.	2022-10-18 08:10:59 -05:00
Tim Gross	3c78980b78	make version checks specific to region (1.4.x) (#14912 ) * One-time tokens are not replicated between regions, so we don't want to enforce that the version check across all of serf, just members in the same region. * Scheduler: Disconnected clients handling is specific to a single region, so we don't want to enforce that the version check across all of serf, just members in the same region. * Variables: enforce version check in Apply RPC * Cleans up a bunch of legacy checks. This changeset is specific to 1.4.x and the changes for previous versions of Nomad will be manually backported in a separate PR.	2022-10-17 16:23:51 -04:00
Tim Gross	c721ce618e	keyring: filter by region before checking version (#14901 ) In #14821 we fixed a panic that can happen if a leadership election happens in the middle of an upgrade. That fix checks that all servers are at the minimum version before initializing the keyring (which blocks evaluation processing during trhe upgrade). But the check we implemented is over the serf membership, which includes servers in any federated regions, which don't necessarily have the same upgrade cycle. Filter the version check by the leader's region. Also bump up log levels of major keyring operations	2022-10-17 13:21:16 -04:00
Tim Gross	bcd26f8815	docker_logger: reorder imports to save memory (#14875 ) Nomad runs one logmon process and also one docker_logger process for each running allocation. A naive look at memory usage shows 10-30 MB of RSS, but a closer look shows that most of this memory (ex. all but ~2MB for logmon) is shared (`Shared_Clean` in Linux pmap). But a heap dump of docker_logger shows that it currently has an extra ~2500 KiB of heap (anonymously-mapped unshared memory) used for init blocks coming from the agent code (ex. mostly regexes from go-version, structs, and the Consul SDK). The packages for running logmon, docker_logger, and executor have an init block that parses `os.Args` to drop into their own logic, which prevents them from loading all the rest of the agent code and saves on memory, so this was unexpected. It looks like we accidentally reordered the imports in main to undo some of the work originally done in 404d2d4c98f1df930be1ae9852fe6e6ae8c1517e. This changeset restores the ordering. A follow-up heap dump shows this saves ~2MB of unshared RSS per docker_logger process.	2022-10-11 13:23:03 -04:00
Seth Hoenig	1593963cd1	servicedisco: implicit constraint for nomad v1.4 when using nsd checks (#14868 ) This PR adds a jobspec mutator to constrain jobs making use of checks in the nomad service provider to nomad clients of at least v1.4.0. Before, in a mixed client version cluster it was possible to submit an NSD job making use of checks and for that job to land on an older, incompatible client node. Closes #14862	2022-10-11 08:21:42 -05:00
Seth Hoenig	69ced2a2bd	services: remove assertion on 'task' field being set (#14864 ) This PR removes the assertion around when the 'task' field of a check may be set. Starting in Nomad 1.4 we automatically set the task field on all checks in support of the NSD checks feature. This is causing validation problems elsewhere, e.g. when a group service using the Consul provider sets 'task' it will fail validation that worked previously. The assertion of leaving 'task' unset was only about making sure job submitters weren't expecting some behavior, but in practice is causing bugs now that we need the task field for more than it was originally added for. We can simply update the docs, noting when the task field set by job submitters actually has value.	2022-10-10 13:02:33 -05:00
Phil Renaud	e771b94164	[ui] Makes service tags wrap and look like tag items (#14834 ) * Makes service tags wrap and look like tag items * Add a little vertical spacing and changelog * Put client before tags * Force tags list to new line	2022-10-07 09:23:52 -04:00
Damian Czaja	95f969c4bf	cli: add `nomad fmt` (#14779 )	2022-10-06 17:00:29 -04:00
Phil Renaud	4b93a30225	[ui] Line charts: explicitly update X-axis whenever xScale changes (#14814 ) * Explicitly update X-axis whenever xScale changes * Changelog	2022-10-06 16:59:16 -04:00
Hemanth Krishna	e516fc266f	enhancement: UpdateTask when Task is waiting for ShutdownDelay (#14775 ) Signed-off-by: Hemanth Krishna <hkpdev008@gmail.com>	2022-10-06 16:33:28 -04:00
Will Jordan	8ae13208c9	Allow jobs not requiring any network resources (#14300 ) Jobs not requiring any network resources should be allowed even when the network fingerprinter is disabled.	2022-10-06 16:25:41 -04:00
Gabriel Villalonga Simon	b974c32ba6	Check that JobPlanResponse Diff Type is None before checking for changes on getExitCode (#14492 )	2022-10-06 16:23:22 -04:00
Pablo Ruiz García	40416be7b1	Invoke FingerprintManager's Reload() func during agent's SIGHUP (#14615 ) Fixes #14614	2022-10-06 16:22:59 -04:00
Giovani Avelar	a625de2062	Allow specification of a custom job name/prefix for parameterized jobs (#14631 )	2022-10-06 16:21:40 -04:00
Tim Gross	80ec5e1346	fix panic from keyring raft entries being written during upgrade (#14821 ) During an upgrade to Nomad 1.4.0, if a server running 1.4.0 becomes the leader before one of the 1.3.x servers, the old server will crash because the keyring is initialized and writes a raft entry. Wait until all members are on a version that supports the keyring before initializing it.	2022-10-06 12:47:02 -04:00
Luiz Aoqui	b924802958	template: apply splay value on change_mode script (#14749 ) Previously, the splay timeout was only applied if a template re-render caused a restart or a signal action. The `change_mode = "script"` was running after the `if restart \|\| len(signals) != 0` check, so it was invoked at all times. This change refactors the logic so it's easier to notice that new `change_mode` options should start only after `splay` is applied.	2022-09-30 12:04:22 -04:00
Seth Hoenig	c68ed3b4c8	client: protect user lookups with global lock (#14742 ) * client: protect user lookups with global lock This PR updates Nomad client to always do user lookups while holding a global process lock. This is to prevent concurrency unsafe implementations of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo). * cl: add cl	2022-09-29 09:30:13 -05:00
Derek Strickland	4c73a3b1dc	Remove changelog entry for test update PR	2022-09-27 18:17:49 -04:00
Derek Strickland	52e4997ace	Add enterprise tag	2022-09-27 17:50:25 -04:00
Derek Strickland	ef0f8c5b81	Add enterprise tag	2022-09-27 17:49:27 -04:00
Derek Strickland	6738684167	Delete 14665.txt	2022-09-27 17:47:35 -04:00
Derek Strickland	87bdb74221	Remove bug fix changelog files	2022-09-27 17:46:32 -04:00
Derek Strickland	cacf4bb8e1	Fix changelog entry type	2022-09-27 14:33:39 -04:00
Jim Razmus II	7da3fd050b	jobspec: allow artifact headers in HCLv1 (#14637 ) * jobspec: allow artifact headers in HCLv1 Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-27 12:18:49 -04:00
Seth Hoenig	5df5e70542	core: numeric operands comparisons in constraints (#14722 ) * cleanup: fixup linter warnings in schedular/feasible.go * core: numeric operands comparisons in constraints This PR changes constraint comparisons to be numeric rather than lexical if both operands are integers or floats. Inspiration #4856 Closes #4729 Closes #14719 * fix: always parse as int64	2022-09-27 11:07:07 -05:00
Tim Gross	87681fca68	CSI: ensure initial unpublish state is checkpointed (#14675 ) A test flake revealed a bug in the CSI unpublish workflow, where an unpublish that comes from a client that's successfully done the node-unpublish step will not have the claim checkpointed if the controller-unpublish step fails. This will result in a delay in releasing the volume claim until the next GC. This changeset also ensures we're using a new snapshot after each write to raft, and fixes two timing issues in test where either the volume watcher can unpublish before the unpublish RPC is sent or we don't wait long enough in resource-restricted environements like GHA.	2022-09-27 08:43:45 -04:00
Michael Schurter	e6af1c0a14	fingerprint: add node attr for reserverable cores (#14694 ) * fingerprint: add node attr for reserverable cores Add an attribute for the number of reservable CPU cores as they may differ from the existing `cpu.numcores` due to client configuration or OS support. Hopefully clarifies some confusion in #14676 * add changelog * num_reservable_cores -> reservablecores	2022-09-26 13:03:03 -07:00
Luiz Aoqui	5c100c0d3d	client: recover from getter panics (#14696 ) The artifact getter uses the go-getter library to fetch files from different sources. Any bug in this library that results in a panic can cause the entire Nomad client to crash due to a single file download attempt. This change aims to guard against this types of crashes by recovering from panics when the getter attempts to download an artifact. The resulting panic is converted to an error that is stored as a task event for operator visibility and the panic stack trace is logged to the client's log.	2022-09-26 15:16:26 -04:00
Luiz Aoqui	f7c6534a79	cli: set content length on `operator api` requests (#14634 ) http.NewRequestWithContext will only set the right value for Content-Length if the input is bytes.Buffer, bytes.Reader, or *strings.Reader [0]. Since os.Stdin is an os.File, POST requests made with the `nomad operator api` command would always have Content-Length set to -1, which is interpreted as an unknown length by web servers. [0]: https://pkg.go.dev/net/http#NewRequestWithContext	2022-09-26 14:21:40 -04:00
Phil Renaud	497bd02169	[ui] Warn users when they leave an edited but unsaved variable page (#14665 ) * Warning on attempt to leave * Lintfix * Only router.off once * Dont warn on transition when only updating queryparams * Remove double-push and queryparam-only issues, thanks @lgfa29 * Acceptance tests * Changelog	2022-09-23 16:53:40 -04:00
Phil Renaud	a28e1bcc1e	[ui] Service Healthchecks: styles for pseudo-timestamp axis (#14677 ) * Styles for pseudo-timestamp axis * Changelog	2022-09-23 16:53:28 -04:00
Tim Gross	17aee4d69c	fingerprint: don't clear Consul/Vault attributes on failure (#14673 ) Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.	2022-09-23 14:45:12 -04:00
Derek Strickland	6874997f91	scheduler: Fix bug where the would treat multiregion jobs as paused for job types that don't use deployments (#14659 ) * scheduler: Fix bug where the scheduler would treat multiregion jobs as paused for job types that don't use deployments Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-22 14:31:27 -04:00
Jorge Marey	92158a1c62	connect: add nomad env to envoy bootstrap (#12959 ) * Add nomad env to envoy bootstrap * Add changelog file	2022-09-22 13:18:18 -05:00
Phil Renaud	eca0e7bf56	[ui] task logs in sidebar (#14612 ) * button styles * Further styles including global toggle adjustment * sidebar funcs and header * Functioning task logs in high-level sidebars * same-lineify the show tasks toggle * Changelog * Full-height sidebar calc in css, plz drop soon container queries * Active status and query params for allocations page * Reactive shouldShowLogs getter and added to client and task group pages * Higher order func passing, thanks @DingoEatingFuzz * Non-service job types get allocation params passed * Keyframe animation for task log sidebar * Acceptance test * A few more sub-row tests * Lintfix	2022-09-22 10:58:52 -04:00
Tim Gross	c29c4bd66c	cli: remove deprecated `eval status -json` list behavior (#14651 ) In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and deprecated the usage of `eval status` without an evaluation ID with an upgrade note that it would be removed in Nomad 1.4.0. This changeset completes that work.	2022-09-22 10:56:32 -04:00
Jorge Marey	584ddfe859	Add Namespace, Job and Group to envoy stats (#14311 )	2022-09-22 10:38:21 -04:00
Tim Gross	d327a68696	operator debug: write NDJSON for large collections (#14610 ) The `operator debug` command writes JSON files from API responses as a single line containing an array of JSON objects. But some of these files can be extremely large (GB's) for large production clusters, which makes it difficult to parse them using typical line-oriented Unix command line tools that can stream their inputs without consuming a lot of memory. For collections that are typically large, instead emit newline-delimited JSON. This changeset includes some first-pass refactoring of this command. It breaks up monolithic methods that validate a path, create a file, serialize objects, and write them to disk into smaller functions, some of which can now be standalone to take advantage of generics.	2022-09-22 10:02:00 -04:00
James Rasell	a25028c412	cli: fix a bug in operator API when setting HTTPS via address. (#14635 ) Operators may have a setup whereby the TLS config comes from a source other than setting Nomad specific env vars. In this case, we should attempt to identify the scheme using the config setting as a fallback.	2022-09-22 15:43:58 +02:00
Luiz Aoqui	ad48401219	chore: move changelog file to the right folder (#14639 )	2022-09-21 13:50:22 -04:00
Tim Gross	38a6e7e343	remove 1.4.0 changelog entry that refers to bugfix on new code (#14611 ) Bug fixes on new features in Nomad 1.4.0 don't need or want changelog entries in the same changelog the feature appeared, so remove this one.	2022-09-16 16:14:02 -04:00
Phil Renaud	d6c9676252	Added task links to various alloc tables (#14592 ) * Added task links to various alloc tables * Lintfix * Border collapse and added to task group page * Logs icon temporarily removed and localStorage added * Mock task added to test * Delog * Two asserts in new test * Remove commented-out code * Changelog * Removing args.allocation deps	2022-09-16 15:58:22 -04:00
Phil Renaud	cebfbb0c28	Stabilizing percy snapshots with faker (#14551 ) * First attempt at stabilizing percy snapshots with faker * Tokens seed moved to before management token generation * Faker seed only in token test * moving seed after storage clear * And again, but back to no faker seeding * Isolated seed and temporary log * Setting seed(1) wherever we're snapshotting, or before establishing cluster scenarios * Deliberate noop to see if percy is stable * Changelog entry	2022-09-14 11:27:48 -04:00
Mahmood Ali	a9d5e4c510	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Tim Gross	eb757606f3	changelog entry for variables (#14509 )	2022-09-13 10:25:26 -04:00
Derek Strickland	5ca934015b	job_endpoint: check spec for all regions (#14519 ) * job_endpoint: check spec for all regions	2022-09-12 09:24:26 -04:00
James Rasell	009948186b	changelog: add entry for #14320 (#14518 )	2022-09-09 17:25:50 +02:00
James Rasell	f51a8c73e6	deps: update armon/go-metrics to v0.4.1 (#14493 )	2022-09-09 09:20:55 +02:00
Charlie Voiselle	e58998e218	Add client scheduling eligibility to heartbeat (#14483 )	2022-09-08 14:31:36 -04:00
Tim Gross	3fc7482ecd	CSI: failed allocation should not block its own controller unpublish (#14484 ) A Nomad user reported problems with CSI volumes associated with failed allocations, where the Nomad server did not send a controller unpublish RPC. The controller unpublish is skipped if other non-terminal allocations on the same node claim the volume. The check has a bug where the allocation belonging to the claim being freed was included in the check incorrectly. During a normal allocation stop for job stop or a new version of the job, the allocation is terminal. But allocations that fail are not yet marked terminal at the point in time when the client sends the unpublish RPC to the server. For CSI plugins that support controller attach/detach, this means that the controller will not be able to detach the volume from the allocation's host and the replacement claim will fail until a GC is run. This changeset fixes the conditional so that the claim's own allocation is not included, and makes the logic easier to read. Include a test case covering this path. Also includes two minor extra bugfixes: * Entities we get from the state store should always be copied before altering. Ensure that we copy the volume in the top-level unpublish workflow before handing off to the steps. * The list stub object for volumes in `nomad/structs` did not match the stub object in `api`. The `api` package also did not include the current readers/writers fields that are expected by the UI. True up the two objects and add the previously undocumented fields to the docs.	2022-09-08 13:30:05 -04:00
Seth Hoenig	a608e7950e	helper: guard against negative inputs into random stagger This PR modifies RandomStagger to protect against negative input values. If the given interval is negative, the value returned will be somewhere in the stratosphere. Instead, treat negative inputs like zero, returning zero.	2022-09-08 09:17:48 -05:00
Michael Schurter	7ff0290f8b	docs: add quota panic fix changelog entry (#14485 ) See https://github.com/hashicorp/nomad-enterprise/pull/839 for original (Enterprise only)	2022-09-07 17:04:46 -07:00
Phil Renaud	52bb5de25a	Changelog added and unused tests removed	2022-09-07 10:31:39 -04:00
Luiz Aoqui	358ba279d0	ui: remove extra space in menu footer (#14457 )	2022-09-06 16:53:17 -04:00
James Rasell	813c5daa96	hcl2: add strlen function and update docs. (#14463 )	2022-09-06 18:42:40 +02:00
Tim Gross	6ff59e71a5	cli: remove network from `quota status` output (#14468 ) Network quotas were removed in Nomad 1.0.4. Remove the fields no longer in use from the `quota status` output.	2022-09-06 09:37:16 -04:00
Kellen Fox	5086368a1e	Add a log line to help track node eligibility (#14125 ) Co-authored-by: James Rasell <jrasell@hashicorp.com>	2022-09-06 14:03:33 +02:00
Yan	6e927fa125	warn destructive update only when count > 1 (#13103 )	2022-09-02 15:30:06 -04:00
Giovani Avelar	b5cf358212	[ui] Show a different message when there are no tasks in a job (#14071 ) Different mesage when there are not tasks in a job	2022-09-02 15:20:45 -04:00
Tiernan	98022376be	Fix error handling in Client consulDiscoveryImpl (#14431 ) Added a missing `continue` on non-nil error to avoid accidentally using a bad peer.	2022-09-02 15:13:03 -04:00
Luiz Aoqui	1ae26981a0	connect: interpolate task env in config values (#14445 ) When configuring Consul Service Mesh, it's sometimes necessary to provide dynamic value that are only known to Nomad at runtime. By interpolating configuration values (in addition to configuration keys), user are able to pass these dynamic values to Consul from their Nomad jobs.	2022-09-02 15:00:28 -04:00
Tim Gross	7921f044e5	migrate autopilot implementation to raft-autopilot (#14441 ) Nomad's original autopilot was importing from a private package in Consul. It has been moved out to a shared library. Switch Nomad to use this library so that we can eliminate the import of Consul, which is necessary to build Nomad ENT with the current version of the Consul SDK. This also will let us pick up autopilot improvements shared with Consul more easily.	2022-09-01 14:27:10 -04:00
Luiz Aoqui	94d7dddccd	cli: set -hcl2-strict to false if -hcl1 is defined (#14426 ) These options are mutually exclusive but, since `-hcl2-strict` defaults to `true` users had to explicitily set it to `false` when using `-hcl1`. Also return `255` when job plan fails validation as this is the expected code in this situation.	2022-09-01 10:42:08 -04:00

... 2 3 4 5 6 ...

835 Commits