open-nomad

Author	SHA1	Message	Date
Tim Gross	c721ce618e	keyring: filter by region before checking version (#14901 ) In #14821 we fixed a panic that can happen if a leadership election happens in the middle of an upgrade. That fix checks that all servers are at the minimum version before initializing the keyring (which blocks evaluation processing during trhe upgrade). But the check we implemented is over the serf membership, which includes servers in any federated regions, which don't necessarily have the same upgrade cycle. Filter the version check by the leader's region. Also bump up log levels of major keyring operations	2022-10-17 13:21:16 -04:00
Tim Gross	bcd26f8815	docker_logger: reorder imports to save memory (#14875 ) Nomad runs one logmon process and also one docker_logger process for each running allocation. A naive look at memory usage shows 10-30 MB of RSS, but a closer look shows that most of this memory (ex. all but ~2MB for logmon) is shared (`Shared_Clean` in Linux pmap). But a heap dump of docker_logger shows that it currently has an extra ~2500 KiB of heap (anonymously-mapped unshared memory) used for init blocks coming from the agent code (ex. mostly regexes from go-version, structs, and the Consul SDK). The packages for running logmon, docker_logger, and executor have an init block that parses `os.Args` to drop into their own logic, which prevents them from loading all the rest of the agent code and saves on memory, so this was unexpected. It looks like we accidentally reordered the imports in main to undo some of the work originally done in 404d2d4c98f1df930be1ae9852fe6e6ae8c1517e. This changeset restores the ordering. A follow-up heap dump shows this saves ~2MB of unshared RSS per docker_logger process.	2022-10-11 13:23:03 -04:00
Seth Hoenig	1593963cd1	servicedisco: implicit constraint for nomad v1.4 when using nsd checks (#14868 ) This PR adds a jobspec mutator to constrain jobs making use of checks in the nomad service provider to nomad clients of at least v1.4.0. Before, in a mixed client version cluster it was possible to submit an NSD job making use of checks and for that job to land on an older, incompatible client node. Closes #14862	2022-10-11 08:21:42 -05:00
Seth Hoenig	69ced2a2bd	services: remove assertion on 'task' field being set (#14864 ) This PR removes the assertion around when the 'task' field of a check may be set. Starting in Nomad 1.4 we automatically set the task field on all checks in support of the NSD checks feature. This is causing validation problems elsewhere, e.g. when a group service using the Consul provider sets 'task' it will fail validation that worked previously. The assertion of leaving 'task' unset was only about making sure job submitters weren't expecting some behavior, but in practice is causing bugs now that we need the task field for more than it was originally added for. We can simply update the docs, noting when the task field set by job submitters actually has value.	2022-10-10 13:02:33 -05:00
Phil Renaud	e771b94164	[ui] Makes service tags wrap and look like tag items (#14834 ) * Makes service tags wrap and look like tag items * Add a little vertical spacing and changelog * Put client before tags * Force tags list to new line	2022-10-07 09:23:52 -04:00
Damian Czaja	95f969c4bf	cli: add `nomad fmt` (#14779 )	2022-10-06 17:00:29 -04:00
Phil Renaud	4b93a30225	[ui] Line charts: explicitly update X-axis whenever xScale changes (#14814 ) * Explicitly update X-axis whenever xScale changes * Changelog	2022-10-06 16:59:16 -04:00
Hemanth Krishna	e516fc266f	enhancement: UpdateTask when Task is waiting for ShutdownDelay (#14775 ) Signed-off-by: Hemanth Krishna <hkpdev008@gmail.com>	2022-10-06 16:33:28 -04:00
Will Jordan	8ae13208c9	Allow jobs not requiring any network resources (#14300 ) Jobs not requiring any network resources should be allowed even when the network fingerprinter is disabled.	2022-10-06 16:25:41 -04:00
Gabriel Villalonga Simon	b974c32ba6	Check that JobPlanResponse Diff Type is None before checking for changes on getExitCode (#14492 )	2022-10-06 16:23:22 -04:00
Pablo Ruiz García	40416be7b1	Invoke FingerprintManager's Reload() func during agent's SIGHUP (#14615 ) Fixes #14614	2022-10-06 16:22:59 -04:00
Giovani Avelar	a625de2062	Allow specification of a custom job name/prefix for parameterized jobs (#14631 )	2022-10-06 16:21:40 -04:00
Tim Gross	80ec5e1346	fix panic from keyring raft entries being written during upgrade (#14821 ) During an upgrade to Nomad 1.4.0, if a server running 1.4.0 becomes the leader before one of the 1.3.x servers, the old server will crash because the keyring is initialized and writes a raft entry. Wait until all members are on a version that supports the keyring before initializing it.	2022-10-06 12:47:02 -04:00
Luiz Aoqui	b924802958	template: apply splay value on change_mode script (#14749 ) Previously, the splay timeout was only applied if a template re-render caused a restart or a signal action. The `change_mode = "script"` was running after the `if restart \|\| len(signals) != 0` check, so it was invoked at all times. This change refactors the logic so it's easier to notice that new `change_mode` options should start only after `splay` is applied.	2022-09-30 12:04:22 -04:00
Seth Hoenig	c68ed3b4c8	client: protect user lookups with global lock (#14742 ) * client: protect user lookups with global lock This PR updates Nomad client to always do user lookups while holding a global process lock. This is to prevent concurrency unsafe implementations of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo). * cl: add cl	2022-09-29 09:30:13 -05:00
Derek Strickland	4c73a3b1dc	Remove changelog entry for test update PR	2022-09-27 18:17:49 -04:00
Derek Strickland	52e4997ace	Add enterprise tag	2022-09-27 17:50:25 -04:00
Derek Strickland	ef0f8c5b81	Add enterprise tag	2022-09-27 17:49:27 -04:00
Derek Strickland	6738684167	Delete 14665.txt	2022-09-27 17:47:35 -04:00
Derek Strickland	87bdb74221	Remove bug fix changelog files	2022-09-27 17:46:32 -04:00
Derek Strickland	cacf4bb8e1	Fix changelog entry type	2022-09-27 14:33:39 -04:00
Jim Razmus II	7da3fd050b	jobspec: allow artifact headers in HCLv1 (#14637 ) * jobspec: allow artifact headers in HCLv1 Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-27 12:18:49 -04:00
Seth Hoenig	5df5e70542	core: numeric operands comparisons in constraints (#14722 ) * cleanup: fixup linter warnings in schedular/feasible.go * core: numeric operands comparisons in constraints This PR changes constraint comparisons to be numeric rather than lexical if both operands are integers or floats. Inspiration #4856 Closes #4729 Closes #14719 * fix: always parse as int64	2022-09-27 11:07:07 -05:00
Tim Gross	87681fca68	CSI: ensure initial unpublish state is checkpointed (#14675 ) A test flake revealed a bug in the CSI unpublish workflow, where an unpublish that comes from a client that's successfully done the node-unpublish step will not have the claim checkpointed if the controller-unpublish step fails. This will result in a delay in releasing the volume claim until the next GC. This changeset also ensures we're using a new snapshot after each write to raft, and fixes two timing issues in test where either the volume watcher can unpublish before the unpublish RPC is sent or we don't wait long enough in resource-restricted environements like GHA.	2022-09-27 08:43:45 -04:00
Michael Schurter	e6af1c0a14	fingerprint: add node attr for reserverable cores (#14694 ) * fingerprint: add node attr for reserverable cores Add an attribute for the number of reservable CPU cores as they may differ from the existing `cpu.numcores` due to client configuration or OS support. Hopefully clarifies some confusion in #14676 * add changelog * num_reservable_cores -> reservablecores	2022-09-26 13:03:03 -07:00
Luiz Aoqui	5c100c0d3d	client: recover from getter panics (#14696 ) The artifact getter uses the go-getter library to fetch files from different sources. Any bug in this library that results in a panic can cause the entire Nomad client to crash due to a single file download attempt. This change aims to guard against this types of crashes by recovering from panics when the getter attempts to download an artifact. The resulting panic is converted to an error that is stored as a task event for operator visibility and the panic stack trace is logged to the client's log.	2022-09-26 15:16:26 -04:00
Luiz Aoqui	f7c6534a79	cli: set content length on `operator api` requests (#14634 ) http.NewRequestWithContext will only set the right value for Content-Length if the input is bytes.Buffer, bytes.Reader, or *strings.Reader [0]. Since os.Stdin is an os.File, POST requests made with the `nomad operator api` command would always have Content-Length set to -1, which is interpreted as an unknown length by web servers. [0]: https://pkg.go.dev/net/http#NewRequestWithContext	2022-09-26 14:21:40 -04:00
Phil Renaud	497bd02169	[ui] Warn users when they leave an edited but unsaved variable page (#14665 ) * Warning on attempt to leave * Lintfix * Only router.off once * Dont warn on transition when only updating queryparams * Remove double-push and queryparam-only issues, thanks @lgfa29 * Acceptance tests * Changelog	2022-09-23 16:53:40 -04:00
Phil Renaud	a28e1bcc1e	[ui] Service Healthchecks: styles for pseudo-timestamp axis (#14677 ) * Styles for pseudo-timestamp axis * Changelog	2022-09-23 16:53:28 -04:00
Tim Gross	17aee4d69c	fingerprint: don't clear Consul/Vault attributes on failure (#14673 ) Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.	2022-09-23 14:45:12 -04:00
Derek Strickland	6874997f91	scheduler: Fix bug where the would treat multiregion jobs as paused for job types that don't use deployments (#14659 ) * scheduler: Fix bug where the scheduler would treat multiregion jobs as paused for job types that don't use deployments Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-22 14:31:27 -04:00
Jorge Marey	92158a1c62	connect: add nomad env to envoy bootstrap (#12959 ) * Add nomad env to envoy bootstrap * Add changelog file	2022-09-22 13:18:18 -05:00
Phil Renaud	eca0e7bf56	[ui] task logs in sidebar (#14612 ) * button styles * Further styles including global toggle adjustment * sidebar funcs and header * Functioning task logs in high-level sidebars * same-lineify the show tasks toggle * Changelog * Full-height sidebar calc in css, plz drop soon container queries * Active status and query params for allocations page * Reactive shouldShowLogs getter and added to client and task group pages * Higher order func passing, thanks @DingoEatingFuzz * Non-service job types get allocation params passed * Keyframe animation for task log sidebar * Acceptance test * A few more sub-row tests * Lintfix	2022-09-22 10:58:52 -04:00
Tim Gross	c29c4bd66c	cli: remove deprecated `eval status -json` list behavior (#14651 ) In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and deprecated the usage of `eval status` without an evaluation ID with an upgrade note that it would be removed in Nomad 1.4.0. This changeset completes that work.	2022-09-22 10:56:32 -04:00
Jorge Marey	584ddfe859	Add Namespace, Job and Group to envoy stats (#14311 )	2022-09-22 10:38:21 -04:00
Tim Gross	d327a68696	operator debug: write NDJSON for large collections (#14610 ) The `operator debug` command writes JSON files from API responses as a single line containing an array of JSON objects. But some of these files can be extremely large (GB's) for large production clusters, which makes it difficult to parse them using typical line-oriented Unix command line tools that can stream their inputs without consuming a lot of memory. For collections that are typically large, instead emit newline-delimited JSON. This changeset includes some first-pass refactoring of this command. It breaks up monolithic methods that validate a path, create a file, serialize objects, and write them to disk into smaller functions, some of which can now be standalone to take advantage of generics.	2022-09-22 10:02:00 -04:00
James Rasell	a25028c412	cli: fix a bug in operator API when setting HTTPS via address. (#14635 ) Operators may have a setup whereby the TLS config comes from a source other than setting Nomad specific env vars. In this case, we should attempt to identify the scheme using the config setting as a fallback.	2022-09-22 15:43:58 +02:00
Luiz Aoqui	ad48401219	chore: move changelog file to the right folder (#14639 )	2022-09-21 13:50:22 -04:00
Tim Gross	38a6e7e343	remove 1.4.0 changelog entry that refers to bugfix on new code (#14611 ) Bug fixes on new features in Nomad 1.4.0 don't need or want changelog entries in the same changelog the feature appeared, so remove this one.	2022-09-16 16:14:02 -04:00
Phil Renaud	d6c9676252	Added task links to various alloc tables (#14592 ) * Added task links to various alloc tables * Lintfix * Border collapse and added to task group page * Logs icon temporarily removed and localStorage added * Mock task added to test * Delog * Two asserts in new test * Remove commented-out code * Changelog * Removing args.allocation deps	2022-09-16 15:58:22 -04:00
Phil Renaud	cebfbb0c28	Stabilizing percy snapshots with faker (#14551 ) * First attempt at stabilizing percy snapshots with faker * Tokens seed moved to before management token generation * Faker seed only in token test * moving seed after storage clear * And again, but back to no faker seeding * Isolated seed and temporary log * Setting seed(1) wherever we're snapshotting, or before establishing cluster scenarios * Deliberate noop to see if percy is stable * Changelog entry	2022-09-14 11:27:48 -04:00
Mahmood Ali	a9d5e4c510	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Tim Gross	eb757606f3	changelog entry for variables (#14509 )	2022-09-13 10:25:26 -04:00
Derek Strickland	5ca934015b	job_endpoint: check spec for all regions (#14519 ) * job_endpoint: check spec for all regions	2022-09-12 09:24:26 -04:00
James Rasell	009948186b	changelog: add entry for #14320 (#14518 )	2022-09-09 17:25:50 +02:00
James Rasell	f51a8c73e6	deps: update armon/go-metrics to v0.4.1 (#14493 )	2022-09-09 09:20:55 +02:00
Charlie Voiselle	e58998e218	Add client scheduling eligibility to heartbeat (#14483 )	2022-09-08 14:31:36 -04:00
Tim Gross	3fc7482ecd	CSI: failed allocation should not block its own controller unpublish (#14484 ) A Nomad user reported problems with CSI volumes associated with failed allocations, where the Nomad server did not send a controller unpublish RPC. The controller unpublish is skipped if other non-terminal allocations on the same node claim the volume. The check has a bug where the allocation belonging to the claim being freed was included in the check incorrectly. During a normal allocation stop for job stop or a new version of the job, the allocation is terminal. But allocations that fail are not yet marked terminal at the point in time when the client sends the unpublish RPC to the server. For CSI plugins that support controller attach/detach, this means that the controller will not be able to detach the volume from the allocation's host and the replacement claim will fail until a GC is run. This changeset fixes the conditional so that the claim's own allocation is not included, and makes the logic easier to read. Include a test case covering this path. Also includes two minor extra bugfixes: * Entities we get from the state store should always be copied before altering. Ensure that we copy the volume in the top-level unpublish workflow before handing off to the steps. * The list stub object for volumes in `nomad/structs` did not match the stub object in `api`. The `api` package also did not include the current readers/writers fields that are expected by the UI. True up the two objects and add the previously undocumented fields to the docs.	2022-09-08 13:30:05 -04:00
Seth Hoenig	a608e7950e	helper: guard against negative inputs into random stagger This PR modifies RandomStagger to protect against negative input values. If the given interval is negative, the value returned will be somewhere in the stratosphere. Instead, treat negative inputs like zero, returning zero.	2022-09-08 09:17:48 -05:00
Michael Schurter	7ff0290f8b	docs: add quota panic fix changelog entry (#14485 ) See https://github.com/hashicorp/nomad-enterprise/pull/839 for original (Enterprise only)	2022-09-07 17:04:46 -07:00
Phil Renaud	52bb5de25a	Changelog added and unused tests removed	2022-09-07 10:31:39 -04:00
Luiz Aoqui	358ba279d0	ui: remove extra space in menu footer (#14457 )	2022-09-06 16:53:17 -04:00
James Rasell	813c5daa96	hcl2: add strlen function and update docs. (#14463 )	2022-09-06 18:42:40 +02:00
Tim Gross	6ff59e71a5	cli: remove network from `quota status` output (#14468 ) Network quotas were removed in Nomad 1.0.4. Remove the fields no longer in use from the `quota status` output.	2022-09-06 09:37:16 -04:00
Kellen Fox	5086368a1e	Add a log line to help track node eligibility (#14125 ) Co-authored-by: James Rasell <jrasell@hashicorp.com>	2022-09-06 14:03:33 +02:00
Yan	6e927fa125	warn destructive update only when count > 1 (#13103 )	2022-09-02 15:30:06 -04:00
Giovani Avelar	b5cf358212	[ui] Show a different message when there are no tasks in a job (#14071 ) Different mesage when there are not tasks in a job	2022-09-02 15:20:45 -04:00
Tiernan	98022376be	Fix error handling in Client consulDiscoveryImpl (#14431 ) Added a missing `continue` on non-nil error to avoid accidentally using a bad peer.	2022-09-02 15:13:03 -04:00
Luiz Aoqui	1ae26981a0	connect: interpolate task env in config values (#14445 ) When configuring Consul Service Mesh, it's sometimes necessary to provide dynamic value that are only known to Nomad at runtime. By interpolating configuration values (in addition to configuration keys), user are able to pass these dynamic values to Consul from their Nomad jobs.	2022-09-02 15:00:28 -04:00
Tim Gross	7921f044e5	migrate autopilot implementation to raft-autopilot (#14441 ) Nomad's original autopilot was importing from a private package in Consul. It has been moved out to a shared library. Switch Nomad to use this library so that we can eliminate the import of Consul, which is necessary to build Nomad ENT with the current version of the Consul SDK. This also will let us pick up autopilot improvements shared with Consul more easily.	2022-09-01 14:27:10 -04:00
Luiz Aoqui	94d7dddccd	cli: set -hcl2-strict to false if -hcl1 is defined (#14426 ) These options are mutually exclusive but, since `-hcl2-strict` defaults to `true` users had to explicitily set it to `false` when using `-hcl1`. Also return `255` when job plan fails validation as this is the expected code in this situation.	2022-09-01 10:42:08 -04:00
Luiz Aoqui	19de803503	cli: ignore VaultToken when generating job diff (#14424 )	2022-09-01 10:01:53 -04:00
dependabot[bot]	9f8a3824c4	build(deps): bump github.com/hashicorp/go-version from 1.4.0 to 1.6.0 (#14364 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: James Rasell <jrasell@hashicorp.com>	2022-09-01 11:55:42 +02:00
Luiz Aoqui	6f5d3e724f	changelog: add entry for #14374 (#14419 )	2022-08-31 10:59:19 -04:00
Luiz Aoqui	27b253bc6e	changelog: add entry for #14381 (#14416 )	2022-08-31 10:41:48 -04:00
Seth Hoenig	5d5c8af930	cgroups: refactor v2 kill path to use cgroups.kill interface file This PR refactors the cgroups v2 group kill code path to use the cgroups.kill interface file for destroying the cgroup. Previously we copied the freeze + sigkill + unfreeze pattern from the v1 code, but v2 provides a more efficient and more race-free way to handle this. Closes #14371	2022-08-29 14:55:13 -05:00
Michael Schurter	dbffe22465	consul: allow stale namespace results (#12953 ) Nomad reconciles services it expects to be registered in Consul with what is actually registered in the local Consul agent. This is necessary to prevent leaking service registrations if Nomad crashes at certain points (or if there are bugs). When Consul has namespaces enabled, we must iterate over each available namespace to be sure no services were leaked into non-default namespaces. Since this reconciliation happens often, there's no need to require results from the Consul leader server. In large clusters this creates far more load than the "freshness" of the response is worth. Therefore this patch switches the request to AllowStale=true	2022-08-26 16:05:12 -07:00
Vladimir Sokolov	b646810401	cli: force periodic job if its id equals search prefix	2022-08-26 10:54:37 -04:00
dependabot[bot]	6d3389653b	build(deps): bump github.com/shirou/gopsutil/v3 from 3.21.12 to 3.22.7 (#14209 ) * build(deps): bump github.com/shirou/gopsutil/v3 from 3.21.12 to 3.22.7 Bumps [github.com/shirou/gopsutil/v3](https://github.com/shirou/gopsutil) from 3.21.12 to 3.22.7. - [Release notes](https://github.com/shirou/gopsutil/releases) - [Commits](https://github.com/shirou/gopsutil/compare/v3.21.12...v3.22.7) --- updated-dependencies: - dependency-name: github.com/shirou/gopsutil/v3 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * changelog entry Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-08-25 14:15:41 -04:00
Seth Hoenig	51384dd63f	client: refactor cpuset manager initialization This PR refactors the code path in Client startup for setting up the cpuset cgroup manager (non-linux systems not affected). Before, there was a logic bug where we would try to read the cpuset.cpus.effective cgroup interface file before ensuring nomad's parent cgroup existed. Therefor that file would not exist, and the list of useable cpus would be empty. Tasks started thereafter would not have a value set for their cpuset.cpus. The refactoring fixes some less than ideal coding style. Instead we now bootstrap each cpuset manager type (v1/v2) within its own constructor. If something goes awry during bootstrap (e.g. cgroups not enabled), the constructor returns the noop implementation and logs a warning. Fixes #14229	2022-08-25 11:18:43 -05:00
Luiz Aoqui	31ab7964bd	ui: task lifecycle restart all tasks (#14223 ) Now that tasks that have finished running can be restarted, the UI needs to use the actual task state to determine which CSS class to use when rendering the task lifecycle chart element.	2022-08-24 18:43:44 -04:00
Luiz Aoqui	e012d9411e	Task lifecycle restart (#14127 ) * allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes	2022-08-24 17:43:07 -04:00
Tim Gross	c732b215f0	vault: detect namespace change in config reload (#14298 ) The `namespace` field was not included in the equality check between old and new Vault configurations, which meant that a Vault config change that only changed the namespace would not be detected as a change and the clients would not be reloaded. Also, the comparison for boolean fields such as `enabled` and `allow_unauthenticated` was on the pointer and not the value of that pointer, which results in spurious reloads in real config reload that is easily missed in typical test scenarios. Includes a minor refactor of the order of fields for `Copy` and `Merge` to match the struct fields in hopes it makes it harder to make this mistake in the future, as well as additional test coverage.	2022-08-24 17:03:29 -04:00
Seth Hoenig	423ea1a5c4	client/logmon: acquire executable in init block This PR causes the logmon task runner to acquire the binary of the Nomad executable in an 'init' block, so as to almost certainly get the name while the nomad file still exists. This is an attempt at fixing the case where a deleted Nomad file (e.g. during upgrade) may be getting renamed with a mysterious suffix first. If this doesn't work, as a last resort we can literally just trim the mystery string. Fixes: #14079	2022-08-24 13:17:20 -05:00
Piotr Kazmierczak	7077d1f9aa	template: custom change_mode scripts (#13972 ) This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707	2022-08-24 17:43:01 +02:00
Luiz Aoqui	848f2dcc22	changelog: update #14212 to breaking-change (#14292 )	2022-08-24 11:36:53 -04:00
Piotr Kazmierczak	077b6e7098	docs: Update upgrade guide to reflect enterprise changes introduced in nomad-enterprise (#14212 ) This PR documents a change made in the enterprise version of nomad that addresses the following issue: When a user tries to filter audit logs, they do so with a stanza that looks like the following: audit { enabled = true filter "remove deletes" { type = "HTTPEvent" endpoints = ["*"] stages = ["OperationComplete"] operations = ["DELETE"] } } When specifying both an "endpoint" and a "stage", the events with both matching a "endpoint" AND a matching "stage" will be filtered. When specifying both an "endpoint" and an "operation" the events with both matching a "endpoint" AND a matching "operation" will be filtered. When specifying both a "stage" and an "operation" the events with a matching a "stage" OR a matching "operation" will be filtered. The "OR" logic with stages and operations is unexpected and doesn't allow customers to get specific on which events they want to filter. For instance the following use-case is impossible to achieve: "I want to filter out all OperationReceived events that have the DELETE verb".	2022-08-24 16:31:49 +02:00
Seth Hoenig	cfe9db0f66	build: set osusergo build tag by default (#14248 ) This PR activates the osuergo build tag in GNUMakefile. This forces the os/user package to be compiled without CGO. Doing so seems to resolve a race condition in getpwnam_r that causes alloc creation to hang or panic on `user.Lookup("nobody")`.	2022-08-24 08:11:56 -05:00
Luiz Aoqui	af5c01a070	ui: use task state to determine if task is active (#14224 ) The current implementation uses the task's finishedAt field to determine if a task is active of not, but this check is not accurate. A task in the "pending" state will not have finishedAt value but it's also not active. This discrepancy results in some components, like the inline stats chart of the task row component, to be displayed even whey they shouldn't.	2022-08-23 15:50:40 -04:00
Tim Gross	bf57d76ec7	allow ACL policies to be associated with workload identity (#14140 ) The original design for workload identities and ACLs allows for operators to extend the automatic capabilities of a workload by using a specially-named policy. This has shown to be potentially unsafe because of naming collisions, so instead we'll allow operators to explicitly attach a policy to a workload identity. This changeset adds workload identity fields to ACL policy objects and threads that all the way down to the command line. It also a new secondary index to the ACL policy table on namespace and job so that claim resolution can efficiently query for related policies.	2022-08-22 16:41:21 -04:00
Luiz Aoqui	dbffdca92e	template: use pointer values for gid and uid (#14203 ) When a Nomad agent starts and loads jobs that already existed in the cluster, the default template uid and gid was being set to 0, since this is the zero value for int. This caused these jobs to fail in environments where it was not possible to use 0, such as in Windows clients. In order to differentiate between an explicit 0 and a template where these properties were not set we need to use a pointer.	2022-08-22 16:25:49 -04:00
Phil Renaud	fcf2c40c60	[ui] Allocation route services table: show task-level services (#14199 ) Adds service fragments to allocations and union taskGroup and task services	2022-08-22 11:45:12 -04:00
Derek Strickland	8dba52cee2	sentinel: add support for Nomad ACL Token and Namespace (#14171 ) * sentinel: add ability to reference Nomad ACL Token and Namespace in Sentinel policies	2022-08-18 16:33:00 -04:00
Phil Renaud	cbd4deedf8	[ui] general keyboard navigation: 1.3.4 release (#14138 ) * Initialized keyboard service Neat but funky: dynamic subnav traversal 👻 generalized traverseSubnav method Shift as special modifier key Nice little demo panel Keyboard shortcuts keycard Some animation styles on keyboard shortcuts Handle situations where a link is deeply nested from its parent menu item Keyboard service cleanup helper-based initializer and teardown for new contextual commands Keyboard shortcuts modal component added and demo-ghost removed Removed j and k from subnav traversal Register and unregister methods for subnav plus new subnavs for volumes and volume register main nav method Generalizing the register nav method 12762 table keynav (#12975) * Experimental feature: shortcut visual hints * Long way around to a custom modifier for keyboard shortcuts * dynamic table and list iterative shortcuts * Progress with regular old tether * Delogging * Table Keynav tether fix, server and client navs, and fix to shiftless on modified arrow keys Go to Optimize keyboard link and storage key changed to g r parameterized jobs keyboard nav Dynamic numeric keynav for multiple tables (#13482) * Multiple tables init * URL-bind enumerable keyboard commands and add to more taskRow and allocationRows * Type safety and lint fixes * Consolidated push to keyCommands * Default value when removing keyCommands * Remove the URL-based removal method and perform a recompute on any add Get tests passing in Keynav: remove math helpers and a few other defensive moves (#13761) * Remove ember math helpers * Test fixes for jobparts/body * Kill an unneeded integration helper test * delog * Trying if disabling percy lets this finish * Okay so its not percy; try parallelism in circle * Percyless yet again * Trying a different angle to not have percy * Upgrade percy to 1.6.1 [ui] Keyboard nav: "u" key to go up a level (#13754) * U to go up a level * Mislabelled my conditional * Custom lint ignore rule * Custom lint ignore rule, this time with commas * Since we're getting rid of ember math helpers elsewhere, do the math ourselves here Replace ArrowLeft etc. with an ascii arrow (#13776) * Replace ArrowLeft etc. with an ascii arrow * non-mutative helper cleanup Keyboard Nav: let users rebind their shortcuts (#13781) * click-outside and shortcuts enabled/disabled toggle * Trap focus when modal open * Enabled/disabled saved to localStorage * Autofocus edit button on variable index * Modal overflow styles * Functional rebind * Saving rebinds to localStorage for all majors * Started on defaultCommandBindings * Modal header style and cancel rebind on escape * keyboardable keybindings w buttons instead of spans * recording and defaultvalues * Enter short-circuits rebind * Only some commands are rebindable, and dont show dupes * No unused get import * More visually distinct header on modal * Disallowed keys for rebind, showing buffer as you type, and moving dedupe to modal logic willDestroy hook to prevent tests from doubling/tripling up addEventListener on kb events remove unused tests Keyboard Navigation acceptance tests (#13893) * Acceptance tests for keyboard modal * a11y audit fix and localStorage clear * Bind/rebind/localStorage tests * Keyboard tests for dynamic nav and tables * Rebinder and assert expectation * Second percy snapshot showing hints no longer relevant Weird issue where linktos with query props specifically from the task-groups page would fail to route / hit undefined.shouldSuperCede errors Adds the concept of exclusivity to a keycommand, removing peers that also share its label Lintfix Changelog and PR feedback Changelog and PR feedback Fix to rebinding in firefox by blurring the now-disabled button on rebind (#14053) * Secure Variables shortcuts removed * Variable index route autofocus removed * Updated changelog entry * Updated changelog entry * Keynav docs (#14148) * Section added to the API Docs UI page * Added a note about disabling * Prev and Next order * Remove dev log and unneeded comments	2022-08-17 12:59:33 -04:00
James Rasell	fbc9f8b66c	changelog: add missing entry for #13539 (#14129 )	2022-08-17 09:26:45 +02:00
Seth Hoenig	0a6497ee1f	api: trim space of error response output	2022-08-16 15:00:38 -05:00
Seth Hoenig	91e32eec9b	build: update to go1.19	2022-08-16 08:40:57 -05:00
Jai	81cac313c5	refact: add parent check to boolean (#14115 ) * refact: add parent check to boolean * chore: add changelog entry	2022-08-15 13:42:08 -04:00
Seth Hoenig	4f72b0ed72	deps: add cl for fsouza/go-dockerclient	2022-08-15 09:59:49 -05:00
Seth Hoenig	077f46c74a	Merge pull request #14025 from hashicorp/dependabot/go_modules/go.etcd.io/bbolt-1.3.6 build(deps): bump go.etcd.io/bbolt from 1.3.5 to 1.3.6	2022-08-15 09:13:48 -05:00
Seth Hoenig	8a377ece7e	deps: update cl for go.etcd.io/bbolt	2022-08-15 09:13:16 -05:00
Seth Hoenig	30d0e55ebb	deps: update cl for grpc	2022-08-15 09:10:13 -05:00
Seth Hoenig	394aebfbd9	Merge pull request #14088 from hashicorp/b-plan-vault-token cli: support vault token in plan command	2022-08-12 09:05:34 -05:00
Seth Hoenig	a939245a27	docs: tweak changelog Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2022-08-12 08:59:58 -05:00
Seth Hoenig	dc761aa7ec	docker: create a docker task config setting for disable built-in healthcheck This PR adds a docker driver task configuration setting for turning off built-in HEALTHCHECK of a container. References) https://docs.docker.com/engine/reference/builder/#healthcheck https://github.com/docker/engine-api/blob/master/types/container/config.go#L16 Closes #5310 Closes #14068	2022-08-11 10:33:48 -05:00
Seth Hoenig	ba5c45ab93	cli: respect vault token in plan command This PR fixes a regression where the 'job plan' command would not respect a Vault token if set via --vault-token or $VAULT_TOKEN. Basically the same bug/fix as for the validate command in https://github.com/hashicorp/nomad/issues/13062 Fixes https://github.com/hashicorp/nomad/issues/13939	2022-08-11 08:54:08 -05:00
Seth Hoenig	1901cfaba8	Merge pull request #14069 from brian-athinkingape/cli-fix-memstats-cgroupsv2 cli: for systems with cgroups v2, fix alloation resource utilization showing 0 memory used	2022-08-11 07:27:48 -05:00
Seth Hoenig	3aaaedf52e	cli: forward request for job validation to nomad leader This PR changes the behavior of 'nomad job validate' to forward the request to the nomad leader, rather than responding from any server. This is because we need the leader when validating Vault tokens, since the leader is the only server with an active vault client.	2022-08-10 14:34:04 -05:00
Brian Chau	63b60ced2a	Add changelog 14069	2022-08-09 14:16:34 -07:00
Luiz Aoqui	9affe31a0f	qemu: reduce monitor socket path (#13971 ) The QEMU driver can take an optional `graceful_shutdown` configuration which will create a Unix socket to send ACPI shutdown signal to the VM. Unix sockets have a hard length limit and the driver implementation assumed that QEMU versions 2.10.1 were able to handle longer paths. This is not correct, the linked QEMU fix only changed the behaviour from silently truncating longer socket paths to throwing an error. By validating the socket path before starting the QEMU machine we can provide users a more actionable and meaningful error message, and by using a shorter socket file name we leave a bit more room for user-defined values in the path, such as the task name. The maximum length allowed is also platform-dependant, so validation needs to be different for each OS.	2022-08-04 12:10:35 -04:00
Charles Z	7a8ec90fbe	allow unhealthy canaries without blocking autopromote (#14001 )	2022-08-04 11:53:50 -04:00
Luiz Aoqui	2c0fea64e9	qemu: restore monitor socket path (#14000 ) When a QEMU task is recovered the monitor socket path was not being restored into the task handler, so the `graceful_shutdown` configuration was effectively ignored if the client restarted.	2022-08-04 10:44:08 -04:00
Derek Strickland	77df9c133b	Add Nomad RetryConfig to agent template config (#13907 ) * add Nomad RetryConfig to agent template config	2022-08-03 16:56:30 -04:00
Phil Renaud	e58a95ed2f	New variable creation adds the first namespace in your available list at variable creation time (#13991 ) * New variable creation adds the first namespace in your available list at variable creation time * Changelog	2022-08-03 15:09:25 -04:00
Seth Hoenig	e2309754de	cl: update cl for 13670	2022-08-03 13:18:09 -05:00
Piotr Kazmierczak	530280505f	client: enable specifying user/group permissions in the template stanza (#13755 ) * Adds Uid/Gid parameters to template. * Updated diff_test * fixed order * update jobspec and api * removed obsolete code * helper functions for jobspec parse test * updated documentation * adjusted API jobs test. * propagate uid/gid setting to job_endpoint * adjusted job_endpoint tests * making uid/gid into pointers * refactor * updated documentation * updated documentation * Update client/allocrunner/taskrunner/template/template_test.go Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * Update website/content/api-docs/json-jobs.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * propagating documentation change from Luiz * formatting * changelog entry * changed changelog entry Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-08-02 22:15:38 +02:00
Eric Weber	cbce13c1ac	Add stage_publish_base_dir field to csi_plugin stanza of a job (#13919 ) * Allow specification of CSI staging and publishing directory path * Add website documentation for stage_publish_dir * Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir * Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir	2022-08-02 09:42:44 -04:00
Luiz Aoqui	6c31a51919	changelog: add entry for #13865 and #13866 (#13901 )	2022-07-22 15:19:33 -04:00
Seth Hoenig	2f20a75d38	cl: add cl about removing lib/darwin library	2022-07-22 14:02:58 -05:00
Tim Gross	c7a11a86c6	block deleting namespaces if the namespace contains a volume (#13880 ) When we delete a namespace, we check to ensure that there are no non-terminal jobs, which effectively covers evals, allocs, etc. CSI volumes are also namespaced, so extend this check to cover CSI volumes.	2022-07-21 16:13:52 -04:00
Seth Hoenig	c61e779b48	Merge pull request #13715 from hashicorp/dev-nsd-checks client: add support for checks in nomad services	2022-07-21 10:22:57 -05:00
Seth Hoenig	606e3ebdd4	client: updates from pr feedback	2022-07-21 09:54:27 -05:00
Seth Hoenig	8e6eeaa37e	Merge pull request #13869 from hashicorp/b-uniq-services-2 servicedisco: ensure service uniqueness in job validation	2022-07-21 08:24:24 -05:00
Will Jordan	5354409b1a	Return 429 response on HTTP max connection limit (#13621 ) Return 429 response on HTTP max connection limit. Instead of silently closing the connection, return a `429 Too Many Requests` HTTP response with a helpful error message to aid debugging when the connection limit is unintentionally reached. Set a 10-millisecond write timeout and rate limiter for connection-limit 429 response to prevent writing the HTTP response from consuming too many server resources. Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections exceeding concurrency limit.	2022-07-20 14:12:21 -04:00
Seth Hoenig	e5978a9cbf	jobspec: ensure service uniqueness in job validation	2022-07-20 12:38:08 -05:00
Tim Gross	cfa2cb140e	fsm: one-time token expiration should be deterministic (#13737 ) When applying a raft log to expire ACL tokens, we need to use a timestamp provided by the leader so that the result is deterministic across servers. Use leader's timestamp from RPC call	2022-07-18 14:19:29 -04:00
Seth Hoenig	c23da281a1	metrics: even classless blocked evals get metrics This PR fixes a bug where blocked evaluations with no class set would not have metrics exported at the dc:class scope. Fixes #13759	2022-07-15 14:12:44 -05:00
Luiz Aoqui	b656981cf0	Track plan rejection history and automatically mark clients as ineligible (#13421 ) Plan rejections occur when the scheduler work and the leader plan applier disagree on the feasibility of a plan. This may happen for valid reasons: since Nomad does parallel scheduling, it is expected that different workers will have a different state when computing placements. As the final plan reaches the leader plan applier, it may no longer be valid due to a concurrent scheduling taking up intended resources. In these situations the plan applier will notify the worker that the plan was rejected and that they should refresh their state before trying again. In some rare and unexpected circumstances it has been observed that workers will repeatedly submit the same plan, even if they are always rejected. While the root cause is still unknown this mitigation has been put in place. The plan applier will now track the history of plan rejections per client and include in the plan result a list of node IDs that should be set as ineligible if the number of rejections in a given time window crosses a certain threshold. The window size and threshold value can be adjusted in the server configuration. To avoid marking several nodes as ineligible at one, the operation is rate limited to 5 nodes every 30min, with an initial burst of 10 operations.	2022-07-12 18:40:20 -04:00
Michael Schurter	3e50f72fad	core: merge reserved_ports into host_networks (#13651 ) Fixes #13505 This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports). As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility: Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly. Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me. So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.	2022-07-12 14:40:25 -07:00
Phil Renaud	59c12fc758	Remove namespace cache (#13679 )	2022-07-11 18:06:18 -04:00
Phil Renaud	e9219a1ae0	Allow wildcard for Evaluations API (#13530 ) * Failing test and TODO for wildcard * Alias the namespace query parameter for Evals * eval: fix list when using ACLs and * namespace Apply the same verification process as in job, allocs and scaling policy list endpoints to handle the eval list when using an ACL token with limited namespace support but querying using the `` wildcard namespace. changelog: add entry for #13530 * ui: set namespace when querying eval Evals have a unique UUID as ID, but when querying them the Nomad API still expects a namespace query param, otherwise it assumes `default`. Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-07-11 16:42:17 -04:00
Luiz Aoqui	674c0ae08b	changelog: add entry for #13659 (#13691 )	2022-07-11 16:07:33 -04:00
Tim Gross	b6dd1191b2	snapshot restore-from-archive streaming and filtering (#13658 ) Stream snapshot to FSM when restoring from archive The `RestoreFromArchive` helper decompresses the snapshot archive to a temporary file before reading it into the FSM. For large snapshots this performs a lot of disk IO. Stream decompress the snapshot as we read it, without first writing to a temporary file. Add bexpr filters to the `RestoreFromArchive` helper. The operator can pass these as `-filter` arguments to `nomad operator snapshot state` (and other commands in the future) to include only desired data when reading the snapshot.	2022-07-11 10:48:00 -04:00
James Rasell	9eb63c9e03	cli: ensure node status and drain use correct cmd name. (#13656 )	2022-07-11 09:50:42 +02:00
Seth Hoenig	239eaf9a29	Merge pull request #13626 from hashicorp/b-client-max-kill-timeout client: enforce max_kill_timeout client configuration	2022-07-07 13:44:39 -05:00
Luiz Aoqui	85908415f9	state: fix eval list by prefix with * namespace (#13551 )	2022-07-07 14:21:51 -04:00
Luiz Aoqui	03433dd8af	cli: improve output of eval commands (#13581 ) Use the same output format when listing multiple evals in the `eval list` command and when `eval status <prefix>` matches more than one eval. Include the eval namespace in all output formats and always include the job ID in `eval status` since, even `node-update` evals are related to a job. Add Node ID to the evals table output to help differentiate `node-update` evals. Co-authored-by: James Rasell <jrasell@hashicorp.com>	2022-07-07 13:13:34 -04:00
Ted Behling	6a032a54d2	driver/docker: Don't pull InfraImage if it exists (#13265 ) Co-authored-by: James Rasell <jrasell@hashicorp.com>	2022-07-07 17:44:06 +02:00
Michael Schurter	f21272065d	core: emit node evals only for sys jobs in dc (#12955 ) Whenever a node joins the cluster, either for the first time or after being `down`, we emit a evaluation for every system job to ensure all applicable system jobs are running on the node. This patch adds an optimization to skip creating evaluations for system jobs not in the current node's DC. While the scheduler performs the same feasability check, skipping the creation of the evaluation altogether saves disk, network, and memory.	2022-07-06 14:35:18 -07:00
Seth Hoenig	5dd8aa3e27	client: enforce max_kill_timeout client configuration This PR fixes a bug where client configuration max_kill_timeout was not being enforced. The feature was introduced in 9f44780 but seems to have been removed during the major drivers refactoring. We can make sure the value is enforced by pluming it through the DriverHandler, which now uses the lesser of the task.killTimeout or client.maxKillTimeout. Also updates Event.SetKillTimeout to require both the task.killTimeout and client.maxKillTimeout so that we don't make the mistake of using the wrong value - as it was being given only the task.killTimeout before.	2022-07-06 15:29:38 -05:00
Luiz Aoqui	a9a66ad018	api: apply new ACL check for wildcard namespace (#13608 ) api: apply new ACL check for wildcard namespace In #13606 the ACL check was refactored to better support the all namespaces wildcard (`*`). This commit applies the changes to the jobs and alloc list endpoints.	2022-07-06 16:17:16 -04:00
Tim Gross	1fc8995590	query for leader in `operator debug` command (#13472 ) The `operator debug` command doesn't output the leader anywhere in the output, which adds extra burden to offline debugging (away from an ongoing incident where you can simply check manually). Query the `/v1/status/leader` API but degrade gracefully.	2022-07-06 10:57:44 -04:00
James Rasell	0c0b028a59	core: allow deleting of evaluations (#13492 ) * core: add eval delete RPC and core functionality. * agent: add eval delete HTTP endpoint. * api: add eval delete API functionality. * cli: add eval delete command. * docs: add eval delete website documentation.	2022-07-06 16:30:11 +02:00
James Rasell	181b247384	core: allow pausing and un-pausing of leader broker routine (#13045 ) * core: allow pause/un-pause of eval broker on region leader. * agent: add ability to pause eval broker via scheduler config. * cli: add operator scheduler commands to interact with config. * api: add ability to pause eval broker via scheduler config * e2e: add operator scheduler test for eval broker pause. * docs: include new opertor scheduler CLI and pause eval API info.	2022-07-06 16:13:48 +02:00
Phil Renaud	84a59ff059	[ui] Fix a bug where redirects after planning/editing a job didn't include namespace (#13588 ) * Job editing and planning handles namespace as part of ID instead of queryParam * Changelog added * Tests updated to reflect new namespace redirects	2022-07-05 15:58:56 -04:00
Seth Hoenig	97726c2fd8	Merge pull request #12862 from hashicorp/f-choose-services api: enable selecting subset of services using rendezvous hashing	2022-06-30 15:17:40 -05:00
Seth Hoenig	0048c59f1a	cl: fixup changelog comment Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2022-06-30 15:10:47 -05:00
James Rasell	3ecffaf36b	deps: update `github.com/hashicorp/go-discover` to latest. (#13491 )	2022-06-28 10:28:32 +02:00
James Rasell	d080eed9ae	client: fixed a problem calculating a service namespace. (#13493 ) When calculating a services namespace for registration, the code assumed the first task within the task array would include a service block. This is incorrect as it is possible only a latter task within the array contains a service definition. This change fixes the logic, so we correctly search for a service definition before identifying the namespace.	2022-06-28 09:47:28 +02:00
Seth Hoenig	9467bc9eb3	api: enable selecting subset of services using rendezvous hashing This PR adds the 'choose' query parameter to the '/v1/service/<service>' endpoint. The value of 'choose' is in the form '<number>\|<key>', number is the number of desired services and key is a value unique but consistent to the requester (e.g. allocID). Folks aren't really expected to use this API directly, but rather through consul-template which will soon be getting a new helper function making use of this query parameter. Example, curl 'localhost:4646/v1/service/redis?choose=2\|abc123' Note: consul-templte v0.29.1 includes the necessary nomadServices functionality.	2022-06-25 10:37:37 -05:00
Phil Renaud	2e6e95e78c	[ui] Reinstate Meta and Payload sections to Parameterized Child Jobs (#13473 ) * Shift meta off job.definition and decodedPayload alias to passed arg * Changelog	2022-06-24 15:03:08 -04:00
Seth Hoenig	b7a8318eac	Merge pull request #13467 from hashicorp/f-purge-raft-v2 core: remove support for raft protocol version 2	2022-06-24 10:10:26 -05:00
Tim Gross	4368dcc02f	fix deadlock in plan_apply (#13407 ) The plan applier has to get a snapshot with a minimum index for the plan it's working on in order to ensure consistency. Under heavy raft loads, we can exceed the timeout. When this happens, we hit a bug where the plan applier blocks waiting on the `indexCh` forever, and all schedulers will block in `Plan.Submit`. Closing the `indexCh` when the `asyncPlanWait` is done with it will prevent the deadlock without impacting correctness of the previous snapshot index. This changeset includes the a PoC failing test that works by injecting a large timeout into the state store. We need to turn this into a test we can run normally without breaking the state store before we can merge this PR. Increase `snapshotMinIndex` timeout to 10s. This timeout creates backpressure where any concurrent `Plan.Submit` RPCs will block waiting for results. This sheds load across all servers and gives raft some CPU to catch up, because schedulers won't dequeue more work while waiting. Increase it to 10s based on observations of large production clusters.	2022-06-23 12:06:27 -04:00
Seth Hoenig	91e08d5e23	core: remove support for raft protocol version 2 This PR checks server config for raft_protocol, which must now be set to 3 or unset (0). When unset, version 3 is used as the default.	2022-06-23 14:37:50 +00:00
Derek Strickland	7d6a3df197	csi_hook: valid if any driver supports csi (#13446 ) * csi_hook: valid if any driver supports csi volumes	2022-06-22 10:43:43 -04:00
Derek Strickland	9de4d7367c	cli: fix detach handling (#13405 ) Fix detach handling for: - `deployment fail` - `deployment promote` - `deployment resume` - `deployment unblock` - `job promote`	2022-06-21 06:01:23 -04:00
Jeffrey Clark	a97699221c	cni: add loopback to linux bridge (#13428 ) CNI changed how to bring up the interface in v0.2.0. Support was moved to a new loopback plugin. https://github.com/containernetworking/cni/pull/121 Fixes #10014	2022-06-20 11:22:53 -04:00
James Rasell	f1f7c5040b	api: added sysbatch job type constant to match other schedulers. (#13359 )	2022-06-16 11:53:04 +02:00
Joseph Martin	4aa96d5bfc	Return evalID if `-detach` flag is passed to job revert (#13364 ) * Return evalID if `-detach` flag is passed to job revert	2022-06-15 14:20:29 -04:00
Tim Gross	12d87c040c	fixup changelog entry for backported regression fix (#13370 ) The changelog entry for #13340 indicated it was an improvement. But on discussion, it was determined that this was a workaround for a regression. Update the changelog to make this clear.	2022-06-14 14:33:39 -04:00
Grant Griffiths	99896da443	CSI: make plugin health_timeout configurable in csi_plugin stanza (#13340 ) Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>	2022-06-14 10:04:16 -04:00
Daniel Rossbach	8c52c03c8c	qemu driver: Add option to configure drive_interface (#11864 )	2022-06-10 10:03:51 -04:00
Luiz Aoqui	e8b788b372	changelog: add entry for #12961 (#13318 )	2022-06-10 09:04:00 -04:00
Tim Gross	9d5523a72d	CSI: skip node unpublish on GC'd or down nodes (#13301 ) If the node has been GC'd or is down, we can't send it a node unpublish. The CSI spec requires that we don't send the controller unpublish before the node unpublish, but in the case where a node is gone we can't know the final fate of the node unpublish step. The `csi_hook` on the client will unpublish if the allocation has stopped and if the host is terminated there's no mount for the volume anyways. So we'll now assume that the node has unpublished at its end. If it hasn't, any controller unpublish will potentially hang or error and need to be retried.	2022-06-09 11:33:22 -04:00
phreakocious	94a78597d2	Add `guest_agent` config option for QEMU driver (#12800 ) Add boolean 'guest_agent' config option for QEMU driver, which will create the socket file for the QEMU Guest Agent in the task dir when enabled.	2022-06-09 09:21:38 -04:00
Derek Strickland	13ea5ae87a	consul-template: Add fault tolerant defaults (#13041 ) consul-template: Add fault tolerant defaults Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-06-08 14:08:25 -04:00
Luiz Aoqui	2e0bffba90	changelog: add entry for #12925 (#13250 )	2022-06-08 10:14:33 -04:00
Tim Gross	8ff5ea1bee	CSI: no early return when feasibility check fails on eligible nodes (#13274 ) As a performance optimization in the scheduler, feasibility checks that apply to an entire class are only checked once for all nodes of that class. Other feasibility checks are "available" checks because they rely on more ephemeral characteristics and don't contribute to the hash for the node class. This currently includes only CSI. We have a separate fast path for "available" checks when the node has already been marked eligible on the basis of class. This fast path has a bug where it returns early rather than continuing the loop. This causes the entire task group to be rejected. Fix the bug by not returning early in the fast path and instead jump to the top of the loop like all the other code paths in this method. Includes a new test exercising topology at whole-scheduler level and a fix for an existing test that should've caught this previously.	2022-06-07 13:31:10 -04:00
Derek Strickland	12f3ee46ea	alloc_runner: stop sidecar tasks last (#13055 ) alloc_runner: stop sidecar tasks last	2022-06-07 11:35:19 -04:00
Tim Gross	81c70f4973	changelog entry for #12534 (#13260 )	2022-06-06 16:19:17 -04:00
Conor Evans	86116a7607	add filebase64 function (#11791 ) Signed-off-by: Conor Evans <coevans@tcd.ie>	2022-06-06 11:58:17 -04:00
Lance Haig	4bf27d743d	Allow Operator Generated bootstrap token (#12520 )	2022-06-03 07:37:24 -04:00
Huan Wang	7d15157635	adding support for customized ingress tls (#13184 )	2022-06-02 18:43:58 -04:00
Seth Hoenig	45e8748658	Merge pull request #13205 from hashicorp/b-batch-preempt2 core: reschedule evicted batch job when resources become available	2022-06-02 16:32:01 -05:00
Shantanu Gadgil	6cb8c95534	fingerprint kernel architecture name (#13182 )	2022-06-02 15:51:00 -04:00
Seth Hoenig	0692190e12	core: reschedule evicted batch job when resources become available This PR fixes a bug where an evicted batch job would not be rescheduled once resources become available. Closes #9890	2022-06-02 14:04:13 -05:00
Seth Hoenig	54efec5dfe	docs: add docs and tests for tagged_addresses	2022-05-31 13:02:48 -05:00
Seth Hoenig	4631045d83	connect: enable setting connect upstream destination namespace	2022-05-26 09:39:36 -05:00
Seth Hoenig	f7c0e078a9	build: update golang version to 1.18.2 This PR update to Go 1.18.2. Also update the versions of hclfmt and go-hclogfmt which includes newer dependencies necessary for dealing with go1.18. The hcl v2 branch is now 'nomad-v2.9.1+tweaks2', to include a fix for newer macOS versions: `8927e75e82`	2022-05-25 10:04:04 -05:00
Luiz Aoqui	769ff1dcc3	Merge pull request #13109 from hashicorp/merge-release-1.3.1-branch Merge release 1.3.1 branch	2022-05-25 10:45:09 -04:00
Seth Hoenig	20b6bf3c22	Merge pull request #13104 from hashicorp/b-blocked-eval-math core: fix blocked eval math	2022-05-24 16:23:06 -05:00
Michael Schurter	2965dc6a1a	artifact: fix numerous go-getter security issues Fix numerous go-getter security issues: - Add timeouts to http, git, and hg operations to prevent DoS - Add size limit to http to prevent resource exhaustion - Disable following symlinks in both artifacts and `job run` - Stop performing initial HEAD request to avoid file corruption on retries and DoS opportunities. Approach Since Nomad has no ability to differentiate a DoS-via-large-artifact vs a legitimate workload, all of the new limits are configurable at the client agent level. The max size of HTTP downloads is also exposed as a node attribute so that if some workloads have large artifacts they can specify a high limit in their jobspecs. In the future all of this plumbing could be extended to enable/disable specific getters or artifact downloading entirely on a per-node basis.	2022-05-24 16:29:39 -04:00
Seth Hoenig	83bab8ed64	Merge pull request #13058 from hashicorp/b-cgroupsv1-docker-cgparent drivers/docker: do not set cgroup parent in v1 mode	2022-05-24 14:07:40 -05:00
Seth Hoenig	c6c3ae020d	drivers/docker: do not set cgroup parent in v1 mode This PR fixes a bug where the CgroupParent on the docker HostConfig struct was accidently being set when running in cgroups v1 mode.	2022-05-24 11:22:50 -05:00
Seth Hoenig	27d0c0dc9f	docs: add changelog	2022-05-24 09:13:15 -05:00
Will Jordan	d515e5c3b0	Don't buffer json logs on agent startup (#13076 ) There's no reason to buffer json logs on agent startup since logs in this format already aren't reordered.	2022-05-19 15:40:30 -04:00
Seth Hoenig	fc58f4972c	cli: correctly use and validate job with vault token set This PR fixes `job validate` to respect '-vault-token', '$VAULT_TOKEN', '-vault-namespace' if set.	2022-05-19 12:13:34 -05:00
Tim Gross	b72ff42ada	api: include Consul token in job revert API (#13065 )	2022-05-19 11:30:29 -04:00
Seth Hoenig	29d3da6dfd	cl: update changelog	2022-05-17 10:35:08 -05:00
Seth Hoenig	26b5c01431	Merge pull request #12817 from twunderlich-grapl/fix-network-interpolation Fix network.dns interpolation	2022-05-17 09:31:32 -05:00
Seth Hoenig	08becb117c	cl: add changelog note for network interpolation	2022-05-17 09:14:55 -05:00
Phil Renaud	45dc1cfd58	12986 UI fails to load job when there is an "@" in job name in nomad 130 (#13012 ) * LastIndexOf and always append a namespace on job links * Confirmed the volume equivalent and simplified idWIthNamespace logic * Changelog added * PR comments addressed * Drop the redirect for the time being * Tests updated to reflect namespace on links * Task detail test default namespace link for test	2022-05-13 17:01:27 -04:00
Tim Gross	faeb3fcd44	scheduler: volume updates should always be destructive (#13008 )	2022-05-13 11:34:04 -04:00
James Rasell	636b647a30	agent: fix panic when logging about protocol version config use. (#12962 ) The log line comes before the agent logger has been setup, therefore we need to use the UI logging to avoid panic.	2022-05-13 09:28:43 +02:00
Phil Renaud	dd824ac3f8	Changelog for visual diff tests (#12909 )	2022-05-06 11:29:10 -04:00
Phil Renaud	6a8f98723e	Chronological most-recent evals by default (#12847 ) * Chronological most-recent evals by default * Adding reverse: true to the list of expected queryparams in test * changelog	2022-05-05 16:11:27 -04:00
Jai	316daf581e	fix broken link to `task-group` in `Recent Allocation` table in `jobs.job.index` (#12765 ) * chore: run prettier on hbs files * ui: ensure to pass a real job object to task-group link * chore: add changelog entry * chore: prettify template * ui: template helper for formatting jobId in LinkTo component * ui: handle async relationship * ui: pass in job id to model arg instead of job model * update test for serialized namespace * ui: defend against null in tests * ui: prettified template added whitespace * ui: rollback ember-data to 3.24 because watcher return undefined on abort * ui: use format-job-helper instead of job model via alloc * ui: fix whitespace in template caused by prettier using template helper * ui: update test for new namespace * ui: revert prettier change Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-04-28 14:02:15 -04:00
Dave May	97cf204c00	debug: add version constraint to avoid pprof panic (#12807 )	2022-04-28 13:18:55 -04:00
Tim Gross	c763c4cb96	remove pre-0.9 driver code and related E2E test (#12791 ) This test exercises upgrades between 0.8 and Nomad versions greater than 0.9. We have not supported 0.8.x in a very long time and in any case the test has been marked to skip because the downloader doesn't work.	2022-04-27 09:53:37 -04:00
Michael Schurter	e2544dd089	client: fix waiting on preempted alloc (#12779 ) Fixes #10200 The bug A user reported receiving the following error when an alloc was placed that needed to preempt existing allocs: ``` [ERROR] client.alloc_watcher: error querying previous alloc: alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup failed: index error: UUID must be 36 characters" ``` The previous alloc (8e) was already complete on the client. This is possible if an alloc stops after the scheduling decision was made to preempt it, but before the node running both allocations was able to pull and start the preemptor. While that is hopefully a narrow window of time, you can expect it to occur in high throughput batch scheduling heavy systems. However the RPC error made no sense! `previous_alloc` in the logs was a valid 36 character UUID! The fix The fix is: ``` - prevAllocID: c.Alloc.PreviousAllocation, + prevAllocID: watchedAllocID, ``` The alloc watcher new func used for preemption improperly referenced Alloc.PreviousAllocation instead of the passed in watchedAllocID. When multiple allocs are preempted, a watcher is created for each with watchedAllocID set properly by the caller. In this case Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters` error was coming from! Sadly we were properly referencing watchedAllocID in the log, so it made the error make no sense! The repro I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl) and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl) for ease of repro. First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad), then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad) that evicts 2 low priority jobs. Everything worked as expected. However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147), it reproduces the bug because the watcher references PreviousAllocation instead of watchedAllocID.	2022-04-26 13:14:43 -07:00
Michael Schurter	6449ba8d41	api: add ParseHCLOpts helper method (#12777 ) The existing ParseHCL func didn't allow setting HCLv1=true.	2022-04-25 11:51:52 -07:00
Tim Gross	b2e4841747	CSI: plugin config updates should always be destructive (#12774 )	2022-04-25 12:59:25 -04:00
Tim Gross	766025cde7	CSI: plugin supervisor prestart should not mark itself done (#12752 ) The task runner hook `Prestart` response object includes a `Done` field that's intended to tell the client not to run the hook again. The plugin supervisor creates mount points for the task during prestart and saves these mounts in the hook resources. But if a client restarts the hook resources will not be populated. If the plugin task restarts at any time after the client restarts, it will fail to have the correct mounts and crash loop until restart attempts run out. Fix this by not returning `Done` in the response, just as we do for the `volume_mount_hook`.	2022-04-22 13:07:47 -04:00
James Rasell	24b499791d	deps: update consul-template to v0.29.0 (#12747 ) * deps: update consul-template to v0.29.0 * changelog: add entry for #12747	2022-04-22 09:58:54 -07:00
Phil Renaud	ab557b15e0	Adding changelog note (#12753 )	2022-04-22 12:38:49 -04:00
Luiz Aoqui	a8cc633156	vault: revert support for entity aliases (#12723 ) After a more detailed analysis of this feature, the approach taken in PR #12449 was found to be not ideal due to poor UX (users are responsible for setting the entity alias they would like to use) and issues around jobs potentially masquerading itself as another Vault entity.	2022-04-22 10:46:34 -04:00
Seth Hoenig	3fcac242c6	services: enable setting arbitrary address value in service registrations This PR introduces the `address` field in the `service` block so that Nomad or Consul services can be registered with a custom `.Address.` to advertise. The address can be an IP address or domain name. If the `address` field is set, the `service.address_mode` must be set in `auto` mode.	2022-04-22 09:14:29 -05:00
Michael Schurter	5db3a671db	cli: add -json flag to support job commands (#12591 ) * cli: add -json flag to support job commands While the CLI has always supported running JSON jobs, its support has been via HCLv2's JSON parsing. I have no idea what format it expects the job to be in, but it's absolutely not in the same format as the API expects. So I ignored that and added a new -json flag to explicitly support API style JSON jobspecs. The jobspecs can even have the wrapping {"Job": {...}} envelope or not! * docs: fix example for `nomad job validate` We haven't been able to validate inside driver config stanzas ever since the move to task driver plugins. 😭	2022-04-21 13:20:36 -07:00
Phil Renaud	a5bef3ce72	[ui, bugfix] Link fix for volumes where per_alloc=true (#12713 ) * Allocation page linkfix * fix added to task page and computed prop moved to allocation model * Fallback query added to task group when specific volume isnt knowable * Delog * link text reflects alloc suffix * Helper instead of in-template conditionals * formatVolumeName unit test * Removing unused helper import	2022-04-21 13:57:18 -04:00
James Rasell	716b8e658b	api: Add support for filtering and pagination to the node list endpoint (#12727 )	2022-04-21 17:04:33 +02:00

... 2 3 4 5 6 ...

696 commits