open-nomad

Author	SHA1	Message	Date
Tim Gross	3a811ac5e7	keyring: fixes for keyring replication on cluster join (#14987 ) * keyring: don't unblock early if rate limit burst exceeded The rate limiter returns an error and unblocks early if its burst limit is exceeded (unless the burst limit is Inf). Ensure we're not unblocking early, otherwise we'll only slow down the cases where we're already pausing to make external RPC requests. * keyring: set MinQueryIndex on stale queries When keyring replication makes a stale query to non-leader peers to find a key the leader doesn't have, we need to make sure the peer we're querying has had a chance to catch up to the most current index for that key. Otherwise it's possible for newly-added servers to query another newly-added server and get a non-error nil response for that key ID. Ensure that we're setting the correct reply index in the blocking query. Note that the "not found" case does not return an error, just an empty key. So as a belt-and-suspenders, update the handling of empty responses so that we don't break the loop early if we hit a server that doesn't have the key. * test for adding new servers to keyring * leader: initialize keyring after we have consistent reads Wait until we're sure the FSM is current before we try to initialize the keyring. Also, if a key is rotated immediately following a leader election, plans that are in-flight may get signed before the new leader has the key. Allow for a short timeout-and-retry to avoid rejecting plans	2022-10-21 12:33:16 -04:00
Michael Schurter	9cac60dbed	test: use port collision instead of cpu exhaustion (#14994 ) Originally this test relied on Job 1 blocking Job 2 until Job 1 had a terminal ClientStatus. Job 2 ensured it would get blocked using 2 mechanisms: 1. A constraint requiring it is placed on the same node as Job 1. 2. Job 2 would require all unreserved CPU on the node to ensure it would be blocked until Job 1's resources were free. That 2nd assertion breaks if any previous job is still running on the target node! That seems very likely to happen in the flaky world of our e2e tests. In fact there may be some jobs we intentionally want running throughout; in hindsight it was never safe to assume my test would be the only thing scheduled when it ran. Ports to the rescue! Reserving a static port means that both Job 2 will now block on Job 1 being terminal. It will only conflict with other tests if those tests use that port on every node. I ensured no existing tests were using the port I chose. Other changes: - Gave job a bit more breathing room resource-wise. - Tightened timings a bit since previous failure ran into the `go test` time limit. - Cleaned up the DumpEvals output. It's quite nice and handy now!	2022-10-21 07:53:26 -07:00
Luiz Aoqui	8b8d85bce7	docs: use of `node_class` when autoscaling (#14950 ) Document how the value of `node_class` is used during cluster scaling. https://github.com/hashicorp/nomad-autoscaler/issues/255	2022-10-21 10:35:45 -04:00
Seth Hoenig	1f1b662e73	ci: use gotestsum for CI tests (#14995 ) Use gotestsum in both GHA and Circle with retries enabled.	2022-10-21 08:45:24 -05:00
James Rasell	206fb04dc1	acl: allow tokens to read policies linked via roles to the token. (#14982 ) ACL tokens are granted permissions either by direct policy links or via ACL role links. Callers should therefore be able to read policies directly assigned to the caller token or indirectly by ACL role links.	2022-10-21 09:05:17 +02:00
Luiz Aoqui	593e48e826	cli: prevent panic on `operator debug` (#14992 ) If the API returns an error during debug bundle collection the CLI was expanding the wrong error object, resulting in a panic since `err` is `nil`.	2022-10-20 15:53:58 -04:00
Jai	08fde3a4ff	refact: upgrade Promise.then to async/await (#14798 ) * refact: upgrade Promise.then to async/await * naive solution (#14800) * refact: use id instead of model * chore: add changelog entry * refact: add conditional safety around alloc	2022-10-20 14:25:41 -04:00
Luiz Aoqui	0fddb4d7e8	Post 1.4.1 release (#14988 ) * Generate files for 1.4.1 release * Prepare for next release Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>	2022-10-20 13:09:41 -04:00
Seth Hoenig	6e9c8a9955	deps: update go-memdb for goroutine leak fix (#14983 ) * deps: update go-memdb for goroutine leak fix * cl: update for goroutine leak go-memdb	2022-10-20 10:34:52 -05:00
Seth Hoenig	756b71b7d2	deps: bump shoenig for str func bugfixes (#14974 ) And fix the one place we use them.	2022-10-20 08:11:43 -05:00
James Rasell	215b4e7e36	acl: add ACL roles to event stream topic and resolve policies. (#14923 ) This changes adds ACL role creation and deletion to the event stream. It is exposed as a single topic with two types; the filter is primarily the role ID but also includes the role name. While conducting this work it was also discovered that the events stream has its own ACL resolution logic. This did not account for ACL tokens which included role links, or tokens with expiry times. ACL role links are now resolved to their policies and tokens are checked for expiry correctly.	2022-10-20 09:43:35 +02:00
James Rasell	d7b311ce55	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
Luiz Aoqui	75830a7161	docs: expand Autoscaling documentation (#14937 ) Rename `Internals` section to `Concepts` to match core docs structure and expand on how policies are evaluated. Also include missing documentation for check grouping and fix examples to use the new feature.	2022-10-19 17:57:08 -04:00
Phil Renaud	54eeb6ebe8	Adds searching and filtering for nodes on topology view (#14913 ) * Adds searching and filtering for nodes on topology view * Lintfix and changelog * Acceptance tests for topology search and filter * Search terms also apply to class and dc on topo page * Initialize queryparam values so as to not break history state	2022-10-19 15:00:35 -04:00
Luiz Aoqui	bb00f3d713	docs: add autoscaling debug (#14941 )	2022-10-19 14:17:41 -04:00
Luiz Aoqui	9f51e7ee40	docs: move autoscaling `source` agent config (#14947 ) Move the Autoscaler agent configuration `source` to the `policy` page since they are very closely related. Also update all headers in this section so they follow the proper `h1 > h2 > h3 > ...` hierarchy.	2022-10-19 14:17:09 -04:00
Luiz Aoqui	150b69daaf	docs: explain autoscaler target-value strategy (#14951 ) Provide more technical details about how the `target-value` strategy calculates new scaling actions.	2022-10-19 14:16:17 -04:00
Zach Shilton	fedeb84500	website: fix broken links (#14946 ) * fix: nomad license put link * fix: redirected URL * fix: avoid auto-formatting changes	2022-10-19 14:07:48 -04:00
Seth Hoenig	57375566d4	consul: register checks along with service on initial registration (#14944 ) * consul: register checks along with service on initial registration This PR updates Nomad's Consul service client to include checks in an initial service registration, so that the checks associated with the service are registered "atomically" with the service. Before, we would only register the checks after the service registration, which causes problems where the service is deemed healthy, even if one or more checks are unhealthy - especially problematic in the case where SuccessBeforePassing is configured. Fixes #3935 * cr: followup to fix cause of extra consul logging * cr: fix another bug * cr: fixup changelog	2022-10-19 12:40:56 -05:00
Michael Schurter	611abdf2cc	build: add ability to specify release targets (#14957 ) My make knowledge is very very limited, so if there's a better way to do this please let me know! This seems to work and lets me cut one off builds easily.	2022-10-19 10:27:47 -07:00
James Rasell	d95f27501b	deps: update consul-template to `61e288a` (#14955 )	2022-10-19 16:27:14 +02:00
Anthony	eb3515c8f5	Updated datacenter block description (#14953 ) * Updated datacenter block description * Replacing accidentally removed title * docs: add closing period Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-19 08:44:52 -05:00
Seth Hoenig	e66c9ede24	e2e: convert flaky exec download in chroot unit test into e2e test (#14949 ) Similar to https://github.com/hashicorp/nomad/pull/14710, convert flaky test into e2e test.	2022-10-19 08:22:32 -05:00
James Rasell	2db8c67a6d	api: add convenience string func to Topic type. (#14843 )	2022-10-19 14:12:23 +02:00
HashiBot	976e4870ec	chore: Update Digital Team Files (#14945 ) * Update generated scripts (website-start.sh) * Update generated scripts (should-build.sh) * Update generated scripts (website-build.sh) * Update generated website Makefile	2022-10-18 17:43:31 -04:00
Michael Schurter	01d90d18f6	test: expand timing and debugging for overlap test (#14920 ) attempt #9000	2022-10-18 13:02:18 -07:00
HashiBot	848158786e	chore: Update Digital Team Files (#14940 ) * Update generated scripts (should-build.sh) * Update generated scripts (website-build.sh) * Update generated scripts (website-start.sh) * Update generated website Makefile	2022-10-18 12:36:24 -04:00
Zach Shilton	217f27c677	website: redirects to empty array (#14921 )	2022-10-18 11:57:36 -04:00
Bryce Kalow	94ff129167	website: fixes redirected links (#14918 )	2022-10-18 10:31:52 -05:00
Seth Hoenig	c571db34e7	deps: bump shoenig/test to 0.4.1 (#14931 ) bugfix for SliceContainsAll and adding SliceContainsSubset	2022-10-18 09:46:25 -05:00
James Rasell	8e25048f3d	acl: gate ACL role write and delete RPC usage on v1.4.0 or greater. (#14908 )	2022-10-18 16:46:11 +02:00
James Rasell	9923f9e6f3	nnsd: gate registration write & delete RPC use on v1.3.0 or greater. (#14924 )	2022-10-18 15:30:28 +02:00
Seth Hoenig	f1b902beac	consul: do not re-register already registered services (#14917 ) This PR updates Nomad's Consul service client to do map comparisons using maps.Equal instead of reflect.DeepEqual. The bug fix is in how DeepEqual treats nil slices different from empty slices, when actually they should be treated the same.	2022-10-18 08:10:59 -05:00
Tim Gross	3c78980b78	make version checks specific to region (1.4.x) (#14912 ) * One-time tokens are not replicated between regions, so we don't want to enforce that the version check across all of serf, just members in the same region. * Scheduler: Disconnected clients handling is specific to a single region, so we don't want to enforce that the version check across all of serf, just members in the same region. * Variables: enforce version check in Apply RPC * Cleans up a bunch of legacy checks. This changeset is specific to 1.4.x and the changes for previous versions of Nomad will be manually backported in a separate PR.	2022-10-17 16:23:51 -04:00
Seth Hoenig	306b4dd38e	cleanup: remove another string-set helper function (#14902 )	2022-10-17 14:14:52 -05:00
Tim Gross	c721ce618e	keyring: filter by region before checking version (#14901 ) In #14821 we fixed a panic that can happen if a leadership election happens in the middle of an upgrade. That fix checks that all servers are at the minimum version before initializing the keyring (which blocks evaluation processing during trhe upgrade). But the check we implemented is over the serf membership, which includes servers in any federated regions, which don't necessarily have the same upgrade cycle. Filter the version check by the leader's region. Also bump up log levels of major keyring operations	2022-10-17 13:21:16 -04:00
Kevin Wang	d66b2eba43	fix: website broken links (#14904 ) * fix: website broken links * fix up keyring-rotate link Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-17 11:32:10 -04:00
Michael Schurter	21eced0a4e	test: extend timing and output of overlap e2e test (#14894 ) Keeps failing in the nightly e2e test with unhelpful output like: ``` Failed === RUN TestOverlap overlap_test.go:92: Followup job overlap93ee1d2b blocked. Sleeping for the rest of overlap48c26c39's shutdown_delay (9.2/10s) overlap_test.go:105: 1500/2000 retries reached for github.com/hashicorp/nomad/e2e/overlap.TestOverlap (err=timed out before an allocation was found for overlap93ee1d2b) overlap_test.go:105: timeout: timed out before an allocation was found for overlap93ee1d2b --- FAIL: TestOverlap (38.96s) ``` I have not been able to replicate it in my own e2e cluster, so I added the EvalDump helper to add detailed eval information like: ``` === RUN TestOverlap 1/1 Job overlap7b0e90ec Eval c38c9919-a4f0-5baf-45f7-0702383c682a Type: service TriggeredBy: job-register Deployment: Status: pending () NextEval: PrevEval: BlockedEval: -- No placement failures -- QueuedAllocs: SnapshotIdx: 0 CreateIndex: 96 ModifyIndex: 96 ... ``` Hopefully helpful when debugging other tests as well!	2022-10-14 14:15:07 -07:00
Mike Nomitch	91d32bb8df	Merge pull request #14879 from hashicorp/mnomitch/job-purge-ui Adds purge job button to UI	2022-10-14 12:46:20 -07:00
hashicorp-copywrite[bot]	2df28b0d7e	[COMPLIANCE] Update MPL 2.0 LICENSE (#14884 ) Co-authored-by: hashicorp-copywrite[bot] <noreply@hashicorp.com>	2022-10-13 08:43:12 -04:00
Michael Schurter	bdb639b3e2	test: simplify overlap job placement logic (#14811 ) * test: simplify overlap job placement logic Trying to fix #14806 Both the previous approach as well as this one worked on e2e clusters I spun up. * simplify code flow	2022-10-12 11:21:28 -07:00
Mike Nomitch	c4ec506009	Adds purge job button to UI when job stopped	2022-10-12 08:14:48 -07:00
Tim Gross	bcd26f8815	docker_logger: reorder imports to save memory (#14875 ) Nomad runs one logmon process and also one docker_logger process for each running allocation. A naive look at memory usage shows 10-30 MB of RSS, but a closer look shows that most of this memory (ex. all but ~2MB for logmon) is shared (`Shared_Clean` in Linux pmap). But a heap dump of docker_logger shows that it currently has an extra ~2500 KiB of heap (anonymously-mapped unshared memory) used for init blocks coming from the agent code (ex. mostly regexes from go-version, structs, and the Consul SDK). The packages for running logmon, docker_logger, and executor have an init block that parses `os.Args` to drop into their own logic, which prevents them from loading all the rest of the agent code and saves on memory, so this was unexpected. It looks like we accidentally reordered the imports in main to undo some of the work originally done in 404d2d4c98f1df930be1ae9852fe6e6ae8c1517e. This changeset restores the ordering. A follow-up heap dump shows this saves ~2MB of unshared RSS per docker_logger process.	2022-10-11 13:23:03 -04:00
Michael Schurter	45ce8c13cf	client: remove unused LogOutput and LogLevel (#14867 ) * client: remove unused LogOutput * client: remove unused config.LogLevel	2022-10-11 09:24:40 -07:00
Seth Hoenig	ba1e337f8b	helpers: lockfree lookup of nobody user on unix systems (#14866 ) * helpers: lockfree lookup of nobody user on linux and darwin This PR continues the nobody user lookup saga, by making the nobody user lookup lock-free on linux and darwin. By doing the lookup in an init block this originally broke on Windows, where we must avoid doing the lookup at all. We can get around that breakage by only doing the lookup on linux/darwin where the nobody user is going to exist. Also return the nobody user by value so that a copy is created that cannot be modified by callers of Nobody(). * helper: move nobody code into unix file	2022-10-11 08:38:05 -05:00
Seth Hoenig	1593963cd1	servicedisco: implicit constraint for nomad v1.4 when using nsd checks (#14868 ) This PR adds a jobspec mutator to constrain jobs making use of checks in the nomad service provider to nomad clients of at least v1.4.0. Before, in a mixed client version cluster it was possible to submit an NSD job making use of checks and for that job to land on an older, incompatible client node. Closes #14862	2022-10-11 08:21:42 -05:00
Seth Hoenig	69ced2a2bd	services: remove assertion on 'task' field being set (#14864 ) This PR removes the assertion around when the 'task' field of a check may be set. Starting in Nomad 1.4 we automatically set the task field on all checks in support of the NSD checks feature. This is causing validation problems elsewhere, e.g. when a group service using the Consul provider sets 'task' it will fail validation that worked previously. The assertion of leaving 'task' unset was only about making sure job submitters weren't expecting some behavior, but in practice is causing bugs now that we need the task field for more than it was originally added for. We can simply update the docs, noting when the task field set by job submitters actually has value.	2022-10-10 13:02:33 -05:00
Seth Hoenig	5e38a0e82c	cleanup: rename Equals to Equal for consistency (#14759 )	2022-10-10 09:28:46 -05:00
Seth Hoenig	0e702aec00	build: move imports into the transitive require block (#14863 )	2022-10-10 09:27:55 -05:00
Phil Renaud	e771b94164	[ui] Makes service tags wrap and look like tag items (#14834 ) * Makes service tags wrap and look like tag items * Add a little vertical spacing and changelog * Put client before tags * Force tags list to new line	2022-10-07 09:23:52 -04:00

1 2 3 4 5 ...

23889 commits