Commit Graph

24117 Commits

Author SHA1 Message Date
Ayrat Badykov c94c231c08
fix create snapshot request docs (#15242) 2022-11-17 08:43:40 +01:00
Tim Gross 6415fb4284
eval broker: shed all but one blocked eval per job after ack (#14621)
When an evaluation is acknowledged by a scheduler, the resulting plan is
guaranteed to cover up to the `waitIndex` set by the worker based on the most
recent evaluation for that job in the state store. At that point, we no longer
need to retain blocked evaluations in the broker that are older than that index.

Move all but the highest priority / highest `ModifyIndex` blocked eval into a
canceled set. When the `Eval.Ack` RPC returns from the eval broker it will
signal a reap of a batch of cancelable evals to write to raft. This paces the
cancelations limited by how frequently the schedulers are acknowledging evals;
this should reduce the risk of cancelations from overwhelming raft relative to
scheduler progress. In order to avoid straggling batches when the cluster is
quiet, we also include a periodic sweep through the cancelable list.
2022-11-16 16:10:11 -05:00
Seth Hoenig 45ff0765c7
e2e: swap bionic image for jammy (#15220) 2022-11-16 10:37:18 -06:00
Tim Gross e8c83c3ecc
test: ensure leader is still valid in reelection test (#15267)
The `TestLeader_Reelection` test waits for a leader to be elected and then makes
some other assertions. But it implcitly assumes that there's no failure of
leadership before shutting down the leader, which can lead to a panic in the
tests. Assert there's still a leader before the shutdown.
2022-11-16 11:04:02 -05:00
Jai f6155f2734
feat: add tooltip to storage volumes (#15245)
* feat: add tooltip to storage volumes

* chore: move Tooltip into td to preserve style

* styling: add overflow-x to section (#15246)

* styling: add overflow-x to section

* refact: use media query with display block
2022-11-15 14:13:57 -05:00
Jai 8bb7a9e577
refact: remove unused API (#15244) 2022-11-15 14:13:14 -05:00
James Rasell 2e19e9639e
agent: ensure all HTTP Server methods are pointer receivers. (#15250) 2022-11-15 16:31:44 +01:00
Nikita Beletskii 550f715ecd
Fix variable create API example in docs (#15248) 2022-11-15 16:04:11 +01:00
Tim Gross 37134a4a37
eval delete: move batching of deletes into RPC handler and state (#15117)
During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluations can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, re-serialize to
MessagePack, send the log entries through raft, and get the FSM applied.

To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:

* A naive solution here would be to just send the filter as the raft request and
  let the FSM apply delete the whole set in a single operation. Benchmarking with
  1M evals on a 3 node cluster demonstrated this can block the FSM apply for
  several minutes, which puts the cluster at risk if there's a leadership
  failover (the barrier write can't be made while this apply is in-flight).

* A less naive but still bad solution would be to have the RPC handler filter
  and paginate, and then hand a list of IDs to the existing raft log
  entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
  took roughly an hour to complete.

Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.

Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.
2022-11-14 14:08:13 -05:00
Douglas Jose 345ef0bbec
Fix wrong reference to `vault` (#15228) 2022-11-14 10:49:09 +01:00
Kyle Root 99d5e7efb3
Fix broken URL to nvidia device plugin (#15234) 2022-11-14 10:37:06 +01:00
Charlie Voiselle c73fb51d3a
[bug] Return a spec on reconnect (#15214)
client: fixed a bug where non-`docker` tasks with network isolation would leak network namespaces and iptables rules if the client was restarted while they were running
2022-11-11 13:27:36 -05:00
Seth Hoenig 21237d8337
client: avoid unconsumed channel in timer construction (#15215)
* client: avoid unconsumed channel in timer construction

This PR fixes a bug introduced in #11983 where a Timer initialized with 0
duration causes an immediate tick, even if Reset is called before reading the
channel. The fix is to avoid doing that, instead creating a Timer with a non-zero
initial wait time, and then immediately calling Stop.

* pr: remove redundant stop
2022-11-11 09:31:34 -06:00
Tim Gross eabbcebdd4
exec: allow running commands from host volume (#14851)
The exec driver and other drivers derived from the shared executor check the
path of the command before handing off to libcontainer to ensure that the
command doesn't escape the sandbox. But we don't check any host volume mounts,
which should be safe to use as a source for executables if we're letting the
user mount them to the container in the first place.

Check the mount config to verify the executable lives in the mount's host path,
but then return an absolute path within the mount's task path so that we can hand
that off to libcontainer to run.

Includes a good bit of refactoring here because the anchoring of the final task
path has different code paths for inside the task dir vs inside a mount. But
I've fleshed out the test coverage of this a good bit to ensure we haven't
created any regressions in the process.
2022-11-11 09:51:15 -05:00
Seth Hoenig 01a3a29e51
docs: clarify how to access task meta values in templates (#15212)
This PR updates template and meta docs pages to give examples of accessing
meta values in templates. To do so one must use the environment variable form
of the meta key name, which isn't obvious and wasn't yet documented.
2022-11-10 16:11:53 -06:00
Luiz Aoqui d3714b68e5
ci: notify on backport-assistant errors (#15203) 2022-11-10 16:11:26 -05:00
Luiz Aoqui d23203b7e4
ci: re-enable tests on main (#15204)
Now that the tests are grouped more tightly we don't use as many runners
as before, so we can re-enable these without clogging the queue.
2022-11-10 13:51:37 -05:00
Piotr Kazmierczak 4851f9e68a
acl: sso auth method schema and store functions (#15191)
This PR implements ACLAuthMethod type, acl_auth_methods table schema and crud state store methods. It also updates nomadSnapshot.Persist and nomadSnapshot.Restore methods in order for them to work with the new table, and adds two new Raft messages: ACLAuthMethodsUpsertRequestType and ACLAuthMethodsDeleteRequestType

This PR is part of the SSO work captured under ☂️ ticket #13120.
2022-11-10 19:42:41 +01:00
Seth Hoenig 6e3309ebc6
template: protect use of template manager with a lock (#15192)
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189
2022-11-10 08:30:27 -06:00
Seth Hoenig 87a34102f5
make: add target cl for create changelog entry (#15186)
* make: add target cl for create changelog entry

This PR adds `tools/cl-entry` and the `make cl` Makefile target for
conveniently creating correctly formatted Changelog entries.

* Update tools/cl-entry/main.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* Update tools/cl-entry/main.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-11-08 09:43:32 -06:00
Derek Strickland 80b6f27efd
api: remove `mapstructure` tags from`Port` struct (#12916)
This PR solves a defect in the deserialization of api.Port structs when returning structs from theEventStream.

Previously, the api.Port struct's fields were decorated with both mapstructure and hcl tags to support the network.port stanza's use of the keyword static when posting a static port value. This works fine when posting a job and when retrieving any struct that has an embedded api.Port instance as long as the value is deserialized using JSON decoding. The EventStream, however, uses mapstructure to decode event payloads in the api package. mapstructure expects an underlying field named static which does not exist. The result was that the Port.Value field would always be set to 0.

Upon further inspection, a few things became apparent.

The struct already has hcl tags that support the indirection during job submission.
Serialization/deserialization with both the json and hcl packages produce the desired result.
The use of of the mapstructure tags provided no value as the Port struct contains only fields with primitive types.
This PR:

Removes the mapstructure tags from the api.Port structs
Updates the job parsing logic to use hcl instead of mapstructure when decoding Port instances.
Closes #11044

Co-authored-by: DerekStrickland <dstrickland@hashicorp.com>
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2022-11-08 11:26:28 +01:00
twunderlich-grapl 1859559134
Fix s3 example URLs in the artifacts docs (#15123)
* Fix s3 URLs so that they work

Unfortunately, s3 urls prefixed with https:// do NOT work with the underlying go-getter library. As such, this fixes the examples so that they are working examples that won't cause problems for people reading the docs.
See discussion in https://github.com/hashicorp/nomad/issues/1113 circa 2016.

* Use s3:// protocol schema for artifact examples

Per the discussion in https://github.com/hashicorp/nomad/pull/15123,
we're going to use the explicit s3 protocol in the examples since that
is the likeliest to work in all scenarios
2022-11-07 14:14:57 -05:00
Drew Gonzales aac9404ee5
server: add git revision to serf tags (#9159) 2022-11-07 10:34:33 -05:00
Phil Renaud 85521c49c4
[ui] Remove animation from task logs sidebar (#15146)
* Remove animation from task logs sidebar

* changelog
2022-11-07 10:11:18 -05:00
Tim Gross 9e1c0b46d8
API for `Eval.Count` (#15147)
Add a new `Eval.Count` RPC and associated HTTP API endpoints. This API is
designed to support interactive use in the `nomad eval delete` command to get a
count of evals expected to be deleted before doing so.

The state store operations to do this sort of thing are somewhat expensive, but
it's cheaper than serializing a big list of evals to JSON. Note that although it
seems like this could be done as an extra parameter and response field on
`Eval.List`, having it as its own endpoint avoids having to change the response
body shape and lets us avoid handling the legacy filter params supported by
`Eval.List`.
2022-11-07 08:53:19 -05:00
dependabot[bot] 7dc1a66630
build(deps): bump github.com/zclconf/go-cty-yaml from 1.0.2 to 1.0.3 (#15165)
Bumps [github.com/zclconf/go-cty-yaml](https://github.com/zclconf/go-cty-yaml) from 1.0.2 to 1.0.3.
- [Release notes](https://github.com/zclconf/go-cty-yaml/releases)
- [Changelog](https://github.com/zclconf/go-cty-yaml/blob/master/CHANGELOG.md)
- [Commits](https://github.com/zclconf/go-cty-yaml/compare/v1.0.2...v1.0.3)

---
updated-dependencies:
- dependency-name: github.com/zclconf/go-cty-yaml
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-07 06:57:23 -06:00
dependabot[bot] b04c0d082d
build(deps): bump github.com/hashicorp/go-plugin from 1.4.3 to 1.4.5 (#15166)
Bumps [github.com/hashicorp/go-plugin](https://github.com/hashicorp/go-plugin) from 1.4.3 to 1.4.5.
- [Release notes](https://github.com/hashicorp/go-plugin/releases)
- [Changelog](https://github.com/hashicorp/go-plugin/blob/master/CHANGELOG.md)
- [Commits](https://github.com/hashicorp/go-plugin/compare/v1.4.3...v1.4.5)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/go-plugin
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-07 06:39:41 -06:00
dependabot[bot] 3f2d853f63
build(deps): bump github.com/shirou/gopsutil/v3 from 3.22.9 to 3.22.10 (#15162)
Bumps [github.com/shirou/gopsutil/v3](https://github.com/shirou/gopsutil) from 3.22.9 to 3.22.10.
- [Release notes](https://github.com/shirou/gopsutil/releases)
- [Commits](https://github.com/shirou/gopsutil/compare/v3.22.9...v3.22.10)

---
updated-dependencies:
- dependency-name: github.com/shirou/gopsutil/v3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-07 06:37:42 -06:00
dependabot[bot] 406af7fd8c
build(deps): bump github.com/zclconf/go-cty from 1.11.0 to 1.12.0 (#15161)
Bumps [github.com/zclconf/go-cty](https://github.com/zclconf/go-cty) from 1.11.0 to 1.12.0.
- [Release notes](https://github.com/zclconf/go-cty/releases)
- [Changelog](https://github.com/zclconf/go-cty/blob/main/CHANGELOG.md)
- [Commits](https://github.com/zclconf/go-cty/compare/v1.11.0...v1.12.0)

---
updated-dependencies:
- dependency-name: github.com/zclconf/go-cty
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-07 06:36:44 -06:00
dependabot[bot] d848c28066
build(deps): bump github.com/shoenig/test from 0.4.3 to 0.4.4 in /api (#15163)
* build(deps): bump github.com/shoenig/test from 0.4.3 to 0.4.4 in /api

Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 0.4.3 to 0.4.4.
- [Release notes](https://github.com/shoenig/test/releases)
- [Commits](https://github.com/shoenig/test/compare/v0.4.3...v0.4.4)

---
updated-dependencies:
- dependency-name: github.com/shoenig/test
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps: also update root go mod

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-11-06 08:06:01 -06:00
Luiz Aoqui e4c8b59919
Update alloc after reconnect and enforece client heartbeat order (#15068)
* scheduler: allow updates after alloc reconnects

When an allocation reconnects to a cluster the scheduler needs to run
special logic to handle the reconnection, check if a replacement was
create and stop one of them.

If the allocation kept running while the node was disconnected, it will
be reconnected with `ClientStatus: running` and the node will have
`Status: ready`. This combination is the same as the normal steady state
of allocation, where everything is running as expected.

In order to differentiate between the two states (an allocation that is
reconnecting and one that is just running) the scheduler needs an extra
piece of state.

The current implementation uses the presence of a
`TaskClientReconnected` task event to detect when the allocation has
reconnected and thus must go through the reconnection process. But this
event remains even after the allocation is reconnected, causing all
future evals to consider the allocation as still reconnecting.

This commit changes the reconnect logic to use an `AllocState` to
register when the allocation was reconnected. This provides the
following benefits:

  - Only a limited number of task states are kept, and they are used for
    many other events. It's possible that, upon reconnecting, several
    actions are triggered that could cause the `TaskClientReconnected`
    event to be dropped.
  - Task events are set by clients and so their timestamps are subject
    to time skew from servers. This prevents using time to determine if
    an allocation reconnected after a disconnect event.
  - Disconnect events are already stored as `AllocState` and so storing
    reconnects there as well makes it the only source of information
    required.

With the new logic, the reconnection logic is only triggered if the
last `AllocState` is a disconnect event, meaning that the allocation has
not been reconnected yet. After the reconnection is handled, the new
`ClientStatus` is store in `AllocState` allowing future evals to skip
the reconnection logic.

* scheduler: prevent spurious placement on reconnect

When a client reconnects it makes two independent RPC calls:

  - `Node.UpdateStatus` to heartbeat and set its status as `ready`.
  - `Node.UpdateAlloc` to update the status of its allocations.

These two calls can happen in any order, and in case the allocations are
updated before a heartbeat it causes the state to be the same as a node
being disconnected: the node status will still be `disconnected` while
the allocation `ClientStatus` is set to `running`.

The current implementation did not handle this order of events properly,
and the scheduler would create an unnecessary placement since it
considered the allocation was being disconnected. This extra allocation
would then be quickly stopped by the heartbeat eval.

This commit adds a new code path to handle this order of events. If the
node is `disconnected` and the allocation `ClientStatus` is `running`
the scheduler will check if the allocation is actually reconnecting
using its `AllocState` events.

* rpc: only allow alloc updates from `ready` nodes

Clients interact with servers using three main RPC methods:

  - `Node.GetAllocs` reads allocation data from the server and writes it
    to the client.
  - `Node.UpdateAlloc` reads allocation from from the client and writes
    them to the server.
  - `Node.UpdateStatus` writes the client status to the server and is
    used as the heartbeat mechanism.

These three methods are called periodically by the clients and are done
so independently from each other, meaning that there can't be any
assumptions in their ordering.

This can generate scenarios that are hard to reason about and to code
for. For example, when a client misses too many heartbeats it will be
considered `down` or `disconnected` and the allocations it was running
are set to `lost` or `unknown`.

When connectivity is restored the to rest of the cluster, the natural
mental model is to think that the client will heartbeat first and then
update its allocations status into the servers.

But since there's no inherit order in these calls the reverse is just as
possible: the client updates the alloc status and then heartbeats. This
results in a state where allocs are, for example, `running` while the
client is still `disconnected`.

This commit adds a new verification to the `Node.UpdateAlloc` method to
reject updates from nodes that are not `ready`, forcing clients to
heartbeat first. Since this check is done server-side there is no need
to coordinate operations client-side: they can continue sending these
requests independently and alloc update will succeed after the heartbeat
is done.

* chagelog: add entry for #15068

* code review

* client: skip terminal allocations on reconnect

When the client reconnects with the server it synchronizes the state of
its allocations by sending data using the `Node.UpdateAlloc` RPC and
fetching data using the `Node.GetClientAllocs` RPC.

If the data fetch happens before the data write, `unknown` allocations
will still be in this state and would trigger the
`allocRunner.Reconnect` flow.

But when the server `DesiredStatus` for the allocation is `stop` the
client should not reconnect the allocation.

* apply more code review changes

* scheduler: persist changes to reconnected allocs

Reconnected allocs have a new AllocState entry that must be persisted by
the plan applier.

* rpc: read node ID from allocs in UpdateAlloc

The AllocUpdateRequest struct is used in three disjoint use cases:

1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs,
   and WriteRequest fields
2. Raft log message using the Allocs, Evals, and WriteRequest fields
3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields

Adding a new field that would only be used in one these cases (1) made
things more confusing and error prone. While in theory an
AllocUpdateRequest could send allocations from different nodes, in
practice this never actually happens since only clients call this method
with their own allocations.

* scheduler: remove logic to handle exceptional case

This condition could only be hit if, somehow, the allocation status was
set to "running" while the client was "unknown". This was addressed by
enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC
calls, so this scenario is not expected to happen.

Adding unnecessary code to the scheduler makes it harder to read and
reason about it.

* more code review

* remove another unused test
2022-11-04 16:25:11 -04:00
Luiz Aoqui 1b87d292a3
client: retry RPC call when no server is available (#15140)
When a Nomad service starts it tries to establish a connection with
servers, but it also runs alloc runners to manage whatever allocations
it needs to run.

The alloc runner will invoke several hooks to perform actions, with some
of them requiring access to the Nomad servers, such as Native Service
Discovery Registration.

If the alloc runner starts before a connection is established the alloc
runner will fail, causing the allocation to be shutdown. This is
particularly problematic for disconnected allocations that are
reconnecting, as they may fail as soon as the client reconnects.

This commit changes the RPC request logic to retry it, using the
existing retry mechanism, if there are no servers available.
2022-11-04 14:09:39 -04:00
Charlie Voiselle 79c4478f5b
template: error on missing key (#15141)
* Support error_on_missing_value for templates
* Update docs for template stanza
2022-11-04 13:23:01 -04:00
Seth Hoenig d7aa37a5c9
e2e: explicitly wait on task status in chroot download exec test (#15145)
Also add some debug log lines for this test, because it doesn't make sense
for the allocation to be complete yet a task in the allocation to be not
started yet, which is what the test failures are implying.
2022-11-04 09:50:11 -05:00
Charlie Voiselle 83e43e01c1
Add missing timer reset (#15134) 2022-11-03 18:57:57 -04:00
Ethan 654ae1d591
fix: batchFirstFingerprints does not update device on node after v1.3.5 (#15125)
* fix: update device in batch first footprint

* cl: add cl note

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-11-03 16:31:39 -05:00
Phil Renaud 147df77e62
Ember patched for security release (#15126) 2022-11-03 16:29:23 -04:00
Tim Gross 672fb46d12
WI: set identity to client secret if missing (#15121)
Allocations created before 1.4.0 will not have a workload identity token. When
the client running these allocs is upgraded to 1.4.x, the identity hook will run
and replace the node secret ID token used previously with an empty string. This
causes service discovery queries to fail.

Fallback to the node's secret ID when the allocation doesn't have a signed
identity. Note that pre-1.4.0 allocations won't have templates that read
Variables, so there's no threat that this new node ID secret will be able to
read data that the allocation shouldn't have access to.
2022-11-03 11:10:11 -04:00
Phil Renaud ab5bfa8149
Accidentally trailed off on a docs paragraph (#15118) 2022-11-02 23:33:41 -04:00
Phil Renaud ffb4c63af7
[ui] Adds meta to job list stub and displays a pack logo on the jobs index (#14833)
* Adds meta to job list stub and displays a pack logo on the jobs index

* Changelog

* Modifying struct for optional meta param

* Explicitly ask for meta anytime I look up a job from index or job page

* Test case for the endpoint

* adding meta field to API struct and ommitting from response if empty

* passthru method added to api/jobs.list

* Meta param listed in docs for jobs list

* Update api/jobs.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-11-02 16:58:24 -04:00
Phil Renaud 6d5fe56fa1
Job spec upload (#14747)
* Job spec upload by click or drag

* pseudo-restrict formats

* Changelog

* Tweak to job spec upload to be above editor layer

* Within the job-editor again tho

* Beginning testcase cleanup

* Test progression

* refact: update codemirror fillin logic

Co-authored-by: Jai Bhagat <jaybhagat841@gmail.com>
2022-11-02 10:34:10 -04:00
Seth Hoenig a0bdc67d6a
build: update to go1.19.3 (#15099) 2022-11-01 15:54:49 -05:00
Tim Gross 4d7a4171cd
volumewatcher: prevent panic on nil volume (#15101)
If a GC claim is written and then volume is deleted before the `volumewatcher`
enters its run loop, we panic on the nil-pointer access. Simply doing a
nil-check at the top of the loop reveals a race condition around shutting down
the loop just as a new update is coming in.

Have the parent `volumeswatcher` send an initial update on the channel before
returning, so that we're still holding the lock. Update the watcher's `Stop`
method to set the running state, which lets us avoid having a second context and
makes stopping synchronous. This reduces the cases we have to handle in the run
loop.

Updated the tests now that we'll safely return from the goroutine and stop the
runner in a larger set of cases. Ran the tests with the `-race` detection flag
and fixed up any problems found here as well.
2022-11-01 16:53:10 -04:00
Tim Gross 38542f256e
variables: limit rekey eval to half the nack timeout (#15102)
In order to limit how much the rekey job can monopolize a scheduler worker, we
limit how long it can run to 1min before stopping work and emitting a new
eval. But this exactly matches the default nack timeout, so it'll fail the eval
rather than getting a chance to emit a new one.

Set the timeout for the rekey eval to half the configured nack timeout.
2022-11-01 16:50:50 -04:00
Tim Gross 903b5baaa4
keyring: safely handle missing keys and restore GC (#15092)
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.

Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
  that returns an error, to make the workflow more clear and reduce nesting. Log
  the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
  on initializing the keyring, so there's a race condition in the keyring tests
  where we can test for the existence of the root key before the keyring has
  been initialize. Change this to an "eventually" test.

But these fixes aren't enough to fix #14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
2022-11-01 15:00:50 -04:00
dependabot[bot] acc94d523f
build(deps): bump github.com/docker/cli from 20.10.18+incompatible to 20.10.21+incompatible (#15078)
* build(deps): bump github.com/docker/cli

Bumps [github.com/docker/cli](https://github.com/docker/cli) from 20.10.18+incompatible to 20.10.21+incompatible.
- [Release notes](https://github.com/docker/cli/releases)
- [Commits](https://github.com/docker/cli/compare/v20.10.18...v20.10.21)

---
updated-dependencies:
- dependency-name: github.com/docker/cli
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps: updated github.com/docker/cli from 20.10.18+incompatible to 20.10.21+incompatible

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-31 08:50:33 -05:00
dependabot[bot] c0fc174e0d
build(deps): bump github.com/hashicorp/serf from 0.10.0 to 0.10.1 (#15077)
Bumps [github.com/hashicorp/serf](https://github.com/hashicorp/serf) from 0.10.0 to 0.10.1.
- [Release notes](https://github.com/hashicorp/serf/releases)
- [Changelog](https://github.com/hashicorp/serf/blob/master/CHANGELOG.md)
- [Commits](https://github.com/hashicorp/serf/compare/v0.10.0...v0.10.1)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/serf
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-10-31 08:48:34 -05:00
dependabot[bot] 369e4da4ad
build(deps): bump github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126 (#15081)
* build(deps): bump github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126

Bumps [github.com/aws/aws-sdk-go](https://github.com/aws/aws-sdk-go) from 1.44.84 to 1.44.126.
- [Release notes](https://github.com/aws/aws-sdk-go/releases)
- [Commits](https://github.com/aws/aws-sdk-go/compare/v1.44.84...v1.44.126)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps: update github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-31 08:47:48 -05:00
dependabot[bot] 0ae4c4a241
build(deps): bump github.com/stretchr/testify in /api (#15082)
Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify) from 1.8.0 to 1.8.1.
- [Release notes](https://github.com/stretchr/testify/releases)
- [Commits](https://github.com/stretchr/testify/compare/v1.8.0...v1.8.1)

---
updated-dependencies:
- dependency-name: github.com/stretchr/testify
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-10-31 08:45:04 -05:00
dependabot[bot] fc9a731813
build(deps): bump github.com/hashicorp/memberlist from 0.4.0 to 0.5.0 (#15080)
Bumps [github.com/hashicorp/memberlist](https://github.com/hashicorp/memberlist) from 0.4.0 to 0.5.0.
- [Release notes](https://github.com/hashicorp/memberlist/releases)
- [Commits](https://github.com/hashicorp/memberlist/compare/v0.4.0...v0.5.0)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/memberlist
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-10-31 08:44:27 -05:00