Commit graph

23626 commits

Author SHA1 Message Date
Seth Hoenig fd9744b9eb
Merge pull request #14301 from hashicorp/b-fix-check-status-test-racey
testing: fix flakey check status test
2022-08-25 08:30:46 -05:00
James Rasell 601588df6b
Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main 2022-08-25 12:14:29 +01:00
James Rasell 2293dcf35f
Merge pull request #14291 from hashicorp/f-gh-13120-sso-various-small-fixes
acl: three small fixes for CLI and state consistency
2022-08-25 12:05:32 +02:00
James Rasell 7a0798663d
acl: fix a bug where roles could be duplicated by name.
An ACL roles name must be unique, however, a bug meant multiple
roles of the same same could be created. This fixes that problem
with checks in the RPC handler and state store.
2022-08-25 09:20:43 +01:00
Luiz Aoqui 31ab7964bd
ui: task lifecycle restart all tasks (#14223)
Now that tasks that have finished running can be restarted, the UI needs
to use the actual task state to determine which CSS class to use when
rendering the task lifecycle chart element.
2022-08-24 18:43:44 -04:00
Luiz Aoqui e012d9411e
Task lifecycle restart (#14127)
* allocrunner: handle lifecycle when all tasks die

When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.

* allocrunner: implement different alloc restarts

Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.

* allocrunner: allow tasks to run again

Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.

The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.

* api/cli: add support for all tasks alloc restart

Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.

* test: fix tasklifecycle Coordinator test

* allocrunner: kill taskrunners if all tasks are dead

When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.

* taskrunner: fix tests that waited on WaitCh

Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.

* tests: add tests for all tasks alloc restart

* changelog: add entry for #14127

* taskrunner: fix restore logic.

The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.

It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.

The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.

This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.

* address code review requests

* apply more code review changes

* taskrunner: add different Restart modes

Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.

So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
  the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
  restart if its `Run()` method is still active. Callers will need to
  restart the allocRunner taskCoordinator to make sure it will allow the
  task to run again.

* minor fixes
2022-08-24 17:43:07 -04:00
Tim Gross c732b215f0
vault: detect namespace change in config reload (#14298)
The `namespace` field was not included in the equality check between old and new
Vault configurations, which meant that a Vault config change that only changed
the namespace would not be detected as a change and the clients would not be
reloaded.

Also, the comparison for boolean fields such as `enabled` and
`allow_unauthenticated` was on the pointer and not the value of that pointer,
which results in spurious reloads in real config reload that is easily missed in
typical test scenarios.

Includes a minor refactor of the order of fields for `Copy` and `Merge` to match
the struct fields in hopes it makes it harder to make this mistake in the
future, as well as additional test coverage.
2022-08-24 17:03:29 -04:00
Seth Hoenig 6398548ebd
Merge pull request #14283 from hashicorp/f-java-corretto-test-case
drivers/java: add parsing test case for corretto 17
2022-08-24 15:28:20 -05:00
Seth Hoenig 5e18c7b5b2
Merge pull request #14297 from hashicorp/b-logmon-fork-mystery-bin
client/logmon: acquire executable in init block
2022-08-24 15:25:09 -05:00
Seth Hoenig ff59b90d41 testing: fix flakey check status test
This PR fixes a flakey test where we did not wait on the check
status to actually become failing (go too fast and you just get
a pending check).

Instead add a helper for waiting on any check in the alloc to become
the state we are looking for.
2022-08-24 15:11:41 -05:00
Seth Hoenig 062c817450 cleanup: move fs helpers into escapingfs 2022-08-24 14:45:34 -05:00
Seth Hoenig 423ea1a5c4 client/logmon: acquire executable in init block
This PR causes the logmon task runner to acquire the binary of the
Nomad executable in an 'init' block, so as to almost certainly get
the name while the nomad file still exists.

This is an attempt at fixing the case where a deleted Nomad file
(e.g. during upgrade) may be getting renamed with a mysterious
suffix first.

If this doesn't work, as a last resort we can literally just trim
the mystery string.

Fixes: #14079
2022-08-24 13:17:20 -05:00
Piotr Kazmierczak 7077d1f9aa
template: custom change_mode scripts (#13972)
This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707
2022-08-24 17:43:01 +02:00
Luiz Aoqui 848f2dcc22
changelog: update #14212 to breaking-change (#14292) 2022-08-24 11:36:53 -04:00
Seth Hoenig bff6c88683 cleanup: remove more copies of min/max from helper 2022-08-24 09:56:15 -05:00
Luiz Aoqui ea1802ffa0
deps: sync versions of go-discover in go.mod (#14269)
In #13491 the version of `go-discover` was updated in `go.mod` but the
comment above it mentions that it also needs to be updated in the
`replace` directive.
2022-08-24 10:32:13 -04:00
Piotr Kazmierczak 077b6e7098
docs: Update upgrade guide to reflect enterprise changes introduced in nomad-enterprise (#14212)
This PR documents a change made in the enterprise version of nomad that addresses the following issue:

When a user tries to filter audit logs, they do so with a stanza that looks like the following:

audit {
  enabled = true

  filter "remove deletes" {
    type = "HTTPEvent"
    endpoints  = ["*"]
    stages = ["OperationComplete"]
    operations = ["DELETE"]
  }
}

When specifying both an "endpoint" and a "stage", the events with both matching a "endpoint" AND a matching "stage" will be filtered.

When specifying both an "endpoint" and an "operation" the events with both matching a "endpoint" AND a matching "operation" will be filtered.

When specifying both a "stage" and an "operation" the events with a matching a "stage" OR a matching "operation" will be filtered.

The "OR" logic with stages and operations is unexpected and doesn't allow customers to get specific on which events they want to filter. For instance the following use-case is impossible to achieve: "I want to filter out all OperationReceived events that have the DELETE verb".
2022-08-24 16:31:49 +02:00
Seth Hoenig 0d97a94814 drivers/java: add parsing test case for corretto 17 2022-08-24 09:16:38 -05:00
James Rasell 2ccc48c167
cli: use policy flag for role creation and update. 2022-08-24 15:15:02 +01:00
James Rasell 7401677e4e
cli: output none when a token has no expiration. 2022-08-24 15:14:49 +01:00
Seth Hoenig cfe9db0f66
build: set osusergo build tag by default (#14248)
This PR activates the osuergo build tag in GNUMakefile. This forces the os/user
package to be compiled without CGO. Doing so seems to resolve a race condition
in getpwnam_r that causes alloc creation to hang or panic on `user.Lookup("nobody")`.
2022-08-24 08:11:56 -05:00
James Rasell 9782d6d7ff
acl: allow tokens to lookup linked roles. (#14227)
When listing or reading an ACL role, roles linked to the ACL token
used for authentication can be returned to the caller.
2022-08-24 13:51:51 +02:00
Luiz Aoqui 7ee3de3ea5
fix minor issues found durint ENT merge (#14250) 2022-08-23 17:22:18 -04:00
Michael Schurter 0114bcfe5b
core: move LicenseConfig to shared file (#14247)
This moves LicenseConfig and its Copy method to a shared file so that it can be shared with enterprise code.
2022-08-23 13:44:10 -07:00
Luiz Aoqui af5c01a070
ui: use task state to determine if task is active (#14224)
The current implementation uses the task's finishedAt field to determine
if a task is active of not, but this check is not accurate. A task in
the "pending" state will not have finishedAt value but it's also not
active.

This discrepancy results in some components, like the inline stats chart
of the task row component, to be displayed even whey they shouldn't.
2022-08-23 15:50:40 -04:00
Luiz Aoqui d3be0abf61
ci: fix gofmt on tasklifecycle (#14232) 2022-08-23 15:47:15 -04:00
Tim Gross afb9fe6a4e
docs: fix an anchor link in secure vars docs (#14231) 2022-08-23 10:46:24 -04:00
Seth Hoenig b5427a9f3b
Merge pull request #14215 from hashicorp/docs-update-checks-for-nsd
docs: update check documentation with NSD specifics
2022-08-23 09:23:53 -05:00
Seth Hoenig fb82f11e70
docs: fix checks doc typo
Co-authored-by: Piotr Kazmierczak <phk@mm.st>
2022-08-23 09:23:36 -05:00
Seth Hoenig 60e22d2b13
Merge pull request #14221 from hashicorp/build-require-go1.19
build: go.mod should require go1.19
2022-08-23 07:53:13 -05:00
Luiz Aoqui 7a8cacc9ec
allocrunner: refactor task coordinator (#14009)
The current implementation for the task coordinator unblocks tasks by
performing destructive operations over its internal state (like closing
channels and deleting maps from keys).

This presents a problem in situations where we would like to revert the
state of a task, such as when restarting an allocation with tasks that
have already exited.

With this new implementation the task coordinator behaves more like a
finite state machine where task may be blocked/unblocked multiple times
by performing a state transition.

This initial part of the work only refactors the task coordinator and
is functionally equivalent to the previous implementation. Future work
will build upon this to provide bug fixes and enhancements.
2022-08-22 18:38:49 -04:00
Phil Renaud 662721ce75
Remove a test pause and a lint error from #14199 (#14222) 2022-08-22 16:51:49 -04:00
Tim Gross bf57d76ec7
allow ACL policies to be associated with workload identity (#14140)
The original design for workload identities and ACLs allows for operators to
extend the automatic capabilities of a workload by using a specially-named
policy. This has shown to be potentially unsafe because of naming collisions, so
instead we'll allow operators to explicitly attach a policy to a workload
identity.

This changeset adds workload identity fields to ACL policy objects and threads
that all the way down to the command line. It also a new secondary index to the
ACL policy table on namespace and job so that claim resolution can efficiently
query for related policies.
2022-08-22 16:41:21 -04:00
Charlie Voiselle 29e63a6cb2
Make var get a blocking query as expected (#14205) 2022-08-22 16:37:21 -04:00
Charlie Voiselle b79447d013
Add .0 to make goenv happy (#14218) 2022-08-22 16:36:02 -04:00
Seth Hoenig 3fd0ec6569
Merge pull request #14219 from hashicorp/e2e-nsd-checks
e2e: add e2e tests for nomad service disco checks
2022-08-22 15:31:35 -05:00
Seth Hoenig 38727b6ab9 e2e: add e2e tests for nomad service disco checks
This PR adds 2 e2e tests for ensuring nomad service discovery checks
get created and produce status results as expected.
2022-08-22 15:31:13 -05:00
Luiz Aoqui dbffdca92e
template: use pointer values for gid and uid (#14203)
When a Nomad agent starts and loads jobs that already existed in the
cluster, the default template uid and gid was being set to 0, since this
is the zero value for int. This caused these jobs to fail in
environments where it was not possible to use 0, such as in Windows
clients.

In order to differentiate between an explicit 0 and a template where
these properties were not set we need to use a pointer.
2022-08-22 16:25:49 -04:00
Michael Schurter d36e0c02c9
client: stats need latest allocdir (#14204)
In #14139 this code was changed to use the original copy of the config,
but Config.AllocDir is updated in the `Client.init()` method for dev
agents.

This uses the latest version of the alloc dir (which cannot change
further at runtime without a client restart which would reinitialize
the stats collector as well).
2022-08-22 09:28:53 -07:00
Seth Hoenig ea6d010790 docs: update check documentation with NSD specifics
This PR updates the checks documentation to mention support for checks
when using the Nomad service provider. There are limitations of NSD
compared to Consul, and those configuration options are now noted as
being Consul-only.
2022-08-22 10:50:26 -05:00
Phil Renaud fcf2c40c60
[ui] Allocation route services table: show task-level services (#14199)
Adds service fragments to allocations and union taskGroup and task services
2022-08-22 11:45:12 -04:00
James Rasell 2736cf0dfa
acl: make listing RPC and HTTP API a stub return object. (#14211)
Making the ACL Role listing return object a stub future-proofs the
endpoint. In the event the role object grows, we are not bound by
having to return all fields within the list endpoint or change the
signature of the endpoint to reduce the list return size.
2022-08-22 17:20:23 +02:00
Seth Hoenig 09e275ccba
Merge pull request #14196 from hashicorp/f-nsd-checks-in-alloc-status
cli: display nomad service check status output in CLI commands
2022-08-22 08:33:56 -05:00
James Rasell d9eaf15c79
readme: remove Gitter lobby link. (#14195)
gitter is not an officially supported forum, so we should not link
to it from the readme.
2022-08-22 10:33:20 +02:00
James Rasell 802d005ef5
acl: add replication to ACL Roles from authoritative region. (#14176)
ACL Roles along with policies and global token will be replicated
from the authoritative region to all federated regions. This
involves a new replication loop running on the federated leader.

Policies and roles may be replicated at different times, meaning
the policies and role references may not be present within the
local state upon replication upsert. In order to bypass the RPC
and state check, a new RPC request parameter has been added. This
is used by the replication process; all other callers will trigger
the ACL role policy validation check.

There is a new ACL RPC endpoint to allow the reading of a set of
ACL Roles which is required by the replication process and matches
ACL Policies and Tokens. A bug within the ACL Role listing RPC has
also been fixed which returned incorrect data during blocking
queries where a deletion had occurred.
2022-08-22 08:54:07 +02:00
Seth Hoenig 9bce3a2e36 build: go.mod should require go1.19
Since we started using atomic.Pointer, we should specify the go1.19
requirement in our go.mod files.
2022-08-21 20:41:49 -05:00
Michael Schurter 26637ab55d
core: fix race mutating jobs in scaling api (#14192)
Since the state store returns a pointer to the shared job structs in
memdb we must always copy it before mutating it and applying the new
version via raft. Otherwise if the rpc fails before the mutated job is
committed to raft (either due to validation, bug, crash, or other exit
condition), the leader server will have an updated copy of the job that
other servers will not have.
2022-08-19 15:46:54 -07:00
Seth Hoenig 88a1353149 cli: display nomad service check status output in CLI commands
This PR adds some NSD check status output to the CLI.

1. The 'nomad alloc status' command produces nsd check summary output (if present)
2. The 'nomad alloc checks' sub-command is added to produce complete nsd check output (if present)
2022-08-19 09:18:29 -05:00
dependabot[bot] 05d943ed51
build(deps): bump github.com/shoenig/test from 0.3.0 to 0.3.1 in /api (#14194)
Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 0.3.0 to 0.3.1.
- [Release notes](https://github.com/shoenig/test/releases)
- [Commits](https://github.com/shoenig/test/compare/v0.3.0...v0.3.1)

---
updated-dependencies:
- dependency-name: github.com/shoenig/test
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-08-19 11:32:59 +02:00
Michael Schurter 3b57df33e3
client: fix data races in config handling (#14139)
Before this change, Client had 2 copies of the config object: config and configCopy. There was no guidance around which to use where (other than configCopy's comment to pass it to alloc runners), both are shared among goroutines and mutated in data racy ways. At least at one point I think the idea was to have `config` be mutable and then grab a lock to overwrite `configCopy`'s pointer atomically. This would have allowed alloc runners to read their config copies in data race safe ways, but this isn't how the current implementation worked.

This change takes the following approach to safely handling configs in the client:

1. `Client.config` is the only copy of the config and all access must go through the `Client.configLock` mutex
2. Since the mutex *only protects the config pointer itself and not fields inside the Config struct:* all config mutation must be done on a *copy* of the config, and then Client's config pointer is overwritten while the mutex is acquired. Alloc runners and other goroutines with the old config pointer will not see config updates.
3. Deep copying is implemented on the Config struct to satisfy the previous approach. The TLS Keyloader is an exception because it has its own internal locking to support mutating in place. An unfortunate complication but one I couldn't find a way to untangle in a timely fashion.
4. To facilitate deep copying I made an *internally backward incompatible API change:* our `helper/funcs` used to turn containers (slices and maps) with 0 elements into nils. This probably saves a few memory allocations but makes it very easy to cause panics. Since my new config handling approach uses more copying, it became very difficult to ensure all code that used containers on configs could handle nils properly. Since this code has caused panics in the past, I fixed it: nil containers are copied as nil, but 0-element containers properly return a new 0-element container. No more "downgrading to nil!"
2022-08-18 16:32:04 -07:00