Commit graph

2054 commits

Author SHA1 Message Date
Seth Hoenig f1b902beac
consul: do not re-register already registered services (#14917)
This PR updates Nomad's Consul service client to do map comparisons
using maps.Equal instead of reflect.DeepEqual. The bug fix is in how
DeepEqual treats nil slices different from empty slices, when actually
they should be treated the same.
2022-10-18 08:10:59 -05:00
Seth Hoenig 306b4dd38e
cleanup: remove another string-set helper function (#14902) 2022-10-17 14:14:52 -05:00
Michael Schurter 45ce8c13cf
client: remove unused LogOutput and LogLevel (#14867)
* client: remove unused LogOutput

* client: remove unused config.LogLevel
2022-10-11 09:24:40 -07:00
Seth Hoenig 5e38a0e82c
cleanup: rename Equals to Equal for consistency (#14759) 2022-10-10 09:28:46 -05:00
hc-github-team-nomad-core 4fdcd197c0 Generate files for 1.4.0 release 2022-10-06 09:16:00 -07:00
Seth Hoenig c68ed3b4c8
client: protect user lookups with global lock (#14742)
* client: protect user lookups with global lock

This PR updates Nomad client to always do user lookups while holding
a global process lock. This is to prevent concurrency unsafe implementations
of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo).

* cl: add cl
2022-09-29 09:30:13 -05:00
hc-github-team-nomad-core 2fe5a962f3 Generate files for 1.4.0-rc.1 release 2022-09-27 17:33:32 -04:00
Jorge Marey 584ddfe859
Add Namespace, Job and Group to envoy stats (#14311) 2022-09-22 10:38:21 -04:00
Seth Hoenig 2088ca3345
cleanup more helper updates (#14638)
* cleanup: refactor MapStringStringSliceValueSet to be cleaner

* cleanup: replace SliceStringToSet with actual set

* cleanup: replace SliceStringSubset with real set

* cleanup: replace SliceStringContains with slices.Contains

* cleanup: remove unused function SliceStringHasPrefix

* cleanup: fixup StringHasPrefixInSlice doc string

* cleanup: refactor SliceSetDisjoint to use real set

* cleanup: replace CompareSliceSetString with SliceSetEq

* cleanup: replace CompareMapStringString with maps.Equal

* cleanup: replace CopyMapStringString with CopyMap

* cleanup: replace CopyMapStringInterface with CopyMap

* cleanup: fixup more CopyMapStringString and CopyMapStringInt

* cleanup: replace CopySliceString with slices.Clone

* cleanup: remove unused CopySliceInt

* cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice

* cleanup: replace CopyMap with maps.Clone

* cleanup: run go mod tidy
2022-09-21 14:53:25 -05:00
Luiz Aoqui c3c8ae584f
api: provide more detail on ACL bootstrap request error (#14629) 2022-09-20 21:20:04 -04:00
hc-github-team-nomad-core a3a718e167 Generate files for 1.4.0-beta.1 release 2022-09-14 19:32:18 +00:00
hc-github-team-nomad-core b91437bb68 Generate files for 1.4.0-beta.1 release 2022-09-14 18:59:59 +00:00
Seth Hoenig 5187f92c5e
cleanup: create interface for check watcher and mock it in nsd tests (#14577)
* cleanup: create interface for check watcher and mock it in nsd tests

* cleanup: add comments for check watcher interface
2022-09-14 08:25:20 -05:00
Seth Hoenig 9a943107c7 servicedisco: implement check_restart for nomad service checks
This PR implements support for check_restart for checks registered
in the Nomad service provider.

Unlike Consul, Nomad service checks never report a "warning" status,
and so the check_restart.ignore_warnings configuration is not valid
for Nomad service checks.
2022-09-13 08:59:23 -05:00
Seth Hoenig feff36f3f7 client: refactor check watcher to be reusable
This PR refactors agent/consul/check_watcher into client/serviceregistration,
and abstracts away the Consul-specific check lookups.

In doing so we should be able to reuse the existing check watcher logic for
also watching NSD checks in a followup PR.

A chunk of consul/unit_test.go is removed - we'll cover that in e2e tests
in a follow PR if needed. In the long run I'd like to remove this whole file.
2022-09-12 10:13:31 -05:00
Seth Hoenig 31234d6a62 cleanup: consolidate interfaces for workload restarting
This PR combines two of the same interface definitions around workload restarting
2022-09-09 08:59:04 -05:00
Tim Gross 9259a373cd
remove root keyring install API (#14514)
* keyring rotate API should require put/post method
* remove keyring install API
2022-09-09 08:50:35 -04:00
James Rasell 3fa8b0b270
client: fix RPC forwarding when querying checks for alloc. (#14498)
When querying the checks for an allocation, the request must be
forwarded to the agent that is running the allocation. If the
initial request is made to a server agent, the request can be made
directly to the client agent running the allocation. If the
request is made to a client agent not running the alloc, the
request needs to be forwarded to a server and then the correct
client.
2022-09-08 16:55:23 +02:00
Tim Gross 7921f044e5
migrate autopilot implementation to raft-autopilot (#14441)
Nomad's original autopilot was importing from a private package in Consul. It
has been moved out to a shared library. Switch Nomad to use this library so that
we can eliminate the import of Consul, which is necessary to build Nomad ENT
with the current version of the Consul SDK. This also will let us pick up
autopilot improvements shared with Consul more easily.
2022-09-01 14:27:10 -04:00
Derek Strickland 35e91ff376
Merge release 1.3.5 files (#14425)
* Merge release 1.3.5 files

* Generate files for 1.3.5 release

* Prepare for next release

Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
2022-08-31 18:31:56 -04:00
Charlie Voiselle 5c0e34dd33
Vars: Update CT dependency to support variables. (#14399)
* Update Consul Template dep to support Nomad vars

* Remove `Peering` config for Consul Testservers
Upgrading to the 1.14 Consul SDK introduces and additional default
configuration—`Peering`—that is not compatible with versions of Consul
before v1.13.0. because Nomad tests against Consul v1.11.1, this
configuration has to be nil'ed out before passing it to the Consul
binary.
2022-08-30 15:26:01 -04:00
Tim Gross cc9b480996
testing: setting env var incompatible with parallel tests (#14405)
Neither the `os.Setenv` nor `t.Setenv` helper are safe to use in parallel tests
because environment variables are process-global. The stdlib panics if you try
to do this. Remove the `ci.Parallel()` call from all tests where we're setting
environment variables.
2022-08-30 14:49:03 -04:00
Tim Gross 5784fb8c58
search: enforce correct ACL for search over variables (#14397) 2022-08-30 13:27:31 -04:00
Seth Hoenig 52de2dc09d
Merge pull request #14290 from hashicorp/cleanup-more-helper-cleanup
cleanup: tidy up helper package some more
2022-08-30 08:19:48 -05:00
James Rasell 755b4745ed
Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main 2022-08-30 08:59:13 +01:00
Tim Gross 62a968f443
Merge pull request #14351 from hashicorp/variables-rename
Variables rename
2022-08-29 11:36:50 -04:00
Michael Schurter dbffe22465
consul: allow stale namespace results (#12953)
Nomad reconciles services it expects to be registered in Consul with
what is actually registered in the local Consul agent. This is necessary
to prevent leaking service registrations if Nomad crashes at certain
points (or if there are bugs).

When Consul has namespaces enabled, we must iterate over each available
namespace to be sure no services were leaked into non-default
namespaces.

Since this reconciliation happens often, there's no need to require
results from the Consul leader server. In large clusters this creates
far more load than the "freshness" of the response is worth.

Therefore this patch switches the request to AllowStale=true
2022-08-26 16:05:12 -07:00
Tim Gross 1dc053b917 rename SecureVariables to Variables throughout 2022-08-26 16:06:24 -04:00
Tim Gross dcfd31296b file rename 2022-08-26 16:06:24 -04:00
Luiz Aoqui ad84b22a72
Post 1.3.4 release (#14329)
* Generate files for 1.3.4 release

* Prepare for next release

* Update CHANGELOG.md

Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
2022-08-26 10:09:13 -04:00
Charlie Voiselle ad737d008b
SV API: return upserted variable to caller (#14325)
* Return created variable to caller in HTTP and Go APIs
* Update tests for returned values
2022-08-25 17:38:15 -04:00
James Rasell 601588df6b
Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main 2022-08-25 12:14:29 +01:00
Luiz Aoqui e012d9411e
Task lifecycle restart (#14127)
* allocrunner: handle lifecycle when all tasks die

When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.

* allocrunner: implement different alloc restarts

Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.

* allocrunner: allow tasks to run again

Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.

The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.

* api/cli: add support for all tasks alloc restart

Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.

* test: fix tasklifecycle Coordinator test

* allocrunner: kill taskrunners if all tasks are dead

When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.

* taskrunner: fix tests that waited on WaitCh

Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.

* tests: add tests for all tasks alloc restart

* changelog: add entry for #14127

* taskrunner: fix restore logic.

The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.

It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.

The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.

This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.

* address code review requests

* apply more code review changes

* taskrunner: add different Restart modes

Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.

So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
  the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
  restart if its `Run()` method is still active. Callers will need to
  restart the allocRunner taskCoordinator to make sure it will allow the
  task to run again.

* minor fixes
2022-08-24 17:43:07 -04:00
Tim Gross c732b215f0
vault: detect namespace change in config reload (#14298)
The `namespace` field was not included in the equality check between old and new
Vault configurations, which meant that a Vault config change that only changed
the namespace would not be detected as a change and the clients would not be
reloaded.

Also, the comparison for boolean fields such as `enabled` and
`allow_unauthenticated` was on the pointer and not the value of that pointer,
which results in spurious reloads in real config reload that is easily missed in
typical test scenarios.

Includes a minor refactor of the order of fields for `Copy` and `Merge` to match
the struct fields in hopes it makes it harder to make this mistake in the
future, as well as additional test coverage.
2022-08-24 17:03:29 -04:00
Seth Hoenig 062c817450 cleanup: move fs helpers into escapingfs 2022-08-24 14:45:34 -05:00
Piotr Kazmierczak 7077d1f9aa
template: custom change_mode scripts (#13972)
This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707
2022-08-24 17:43:01 +02:00
James Rasell 9782d6d7ff
acl: allow tokens to lookup linked roles. (#14227)
When listing or reading an ACL role, roles linked to the ACL token
used for authentication can be returned to the caller.
2022-08-24 13:51:51 +02:00
Luiz Aoqui 7a8cacc9ec
allocrunner: refactor task coordinator (#14009)
The current implementation for the task coordinator unblocks tasks by
performing destructive operations over its internal state (like closing
channels and deleting maps from keys).

This presents a problem in situations where we would like to revert the
state of a task, such as when restarting an allocation with tasks that
have already exited.

With this new implementation the task coordinator behaves more like a
finite state machine where task may be blocked/unblocked multiple times
by performing a state transition.

This initial part of the work only refactors the task coordinator and
is functionally equivalent to the previous implementation. Future work
will build upon this to provide bug fixes and enhancements.
2022-08-22 18:38:49 -04:00
Luiz Aoqui dbffdca92e
template: use pointer values for gid and uid (#14203)
When a Nomad agent starts and loads jobs that already existed in the
cluster, the default template uid and gid was being set to 0, since this
is the zero value for int. This caused these jobs to fail in
environments where it was not possible to use 0, such as in Windows
clients.

In order to differentiate between an explicit 0 and a template where
these properties were not set we need to use a pointer.
2022-08-22 16:25:49 -04:00
James Rasell 2736cf0dfa
acl: make listing RPC and HTTP API a stub return object. (#14211)
Making the ACL Role listing return object a stub future-proofs the
endpoint. In the event the role object grows, we are not bound by
having to return all fields within the list endpoint or change the
signature of the endpoint to reduce the list return size.
2022-08-22 17:20:23 +02:00
James Rasell 802d005ef5
acl: add replication to ACL Roles from authoritative region. (#14176)
ACL Roles along with policies and global token will be replicated
from the authoritative region to all federated regions. This
involves a new replication loop running on the federated leader.

Policies and roles may be replicated at different times, meaning
the policies and role references may not be present within the
local state upon replication upsert. In order to bypass the RPC
and state check, a new RPC request parameter has been added. This
is used by the replication process; all other callers will trigger
the ACL role policy validation check.

There is a new ACL RPC endpoint to allow the reading of a set of
ACL Roles which is required by the replication process and matches
ACL Policies and Tokens. A bug within the ACL Role listing RPC has
also been fixed which returned incorrect data during blocking
queries where a deletion had occurred.
2022-08-22 08:54:07 +02:00
Michael Schurter 3b57df33e3
client: fix data races in config handling (#14139)
Before this change, Client had 2 copies of the config object: config and configCopy. There was no guidance around which to use where (other than configCopy's comment to pass it to alloc runners), both are shared among goroutines and mutated in data racy ways. At least at one point I think the idea was to have `config` be mutable and then grab a lock to overwrite `configCopy`'s pointer atomically. This would have allowed alloc runners to read their config copies in data race safe ways, but this isn't how the current implementation worked.

This change takes the following approach to safely handling configs in the client:

1. `Client.config` is the only copy of the config and all access must go through the `Client.configLock` mutex
2. Since the mutex *only protects the config pointer itself and not fields inside the Config struct:* all config mutation must be done on a *copy* of the config, and then Client's config pointer is overwritten while the mutex is acquired. Alloc runners and other goroutines with the old config pointer will not see config updates.
3. Deep copying is implemented on the Config struct to satisfy the previous approach. The TLS Keyloader is an exception because it has its own internal locking to support mutating in place. An unfortunate complication but one I couldn't find a way to untangle in a timely fashion.
4. To facilitate deep copying I made an *internally backward incompatible API change:* our `helper/funcs` used to turn containers (slices and maps) with 0 elements into nils. This probably saves a few memory allocations but makes it very easy to cause panics. Since my new config handling approach uses more copying, it became very difficult to ensure all code that used containers on configs could handle nils properly. Since this code has caused panics in the past, I fixed it: nil containers are copied as nil, but 0-element containers properly return a new 0-element container. No more "downgrading to nil!"
2022-08-18 16:32:04 -07:00
Piotr Kazmierczak b63944b5c1
cleanup: replace TypeToPtr helper methods with pointer.Of (#14151)
Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.
2022-08-17 18:26:34 +02:00
Seth Hoenig 7728cf5a9a
Merge pull request #14132 from hashicorp/build-update-go1.19
build: update to go1.19
2022-08-16 11:20:27 -05:00
Seth Hoenig b3ea68948b build: run gofmt on all go source files
Go 1.19 will forecefully format all your doc strings. To get this
out of the way, here is one big commit with all the changes gofmt
wants to make.
2022-08-16 11:14:11 -05:00
Seth Hoenig 56b0b456dc
Merge pull request #14102 from hashicorp/cleanup-mesh-gateway-value
cleanup: consul mesh gateway type need not be pointer
2022-08-16 10:07:16 -05:00
Tim Gross 4005759d28
move secure variable conflict resolution to state store (#13922)
Move conflict resolution implementation into the state store with a new Apply RPC. 
This also makes the RPC for secure variables much more similar to Consul's KV, 
which will help us support soft deletes in a post-1.4.0 version of Nomad.

Reimplement quotas in the state store functions.

Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2022-08-15 11:19:53 -04:00
Seth Hoenig f9355c29fb cleanup: consul mesh gateway type need not be pointer
This PR changes the use of structs.ConsulMeshGateway to value types
instead of via pointers. This will help in a follow up PR where we
cleanup a lot of custom comparison code with helper functions instead.
2022-08-13 11:26:58 -05:00
James Rasell 9cd0dd2ff7
http: add ACL Role HTTP endpoints for CRUD actions.
These new endpoints are exposed under the /v1/acl/roles and
/v1/acl/role endpoints.
2022-08-11 08:44:19 +01:00
Luiz Aoqui 815adbada5
Post 1.3.3 release (#14064)
* Generate files for 1.3.3 release

* Prepare for next release

* Merge release 1.3.3 files

Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
2022-08-09 17:27:29 -04:00