Commit graph

4792 commits

Author SHA1 Message Date
Jerome Eteve c26f01eefd
client checks kernel module in /sys/module for WSL2 bridge networking (#17306) 2023-06-06 10:26:50 -04:00
Seth Hoenig d1d4d22f8e
test: ensure cpuset cgroup is setup before fingerprinting (#17428)
This PR fixes a racey test where we need to ensure the cpuset cgroup
is setup before trying to fingerprint it.
2023-06-05 14:15:00 -05:00
hashicorp-copywrite[bot] 0f4532f138
[COMPLIANCE] Add Copyright and License Headers (#17429)
Co-authored-by: hashicorp-copywrite[bot] <110428419+hashicorp-copywrite[bot]@users.noreply.github.com>
2023-06-05 13:23:59 -04:00
Luiz Aoqui 6039c18ab6
node pools: register a node in a node pool (#17405) 2023-06-02 17:50:50 -04:00
Tim Gross 06972fae0c
prioritized client updates (#17354)
The allocrunner sends several updates to the server during the early lifecycle
of an allocation and its tasks. Clients batch-up allocation updates every 200ms,
but experiments like the C2M challenge has shown that even with this batching,
servers can be overwhelmed with client updates during high volume
deployments. Benchmarking done in #9451 has shown that client updates can easily
represent ~70% of all Nomad Raft traffic.

Each allocation sends many updates during its lifetime, but only those that
change the `ClientStatus` field are critical for progressing a deployment or
kicking off a reschedule to recover from failures.

Add a priority to the client allocation sync and update the `syncTicker`
receiver so that we only send an update if there's a high priority update
waiting, or on every 5th tick. This means when there are no high priority
updates, the client will send updates at most every 1s instead of
200ms. Benchmarks have shown this can reduce overall Raft traffic by 10%, as
well as reduce client-to-server RPC traffic.

This changeset also switches from a channel-based collection of updates to a
shared buffer, so as to split batching from sending and prevent backpressure
onto the allocrunner when the RPC is slow. This doesn't have a major performance
benefit in the benchmarks but makes the implementation of the prioritized update
simpler.

Fixes: #9451
2023-05-31 15:34:16 -04:00
Luiz Aoqui bb2395031b
client: fix Consul version finterprint (#17349)
Consul v1.13.8 was released with a breaking change in the /v1/agent/self
endpoint version where a line break was being returned.

This caused the Nomad finterprint to fail because `NewVersion` errors on
parse.

This commit removes any extra space from the Consul version returned by
the API.
2023-05-30 11:07:57 -04:00
Seth Hoenig acfdf0f479
compliance: add headers with fixed copywrite tool (#17353)
Closes #17117
2023-05-30 09:20:32 -05:00
Charlie Voiselle 86e04a4c6c
[core] nil check and error handling for client status in heartbeat responses (#17316)
Add a nil check to constructNodeServerInfoResponse to manage an apparent race between deregister and client heartbeats. Fixes #17310
2023-05-25 16:04:54 -04:00
Lance Haig 568da5918b
cli: tls certs not created with correct SANs (#16959)
The `nomad tls cert` command did not create certificates with the correct SANs for
them to work with non default domain and region names. This changset updates the
code to support non default domains and regions in the certificates.
2023-05-22 09:31:56 -04:00
Seth Hoenig e04ff0d935
client: ignore restart issued to terminal allocations (#17175)
* client: ignore restart issued to terminal allocations

This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

* e2e: add e2e test for alloc restart zombies

* cl: tweak text

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-05-16 10:19:41 -05:00
Tim Gross 6814e8e6d9
drivers: make internal DisableLogCollection capability public (#17196)
The `DisableLogCollection` capability was introduced as an experimental
interface for the Docker driver in 0.10.4. The interface has been stable and
allowing third-party task drivers the same capability would be useful for those
drivers that don't need the additional overhead of logmon.

This PR only makes the capability public. It doesn't yet add it to the
configuration options for the other internal drivers.

Fixes: #14636 #15686
2023-05-16 09:16:03 -04:00
Tim Gross d018fcbff7
allocrunner: provide factory function so we can build mock ARs (#17161)
Tools like `nomad-nodesim` are unable to implement a minimal implementation of
an allocrunner so that we can test the client communication without having to
lug around the entire allocrunner/taskrunner code base. The allocrunner was
implemented with an interface specifically for this purpose, but there were
circular imports that made it challenging to use in practice.

Move the AllocRunner interface into an inner package and provide a factory
function type. Provide a minimal test that exercises the new function so that
consumers have some idea of what the minimum implementation required is.
2023-05-12 13:29:44 -04:00
Tim Gross 9ed75e1f72
client: de-duplicate alloc updates and gate during restore (#17074)
When client nodes are restarted, all allocations that have been scheduled on the
node have their modify index updated, including terminal allocations. There are
several contributing factors:

* The `allocSync` method that updates the servers isn't gated on first contact
  with the servers. This means that if a server updates the desired state while
  the client is down, the `allocSync` races with the `Node.ClientGetAlloc`
  RPC. This will typically result in the client updating the server with "running"
  and then immediately thereafter "complete".

* The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even
  if it's possible to assert that the server has definitely seen the client
  state. The allocrunner may queue-up updates even if we gate sending them. So
  then we end up with a race between the allocrunner updating its internal state
  to overwrite the previous update and `allocSync` sending the bogus or duplicate
  update.

This changeset adds tracking of server-acknowledged state to the
allocrunner. This state gets checked in the `allocSync` before adding the update
to the batch, and updated when `Node.UpdateAlloc` returns successfully. To
implement this we need to be able to equality-check the updates against the last
acknowledged state. We also need to add the last acknowledged state to the
client state DB, otherwise we'd drop unacknowledged updates across restarts.

The client restart test has been expanded to cover a variety of allocation
states, including allocs stopped before shutdown, allocs stopped by the server
while the client is down, and allocs that have been completely GC'd on the
server while the client is down. I've also bench tested scenarios where the task
workload is killed while the client is down, resulting in a failed restore.

Fixes #16381
2023-05-11 09:05:24 -04:00
Seth Hoenig 6f4992ef29
client: unveil /etc/ssh/ssh_known_hosts for artifact downloads (#17122)
This PR fixes a bug where nodes configured with populated
/etc/ssh/ssh_known_hosts files would be unable to read them during
artifact downloading.

Fixes #17086
2023-05-09 09:43:52 -05:00
Daniel Bennett a7ed6f5c53
full task cleanup when alloc prerun hook fails (#17104)
to avoid leaking task resources (e.g. containers,
iptables) if allocRunner prerun fails during
restore on client restart.

now if prerun fails, TaskRunner.MarkFailedKill()
will only emit an event, mark the task as failed,
and cancel the tr's killCtx, so then ar.runTasks()
-> tr.Run() can take care of the actual cleanup.

removed from (formerly) tr.MarkFailedDead(),
now handled by tr.Run():
 * set task state as dead
 * save task runner local state
 * task stop hooks

also done in tr.Run() now that it's not skipped:
 * handleKill() to kill tasks while respecting
   their shutdown delay, and retrying as needed
   * also includes task preKill hooks
 * clearDriverHandle() to destroy the task
   and associated resources
 * task exited hooks
2023-05-08 13:17:10 -05:00
Tim Gross 17bd930ca9
logs: fix missing allocation logs after update to Nomad 1.5.4 (#17087)
When the server restarts for the upgrade, it loads the `structs.Job` from the
Raft snapshot/logs. The jobspec has long since been parsed, so none of the
guards around the default value are in play. The empty field value for `Enabled`
is the zero value, which is false.

This doesn't impact any running allocation because we don't replace running
allocations when either the client or server restart. But as soon as any
allocation gets rescheduled (ex. you drain all your clients during upgrades),
it'll be using the `structs.Job` that the server has, which has `Enabled =
false`, and logs will not be collected.

This changeset fixes the bug by adding a new field `Disabled` which defaults to
false (so that the zero value works), and deprecates the old field.

Fixes #17076
2023-05-04 16:01:18 -04:00
Seth Hoenig e9fec4ebc8
connect: remove unusable path for fallback envoy image names (#17044)
This PR does some cleanup of an old code path for versions of Consul that
did not support reporting the supported versions of Envoy in its API. Those
versions are no longer supported for years at this point, and the fallback
version of envoy hasn't been supported by any version of Consul for almost
as long. Remove this code path that is no longer useful.
2023-05-02 09:48:44 -05:00
Seth Hoenig e8d53ea30b
connect: use explicit docker.io prefix in default envoy image names (#17045)
This PR modifies references to the envoyproxy/envoy docker image to
explicitly include the docker.io prefix. This does not affect existing
users, but makes things easier for Podman users, who otherwise need to
specify the full name because Podman does not default to docker.io
2023-05-02 09:27:48 -05:00
Luiz Aoqui 7b5a8f1fb0
Revert "hashicorp/go-msgpack v2 (#16810)" (#17047)
This reverts commit 8a98520d56eed3848096734487d8bd3eb9162a65.
2023-05-01 17:18:34 -04:00
Seth Hoenig 86f6a38867
connect: do not restrict auto envoy version to docker task driver (#17041)
This PR updates the envoy_bootstrap_hook to no longer disable itself if
the task driver in use is not docker. In other words, make it work for
podman and other image based task drivers. The hook now only checks that

1. the task is a connect sidecar
2. the task.config block contains an "image" field
2023-05-01 15:07:35 -05:00
Seth Hoenig 753c17c9de
services: un-mark group services as deregistered if restart hook runs (#16905)
* services: un-mark group services as deregistered if restart hook runs

This PR may fix a bug where group services will never be deregistered if the
group undergoes a task restart.

* e2e: add test case for restart and deregister group service

* cl: add cl

* e2e: add wait for service list call
2023-04-24 14:24:51 -05:00
Tim Gross 72cbe53f19
logs: allow disabling log collection in jobspec (#16962)
Some Nomad users ship application logs out-of-band via syslog. For these users
having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow
disabling the logmon and pointing the task's stdout/stderr to /dev/null.

This changeset is the first of several incremental improvements to log
collection short of full-on logging plugins. The next step will likely be to
extend the internal-only task driver configuration so that cluster
administrators can turn off log collection for the entire driver.

---

Fixes: #11175

Co-authored-by: Thomas Weber <towe75@googlemail.com>
2023-04-24 10:00:27 -04:00
valodzka 379497a484
fix host port handling for ipv6 (#16723) 2023-04-20 19:53:20 -07:00
Etienne Bruines 1e3531b978
cni: fix plugin fingerprinting versions (#16776)
CNI plugins v1.2.0 and above output a second line, containing supported protocol versions.
2023-04-20 18:44:39 -07:00
astudentofblake 42c4c8d5ea
fix: added landlock access to /usr/libexec for getter (#16900) 2023-04-20 11:16:04 -05:00
Luiz Aoqui fb588fcbb8
allocrunner: prevent panic on network manager (#16921) 2023-04-18 13:39:13 -07:00
Ian Fijolek 619f49afcf
hashicorp/go-msgpack v2 (#16810)
* Upgrade from hashicorp/go-msgpack v1.1.5 to v2.1.0

Fixes #16808

* Update hashicorp/net-rpc-msgpackrpc to v2 to match go-msgpack

* deps: use go-msgpack v2.0.0

go-msgpack v2.1.0 includes some code changes that we will need to
investigate furthere to assess its impact on Nomad, so keeping this
dependency on v2.0.0 for now since it's no-op.

---------

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-04-17 17:02:05 -04:00
Seth Hoenig 92d4a05534
users: eliminate nobody user memoization (#16904)
This PR eliminates code specific to looking up and caching the uid/gid/user.User
object associated with the nobody user in an init block. This code existed before
adding the generic users cache and was meant to optimize the one search path we
knew would happen often. Now that we have the cache, seems reasonable to eliminate
this init block and use the cache instead like for any other user.

Also fixes a constraint on the podman (and other) drivers, where building without
CGO became problematic on some OS like Fedora IoT where the nobody user cannot
be found with the pure-Go standard library.

Fixes github.com/hashicorp/nomad-driver-podman/issues/228
2023-04-17 12:30:30 -05:00
Tim Gross 62548616d4
client: allow drain_on_shutdown configuration (#16827)
Adds a new configuration to clients to optionally allow them to drain their
workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting
itself and then monitors the drain state as seen by the server until the drain
is complete or the deadline expires. If it loses connection with the server, it
will monitor local client status instead to ensure allocations are stopped
before exiting.
2023-04-14 15:35:32 -04:00
James Rasell bc01d47071
consul/connect: fixed a bug where restarting proxy tasks failed. (#16815)
The first start of a Consul Connect proxy sidecar triggers a run
of the envoy_version hook which modifies the task config image
entry. The modification takes into account a number of factors to
correctly populate this. Importantly, once the hook has run, it
marks itself as done so the taskrunner will not execute it again.

When the client receives a non-destructive update for the
allocation which the proxy sidecar is a member of, it will update
and overwrite the task definition within the taskerunner. In doing
so it overwrite the modification performed by the hook. If the
allocation is restarted, the envoy_version hook will be skipped as
it previously marked itself as done, and therefore the sidecar
config image is incorrect and causes a driver error.

The fix removes the hook in marking itself as done to the view of
the taskrunner.
2023-04-11 15:56:03 +01:00
Seth Hoenig ba728f8f97
api: enable support for setting original job source (#16763)
* api: enable support for setting original source alongside job

This PR adds support for setting job source material along with
the registration of a job.

This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.

The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.

The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.

* api: avoid writing var content to disk for parsing

* api: move submission validation into RPC layer

* api: return an error if updating a job submission without namespace or job id

* api: be exact about the job index we associate a submission with (modify)

* api: reword api docs scheduling

* api: prune all but the last 6 job submissions

* api: protect against nil job submission in job validation

* api: set max job source size in test server

* api: fixups from pr
2023-04-11 08:45:08 -05:00
Michael Schurter a407cbb626
Merge pull request #16836 from hashicorp/compliance/add-headers
[COMPLIANCE] Add Copyright and License Headers
2023-04-10 16:32:03 -07:00
Daniel Bennett fa33ee567a
gracefully recover tasks that use csi node plugins (#16809)
new WaitForPlugin() called during csiHook.Prerun,
so that on startup, clients can recover running
tasks that use CSI volumes, instead of them being
terminated and rescheduled because they need a
node plugin that is "not found" *yet*, only because
the plugin task has not yet been recovered.
2023-04-10 17:15:33 -05:00
hashicorp-copywrite[bot] 005636afa0 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Tim Gross 1335543731
ephemeral disk: migrate should imply sticky (#16826)
The `ephemeral_disk` block's `migrate` field allows for best-effort migration of
the ephemeral disk data to new nodes. The documentation says the `migrate` field
is only respected if `sticky=true`, but in fact if client ACLs are not set the
data is migrated even if `sticky=false`.

The existing behavior when client ACLs are disabled has existed since the early
implementation, so "fixing" that case now would silently break backwards
compatibility. Additionally, having `migrate` not imply `sticky` seems
nonsensical: it suggests that if we place on a new node we migrate the data but
if we place on the same node, we throw the data away!

Update so that `migrate=true` implies `sticky=true` as follows:

* The failure mode when client ACLs are enabled comes from the server not passing
  along a migration token. Update the server so that the server provides a
  migration token whenever `migrate=true` and not just when `sticky=true` too.
* Update the scheduler so that `migrate` implies `sticky`.
* Update the client so that we check for `migrate || sticky` where appropriate.
* Refactor the E2E tests to move them off the old framework and make the intention
  of the test more clear.
2023-04-07 16:33:45 -04:00
James Rasell 15eee99db4
client: ensure envoy version hook uses all pointer receiver funcs. (#16813) 2023-04-06 14:47:00 +01:00
the-nando f541f2e59b
Do not set attributes when spawning the getter child (#16791)
* Do not set attributes when spawning the getter child

* Cleanup

* Cleanup

---------

Co-authored-by: the-nando <the-nando@invalid.local>
2023-04-05 11:47:51 -05:00
Tim Gross 118b703164
CSI: set mounts in alloc hook resources atomically (#16722)
The allocrunner has a facility for passing data written by allocrunner hooks to
taskrunner hooks. Currently the only consumers of this facility are the
allocrunner CSI hook (which writes data) and the taskrunner volume hook (which
reads that same data).

The allocrunner hook for CSI volumes doesn't set the alloc hook resources
atomically. Instead, it gets the current resources and then writes a new version
back. Because the CSI hook is currently the only writer and all readers happen
long afterwards, this should be safe but #16623 shows there's some sequence of
events during restore where this breaks down.

Refactor hook resources so that hook data is accessed via setters and getters
that hold the mutex.
2023-04-03 11:03:36 -04:00
Seth Hoenig 87f4b71df0
client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips (#16672)
* client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips

This PR adds detection of asymetric core types (Power & Efficiency) (P/E)
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read
undocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into gopsutil, but gopsutil
would still not support detecting P vs E cores like this PR does.

Also refactors the CPUFingerprinter code to handle the mixed core
types, now setting power vs efficiency cpu attributes.

For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.

Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks of resources.cores on Linux
platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.

node attributes before

cpu.arch                  = arm64
cpu.modelname             = Apple M2 Pro
cpu.numcores              = 12
cpu.reservablecores       = 0
cpu.totalcompute          = 1000

node attributes after

cpu.arch                  = arm64
cpu.frequency.efficiency  = 2424
cpu.frequency.power       = 3504
cpu.modelname             = Apple M2 Pro
cpu.numcores.efficiency   = 4
cpu.numcores.power        = 8
cpu.reservablecores       = 0
cpu.totalcompute          = 37728

* fingerprint/cpu: follow up cr items
2023-03-28 08:27:58 -05:00
Seth Hoenig 07543f8bdf
nsd: always set deregister flag after deregistration of group (#16289)
* services: always set deregister flag after deregistration of group

This PR fixes a bug where the group service hook's deregister flag was
not set in some cases, causing the hook to attempt deregistrations twice
during job updates (alloc replacement).

In the tests ... we used to assert on the wrong behvior (remove twice) which
has now been corrected to assert we remove only once.

This bug was "silent" in the Consul provider world because the error logs for
double deregistration only show up in Consul logs; with the Nomad provider the
error logs are in the Nomad agent logs.

* services: cleanup group service hook tests
2023-03-17 09:44:21 -05:00
Tim Gross ec47b245d0
client: don't use Status RPC for Consul discovery (#16490)
In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.

Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the
Consul response directly. Update the `registerNode` method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.
2023-03-16 15:38:33 -04:00
Seth Hoenig 5b1970468e
artifact: git needs more files for private repositories (#16508)
* landlock: git needs more files for private repositories

This PR fixes artifact downloading so that git may work when cloning from
private repositories. It needs

- file read on /etc/passwd
- dir read on /root/.ssh
- file write on /root/.ssh/known_hosts

Add these rules to the landlock rules for the artifact sandbox.

* cr: use nonexistent instead of devnull

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>

* cr: use go-homdir for looking up home directory

* pr: pull go-homedir into explicit require

* cr: fixup homedir tests in homeless root cases

* cl: fix root test for real

---------

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-03-16 12:22:25 -05:00
Seth Hoenig d2e8fb626a
artifact: do not set process attributes on darwin (#16511)
This PR fixes the non-root macOS use case where artifact downloads
stopped working. It seems setting a Credential on a SysProcAttr
used by the exec package will always cause fork/exec to fail -
even if the credential contains our own UID/GID or nil UID/GID.

Technically we do not need to set this as the child process will
inherit the parent UID/GID anyway... and not setting it makes
things work again ... /shrug
2023-03-16 11:31:18 -05:00
Seth Hoenig a25d3ea792
cgv1: do not disable cpuset manager if reserved interface already exists (#16467)
* cgv1: do not disable cpuset manager if reserved interface already exists

This PR fixes a bug where restarting a Nomad Client on a machine using cgroups
v1 (e.g. Ubuntu 20.04) would cause the cpuset cgroups manager to disable itself.

This is being caused by incorrectly interpreting a "file exists" error as
problematic when ensuring the reserved cpuset exists. If we get a "file exists"
error, that just means the Client was likely restarted.

Note that a machine reboot would fix the issue - the groups interfaces are
ephemoral.

* cl: add cl
2023-03-13 17:00:17 -05:00
Luiz Aoqui 7305a374e3
allocrunner: fix health check monitoring for Consul services (#16402)
Services must be interpolated to replace runtime variables before they
can be compared against the values returned by Consul.
2023-03-10 14:43:31 -05:00
Michael Schurter 0021b282ef
env/aws: update ec2 cpu info data (#16417)
Update AWS EC2 CPU tables using `make ec2info`
2023-03-09 14:33:21 -08:00
Seth Hoenig ff4503aac6
client: disable running artifact downloader as nobody (#16375)
* client: disable running artifact downloader as nobody

This PR reverts a change from Nomad 1.5 where artifact downloads were
executed as the nobody user on Linux systems. This was done as an attempt
to improve the security model of artifact downloading where third party
tools such as git or mercurial would be run as the root user with all
the security implications thereof.

However, doing so conflicts with Nomad's own advice for securing the
Client data directory - which when setup with the recommended directory
permissions structure prevents artifact downloads from working as intended.

Artifact downloads are at least still now executed as a child process of
the Nomad agent, and on modern Linux systems make use of the kernel Landlock
feature for limiting filesystem access of the child process.

* docs: update upgrade guide for 1.5.1 sandboxing

* docs: add cl

* docs: add title to upgrade guide fix
2023-03-08 15:58:43 -06:00
Lance Haig 35c17b2e56
deps: Update ioutil deprecated library references to os and io respectively in the client package (#16318)
* Update ioutil deprecated library references to os and io respectively

* Deal with the errors produced.

Add error handling to filEntry info
Add error handling to info
2023-03-08 13:25:10 -06:00
Luiz Aoqui 2a1a790820
client: don't emit task shutdown delay event if not waiting (#16281) 2023-03-03 18:22:06 -05:00
Farbod Ahmadian 629ac58763
tests: add functionality to skip a test if it's not running in CI and not with root user (#16222) 2023-03-02 13:38:27 -05:00
Tim Gross bb4880ec13
client: use RPC address and not serf after initial Consul discovery (#16217)
Nomad servers can advertise independent IP addresses for `serf` and
`rpc`. Somewhat unexpectedly, the `serf` address is also used for both Serf and
server-to-server RPC communication (including Raft RPC). The address advertised
for `rpc` is only used for client-to-server RPC. This split was introduced
intentionally in Nomad 0.8.

When clients are using Consul discovery for connecting to servers, they get an
initial discovery set from Consul and use the correct `rpc` tag in Consul to get
a list of adddresses for servers. The client then makes a `Status.Peers` RPC to
get the list of those servers that are raft peers. But this endpoint is shared
between servers and clients, and provides the address used for Raft.

Most of the time this is harmless because servers will bind on 0.0.0.0 anyways.,
But in topologies where servers are on a private network and clients are on
separate subnets (or even public subnets), clients will make initial contact
with the server to get the list of peers but then populate their local server
set with unreachable addresses.

Cluster administrators can work around this problem by using `server_join` with
specific IP addresses (or DNS names), because the `Node.UpdateStatus` endpoint
returns the correct set of RPC addresses when updating the node. So once a
client has registered, it will get the correct set of RPC addresses.

This changeset updates the client logic to query `Status.Members` instead of
`Status.Peers`, and then extract the correctly advertised address and port from
the response body.
2023-03-02 13:36:45 -05:00
Michael Schurter bd7b60712e
Accept Workload Identities for Client RPCs (#16254)
This change resolves policies for workload identities when calling Client RPCs. Previously only ACL tokens could be used for Client RPCs.

Since the same cache is used for both bearer tokens (ACL and Workload ID), the token cache size was doubled.

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-02-27 10:17:47 -08:00
Tim Gross 79844048e6
populate Nomad token for task runner update hooks (#16266)
The `TaskUpdateRequest` struct we send to task runner update hooks was not
populating the Nomad token that we get from the task runner (which we do for the
Vault token). This results in task runner hooks like the template hook
overwriting the Nomad token with the zero value for the token. This causes
in-place updates of a task to break templates (but not other uses that rely on
identity but don't currently bother to update it, like the identity hook).
2023-02-27 10:48:13 -05:00
Seth Hoenig 61404b2551
services: Set Nomad's User-Agent by default on HTTP checks for nomad services (#16248) 2023-02-23 08:10:42 -06:00
Seth Hoenig 804f9fdb93
services: ensure task group is set on service hook (#16240)
This PR fixes a bug where the task group information was not being set
on the serviceHook.AllocInfo struct, which is needed later on for calculating
the CheckID of a nomad service check. The CheckID is calculated independently
from multiple callsites, and the information being passed in must be consistent,
including the group name.

The workload.AllocInfo.Group was not set at this callsite, due to the bug fixed in this PR.
 https://github.com/hashicorp/nomad/blob/main/client/serviceregistration/nsd/nsd.go#L114
2023-02-22 10:22:48 -06:00
Seth Hoenig 5d325decca
cgutil: handle panic from runc helper method (#16180)
This PR wraps the cgroups.IsCgroup2UnifiedMode() helper method from
runc in a defer/recover block because it might panic in some cases.

Upstream fix in: https://github.com/opencontainers/runc/pull/3745

Closes #16179
2023-02-14 15:09:43 -06:00
Seth Hoenig 165791dd89
artifact: protect against unbounded artifact decompression (1.5.0) (#16151)
* artifact: protect against unbounded artifact decompression

Starting with 1.5.0, set defaut values for artifact decompression limits.

artifact.decompression_size_limit (default "100GB") - the maximum amount of
data that will be decompressed before triggering an error and cancelling
the operation

artifact.decompression_file_count_limit (default 4096) - the maximum number
of files that will be decompressed before triggering an error and
cancelling the operation.

* artifact: assert limits cannot be nil in validation
2023-02-14 09:28:39 -06:00
Seth Hoenig 490c902c62
cni: handle multi-path cni_path when fingerprinting plugins (#16163)
This PR fixes the CNI plugin fingerprinter to take into account the fact
that the cni_path config can be a multi-path (e.g. `/foo:/bar:/baz`).

Accumulate plugins from each of the possible path elements. If scanning
any of the named directory fails, the fingerprinter fails.

Fixes #16083

No CL/BP - has not shipped yet.
2023-02-13 14:55:56 -06:00
Charlie Voiselle 7cfb938723
[chore] Move TestUtil_loadVersionControlGlobalConfigs into build flagged file (#16114) 2023-02-09 14:25:26 -05:00
Seth Hoenig 0e7bf87ee1
deps: upgrade to hashicorp/golang-lru/v2 (#16085) 2023-02-08 15:20:33 -06:00
Seth Hoenig a4519c739d
users: eliminate LookupGroupId and its one use case (#16093)
This PR deletes the user.LookupGroupId function as it was only being used
in a single test case, and its value was not important to the test.
2023-02-08 14:57:09 -06:00
Michael Schurter 35d65c7c7e
Dynamic Node Metadata (#15844)
Fixes #14617
Dynamic Node Metadata allows Nomad users, and their jobs, to update Node metadata through an API. Currently Node metadata is only reloaded when a Client agent is restarted.

Includes new UI for editing metadata as well.

---------

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
2023-02-07 14:42:25 -08:00
Seth Hoenig 590ae08752
main: remove deprecated uses of rand.Seed (#16074)
* main: remove deprecated uses of rand.Seed

go1.20 deprecates rand.Seed, and seeds the rand package
automatically. Remove cases where we seed the random package,
and cleanup the one case where we intentionally create a
known random source.

* cl: update cl

* mod: update go mod
2023-02-07 09:19:38 -06:00
Luiz Aoqui d6bb417795
docs: update default Nomad bridge config (#16072) 2023-02-07 09:47:41 -05:00
Seth Hoenig 68894bdc62
docker: disable driver when running as non-root on cgroups v2 hosts (#16063)
* docker: disable driver when running as non-root on cgroups v2 hosts

This PR modifies the docker driver to behave like exec when being run
as a non-root user on a host machine with cgroups v2 enabled. Because
of how cpu resources are managed by the Nomad client, the nomad agent
must be run as root to manage docker-created cgroups.

* cl: update cl
2023-02-06 14:09:19 -06:00
Michael Schurter 0a496c845e
Task API via Unix Domain Socket (#15864)
This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled.

This PR contains the core feature and tests but followup work is required for the following TODO items:

- Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink
- Unit tests for auth middleware
- Caching for auth middleware
- Rate limiting on negative lookups for auth middleware

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-02-06 11:31:22 -08:00
Daniel Bennett 335f0a5371
docs: how to troubleshoot consul connect envoy (#15908)
* largely a doc-ification of this commit message:
  d47678074bf8ae9ff2da3c91d0729bf03aee8446
  this doesn't spell out all the possible failure modes,
  but should be a good starting point for folks.

* connect: add doc link to envoy bootstrap error

* add Unwrap() to RecoverableError
  mainly for easier testing
2023-02-02 14:20:26 -06:00
Charlie Voiselle dec41f7f01
Update networking_bridge_linux.go (#16025)
* Removed line from previous implementation
* remove import

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-02-02 14:03:02 -05:00
Charlie Voiselle cc6f4719f1
Add option to expose workload token to task (#15755)
Add `identity` jobspec block to expose workload identity tokens to tasks.

---------

Co-authored-by: Anders <mail@anars.dk>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-02-02 10:59:14 -08:00
Charlie Voiselle 4caac1a92f
client: Add option to enable hairpinMode on Nomad bridge (#15961)
* Add `bridge_network_hairpin_mode` client config setting
* Add node attribute: `nomad.bridge.hairpin_mode`
* Changed format string to use `%q` to escape user provided data
* Add test to validate template JSON for developer safety

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-02-02 10:12:15 -05:00
Piotr Kazmierczak 14b53df3b6
renamed stanza to block for consistency with other projects (#15941) 2023-01-30 15:48:43 +01:00
舍我其谁 3abb453bd0
volume: Add the missing option propagation_mode (#15626) 2023-01-30 09:32:07 -05:00
Seth Hoenig 0fac4e19b3
client: always run alloc cleanup hooks on final update (#15855)
* client: run alloc pre-kill hooks on last pass despite no live tasks

This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.

Fixes #15477

* client: do not run ar cleanup hooks if client is shutting down
2023-01-27 09:59:31 -06:00
Luiz Aoqui de87cdc816
template: restore driver handle on update (#15915)
When the template hook Update() method is called it may recreate the
template manager if the Nomad or Vault token has been updated.

This caused the new template manager did not have a driver handler
because this was only being set on the Poststart hook, which is not
called for inplace updates.
2023-01-27 10:55:59 -05:00
Tim Gross 881a4cfaff
metrics: Add remaining server RPC rate metrics (#15901) 2023-01-27 08:29:53 -05:00
Seth Hoenig caba88128e
artifact: enable reading system git/mercurial configuration (#15903)
This PR adjusts the artifact sandbox on Linux to enable reading from known
system-wide git or mercurial configuration, if they exist.

Folks doing something odd like specifying custom paths for global config will
need to use the standard locations, or disable artifact filesystem isolation.
2023-01-26 13:07:40 -06:00
Seth Hoenig 7375fd40fc
nsd: block on removal of services (#15862)
* nsd: block on removal of services

This PR uses a WaitGroup to ensure workload removals are complete
before returning from ServiceRegistrationHandler.RemoveWorkload of
the nomad service provider. The de-registration of individual services
still occurs asynchrously, but we must block on the parent removal
call so that we do not race with further operations on the same set
of services - e.g. in the case of a task restart where we de-register
and then re-register the services in quick succession.

Fixes #15032

* nsd: add e2e test for initial failing check and restart
2023-01-26 08:17:57 -06:00
Yorick Gersie 2a5c423ae0
Allow per_alloc to be used with host volumes (#15780)
Disallowing per_alloc for host volumes in some cases makes life of a nomad user much harder.
When we rely on the NOMAD_ALLOC_INDEX for any configuration that needs to be re-used across
restarts we need to make sure allocation placement is consistent. With CSI volumes we can
use the `per_alloc` feature but for some reason this is explicitly disabled for host volumes.

Ensure host volumes understand the concept of per_alloc
2023-01-26 09:14:47 -05:00
Tim Gross 2030d62920
implement pre-forwarding auth on select RPCs (#15513)
In #15417 we added a new `Authenticate` method to the server that returns an
`AuthenticatedIdentity` struct. This changeset implements this method for a
small number of RPC endpoints that together represent all the various ways in
which RPCs are sent, so that we can validate that we're happy with this
approach.
2023-01-24 10:52:07 -05:00
Karl Johann Schubert b773a1b77f
client: add disk_total_mb and disk_free_mb config options (#15852) 2023-01-24 09:14:22 -05:00
Michael Schurter 92c7d96e0a
Add INFO task even log line and make logmon less noisy (#15842)
* client: log task events at INFO level

Fixes #15840

Example INFO level client logs with this enabled:

```
[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type="Task Setup" msg="Building Task Directory" failed=false
[WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy
[INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy path=/tmp/NomadClient2414238708/b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51/alloc/logs/.sleepy.stdout.fifo @module=logmon timestamp=2023-01-20T11:19:34.275-0800
[INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy @module=logmon path=/tmp/NomadClient2414238708/b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51/alloc/logs/.sleepy.stderr.fifo timestamp=2023-01-20T11:19:34.275-0800
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[WARN]  client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 driver=raw_exec task_name=sleepy
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Started msg="Task started by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
[INFO]  client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 driver=raw_exec task_name=sleepy path=/home/schmichael/go/bin/nomad pid=27668
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Terminated msg="Exit Code: 130, Signal: 2" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Killed msg="Task successfully killed" failed=false
[INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy path=/home/schmichael/go/bin/nomad pid=27653
[INFO]  client.gc: marking allocation for GC: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51
```

So task events will approximately *double* the number of per-task log
lines, but I think they add a lot of value.

* client: drop logmon 'opening' from debug->info

Cannot imagine why users care and removes 2 log lines per task
invocation.

```

[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type="Task Setup" msg="Building Task Directory" failed=false
<<< 2 "opening fifo" lines elided here >>>
[WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[WARN]  client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 driver=raw_exec task_name=sleepy
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type=Started msg="Task started by client" failed=false
```

* docs: add changelog for #15842
2023-01-20 14:35:00 -08:00
Seth Hoenig d2d8ebbeba
consul: correctly interpret missing consul checks as unhealthy (#15822)
* consul: correctly understand missing consul checks as unhealthy

This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.

In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.

Fixes #15536

* consul: minor CR refactor using maps not sets

* consul: observe transition from healthy to unhealthy checks

* consul: spell healthy correctly
2023-01-19 14:01:12 -06:00
Seth Hoenig 8cd77c14a2
env/aws: update ec2 cpu info data (#15770) 2023-01-13 09:58:23 -06:00
Seth Hoenig fe7795ce16
consul/connect: support for proxy upstreams opaque config (#15761)
This PR adds support for configuring `proxy.upstreams[].config` for
Consul Connect upstreams. This is an opaque config value to Nomad -
the data is passed directly to Consul and is unknown to Nomad.
2023-01-12 08:20:54 -06:00
Seth Hoenig 719eee8112
consul: add client configuration for grpc_ca_file (#15701)
* [no ci] first pass at plumbing grpc_ca_file

* consul: add support for grpc_ca_file for tls grpc connections in consul 1.14+

This PR adds client config to Nomad for specifying consul.grpc_ca_file

These changes combined with https://github.com/hashicorp/consul/pull/15913 should
finally enable Nomad users to upgrade to Consul 1.14+ and use tls grpc connections.

* consul: add cl entgry for grpc_ca_file

* docs: mention grpc_tls changes due to Consul 1.14
2023-01-11 09:34:28 -06:00
Seth Hoenig 83450c8762
vault: configure user agent on Nomad vault clients (#15745)
* vault: configure user agent on Nomad vault clients

This PR attempts to set the User-Agent header on each Vault API client
created by Nomad. Still need to figure a way to set User-Agent on the
Vault client created internally by consul-template.

* vault: fixup find-and-replace gone awry
2023-01-10 10:39:45 -06:00
Seth Hoenig 2a7c7d85a5
artifact: fix sandbox behavior when destination is shared alloc directory (#15712)
This PR fixes the artifact sandbox (new in Nomad 1.5) to allow downloading
artifacts into the shared 'alloc' directory made available to each task in
a common allocation. Previously we assumed the 'alloc' dir would be mounted
under the 'task' dir, but this is only the case in fs isolation: chroot; in
other modes the alloc dir is elsewhere.
2023-01-09 09:46:32 -06:00
Seth Hoenig 7214e21402
ci: swap freeport for portal in packages (#15661) 2023-01-03 11:25:20 -06:00
Tim Gross 989d7d9fcf
csi: avoid a nil pointer when handling plugin events (#15518)
If a plugin crashes quickly enough, we can get into a situation where the
deregister function is called before it's ever registered. Safely handle the
resulting nil pointer in the dynamic registry by not emitting a plugin event,
but also update the plugin event handler to tolerate nil pointers in case we
wire it up elsewhere in the future.
2022-12-12 08:42:57 -05:00
Seth Hoenig be3f89b5f9
artifact: enable inheriting environment variables from client (#15514)
* artifact: enable inheriting environment variables from client

This PR adds client configuration for specifying environment variables that
should be inherited by the artifact sandbox process from the Nomad Client agent.

Most users should not need to set these values but the configuration is provided
to ensure backwards compatability. Configuration of go-getter should ideally be
done through the artifact block in a jobspec task.

e.g.

```hcl
client {
  artifact {
    set_environment_variables = "TMPDIR,GIT_SSH_OPTS"
  }
}
```

Closes #15498

* website: update set_environment_variables text to mention PATH
2022-12-09 15:46:07 -06:00
Seth Hoenig 825c5cc65e
artifact: add client toggle to disable filesystem isolation (#15503)
This PR adds the client config option for turning off filesystem isolation,
applicable on Linux systems where filesystem isolation is possible and
enabled by default.

```hcl
client{
  artifact {
    disable_filesystem_isolation = <bool:false>
  }
}
```

Closes #15496
2022-12-08 12:29:23 -06:00
Seth Hoenig 51a2212d3d
client: sandbox go-getter subprocess with landlock (#15328)
* client: sandbox go-getter subprocess with landlock

This PR re-implements the getter package for artifact downloads as a subprocess.

Key changes include

On all platforms, run getter as a child process of the Nomad agent.
On Linux platforms running as root, run the child process as the nobody user.
On supporting Linux kernels, uses landlock for filesystem isolation (via go-landlock).
On all platforms, restrict environment variables of the child process to a static set.
notably TMP/TEMP now points within the allocation's task directory
kernel.landlock attribute is fingerprinted (version number or unavailable)
These changes make Nomad client more resilient against a faulty go-getter implementation that may panic, and more secure against bad actors attempting to use artifact downloads as a privilege escalation vector.

Adds new e2e/artifact suite for ensuring artifact downloading works.

TODO: Windows git test (need to modify the image, etc... followup PR)

* landlock: fixup items from cr

* cr: fixup tests and go.mod file
2022-12-07 16:02:25 -06:00
Seth Hoenig 3ed37b0b1d
fingerprint: add fingerprinting for CNI plugins presense and version (#15452)
This PR adds a fingerprinter to set the attribute
"plugins.cni.version.<name>" => "<version>"

for each CNI plugin in <client>.cni_path (/opt/cni/bin by default).
2022-12-05 14:22:47 -06:00
Seth Hoenig a65fbeb3b3
client: manually cleanup leaked iptables rules (#15407)
This PR adds a secondary path for cleaning up iptables created for an allocation
when the normal CNI library fails to do so. This typically happens when the state
of the pause container is unexpected - e.g. deleted out of band from Nomad. Before,
the iptables rules would be leaked which could lead to unexpected nat routing
behavior later on (in addition to leaked resources). With this change, we scan
for the rules created on behalf of the allocation being GC'd and delete them.

Fixes #6385
2022-11-28 11:32:16 -06:00
Sam 4689822628
Fix missing host header in http check (#15337) 2022-11-23 08:58:13 -05:00
James Rasell e2a2ea68fc
client: accommodate Consul 1.14.0 gRPC and agent self changes. (#15309)
* client: accommodate Consul 1.14.0 gRPC and agent self changes.

Consul 1.14.0 changed the way in which gRPC listeners are
configured, particularly when using TLS. Prior to the change, a
single listener was responsible for handling plain-text and
encrypted gRPC requests. In 1.14.0 and beyond, separate listeners
will be used for each, defaulting to 8502 and 8503 for plain-text
and TLS respectively.

The change means that Nomad’s Consul Connect integration would not
work when integrated with Consul clusters using TLS and running
1.14.0 or greater.

The Nomad Consul fingerprinter identifies the gRPC port Consul has
exposed using the "DebugConfig.GRPCPort" value from Consul’s
“/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only
represents the plain-text gRPC port which is likely to be disbaled
in clusters running TLS. In order to fix this issue, Nomad now
takes into account the Consul version and configured scheme to
optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent
self return.

The “consul_grcp_socket” allocrunner hook has also been updated so
that the fingerprinted gRPC port attribute is passed in. This
provides a better fallback method, when the operator does not
configure the “consul.grpc_address” option.

* docs: modify Consul Connect entries to detail 1.14.0 changes.

* changelog: add entry for #15309

* fixup: tidy tests and clean version match from review feedback.

* fixup: use strings tolower func.
2022-11-21 09:19:09 -06:00
Charlie Voiselle c73fb51d3a
[bug] Return a spec on reconnect (#15214)
client: fixed a bug where non-`docker` tasks with network isolation would leak network namespaces and iptables rules if the client was restarted while they were running
2022-11-11 13:27:36 -05:00
Seth Hoenig 21237d8337
client: avoid unconsumed channel in timer construction (#15215)
* client: avoid unconsumed channel in timer construction

This PR fixes a bug introduced in #11983 where a Timer initialized with 0
duration causes an immediate tick, even if Reset is called before reading the
channel. The fix is to avoid doing that, instead creating a Timer with a non-zero
initial wait time, and then immediately calling Stop.

* pr: remove redundant stop
2022-11-11 09:31:34 -06:00
Seth Hoenig 6e3309ebc6
template: protect use of template manager with a lock (#15192)
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189
2022-11-10 08:30:27 -06:00
Derek Strickland 80b6f27efd
api: remove mapstructure tags fromPort struct (#12916)
This PR solves a defect in the deserialization of api.Port structs when returning structs from theEventStream.

Previously, the api.Port struct's fields were decorated with both mapstructure and hcl tags to support the network.port stanza's use of the keyword static when posting a static port value. This works fine when posting a job and when retrieving any struct that has an embedded api.Port instance as long as the value is deserialized using JSON decoding. The EventStream, however, uses mapstructure to decode event payloads in the api package. mapstructure expects an underlying field named static which does not exist. The result was that the Port.Value field would always be set to 0.

Upon further inspection, a few things became apparent.

The struct already has hcl tags that support the indirection during job submission.
Serialization/deserialization with both the json and hcl packages produce the desired result.
The use of of the mapstructure tags provided no value as the Port struct contains only fields with primitive types.
This PR:

Removes the mapstructure tags from the api.Port structs
Updates the job parsing logic to use hcl instead of mapstructure when decoding Port instances.
Closes #11044

Co-authored-by: DerekStrickland <dstrickland@hashicorp.com>
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2022-11-08 11:26:28 +01:00