Commit graph

4783 commits

Author SHA1 Message Date
Michael Schurter 43fb0e82dc client: prevent watching stale alloc state (#18612)
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.

Also check to ensure the response is fresh before using it like #18269
2023-09-29 14:37:10 -07:00
Michael Schurter 547a95795a client: prevent using stale allocs (#18601)
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
2023-09-29 14:34:04 -07:00
hc-github-team-nomad-core a6ecf954b0
backport of commit 7bd5c6e84eef890cebdb404d9cb2e281919d4529 (#18555)
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-09-21 17:16:14 -05:00
hc-github-team-nomad-core a2f56797a0
backport of commit 4895d708b438b42e52fd54a128f9ec4cb6d72277 (#18531)
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-09-18 14:29:29 -05:00
hc-github-team-nomad-core 46b4847885
backport of commit c6dbba7cde911bb08f1f8da445a44a0125cd2047 (#18505)
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-09-14 14:38:05 -05:00
hc-github-team-nomad-core 6ae643a3bf
backport of commit 12580c345a89312542c18878680dd581da3d44eb (#18479)
Co-authored-by: Shantanu Gadgil <shantanugadgil@users.noreply.github.com>
2023-09-13 10:16:07 -04:00
hc-github-team-nomad-core 156db8d368
backport of commit 668dc5f7a767e85d62379e3e02405d2afa93f1db (#18448)
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-09-11 13:22:30 +01:00
hc-github-team-nomad-core a7f85c804f
backport of commit 22cbb913db0fa1cbb4e24d197b067d64ea02739a (#18437)
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-09-08 17:10:25 -05:00
hc-github-team-nomad-core ef780825d4
backport of commit 05c332221471d39053eaecafe4832ddd6e1b3b89 (#18365)
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-08-30 09:05:57 -05:00
hc-github-team-nomad-core 4b59840bb1
backport of commit d0a93f12d1ec1e2b276f9958898c9a6fe4f6b077 (#18351)
Co-authored-by: Matthew Salsamendi <matthewsalsamendi@gmail.com>
2023-08-28 19:44:39 -04:00
hc-github-team-nomad-core d8ff618c40
backport of commit f25480c9e929c27476c8930f05832e8b96167660 (#18341)
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
2023-08-25 16:36:35 -07:00
James Rasell 3730b66d8c
test: use correct parallel test setup func (#18326) (#18330) 2023-08-25 14:48:06 +01:00
hc-github-team-nomad-core 621bce1da2
backport of commit 14a38bee7bc4386e74157f6a99f3db7382d7e6a5 (#18275)
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-08-21 16:34:32 -04:00
Tim Gross 0a19fe3b60 fix multiple overflow errors in exponential backoff (#18200)
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all 
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
2023-08-15 14:39:09 -04:00
Seth Hoenig a45b689d8e update go1.21 (#18184)
* build: update to go1.21

* go: eliminate helpers in favor of min/max

* build: run go mod tidy

* build: swap depguard for semgrep

* command: fixup broken tls error check on go1.21
2023-08-15 14:40:33 +02:00
Charlie Voiselle bac4d112d1 [dep] bump golang.org/x/exp (#18102)
There are some refactorings that have to be made in the getter and state
where the api changed in `slices`

* Bump golang.org/x/exp
* Bump golang.org/x/exp in api
* Update job_endpoint_test
* [feedback] unexport sort function
2023-08-03 15:14:39 -04:00
hc-github-team-nomad-core 9301daa8e8
backport of commit a3a637ee8efe5e1251f60f781369bd9052c4d4a2 (#18132)
This pull request was automerged via backport-assistant
2023-08-02 08:47:19 -05:00
hc-github-team-nomad-core b75f552246
fingerprint: fix 'default' alias not added to interface specified by network_interface (#18096) (#18116)
Co-authored-by: Kevin Schoonover <github@kschoon.me>
2023-08-01 08:38:03 -04:00
hc-github-team-nomad-core 2ed92e0c6c
Backport of feature: Add new field render_templates on restart block into release/1.6.x (#18094)
This pull request was automerged via backport-assistant
2023-07-28 13:54:00 -05:00
James Rasell b8cb1e79a3
chore(lint): use Go stdlib variables for HTTP methods and status codes (#17968) (#18074)
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
2023-07-26 16:38:39 +01:00
hc-github-team-nomad-core 02c2f1a50f
Backport of Retain task states for post stop tasks at the time of node GC into release/1.6.x (#18033)
This pull request was automerged via backport-assistant
2023-07-21 12:55:29 -05:00
hc-github-team-nomad-core b1bfb59394
Backport of metrics: report task memory_max value into release/1.6.x (#18004)
This pull request was automerged via backport-assistant
2023-07-19 15:50:34 -05:00
hc-github-team-nomad-core b7689e87ec
Backport of nsd: retain query params in HTTP health checks into release/1.6.x (#18003)
This pull request was automerged via backport-assistant
2023-07-19 15:47:02 -05:00
hc-github-team-nomad-core e5fb6fe687
backport of commit 615e76ef3c23497f768ebd175f0c624d32aeece8 (#17993)
This pull request was automerged via backport-assistant
2023-07-19 13:31:14 -05:00
Michael Schurter c82f439a6d
remove empty file (#17853) 2023-07-10 16:34:10 -07:00
Tim Gross ad7355e58b
CSI: persist previous mounts on client to restore during restart (#17840)
When claiming a CSI volume, we need to ensure the CSI node plugin is running
before we send any CSI RPCs. This extends even to the controller publish RPC
because it requires the storage provider's "external node ID" for the
client. This primarily impacts client restarts but also is a problem if the node
plugin exits (and fingerprints) while the allocation that needs a CSI volume
claim is being placed.

Unfortunately there's no mapping of volume to plugin ID available in the
jobspec, so we don't have enough information to wait on plugins until we either
get the volume from the server or retrieve the plugin ID from data we've
persisted on the client.

If we always require getting the volume from the server before making the claim,
a client restart for disconnected clients will cause all the allocations that
need CSI volumes to fail. Even while connected, checking in with the server to
verify the volume's plugin before trying to make a claim RPC is inherently racy,
so we'll leave that case as-is and it will fail the claim if the node plugin
needed to support a newly-placed allocation is flapping such that the node
fingerprint is changing.

This changeset persists a minimum subset of data about the volume and its plugin
in the client state DB, and retrieves that data during the CSI hook's prerun to
avoid re-claiming and remounting the volume unnecessarily.

This changeset also updates the RPC handler to use the external node ID from the
claim whenever it is available.

Fixes: #13028
2023-07-10 13:20:15 -04:00
Devashish Taneja 0d9dee3cbe
Include parent job ID as a Docker container label (#17843)
Fixes: #17751
2023-07-10 11:27:45 -04:00
Seth Hoenig 4452f0623b
env/aws: updates from ec2info (#17835) 2023-07-07 10:12:05 -05:00
Yorick Gersie 3e66291b0e
cni: ensure to setup CNI addresses in deterministic order (#17766)
* cni: ensure to setup CNI addresses in deterministic order

  Currently as commented in the code the go-cni library returns an unordered map
  of interfaces. In cases where there are multiple CNI interfaces being created this
  creates a problem with service registration and healthchecking because the first
  address in the map is being used.

  The use case we have where this is an issue is that we run CNI with the macvlan
  plugin to isolate workloads, but they still need to be able to access the host on
  a static address to be able to perform local resolving and hit host services like
  the Consul agent API. To make this work there are 2 options, you either add a
  macvlan interface on the host with an assigned address for each VLAN you have or
  you create an additional veth bridged interface in the container namespace.
  We chose the latter option through a custom CNI plugin but the ordering issue
  leaves us with incorrect service registration.

* Updates after feedback

 * First check for the CNIResult interfaces length, if it's zero we don't need to proceed
   at all.
 * Use sorted interfaces list for the address fallback scenario as well.
 * Remove "found" log message logic, when an address isn't found an error is returned stating
   the allocation could not be configured as an address was missing from the CNIResult. If we
   still need a Warn message then we can add it to the condition that returns the error if no
   address could be found instead of using the "found" bool logic.
2023-07-06 13:25:29 -07:00
Patric Stout ebb363d43e
metrics: add "total_ticks_count" for CPU metrics (#17579)
This counter tells you the total amount of ticks for that CPU
entry since the start of Nomad.
2023-07-05 10:28:55 -04:00
Tim Gross f65a925096
adjust prioritized client updates (#17541)
In #17354 we made client updates prioritized to reduce client-to-server
traffic. When the client has no previously-acknowledged update we assume that
the update is of typical priority; although we don't know that for sure in
practice an allocation will never become healthy quickly enough that the first
update we send is the update saying the alloc is healthy.

But that doesn't account for allocations that quickly fail in an unrecoverable
way because of allocrunner hook failures, and it'd be nice to be able to send
those failure states to the server more quickly. This changeset does so and adds
some extra comments on reasoning behind priority.
2023-06-26 09:14:24 -04:00
grembo 7936c1e33f
Add disable_file parameter to job's vault stanza (#13343)
This complements the `env` parameter, so that the operator can author
tasks that don't share their Vault token with the workload when using 
`image` filesystem isolation. As a result, more powerful tokens can be used 
in a job definition, allowing it to use template stanzas to issue all kinds of 
secrets (database secrets, Vault tokens with very specific policies, etc.), 
without sharing that issuing power with the task itself.

This is accomplished by creating a directory called `private` within
the task's working directory, which shares many properties of
the `secrets` directory (tmpfs where possible, not accessible by
`nomad alloc fs` or Nomad's web UI), but isn't mounted into/bound to the
container.

If the `disable_file` parameter is set to `false` (its default), the Vault token
is also written to the NOMAD_SECRETS_DIR, so the default behavior is
backwards compatible. Even if the operator never changes the default,
they will still benefit from the improved behavior of Nomad never reading
the token back in from that - potentially altered - location.
2023-06-23 15:15:04 -04:00
James Rasell b9440965db
client: remove unused nsd check allocation result diff func (#17695) 2023-06-23 15:26:06 +01:00
Tim Gross 11216d09af
client: send node secret with every client-to-server RPC (#16799)
In Nomad 1.5.3 we fixed a security bug that allowed bypass of ACL checks if the
request came thru a client node first. But this fix broke (knowingly) the
identification of many client-to-server RPCs. These will be now measured as if
they were anonymous. The reason for this is that many client-to-server RPCs do
not send the node secret and instead rely on the protection of mTLS.

This changeset ensures that the node secret is being sent with every
client-to-server RPC request. In a future version of Nomad we can add
enforcement on the server side, but this was left out of this changeset to
reduce risks to the safe upgrade path.

Sending the node secret as an auth token introduces a new problem during initial
introduction of a client. Clients send many RPCs concurrently with
`Node.Register`, but until the node is registered the node secret is unknown to
the server and will be rejected as invalid. This causes permission denied
errors.

To fix that, this changeset introduces a gate on having successfully made a
`Node.Register` RPC before any other RPCs can be sent (except for `Status.Ping`,
which we need earlier but which also ignores the error because that handler
doesn't do an authorization check). This ensures that we only send requests with
a node secret already known to the server. This also makes client startup a
little easier to reason about because we know `Node.Register` must succeed
first, and it should make for a good place to hook in future plans for secure
introduction of nodes. The tradeoff is that an existing client that has running
allocs will take slightly longer (a second or two) to transition to ready after
a restart, because the transition in `Node.UpdateStatus` is gated at the server
by first submitting `Node.UpdateAlloc` with client alloc updates.
2023-06-22 11:06:49 -04:00
Seth Hoenig 5138c5b99e
client: do not disable memory swappiness if kernel does not support it (#17625)
* client: do not disable memory swappiness if kernel does not support it

This PR adds a workaround for very old Linux kernels which do not support
the memory swappiness interface file. Normally we write a "0" to the file
to explicitly disable swap. In the case the kernel does not support it,
give libcontainer a nil value so it does not write anything.

Fixes #17448

* client: detect swappiness by writing to the file

* fixup changelog

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-06-22 09:36:31 -05:00
VishnuJin 67efb19e94
fingerprint: added windows os.build attribute to host fingerprint (#17576) 2023-06-21 10:53:50 -04:00
Patric Stout 4767d44b94
Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 (#17535)
* Fix DevicesSets being removed when cpusets are reloaded with cgroup v2

This meant that if any allocation was created or removed, all
active DevicesSets were removed from all cgroups of all tasks.

This was most noticeable with "exec" and "raw_exec", as it meant
they no longer had access to /dev files.

* e2e: add test for verifying cgroups do not interfere with access to devices

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-06-15 09:39:36 -05:00
Tim Gross dc9fae34ca
node pools: add pool as label on client metrics (#17528)
This changeset adds the node pool as a label anywhere we're already emitting
labels with additional information such as node class or ID about the client.
2023-06-14 15:58:38 -04:00
Luiz Aoqui ec80d051d8
client: fix panic on alloc stop in non-Linux environments (#17515)
Provide a no-op implementation of the drivers.DriverNetoworkManager
interface to be used by systems that don't support network isolation and
prevent panics where a network manager is expected.
2023-06-14 10:22:38 -04:00
Seth Hoenig 557a6b4a5e
docker: stop network pause container of lost alloc after node restart (#17455)
This PR fixes a bug where the docker network pause container would not be
stopped and removed in the case where a node is restarted, the alloc is
moved to another node, the node comes back up. See the issue below for
full repro conditions.

Basically in the DestroyNetwork PostRun hook we would depend on the
NetworkIsolationSpec field not being nil - which is only the case
if the Client stays alive all the way from network creation to network
teardown. If the node is rebooted we lose that state and previously
would not be able to find the pause container to remove. Now, we manually
find the pause container by scanning them and looking for the associated
allocID.

Fixes #17299
2023-06-09 08:46:29 -05:00
Seth Hoenig 134e70cbab
client: fix client panic during drain cause by shutdown (#17450)
During shutdown of a client with drain_on_shutdown there is a race between
the Client ending the cgroup and the task's cpuset manager cleaning up
the cgroup. During the path traversal, skip anything we cannot read, which
avoids the nil DirEntry we try to dereference now.
2023-06-07 15:12:44 -05:00
Jerome Eteve c26f01eefd
client checks kernel module in /sys/module for WSL2 bridge networking (#17306) 2023-06-06 10:26:50 -04:00
Seth Hoenig d1d4d22f8e
test: ensure cpuset cgroup is setup before fingerprinting (#17428)
This PR fixes a racey test where we need to ensure the cpuset cgroup
is setup before trying to fingerprint it.
2023-06-05 14:15:00 -05:00
hashicorp-copywrite[bot] 0f4532f138
[COMPLIANCE] Add Copyright and License Headers (#17429)
Co-authored-by: hashicorp-copywrite[bot] <110428419+hashicorp-copywrite[bot]@users.noreply.github.com>
2023-06-05 13:23:59 -04:00
Luiz Aoqui 6039c18ab6
node pools: register a node in a node pool (#17405) 2023-06-02 17:50:50 -04:00
Tim Gross 06972fae0c
prioritized client updates (#17354)
The allocrunner sends several updates to the server during the early lifecycle
of an allocation and its tasks. Clients batch-up allocation updates every 200ms,
but experiments like the C2M challenge has shown that even with this batching,
servers can be overwhelmed with client updates during high volume
deployments. Benchmarking done in #9451 has shown that client updates can easily
represent ~70% of all Nomad Raft traffic.

Each allocation sends many updates during its lifetime, but only those that
change the `ClientStatus` field are critical for progressing a deployment or
kicking off a reschedule to recover from failures.

Add a priority to the client allocation sync and update the `syncTicker`
receiver so that we only send an update if there's a high priority update
waiting, or on every 5th tick. This means when there are no high priority
updates, the client will send updates at most every 1s instead of
200ms. Benchmarks have shown this can reduce overall Raft traffic by 10%, as
well as reduce client-to-server RPC traffic.

This changeset also switches from a channel-based collection of updates to a
shared buffer, so as to split batching from sending and prevent backpressure
onto the allocrunner when the RPC is slow. This doesn't have a major performance
benefit in the benchmarks but makes the implementation of the prioritized update
simpler.

Fixes: #9451
2023-05-31 15:34:16 -04:00
Luiz Aoqui bb2395031b
client: fix Consul version finterprint (#17349)
Consul v1.13.8 was released with a breaking change in the /v1/agent/self
endpoint version where a line break was being returned.

This caused the Nomad finterprint to fail because `NewVersion` errors on
parse.

This commit removes any extra space from the Consul version returned by
the API.
2023-05-30 11:07:57 -04:00
Seth Hoenig acfdf0f479
compliance: add headers with fixed copywrite tool (#17353)
Closes #17117
2023-05-30 09:20:32 -05:00
Charlie Voiselle 86e04a4c6c
[core] nil check and error handling for client status in heartbeat responses (#17316)
Add a nil check to constructNodeServerInfoResponse to manage an apparent race between deregister and client heartbeats. Fixes #17310
2023-05-25 16:04:54 -04:00
Lance Haig 568da5918b
cli: tls certs not created with correct SANs (#16959)
The `nomad tls cert` command did not create certificates with the correct SANs for
them to work with non default domain and region names. This changset updates the
code to support non default domains and regions in the certificates.
2023-05-22 09:31:56 -04:00