Commit graph

24599 commits

Author SHA1 Message Date
James Rasell a53f9a4094
docs: fix-up legacy link in client config page. (#16678) 2023-03-28 09:32:34 +01:00
Tobias Birkefeld 581eba9f41
docs: fix link of Read Stats API (#16673)
The former link results in a 404. Update the link to the correct developer docs.
2023-03-28 08:49:44 +01:00
James Rasell 28c142c1a6
dev: account for non-vagrant machines on Linux config priv. (#16657) 2023-03-27 17:13:18 +01:00
Juana De La Cuesta 320884b8ee
Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true (#16583)
* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

* style: refactor force run function

* fix: remove defer and inline unlock for speed optimization

* Update nomad/leader.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* style: refactor tests to use must

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* fix: move back from defer to calling unlock before returning.

createEval cant be called with the lock on

* style: refactor test to use must

* added new entry to changelog and update comments

---------

Co-authored-by: James Rasell <jrasell@hashicorp.com>
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-03-27 17:25:05 +02:00
Juana De La Cuesta 21b675244e
style: rename ForceRun to ForceEval, for clarity (#16617) 2023-03-27 15:38:48 +02:00
Luiz Aoqui 8070882c4b
scheduler: fix reconciliation of reconnecting allocs (#16609)
When a disconnect client reconnects the `allocReconciler` must find the
allocations that were created to replace the original disconnected
allocations.

This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.

This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.

This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.

It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the `untainted` set.

The system `SystemScheduler` is not affected by this bug because
disconnected clients don't trigger replacements: every eligible client
is already running an allocation.
2023-03-24 19:38:31 -04:00
ron-savoia 743414739d
docs: added section of needed ACL rules for Nomad UI (#16494) 2023-03-24 08:57:16 -04:00
Luiz Aoqui e5d31bca61
cli: job restart command (#16278)
Implement the new `nomad job restart` command that allows operators to
restart allocations tasks or reschedule then entire allocation.

Restarts can be batched to target multiple allocations in parallel.
Between each batch the command can stop and hold for a predefined time
or until the user confirms that the process should proceed.

This implements the "Stateless Restarts" alternative from the original
RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180).
The original concept is still worth implementing, as it allows this
functionality to be exposed over an API that can be consumed by the
Nomad UI and other clients. But the implementation turned out to be more
complex than we initially expected so we thought it would be better to
release a stateless CLI-based implementation first to gather feedback
and validate the restart behaviour.

Co-authored-by: Shishir Mahajan <smahajan@roblox.com>
2023-03-23 18:28:26 -04:00
Luiz Aoqui 4ccd999304
ci: send notification when prepare is complete (#16627) 2023-03-23 17:34:45 -04:00
Tim Gross 977c88dcea
drainer: test refactoring to clarify behavior around delete/down nodes (#16612)
This changeset refactors the tests of the draining node watcher so that we don't
mock the node watcher's `Remove` and `Update` methods for its own tests. Instead
we'll mock the node watcher's dependencies (the job watcher and deadline
notifier) and now unit tests can cover the real code. This allows us to remove a
bunch of TODOs in `watch_nodes.go` around testing and clarify some important
behaviors:

* Nodes that are down or disconnected will still be watched until the scheduler
  decides what to do with their allocations. This will drive the job watcher but
  not the node watcher, and that lets the node watcher gracefully handle cases
  where a heartbeat fails but the node heartbeats again before its allocs can be
  evicted.

* Stop watching nodes that have been deleted. The blocking query for nodes set
  the maximum index to the highest index of a node it found, rather than the
  index of the nodes table. This misses updates to the index from deleting
  nodes. This was done as an performance optimization to avoid excessive
  unblocking, but because the query is over all nodes anyways there's no
  optimization to be had here. Remove the optimization so we can detect deleted
  nodes without having to wait for an update to an unrelated node.
2023-03-23 14:07:09 -04:00
Michael Schurter 5e6799164f
Post 1.5.2 release (#16614)
* Generate files for 1.5.2 release

* Prepare for next release

* add 1.4.7 and 1.3.12 to the changelog

---------

Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
2023-03-22 14:23:38 -07:00
Phil Renaud 11de45d17b
[ui] Copyable server and client attribute values (#16548)
* Copyable server and client attribute values

* Changelog
2023-03-22 15:05:01 -04:00
Juana De La Cuesta 5892839c83
Fix broken test for quotas CLI (#16610)
* fix: fix broken test

* fix: fix broken test for quota status
2023-03-22 19:07:37 +01:00
James Rasell 7dd1484757
docs: detail support for Nomad checks in service block. (#16598) 2023-03-22 09:27:58 +01:00
Michael Schurter d2aa8fcdc7
taskapi: use HasSuffix to detect errors from rpcs (#16594)
Matches the "normal" HTTP error detection logic in the same file.
2023-03-21 14:38:07 -07:00
Michael Schurter 4678dc7b4d
e2e: sleep to ensure logs are picked up (#16596)
:(
2023-03-21 14:10:50 -07:00
Tim Gross ad774ccfa1
E2E: fix events tests (#16595)
In #12916 we updated the events test as part of a larger set of changes around
mapstructure serialization fixes. But the changes to the jobs we're deploying in
the tests had invalid task configs so they never result in good deployments and
the test will always fail. Make the before/after jobs identical (except for the
version bump) and make them valid. Also wait for allocations for the 2nd job run
to appear before checking the deployment list, so that we don't race with the
scheduler.
2023-03-21 14:01:40 -07:00
Michael Schurter 15fe2ade18
Windows fixes for e2e tests (#16592)
* e2e: skip task api test when windows too old

* e2e: don't run proxy on windows
2023-03-21 13:55:32 -07:00
Suselz b3d2ec7634
Update csi_plugin.mdx (#16584)
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-03-21 16:16:18 +01:00
Tim Gross 1763622dfd
contrib: architecture guide to the drainer (#16569)
The drainer component is fairly complex. As part of upcoming work to fix some of
the drainer's rough edges, document the drainer's architecture from a Nomad
developer perspective.
2023-03-21 09:17:24 -04:00
Luiz Aoqui 518fd610b3
changelog: update #16427 to improvement (#16565)
The security fix in Go 1.20.2 does not apply to Nomad.
2023-03-20 21:24:53 -04:00
Michael Schurter f8884d8b52
client/metadata: fix crasher caused by AllowStale = false (#16549)
Fixes #16517

Given a 3 Server cluster with at least 1 Client connected to Follower 1:

If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.

The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.

Follower 1, seeing AllowStale=false, will forward the request to the
Leader.

The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
2023-03-20 16:32:32 -07:00
Mike Nomitch ae99a24de5 Adds public roadmap project to readme 2023-03-20 15:11:38 -07:00
Tim Gross d1b35c6bd0
contrib: mock driver (#16573) 2023-03-20 16:35:32 -04:00
James Rasell 2f4680680f
dev: remove use of cfssl and use Nomad CLI for TLS certs. (#16145) 2023-03-20 17:06:15 +01:00
James Rasell 4825b40e9a
docs: remove Java and Scala SDKs from supported list. (#16555) 2023-03-20 15:35:02 +01:00
Phil Renaud ccce4b68f2
[ui] Perform common job tasks with keyboard shortcuts (#16378)
* Throw your mouse into traffic

* Add node metadata with a shortcut

* Re-labelled

* Adds a toast notification to job start/stop on keyboard shortcut

* Typo fix
2023-03-20 09:24:39 -04:00
Juana De La Cuesta 47be374bbd
Add -json flag to quota inspect command (#16478)
* Added  and  flag to  command

* cli[style]: small refactor to avoid confussion with tmpl variable

* Update inspect.mdx

* cli: add changelog entry

* Update .changelog/16478.txt

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update command/quota_inspect.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-03-20 10:40:51 +01:00
Juana De La Cuesta ed44f50091
cli: add -json and -t flags to quota status command (#16485)
* cli: add json and t flags to quota status command

* cli: add entry to changelog

* Update command/quota_status.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-03-20 10:39:56 +01:00
Juana De La Cuesta eeb3766575
cli: Add json and -t flags to server members command (#16444)
* cli: Add  and  flags to server members

* Update website/content/docs/commands/server/members.mdx

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update website/content/docs/commands/server/members.mdx

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* cli: update the server memebers tests to use must

* cli: add flags addition to changelog

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-03-20 10:39:24 +01:00
Adam Pugh e4e53872be
Spelling update (#16553)
updated propogating to propagating
2023-03-20 09:24:41 +01:00
Seth Hoenig d6dcc53c0a
tls enforcement flaky tests (#16543)
* tests: add WaitForLeaders helpers using must/wait timings

* tests: start servers for mtls tests together

Fixes #16253 (hopefully)
2023-03-17 14:11:13 -05:00
Piotr Kazmierczak 0a2b425eb5
cli: nomad login command should not require a -type flag and should respect default auth method (#16504)
nomad login command does not need to know ACL Auth Method's type, since all
method names are unique. 

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-03-17 19:14:28 +01:00
Seth Hoenig 07543f8bdf
nsd: always set deregister flag after deregistration of group (#16289)
* services: always set deregister flag after deregistration of group

This PR fixes a bug where the group service hook's deregister flag was
not set in some cases, causing the hook to attempt deregistrations twice
during job updates (alloc replacement).

In the tests ... we used to assert on the wrong behvior (remove twice) which
has now been corrected to assert we remove only once.

This bug was "silent" in the Consul provider world because the error logs for
double deregistration only show up in Consul logs; with the Nomad provider the
error logs are in the Nomad agent logs.

* services: cleanup group service hook tests
2023-03-17 09:44:21 -05:00
Piotr Kazmierczak 14927e93bc
acl: fix canonicalization of OIDC auth method mock (#16534) 2023-03-17 15:37:54 +01:00
James Rasell 4a5d7d3793
docs: add binding-rule selector escape example on Windows PS (#16273) 2023-03-17 15:13:35 +01:00
Michael Schurter a875bad6e5
Enable ACLs on E2E test clients (#16530)
* e2e: uniformly enable acls across all agents

* docs: clarify that acls should be set everywhere
2023-03-16 14:22:41 -07:00
Tim Gross ec47b245d0
client: don't use Status RPC for Consul discovery (#16490)
In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.

Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the
Consul response directly. Update the `registerNode` method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.
2023-03-16 15:38:33 -04:00
Seth Hoenig 5b1970468e
artifact: git needs more files for private repositories (#16508)
* landlock: git needs more files for private repositories

This PR fixes artifact downloading so that git may work when cloning from
private repositories. It needs

- file read on /etc/passwd
- dir read on /root/.ssh
- file write on /root/.ssh/known_hosts

Add these rules to the landlock rules for the artifact sandbox.

* cr: use nonexistent instead of devnull

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>

* cr: use go-homdir for looking up home directory

* pr: pull go-homedir into explicit require

* cr: fixup homedir tests in homeless root cases

* cl: fix root test for real

---------

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-03-16 12:22:25 -05:00
Michael Schurter 81b8c52472
docs: dispatch_payload and jobs api docs had some weirdness (#16514)
* docs: dispatch_payload docs had some weirdness

Docs said "Examples" when there was only 1 example. Not sure what the
floating "to" in the description was for.

* docs: missing a heading level on jobs api docs
2023-03-16 09:42:46 -07:00
Seth Hoenig d2e8fb626a
artifact: do not set process attributes on darwin (#16511)
This PR fixes the non-root macOS use case where artifact downloads
stopped working. It seems setting a Credential on a SysProcAttr
used by the exec package will always cause fork/exec to fail -
even if the credential contains our own UID/GID or nil UID/GID.

Technically we do not need to set this as the child process will
inherit the parent UID/GID anyway... and not setting it makes
things work again ... /shrug
2023-03-16 11:31:18 -05:00
Seth Hoenig 25944cbb7d
artifact: use specific version link for zipbomb artifact (#16513)
Fix the e2e case where we download the go-getter bomb.zip test file, which
is being removed on main. We can still get it from the version tag - yay git!
2023-03-16 10:18:46 -05:00
James Rasell 184733a126
build: fix test-nomad make target when running locally. (#16506) 2023-03-16 09:32:14 +01:00
Daniel Bennett 0331dd71ca
test: set BuildDate in default TestAgent config (#16499)
so enterprise tests don't fail due to the default zero time
2023-03-15 11:47:15 -05:00
James Rasell b0a3964e6b
cli: fix login help output formatting. (#16502) 2023-03-15 13:23:26 +01:00
Seth Hoenig ed7177de76
scheduler: annotate tasksUpdated with reason and purge DeepEquals (#16421)
* scheduler: annotate tasksUpdated with reason and purge DeepEquals

* cr: move opaque into helper

* cr: swap affinity/spread hashing for slice equal

* contributing: update checklist-jobspec with notes about struct methods

* cr: add more cases to wait config equal method

* cr: use reflect when comparing envoy config blocks

* cl: add cl
2023-03-14 09:46:00 -05:00
Anthony 6a7e22d546
Merge pull request #16484 from hashicorp/tunzor-patch-1
Update for enterprise trial wording and link
2023-03-14 10:19:29 -04:00
Anthony 9a3d2924e4
Updated trial license link and wording 2023-03-14 09:31:06 -04:00
Juana De La Cuesta c235bafa3f
cli: Add -json and -t flags to namespace status command (#16442)
* cli: Add  and  flag to namespace status command

* Update command/namespace_status.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* cli: update tests for namespace status command to use must

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-03-14 14:23:04 +01:00
Tim Gross 16b731e456
docs: clarify migration behavior under nomad alloc stop (#16468) 2023-03-14 09:00:29 -04:00