Commit Graph

24690 Commits

Author SHA1 Message Date
Tim Gross c3d9c598f5 Merge release 1.5.3 files 2023-04-05 12:32:00 -04:00
hc-github-team-nomad-core 3578078caf Prepare for next release 2023-04-05 12:31:42 -04:00
hc-github-team-nomad-core b64ee2726d Generate files for 1.5.3 release 2023-04-05 12:31:30 -04:00
Tim Gross 66a01bb35a upgrade go to 1.20.3 2023-04-05 12:18:19 -04:00
Tim Gross 8278f23042 acl: fix ACL bypass for anon requests that pass thru client HTTP
Requests without an ACL token that pass thru the client's HTTP API are treated
as though they come from the client itself. This allows bypass of ACLs on RPC
requests where ACL permissions are checked (like `Job.Register`). Invalid tokens
are correctly rejected.

Fix the bypass by only setting a client ID on the identity if we have a valid node secret.

Note that this changeset will break rate metrics for RPCs sent by clients
without a client secret such as `Node.GetClientAllocs`; these requests will be
recorded as anonymous.

Future work should:
* Ensure the node secret is sent with all client-driven RPCs except
  `Node.Register` which is TOFU.
* Create a new `acl.ACL` object from client requests so that we
  can enforce ACLs for all endpoints in a uniform way that's less error-prone.~
2023-04-05 12:17:51 -04:00
Juana De La Cuesta 9b4871fece
Prevent kill_timeout greater than progress_deadline (#16761)
* func: add validation for kill timeout smaller than progress dealine

* style: add changelog

* style: typo in changelog

* style: remove refactored test

* Update .changelog/16761.txt

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/structs/structs.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-04-04 18:17:10 +02:00
Seth Hoenig 15a2d912b3
cleanup: use jobID name rather than jobName in job endpoints (#16777)
These endpoints all refer to JobID by the time you get to the RPC request
layer, but the HTTP handler functions call the field JobName, which is a different
field (... though often with the same value).
2023-04-04 09:11:58 -05:00
James Rasell bcfb4ea1f2
cli: fix up failing quota inspect enterprise test. (#16781) 2023-04-04 15:02:40 +01:00
James Rasell cb6ba80f0f
cli: stream both stdout and stderr when following an alloc. (#16556)
This update changes the behaviour when following logs from an
allocation, so that both stdout and stderr files streamed when the
operator supplies the follow flag. The previous behaviour is held
when all other flags and situations are provided.

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-04-04 10:42:27 +01:00
Mike Nomitch b5a1051fe6
Merge pull request #16575 from hashicorp/docs-add-roadmap-project
Adds public roadmap project to readme
2023-04-03 08:21:13 -07:00
Tim Gross 118b703164
CSI: set mounts in alloc hook resources atomically (#16722)
The allocrunner has a facility for passing data written by allocrunner hooks to
taskrunner hooks. Currently the only consumers of this facility are the
allocrunner CSI hook (which writes data) and the taskrunner volume hook (which
reads that same data).

The allocrunner hook for CSI volumes doesn't set the alloc hook resources
atomically. Instead, it gets the current resources and then writes a new version
back. Because the CSI hook is currently the only writer and all readers happen
long afterwards, this should be safe but #16623 shows there's some sequence of
events during restore where this breaks down.

Refactor hook resources so that hook data is accessed via setters and getters
that hold the mutex.
2023-04-03 11:03:36 -04:00
Tim Gross 0c582a2c94
docs: fix use of gpg to avoid teeing binary to terminal (#16767) 2023-04-03 10:54:21 -04:00
Tim Gross ffd5435ceb
docs: fix install instructions for apt (#16764)
The workflow described in the docs for apt installation is deprecated. Update to
match the workflow described in the Tutorials and official packaging guide.
2023-04-03 10:06:59 -04:00
Georgy Buranov ca80546ef7
take maximum processor Mhz (#16740)
* take maximum processor Mhz

* remove break

* cl: add cl for 16740

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-03-31 11:25:32 -05:00
Juana De La Cuesta 89baa13b14
Update quota name on failing test for quota status (#16662)
* fix: update quota name on test

* Update quota_status_test.go

* Update quota_status_test.go

* fix: simplify template call for quota status
2023-03-31 18:07:21 +02:00
Juana De La Cuesta 1fc13b83d8
style: update documentation (#16729) 2023-03-31 16:38:16 +02:00
Daniel Bennett c9adc22eec
Update enterprise licensing documentation (#16615)
updated various docs for new expiration behavior
and new command `nomad license inspect` to validate pre-upgrade
2023-03-30 16:40:19 -05:00
Daniel Bennett c42950e342
ent: move all license info into LicenseConfig{} (#16738)
and add new TestConfigForServer() to get a
valid nomad.Config to use in tests

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-03-30 16:15:05 -05:00
Horacio Monsalvo 20372b1721
connect: add meta on ConsulSidecarService (#16705)
Co-authored-by: Sol-Stiep <sol.stiep@southworks.com>
2023-03-30 16:09:28 -04:00
Luiz Aoqui fa4ee68c6a
ci: use `BACKPORT_MERGE_COMMIT` option (#16730)
Instead of attempting to pick each individual commit in a PR using
`BACKPORT_MERGE_COMMIT` only picks the commit that was merged into
`main`.

This reduces the amount of work done during a backport, generating
cleaner merges and avoiding potential issues on specific commits.

With this setting PRs that are not squashed will fail to backport and
must be handled manually, but those are considered exceptions.
2023-03-30 11:49:46 -04:00
Piotr Kazmierczak 1470d2ff62
Merge pull request #15897 from hashicorp/f-sso-jwt-auth-method
acl: JWT as SSO auth method
2023-03-30 17:07:50 +02:00
Piotr Kazmierczak 1a5eba24a6 acl: set minACLJWTAuthMethodVersion to 1.5.3 and adjust code comment 2023-03-30 15:30:42 +02:00
Phil Renaud e9a114e249 [ui] Web sign-in with JWT (#16625)
* Bones of JWT detection

* JWT to token pipeline complete

* Some live-demo fixes for template language

* findSelf and loginJWT funcs made async

* Acceptance tests and mirage mocks for JWT login

* [ui] Allow for multiple JWT auth methods in the UI (#16665)

* Split selectable jwt methods

* repositions the dropdown to be next to the input field
2023-03-30 09:40:12 +02:00
Piotr Kazmierczak d98c8f6759 acl: rebased on main and changed the gate to 1.5.3-dev 2023-03-30 09:40:12 +02:00
Piotr Kazmierczak acfc266c30 acl: JWT changelog entry and typo fix 2023-03-30 09:40:11 +02:00
Piotr Kazmierczak 4609119fb5 acl: JWT auth CLI (#16532) 2023-03-30 09:39:56 +02:00
Piotr Kazmierczak 16b6bd9ff2 acl: fix canonicalization of JWT auth method mock (#16531) 2023-03-30 09:39:56 +02:00
Piotr Kazmierczak 2b353902a1 acl: HTTP endpoints for JWT auth (#16519) 2023-03-30 09:39:56 +02:00
Piotr Kazmierczak e48c48e89b acl: RPC endpoints for JWT auth (#15918) 2023-03-30 09:39:56 +02:00
Piotr Kazmierczak a9230fb0b7 acl: JWT auth method 2023-03-30 09:39:56 +02:00
Tim Gross 76284a09a0
docker: move pause container recovery to after `SetConfig` (#16713)
When we added recovery of pause containers in #16352 we called the recovery
function from the plugin factory function. But in our plugin setup protocol, a
plugin isn't ready for use until we call `SetConfig`. This meant that
recovering pause containers was always done with the default
config. Setting up the Docker client only happens once, so setting the wrong
config in the recovery function also means that all other Docker API calls will
use the default config.

Move the `recoveryPauseContainers` call into the `SetConfig`. Fix the error
handling so that we return any error but also don't log when the context is
canceled, which happens twice during normal startup as we fingerprint the
driver.
2023-03-29 16:20:37 -04:00
dependabot[bot] afa9608475
build(deps): bump github.com/opencontainers/runc from 1.1.4 to 1.1.5 (#16712)
* build(deps): bump github.com/opencontainers/runc from 1.1.4 to 1.1.5

Bumps [github.com/opencontainers/runc](https://github.com/opencontainers/runc) from 1.1.4 to 1.1.5.
- [Release notes](https://github.com/opencontainers/runc/releases)
- [Changelog](https://github.com/opencontainers/runc/blob/v1.1.5/CHANGELOG.md)
- [Commits](https://github.com/opencontainers/runc/compare/v1.1.4...v1.1.5)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/runc
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* changelog entry

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-03-29 15:05:05 -04:00
Juana De La Cuesta dd770027df
fix: clean the output writter to avoid duplicates when testing for json output (#16619) 2023-03-29 12:05:23 +02:00
Max Fröhlich ba590b081e
docs: mention Nomad Admission Control Proxy (#16702) 2023-03-28 15:18:26 -04:00
Tim Gross f22ff2b847
docs: clarify capabilities options for `docker` driver (#16693)
The `docker` driver cannot expand capabilities beyond the default set when the
task is a non-root user. Clarify this in the documentation of `allow_caps` and
update the `cap_add` and `cap_drop` to match the `exec` driver, which has more
clear language overall.
2023-03-28 13:32:08 -04:00
Elvis Pranskevichus 11a9bb6ce7
drivers/exec: Fix handling of capabilities for unprivileged tasks (#16643)
Currently, the `exec` driver is only setting the Bounding set, which is
not sufficient to actually enable the requisite capabilities for the
task process.  In order for the capabilities to survive `execve`
performed by libcontainer, the `Permitted`, `Inheritable`, and `Ambient`
sets must also be set.

Per CAPABILITIES (7):

> Ambient: This is a set of capabilities that are preserved across an
> execve(2) of a program that is not privileged.  The ambient capability
> set obeys the invariant that no capability can ever be ambient if it
> is not both permitted and inheritable.
2023-03-28 12:12:55 -04:00
James Rasell 17fd1a2e35
dev: make cni, consul, dev, docker, and vault scripts Lima compat. (#16689) 2023-03-28 16:21:14 +01:00
Seth Hoenig 87f4b71df0
client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips (#16672)
* client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips

This PR adds detection of asymetric core types (Power & Efficiency) (P/E)
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read
undocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into gopsutil, but gopsutil
would still not support detecting P vs E cores like this PR does.

Also refactors the CPUFingerprinter code to handle the mixed core
types, now setting power vs efficiency cpu attributes.

For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.

Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks of resources.cores on Linux
platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.

node attributes before

cpu.arch                  = arm64
cpu.modelname             = Apple M2 Pro
cpu.numcores              = 12
cpu.reservablecores       = 0
cpu.totalcompute          = 1000

node attributes after

cpu.arch                  = arm64
cpu.frequency.efficiency  = 2424
cpu.frequency.power       = 3504
cpu.modelname             = Apple M2 Pro
cpu.numcores.efficiency   = 4
cpu.numcores.power        = 8
cpu.reservablecores       = 0
cpu.totalcompute          = 37728

* fingerprint/cpu: follow up cr items
2023-03-28 08:27:58 -05:00
James Rasell a18e480a57
dev: modify Go install to support arch64 and non-vagrant machines. (#16651) 2023-03-28 14:18:48 +01:00
Tim Gross 78acc75b57
docs: add notes about keyring to snapshot restore (#16663)
When cluster administrators restore from Raft snapshot, they also need to ensure the
keyring is in place. For on-prem users doing in-place upgrades this is less of a
concern but for typical cloud workflows where the whole host is replaced, it's
an important warning (at least until #14852 has been implemented).
2023-03-28 08:31:01 -04:00
Tim Gross a953456460
docs: fix template retry attempts default documentation (#16667)
The configuration docs for `client.template.vault_retry`, `consul_retry`, and
`nomad_retry` incorrectly document the default number of attempts to be
unlimited (0). When we added these config blocks, we defaulted the fields to
`nil` for backwards compatibility, which causes them to fall back to the default
consul-template configuration values.
2023-03-28 08:27:06 -04:00
James Rasell a53f9a4094
docs: fix-up legacy link in client config page. (#16678) 2023-03-28 09:32:34 +01:00
Tobias Birkefeld 581eba9f41
docs: fix link of Read Stats API (#16673)
The former link results in a 404. Update the link to the correct developer docs.
2023-03-28 08:49:44 +01:00
James Rasell 28c142c1a6
dev: account for non-vagrant machines on Linux config priv. (#16657) 2023-03-27 17:13:18 +01:00
Juana De La Cuesta 320884b8ee
Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true (#16583)
* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

* style: refactor force run function

* fix: remove defer and inline unlock for speed optimization

* Update nomad/leader.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* style: refactor tests to use must

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* fix: move back from defer to calling unlock before returning.

createEval cant be called with the lock on

* style: refactor test to use must

* added new entry to changelog and update comments

---------

Co-authored-by: James Rasell <jrasell@hashicorp.com>
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-03-27 17:25:05 +02:00
Juana De La Cuesta 21b675244e
style: rename ForceRun to ForceEval, for clarity (#16617) 2023-03-27 15:38:48 +02:00
Luiz Aoqui 8070882c4b
scheduler: fix reconciliation of reconnecting allocs (#16609)
When a disconnect client reconnects the `allocReconciler` must find the
allocations that were created to replace the original disconnected
allocations.

This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.

This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.

This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.

It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the `untainted` set.

The system `SystemScheduler` is not affected by this bug because
disconnected clients don't trigger replacements: every eligible client
is already running an allocation.
2023-03-24 19:38:31 -04:00
ron-savoia 743414739d
docs: added section of needed ACL rules for Nomad UI (#16494) 2023-03-24 08:57:16 -04:00
Luiz Aoqui e5d31bca61
cli: job restart command (#16278)
Implement the new `nomad job restart` command that allows operators to
restart allocations tasks or reschedule then entire allocation.

Restarts can be batched to target multiple allocations in parallel.
Between each batch the command can stop and hold for a predefined time
or until the user confirms that the process should proceed.

This implements the "Stateless Restarts" alternative from the original
RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180).
The original concept is still worth implementing, as it allows this
functionality to be exposed over an API that can be consumed by the
Nomad UI and other clients. But the implementation turned out to be more
complex than we initially expected so we thought it would be better to
release a stateless CLI-based implementation first to gather feedback
and validate the restart behaviour.

Co-authored-by: Shishir Mahajan <smahajan@roblox.com>
2023-03-23 18:28:26 -04:00
Luiz Aoqui 4ccd999304
ci: send notification when prepare is complete (#16627) 2023-03-23 17:34:45 -04:00