Commit graph

1176 commits

Author SHA1 Message Date
Luiz Aoqui cfb3bb517f
np: scheduler configuration updates (#17575)
* jobspec: rename node pool scheduler_configuration

In HCL specifications we usually call configuration blocks `config`
instead of `configuration`.

* np: add memory oversubscription config

* np: make scheduler config ENT
2023-06-19 11:41:46 -04:00
Luiz Aoqui d5aa72190f
node pools: namespace integration (#17562)
Add structs and fields to support the Nomad Pools Governance Enterprise
feature of controlling node pool access via namespaces.

Nomad Enterprise allows users to specify a default node pool to be used
by jobs that don't specify one. In order to accomplish this, it's
necessary to distinguish between a job that explicitly uses the
`default` node pool and one that did not specify any.

If the `default` node pool is set during job canonicalization it's
impossible to do this, so this commit allows a job to have an empty node
pool value during registration but sets to `default` at the admission
controller mutator.

In order to guarantee state consistency the state store validates that
the job node pool is set and exists before inserting it.
2023-06-16 16:30:22 -04:00
Tim Gross fbaf4c8b69
node pools: implement support in scheduler (#17443)
Implement scheduler support for node pool:

* When a scheduler is invoked, we get a set of the ready nodes in the DCs that
  are allowed for that job. Extend the filter to include the node pool.
* Ensure that changes to a job's node pool are picked up as destructive
  allocation updates.
* Add `NodesInPool` as a metric to all reporting done by the scheduler.
* Add the node-in-pool the filter to the `Node.Register` RPC so that we don't
  generate spurious evals for nodes in the wrong pool.
2023-06-07 10:39:03 -04:00
Luiz Aoqui 5878113c41
node pool: implement nomad node pool nodes CLI (#17444) 2023-06-07 10:37:27 -04:00
Tim Gross c0f2295510
node pools: implement HTTP API to list jobs in pool (#17431)
Implements the HTTP API associated with the `NodePool.ListJobs` RPC, including
the `api` package for the public API and documentation.

Update the `NodePool.ListJobs` RPC to fix the missing handling of the special
"all" pool.
2023-06-06 11:40:13 -04:00
Luiz Aoqui aa1b33d157
node pools: add event stream support (#17412) 2023-06-06 10:14:47 -04:00
Luiz Aoqui 6039c18ab6
node pools: register a node in a node pool (#17405) 2023-06-02 17:50:50 -04:00
Luiz Aoqui b770f2b1ef
node pools: implement CLI (#17388) 2023-06-02 15:49:57 -04:00
Samantha b92a782b6e
check: Add support for Consul field tls_server_name (#17334) 2023-06-02 10:19:12 -04:00
Tim Gross 4f14fa0518
node pools: add node_pool field to job spec (#17379)
This changeset only adds the `node_pool` field to the jobspec, and ensures that
it gets picked up correctly as a change. Without the rest of the implementation
landed yet, the field will be ignored.
2023-06-01 16:08:55 -04:00
Seth Hoenig acfdf0f479
compliance: add headers with fixed copywrite tool (#17353)
Closes #17117
2023-05-30 09:20:32 -05:00
Piotr Kazmierczak cea48b24ee
fix: job canonicalization should set job priority to 50, not 0. (#17314)
Nomad API will reject jobs with priority set to 0.
2023-05-30 09:05:32 +02:00
Charlie Voiselle fc313b7f8f
[api] Return a shapely error for unexpected response (#16743)
* Add UnexpectedResultError to nomad/api

This allows users to perform additional status-based behavior by rehydrating the error using `errors.As` inside of consumers.
2023-05-22 11:45:31 -04:00
dependabot[bot] 31a38d750b
build(deps): bump github.com/shoenig/test from 0.6.4 to 0.6.6 in /api (#17178)
* build(deps): bump github.com/shoenig/test from 0.6.4 to 0.6.5 in /api

* deps: update shoenig/test to v0.6.5

* deps: update again to v0.6.6

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-05-22 07:53:12 -05:00
Phil Renaud 0b729e4bb9
Fixes to scheduling-filtering-in-ui (#17244) 2023-05-18 17:38:34 -04:00
Phil Renaud 7e56ca62d1
[ui] Adds a "Scheduling" filter to the job.allocations page (#17227)
* Basic filter concept

* Make sure NextAllocation gets sent up with allocation stub
2023-05-18 16:24:41 -04:00
James Rasell c60c5ace60
api: update Go mod go version to 1.20 to match main mod. (#17137) 2023-05-10 16:29:06 +01:00
Tim Gross 17bd930ca9
logs: fix missing allocation logs after update to Nomad 1.5.4 (#17087)
When the server restarts for the upgrade, it loads the `structs.Job` from the
Raft snapshot/logs. The jobspec has long since been parsed, so none of the
guards around the default value are in play. The empty field value for `Enabled`
is the zero value, which is false.

This doesn't impact any running allocation because we don't replace running
allocations when either the client or server restart. But as soon as any
allocation gets rescheduled (ex. you drain all your clients during upgrades),
it'll be using the `structs.Job` that the server has, which has `Enabled =
false`, and logs will not be collected.

This changeset fixes the bug by adding a new field `Disabled` which defaults to
false (so that the zero value works), and deprecates the old field.

Fixes #17076
2023-05-04 16:01:18 -04:00
dependabot[bot] 1633cab363
build(deps): bump github.com/shoenig/test from 0.6.3 to 0.6.4 in /api (#16895)
Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 0.6.3 to 0.6.4.
- [Release notes](https://github.com/shoenig/test/releases)
- [Commits](https://github.com/shoenig/test/compare/v0.6.3...v0.6.4)

---
updated-dependencies:
- dependency-name: github.com/shoenig/test
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-04-24 11:39:37 -05:00
Tim Gross 72cbe53f19
logs: allow disabling log collection in jobspec (#16962)
Some Nomad users ship application logs out-of-band via syslog. For these users
having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow
disabling the logmon and pointing the task's stdout/stderr to /dev/null.

This changeset is the first of several incremental improvements to log
collection short of full-on logging plugins. The next step will likely be to
extend the internal-only task driver configuration so that cluster
administrators can turn off log collection for the entire driver.

---

Fixes: #11175

Co-authored-by: Thomas Weber <towe75@googlemail.com>
2023-04-24 10:00:27 -04:00
Seth Hoenig ba728f8f97
api: enable support for setting original job source (#16763)
* api: enable support for setting original source alongside job

This PR adds support for setting job source material along with
the registration of a job.

This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.

The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.

The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.

* api: avoid writing var content to disk for parsing

* api: move submission validation into RPC layer

* api: return an error if updating a job submission without namespace or job id

* api: be exact about the job index we associate a submission with (modify)

* api: reword api docs scheduling

* api: prune all but the last 6 job submissions

* api: protect against nil job submission in job validation

* api: set max job source size in test server

* api: fixups from pr
2023-04-11 08:45:08 -05:00
hashicorp-copywrite[bot] 005636afa0 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Charlie Voiselle 9dfe4aa7c0
Set RequireRoot to be a test helper. (#16641) 2023-04-06 14:34:36 -04:00
James Rasell cb6ba80f0f
cli: stream both stdout and stderr when following an alloc. (#16556)
This update changes the behaviour when following logs from an
allocation, so that both stdout and stderr files streamed when the
operator supplies the follow flag. The previous behaviour is held
when all other flags and situations are provided.

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-04-04 10:42:27 +01:00
Horacio Monsalvo 20372b1721
connect: add meta on ConsulSidecarService (#16705)
Co-authored-by: Sol-Stiep <sol.stiep@southworks.com>
2023-03-30 16:09:28 -04:00
Piotr Kazmierczak 2b353902a1 acl: HTTP endpoints for JWT auth (#16519) 2023-03-30 09:39:56 +02:00
Seth Hoenig 87f4b71df0
client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips (#16672)
* client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips

This PR adds detection of asymetric core types (Power & Efficiency) (P/E)
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read
undocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into gopsutil, but gopsutil
would still not support detecting P vs E cores like this PR does.

Also refactors the CPUFingerprinter code to handle the mixed core
types, now setting power vs efficiency cpu attributes.

For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.

Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks of resources.cores on Linux
platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.

node attributes before

cpu.arch                  = arm64
cpu.modelname             = Apple M2 Pro
cpu.numcores              = 12
cpu.reservablecores       = 0
cpu.totalcompute          = 1000

node attributes after

cpu.arch                  = arm64
cpu.frequency.efficiency  = 2424
cpu.frequency.power       = 3504
cpu.modelname             = Apple M2 Pro
cpu.numcores.efficiency   = 4
cpu.numcores.power        = 8
cpu.reservablecores       = 0
cpu.totalcompute          = 37728

* fingerprint/cpu: follow up cr items
2023-03-28 08:27:58 -05:00
Luiz Aoqui e5d31bca61
cli: job restart command (#16278)
Implement the new `nomad job restart` command that allows operators to
restart allocations tasks or reschedule then entire allocation.

Restarts can be batched to target multiple allocations in parallel.
Between each batch the command can stop and hold for a predefined time
or until the user confirms that the process should proceed.

This implements the "Stateless Restarts" alternative from the original
RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180).
The original concept is still worth implementing, as it allows this
functionality to be exposed over an API that can be consumed by the
Nomad UI and other clients. But the implementation turned out to be more
complex than we initially expected so we thought it would be better to
release a stateless CLI-based implementation first to gather feedback
and validate the restart behaviour.

Co-authored-by: Shishir Mahajan <smahajan@roblox.com>
2023-03-23 18:28:26 -04:00
Michael Schurter f8884d8b52
client/metadata: fix crasher caused by AllowStale = false (#16549)
Fixes #16517

Given a 3 Server cluster with at least 1 Client connected to Follower 1:

If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.

The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.

Follower 1, seeing AllowStale=false, will forward the request to the
Leader.

The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
2023-03-20 16:32:32 -07:00
Juana De La Cuesta 47be374bbd
Add -json flag to quota inspect command (#16478)
* Added  and  flag to  command

* cli[style]: small refactor to avoid confussion with tmpl variable

* Update inspect.mdx

* cli: add changelog entry

* Update .changelog/16478.txt

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update command/quota_inspect.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-03-20 10:40:51 +01:00
Lance Haig ae256e28d8
Update ioutil library references to os and io respectively for API and Plugins package (#16330)
No user facing changes so I assume no change log is required
2023-03-08 10:25:09 -06:00
Seth Hoenig 1c8b408a81
deps: update test to 0.6.2 for new functions (#16326) 2023-03-06 09:24:45 -06:00
Luiz Aoqui 3f1ea9da4b
api: set last index and request time on alloc stop (#16319)
Some of the methods in `Allocations()` incorrectly use the `putQuery`
in API calls where `put` is more appropriate since they are not reading
information back. These methods are also not returning request metadata
such as `LastIndex` back to callers, which can be useful to have in some
scenarios.

They also provide poor developer experience as they take an
`*api.Allocation` struct when only the allocation ID is necessary. This
can lead consumers to make unnecessary API calls to fetch the full
allocation.

Fixing these problems require updating the methods' signatures so they
take `*WriteOptions` instead of `*QueryOptions` and return `*WriteMeta`,
but this is a breaking change that requires advanced notice to consumers.

This commit adds a future breaking change notice and also fixes the
`Stop` method so it properly returns request metadata in a backwards
compatible way.
2023-03-03 15:52:41 -05:00
Dao Thanh Tung 62a69552c1
api: add new test case for force-leave (#16260)
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
2023-03-03 10:38:40 -05:00
Tim Gross 0e1b554299
handle FSM.Apply errors in raftApply (#16287)
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.

Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
https://github.com/hashicorp/consul/pull/9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:

* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
  discovery or CSI plugin states

Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.
2023-03-02 13:51:09 -05:00
Seth Hoenig c9ffd1274b
api: fix a panic and tweak some exported types (#16237)
This PR
 - fixes a panic in GetItems when looking up a variable that does not exist.
 - deprecates GetItems in favor of GetVariableItems which avoids returning a pointer to a map
 - deprecates ErrVariableNotFound in favor of ErrVariablePathNotFound which is an actual error type
 - does some minor code cleanup to make linters happier
2023-02-22 08:17:22 -06:00
James Rasell 8295d0e516
acl: add validation to binding rule selector on upsert. (#16210)
* acl: add validation to binding rule selector on upsert.

* docs: add more information on binding rule selector escaping.
2023-02-17 15:38:55 +01:00
Alessio Perugini 4e9ec24b22
Allow configurable range of Job priorities (#16084) 2023-02-17 09:23:13 -05:00
Pierre Cauchois 74cf372e20
api: fix missing Node Status "disconnected" in API (#16166) 2023-02-14 09:43:23 -05:00
Michael Schurter 35d65c7c7e
Dynamic Node Metadata (#15844)
Fixes #14617
Dynamic Node Metadata allows Nomad users, and their jobs, to update Node metadata through an API. Currently Node metadata is only reloaded when a Client agent is restarted.

Includes new UI for editing metadata as well.

---------

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
2023-02-07 14:42:25 -08:00
Charlie Voiselle cc6f4719f1
Add option to expose workload token to task (#15755)
Add `identity` jobspec block to expose workload identity tokens to tasks.

---------

Co-authored-by: Anders <mail@anars.dk>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-02-02 10:59:14 -08:00
Tristan Pemble 5440965260
fix(#13844): canonicalize job to avoid nil pointer deference (#13845) 2023-02-01 16:01:28 -05:00
James Rasell 9e8325d63c
acl: fix a bug in token creation when parsing expiration TTLs. (#15999)
The ACL token decoding was not correctly handling time duration
syntax such as "1h" which forced people to use the nanosecond
representation via the HTTP API.

The change adds an unmarshal function which allows this syntax to
be used, along with other styles correctly.
2023-02-01 17:43:41 +01:00
Jorge Marey d1c9aad762
Rename fields on proxyConfig (#15541)
* Change api Fields for expose and paths

* Add changelog entry

* changelog: add deprecation notes about connect fields

* api: minor style tweaks

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-01-30 09:31:16 -06:00
Piotr Kazmierczak 14b53df3b6
renamed stanza to block for consistency with other projects (#15941) 2023-01-30 15:48:43 +01:00
dependabot[bot] 52a86b9d32
build(deps): bump github.com/shoenig/test from 0.6.0 to 0.6.1 in /api (#15939)
* build(deps): bump github.com/shoenig/test from 0.6.0 to 0.6.1 in /api

Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 0.6.0 to 0.6.1.
- [Release notes](https://github.com/shoenig/test/releases)
- [Commits](https://github.com/shoenig/test/compare/v0.6.0...v0.6.1)

---
updated-dependencies:
- dependency-name: github.com/shoenig/test
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps: update test

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-01-29 14:03:56 -06:00
James Rasell 5d33891910
sso: allow binding rules to create management ACL tokens. (#15860)
* sso: allow binding rules to create management ACL tokens.

* docs: update binding rule docs to detail management type addition.
2023-01-26 09:57:44 +01:00
James Rasell fad9b40e53
Merge branch 'main' into sso/gh-13120-oidc-login 2023-01-18 10:05:31 +00:00
Benjamin Buzbee 13cc30ebeb
Return buffered text from log endpoint if decoding fails (#15558)
To see why I think this is a good change lets look at why I am making it

My disk was full, which means GC was happening agressively. So by the
time I called the logging endpoint from the SDK, the logs were GC'd

The error I was getting before was:
```
invalid character 'i' in literal false (expecting 'l')
```

Now the error I get is:
```
failed to decode log endpoint response as JSON: "failed to list entries: open /tmp/nomad.data.4219353875/alloc/f11fee50-2b66-a7a2-d3ec-8442cb3d557a/alloc/logs: no such file or directory"
```

Still not super descriptive but much more debugable
2023-01-16 10:39:56 +01:00
James Rasell b3a6cfecc4
api: add OIDC HTTP API endpoints and SDK. 2023-01-13 13:15:58 +00:00