Commit graph

402 commits

Author SHA1 Message Date
Tim Gross 33558cb51e
csi: fix handling of garbage collected node in node unpublish (#12350)
When a node is garbage collected, we assume that the volume is no
longer attached to it and ignore the `ErrUnknownNode` error. But we
used `errors.Is` to check for a wrapped error, and RPC flattens the
errors during serialization. This results in an error check that works
in automated testing but not in real clusters. Use a string contains
check instead.
2022-03-22 15:40:24 -04:00
Tim Gross 60cfeacd76
drainer: defer CSI plugins until last (#12324)
When a node is drained, system jobs are left until last so that
operators can rely on things like log shippers running even as their
applications are getting drained off. Include CSI plugins in this set
so that Controller plugins deployed as services can be handled as
gracefully as Node plugins that are running as system jobs.
2022-03-22 10:26:56 -04:00
Luiz Aoqui 68e5b58007
cli: display Raft version in server members (#12317)
The previous output of the `nomad server members` command would output a
column named `Protocol` that displayed the Serf protocol being currently
used by servers.

This is not a configurable option, so it holds very little value to
operators. It is also easy to confuse it with the Raft Protocol version,
which is configurable and highly relevant to operators.

This commit replaces the previous `Protocol` column with the new `Raft
Version`. It also updates the `-detailed` flag to be called `-verbose`
so it matches other commands. The detailed output now also outputs the
same information as the standard output with the addition of the
previous `Protocol` column and `Tags`.
2022-03-17 14:15:10 -04:00
Luiz Aoqui 15089f055f
api: add related evals to eval details (#12305)
The `related` query param is used to indicate that the request should
return a list of related (next, previous, and blocked) evaluations.

Co-authored-by: Jasmine Dahilig <jasmine@hashicorp.com>
2022-03-17 13:56:14 -04:00
Luiz Aoqui 8db12c2a17
server: transfer leadership in case of error (#12293)
When a Nomad server becomes the Raft leader, it must perform several
actions defined in the establishLeadership function. If any of these
actions fail, Raft will think the node is the leader, but it will not
actually be able to act as a Nomad leader.

In this scenario, leadership must be revoked and transferred to another
server if possible, or the node should retry the establishLeadership
steps.
2022-03-17 11:10:57 -04:00
Luiz Aoqui ab8ce87bba
Add pagination, filtering and sort to more API endpoints (#12186) 2022-03-08 20:54:17 -05:00
Michael Schurter 7bb8de68e5
Merge pull request #12138 from jorgemarey/f-ns-meta
Add metadata to namespaces
2022-03-07 10:19:33 -08:00
Tim Gross b94837a2b8
csi: add pagination args to volume snapshot list (#12193)
The snapshot list API supports pagination as part of the CSI
specification, but we didn't have it plumbed through to the command
line.
2022-03-07 12:19:28 -05:00
Tim Gross 2dafe46fe3
CSI: allow updates to volumes on re-registration (#12167)
CSI `CreateVolume` RPC is idempotent given that the topology,
capabilities, and parameters are unchanged. CSI volumes have many
user-defined fields that are immutable once set, and many fields that
are not user-settable.

Update the `Register` RPC so that updating a volume via the API merges
onto any existing volume without touching Nomad-controlled fields,
while validating it with the same strict requirements expected for
idempotent `CreateVolume` RPCs.

Also, clarify that this state store method is used for everything, not just
for the `Register` RPC.
2022-03-07 11:06:59 -05:00
Tim Gross 09a7612150
csi: volume snapshot list plugin option is required (#12197)
The RPC for listing volume snapshots requires a plugin ID. Update the
`volume snapshot list` command to find the specific plugin from the
provided prefix.
2022-03-07 09:58:29 -05:00
Tim Gross 3a692a4360
csi: get plugin ID for creating snapshot from volume, not args (#12195)
The `CreateSnapshot` RPC expects a plugin ID to be set by the API, but
in the common case of the `nomad volume snapshot create` command, we
don't ask the user for the plugin ID because it's available from the
volume we're snapshotting.

Change the order of the RPC so that we get the volume first and then
use the volume's plugin ID for the plugin if the API didn't set the
value.
2022-03-07 09:06:50 -05:00
Jorge Marey 372ea7479b Add changelog file. Add meta to ns mock for testing 2022-03-07 10:56:56 +01:00
Tim Gross b776c1c196
csi: fix prefix queries for plugin list RPC (#12194)
The `CSIPlugin.List` RPC was intended to accept a prefix to filter the
list of plugins being listed. This was being accidentally being done
in the state store instead, which contributed to incorrect filtering
behavior for plugins in the `volume plugin status` command.

Move the prefix matching into the RPC so that it calls the
prefix-matching method in the state store if we're looking for a
prefix.

Update the `plugin status command` to accept a prefix for the plugin
ID argument so that it matches the expected behavior of other commands.
2022-03-04 16:44:09 -05:00
Luiz Aoqui b1809eb48c
Fix CSI volume list with prefix and * namespace (#12184)
When using a prefix value and the * wildcard for namespace, the endpoint
would not take the prefix value into consideration due to the order in
which the checks were executed but also the logic for retrieving volumes
from the state store.

This commit changes the order to check for a prefix first and wraps the
result iterator of the state store query in a filter to apply the
prefix.
2022-03-03 17:27:04 -05:00
Tim Gross 3247e422d1
csi: add missing fields to HTTP API response (#12178)
The HTTP endpoint for CSI manually serializes the internal struct to
the API struct for purposes of redaction (see also #10470). Add fields
that were missing from this serialization so they don't show up as
always empty in the API response.
2022-03-03 15:15:28 -05:00
Michael Schurter 0f6923c750
Merge pull request #10808 from hashicorp/f-curl
cli: add operator api command
2022-03-02 10:12:16 -08:00
Tim Gross f2a4ad0949
CSI: implement support for topology (#12129) 2022-03-01 10:15:46 -05:00
Tim Gross c90e674918
CSI: use HTTP headers for passing CSI secrets (#12144) 2022-03-01 08:47:01 -05:00
Tim Gross a499401b34
csi: fix redaction of volume status mount flags (#12150)
The `volume status` command and associated API redacts the entire
mount options instead of just the `MountFlags` field that can contain
sensitive data. Return a redacted value so that the return value makes
sense to operators who have set this field.
2022-03-01 08:34:03 -05:00
Tim Gross 99d03cdc6c
CSI: sort capabilities in plugin status (#12154)
Also fix `LIST_SNAPSHOTS` capability name
2022-03-01 07:59:31 -05:00
Tim Gross 02ae95ab22
csi: respect -verbose flag for allocs in volume status (#12153) 2022-03-01 07:57:29 -05:00
Michael Schurter f6342d1d45 docs: add changelog for #10808 2022-02-24 17:13:42 -08:00
Seth Hoenig 1274aa690f tests: deflake test that joins a server with non-voting servers to form qourum
This PR
 - upgrades the serf library
 - has the test start the join process using the un-joined server first
 - disables schedulers on the servers
 - uses the WaitForLeader and wantPeers helpers

Not sure which, if any of these actually improves the flakiness of this test.
2022-02-24 17:02:58 -06:00
Tim Gross 13ea2c7fb3
CSI: display plugin capabilities in verbose status (#12116)
The behaviors of CSI plugins are governed by their capabilities as
defined by the CSI specification. When debugging plugin issues, it's
useful to know which behaviors are expected so they can be matched
against RPC calls made to the plugin allocations.

Expose the plugin capabilities as named in the CSI spec in the `nomad
plugin status -verbose` output.
2022-02-24 13:51:38 -05:00
Tim Gross 22cf24a6bd
CSI: retry claims from client when max claims are reached (#12113)
When the alloc runner claims a volume, an allocation for a previous
version of the job may still have the volume claimed because it's
still shutting down. In this case we'll receive an error from the
server. Retry this error until we succeed or until a very long timeout
expires, to give operators a chance to recover broken plugins.

Make the alloc runner hook tolerant of temporary RPC failures.
2022-02-24 10:39:07 -05:00
Tim Gross cfe3117af8
CSI: enforce usage at claim time (#12112)
* Remove redundant schedulable check in `FreeWriteClaims`. If a volume
  has been created but not yet claimed, its capabilities will be checked
  in `WriteSchedulable` at both scheduling time and claim time. We don't
  need to also check them in the `FreeWriteClaims` method.

* Enforce maximum volume claims for writers.

  When the scheduler checks feasibility for CSI volumes, the check is
  fairly loose: earlier versions of the same job are not counted as
  active claims. This allows the scheduler to place new allocations
  for the new version of a job, under the assumption that we'll replace
  the existing allocations and their volume claims.

  But when the alloc runner claims the volume, we need to enforce the
  active claims even if they're for allocations of an earlier version of
  the job. Otherwise we'll try to mount a volume that's currently being
  unmounted, and this will cause replacement allocations to frequently
  fail.

* Enforce single-node reader check for read-only volumes. When the
  alloc runner makes a claim for a read-only volume, we only check that
  the volume is potentially schedulable and not that it actually has
  free read claims.
2022-02-24 09:37:37 -05:00
Sander Mol 42b338308f
add go-sockaddr templating support to nomad consul address (#12084) 2022-02-24 09:34:54 -05:00
Florian Apolloner 3bced8f558
namespaces: allow enabling/disabling allowed drivers per namespace 2022-02-24 09:27:32 -05:00
Seth Hoenig 57b9c64b8f
Merge pull request #12107 from hashicorp/use-bbolt
core: swap bolt impl and enable configuring raft freelist sync behavior
2022-02-24 08:25:54 -06:00
Tim Gross 5b7b9fdafb
csi: tolerate missing plugins on job delete (#12114)
If a plugin job fails before successfully fingerprinting the plugins,
the plugin will not exist when we try to delete the job. Tolerate
missing plugins.
2022-02-24 08:53:15 -05:00
Seth Hoenig ca84ba12ac agent: switch to go.etc.io/bbolt for state store
This PR modifies the server and client agents to use `go.etc.io/bbolt` as the
implementation for their state stores.
2022-02-23 14:28:31 -06:00
Seth Hoenig de95998faa core: switch to go.etc.io/bbolt
This PR swaps the underlying BoltDB implementation from boltdb/bolt
to go.etc.io/bbolt.

In addition, the Server has a new configuration option for disabling
NoFreelistSync on the underlying database.

Freelist option: https://github.com/etcd-io/bbolt/blob/master/db.go#L81
Consul equivelent PR: https://github.com/hashicorp/consul/pull/11720
2022-02-23 14:26:41 -06:00
Tim Gross 246db87a74
CSI: allow for concurrent plugin allocations (#12078)
The dynamic plugin registry assumes that plugins are singletons, which
matches the behavior of other Nomad plugins. But because dynamic
plugins like CSI are implemented by allocations, we need to handle the
possibility of multiple allocations for a given plugin type + ID, as
well as behaviors around interleaved allocation starts and stops.

Update the data structure for the dynamic registry so that more recent
allocations take over as the instance manager singleton, but we still
preserve the previous running allocations so that restores work
without racing.

Multiple allocations can run on a client for the same plugin, even if
only during updates. Provide each plugin task a unique path for the
control socket so that the tasks don't interfere with each other.
2022-02-23 15:23:07 -05:00
Michael Schurter 5410ec81c5 docs: add changelog for #11600 2022-02-18 16:16:19 -08:00
Michael Schurter 48aaa2c7d9
Merge pull request #11975 from hashicorp/f-connect-debugging
connect: write envoy bootstrap debugging info
2022-02-18 13:56:22 -08:00
Seth Hoenig 6550c90198 connect: bootstrap envoy using -proxy-id
This PR modifies the Consul CLI arguments used to bootstrap envoy for
Connect sidecars to make use of '-proxy-id' instead of '-sidecar-for'.

Nomad registers the sidecar service, so we know what ID it has. The
'-sidecar-for' was intended for use when you only know the name of the
service for which the sidecar is being created.

The improvement here is that using '-proxy-id' does not require an underlying
request for listing Consul services. This will make make the interaction
between Nomad and Consul more efficient.

Closes #10452
2022-02-18 14:58:23 -06:00
Michael Schurter 27b8112123 connect: write envoy bootstrap debugging info
When Consul Connect just works, it's wonderful. When it doesn't work it
can be exceeding difficult to debug: operators have to check task
events, Nomad logs, Consul logs, Consul APIs, and even then critical
information is missing.

Using Consul to generate a bootstrap config for Envoy is notoriously
difficult. Nomad doesn't even log stderr, so operators are left trying
to piece together what went wrong.

This patch attempts to provide *maximal* context which unfortunately
includes secrets. **Secrets are always restricted to the secrets/
directory.** This makes debugging a little harder, but allows operators
to know exactly what operation Nomad was trying to perform.

What's added:

- stderr is sent to alloc/logs/envoy_bootstrap.stderr.0
- the CLI is written to secrets/.envoy_bootstrap.cmd
- the environment is written to secrets/.envoy_bootstrap.env as JSON

Accessing this information is unfortunately awkward:
```
nomad alloc exec -task connect-proxy-count-countdash b36a cat secrets/.envoy_bootstrap.env
nomad alloc exec -task connect-proxy-count-countdash b36a cat secrets/.envoy_bootstrap.cmd
nomad alloc fs b36a alloc/logs/envoy_bootstrap.stderr.0
```

The above assumes an alloc id that starts with `b36a` and a Connect
sidecar proxy for a service named `count-countdash`.

If the alloc is unable to start successfully, the debugging files are
only accessible from the host filesystem.
2022-02-18 12:02:36 -08:00
Seth Hoenig c8d27257e7 deps: upgrade hashicorp/raft to v1.3.5 2022-02-17 13:49:56 -06:00
Seth Hoenig 9fccc8f8bc
Merge pull request #12077 from hashicorp/b-makefile-use-gobin
build: respect GOBIN when using make targets
2022-02-16 13:25:03 -06:00
Seth Hoenig 98758d4287 build: respect GOBIN when using make targets
This PR updates GNUMakefile to respect $GOBIN if it is set in the
environment or via an $GOENV file. Previously we hard-coded the output
to $GOPATH/bin, which is not necessarily the desired behavior.
2022-02-16 12:05:55 -06:00
Luiz Aoqui 110dbeeb9d
Add go-bexpr filters to evals and deployment list endpoints (#12034) 2022-02-16 11:40:30 -05:00
Tiernan c30b4617aa
interpolate network.dns block on client (#12021) 2022-02-16 08:39:44 -05:00
Tim Gross 27bb2da5ee
CSI: make gRPC client creation more robust (#12057)
Nomad communicates with CSI plugin tasks via gRPC. The plugin
supervisor hook uses this to ping the plugin for health checks which
it emits as task events. After the first successful health check the
plugin supervisor registers the plugin in the client's dynamic plugin
registry, which in turn creates a CSI plugin manager instance that has
its own gRPC client for fingerprinting the plugin and sending mount
requests.

If the plugin manager instance fails to connect to the plugin on its
first attempt, it exits. The plugin supervisor hook is unaware that
connection failed so long as its own pings continue to work. A
transient failure during plugin startup may mislead the plugin
supervisor hook into thinking the plugin is up (so there's no need to
restart the allocation) but no fingerprinter is started.

* Refactors the gRPC client to connect on first use. This provides the
  plugin manager instance the ability to retry the gRPC client
  connection until success.
* Add a 30s timeout to the plugin supervisor so that we don't poll
  forever waiting for a plugin that will never come back up.

Minor improvements:
* The plugin supervisor hook creates a new gRPC client for every probe
  and then throws it away. Instead, reuse the client as we do for the
  plugin manager.
* The gRPC client constructor has a 1 second timeout. Clarify that this
  timeout applies to the connection and not the rest of the client
  lifetime.
2022-02-15 16:57:29 -05:00
Seth Hoenig ac3cd73d00
Merge pull request #12054 from hashicorp/b-creation-indexes
api: return sorted results in certain list endpoints
2022-02-15 15:08:38 -06:00
Seth Hoenig 40c714a681 api: return sorted results in certain list endpoints
These API endpoints now return results in chronological order. They
can return results in reverse chronological order by setting the
query parameter ascending=true.

- Eval.List
- Deployment.List
2022-02-15 13:48:28 -06:00
Seth Hoenig f8f0d92469
Merge pull request #11955 from hashicorp/f-update-gopsutil
Update gopsutil to 3.21.12
2022-02-15 08:31:57 -06:00
Seth Hoenig d1ce07cbf7 cl: shorten changelog entry 2022-02-15 08:31:25 -06:00
Tim Gross bed9b3c248
changelog entry (#12072) 2022-02-15 09:00:30 -05:00
Tim Gross 2f79a260fe
csi: volume cli prefix matching should accept exact match (#12051)
The `volume detach`, `volume deregister`, and `volume status` commands
accept a prefix argument for the volume ID. Update the behavior on
exact matches so that if there is more than one volume that matches
the prefix, we should only return an error if one of the volume IDs is
not an exact match. Otherwise we won't be able to use these commands
at all on those volumes. This also makes the behavior of these commands
consistent with `job stop`.
2022-02-11 08:53:03 -05:00
Tim Gross 8ffe7aa76f
csi: provide CSI_ENDPOINT env var to plugins (#12050)
The CSI specification says:
> The CO SHALL provide the listen-address for the Plugin by way of the
`CSI_ENDPOINT` environment variable.

Note that plugins without filesystem isolation won't have the plugin
dir bind-mounted to their alloc dir, but we can provide a path to the
socket anyways.

Refactor to use opts struct for plugin supervisor hook config.
The parameter list for configuring the plugin supervisor hook has
grown enough where is makes sense to use an options struct similiar to
many of the other task runner hooks (ex. template).
2022-02-11 08:46:21 -05:00
James Rasell d7f90a4fbb
Merge pull request #12041 from hashicorp/b-gh-12040
changelog: add entry for #12040
2022-02-11 10:15:09 +01:00
Luiz Aoqui 3bf6036487 Version 1.2.6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJiBIXqAAoJELC0QQl2hbZ2M8cP/A7LENJbFSph25M1aGItra5j
 BphSX//Sq/v9ZzO44rOGNYQGfTpFT8STJgj2GC50qR/ilF4KX4D0oZlDyu/6D0NG
 ouN9RUjnFd6IEDQrjqqqhr3F69Z95SWVfi1rfgn/pIgOYkVEXfi6DXaulVVyd2ZT
 J0G5w5ryl5d8PhuL7TWw4zbhZRQn0hVspZv/1s3/I9aG6Sew8SMweeOxbN9lBr7E
 H19Amdjh6ugRuPgU7YMpKDVrZQRv9Wt7BUP/uc0u3LiW9z3Ko8ZKnCRKErtL5Kc3
 HDZsWe+t3va4Uekzd0HULNcYU4kwjogdRYRzX5kRsOyXelrZkQIqYFiKrk1wVbq/
 cYM5DUak6eUQBGhgi3UY0fklBFq4GDGpiwEzn7rvQb0PRSuVyykgbZ12fzyIu8dp
 tWbR/WOEg9F+jva6HkR2kDIcr5mDmny3Pxi5aUT6lMk1111nCzOjDzhLkQVtfsex
 FDMByXxM4oWAK3ouq2OIdxDL2c742A2933C4/30KWE7Xy7twsvkGw52irw66VO3V
 4PHP880cDvEDaEh15mY/8FlaAE7t/gsCUuYLxGwl33TaXSRBLc9vVNrrp89q53TD
 ZcvXTBpHUOWa6ZlHF/4f8LW44rowM6bU0Wili7NaWOKx86dnUJMG4sqJifNgcpS/
 7lXogv98CYLbMy4X4if0
 =NY1Z
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEElFaq1Z5DKdB91i+lKfRZwNnLtXMFAmIFbbkACgkQKfRZwNnL
 tXOr/g/+N2ZBMK8ohEvtdXLl7WXrVhgJfUSVbdD5Kfshul9CPn3yWRxJzqtEN2Pf
 55ozeWLpoziP9y9LviJ7rDidXcTmDFutbFdGJ3L+ZLdLILsNOq1A+lbuwO3fJngZ
 5aiPoJLsw4sqj6uHaM6Cls2f145O92nT7GXEHCxuvGHeSf3NkcR+zRY5nPrLTIrA
 uxYefCOzP6C2I+W7dL4Oj5R5EZd4UDi1WiL8pGzwm24LcagZN2ctctolAeF9OlJX
 M58UUv9b4GObe617u8MeH0LIlyZiNwn9JqrV33dKVTyrkBIYfYxkzdzMKf1csVYk
 kQb13KPdPTASBAGTl+sxeXXnw/bg09JXGcvREX5lLyQqY8xGwTv2FpTmybKWLiss
 Bg6BbejrgtCPBik0EAHWV0+kVzhi9bPfUYwTXLDCzMtrbyCyPoWchruel2sm41U1
 ezRDzlSvf6nrXf7sAv6umJICck4Bc5Gol+8W7fxvWqnY9rQ3ds2v7E5lXZMBbOmE
 JSi+EDWBJjBAXehE6pLxeVsvlHMRWN007Z2UeD4neGIgG7xFJLq6nKeUKoiNIpgk
 hKBL8iwHyuJfrBB/dcPzI9NV+jL6OZ/oI1RWxSj0MX/B4VXZp8HrqZA5JxzQolUg
 KIxqe4iX3WIkQv+UU4WiELvs4O7fujB4KWz3iQokhwDxqGUpffk=
 =5EG2
 -----END PGP SIGNATURE-----

Merge tag 'v1.2.6' into merge-release-1.2.6-branch

Version 1.2.6
2022-02-10 14:55:34 -05:00
James Rasell d96f9684be
changelog: add entry for #12040 2022-02-10 08:36:32 +01:00
Tim Gross 74486d86fb
scheduler: prevent panic in spread iterator during alloc stop
The spread iterator can panic when processing an evaluation, resulting
in an unrecoverable state in the cluster. Whenever a panicked server
restarts and quorum is restored, the next server to dequeue the
evaluation will panic.

To trigger this state:
* The job must have `max_parallel = 0` and a `canary >= 1`.
* The job must not have a `spread` block.
* The job must have a previous version.
* The previous version must have a `spread` block and at least one
  failed allocation.

In this scenario, the desired changes include `(place 1+) (stop
1+), (ignore n) (canary 1)`. Before the scheduler can place the canary
allocation, it tries to find out which allocations can be
stopped. This passes back through the stack so that we can determine
previous-node penalties, etc. We call `SetJob` on the stack with the
previous version of the job, which will include assessing the `spread`
block (even though the results are unused). The task group spread info
state from that pass through the spread iterator is not reset when we
call `SetJob` again. When the new job version iterates over the
`groupPropertySets`, it will get an empty `spreadAttributeMap`,
resulting in an unexpected nil pointer dereference.

This changeset resets the spread iterator internal state when setting
the job, logging with a bypass around the bug in case we hit similar
cases, and a test that panics the scheduler without the patch.
2022-02-09 19:53:06 -05:00
Luiz Aoqui 15f9d54dea
api: prevent excessice CPU load on job parse
Add new namespace ACL requirement for the /v1/jobs/parse endpoint and
return early if HCLv2 parsing fails.

The endpoint now requires the new `parse-job` ACL capability or
`submit-job`.
2022-02-09 19:51:47 -05:00
Seth Hoenig 437bb4b86d
client: check escaping of alloc dir using symlinks
This PR adds symlink resolution when doing validation of paths
to ensure they do not escape client allocation directories.
2022-02-09 19:50:13 -05:00
Seth Hoenig de078e7ac6
client: fix race condition in use of go-getter
go-getter creates a circular dependency between a Client and Getter,
which means each is inherently thread-unsafe if you try to re-use
on or the other.

This PR fixes Nomad to no longer make use of the default Getter objects
provided by the go-getter package. Nomad must create a new Client object
on every artifact download, as the Client object controls the Src and Dst
among other things. When Caling Client.Get, the Getter modifies its own
Client reference, creating the circular reference and race condition.

We can still achieve most of the desired connection caching behavior by
re-using a shared HTTP client with transport pooling enabled.
2022-02-09 19:48:28 -05:00
Charlie Voiselle 6750235152 Add changelog 2022-02-09 19:31:42 -05:00
Tim Gross 59c8558969
docs and changelog for nomad config validate (#12031) 2022-02-09 10:20:45 -05:00
Tim Gross d9d4da1e9f
scheduler: seed random shuffle nodes with eval ID (#12008)
Processing an evaluation is nearly a pure function over the state
snapshot, but we randomly shuffle the nodes. This means that
developers can't take a given state snapshot and pass an evaluation
through it and be guaranteed the same plan results.

But the evaluation ID is already random, so if we use this as the seed
for shuffling the nodes we can greatly reduce the sources of
non-determinism. Unfortunately golang map iteration uses a global
source of randomness and not a goroutine-local one, but arguably
if the scheduler behavior is impacted by this, that's a bug in the
iteration.
2022-02-08 12:16:33 -05:00
Seth Hoenig a06ae106f0
cl: fix DO name
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-02-08 10:28:57 -06:00
Seth Hoenig a21776a82d changelog: update changelog for DO 2022-02-08 08:43:49 -06:00
Tim Gross 464026c87b
scheduler: recover from panic (#12009)
If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.

Recover the goroutine from the top-level `Process` methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.
2022-02-07 11:47:53 -05:00
ttys3 5faf344152
style: fix up very long tag word breaking the allocation service table width (#11995) 2022-02-04 19:40:03 -05:00
Karthick Ramachandran 0600bc32e2
improve error message on service length (#12012) 2022-02-04 19:39:34 -05:00
Tim Gross 7ad15b2b42
raft: default to protocol v3 (#11572)
Many of Nomad's Autopilot features require raft protocol version
3. Set the default raft protocol to 3, and improve the upgrade
documentation.
2022-02-03 15:03:12 -05:00
Seth Hoenig 5f48e18189
Merge pull request #11983 from hashicorp/b-select-after
cleanup: prevent leaks from time.After
2022-02-03 09:38:06 -06:00
ttys3 1ab3b4d3d8
correct task row memory unit (#11980) 2022-02-02 17:00:25 -05:00
Samantha 54f8c04c91
Fix health checking for ephemeral poststart tasks (#11945)
Update the logic in the Nomad client's alloc health tracker which
erroneously marks existing healthy allocations with dead poststart ephemeral
tasks as unhealthy even if they were already successful during a previous
deployment.
2022-02-02 16:29:49 -05:00
Seth Hoenig db2347a86c cleanup: prevent leaks from time.After
This PR replaces use of time.After with a safe helper function
that creates a time.Timer to use instead. The new function returns
both a time.Timer and a Stop function that the caller must handle.

Unlike time.NewTimer, the helper function does not panic if the duration
set is <= 0.
2022-02-02 14:32:26 -06:00
Luiz Aoqui c4cff5359f
Verify TLS certificate on endpoints that are used between agents only (#11956) 2022-02-02 15:03:18 -05:00
Michael Schurter fd242ab7f8
Merge pull request #11878 from kainoaseto/fix/multi-task-group-canary-deploys
Bugfix: auto-promote canary taskgroups when mixed with non-canary taskgroups
2022-01-31 16:22:51 -08:00
Michael Schurter dcf15d5960 docs: add changelog for #11878 2022-01-31 12:21:31 -08:00
Tim Gross 6af1b359ed docs: missing changelog for #11892 (#11959) 2022-01-28 15:08:48 -05:00
Tim Gross ea69eda522
docs: missing changelog for #11892 (#11959) 2022-01-28 15:04:32 -05:00
Tim Gross 951661db04 CSI: resolve invalid claim states (#11890)
* csi: resolve invalid claim states on read

It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.

* csi: make feasibility check errors more understandable

When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
2022-01-28 14:43:35 -05:00
Tim Gross 4e559c6255 csi: update leader's ACL in volumewatcher (#11891)
The volumewatcher that runs on the leader needs to make RPC calls
rather than writing to raft (as we do in the deploymentwatcher)
because the unpublish workflow needs to make RPC calls to the
clients. This requires that the volumewatcher has access to the
leader's ACL token.

But when leadership transitions, the new leader creates a new leader
ACL token. This ACL token needs to be passed into the volumewatcher
when we enable it, otherwise the volumewatcher can find itself with a
stale token.
2022-01-28 14:43:27 -05:00
Derek Strickland 460416e787 Update IsEmpty to check for pre-1.2.4 fields (#11930) 2022-01-28 14:41:49 -05:00
Tim Gross a2433e35fb
CSI: resolve invalid claim states (#11890)
* csi: resolve invalid claim states on read

It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.

* csi: make feasibility check errors more understandable

When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
2022-01-27 09:30:03 -05:00
André 518fc11dca
ui: move volume link to the source column and fix the link target (#11896)
The link target used the volume name instead of the volume id.
Fixes issue #11884.
2022-01-26 14:17:29 -05:00
Derek Strickland b3c8ab9be7
Update IsEmpty to check for pre-1.2.4 fields (#11930) 2022-01-26 11:31:37 -05:00
Seth Hoenig 86330e43c8 changelog: use pr number not issue number 2022-01-26 06:32:10 -06:00
Seth Hoenig ffe7f87912 connect: fix bug where sidecar_task.resources was ignored with hcl1
The HCL1 parser did not respect connect.sidecar_task.resources if the
connect.sidecar_service block was not set (an optimiztion that no longer
makes sense with connect gateways).

Fixes #10899
2022-01-25 10:17:54 -06:00
Seth Hoenig ef9b84ad82 deps: update api go version and dependencies
This PR sets the minimum Go version for the `api` submodule to Go 1.17.

It also upgrades
 - gorilla/websocket 1.4.1 -> 1.4.2
 - mitchelh/mapstructure 1.4.2 -> 1.4.3
 - stretchr/testify 1.5.1 -> 1.7.0

Closes #11518 #11602 #11528
2022-01-24 12:23:26 -06:00
Tim Gross 04977525dd
csi: update leader's ACL in volumewatcher (#11891)
The volumewatcher that runs on the leader needs to make RPC calls
rather than writing to raft (as we do in the deploymentwatcher)
because the unpublish workflow needs to make RPC calls to the
clients. This requires that the volumewatcher has access to the
leader's ACL token.

But when leadership transitions, the new leader creates a new leader
ACL token. This ACL token needs to be passed into the volumewatcher
when we enable it, otherwise the volumewatcher can find itself with a
stale token.
2022-01-24 11:49:50 -05:00
Seth Hoenig 0030424384
Merge pull request #11889 from hashicorp/build-update-circle
build: upgrade circleci configuration
2022-01-24 10:18:21 -06:00
Seth Hoenig c220d3009b
Merge pull request #11910 from hashicorp/deps-update-containernetworking
deps: upgrade containernetworking/plugins
2022-01-24 10:14:50 -06:00
Seth Hoenig 8a96e5d567 deps: add missing cl note 2022-01-24 10:13:13 -06:00
Tim Gross b9456f2f72
changelog: fix entry markdown (#11911) 2022-01-24 11:04:14 -05:00
Seth Hoenig 2f0cfb5740 build: upgrade and speedup circleci configuration
This PR upgrades our CI images and fixes some affected tests.

- upgrade go-machine-image to premade latest ubuntu LTS (ubuntu-2004:202111-02)

- eliminate go-machine-recent-image (no longer necessary)

- manage GOPATH in GNUMakefile (see https://discuss.circleci.com/t/gopath-is-set-to-multiple-directories/7174)

- fix tcp dial error check (message seems to be OS specific)

- spot check values measured instead of specifically 'RSS' (rss no longer reported in cgroups v2)

- use safe MkdirTemp for generating tmpfiles

NOT applied: (too flakey)

- eliminate setting GOMAXPROCS=1 (build tools were also affected by this setting)

- upgrade resource type for all imanges to large (2C -> 4C)
2022-01-24 08:28:14 -06:00
Seth Hoenig f2a71fd0d9 deps: pty has new home
github.com/kr/pty was moved to github.com/creack/pty

Swap this dependency so we can upgrade to the latest version
and no longer need a replace directive.
2022-01-19 12:33:05 -06:00
Seth Hoenig 9a6988f55b deps: adjust to gzip handler zero length response body
After swapping gzip handler to use the gorilla library, we
must account for a quirk in how zero/minimal length response
bodies are delivered.

The previous gzip handler was configured to compress all responses
regardless of size - even if the data was zero length or below the
network MTU. This behavior changed in [v1.1.0](c551b6c3b4 (diff-de723e6602cc2f16f7a9d85fd89d69954edc12a49134dab8901b10ee06d1879d))
which is why we could not upgrade.

The Nomad HTTP Client mutates the http.Response.Body object, making
a strong assumption that if the Content-Encoding header is set to "gzip",
the response will be readable via gzip decoder. This is no longer true
for the nytimes gzip handler, and is also not true for the gorilla gzip
handler.

It seems in practice this only makes a difference on the /v1/operator/license
endpoint which returns an empty response in OSS Nomad.

The fix here is to simply not wrap the response body reader if we
encounter an io.EOF while creating the gzip reader - indicating there
is no data to decode.
2022-01-19 11:52:19 -06:00
Seth Hoenig 4650e97d29 deps: upgrade docker and runc
This PR upgrades
 - docker dependency to the latest tagged release (v20.10.12)
 - runc dependency to the latest tagged release (v1.0.3)

Docker does not abide by [semver](https://github.com/moby/moby/issues/39302), so it is marked +incompatible,
and transitive dependencies are upgrade manually.

Runc made three relevant breaking changes

 * cgroup manager .Set changed to accept Resources instead of Cgroup
   3f65946756

 * config.Device moved to devices.Device
   https://github.com/opencontainers/runc/pull/2679

 * mountinfo.Mounted now returns an error if the specified path does not exist
   https://github.com/moby/sys/blob/mountinfo/v0.5.0/mountinfo/mountinfo.go#L16
2022-01-18 08:35:26 -06:00
Dave May 330d24a873
cli: Add event stream capture to nomad operator debug (#11865) 2022-01-17 21:35:51 -05:00
Michael Schurter 99c863f909
cli: improve debug error messages (#11507)
Improves `nomad debug` error messages when contacting agents that do not
have /v1/agent/host endpoints (the endpoint was added in v0.12.0)

Part of #9568 and manually tested against Nomad v0.8.7.

Hopefully isRedirectError can be reused for more cases listed in #9568
2022-01-17 11:15:17 -05:00
Luiz Aoqui 57a61b50f4
changelog: add entry for #11793 (#11862) 2022-01-17 11:08:29 -05:00
James Rasell 29fb0ddb4e
Merge pull request #11849 from hashicorp/b-changelog-11848
changelog: add entry for #11848
2022-01-17 09:35:10 +01:00
Jai 7ad9ec1143
Merge pull request #11820 from hashicorp/f-ui/alloc-legend
feat:  add links to legend items in `allocation-summary`
2022-01-14 14:02:54 -05:00
Tim Gross d7756f8cdb
csi: volume deregistration should require exact ID (#11852)
The command line client sends a specific volume ID, but this isn't
enforced at the API level and we were incorrectly using a prefix match
for volume deregistration, resulting in cases where a volume with a
shorter ID that's a prefix of another volume would be deregistered
instead of the intended volume.
2022-01-14 12:26:03 -05:00
Tim Gross 33f7c6cba4
csi: when warning for multiple prefix matches, use full ID (#11853)
When the `volume deregister` or `volume detach` commands get an ID
prefix that matches multiple volumes, show the full length of the
volume IDs in the list of volumes shown so so that the user can select
the correct one.
2022-01-14 12:25:48 -05:00
Tim Gross 9c4864badd
freebsd: build fix for ARM7 32-bit (#11854)
The size of `stat_t` fields is architecture dependent, which was
reportedly causing a build failure on FreeBSD ARM7 32-bit
systems. This changeset matches the behavior we have on Linux.
2022-01-14 12:25:32 -05:00
Tim Gross 73d0779858
drivers: set world-readable permissions on copied resolv.conf (#11856)
When we copy the system DNS to a task's `resolv.conf`, we should set
the permissions as world-readable so that unprivileged users within
the task can read it.
2022-01-14 12:25:23 -05:00
Jai Bhagat 764936a768 chore: add changelog 2022-01-14 10:23:09 -05:00
James Rasell 10405e9c6b
changelog: add entry for #11848 2022-01-14 13:40:50 +01:00
Luiz Aoqui d48e50da9a
Fix log level parsing from lines that include a timestamp (#11838) 2022-01-13 09:56:35 -05:00
Luiz Aoqui c7ae13a1f3
Fix ACL requirements for job details UI (#11672) 2022-01-12 21:26:02 -05:00
Michael Schurter ebadaabc71 doc: add changelog for #11830 2022-01-12 14:21:47 -08:00
Tim Gross b62da8fc9a
docs: improve changelog for PR #11783 (#11818) 2022-01-11 11:54:12 -05:00
Tim Gross 1a5973184e
docs: changelog for PR #11783 (#11812) 2022-01-10 16:39:21 -05:00
Alessandro De Blasis e647549ecf
metrics: added mapped_file metric (#11500)
Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
Co-authored-by: Nate <37554478+servusdei2018@users.noreply.github.com>
2022-01-10 15:35:19 -05:00
Derek Strickland 0a8e03f0f7
Expose Consul template configuration parameters (#11606)
This PR exposes the following existing`consul-template` configuration options to Nomad jobspec authors in the `{job.group.task.template}` stanza.

- `wait`

It also exposes the following`consul-template` configuration to Nomad operators in the `{client.template}` stanza.

- `max_stale`
- `block_query_wait`
- `consul_retry`
- `vault_retry` 
- `wait` 

Finally, it adds the following new Nomad-specific configuration to the `{client.template}` stanza that allows Operators to set bounds on what `jobspec` authors configure.

- `wait_bounds`

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-01-10 10:19:07 -05:00
Tim Gross 32f150d469
docs: new scheduler metrics (#11790)
* Fixed name of `nomad.scheduler.allocs.reschedule` metric
* Added new metrics to metrics reference documentation
* Expanded definitions of "waiting" metrics
* Changelog entry for #10236 and #10237
2022-01-07 09:51:15 -05:00
Charlie Voiselle 98a240cd99
Make number of scheduler workers reloadable (#11593)
## Development Environment Changes
* Added stringer to build deps

## New HTTP APIs
* Added scheduler worker config API
* Added scheduler worker info API

## New Internals
* (Scheduler)Worker API refactor—Start(), Stop(), Pause(), Resume()
* Update shutdown to use context
* Add mutex for contended server data
    - `workerLock` for the `workers` slice
    - `workerConfigLock` for the `Server.Config.NumSchedulers` and
      `Server.Config.EnabledSchedulers` values

## Other
* Adding docs for scheduler worker api
* Add changelog message

Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>
2022-01-06 11:56:13 -05:00
Michael Schurter 1af8d47de2
Merge pull request #11744 from hashicorp/b-node-copy
Fix Node.Copy()
2022-01-05 17:01:53 -08:00
Jai c7e581d879
Merge pull request #11590 from hashicorp/e-ui/breadcrumbs-service
Refactor:  Breadcrumbs Service
2022-01-05 17:46:48 -05:00
Tim Gross 51f512a3e6
csi: reap unused volume claims at leadership transitions (#11776)
When `volumewatcher.Watcher` starts on the leader, it starts a watch
on every volume and triggers a reap of unused claims on any change to
that volume. But if a reaping is in-flight during leadership
transitions, it will fail and the event that triggered the reap will
be dropped. Perform one reap of unused claims at the start of the
watcher so that leadership transitions don't drop this event.
2022-01-05 11:40:20 -05:00
Arkadiusz ffb174b596
Fix log streaming missing frames (#11721)
Perform one more read after receiving cancel when streaming file from the allocation API
2022-01-04 14:07:16 -05:00
Tim Gross 2806dc2bd7
docs/tests for multiple HTTP address config (#11760) 2022-01-03 10:17:13 -05:00
Tim Gross 395628efe1
api: paginate deployment list and accept wildcard namespace (#11743)
Add `per_page` and `next_token` handling to `Deployment.List` RPC, and
allow the use of a wildcard namespace for namespace filtering.
2022-01-03 08:36:02 -05:00
Michael Schurter c4d03815e1 add changelog for Node.Copy fix 2021-12-23 12:34:05 -08:00
Tim Gross 265e488ab4
task runner: fix goroutine leak in prestart hook (#11741)
The task runner prestart hooks take a `joincontext` so they have the
option to exit early if either of two contexts are canceled: from
killing the task or client shutdown. Some tasks exit without being
shutdown from the server, so neither of the joined contexts ever gets
canceled and we leak the `joincontext` (48 bytes) and its internal
goroutine. This primarily impacts batch jobs and any task that fails
or completes early such as non-sidecar prestart lifecycle tasks.
Cancel the `joincontext` after the prestart call exits to fix the
leak.
2021-12-23 11:50:51 -05:00
Luiz Aoqui 4bdd2c84e3
fix host network reserved port fingerprint (#11728) 2021-12-22 15:29:54 -05:00
Luiz Aoqui 40093f97cd
api: support namespace wildcard in CSI volume list (#11724) 2021-12-21 17:19:45 -05:00
Shishir 65eab35412
Add support for setting pids_limit in docker plugin config. (#11526) 2021-12-21 13:31:34 -05:00
Tim Gross b0c3b99b03
scheduler: fix quadratic performance with spread blocks (#11712)
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
2021-12-21 10:10:01 -05:00
Jai Bhagat 1887e34b75 chore: edit mirage scenario to populate csi 2021-12-21 07:42:23 -05:00
Luiz Aoqui 3d3b5a2c8e
changelog: add entries for #11555, #11557, and #11687 (#11706) 2021-12-20 13:45:20 -05:00
Tim Gross e046bb31e9
api: respect wildcard in evaluations list API (#11710) 2021-12-20 12:23:50 -05:00
Jai 93d5ef596f
Merge pull request #11545 from hashicorp/f-ui/add-alloc-filters-on-table
Add Allocation Filters in Client View
2021-12-18 09:39:53 -05:00
Jai 2b7fb2c5bd
Merge pull request #11544 from hashicorp/f-ui/add-filters-to-allocs
Add filters to Allocations
2021-12-18 09:38:28 -05:00
Luiz Aoqui e067b3d75f
changelog: fix entry for #11544 2021-12-17 18:57:54 -05:00
Luiz Aoqui a1c4536523
changelog: add entry for #11545 2021-12-17 18:49:56 -05:00
Tim Gross f2615992a4
cli: unhide advanced operator raft debugging commands (#11682)
The `nomad operator raft` and `nomad operator snapshot state`
subcommands for inspecting on-disk raft state were hidden and
undocumented. Expose and document these so that advanced operators
have support for these tools.
2021-12-16 10:32:11 -05:00
Tim Gross 536e3c5282
nomad eval list command (#11675)
Use the new filtering and pagination capabilities of the `Eval.List`
RPC to provide filtering and pagination at the command line.

Also includes note that `nomad eval status -json` is deprecated and
will be replaced with a single evaluation view in a future version of
Nomad.
2021-12-15 11:58:38 -05:00
Tim Gross f8a133a810
cli: ensure -stale flag is respected by nomad operator debug (#11678)
When a cluster doesn't have a leader, the `nomad operator debug`
command can safely use stale queries to gracefully degrade the
consistency of almost all its queries. The query parameter for these
API calls was not being set by the command.

Some `api` package queries do not include `QueryOptions` because
they target a specific agent, but they can potentially be forwarded to
other agents. If there is no leader, these forwarded queries will
fail. Provide methods to call these APIs with `QueryOptions`.
2021-12-15 10:44:03 -05:00
Luiz Aoqui 05bb65779c
api: return error when LicenseGet status is not 200 (#11644) 2021-12-14 19:47:09 -05:00
Tim Gross a0cf5db797
provide -no-shutdown-delay flag for job/alloc stop (#11596)
Some operators use very long group/task `shutdown_delay` settings to
safely drain network connections to their workloads after service
deregistration. But during incident response, they may want to cause
that drain to be skipped so they can quickly shed load.

Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and
`nomad job stop` commands that bypasses the delay. This sets a new
desired transition state on the affected allocations that the
allocation/task runner will identify during pre-kill on the client.

Note (as documented here) that using this flag will almost always
result in failed inbound network connections for workloads as the
tasks will exit before clients receive updated service discovery
information and won't be gracefully drained.
2021-12-13 14:54:53 -05:00
Tim Gross 5a68373e7f Version 1.2.3
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJhs7QgAAoJELC0QQl2hbZ2IQQP/3aKKgsptB0IPGx4vAAlIfMY
 IUyj9KdQ0SRN4B0C4h/T3CxqIhFPGmrV2RkOtEpDyBJuTUbH4FBjCscsKePFON+g
 Kfk/SoP05AQSksXFiKVK99UxUjg43SdqvatwnmLH4hafapbq5mMouTkBho+i05xK
 n6853DOwoq5qsPs6ihwRddRtpduozBKWLBMoBUm3syf8erWX0dafU5WszvLvG16R
 YJxTNr0nwQFhKDfY1CFUHJglj1s521poA9Zj6Xa1fNnIQ2JdKW1kElPUXmra1w7X
 0Wussv4fgJAetTO2bz0+IeuQf+EzxQ7vKDklt4ORypXkwiC9h7x2ZNCKRL+GReyU
 wUnzccXBfOsgpvW5EAoNXCGOQa6c2+uvHAAd62AAqljLh+B+yDJysvPobihfbSsu
 E2kXJEd3N6GndDjFfzUaYPhhGkvBaPUTNxybSaaREShJ7a7c8tedxfMpNYt1RwGz
 llJEoeZZketwjEFLEHp9xjNeqXdAXyrqCkluMvy+foU72HaRPFc0tlDnRsqirZ0p
 hBxLxPp5oM4V/RegTa3z8P4J0kMSvCdCE4bPNgyiEJDmvxRYDVk5YorLTCDTGrWU
 4WO7fue0bOwhGBYWRfAWzfpoHrCvRLto2vVdtBaFwlzmGP8j/QjM8ANrGyiJeiuY
 IPZSM93pAAcWQEV9id/E
 =G3In
 -----END PGP SIGNATURE-----

Merge tag 'v1.2.3' into merge-release-1.2.3-branch

Version 1.2.3
2021-12-13 10:12:07 -05:00
Tim Gross 46e1d29298 golang security update 1.17.5 2021-12-10 13:50:22 -05:00
Tim Gross 624ecab901
evaluations list pagination and filtering (#11648)
API queries can request pagination using the `NextToken` and `PerPage`
fields of `QueryOptions`, when supported by the underlying API.

Add a `NextToken` field to the `structs.QueryMeta` so that we have a
common field across RPCs to tell the caller where to resume paging
from on their next API call. Include this field on the `api.QueryMeta`
as well so that it's available for future versions of List HTTP APIs
that wrap the response with `QueryMeta` rather than returning a simple
list of structs. In the meantime callers can get the `X-Nomad-NextToken`.

Add pagination to the `Eval.List` RPC by checking for pagination token
and page size in `QueryOptions`. This will allow resuming from the
last ID seen so long as the query parameters and the state store
itself are unchanged between requests.

Add filtering by job ID or evaluation status over the results we get
out of the state store.

Parse the query parameters of the `Eval.List` API into the arguments
expected for filtering in the RPC call.
2021-12-10 13:43:03 -05:00
Lukas W 0e5958d671
CLI: Return non-zero exit code when deployment fails in nomad run (#11550)
* Exit non-zero from run command if deployment fails
* Fix typo in deployment monitor introduced in 0edda11
2021-12-09 09:09:28 -05:00
Vyacheslav Morov 6a244f18ad
cli: Add var args to plan output. (#11631) 2021-12-07 10:43:52 -05:00
Tim Gross 03e697a69d
scheduler: config option to reject job registration (#11610)
During incident response, operators may find that automated processes
elsewhere in the organization can be generating new workloads on Nomad
clusters that are unable to handle the workload. This changeset adds a
field to the `SchedulerConfiguration` API that causes all job
registration calls to be rejected unless the request has a management
ACL token.
2021-12-06 15:20:34 -05:00
Derek Strickland 8595e3ed6a
Add change log entry for PR 11592 (#11609) 2021-12-02 16:18:56 -05:00
Tim Gross 5097546153
changelog: new metrics in Nomad Enterprise (#11591)
This changelog is for a PR that landed in Nomad Enterprise only.
2021-12-01 09:15:12 -05:00
Michael Schurter 3d248153f4
Merge pull request #11579 from hashicorp/b-getscalingpolicy-rpc-index-response
rpc: fix scaling policy get index response when policy is found.
2021-11-30 10:43:20 -08:00
Tim Gross 6e1311a265
client: respect client_auto_join after connection loss (#11585)
The `consul.client_auto_join` configuration block tells the Nomad
client whether to use Consul service discovery to find Nomad
servers. By default it is set to `true`, but contrary to the
documentation it was only respected during the initial client
registration. If a client missed a heartbeat, failed a
`Node.UpdateStatus` RPC, or if there was no Nomad leader, the client
would fallback to Consul even if `client_auto_join` was set to
`false`. This changeset returns early from the client's trigger for
Consul discovery if the `client_auto_join` field is set to `false`.
2021-11-30 13:20:42 -05:00
James Rasell a9a624574f
changelog: add entry for #11579 2021-11-26 11:16:17 +01:00
Tim Gross 74768eb7d3
scheduler: fix panic in system jobs when nodes filtered by class (#11565)
In the system scheduler, if a subset of clients are filtered by class,
we hit a code path where the `AllocMetric` has been copied, but the
`Copy` method does not instantiate the various maps. This leads to an
assignment to a nil map. This changeset ensures that the maps are
non-nil before continuing.

The `Copy` method relies on functions in the `helper` package that all
return nil slices or maps when passed zero-length inputs. This
changeset to fix the panic bug intentionally defers updating those
functions because it'll have potential impact on memory usage. See
https://github.com/hashicorp/nomad/issues/11564 for more details.
2021-11-24 12:59:15 -05:00
Tim Gross ba38008596
scheduler: fix panic in system jobs when nodes filtered by class (#11565)
In the system scheduler, if a subset of clients are filtered by class,
we hit a code path where the `AllocMetric` has been copied, but the
`Copy` method does not instantiate the various maps. This leads to an
assignment to a nil map. This changeset ensures that the maps are
non-nil before continuing.

The `Copy` method relies on functions in the `helper` package that all
return nil slices or maps when passed zero-length inputs. This
changeset to fix the panic bug intentionally defers updating those
functions because it'll have potential impact on memory usage. See
https://github.com/hashicorp/nomad/issues/11564 for more details.
2021-11-24 12:28:47 -05:00