An ACL roles name must be unique, however, a bug meant multiple
roles of the same same could be created. This fixes that problem
with checks in the RPC handler and state store.
* allocrunner: handle lifecycle when all tasks die
When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.
* allocrunner: implement different alloc restarts
Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.
* allocrunner: allow tasks to run again
Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.
The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.
* api/cli: add support for all tasks alloc restart
Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.
* test: fix tasklifecycle Coordinator test
* allocrunner: kill taskrunners if all tasks are dead
When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.
* taskrunner: fix tests that waited on WaitCh
Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.
* tests: add tests for all tasks alloc restart
* changelog: add entry for #14127
* taskrunner: fix restore logic.
The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.
It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.
The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.
This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.
* address code review requests
* apply more code review changes
* taskrunner: add different Restart modes
Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.
So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
restart if its `Run()` method is still active. Callers will need to
restart the allocRunner taskCoordinator to make sure it will allow the
task to run again.
* minor fixes
The original design for workload identities and ACLs allows for operators to
extend the automatic capabilities of a workload by using a specially-named
policy. This has shown to be potentially unsafe because of naming collisions, so
instead we'll allow operators to explicitly attach a policy to a workload
identity.
This changeset adds workload identity fields to ACL policy objects and threads
that all the way down to the command line. It also a new secondary index to the
ACL policy table on namespace and job so that claim resolution can efficiently
query for related policies.
When a Nomad agent starts and loads jobs that already existed in the
cluster, the default template uid and gid was being set to 0, since this
is the zero value for int. This caused these jobs to fail in
environments where it was not possible to use 0, such as in Windows
clients.
In order to differentiate between an explicit 0 and a template where
these properties were not set we need to use a pointer.
Making the ACL Role listing return object a stub future-proofs the
endpoint. In the event the role object grows, we are not bound by
having to return all fields within the list endpoint or change the
signature of the endpoint to reduce the list return size.
This PR adds some NSD check status output to the CLI.
1. The 'nomad alloc status' command produces nsd check summary output (if present)
2. The 'nomad alloc checks' sub-command is added to produce complete nsd check output (if present)
HTTP API consumers that have network line-of-sight to client nodes can connect
directly for a small number of APIs. But in environments where the consumer
doesn't have line-of-sight, there's a long pause waiting for the
`api.ClientConnTimeout` to expire. Warn about this in the API docs so that
authors can avoid the extra timeout.
UID/GID 0 is usually reserved for the root user/group. While Nomad
clients are expected to run as root it may not always be the case.
Setting these values as -1 if not defined will fallback to the pervious
behaviour of not attempting to set file ownership and use whatever
UID/GID the Nomad agent is running as. It will also keep backwards
compatibility, which is specially important for platforms where this
feature is not supported, like Windows.
* Allow specification of CSI staging and publishing directory path
* Add website documentation for stage_publish_dir
* Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir
* Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir
Move the secure variables quota enforcement calls into the state store to ensure
quota checks are atomic with quota updates (in the same transaction).
Switch to a machine-size int instead of a uint64 for quota tracking. The
ENT-side quota spec is described as int, and negative values have a meaning as
"not permitted at all". Using the same type for tracking will make it easier to
the math around checks, and uint64 is infeasibly large anyways.
Add secure vars to quota HTTP API and CLI outputs and API docs.
Most of our objects use int64 timestamps derived from `UnixNano()` instead of
`time.Time` objects. Switch the keyring metadata to use `UnixNano()` for
consistency across the API.
When the `Full` flag is passed for key rotation, we kick off a core
job to decrypt and re-encrypt all the secure variables so that they
use the new key.
* Add Path only index for SecureVariables
* Add GetSecureVariablesByPrefix; refactor tests
* Add search for SecureVariables
* Add prefix search for secure variables
After internal design review, we decided to remove exposing algorithm
choice to the end-user for the initial release. We'll solve nonce
rotation by forcing rotations automatically on key GC (in a core job,
not included in this changeset). Default to AES-256 GCM for the
following criteria:
* faster implementation when hardware acceleration is available
* FIPS compliant
* implementation in pure go
* post-quantum resistance
Also fixed a bug in the decoding from keystore and switched to a
harder-to-misuse encoding method.
When a server becomes leader, it will check if there are any keys in
the state store, and create one if there is not. The key metadata will
be replicated via raft to all followers, who will then get the key
material via key replication (not implemented in this changeset).
This changeset implements the keystore serialization/deserialization:
* Adds a JSON serialization extension for the `RootKey` struct, along with a metadata stub. When we serialize RootKey to the on-disk keystore, we want to base64 encode the key material but also exclude any frequently-changing fields which are stored in raft.
* Implements methods for loading/saving keys to the keystore.
* Implements methods for restoring the whole keystore from disk.
* Wires it all up with the `Keyring` RPC handlers and fixes up any fallout on tests.
* core: allow pause/un-pause of eval broker on region leader.
* agent: add ability to pause eval broker via scheduler config.
* cli: add operator scheduler commands to interact with config.
* api: add ability to pause eval broker via scheduler config
* e2e: add operator scheduler test for eval broker pause.
* docs: include new opertor scheduler CLI and pause eval API info.
Closes#12927Closes#12958
This PR updates the version of redis used in our examples from 3.2 to 7.
The old version is very not supported anymore, and we should be setting
a good example by using a supported version.
The long-form example job is now fixed so that the service stanza uses
nomad as the service discovery provider, and so now the job runs without
a requirement of having Consul running and configured.
After a more detailed analysis of this feature, the approach taken in
PR #12449 was found to be not ideal due to poor UX (users are
responsible for setting the entity alias they would like to use) and
issues around jobs potentially masquerading itself as another Vault
entity.
This PR introduces the `address` field in the `service` block so that Nomad
or Consul services can be registered with a custom `.Address.` to advertise.
The address can be an IP address or domain name. If the `address` field is
set, the `service.address_mode` must be set in `auto` mode.
* Add os to NodeListStub struct.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
* Add os as a query param to /v1/nodes.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
* Add test: os as a query param to /v1/nodes.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
We expect every Nomad API client to use a single connection to any
given agent, so take advantage of keep-alive by switching the default
transport to `DefaultPooledClient`. Provide a facility to close idle
connections for testing purposes.
Restores the previously reverted #12409
Co-authored-by: Ben Buzbee <bbuzbee@cloudflare.com>
The only difference is DefaultTransport sets DisableKeepAlives
This doesn't make much sense to me - every http connection from the
nomad client goes to the same NOMAD_ADDR so it's a great case for keep
alive. Except round robin DNS and anycast perhaps.
Consul does this already
1e47e3c82b/api/api.go (L397)
Move some common Vault API data struct decoding out of the Vault client
so it can be reused in other situations.
Make Vault job validation its own function so it's easier to expand it.
Rename the `Job.VaultPolicies` method to just `Job.Vault` since it
returns the full Vault block, not just their policies.
Set `ChangeMode` on `Vault.Canonicalize`.
Add some missing tests.
Allows specifying an entity alias that will be used by Nomad when
deriving the task Vault token.
An entity alias assigns an indentity to a token, allowing better control
and management of Vault clients since all tokens with the same indentity
alias will now be considered the same client. This helps track Nomad
activity in Vault's audit logs and better control over Vault billing.
Add support for a new Nomad server configuration to define a default
entity alias to be used when deriving Vault tokens. This default value
will be used if the task doesn't have an entity alias defined.
Pass-through the `-secret` and `-parameter` flags to allow setting
parameters for the snapshot and overriding the secrets we've stored on
the CSI volume in the state store.
The service registration client name was used to provide a
distinction between the service block and the service client. This
however creates new wording to understand and does not match the
CLI, therefore this change fixes that so we have a Services
client.
Consul specific objects within the service file have been moved to
the consul location to create a clearer separation.
The `related` query param is used to indicate that the request should
return a list of related (next, previous, and blocked) evaluations.
Co-authored-by: Jasmine Dahilig <jasmine@hashicorp.com>
This is a followup to running tests in serial in CI.
Since the API package cannot import anything outside of api/,
copy the ci.Parallel function into api/internal/testutil, and
have api tests use that.
When we set the headers for CSI secrets in the `WriteOptions`, it
turns out that we're not always passing a non-nil object. In that
case, instanstiate it on demand in the API.
The paginator logic was built when go-memdb iterators would return items
ordered lexicographically by their ID prefixes, but #12054 added the
option for some tables to return results ordered by their `CreateIndex`
instead, which invalidated the previous paginator assumption.
The iterator used for pagination must still return results in some order
so that the paginator can properly handle requests where the next_token
value is not present in the results anymore (e.g., the eval was GC'ed).
In these situations, the paginator will start the returned page in the
first element right after where the requested token should've been.
This commit moves the logic to generate pagination tokens from the
elements being paginated to the iterator itself so that callers can have
more control over the token format to make sure they are properly
ordered and stable.
It also allows configuring the paginator as being ordered in ascending
or descending order, which is relevant when looking for a token that may
not be present anymore.
Detection of the full set of plugin capabilities was added in Nomad
1.1 for the volume creation workflow, but these were not added to the
API response for plugins.
These API endpoints now return results in chronological order. They
can return results in reverse chronological order by setting the
query parameter ascending=true.
- Eval.List
- Deployment.List
This PR replaces use of time.After with a safe helper function
that creates a time.Timer to use instead. The new function returns
both a time.Timer and a Stop function that the caller must handle.
Unlike time.NewTimer, the helper function does not panic if the duration
set is <= 0.
This PR sets the minimum Go version for the `api` submodule to Go 1.17.
It also upgrades
- gorilla/websocket 1.4.1 -> 1.4.2
- mitchelh/mapstructure 1.4.2 -> 1.4.3
- stretchr/testify 1.5.1 -> 1.7.0
Closes#11518#11602#11528
After swapping gzip handler to use the gorilla library, we
must account for a quirk in how zero/minimal length response
bodies are delivered.
The previous gzip handler was configured to compress all responses
regardless of size - even if the data was zero length or below the
network MTU. This behavior changed in [v1.1.0](c551b6c3b4 (diff-de723e6602cc2f16f7a9d85fd89d69954edc12a49134dab8901b10ee06d1879d))
which is why we could not upgrade.
The Nomad HTTP Client mutates the http.Response.Body object, making
a strong assumption that if the Content-Encoding header is set to "gzip",
the response will be readable via gzip decoder. This is no longer true
for the nytimes gzip handler, and is also not true for the gorilla gzip
handler.
It seems in practice this only makes a difference on the /v1/operator/license
endpoint which returns an empty response in OSS Nomad.
The fix here is to simply not wrap the response body reader if we
encounter an io.EOF while creating the gzip reader - indicating there
is no data to decode.
This PR exposes the following existing`consul-template` configuration options to Nomad jobspec authors in the `{job.group.task.template}` stanza.
- `wait`
It also exposes the following`consul-template` configuration to Nomad operators in the `{client.template}` stanza.
- `max_stale`
- `block_query_wait`
- `consul_retry`
- `vault_retry`
- `wait`
Finally, it adds the following new Nomad-specific configuration to the `{client.template}` stanza that allows Operators to set bounds on what `jobspec` authors configure.
- `wait_bounds`
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
## Development Environment Changes
* Added stringer to build deps
## New HTTP APIs
* Added scheduler worker config API
* Added scheduler worker info API
## New Internals
* (Scheduler)Worker API refactor—Start(), Stop(), Pause(), Resume()
* Update shutdown to use context
* Add mutex for contended server data
- `workerLock` for the `workers` slice
- `workerConfigLock` for the `Server.Config.NumSchedulers` and
`Server.Config.EnabledSchedulers` values
## Other
* Adding docs for scheduler worker api
* Add changelog message
Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>
When a cluster doesn't have a leader, the `nomad operator debug`
command can safely use stale queries to gracefully degrade the
consistency of almost all its queries. The query parameter for these
API calls was not being set by the command.
Some `api` package queries do not include `QueryOptions` because
they target a specific agent, but they can potentially be forwarded to
other agents. If there is no leader, these forwarded queries will
fail. Provide methods to call these APIs with `QueryOptions`.
Some operators use very long group/task `shutdown_delay` settings to
safely drain network connections to their workloads after service
deregistration. But during incident response, they may want to cause
that drain to be skipped so they can quickly shed load.
Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and
`nomad job stop` commands that bypasses the delay. This sets a new
desired transition state on the affected allocations that the
allocation/task runner will identify during pre-kill on the client.
Note (as documented here) that using this flag will almost always
result in failed inbound network connections for workloads as the
tasks will exit before clients receive updated service discovery
information and won't be gracefully drained.
API queries can request pagination using the `NextToken` and `PerPage`
fields of `QueryOptions`, when supported by the underlying API.
Add a `NextToken` field to the `structs.QueryMeta` so that we have a
common field across RPCs to tell the caller where to resume paging
from on their next API call. Include this field on the `api.QueryMeta`
as well so that it's available for future versions of List HTTP APIs
that wrap the response with `QueryMeta` rather than returning a simple
list of structs. In the meantime callers can get the `X-Nomad-NextToken`.
Add pagination to the `Eval.List` RPC by checking for pagination token
and page size in `QueryOptions`. This will allow resuming from the
last ID seen so long as the query parameters and the state store
itself are unchanged between requests.
Add filtering by job ID or evaluation status over the results we get
out of the state store.
Parse the query parameters of the `Eval.List` API into the arguments
expected for filtering in the RPC call.
During incident response, operators may find that automated processes
elsewhere in the organization can be generating new workloads on Nomad
clusters that are unable to handle the workload. This changeset adds a
field to the `SchedulerConfiguration` API that causes all job
registration calls to be rejected unless the request has a management
ACL token.
This change modifies the Nomad job register and deregister RPCs to
accept an updated option set which includes eval priority. This
param is optional and override the use of the job priority to set
the eval priority.
In order to ensure all evaluations as a result of the request use
the same eval priority, the priority is shared to the
allocReconciler and deploymentWatcher. This creates a new
distinction between eval priority and job priority.
The Nomad agent HTTP API has been modified to allow setting the
eval priority on job update and delete. To keep consistency with
the current v1 API, job update accepts this as a payload param;
job delete accepts this as a query param.
Any user supplied value is validated within the agent HTTP handler
removing the need to pass invalid requests to the server.
The register and deregister opts functions now all for setting
the eval priority on requests.
The change includes a small change to the DeregisterOpts function
which handles nil opts. This brings the function inline with the
RegisterOpts.
* build(deps): bump github.com/kr/pretty from 0.1.0 to 0.3.0 in /api
Bumps [github.com/kr/pretty](https://github.com/kr/pretty) from 0.1.0 to 0.3.0.
- [Release notes](https://github.com/kr/pretty/releases)
- [Commits](https://github.com/kr/pretty/compare/v0.1.0...v0.3.0)
---
updated-dependencies:
- dependency-name: github.com/kr/pretty
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* update in core as well and tidy
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tim Gross <tim@0x74696d.com>
Enhance the CLI in order to return the host network in two flavors
(default, verbose) of the `node status` command.
Fixes: #11223.
Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
Add a new hostname string parameter to the network block which
allows operators to specify the hostname of the network namespace.
Changing this causes a destructive update to the allocation and it
is omitted if empty from API responses. This parameter also supports
interpolation.
In order to have a hostname passed as a configuration param when
creating an allocation network, the CreateNetwork func of the
DriverNetworkManager interface needs to be updated. In order to
minimize the disruption of future changes, rather than add another
string func arg, the function now accepts a request struct along with
the allocID param. The struct has the hostname as a field.
The in-tree implementations of DriverNetworkManager.CreateNetwork
have been modified to account for the function signature change.
In updating for the change, the enhancement of adding hostnames to
network namespaces has also been added to the Docker driver, whilst
the default Linux manager does not current implement it.
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527
* api: revert to defaulting to http/1
PR #10778 incidentally changed the api http client to connect with
HTTP/2 first. However, the websocket libraries used in `alloc exec`
features don't handle http/2 well, and don't downgrade to http/1
gracefully.
Given that the switch is incidental, and not requested by users.
Furthermore, api consumers can opt-in to forcing http/2 by setting
custom http clients.
Fixes#10922
This PR fixes the API to _not_ set the default mesh gateway mode. Before,
the mode would be set to "none" in Canonicalize, which is incorrect. We
should pass through the empty string so that folks can make use of Consul
service-defaults Config entries to configure the default mode.
This PR implements first-class support for Nomad running Consul
Connect Mesh Gateways. Mesh gateways enable services in the Connect
mesh to make cross-DC connections via gateways, where each datacenter
may not have full node interconnectivity.
Consul docs with more information:
https://www.consul.io/docs/connect/gateways/mesh-gateway
The following group level service block can be used to establish
a Connect mesh gateway.
service {
connect {
gateway {
mesh {
// no configuration
}
}
}
}
Services can make use of a mesh gateway by configuring so in their
upstream blocks, e.g.
service {
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "<service>"
local_bind_port = <port>
datacenter = "<datacenter>"
mesh_gateway {
mode = "<mode>"
}
}
}
}
}
}
Typical use of a mesh gateway is to create a bridge between datacenters.
A mesh gateway should then be configured with a service port that is
mapped from a host_network configured on a WAN interface in Nomad agent
config, e.g.
client {
host_network "public" {
interface = "eth1"
}
}
Create a port mapping in the group.network block for use by the mesh
gateway service from the public host_network, e.g.
network {
mode = "bridge"
port "mesh_wan" {
host_network = "public"
}
}
Use this port label for the service.port of the mesh gateway, e.g.
service {
name = "mesh-gateway"
port = "mesh_wan"
connect {
gateway {
mesh {}
}
}
}
Currently Envoy is the only supported gateway implementation in Consul.
By default Nomad client will run the latest official Envoy docker image
supported by the local Consul agent. The Envoy task can be customized
by setting `meta.connect.gateway_image` in agent config or by setting
the `connect.sidecar_task` block.
Gateways require Consul 1.8.0+, enforced by the Nomad scheduler.
Closes#9446
refactor the api handling of `nomad exec`, and ensure that we process
all received events before handling websocket closing.
The exit code should be the last message received, and we ought to
ignore any websocket close error we receive afterwards.
Previously, we used two channels: one for websocket frames and another
for handling errors. This raised the possibility that we processed the
error before processing the frames, resulting into an "unexpected EOF"
error.
Add a new driver capability: RemoteTasks.
When a task is run by a driver with RemoteTasks set, its TaskHandle will
be propagated to the server in its allocation's TaskState. If the task
is replaced due to a down node or draining, its TaskHandle will be
propagated to its replacement allocation.
This allows tasks to be scheduled in remote systems whose lifecycles are
disconnected from the Nomad node's lifecycle.
See https://github.com/hashicorp/nomad-driver-ecs for an example ECS
remote task driver.
Small change to pull in ent struct types in a switch
statement used by ent. They are benign in oss, this
is just to make sure OSS->ENT merges don't create a
diff.
This PR introduces the /v1/search/fuzzy API endpoint, used for fuzzy
searching objects in Nomad. The fuzzy search endpoint routes requests
to the Nomad Server leader, which implements the Search.FuzzySearch RPC
method.
Requests to the fuzzy search API are based on the api.FuzzySearchRequest
object, e.g.
{
"Text": "ed",
"Context": "all"
}
Responses from the fuzzy search API are based on the api.FuzzySearchResponse
object, e.g.
{
"Index": 27,
"KnownLeader": true,
"LastContact": 0,
"Matches": {
"tasks": [
{
"ID": "redis",
"Scope": [
"default",
"example",
"cache"
]
}
],
"evals": [],
"deployment": [],
"volumes": [],
"scaling_policy": [],
"images": [
{
"ID": "redis:3.2",
"Scope": [
"default",
"example",
"cache",
"redis"
]
}
]
},
"Truncations": {
"volumes": false,
"scaling_policy": false,
"evals": false,
"deployment": false
}
}
The API is tunable using the new server.search stanza, e.g.
server {
search {
fuzzy_enabled = true
limit_query = 200
limit_results = 1000
min_term_length = 5
}
}
These values can be increased or decreased, so as to provide more
search results or to reduce load on the Nomad Server. The fuzzy search
API can be disabled entirely by setting `fuzzy_enabled` to `false`.
Consul allows specifying the HTTP body to send in a health check. Nomad
uses Consul for health checking so this just plumbs the value through to
where the Consul API is called.
There is no validation that `body` is not used with an incompatible
check method like GET.