* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes#11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.
* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes#11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.
* style: refactor force run function
* fix: remove defer and inline unlock for speed optimization
* Update nomad/leader.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* style: refactor tests to use must
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* fix: move back from defer to calling unlock before returning.
createEval cant be called with the lock on
* style: refactor test to use must
* added new entry to changelog and update comments
---------
Co-authored-by: James Rasell <jrasell@hashicorp.com>
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
When a disconnect client reconnects the `allocReconciler` must find the
allocations that were created to replace the original disconnected
allocations.
This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.
This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.
This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.
It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the `untainted` set.
The system `SystemScheduler` is not affected by this bug because
disconnected clients don't trigger replacements: every eligible client
is already running an allocation.
Implement the new `nomad job restart` command that allows operators to
restart allocations tasks or reschedule then entire allocation.
Restarts can be batched to target multiple allocations in parallel.
Between each batch the command can stop and hold for a predefined time
or until the user confirms that the process should proceed.
This implements the "Stateless Restarts" alternative from the original
RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180).
The original concept is still worth implementing, as it allows this
functionality to be exposed over an API that can be consumed by the
Nomad UI and other clients. But the implementation turned out to be more
complex than we initially expected so we thought it would be better to
release a stateless CLI-based implementation first to gather feedback
and validate the restart behaviour.
Co-authored-by: Shishir Mahajan <smahajan@roblox.com>
Fixes#16517
Given a 3 Server cluster with at least 1 Client connected to Follower 1:
If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.
The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.
Follower 1, seeing AllowStale=false, will forward the request to the
Leader.
The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
* Throw your mouse into traffic
* Add node metadata with a shortcut
* Re-labelled
* Adds a toast notification to job start/stop on keyboard shortcut
* Typo fix
* Added and flag to command
* cli[style]: small refactor to avoid confussion with tmpl variable
* Update inspect.mdx
* cli: add changelog entry
* Update .changelog/16478.txt
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update command/quota_inspect.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: add json and t flags to quota status command
* cli: add entry to changelog
* Update command/quota_status.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: Add and flags to server members
* Update website/content/docs/commands/server/members.mdx
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update website/content/docs/commands/server/members.mdx
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: update the server memebers tests to use must
* cli: add flags addition to changelog
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
nomad login command does not need to know ACL Auth Method's type, since all
method names are unique.
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* services: always set deregister flag after deregistration of group
This PR fixes a bug where the group service hook's deregister flag was
not set in some cases, causing the hook to attempt deregistrations twice
during job updates (alloc replacement).
In the tests ... we used to assert on the wrong behvior (remove twice) which
has now been corrected to assert we remove only once.
This bug was "silent" in the Consul provider world because the error logs for
double deregistration only show up in Consul logs; with the Nomad provider the
error logs are in the Nomad agent logs.
* services: cleanup group service hook tests
In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.
Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the
Consul response directly. Update the `registerNode` method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.
* landlock: git needs more files for private repositories
This PR fixes artifact downloading so that git may work when cloning from
private repositories. It needs
- file read on /etc/passwd
- dir read on /root/.ssh
- file write on /root/.ssh/known_hosts
Add these rules to the landlock rules for the artifact sandbox.
* cr: use nonexistent instead of devnull
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* cr: use go-homdir for looking up home directory
* pr: pull go-homedir into explicit require
* cr: fixup homedir tests in homeless root cases
* cl: fix root test for real
---------
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* cli: Add and flag to namespace status command
* Update command/namespace_status.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: update tests for namespace status command to use must
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
Our auth token parsing code trims space around the `Authorization` header but
not around `X-Nomad-Token`. When using the UI, it's easy to accidentally
introduce a leading or trailing space, which results in spurious authentication
errors. Trim the space at the HTTP server.
* cgv1: do not disable cpuset manager if reserved interface already exists
This PR fixes a bug where restarting a Nomad Client on a machine using cgroups
v1 (e.g. Ubuntu 20.04) would cause the cpuset cgroups manager to disable itself.
This is being caused by incorrectly interpreting a "file exists" error as
problematic when ensuring the reserved cpuset exists. If we get a "file exists"
error, that just means the Client was likely restarted.
Note that a machine reboot would fix the issue - the groups interfaces are
ephemoral.
* cl: add cl
The job evaluate endpoint creates a new evaluation for the job which is
a write operation. This change modifies the necessary capability from
`read-job` to `submit-job` to better reflect this.
ACL policies can be associated with a job so that the job's Workload Identity
can have expanded access to other policy objects, including other
variables. Policies set on the variables the job automatically has access to
were ignored, but this includes policies with `deny` capabilities.
Additionally, when resolving claims for a workload identity without any attached
policies, the `ResolveClaims` method returned a `nil` ACL object, which is
treated similarly to a management token. While this was safe in Nomad 1.4.x,
when the workload identity token was exposed to the task via the `identity`
block, this allows a user with `submit-job` capabilities to escalate their
privileges.
We originally implemented automatic workload access to Variables as a separate
code path in the Variables RPC endpoint so that we don't have to generate
on-the-fly policies that blow up the ACL policy cache. This is fairly brittle
but also the behavior around wildcard paths in policies different from the rest
of our ACL polices, which is hard to reason about.
Add an `ACLClaim` parameter to the `AllowVariableOperation` method so that we
can push all this logic into the `acl` package and the behavior can be
consistent. This will allow a `deny` policy to override automatic access (and
probably speed up checks of non-automatic variable access).
* cli: add -json flag to alloc checks for completion
* CLI: Expand test to include testing the json flag for allocation checks
* Documentation: Add the checks command
* Documentation: Add example for alloc check command
* Update website/content/docs/commands/alloc/checks.mdx
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* CLI: Add template flag to alloc checks command
* Update website/content/docs/commands/alloc/checks.mdx
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* CLI: Extend test to include -t flag for alloc checks
* func: add changelog for added flags to alloc checks
* cli[doc]: Make usage section on alloc checks clearer
* Update website/content/docs/commands/alloc/checks.mdx
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Delete modd.conf
* cli[doc]: add -t flag to command description for alloc checks
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
Co-authored-by: Juanita De La Cuesta Morales <juanita.delacuestamorales@juanita.delacuestamorales-LHQ7X0QG9X>
Most job subcommands allow for job ID prefix match as a convenience
functionality so users don't have to type the full job ID.
But this introduces a hard ACL requirement that the token used to run
these commands have the `list-jobs` permission, even if the token has
enough permission to execute the basic command action and the user
passed an exact job ID.
This change softens this requirement by not failing the prefix match in
case the request results in a permission denied error and instead using
the information passed by the user directly.
When the scheduler tries to find a placement for a new allocation, it iterates
over a subset of nodes. For each node, we populate a `NetworkIndex` bitmap with
the ports of all existing allocations and any other allocations already proposed
as part of this same evaluation via its `SetAllocs` method. Then we make an
"ask" of the `NetworkIndex` in `AssignPorts` for any ports we need and receive
an "offer" in return. The offer will include both static ports and any dynamic
port assignments.
The `AssignPorts` method was written to support group networks, and it shares
code that selects dynamic ports with the original `AssignTaskNetwork`
code. `AssignTaskNetwork` can request multiple ports from the bitmap at a
time. But `AssignPorts` requests them one at a time and does not account for
possible collisions, and doesn't return an error in that case.
What happens next varies:
1. If the scheduler doesn't place the allocation on that node, the port
conflict is thrown away and there's no problem.
2. If the node is picked and this is the only allocation (or last allocation),
the plan applier will reject the plan when it calls `SetAllocs`, as we'd expect.
3. If the node is picked and there are additional allocations in the same eval
that iterate over the same node, their call to `SetAllocs` will detect the
impossible state and the node will be rejected. This can have the puzzling
behavior where a second task group for the job without any networking at all
can hit a port collision error!
It looks like this bug has existed since we implemented group networks, but
there are several factors that add up to making the issue rare for many users
yet frustratingly frequent for others:
* You're more likely to hit this bug the more tightly packed your range for
dynamic ports is. With 12000 ports in the range by default, many clusters can
avoid this for a long time.
* You're more likely to hit case (3) for jobs with lots of allocations or if a
scheduler has to iterate over a large number of nodes, such as with system jobs,
jobs with `spread` blocks, or (sometimes) jobs using `unique` constraints.
For unlucky combinations of these factors, it's possible that case (3) happens
repeatedly, preventing scheduling of a given job until a client state
change (ex. restarting the agent so all its allocations are rescheduled
elsewhere) re-opens the range of dynamic ports available.
This changeset:
* Fixes the bug by accounting for collisions in dynamic port selection in
`AssignPorts`.
* Adds test coverage for `AssignPorts`, expands coverage of this case for the
deprecated `AssignTaskNetwork`, and tightens the dynamic port range in a
scheduler test for spread scheduling to more easily detect this kind of problem
in the future.
* Adds a `String()` method to `Bitmap` so that any future "screaming" log lines
have a human-readable list of used ports.
* client: disable running artifact downloader as nobody
This PR reverts a change from Nomad 1.5 where artifact downloads were
executed as the nobody user on Linux systems. This was done as an attempt
to improve the security model of artifact downloading where third party
tools such as git or mercurial would be run as the root user with all
the security implications thereof.
However, doing so conflicts with Nomad's own advice for securing the
Client data directory - which when setup with the recommended directory
permissions structure prevents artifact downloads from working as intended.
Artifact downloads are at least still now executed as a child process of
the Nomad agent, and on modern Linux systems make use of the kernel Landlock
feature for limiting filesystem access of the child process.
* docs: update upgrade guide for 1.5.1 sandboxing
* docs: add cl
* docs: add title to upgrade guide fix
Wildcard datacenters introduced a bug where a job with any wildcard datacenters
will always be treated as a destructive update when we check whether a
datacenter has been removed from the jobspec.
Includes updating the helper so that callers don't have to loop over the job's
datacenters.
* Fix for wildcard DC sys/sysbatch jobs
* A few extra modules for wildcard DC in systemish jobs
* doesMatchPattern moved to its own util as match-glob
* DC glob lookup using matchGlob
* PR feedback
Some of the methods in `Allocations()` incorrectly use the `putQuery`
in API calls where `put` is more appropriate since they are not reading
information back. These methods are also not returning request metadata
such as `LastIndex` back to callers, which can be useful to have in some
scenarios.
They also provide poor developer experience as they take an
`*api.Allocation` struct when only the allocation ID is necessary. This
can lead consumers to make unnecessary API calls to fetch the full
allocation.
Fixing these problems require updating the methods' signatures so they
take `*WriteOptions` instead of `*QueryOptions` and return `*WriteMeta`,
but this is a breaking change that requires advanced notice to consumers.
This commit adds a future breaking change notice and also fixes the
`Stop` method so it properly returns request metadata in a backwards
compatible way.
In Nomad 0.12.1 we introduced atomic job registration/deregistration, where the
new eval was written in the same raft entry. Backwards-compatibility checks were
supposed to have been removed in Nomad 1.1.0, but we missed that. This is long
safe to remove.
Several `nomad job` subcommands had duplicate or slightly similar logic
for resolving a job ID from a CLI argument prefix, while others did not
have this functionality at all.
This commit pulls the shared logic to the command Meta and updates all
`nomad job` subcommands to use it.
When native service discovery was added, we used the node secret as the auth
token. Once Workload Identity was added in Nomad 1.4.x we needed to use the
claim token for `template` blocks, and so we allowed valid claims to bypass the
ACL policy check to preserve the existing behavior. (Invalid claims are still
rejected, so this didn't widen any security boundary.)
In reworking authentication for 1.5.0, we unintentionally removed this
bypass. For WIs without a policy attached to their job, everything works as
expected because the resulting `acl.ACL` is nil. But once a policy is attached
to the job the `acl.ACL` is no longer nil and this causes permissions errors.
Fix the regression by adding back the bypass for valid claims. In future work,
we should strongly consider getting turning the implicit policies into real
`ACLPolicy` objects (even if not stored in state) so that we don't have these
kind of brittle exceptions to the auth code.
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.
Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
https://github.com/hashicorp/consul/pull/9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:
* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
discovery or CSI plugin states
Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.
Nomad servers can advertise independent IP addresses for `serf` and
`rpc`. Somewhat unexpectedly, the `serf` address is also used for both Serf and
server-to-server RPC communication (including Raft RPC). The address advertised
for `rpc` is only used for client-to-server RPC. This split was introduced
intentionally in Nomad 0.8.
When clients are using Consul discovery for connecting to servers, they get an
initial discovery set from Consul and use the correct `rpc` tag in Consul to get
a list of adddresses for servers. The client then makes a `Status.Peers` RPC to
get the list of those servers that are raft peers. But this endpoint is shared
between servers and clients, and provides the address used for Raft.
Most of the time this is harmless because servers will bind on 0.0.0.0 anyways.,
But in topologies where servers are on a private network and clients are on
separate subnets (or even public subnets), clients will make initial contact
with the server to get the list of peers but then populate their local server
set with unreachable addresses.
Cluster administrators can work around this problem by using `server_join` with
specific IP addresses (or DNS names), because the `Node.UpdateStatus` endpoint
returns the correct set of RPC addresses when updating the node. So once a
client has registered, it will get the correct set of RPC addresses.
This changeset updates the client logic to query `Status.Members` instead of
`Status.Peers`, and then extract the correctly advertised address and port from
the response body.
* build: add BuildDate to version info
will be used in enterprise to compare to license expiration time
* cli: multi-line version output, add BuildDate
before:
$ nomad version
Nomad v1.4.3 (coolfakecommithashomgoshsuchacoolonewoww)
after:
$ nomad version
Nomad v1.5.0-dev
BuildDate 2023-02-17T19:29:26Z
Revision coolfakecommithashomgoshsuchacoolonewoww
compare consul:
$ consul version
Consul v1.14.4
Revision dae670fe
Build Date 2023-01-26T15:47:10Z
Protocol 2 spoken by default, blah blah blah...
and vault:
$ vault version
Vault v1.12.3 (209b3dd99fe8ca320340d08c70cff5f620261f9b), built 2023-02-02T09:07:27Z
* docs: update version command output
The `TaskUpdateRequest` struct we send to task runner update hooks was not
populating the Nomad token that we get from the task runner (which we do for the
Vault token). This results in task runner hooks like the template hook
overwriting the Nomad token with the zero value for the token. This causes
in-place updates of a task to break templates (but not other uses that rely on
identity but don't currently bother to update it, like the identity hook).
The `CSIVolume` struct has references to allocations that are "denormalized"; we
don't store them on the `CSIVolume` struct but hydrate them on read. Tests
detecting potential state store corruptions found two locations where we're not
copying the volume before denormalizing:
* When garbage collecting CSI volume claims.
* When checking if it's safe to force-deregister the volume.
There are no known user-visible problems associated with these bugs but both
have the potential of mutating volume claims outside of a FSM transaction. This
changeset also cleans up state mutations in some CSI tests so as to avoid having
working tests cover up potential future bugs.
This PR fixes a bug where the task group information was not being set
on the serviceHook.AllocInfo struct, which is needed later on for calculating
the CheckID of a nomad service check. The CheckID is calculated independently
from multiple callsites, and the information being passed in must be consistent,
including the group name.
The workload.AllocInfo.Group was not set at this callsite, due to the bug fixed in this PR.
https://github.com/hashicorp/nomad/blob/main/client/serviceregistration/nsd/nsd.go#L114
This PR
- fixes a panic in GetItems when looking up a variable that does not exist.
- deprecates GetItems in favor of GetVariableItems which avoids returning a pointer to a map
- deprecates ErrVariableNotFound in favor of ErrVariablePathNotFound which is an actual error type
- does some minor code cleanup to make linters happier
The `nomad fmt -check` command incorrectly writes to file because we didn't
return before writing the file on a diff. Fix this bug and update the command
internals to differentiate between the write-to-file and write-to-stdout code
paths, which are activated by different combinations of options and flags.
The docstring for the `-list` and `-write` flags is also unclear and can be
easily misread to be the opposite of the actual behavior. Clarify this and fix
up the docs to match.
This changeset also refactors the tests quite a bit so as to make the test
outputs clear when something is incorrect.
* artifact: protect against unbounded artifact decompression
Starting with 1.5.0, set defaut values for artifact decompression limits.
artifact.decompression_size_limit (default "100GB") - the maximum amount of
data that will be decompressed before triggering an error and cancelling
the operation
artifact.decompression_file_count_limit (default 4096) - the maximum number
of files that will be decompressed before triggering an error and
cancelling the operation.
* artifact: assert limits cannot be nil in validation
* Warn when Items key isn't directly accessible
Go template requires that map keys are alphanumeric for direct access
using the dotted reference syntax. This warns users when they create
keys that run afoul of this requirement.
- cli: use regex to detect invalid indentifiers in var keys
- test: fix slash in escape test case
- api: share warning formatting function between API and CLI
- ui: warn if var key has characters other than _, letter, or number
---------
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
The eval broker's `Cancelable` method used by the cancelable eval reaper mutates
the slice of cancelable evals by removing a batch at a time from the slice. But
this method unsafely uses a read lock despite this mutation. Under normal
workloads this is likely to be safe but when the eval broker is under the heavy
load this feature is intended to fix, we're likely to have a race
condition. Switch this to a write lock, like the other locks that mutate the
eval broker state.
This changeset also adjusts the timeout to allow poorly-sized Actions runners
more time to schedule the appropriate goroutines. The test has also been updated
to use `shoenig/test/wait` so we can have sensible reporting of the results
rather than just a timeout error when things go wrong.
* users: create cache for user lookups
This PR introduces a global cache for OS user lookups. This should
relieve pressure on the OS domain/directory lookups, which would be
queried more now that Task API exists.
Hits are cached for 1 hour, and misses are cached for 1 minute. These
values are fairly arbitrary - we can tweak them if there is any reason to.
Closes#16010
* users: delete expired negative entry from cache
Fixes#14617
Dynamic Node Metadata allows Nomad users, and their jobs, to update Node metadata through an API. Currently Node metadata is only reloaded when a Client agent is restarted.
Includes new UI for editing metadata as well.
---------
Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
* main: remove deprecated uses of rand.Seed
go1.20 deprecates rand.Seed, and seeds the rand package
automatically. Remove cases where we seed the random package,
and cleanup the one case where we intentionally create a
known random source.
* cl: update cl
* mod: update go mod
* docker: disable driver when running as non-root on cgroups v2 hosts
This PR modifies the docker driver to behave like exec when being run
as a non-root user on a host machine with cgroups v2 enabled. Because
of how cpu resources are managed by the Nomad client, the nomad agent
must be run as root to manage docker-created cgroups.
* cl: update cl
This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled.
This PR contains the core feature and tests but followup work is required for the following TODO items:
- Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink
- Unit tests for auth middleware
- Caching for auth middleware
- Rate limiting on negative lookups for auth middleware
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
* Demoable state
* Demo mirage color
* Label as a block with foreground and background colours
* Test mock updates
* Go test updated
* Documentation update for label support
Service jobs should have unique allocation Names, derived from the
Job.ID. System jobs do not have unique allocation Names because the index is
intended to indicated the instance out of a desired count size. Because system
jobs do not have an explicit count but the results are based on the targeted
nodes, the index is less informative and this was intentionally omitted from the
original design.
Update docs to make it clear that NOMAD_ALLOC_INDEX is always zero for
system/sysbatch jobs
Validate that `volume.per_alloc` is incompatible with system/sysbatch jobs.
System and sysbatch jobs always have a `NOMAD_ALLOC_INDEX` of 0. So
interpolation via `per_alloc` will not work as soon as there's more than one
allocation placed. Validate against this on job submission.
* largely a doc-ification of this commit message:
d47678074bf8ae9ff2da3c91d0729bf03aee8446
this doesn't spell out all the possible failure modes,
but should be a good starting point for folks.
* connect: add doc link to envoy bootstrap error
* add Unwrap() to RecoverableError
mainly for easier testing
Add `identity` jobspec block to expose workload identity tokens to tasks.
---------
Co-authored-by: Anders <mail@anars.dk>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
If a deployment fails, the `deployment status` command can get a nil deployment
when it checks for a rollback deployment if there isn't one (or at least not one
at the time of the query). Fix the panic.
* Extend variables under the nomad path prefix to allow for job-templates (#15570)
* Extend variables under the nomad path prefix to allow for job-templates
* Add job-templates to error message hinting
* RadioCard component for Job Templates (#15582)
* chore: add
* test: component API
* ui: component template
* refact: remove bc naming collission
* styles: remove SASS var causing conflicts
* Disallow specific variable at nomad/job-templates (#15681)
* Disallows variables at exactly nomad/job-templates
* idiomatic refactor
* Expanding nomad job init to accept a template flag (#15571)
* Adding a string flag for templates on job init
* data-down actions-up version of a custom template editor within variable
* Dont force grid on job template editor
* list-templates flag started
* Correctly slice from end of path name
* Pre-review cleanup
* Variable form acceptance test for job template editing
* Some review cleanup
* List Job templates test
* Example from template test
* Using must.assertions instead of require etc
* ui: add choose template button (#15596)
* ui: add new routes
* chore: update file directory
* ui: add choose template button
* test: button and page navigation
* refact: update var name
* ui: use `Button` component from `HDS` (#15607)
* ui: integrate buttons
* refact: remove helper
* ui: remove icons on non-tertiary buttons
* refact: update normalize method for key/value pairs (#15612)
* `revert`: `onCancel` for `JobDefinition`
The `onCancel` method isn't included in the component API for `JobEditor` and the primary cancel behavior exists outside of the component. With the exception of the `JobDefinition` page where we include this button in the top right of the component instead of next to the `Plan` button.
* style: increase button size
* style: keep lime green
* ui: select template (#15613)
* ui: deprecate unused component
* ui: deprecate tests
* ui: jobs.run.templates.index
* ui: update logic to handle templates
* refact: revert key/value changes
* style: padding for cards + buttons
* temp: fixtures for mirage testing
* Revert "refact: revert key/value changes"
This reverts commit 124e95d12140be38fc921f7e15243034092c4063.
* ui: guard template for unsaved job
* ui: handle reading template variable
* Revert "refact: update normalize method for key/value pairs (#15612)"
This reverts commit 6f5ffc9b610702aee7c47fbff742cc81f819ab74.
* revert: remove test fixtures
* revert: prettier problems
* refact: test doesnt need filter expression
* styling: button sizes and responsive cards
* refact: remove route guarding
* ui: update variable adapter
* refact: remove model editing behavior
* refact: model should query variables to populate editor
* ui: clear qp on exit
* refact: cleanup deprecated API
* refact: query all namespaces
* refact: deprecate action
* ui: rely on collection
* refact: patch deprecate transition API
* refact: patch test to expect namespace qp
* styling: padding, conditionals
* ui: flashMessage on 404
* test: update for o(n+1) query
* ui: create new job template (#15744)
* refact: remove unused code
* refact: add type safety
* test: select template flow
* test: add data-test attrs
* chore: remove dead code
* test: create new job flow
* ui: add create button
* ui: create job template
* refact: no need for wildcard
* refact: record instead of delete
* styling: spacing
* ui: add error handling and form validation to job create template (#15767)
* ui: handle server side errors
* ui: show error to prevent duplicate
* refact: conditional namespace
* ui: save as template flow (#15787)
* bug: patches failing tests associated with `pretender` (#15812)
* refact: update assertion
* refact: test set-up
* ui: job templates manager view (#15815)
* ui: manager list view
* test: edit flow
* refact: deprecate column-helper
* ui: template edit and delete flow (#15823)
* ui: manager list view
* refact: update title
* refact: update permissions
* ui: template edit page
* bug: typo
* refact: update toast messages
* bug: clear selections on exit (#15827)
* bug: clear controllers on exit
* test: mirage config changes (#15828)
* refact: deprecate column-helper
* style: update z-index for HDS
* Revert "style: update z-index for HDS"
This reverts commit d3d87ceab6d083f7164941587448607838944fc1.
* refact: update delete button
* refact: edit redirect
* refact: patch reactivity issues
* styling: fixed width
* refact: override defaults
* styling: edit text causing overflow
* styling: add inline text
Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
* bug: edit `text` to `template`
Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
* test: delete flow job templates (#15896)
* refact: edit names
* bug: set correct ref to store
* chore: trim whitespace:
* test: delete flow
* bug: reactively update view (#15904)
* Initialized default jobs (#15856)
* Initialized default jobs
* More jobs scaffolded
* Better commenting on a couple example job specs
* Adapter doing the work
* fall back to epic config
* Label format helper and custom serialization logic
* Test updates to account for a never-empty state
* Test suite uses settled and maintain RecordArray in adapter return
* Updates to hello-world and variables example jobspecs
* Parameterized job gets optional payload output
* Formatting changes for param and service discovery job templates
* Multi-group service discovery job
* Basic test for default templates (#15965)
* Basic test for default templates
* Percy snapshot for manage page
* Some late-breaking design changes
* Some copy edits to the header paragraphs for job templates (#15967)
* Added some init options for job templates (#15994)
* Async method for populating default job templates from the variable adapter
---------
Co-authored-by: Jai <41024828+ChaiWithJai@users.noreply.github.com>
* Add `bridge_network_hairpin_mode` client config setting
* Add node attribute: `nomad.bridge.hairpin_mode`
* Changed format string to use `%q` to escape user provided data
* Add test to validate template JSON for developer safety
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
The ACL token decoding was not correctly handling time duration
syntax such as "1h" which forced people to use the nanosecond
representation via the HTTP API.
The change adds an unmarshal function which allows this syntax to
be used, along with other styles correctly.
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory.
Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
* docker: set force=true on remove image to handle images referenced by multiple tags
This PR changes our call of docker client RemoveImage() to RemoveImageExtended with
the Force=true option set. This fixes a bug where an image referenced by more than
one tag could never be garbage collected by Nomad. The Force option only applies to
stopped containers; it does not affect running workloads.
* docker: add note about image_delay and multiple tags
* Ensure infra_image gets proper label used for reconciliation
Currently infra containers are not cleaned up as part of the dangling container
cleanup routine. The reason is that Nomad checks if a container is a Nomad owned
container by verifying the existence of the: `com.hashicorp.nomad.alloc_id` label.
Ensure we set this label on the infra container as well.
* fix unit test
* changelog: add entry
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
When registering a job with a service and 'consul.allow_unauthenticated=false',
we scan the given Consul token for an acceptable policy or role with an
acceptable policy, but did not scan for an acceptable service identity (which
is backed by an acceptable virtual policy). This PR updates our consul token
validation to also accept a matching service identity when registering a service
into Consul.
Fixes#15902
* client: run alloc pre-kill hooks on last pass despite no live tasks
This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.
Fixes#15477
* client: do not run ar cleanup hooks if client is shutting down
When the template hook Update() method is called it may recreate the
template manager if the Nomad or Vault token has been updated.
This caused the new template manager did not have a driver handler
because this was only being set on the Poststart hook, which is not
called for inplace updates.