Fix numerous go-getter security issues:
- Add timeouts to http, git, and hg operations to prevent DoS
- Add size limit to http to prevent resource exhaustion
- Disable following symlinks in both artifacts and `job run`
- Stop performing initial HEAD request to avoid file corruption on
retries and DoS opportunities.
**Approach**
Since Nomad has no ability to differentiate a DoS-via-large-artifact vs
a legitimate workload, all of the new limits are configurable at the
client agent level.
The max size of HTTP downloads is also exposed as a node attribute so
that if some workloads have large artifacts they can specify a high
limit in their jobspecs.
In the future all of this plumbing could be extended to enable/disable
specific getters or artifact downloading entirely on a per-node basis.
The description of `mount_flags` provides incorrect example
of the accepted value format.
This fixes the issue by changing the example from a string
`ro,noatime` to a slice of strings `["ro", "noatime"]`.
Nomad errors out when attempting to specify a task for a service that uses consul connect but does not have script or gRPC checks. See 304d0cf595/nomad/structs/structs.go (L6643) for details.
Closes#12927Closes#12958
This PR updates the version of redis used in our examples from 3.2 to 7.
The old version is very not supported anymore, and we should be setting
a good example by using a supported version.
The long-form example job is now fixed so that the service stanza uses
nomad as the service discovery provider, and so now the job runs without
a requirement of having Consul running and configured.
The shortlink /s/port-plan-failure is logged when a plan for a node is
rejected to help users debug and mitigate repeated `plan for node
rejected` failures.
The current link to #9506 is... less than useful. It is not clear to
users what steps they should take to either fix their cluster or
contribute to the issue.
While .../monitoring-nomad#progess isn't as comprehensive as it could
be, it's a much more gentle introduction to the class of bug than the
original issue.
In #12324 we made it so that plugins wait until the node drain is
complete, as we do for system jobs. But we neglected to mark the node
drain as complete once only plugins (or system jobs) remaining, which
means that the node drain is left in a draining state until the
`deadline` time expires. This was incorrectly documented as expected
behavior in #12324.
notably:
- name of the compiled binary is 'nomad-device-nvidia', not 'nvidia-gpu'
- link to Nvidia docs for installing the container runtime toolkit
- list docker v19.03 as minimum version, to track with nvidia's new container runtime toolkit
The capacity fields for `create volume` set bounds on the resulting
size of the volume, but the ultimate size of the volume will be
determined by the storage provider (between the min and max). Clarify
this in the documentation and provide a suggestion for how to set a
exact size.
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
* docs: update json jobs docs
Did you know that Nomad has not 1 but 2 JSON formats for jobs? 2½ if you
want to acknowledge that sometimes our JSON job representations have a
Job top-level wrapper and sometimes do not.
The 2½ formats are:
```
1. HCL JSON
2. Input API JSON (top-level Job field)
2.5. Output API JSON (lacks top-level Job field)
```
`#2` is what our docs consider our API JSON. `#2.5` seems to be an
accident of history we can't fix with breaking API compatibility.
`#1` is an even more interesting accident of history: the `jobspec2`
package automatically detects if the input to Parse is JSON and switches
to a JSON parser. This behavior is undocumented, the format is
unspecified, and there is no official HashiCorp tooling to produce this
JSON from HCL. The plot thickens when you discover popular third party
tools like hcl2json.com and https://github.com/tmccombs/hcl2json seem to
produce JSON that `nomad run` accepts!
Since we have no telemetry around whether or not anyone passes HCL JSON
to `nomad run`, and people don't file bugs around features that Just
Work, I'm choosing to leave that code path in place and *acknowledged
but not suggested* in documentation.
See https://github.com/hashicorp/hcl/issues/498 for a more comprehensive
discussion of what officially supporting HCL JSON in Nomad would look
like.
(I also added some of the missing fields to the (Input API flavor) JSON
Job documentation, but it still needs a lot of work to be
comprehensive.)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
After a more detailed analysis of this feature, the approach taken in
PR #12449 was found to be not ideal due to poor UX (users are
responsible for setting the entity alias they would like to use) and
issues around jobs potentially masquerading itself as another Vault
entity.
This PR introduces the `address` field in the `service` block so that Nomad
or Consul services can be registered with a custom `.Address.` to advertise.
The address can be an IP address or domain name. If the `address` field is
set, the `service.address_mode` must be set in `auto` mode.
* cli: add -json flag to support job commands
While the CLI has always supported running JSON jobs, its support has
been via HCLv2's JSON parsing. I have no idea what format it expects the
job to be in, but it's absolutely not in the same format as the API
expects.
So I ignored that and added a new -json flag to explicitly support *API*
style JSON jobspecs.
The jobspecs can even have the wrapping {"Job": {...}} envelope or not!
* docs: fix example for `nomad job validate`
We haven't been able to validate inside driver config stanzas ever since
the move to task driver plugins. 😭
The new `namespace apply` feature that allows for passing a namespace
specification file detects the difference between an empty namespace
and a namespace specification by checking if the file exists. For most
cases, the file will have an extension like `.hcl` and so there's
little danger that a user will apply a file spec when they intended to
apply a file name.
But because directory names typically don't include an extension,
you're much more likely to collide when trying to `namespace apply` by
name only, and then you get a confusing error message of the form:
Failed to read file: read $namespace: is a directory
Detect the case where the namespace name collides with a directory in
the current working directory, and skip trying to load the directory.
This PR updates the changelog, adds notes the 1.3 upgrade guide, and
updates the connect integration docs with documentation about the new
requirement on Consul ACL policies of Consul agent default anonymous ACL
tokens.
* Add os to NodeListStub struct.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
* Add os as a query param to /v1/nodes.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
* Add test: os as a query param to /v1/nodes.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
- Moved federation docs to the bottom since *everyone* is potentially
affected by the other sections on the page, but only users of
federation are affected by it.
- Added section on the plan for node rejected bug since it is fairly
easy to diagnose and removing affected nodes is a fairly reliable
workaround.
- Mention 5s cliff for wait_for_index.
- Remove the lie that we do not have job status metrics! How old was
that?!
- Reinforce the importance of monitoring basic system resources
This PR expands on the work done in #12543 to
- prefix the tag, so it is now "nomad.alloc_id" to be more consistent with Consul tags
- merge into pre-existing envoy_stats_tags fields
- update the upgrade guide docs
- update changelog
* services: add pagination and filter support to info RPC.
* cli: add filter flag to service info command.
* docs: add pagination and filter details to services info API.
* paginator: minor updates to comment and func signature.
Many of our scripts have a non-portable interpreter line for bash and
use bash-specific variables like `BASH_SOURCE`. Update the interpreter
line to be portable between various Linuxes and macOS without
complaint from posix shell users.
Custom variable validation is a useful feature that is supported by
Nomad and not just Terraform. As such it should be documented on the
input variable page.
I've cribbed the content from the terraform docs so this should be
consistent across projects
We introduced a `pprof-interval` argument to `operator debug` in #11938, and unfortunately this has resulted in a lot of test flakes. The actual command in use is mostly fine (although I've fixed some quirks here), so what's really happened is that the change has revealed some existing issues in the tests. Summary of changes:
* Make first pprof collection synchronous to preserve the existing
behavior for the common case where the pprof interval matches the
duration.
* Clamp `operator debug` pprof timing to that of the command. The
`pprof-duration` should be no more than `duration` and the
`pprof-interval` should be no more than `pprof-duration`. Clamp the
values rather than throwing errors, which could change the commands
that existing users might already have in debugging scripts
* Testing: remove test parallelism
The `operator debug` tests that stand up servers can't be run in
parallel, because we don't have a way of canceling the API calls for
pprof. The agent will still be running the last pprof when we exit,
and that breaks the next test that talks to that same agent.
(Because you can only run one pprof at a time on any process!)
We could split off each subtest into its own server, but this test
suite is already very slow. In future work we should fix this "for
real" by making the API call cancelable.
* Testing: assert against unexpected errors in `operator debug` tests.
If we assert there are no unexpected error outputs, it's easier for
the developer to debug when something is going wrong with the tests
because the error output will be presented as a failing test, rather
than just a failing exit code check. Or worse, no failing exit code
check!
This also forces us to be explicit about which tests will return 0
exit codes but still emit (presumably ignorable) error outputs.
Additional minor bug fixes (mostly in tests) and test refactorings:
* Fix text alignment on pprof Duration in `operator debug` output
* Remove "done" channel from `operator debug` event stream test. The
goroutine we're blocking for here already tells us it's done by
sending a value, so block on that instead of an extraneous channel
* Event stream test timer should start at current time, not zero
* Remove noise from `operator debug` test log output. The `t.Logf`
calls already are picked out from the rest of the test output by
being prefixed with the filename.
* Remove explicit pprof args so we use the defaults clamped from
duration/interval
This PR injects the 'NOMAD_CPU_CORES' environment variable into
tasks that have been allocated reserved cpu cores. The value uses
normal cpuset notation, as found in cpuset.cpu cgroup interface files.
Note this value is not necessiarly the same as the content of the actual
cpuset.cpus interface file, which will also include shared cpu cores when
using cgroups v2. This variable is a workaround for users who used to be
able to read the reserved cgroup cpuset file, but lose the information
about distinct reserved cores when using cgroups v2.
Side discussion in: https://github.com/hashicorp/nomad/issues/12374
Move some common Vault API data struct decoding out of the Vault client
so it can be reused in other situations.
Make Vault job validation its own function so it's easier to expand it.
Rename the `Job.VaultPolicies` method to just `Job.Vault` since it
returns the full Vault block, not just their policies.
Set `ChangeMode` on `Vault.Canonicalize`.
Add some missing tests.
Allows specifying an entity alias that will be used by Nomad when
deriving the task Vault token.
An entity alias assigns an indentity to a token, allowing better control
and management of Vault clients since all tokens with the same indentity
alias will now be considered the same client. This helps track Nomad
activity in Vault's audit logs and better control over Vault billing.
Add support for a new Nomad server configuration to define a default
entity alias to be used when deriving Vault tokens. This default value
will be used if the task doesn't have an entity alias defined.
This PR adds support for the raw_exec driver on systems with only cgroups v2.
The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.
There is a bit of refactoring in this PR, but the fundamental design remains
the same.
Closes#12351#12348
The client configuration options for drivers have been deprecated
since 0.9. We haven't torn them out completely but because they're
deprecated it's been hard to guarantee correct behavior. Remove the
documentation so that users aren't misled about their viability.
Resolves#12095 by WONTFIXing it.
This approach disables `writeToFile` as it allows arbitrary host
filesystem writes and is only a small quality of life improvement over
multiple `template` stanzas.
This approach has the significant downside of leaving people who have
altered their `template.function_denylist` *still vulnerable!* I added
an upgrade note, but we should have implemented the denylist as a
`map[string]bool` so that new funcs could be denied without overriding
custom configurations.
This PR also includes a bug fix that broke enabling all consul-template
funcs. We repeatedly failed to differentiate between a nil (unset)
denylist and an empty (allow all) one.
Pass-through the `-secret` and `-parameter` flags to allow setting
parameters for the snapshot and overriding the secrets we've stored on
the CSI volume in the state store.
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.
Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.
Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.
When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.
Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.
The new cpuset management strategy fixes#11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.htmlCloses#11289Fixes#11705#11773#11933
When a node is drained, system jobs are left until last so that
operators can rely on things like log shippers running even as their
applications are getting drained off. Include CSI plugins in this set
so that Controller plugins deployed as services can be handled as
gracefully as Node plugins that are running as system jobs.