Fix a bug where a millicious user can access or manipulate an alloc in a
namespace they don't have access to. The allocation endpoints perform
ACL checks against the request namespace, not the allocation namespace,
and performs the allocation lookup independently from namespaces.
Here, we check that the requested can access the alloc namespace
regardless of the declared request namespace.
Ideally, we'd enforce that the declared request namespace matches
the actual allocation namespace. Unfortunately, we haven't documented
alloc endpoints as namespaced functions; we suspect starting to enforce
this will be very disruptive and inappropriate for a nomad point
release. As such, we maintain current behavior that doesn't require
passing the proper namespace in request. A future major release may
start enforcing checking declared namespace.
fixes https://github.com/hashicorp/nomad/issues/6382
The prestart hook for templates blocks while it resolves vault secrets.
If the secret is not found it continues to retry. If a task is shutdown
during this time, the prestart hook currently does not receive
shutdownCtxCancel, causing it to hang.
This PR joins the two contexts so either killCtx or shutdownCtx cancel
and stop the task.
In a job registration request, ensure that the request namespace "header" and job
namespace field match. This should be the case already in prod, as http
handlers ensures that the values match [1].
This mitigates bugs that exploit bugs where we may check a value but act
on another, resulting into bypassing ACL system.
[1] https://github.com/hashicorp/nomad/blob/v0.9.5/command/agent/job_endpoint.go#L415-L418
Some drivers will automatically create directories when trying to mount
a path into a container, but some will not.
To unify this behaviour, this commit requires that host volumes already exist,
and can be stat'd by the Nomad Agent during client startup.
Currently, using a Volume in a job uses the following configuration:
```
volume "alias-name" {
type = "volume-type"
read_only = true
config {
source = "host_volume_name"
}
}
```
This commit migrates to the following:
```
volume "alias-name" {
type = "volume-type"
source = "host_volume_name"
read_only = true
}
```
The original design was based due to being uncertain about the future of storage
plugins, and to allow maxium flexibility.
However, this causes a few issues, namely:
- We frequently need to parse this configuration during submission,
scheduling, and mounting
- It complicates the configuration from and end users perspective
- It complicates the ability to do validation
As we understand the problem space of CSI a little more, it has become
clear that we won't need the `source` to be in config, as it will be
used in the majority of cases:
- Host Volumes: Always need a source
- Preallocated CSI Volumes: Always needs a source from a volume or claim name
- Dynamic Persistent CSI Volumes*: Always needs a source to attach the volumes
to for managing upgrades and to avoid dangling.
- Dynamic Ephemeral CSI Volumes*: Less thought out, but `source` will probably point
to the plugin name, and a `config` block will
allow you to pass meta to the plugin. Or will
point to a pre-configured ephemeral config.
*If implemented
The new design simplifies this by merging the source into the volume
stanza to solve the above issues with usability, performance, and error
handling.
On macOS, `os.TempDir` returns a symlinked path under `/var` which is
outside of the directories shared into the VM used for Docker, and
that fails tests using Docker that need that mount. If we expand the
symlink to get the real path in `/private`, we're in the shared
folders and can safely mount them.
This is an attempt to ease dependency management for external driver
plugins, by avoiding requiring them to compile ugorji/go generated
files. Plugin developers reported some pain with the brittleness of
ugorji/go dependency in particular, specially when using go mod, the
default go mod manager in golang 1.13.
Context
--------
Nomad uses msgpack to persist and serialize internal structs, using
ugorji/go library. As an optimization, we use ugorji/go code generation
to speedup process and aovid the relection-based slow path.
We commit these generated files in repository when we cut and tag the
release to ease reproducability and debugging old releases. Thus,
downstream projects that depend on release tag, indirectly depends on
ugorji/go generated code.
Sadly, the generated code is brittle and specific to the version of
ugorji/go being used. When go mod picks another version of ugorji/go
then nomad (go mod by default uses release according to semver),
downstream projects face compilation errors.
Interestingly, downstream projects don't commonly serialize nomad
internal structs. Drivers and device plugins use grpc instead of
msgpack for the most part. In the few cases where they use msgpag (e.g.
decoding task config), they do without codegen path as they run on
driver specific structs not the nomad internal structs. Also, the
ugorji/go serialization through reflection is generally backward
compatible (mod some ugorji/go regression bugs that get introduced every
now and then :( ).
Proposal
---------
The proposal here is to keep committing ugorji/go codec generated files
for releases but to use a go tag for them.
All nomad development through the makefile, including releasing, CI and
dev flow, has the tag enabled.
Downstream plugin projects, by default, will skip these files and life
proceed as normal for them.
The downside is that nomad developers who use generated code but avoid
using make must start passing additional go tag argument. Though this
is not a blessed configuration.
Splitting the immutable and mutable components of the scriptCheck led
to a bug where the environment interpolation wasn't being incorporated
into the check's ID, which caused the UpdateTTL to update for a check
ID that Consul didn't have (because our Consul client creates the ID
from the structs.ServiceCheck each time we update).
Task group services don't have access to a task environment at
creation, so their checks get registered before the check can be
interpolated. Use the original check ID so they can be updated.
* ar: refactor network bridge config to use go-cni lib
* ar: use eth as the iface prefix for bridged network namespaces
* vendor: update containerd/go-cni package
* ar: update network hook to use TODO contexts when calling configurator
* unnecessary conversion
In Nomad prior to Consul Connect, all Consul checks work the same
except for Script checks. Because the Task being checked is running in
its own container namespaces, the check is executed by Nomad in the
Task's context. If the Script check passes, Nomad uses the TTL check
feature of Consul to update the check status. This means in order to
run a Script check, we need to know what Task to execute it in.
To support Consul Connect, we need Group Services, and these need to
be registered in Consul along with their checks. We could push the
Service down into the Task, but this doesn't work if someone wants to
associate a service with a task's ports, but do script checks in
another task in the allocation.
Because Nomad is handling the Script check and not Consul anyways,
this moves the script check handling into the task runner so that the
task runner can own the script check's configuration and
lifecycle. This will allow us to pass the group service check
configuration down into a task without associating the service itself
with the task.
When tasks are checked for script checks, we walk back through their
task group to see if there are script checks associated with the
task. If so, we'll spin off script check tasklets for them. The
group-level service and any restart behaviors it needs are entirely
encapsulated within the group service hook.
* connect: add unix socket to proxy grpc for envoy
Fixes#6124
Implement a L4 proxy from a unix socket inside a network namespace to
Consul's gRPC endpoint on the host. This allows Envoy to connect to
Consul's xDS configuration API.
* connect: pointer receiver on structs with mutexes
* connect: warn on all proxy errors
The ClientState being pending isn't a good criteria; as an alloc may
have been updated in-place before it was completed.
Also, updated the logic so we only check for task states. If an alloc
has deployment state but no persisted tasks at all, restore will still
fail.
* taskenv: add connect upstream env vars + test
* set taskenv upstreams instead of appending
* Update client/taskenv/env.go
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
This uses an alternative approach where we avoid restoring the alloc
runner in the first place, if we suspect that the alloc may have been
completed already.
This commit aims to help users running with clients suseptible to the
destroyed alloc being restrarted bug upgrade to latest. Without this,
such users will have their tasks run unexpectedly on upgrade and only
see the bug resolved after subsequent restart.
If, on restore, the client sees a pending alloc without any other
persisted info, then err on the side that it's an corrupt persisted
state of an alloc instead of the client happening to be killed right
when alloc is assigned to client.
Few reasons motivate this behavior:
Statistically speaking, corruption being the cause is more likely. A
long running client will have higher chance of having allocs persisted
incorrectly with pending state. Being killed right when an alloc is
about to start is relatively unlikely.
Also, delaying starting an alloc that hasn't started (by hopefully
seconds) is not as severe as launching too many allocs that may bring
client down.
More importantly, this helps customers upgrade their clients without
risking taking their clients down and destablizing their cluster. We
don't want existing users to force triggering the bug while they upgrade
and restart cluster.
Protect against a race where destroying and persist state goroutines
race.
The downside is that the database io operation will run while holding
the lock and may run indefinitely. The risk of lock being long held is
slow destruction, but slow io has bigger problems.