Pulling large docker containers can take longer than the default
context timeout. Without a way to change this it is very hard for
users to utilise Nomad properly without hacky work arounds.
This change adds an optional pull_timeout config parameter which
gives operators the possibility to account for increase pull times
where needed. The infra docker image also has the option to set a
custom timeout to keep consistency.
The CSI client RPC uses error wrapping to detect the type of error bubbling up
from plugins, but if the errors we get aren't wrapped at each layer, we can't
unwrap the inner error.
Also eliminates some unused args.
adds in oss components to support enterprise multi-vault namespace feature
upgrade specific doc on vault multi-namespaces
vault docs
update test to reflect new error
Go 1.14 now streams t.Log output as it happens [1], so we no longer need
to maintain our log STDOUT helper.
I preserved the interface, so `testlog` still takes in a `*testing.T`
though unused. Changing it requires so too many changes that I didn't
want to make quite yet.
[1] https://golang.org/doc/go1.14#go-test
This fixes few cases where driver eventor goroutines are leaked during
normal operations, but especially so in tests.
This change makes few modifications:
First, it switches drivers to use `Context`s to manage shutdown events.
Previously, it relied on callers invoking `.Shutdown()` function that is
specific to internal drivers only and require casting. Using `Contexts`
provide a consistent idiomatic way to manage lifecycle for both internal
and external drivers.
Also, I discovered few places where we don't clean up a temporary driver
instance in the plugin catalog code, where we dispense a driver to
inspect and validate the schema config without properly cleaning it up.
The plugin loader supplies a version map to ensure the Nomad agent
can support the plugins at the version they specify. The map was
incorrectly mapping the driver type to the device API supported
version identifier. This currently does not cause a bug as both
device and driver versions are the same string. This could cause
problems in the future, however, if either plugin interface were
to change and require version updates.
* command/csi: csi, csi_plugin, csi_volume
* helper/funcs: move ExtraKeys from parse_config to UnusedKeys
* command/agent/config_parse: use helper.UnusedKeys
* api/csi: annotate CSIVolumes with hcl fields
* command/csi_plugin: add Synopsis
* command/csi_volume_register: use hcl.Decode style parsing
* command/csi_volume_list
* command/csi_volume_status: list format, cleanup
* command/csi_plugin_list
* command/csi_plugin_status
* command/csi_volume_deregister
* command/csi_volume: add Synopsis
* api/contexts/contexts: add csi search contexts to the constants
* command/commands: register csi commands
* api/csi: fix struct tag for linter
* command/csi_plugin_list: unused struct vars
* command/csi_plugin_status: unused struct vars
* command/csi_volume_list: unused struct vars
* api/csi: add allocs to CSIPlugin
* command/csi_plugin_status: format the allocs
* api/allocations: copy Allocation.Stub in from structs
* nomad/client_rpc: add some error context with Errorf
* api/csi: collapse read & write alloc maps to a stub list
* command/csi_volume_status: cleanup allocation display
* command/csi_volume_list: use Schedulable instead of Healthy
* command/csi_volume_status: use Schedulable instead of Healthy
* command/csi_volume_list: sprintf string
* command/csi: delete csi.go, csi_plugin.go
* command/plugin: refactor csi components to sub-command plugin status
* command/plugin: remove csi
* command/plugin_status: remove csi
* command/volume: remove csi
* command/volume_status: split out csi specific
* helper/funcs: add RemoveEqualFold
* command/agent/config_parse: use helper.RemoveEqualFold
* api/csi: do ,unusedKeys right
* command/volume: refactor csi components to `nomad volume`
* command/volume_register: split out csi specific
* command/commands: use the new top level commands
* command/volume_deregister: hardwired type csi for now
* command/volume_status: csiFormatVolumes rescued from volume_list
* command/plugin_status: avoid a panic on no args
* command/volume_status: avoid a panic on no args
* command/plugin_status: predictVolumeType
* command/volume_status: predictVolumeType
* nomad/csi_endpoint_test: move CreateTestPlugin to testing
* command/plugin_status_test: use CreateTestCSIPlugin
* nomad/structs/structs: add CSIPlugins and CSIVolumes search consts
* nomad/state/state_store: add CSIPlugins and CSIVolumesByIDPrefix
* nomad/search_endpoint: add CSIPlugins and CSIVolumes
* command/plugin_status: move the header to the csi specific
* command/volume_status: move the header to the csi specific
* nomad/state/state_store: CSIPluginByID prefix
* command/status: rename the search context to just Plugins/Volumes
* command/plugin,volume_status: test return ids now
* command/status: rename the search context to just Plugins/Volumes
* command/plugin_status: support -json and -t
* command/volume_status: support -json and -t
* command/plugin_status_csi: comments
* command/*_status: clean up text
* api/csi: fix stale comments
* command/volume: make deregister sound less fearsome
* command/plugin_status: set the id length
* command/plugin_status_csi: more compact plugin health
* command/volume: better error message, comment
This package introduces some basic abstractions around mount utilties
for various platforms. Initially it only supports linux, but the plan is
to expand this as CSI expands across to other platforms.
MultiplexV2 is a new connection multiplex header that supports multiplex both
RPC and streaming requests over the same Yamux connection.
MultiplexV2 was added in 0.8.0 as part of
https://github.com/hashicorp/nomad/pull/3892 . So Nomad 0.11 can expect it to
be supported. Though, some more rigorous testing is required before merging
this.
I want to call out some implementation details:
First, the current connection pool reuses the Yamux stream for multiple RPC calls,
and doesn't close them until an error is encountered. This commit doesn't
change it, and sets the `RpcNomad` byte only at stream creation.
Second, the StreamingRPC session gets closed by callers and cannot be reused.
Every StreamingRPC opens a new Yamux session.
Pick up some fixes from Consul:
* If a stream returns an EOF error, clear session from cache/pool and start a
new one.
* Close the codec when closing StreamClient
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.
Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.
This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
The existing version constraint uses logic optimized for package
managers, not schedulers, when checking prereleases:
- 1.3.0-beta1 will *not* satisfy ">= 0.6.1"
- 1.7.0-rc1 will *not* satisfy ">= 1.6.0-beta1"
This is due to package managers wishing to favor final releases over
prereleases.
In a scheduler versions more often represent the earliest release all
required features/APIs are available in a system. Whether the constraint
or the version being evaluated are prereleases has no impact on
ordering.
This commit adds a new constraint - `semver` - which will use Semver
v2.0 ordering when evaluating constraints. Given the above examples:
- 1.3.0-beta1 satisfies ">= 0.6.1" using `semver`
- 1.7.0-rc1 satisfies ">= 1.6.0-beta1" using `semver`
Since existing jobspecs may rely on the old behavior, a new constraint
was added and the implicit Consul Connect and Vault constraints were
updated to use it.
This is the basic code to add the Windows Service Manager hooks to Nomad.
Includes vendoring golang.org/x/sys/windows/svc and added Docs:
* guide for installing as a windows service.
* configuration for logging to file from PR #6429
Adds new package that can be used by client and server RPC endpoints to
facilitate monitoring based off of a logger
clean up old code
small comment about write
rm old comment about minsize
rename to Monitor
Removes connection logic from monitor command
Keep connection logic in endpoints, use a channel to send results from
monitoring
use new multisink logger and interfaces
small test for dropped messages
update go-hclogger and update sink/intercept logger interfaces
Consul Connect must route traffic between network namespaces through a
public interface (i.e. not localhost). In order to support testing in
dev mode, users needed to manually set the interface which doesn't
make for a smooth experience.
This commit adds a facility for adding optional parameters to the
`nomad agent -dev` flag and uses it to add a `-dev=connect` flag that
binds to a public interface on the host.
Handle stalled readers (e.g. network write got stalled), by having
escaping io have a buffer so it looks for escaped characters in the
stream.
This simplifies the implementation considerably, as we can look
for new lines followed by escaped characters directly.
Also, we add a test to ensure that any partial results are flushed to
readers.
nvidia library use of dynamic library seems to conflict with alpine and
musl based OSes. This adds a `nonvidia` tag to allow compiling nomad
for alpine images.
The nomad releases currently only support glibc based OS environments,
so we default to compiling with nvidia.
The e2e test code is absolutely hideous and leaks processes and files
on disk. NomadAgent seems useful, but the clientstate e2e tests are very
messy and slow. The last test "Corrupt" is probably the most useful as
it explicitly corrupts the state file whereas the other tests attempt to
reproduce steps thought to cause corruption in earlier releases of
Nomad.
This ensures that `port_map` along with other block like attribute
declarations (e.g. ulimit, labels, etc) can handle various hcl and json
syntax that was supported in 0.8.
In 0.8.7, the following declarations are effectively equivalent:
```
// hcl block
port_map {
http = 80
https = 443
}
// hcl assignment
port_map = {
http = 80
https = 443
}
// json single element array of map (default in API response)
{"port_map": [{"http": 80, "https": 443}]}
// json array of individual maps (supported accidentally iiuc)
{"port_map: [{"http": 80}, {"https": 443}]}
```
We achieve compatbility by using `NewAttr("...", "list(map(string))",
false)` to be serialized to a `map[string]string` wrapper, instead of using
`BlockAttrs` declaration. The wrapper merges the list of maps
automatically, to ease driver development.
This approach is closer to how v0.8.7 implemented the fields [1][2], and
despite its verbosity, seems to perserve 0.8.7 behavior in hcl2.
This is only required for built-in types that have backward
compatibility constraints. External drivers should use `BlockAttrs`
instead, as they see fit.
[1] https://github.com/hashicorp/nomad/blob/v0.8.7/client/driver/docker.go#L216
[2] https://github.com/hashicorp/nomad/blob/v0.8.7/client/driver/docker.go#L698-L700
The test used old local copies of Docker structs and appeared to be
testing an outdated approach to task config decoding.
Updated to use real Docker structs so we can do end-to-end unit testing
of real Docker task configs.
* Migrated all of the old leader task tests and got them passing
* Refactor and consolidate task killing code in AR to always kill leader
tasks first
* Fixed lots of issues with state restoring
* Fixed deadlock in AR.Destroy if AR.Run had never been called
* Added a new in memory statedb for testing
Saves a tiny bit of cpu and some IO. Sadly doesn't prevent all IO on
duplicate writes as the transactions are still created and committed.
$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: github.com/hashicorp/nomad/helper/boltdd
BenchmarkWriteDeduplication_On-4 500 4059591 ns/op 23736 B/op 56 allocs/op
BenchmarkWriteDeduplication_Off-4 300 4115319 ns/op 25942 B/op 55 allocs/op
Previously we did a validation pass over CA PEM files before calling
Go's CertPool.AppendCertsFromPEM to provide more detailed error messages
than the stdlib provides.
Unfortunately our validation was overly strict and rejected valid CA
files. This is actually the reason the stdlib PEM parser doesn't return
meaningful errors: PEM files are extremely permissive and it's difficult
to tell the difference between invalid data and valid metadata.
This PR removes our custom validation as it would reject valid data and
the extra error messages were not useful in diagnosing the error
encountered.
Not setting the host name led the Go HTTP client to expect a certificate
with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad`
names ephemeral dir migrations were broken when TLS was enabled.
Added an e2e test to ensure this doesn't break again as it's very
difficult to test and the TLS configuration is very easy to get wrong.
The default timeout is too short for some overburdened or resource
constrained machines to complete the WMI query before the context
deadline expires. This causes them to be unable to fingerprint the CPU
properly.
The new version of gopsutil introduces a 3 second timeout that could come up as an error here; however, we are outputting the wrong variable and eating the error.
This PR enhances the API package by having client only RPCs route
through the server when they are low cost and for filesystem access to
first attempt a direct connection to the node and then falling back to
a server routed request.
From [https://github.com/appc/spec/blob/master/spec/aci.md](https://github.com/appc/spec/blob/master/spec/aci.md):
>environment (list of objects, optional) represents the app's environment variables (ACE can append). The listed objects must have two key-value pairs: name and value. The name must consist solely of letters, digits, and underscores '_' as outlined in IEEE Std 1003.1-2008, 2016 Edition, with practical considerations dictating that the name may also include periods '.' and hyphens '-'. The value is an arbitrary string. These values are not evaluated in any way, and no substitutions are made.
Dotted environment variables are frequently used as a part of the Spring Boot pattern. (re: ZD-6116)
This PR specifically doesn't address the conversion of hyphens (`-`) due to an issue with rkt [[Nomad GH # 2358]](https://github.com/hashicorp/nomad/issues/2358).
* Allow server TLS configuration to be reloaded via SIGHUP
* dynamic tls reloading for nomad agents
* code cleanup and refactoring
* ensure keyloader is initialized, add comments
* allow downgrading from TLS
* initalize keyloader if necessary
* integration test for tls reload
* fix up test to assert success on reloaded TLS configuration
* failure in loading a new TLS config should remain at current
Reload only the config if agent is already using TLS
* reload agent configuration before specific server/client
lock keyloader before loading/caching a new certificate
* introduce a get-or-set method for keyloader
* fixups from code review
* fix up linting errors
* fixups from code review
* add lock for config updates; improve copy of tls config
* GetCertificate only reloads certificates dynamically for the server
* config updates/copies should be on agent
* improve http integration test
* simplify agent reloading storing a local copy of config
* reuse the same keyloader when reloading
* Test that server and client get reloaded but keep keyloader
* Keyloader exposes GetClientCertificate as well for outgoing connections
* Fix spelling
* correct changelog style
Fixes an issue in which the allocation health watcher was checking for
allocations health based on un-interpolated services and checks. Change
the interface for retrieving check information from Consul to retrieving
all registered services and checks by allocation. In the future this
will allow us to output nicer messages.
Fixes https://github.com/hashicorp/nomad/issues/2969
This PR enhances the distinct_property constraint such that a limit can
be specified in the RTarget/value parameter. This allows constraints
such as:
```
constraint {
distinct_property = "${meta.rack}"
value = "2"
}
```
This restricts any given rack from running more than 2 allocations from
the task group.
Fixes https://github.com/hashicorp/nomad/issues/1146
This PR takes the host ID and runs it through a hash so that it is well
distributed. This makes it so that machines that report similar host IDs
are easily distinguished.
Instances of similar IDs occur on EC2 where the ID is prefixed and on
motherboards created in the same batch.
Fixes https://github.com/hashicorp/nomad/issues/2534
This PR:
* Uses Go 1.8 executable lookup
* Stores any err message from stats init method
* Allows overriding of Cpu Compute for hosts where it can't be detected
This PR fixes our vet script and fixes all the missed vet changes.
It also fixes pointers being printed in `nomad stop <job>` and `nomad
node-status <node>`.