Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same.
Note that consul-template uses CONSUL_TOKEN, which Nomad also uses,
so be careful to preserve any reference to that in the consul-template
context.
Consul provides a feature of Service Definitions where the tags
associated with a service can be modified through the Catalog API,
overriding the value(s) configured in the agent's service configuration.
To enable this feature, the flag enable_tag_override must be configured
in the service definition.
Previously, Nomad did not allow configuring this flag, and thus the default
value of false was used. Now, it is configurable.
Because Nomad itself acts as a state machine around the the service definitions
of the tasks it manages, it's worth describing what happens when this feature
is enabled and why.
Consider the basic case where there is no Nomad, and your service is provided
to consul as a boring JSON file. The ultimate source of truth for the definition
of that service is the file, and is stored in the agent. Later, Consul performs
"anti-entropy" which synchronizes the Catalog (stored only the leaders). Then
with enable_tag_override=true, the tags field is available for "external"
modification through the Catalog API (rather than directly configuring the
service definition file, or using the Agent API). The important observation
is that if the service definition ever changes (i.e. the file is changed &
config reloaded OR the Agent API is used to modify the service), those
"external" tag values are thrown away, and the new service definition is
once again the source of truth.
In the Nomad case, Nomad itself is the source of truth over the Agent in
the same way the JSON file was the source of truth in the example above.
That means any time Nomad sets a new service definition, any externally
configured tags are going to be replaced. When does this happen? Only on
major lifecycle events, for example when a task is modified because of an
updated job spec from the 'nomad job run <existing>' command. Otherwise,
Nomad's periodic re-sync's with Consul will now no longer try to restore
the externally modified tag values (as long as enable_tag_override=true).
Fixes#2057
Test set Agent.client=nil which prevented the client from being
shutdown. This leaked goroutines and could cause panics due to the
leaked client goroutines logging after their parent test had finished.
Removed ACLs from the server test because I couldn't get it to work with
the test agent, and it tested very little.
Fixes a panic when accessing a.agent.Server() when agent is a client
instead. This pr removes a redundant ACL check since ACLs are validated
at the RPC layer. It also nil checks the agent server and uses Client()
when appropriate.
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
This change provides an initial pass at setting up the configuration necessary to
enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul
Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert`
commands (similar to Vault tokens).
These values are not actually used yet in this changeset.
Introduce limits to prevent unauthorized users from exhausting all
ephemeral ports on agents:
* `{https,rpc}_handshake_timeout`
* `{http,rpc}_max_conns_per_client`
The handshake timeout closes connections that have not completed the TLS
handshake by the deadline (5s by default). For RPC connections this
timeout also separately applies to first byte being read so RPC
connections with TLS enabled have `rpc_handshake_time * 2` as their
deadline.
The connection limit per client prevents a single remote TCP peer from
exhausting all ephemeral ports. The default is 100, but can be lowered
to a minimum of 26. Since streaming RPC connections create a new TCP
connection (until MultiplexV2 is used), 20 connections are reserved for
Raft and non-streaming RPCs to prevent connection exhaustion due to
streaming RPCs.
All limits are configurable and may be disabled by setting them to `0`.
This also includes a fix that closes connections that attempt to create
TLS RPC connections recursively. While only users with valid mTLS
certificates could perform such an operation, it was added as a
safeguard to prevent programming errors before they could cause resource
exhaustion.
The system command includes gc and reconcile-summaries subcommands
which covers all currently available system API calls. The help
information is largely pulled from the current Nomad website API
documentation.
Passes in agent enable_debug config to nomad server and client configs.
This allows for rpc endpoints to have more granular control if they
should be enabled or not in combination with ACLs.
enable debug on client test
copy struct values
ensure groupserviceHook implements RunnerPreKillhook
run deregister first
test that shutdown times are delayed
move magic number into variable
Fixes a bug where if a command flag parsing errors, the resulting error
and help usage messages get interleaved in unexpected and non-user
friendly way.
The reason is that we have flag parsing library effectively writes to
ui.Error in a goroutine. This is problematic: first, we lose the sequencing between help
usage and error message; second, cli.Ui methods are not concurrent safe.
Here, we introduce a custom error writer that buffers result and calls
ui.Error() in the write method and in the same goroutine.
For context, we need to wrap ui.Error because it's line-oriented, while
flags library expects a io.Writer which is bytes oriented.
Currently `nomad monitor -node-id` will panic when a node-id does not
match any nodes, as there is no empty result bounds checking. Here we
return an error to the user when no nodes are found.
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.
Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.
This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
The test asserts that alloc counts get reported accurately in metrics by
inspecting the metrics endpoint directly. Sadly, the metrics as
collected by `armon/go-metrics` seem to be stateful and may contain info
from other tests.
This means that the test can fail depending on the order of returned
metrics.
Inspecting the metrics output of one failing run, you can see the
duplicate guage entries but for different node_ids:
```
{
"Name": "service-name.default-0a3ba4b6-2109-485e-be74-6864228aed3d.client.allocations.terminal",
"Value": 10,
"Labels": {
"datacenter": "dc1",
"node_class": "none",
"node_id": "67402bf4-00f3-bd8d-9fa8-f4d1924a892a"
}
},
{
"Name": "service-name.default-0a3ba4b6-2109-485e-be74-6864228aed3d.client.allocations.terminal",
"Value": 0,
"Labels": {
"datacenter": "dc1",
"node_class": "none",
"node_id": "a2945b48-7e66-68e2-c922-49b20dd4e20c"
}
},
```
Noticed that ACL endpoints return 500 status code for user errors. This
is confusing and can lead to false monitoring alerts.
Here, I introduce a concept of RPCCoded errors to be returned by RPC
that signal a code in addition to error message. Codes for now match
HTTP codes to ease reasoning.
```
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 500 (ACL bootstrap already done (reset index: 9))
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 400 (ACL bootstrap already done (reset index: 9))
```
* client: improve group service stanza interpolation and check_restart support
Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.
The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
handle the case where we request a server-id which is this current server
update docs, error on node and server id params
more accurate names for tests
use shared no leader err, formatting
rm bad comment
remove redundant variable
When inferring whether to use TTY, check both stdin and stdout are
terminals.
Otherwise, we get failures like the following:
```
$ nomad alloc exec --job example echo hi
hi
$ echo | nomad alloc exec --job example echo hi
hi
$ nomad alloc exec --job example echo hi | head -n1
failed to exec into task: not a terminal
```
This is the basic code to add the Windows Service Manager hooks to Nomad.
Includes vendoring golang.org/x/sys/windows/svc and added Docs:
* guide for installing as a windows service.
* configuration for logging to file from PR #6429