This changeset implements the initial registration and fingerprinting
of CSI Plugins as part of #5378. At a high level, it introduces the
following:
* A `csi_plugin` stanza as part of a Nomad task configuration, to
allow a task to expose that it is a plugin.
* A new task runner hook: `csi_plugin_supervisor`. This hook does two
things. When the `csi_plugin` stanza is detected, it will
automatically configure the plugin task to receive bidirectional
mounts to the CSI intermediary directory. At runtime, it will then
perform an initial heartbeat of the plugin and handle submitting it to
the new `dynamicplugins.Registry` for further use by the client, and
then run a lightweight heartbeat loop that will emit task events
when health changes.
* The `dynamicplugins.Registry` for handling plugins that run
as Nomad tasks, in contrast to the existing catalog that requires
`go-plugin` type plugins and to know the plugin configuration in
advance.
* The `csimanager` which fingerprints CSI plugins, in a similar way to
`drivermanager` and `devicemanager`. It currently only fingerprints
the NodeID from the plugin, and assumes that all plugins are
monolithic.
Missing features
* We do not use the live updates of the `dynamicplugin` registry in
the `csimanager` yet.
* We do not deregister the plugins from the client when they shutdown
yet, they just become indefinitely marked as unhealthy. This is
deliberate until we figure out how we should manage deploying new
versions of plugins/transitioning them.
allow oss to parse sink duration
clean up audit sink parsing
ent eventer config reload
fix typo
SetEnabled to eventer interface
client acl test
rm dead code
fix failing test
The test starts enough connections to hit the limit, then closes the
connection and immediately starts one expecting the new one to succeed.
We must wait until the server side recognizes the connection
closing and free up a limits slot. The current test attempts to achieve
that by waiting to get an error on conn.Read, however, this error is
returned from local client without waiting for server update.
As such, I change the logic so it retries on connection rejection but
force the first non-EOF failure to be a deadline error.
This fixes a bug where a forwarded node update request may be assumed
to be the actual direct client connection if the server just lost
leadership.
When a nomad non-leader server receives a Node.UpdateStatus request, it
forwards the RPC request to the leader, and holds on the request
Yamux connection in a cache to allow for server<->client forwarding.
When the leader handles the request, it must differentiate between a
forwarded connection vs the actual connection. This is done in
https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/node_endpoint.go#L412
Now, consider if the non-leader server forwards to the connection to a
recently deposed nomad leader, which in turn forwards the RPC request to
the new leader.
Without this change, the deposed leader will mistake the forwarded
connection for the actual client connection and cache it mapped to the
client ID. If the server attempts to connect to that client, it will
attempt to start a connection/session to the other server instead and
the call will hang forever.
This change ensures that we only add node connection mapping if the
request is not a forwarded request, regardless of circumstances.
This deflake the tests in the deploymentwatcher package. The package
uses a mock deployment watcher backend, where the watcher in a
background goroutine calls UpdateDeploymentStatus . If the mock isn't
configured to expect the call, the background goroutine will fail. One
UpdateDeploymentStatus call is made at the end of the background
goroutine, which may occur after the test completes, thus explaining the
flakiness.
This change updates tests to honor `BootstrapExpect` exclusively when
forming test clusters and removes test only knobs, e.g.
`config.DevDisableBootstrap`.
Background:
Test cluster creation is fragile. Test servers don't follow the
BootstapExpected route like production clusters. Instead they start as
single node clusters and then get rejoin and may risk causing brain
split or other test flakiness.
The test framework expose few knobs to control those (e.g.
`config.DevDisableBootstrap` and `config.Bootstrap`) that control
whether a server should bootstrap the cluster. These flags are
confusing and it's unclear when to use: their usage in multi-node
cluster isn't properly documented. Furthermore, they have some bad
side-effects as they don't control Raft library: If
`config.DevDisableBootstrap` is true, the test server may not
immediately attempt to bootstrap a cluster, but after an election
timeout (~50ms), Raft may force a leadership election and win it (with
only one vote) and cause a split brain.
The knobs are also confusing as Bootstrap is an overloaded term. In
BootstrapExpect, we refer to bootstrapping the cluster only after N
servers are connected. But in tests and the knobs above, it refers to
whether the server is a single node cluster and shouldn't wait for any
other server.
Changes:
This commit makes two changes:
First, it relies on `BootstrapExpected` instead of `Bootstrap` and/or
`DevMode` flags. This change is relatively trivial.
Introduce a `Bootstrapped` flag to track if the cluster is bootstrapped.
This allows us to keep `BootstrapExpected` immutable. Previously, the
flag was a config value but it gets set to 0 after cluster bootstrap
completes.
Fix a bug where consul service definitions would not be updated if changes
were made to the service in the Nomad job. Currently this only fixes the
bug for cases where the fix is a matter of updating consul agent's service
registration. There is related bug where destructive changes are required
(see #6877) which will be fixed in another PR.
The enable_tag_override configuration setting for the parent service is
applied to the sidecar service.
Fixes#6459
Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same.
Note that consul-template uses CONSUL_TOKEN, which Nomad also uses,
so be careful to preserve any reference to that in the consul-template
context.
Consul provides a feature of Service Definitions where the tags
associated with a service can be modified through the Catalog API,
overriding the value(s) configured in the agent's service configuration.
To enable this feature, the flag enable_tag_override must be configured
in the service definition.
Previously, Nomad did not allow configuring this flag, and thus the default
value of false was used. Now, it is configurable.
Because Nomad itself acts as a state machine around the the service definitions
of the tasks it manages, it's worth describing what happens when this feature
is enabled and why.
Consider the basic case where there is no Nomad, and your service is provided
to consul as a boring JSON file. The ultimate source of truth for the definition
of that service is the file, and is stored in the agent. Later, Consul performs
"anti-entropy" which synchronizes the Catalog (stored only the leaders). Then
with enable_tag_override=true, the tags field is available for "external"
modification through the Catalog API (rather than directly configuring the
service definition file, or using the Agent API). The important observation
is that if the service definition ever changes (i.e. the file is changed &
config reloaded OR the Agent API is used to modify the service), those
"external" tag values are thrown away, and the new service definition is
once again the source of truth.
In the Nomad case, Nomad itself is the source of truth over the Agent in
the same way the JSON file was the source of truth in the example above.
That means any time Nomad sets a new service definition, any externally
configured tags are going to be replaced. When does this happen? Only on
major lifecycle events, for example when a task is modified because of an
updated job spec from the 'nomad job run <existing>' command. Otherwise,
Nomad's periodic re-sync's with Consul will now no longer try to restore
the externally modified tag values (as long as enable_tag_override=true).
Fixes#2057
The default RPC addr was a global which is fine for normal runtime use
when it only has a single user.
However many tests modify it and cause races. Follow our convention of
returning defaults from funcs instead of using globals.
MultiplexV2 is a new connection multiplex header that supports multiplex both
RPC and streaming requests over the same Yamux connection.
MultiplexV2 was added in 0.8.0 as part of
https://github.com/hashicorp/nomad/pull/3892 . So Nomad 0.11 can expect it to
be supported. Though, some more rigorous testing is required before merging
this.
I want to call out some implementation details:
First, the current connection pool reuses the Yamux stream for multiple RPC calls,
and doesn't close them until an error is encountered. This commit doesn't
change it, and sets the `RpcNomad` byte only at stream creation.
Second, the StreamingRPC session gets closed by callers and cannot be reused.
Every StreamingRPC opens a new Yamux session.
This is from a merge conflict resolution that went the wrong direction.
I assumed the block had been added, but really it had been removed. Now,
it is removed once again.
Apply smaller suggestions like doc strings, variable names, etc.
Co-Authored-By: Nick Ethier <nethier@hashicorp.com>
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
When a job is configured with Consul Connect aware tasks (i.e. sidecar),
the Nomad Client should be able to request from Consul (through Nomad Server)
Service Identity tokens specific to those tasks.
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.
The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
This change provides an initial pass at setting up the configuration necessary to
enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul
Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert`
commands (similar to Vault tokens).
These values are not actually used yet in this changeset.
Introduce limits to prevent unauthorized users from exhausting all
ephemeral ports on agents:
* `{https,rpc}_handshake_timeout`
* `{http,rpc}_max_conns_per_client`
The handshake timeout closes connections that have not completed the TLS
handshake by the deadline (5s by default). For RPC connections this
timeout also separately applies to first byte being read so RPC
connections with TLS enabled have `rpc_handshake_time * 2` as their
deadline.
The connection limit per client prevents a single remote TCP peer from
exhausting all ephemeral ports. The default is 100, but can be lowered
to a minimum of 26. Since streaming RPC connections create a new TCP
connection (until MultiplexV2 is used), 20 connections are reserved for
Raft and non-streaming RPCs to prevent connection exhaustion due to
streaming RPCs.
All limits are configurable and may be disabled by setting them to `0`.
This also includes a fix that closes connections that attempt to create
TLS RPC connections recursively. While only users with valid mTLS
certificates could perform such an operation, it was added as a
safeguard to prevent programming errors before they could cause resource
exhaustion.
Allows addressing servers with nomad monitor using the servers name or
ID.
Also unifies logic for addressing servers for client_agent_endpoint
commands and makes addressing logic region aware.
rpc getServer test
Fixes a deadlock in leadership handling if leadership flapped.
Raft propagates leadership transition to Nomad through a NotifyCh channel.
Raft blocks when writing to this channel, so channel must be buffered or
aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader`
until the channel is consumed[1] and does not move on to executing follower
related logic (in `raft.runFollower`).
While Raft `runLeader` defer function blocks, raft cannot process any other
raft operations. For example, `run{Leader|Follower}` methods consume
`raft.applyCh`, and while runLeader defer is blocked, all raft log applications
or config lookup will block indefinitely.
Sadly, `leaderLoop` and `establishLeader` makes few Raft calls!
`establishLeader` attempts to auto-create autopilot/scheduler config [3]; and
`leaderLoop` attempts to check raft configuration [4]. All of these calls occur
without a timeout.
Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is
invoked and hit any of these Raft calls, Raft handler _deadlock_ forever.
Depending on how many times it flapped and where exactly we get stuck, I suspect
it's possible to get in the following case:
* Agent metrics/stats http and RPC calls hang as they check raft.Configurations
* raft.State remains in Leader state, and server attempts to handle RPC calls
(e.g. node/alloc updates) and these hang as well
As we create goroutines per RPC call, the number of goroutines grow over time
and may trigger a out of memory errors in addition to missed updates.
[1] d90d6d6bda/config.go (L190-L193)
[2] d90d6d6bda/raft.go (L425-L436)
[3] 2a89e47746/nomad/leader.go (L198-L202)
[4] 2a89e47746/nomad/leader.go (L877)