Add the ingress gateway example from the noamd connect examples
to the e2e Connect suite. Includes the ACLs enabled version,
which means the nomad server consul acl policy will require
operator=write permission.
The API is missing values for `ReadAllocs` and `WriteAllocs` fields, resulting
in allocation claims not being populated in the web UI. These fields mirror
the fields in `nomad/structs.CSIVolume`. Returning a separate list of stubs
for read and write would be ideal, but this can't be done without either
bloating the API response with repeated full `Allocation` data, or causing a
panic in previous versions of the CLI.
The `nomad/structs` fields are persisted with nil values and are populated
during RPC, so we'll do the same in the HTTP API and populate the `ReadAllocs`
and `WriteAllocs` fields with a map of allocation IDs, but with null
values. The web UI will then create its `ReadAllocations` and
`WriteAllocations` fields by mapping from those IDs to the values in
`Allocations`, instead of flattening the map into a list.
The clean up in #8908 inadvertently caused the output from the scripts
involved in the Consul ACL bootstrap process to be printed as a big blob
of bytes, which is slightly less useful than the text version.
Use targetted ignore comments for the cases where we are bound by
backward compatibility.
I've left some file based linters, especially when the file is riddled
with linter voilations (e.g. enum names), or if it's a property of the
file (e.g. package and file names).
I encountered an odd behavior related to RPC_REQUEST_RESPONSE_UNIQUE and
RPC_REQUEST_STANDARD_NAME. Apparently, if they target a `stream` type,
we must separate them into separate lines so that the ignore comment
targets the type specifically.
Plugin health for controllers should show "Node Only" in the UI only when both
conditions are true: controllers are not required, and no controllers have
registered themselves (0 expected controllers). This accounts for "monolith"
plugins which might register as both controllers and nodes but not necessarily
have `ControllerRequired = true` because they don't implement the Controller
RPC endpoints we need (this requirement was added in #7844)
This changeset includes the following fixes:
* Update the Plugins tab of the UI so that monolith plugins don't show "Node
Only" once they've registered.
* Add the missing "Node Only" logic to the Volumes tab of the UI.
Always wait 200ms before calling the Node.UpdateAlloc RPC to send
allocation updates to servers.
Prior to this change we only reset the update ticker when an error was
encountered. This meant the 200ms ticker was running while the RPC was
being performed. If the RPC was slow due to network latency or server
load and took >=200ms, the ticker would tick during the RPC.
Then on the next loop only the select would randomly choose between the
two viable cases: receive an update or fire the RPC again.
If the RPC case won it would immediately loop again due to there being
no updates to send.
When the update chan receive is selected a single update is added to the
slice. The odds are then 50/50 that the subsequent loop will send the
single update instead of receiving any more updates.
This could cause a couple of problems:
1. Since only a small number of updates are sent, the chan buffer may
fill, applying backpressure, and slowing down other client
operations.
2. The small number of updates sent may already be stale and not
represent the current state of the allocation locally.
A risk here is that it's hard to reason about how this will interact
with the 50ms batches on servers when the servers under load.
A further improvement would be to completely remove the alloc update
chan and instead use a mutex to build a map of alloc updates. I wanted
to test the lowest risk possible change on loaded servers first before
making more drastic changes.
When making updates to CSI plugins, the state store methods that have open
write transactions were querying the state store using the same methods used
by the CSI RPC endpoint, but these method creates their own top-level read
transactions. During concurrent plugin updates (as happens when a plugin job
is stopped), this can cause write skew in the plugin counts.
* Refactor the CSIPlugin query methods to have an implementation method that
accepts a transaction, which can be called with either a read txn or a write
txn.
* Refactor the CSIVolume query methods to have an implementation method that
accepts a transaction, which can be called with either a read txn or a write
txn.
* CSI volumes need to be "denormalized" with their plugins and (optionally)
allocations. Read-only RPC endpoints should take a snapshot so that we can
make multiple state store method calls with a consistent view.
Fix#9210 .
This update the executor so it honors the User when using nomad alloc exec. The bug was that the exec task didn't honor the init command when execing.
* e2e/csi: wait longer for plugins to become healthy
Plugins are Docker containers, and as such sometimes we get delays in startup
due to pulling from the registry and this is a source of test flakiness. Give
the plugins a little longer to start up.
* e2e/csi: version bump for AWS EBS plugins
The cloud-init configuration runs on boot, which can result in a race
condition between that and service startup. This has caused provisioning
failures because Nomad expects the userdata to have configured a host volume
directory. Diagnosing this was also compounded by a warning being fired by
systemd for the Nomad unit file.
* Update the location of the `StartLimitIntervalSec` field to it's
post-systemd-230 location.
* Ensure that the weekly AMI build is up-to-date to reduce the risk of
unexpected system software changes.
* Move the host volume to a directory we can set up at AMI build time rather
than in userdata.
Only `change_mode = "restart"` will result in the template environment
variables being updated in the task. Clarify the behavior of the unsupported
options.
* `nomad operator keyring` was missing the general options section
* `nomad operator metrics` was missing a page in the docs entirely
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Create a new "Operating Nomad" section of the docs where we can put reference
material for operators that doesn't quite fit in either configuration file /
command line documentation or a step-by-step Learn Guide. Pre-populate this
with the existing telemetry docs and some links out to the Learn Guide
sections.