* scheduler: stopped-yet-running allocs are still running
* scheduler: test new stopped-but-running logic
* test: assert nonoverlapping alloc behavior
Also add a simpler Wait test helper to improve line numbers and save few
lines of code.
* docs: tried my best to describe #10446
it's not concise... feedback welcome
* scheduler: fix test that allowed overlapping allocs
* devices: only free devices when ClientStatus is terminal
* test: output nicer failure message if err==nil
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* Fixing heading order, adding text for links
* Apply suggestions from code review
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Applying more suggestions from code review
Co-authored-by: Tim Gross <tgross@hashicorp.com>
This PR implements support for check_restart for checks registered
in the Nomad service provider.
Unlike Consul, Nomad service checks never report a "warning" status,
and so the check_restart.ignore_warnings configuration is not valid
for Nomad service checks.
* Preliminary version
* Addition of a filtering helper and more styling for service check history
* Fixed-widths on table cols
* Account for new rows in test
* Explanation for magic numbers
Restrict variable paths to RFC3986 URL-safe characters that don't conflict with
the use of characters "@" and "." in `template` blocks. This prevents users from
writing variables that will require tricky templating syntax or that they simply
won't be able to use.
Also restrict the length so that a user can't make queries in the state store
unusually expensive (as they are O(k) on the key length).
This PR refactors agent/consul/check_watcher into client/serviceregistration,
and abstracts away the Consul-specific check lookups.
In doing so we should be able to reuse the existing check watcher logic for
also watching NSD checks in a followup PR.
A chunk of consul/unit_test.go is removed - we'll cover that in e2e tests
in a follow PR if needed. In the long run I'd like to remove this whole file.
* Test for aggregate service health and consul agg service health
* If a consul UI link is present, show a nice little link
* Also add to job services page
* Reallocate consul url in mock
Running `make check` on macOS identifies some dead code because the code is used
only with the Linux build tag. Move this code into appropriately-tagged code
files.
The ec2info was never intuitive to run - needing to set the AWS
envinronment variables, cd'ing into tools/ec2info, and knowing
to invoke the command.
This PR makes it so we can run ec2info just by running
make ec2info
The command now also checks for the AWS environment variables being
set, and provides a useful error if they are not.
The NSD checks tests were racey, whereby the check may not have
been triggered by the time it was queried. This change wraps the
check so it can account for this.
This removes the current ACL expiration GC section in order to get
the tests passing and allow more time to investigate the test. I
have full confidence the feature is working as expected and have
tested extensively locally.
A Nomad user reported problems with CSI volumes associated with failed
allocations, where the Nomad server did not send a controller unpublish RPC.
The controller unpublish is skipped if other non-terminal allocations on the
same node claim the volume. The check has a bug where the allocation belonging
to the claim being freed was included in the check incorrectly. During a normal
allocation stop for job stop or a new version of the job, the allocation is
terminal. But allocations that fail are not yet marked terminal at the point in
time when the client sends the unpublish RPC to the server.
For CSI plugins that support controller attach/detach, this means that the
controller will not be able to detach the volume from the allocation's host and
the replacement claim will fail until a GC is run. This changeset fixes the
conditional so that the claim's own allocation is not included, and makes the
logic easier to read. Include a test case covering this path.
Also includes two minor extra bugfixes:
* Entities we get from the state store should always be copied before
altering. Ensure that we copy the volume in the top-level unpublish workflow
before handing off to the steps.
* The list stub object for volumes in `nomad/structs` did not match the stub
object in `api`. The `api` package also did not include the current
readers/writers fields that are expected by the UI. True up the two objects and
add the previously undocumented fields to the docs.
When querying the checks for an allocation, the request must be
forwarded to the agent that is running the allocation. If the
initial request is made to a server agent, the request can be made
directly to the client agent running the allocation. If the
request is made to a client agent not running the alloc, the
request needs to be forwarded to a server and then the correct
client.
The map of in-flight RPCs gets cleared by a goroutine in the test without first
locking it to make sure that it's not being accessed concurrently by the stats
fetcher itself. This can cause a panic in tests.