Commit Graph

35 Commits

Author SHA1 Message Date
Luiz Aoqui 3479e2231f
core: enforce strict steps for clients reconnect (#15808)
When a Nomad client that is running an allocation with
`max_client_disconnect` set misses a heartbeat the Nomad server will
update its status to `disconnected`.

Upon reconnecting, the client will make three main RPC calls:

- `Node.UpdateStatus` is used to set the client status to `ready`.
- `Node.UpdateAlloc` is used to update the client-side information about
  allocations, such as their `ClientStatus`, task states etc.
- `Node.Register` is used to upsert the entire node information,
  including its status.

These calls are made concurrently and are also running in parallel with
the scheduler. Depending on the order they run the scheduler may end up
with incomplete data when reconciling allocations.

For example, a client disconnects and its replacement allocation cannot
be placed anywhere else, so there's a pending eval waiting for
resources.

When this client comes back the order of events may be:

1. Client calls `Node.UpdateStatus` and is now `ready`.
2. Scheduler reconciles allocations and places the replacement alloc to
   the client. The client is now assigned two allocations: the original
   alloc that is still `unknown` and the replacement that is `pending`.
3. Client calls `Node.UpdateAlloc` and updates the original alloc to
   `running`.
4. Scheduler notices too many allocs and stops the replacement.

This creates unnecessary placements or, in a different order of events,
may leave the job without any allocations running until the whole state
is updated and reconciled.

To avoid problems like this clients must update _all_ of its relevant
information before they can be considered `ready` and available for
scheduling.

To achieve this goal the RPC endpoints mentioned above have been
modified to enforce strict steps for nodes reconnecting:

- `Node.Register` does not set the client status anymore.
- `Node.UpdateStatus` sets the reconnecting client to the `initializing`
  status until it successfully calls `Node.UpdateAlloc`.

These changes are done server-side to avoid the need of additional
coordination between clients and servers. Clients are kept oblivious of
these changes and will keep making these calls as they normally would.

The verification of whether allocations have been updates is done by
storing and comparing the Raft index of the last time the client missed
a heartbeat and the last time it updated its allocations.
2023-01-25 15:53:59 -05:00
Michael Schurter ed3218c3dd
Fixing flaky TestOverlap test (#14780)
* test: ensure feasible node selected in overlap test

* test: warn when getting close to retry limit
2022-10-03 14:35:02 -07:00
Mahmood Ali a9d5e4c510
scheduler: stopped-yet-running allocs are still running (#10446)
* scheduler: stopped-yet-running allocs are still running

* scheduler: test new stopped-but-running logic

* test: assert nonoverlapping alloc behavior

Also add a simpler Wait test helper to improve line numbers and save few
lines of code.

* docs: tried my best to describe #10446

it's not concise... feedback welcome

* scheduler: fix test that allowed overlapping allocs

* devices: only free devices when ClientStatus is terminal

* test: output nicer failure message if err==nil

Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-09-13 12:52:47 -07:00
Michael Schurter fd68bbc342 test: update tests to properly use AllocDir
Also use t.TempDir when possible.
2021-10-19 10:49:07 -07:00
Dave May c37a6ed583
cli: rename paths in debug bundle for clarity (#11307)
* Rename folders to reflect purpose
* Improve captured files test coverage
* Rename CSI plugins output file
* Add changelog entry
* fix test and make changelog message more explicit

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2021-10-13 18:00:55 -04:00
Dave May 2d14c54fa0
debug: Improve namespace and region support (#11269)
* Include region and namespace in CLI output
* Add region and prefix matching for server members
* Add namespace and region API outputs to cluster metadata folder
* Add region awareness to WaitForClient helper function
* Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice
* Refactor test client agent generation
* Add tests for region
* Add changelog
2021-10-12 16:58:41 -04:00
Dave May 1e51d00d98
Add remaining pprof profiles to nomad operator debug (#10748)
* Add remaining pprof profiles to debug dump
* Refactor pprof profile capture
* Add WaitForFilesUntil and WaitForResultUntil utility functions
* Add CHANGELOG entry
2021-06-21 14:22:49 -04:00
Mahmood Ali aa77c2731b tests: use standard library testing.TB
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
2021-06-09 16:18:45 -07:00
Dave May e89302aa4b
nomad operator debug - add client node filtering arguments (#9331)
* operator debug - add client node filtering arguments

* add WaitForClient helper function

* use RPC in WaitForClient to avoid unnecessary imports

* guard against nil values

* move initialization up and shorten test duration

* cleanup nodeLookupFailCount logic

* only display max node notice if we actually tried to capture nodes
2020-11-12 11:25:28 -05:00
Dave May f37e90be18
Metrics gotemplate support, debug bundle features (#9067)
* add goroutine text profiles to nomad operator debug

* add server-id=all to nomad operator debug

* fix bug from changing metrics from string to []byte

* Add function to return MetricsSummary struct, metrics gotemplate support

* fix bug resolving 'server-id=all' when no servers are available

* add url to operator_debug tests

* removed test section which is used for future operator_debug.go changes

* separate metrics from operator, use only structs from go-metrics

* ensure parent directories are created as needed

* add suggested comments for text debug pprof

* move check down to where it is used

* add WaitForFiles helper function to wait for multiple files to exist

* compact metrics check

Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>

* fix github's silly apply suggestion

Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>
2020-10-14 15:16:10 -04:00
Mahmood Ali 0a539c629f tests: wait until leadership loop finishes
Reverts d5c7d6e491e36a11159211f5236c19a41bed4d8e .

We actually need to forward the request to ensure that the leader is
properly configured and that establishedLeadership completes.
2020-03-06 14:41:59 -05:00
Mahmood Ali 6daac02548 avoid forwarding leadership checks in tests
The tests only care if a test server recognizes the leader.
2020-03-02 13:47:43 -05:00
Mahmood Ali 4b2ba62e35 acl: check ACL against object namespace
Fix a bug where a millicious user can access or manipulate an alloc in a
namespace they don't have access to.  The allocation endpoints perform
ACL checks against the request namespace, not the allocation namespace,
and performs the allocation lookup independently from namespaces.

Here, we check that the requested can access the alloc namespace
regardless of the declared request namespace.

Ideally, we'd enforce that the declared request namespace matches
the actual allocation namespace.  Unfortunately, we haven't documented
alloc endpoints as namespaced functions; we suspect starting to enforce
this will be very disruptive and inappropriate for a nomad point
release.  As such, we maintain current behavior that doesn't require
passing the proper namespace in request.  A future major release may
start enforcing checking declared namespace.
2019-10-08 12:59:22 -04:00
Mahmood Ali 6cefd8f97e tests: attempt to fix TestAutopilot_CleanupStaleRaftServer
Also add a utility function for waiting for stable leadership
2019-09-04 08:49:33 -04:00
Mahmood Ali fc72fff0ed test helper for registering jobs with acl
Test helper that allows registration of jobs when ACL is activated.
2019-04-30 10:23:56 -04:00
Mahmood Ali 8c82c19831 tests: IsTravis() -> IsCI()
Replace IsTravis() references that is intended for more CI environments
rather than for Travis environment specifically.
2019-02-20 08:21:03 -05:00
Mahmood Ali 33ff8c3e8d tests: expect Docker on AppVeyor
Prepare to run docker on AppVeyor Windows environment
2019-02-20 07:41:47 -05:00
Alex Dadgar 4bdccab550 goimports 2019-01-22 15:44:31 -08:00
Mahmood Ali b7d421a149 tests: WaitForRunning checks for pending only
WaitForRunning risks a race condition where the allocation succeeds and
completes before WaitForRunning is called (or while it is running).

Here, I made the behavior match the function documentation.

I considered making it stricter, but callers need to account for
allocation terminating immediately after WaitForRunning terminates
anyway.
2019-01-10 15:36:57 -05:00
Mahmood Ali c3eaa0f4c8 tests: enable and fix tests requiring mock driver 2019-01-10 10:10:11 -05:00
Michael Schurter f279b1d1b1 tests: test logs endpoint against pending task
Although the really exciting change is making WaitForRunning return the
allocations that it started. This should cut down test boilerplate
significantly.
2018-10-16 16:56:55 -07:00
Michael Schurter 6def5bc4f9 client: set host name when migrating over tls
Not setting the host name led the Go HTTP client to expect a certificate
with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad`
names ephemeral dir migrations were broken when TLS was enabled.

Added an e2e test to ensure this doesn't break again as it's very
difficult to test and the TLS configuration is very easy to get wrong.
2018-09-05 17:24:17 -07:00
Alex Dadgar af1b185ce4 Fix flaky deadline tests 2018-04-03 16:51:57 -07:00
Alex Dadgar b18f789020 Unmark drain when nodes hit their deadline and only batch/system left and add all job type integration test 2018-03-28 17:25:58 -07:00
Michael Schurter 5f1f91a46c Use go-testing-interface instead of testing
This drops the testings stdlib pkg from our dependencies. Saves a
whopping 46kb on our binary (was really hoping for more of a win there),
but also avoids potential ugliness with how testing sets flags.
2017-07-25 15:35:19 -07:00
Michael Schurter aede1478db Create AssertUntil helper func 2017-04-06 17:05:09 -07:00
Alex Dadgar d82747bd33 Benchmark 2016-12-09 14:44:50 -08:00
Abhishek Chanda babd75c55c Use the CI env variable
Travis exports this variable to all builds. We don't need out own.
A number of other CI systems use this variable too.
2016-02-06 01:11:44 -08:00
Alex Dadgar e8067029cc Small fixes 2016-02-04 14:19:27 -08:00
Alex Dadgar a4ddfc306b ordering issue 2016-01-21 13:28:48 -08:00
Alex Dadgar fd21e890db Time Duration fixes 2016-01-21 12:29:13 -08:00
Alex Dadgar 1ceb6f012a Fix a bunch of tests
Up timeouts

trusty travis beta

Increase timeouts
2016-01-20 16:03:53 -08:00
Armon Dadgar 42734d2c3a nomad: allow region forwarding for status endpoints 2015-09-06 18:07:05 -07:00
Armon Dadgar 0bca2285a4 nomad: testing status endpoint 2015-06-04 13:26:16 +02:00
Armon Dadgar c18c068bf2 nomad: testing remove peer 2015-06-04 13:02:39 +02:00