Commit Graph

73 Commits

Author SHA1 Message Date
Luiz Aoqui 1d3a38aae0
Revert "deps: update go-metrics to v0.5.3 in 1.6.x" (#19375)
* Revert "deps: update go-metrics to v0.5.3 (#19190) (#19208)"

This reverts commit 1112a282d76e67e26b3973a1e4cfc85b22678072.

* changelog add entry for #19375
2023-12-08 08:47:02 -05:00
Luiz Aoqui e552e1726f
deps: update go-metrics to v0.5.3 (#19190) (#19208)
Update `go-metrics` to v0.5.3 to pick
https://github.com/hashicorp/go-metrics/pull/146.
2023-11-28 13:52:25 -05:00
hashicorp-copywrite[bot] 005636afa0 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Tim Gross 5e75ea9fb3
metrics: Add RPC rate metrics to endpoints that validate TLS names (#15900) 2023-01-26 15:04:25 -05:00
Tim Gross 6677a103c2
metrics: measure rate of RPC requests that serve API (#15876)
This changeset configures the RPC rate metrics that were added in #15515 to all
the RPCs that support authenticated HTTP API requests. These endpoints already
configured with pre-forwarding authentication in #15870, and a handful of others
were done already as part of the proof-of-concept work. So this changeset is
entirely copy-and-pasting one method call into a whole mess of handlers.

Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining
set of RPCs that have no API consumers or aren't authenticated, in smaller
chunks that can be more thoughtfully reviewed.
2023-01-25 16:37:24 -05:00
Tim Gross f3f64af821
WI: allow workloads to use RPCs associated with HTTP API (#15870)
This changeset allows Workload Identities to authenticate to all the RPCs that
support HTTP API endpoints, for use with PR #15864.

* Extends the work done for pre-forwarding authentication to all RPCs that
  support a HTTP API endpoint.
* Consolidates the auth helpers used by the CSI, Service Registration, and Node
  endpoints that are currently used to support both tokens and client secrets.

Intentionally excluded from this changeset:
* The Variables endpoint still has custom handling because of the implicit
  policies. Ideally we'll figure out an efficient way to resolve those into real
  policies and then we can get rid of that custom handling.
* The RPCs that don't currently support auth tokens (i.e. those that don't
  support HTTP endpoints) have not been updated with the new pre-forwarding auth
  We'll be doing this under a separate PR to support RPC rate metrics.
2023-01-25 14:33:06 -05:00
Tim Gross 2030d62920
implement pre-forwarding auth on select RPCs (#15513)
In #15417 we added a new `Authenticate` method to the server that returns an
`AuthenticatedIdentity` struct. This changeset implements this method for a
small number of RPC endpoints that together represent all the various ways in
which RPCs are sent, so that we can validate that we're happy with this
approach.
2023-01-24 10:52:07 -05:00
Tim Gross f61f801e77
provide `RPCContext` to all RPC handlers (#15430)
Upcoming work to instrument the rate of RPC requests by consumer (and eventually
rate limit) requires that we thread the `RPCContext` through all RPC
handlers so that we can access the underlying connection. This changeset adds
the context to everywhere we intend to initially support it and intentionally
excludes streaming RPCs and client RPCs.

To improve the ergonomics of adding the context everywhere its needed and to
clarify the requirements of dynamic vs static handlers, I've also done a good
bit of refactoring here:

* canonicalized the RPC handler fields so they're as close to identical as
  possible without introducing unused fields (i.e. I didn't add loggers if the
  handler doesn't use them already).
* canonicalized the imports in the handler files.
* added a `NewExampleEndpoint` function for each handler that ensures we're
  constructing the handlers with the required arguments.
* reordered the registration in server.go to match the order of the files (to
  make it easier to see if we've missed one), and added a bunch of commentary
  there as to what the difference between static and dynamic handlers is.
2022-12-01 10:05:15 -05:00
Tim Gross 05a46e6648
make eval cancelation really async with `Eval.Ack` (#15298)
Ensure we never block in the `Eval.Ack`
2022-11-18 08:38:17 -05:00
Tim Gross b74a868aae
make eval cancelation async with `Eval.Ack` (#15294)
In #14621 we added an eval canelation reaper goroutine with a channel that
allowed us to wake it up. But we forgot to actually send on this channel from
`Eval.Ack` and are still committing the cancelations synchronously. Fix this by
sending on the buffered channel to wake up the reaper instead.
2022-11-17 16:40:41 -05:00
Tim Gross 6415fb4284
eval broker: shed all but one blocked eval per job after ack (#14621)
When an evaluation is acknowledged by a scheduler, the resulting plan is
guaranteed to cover up to the `waitIndex` set by the worker based on the most
recent evaluation for that job in the state store. At that point, we no longer
need to retain blocked evaluations in the broker that are older than that index.

Move all but the highest priority / highest `ModifyIndex` blocked eval into a
canceled set. When the `Eval.Ack` RPC returns from the eval broker it will
signal a reap of a batch of cancelable evals to write to raft. This paces the
cancelations limited by how frequently the schedulers are acknowledging evals;
this should reduce the risk of cancelations from overwhelming raft relative to
scheduler progress. In order to avoid straggling batches when the cluster is
quiet, we also include a periodic sweep through the cancelable list.
2022-11-16 16:10:11 -05:00
Tim Gross 37134a4a37
eval delete: move batching of deletes into RPC handler and state (#15117)
During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluations can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, re-serialize to
MessagePack, send the log entries through raft, and get the FSM applied.

To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:

* A naive solution here would be to just send the filter as the raft request and
  let the FSM apply delete the whole set in a single operation. Benchmarking with
  1M evals on a 3 node cluster demonstrated this can block the FSM apply for
  several minutes, which puts the cluster at risk if there's a leadership
  failover (the barrier write can't be made while this apply is in-flight).

* A less naive but still bad solution would be to have the RPC handler filter
  and paginate, and then hand a list of IDs to the existing raft log
  entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
  took roughly an hour to complete.

Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.

Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.
2022-11-14 14:08:13 -05:00
Tim Gross 9e1c0b46d8
API for `Eval.Count` (#15147)
Add a new `Eval.Count` RPC and associated HTTP API endpoints. This API is
designed to support interactive use in the `nomad eval delete` command to get a
count of evals expected to be deleted before doing so.

The state store operations to do this sort of thing are somewhat expensive, but
it's cheaper than serializing a big list of evals to JSON. Note that although it
seems like this could be done as an extra parameter and response field on
`Eval.List`, having it as its own endpoint avoids having to change the response
body shape and lets us avoid handling the legacy filter params supported by
`Eval.List`.
2022-11-07 08:53:19 -05:00
Tim Gross dab9388c75
refactor eval delete safety check (#15070)
The `Eval.Delete` endpoint has a helper that takes a list of jobs and allocs and
determines whether the eval associated with those is safe to delete (based on
their state). Filtering improvements to the `Eval.Delete` endpoint are going to
need this check to run in the state store itself for consistency.

Refactor to push this check down into the state store to keep the eventual diff
for that work reasonable.
2022-10-28 09:10:33 -04:00
Phil Renaud e9219a1ae0
Allow wildcard for Evaluations API (#13530)
* Failing test and TODO for wildcard

* Alias the namespace query parameter for Evals

* eval: fix list when using ACLs and * namespace

Apply the same verification process as in job, allocs and scaling
policy list endpoints to handle the eval list when using an ACL token
with limited namespace support but querying using the `*` wildcard
namespace.

* changelog: add entry for #13530

* ui: set namespace when querying eval

Evals have a unique UUID as ID, but when querying them the Nomad API
still expects a namespace query param, otherwise it assumes `default`.

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-07-11 16:42:17 -04:00
James Rasell 0c0b028a59
core: allow deleting of evaluations (#13492)
* core: add eval delete RPC and core functionality.

* agent: add eval delete HTTP endpoint.

* api: add eval delete API functionality.

* cli: add eval delete command.

* docs: add eval delete website documentation.
2022-07-06 16:30:11 +02:00
James Rasell 181b247384
core: allow pausing and un-pausing of leader broker routine (#13045)
* core: allow pause/un-pause of eval broker on region leader.

* agent: add ability to pause eval broker via scheduler config.

* cli: add operator scheduler commands to interact with config.

* api: add ability to pause eval broker via scheduler config

* e2e: add operator scheduler test for eval broker pause.

* docs: include new opertor scheduler CLI and pause eval API info.
2022-07-06 16:13:48 +02:00
Luiz Aoqui 15089f055f
api: add related evals to eval details (#12305)
The `related` query param is used to indicate that the request should
return a list of related (next, previous, and blocked) evaluations.

Co-authored-by: Jasmine Dahilig <jasmine@hashicorp.com>
2022-03-17 13:56:14 -04:00
Luiz Aoqui 2876739a51
api: apply consistent behaviour of the reverse query parameter (#12244) 2022-03-11 19:44:52 -05:00
Luiz Aoqui ab8ce87bba
Add pagination, filtering and sort to more API endpoints (#12186) 2022-03-08 20:54:17 -05:00
Luiz Aoqui 01931587ba
api: paginated results with different ordering (#12128)
The paginator logic was built when go-memdb iterators would return items
ordered lexicographically by their ID prefixes, but #12054 added the
option for some tables to return results ordered by their `CreateIndex`
instead, which invalidated the previous paginator assumption.

The iterator used for pagination must still return results in some order
so that the paginator can properly handle requests where the next_token
value is not present in the results anymore (e.g., the eval was GC'ed).

In these situations, the paginator will start the returned page in the
first element right after where the requested token should've been.

This commit moves the logic to generate pagination tokens from the
elements being paginated to the iterator itself so that callers can have
more control over the token format to make sure they are properly
ordered and stable.

It also allows configuring the paginator as being ordered in ascending
or descending order, which is relevant when looking for a token that may
not be present anymore.
2022-03-01 15:36:49 -05:00
Luiz Aoqui de91954582
initial base work for implementing sorting and filter across API endpoints (#12076) 2022-02-16 14:34:36 -05:00
Luiz Aoqui 110dbeeb9d
Add `go-bexpr` filters to evals and deployment list endpoints (#12034) 2022-02-16 11:40:30 -05:00
Seth Hoenig 40c714a681 api: return sorted results in certain list endpoints
These API endpoints now return results in chronological order. They
can return results in reverse chronological order by setting the
query parameter ascending=true.

- Eval.List
- Deployment.List
2022-02-15 13:48:28 -06:00
Luiz Aoqui 0e09b120e4
fix mTLS certificate check on agent to agent RPCs (#11998)
PR #11956 implemented a new mTLS RPC check to validate the role of the
certificate used in the request, but further testing revealed two flaws:

  1. client-only endpoints did not accept server certificates so the
     request would fail when forwarded from one server to another.
  2. the certificate was being checked after the request was forwarded,
     so the check would happen over the server certificate, not the
     actual source.

This commit checks for the desired mTLS level, where the client level
accepts both, a server or a client certificate. It also validates the
cercertificate before the request is forwarded.
2022-02-04 20:35:20 -05:00
Luiz Aoqui c4cff5359f
Verify TLS certificate on endpoints that are used between agents only (#11956) 2022-02-02 15:03:18 -05:00
Tim Gross e046bb31e9
api: respect wildcard in evaluations list API (#11710) 2021-12-20 12:23:50 -05:00
Tim Gross 624ecab901
evaluations list pagination and filtering (#11648)
API queries can request pagination using the `NextToken` and `PerPage`
fields of `QueryOptions`, when supported by the underlying API.

Add a `NextToken` field to the `structs.QueryMeta` so that we have a
common field across RPCs to tell the caller where to resume paging
from on their next API call. Include this field on the `api.QueryMeta`
as well so that it's available for future versions of List HTTP APIs
that wrap the response with `QueryMeta` rather than returning a simple
list of structs. In the meantime callers can get the `X-Nomad-NextToken`.

Add pagination to the `Eval.List` RPC by checking for pagination token
and page size in `QueryOptions`. This will allow resuming from the
last ID seen so long as the query parameters and the state store
itself are unchanged between requests.

Add filtering by job ID or evaluation status over the results we get
out of the state store.

Parse the query parameters of the `Eval.List` API into the arguments
expected for filtering in the RPC call.
2021-12-10 13:43:03 -05:00
James Rasell b6813f1221
chore: fix incorrect docstring formatting. 2021-08-30 11:08:12 +02:00
Kris Hicks d71a90c8a4
Fix some errcheck errors (#9811)
* Throw away result of multierror.Append

When given a *multierror.Error, it is mutated, therefore the return
value is not needed.

* Simplify MergeMultierrorWarnings, use StringBuilder

* Hash.Write() never returns an error

* Remove error that was always nil

* Remove error from Resources.Add signature

When this was originally written it could return an error, but that was
refactored away, and callers of it as of today never handle the error.

* Throw away results of io.Copy during Bridge

* Handle errors when computing node class in test
2021-01-14 12:46:35 -08:00
Michael Schurter 8ccbd92cb6 api: add field filters to /v1/{allocations,nodes}
Fixes #9017

The ?resources=true query parameter includes resources in the object
stub listings. Specifically:

- For `/v1/nodes?resources=true` both the `NodeResources` and
  `ReservedResources` field are included.
- For `/v1/allocations?resources=true` the `AllocatedResources` field is
  included.

The ?task_states=false query parameter removes TaskStates from
/v1/allocations responses. (By default TaskStates are included.)
2020-10-14 10:35:22 -07:00
Mahmood Ali 4b2ba62e35 acl: check ACL against object namespace
Fix a bug where a millicious user can access or manipulate an alloc in a
namespace they don't have access to.  The allocation endpoints perform
ACL checks against the request namespace, not the allocation namespace,
and performs the allocation lookup independently from namespaces.

Here, we check that the requested can access the alloc namespace
regardless of the declared request namespace.

Ideally, we'd enforce that the declared request namespace matches
the actual allocation namespace.  Unfortunately, we haven't documented
alloc endpoints as namespaced functions; we suspect starting to enforce
this will be very disruptive and inappropriate for a nomad point
release.  As such, we maintain current behavior that doesn't require
passing the proper namespace in request.  A future major release may
start enforcing checking declared namespace.
2019-10-08 12:59:22 -04:00
Alex Dadgar 1857f5d7c1
Update nomad/eval_endpoint.go
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2019-03-05 15:19:07 -08:00
Michael Schurter e37bbb21a5 nomad: simplify code and improve parameter name 2019-03-04 13:44:14 -08:00
Michael Schurter 05f51499ba nomad: compare current eval when setting WaitIndex
Consider currently dequeued Evaluation's ModifyIndex when determining
its WaitIndex. Normally the Evaluation itself would already be in the
state store snapshot used to determine the WaitIndex. However, since the FSM
applies Raft messages to the state store concurrently with Dequeueing,
it's possible the currently dequeued Evaluation won't yet exist in the
state store snapshot used by JobsForEval.

This can be solved by always considering the current eval's modify index
and using it if it is greater than all of the evals returned by the
state store.
2019-03-01 15:23:39 -08:00
Alex Dadgar 4bdccab550 goimports 2019-01-22 15:44:31 -08:00
Alex Dadgar 3c19d01d7a server 2018-09-15 16:23:13 -07:00
Michael Schurter 84d8a51be1 SecretID -> AuthToken 2017-10-12 15:16:33 -07:00
Michael Schurter e50acae1a9 ForceLeave endpoint must use Server.ResolveToken
The ForceLeaveRequest endpoint may only be called on servers, but the
code was using a Client to resolve tokens. This would cause a panic when
an agent wasn't both a Server and a Client.
2017-10-09 15:49:04 -07:00
Michael Schurter 22169a7cd4 Eval.Allocations ACL enforcement 2017-10-03 14:57:47 -07:00
Michael Schurter b3db8f41fd Eval.List ACL enforcement 2017-10-03 14:57:47 -07:00
Michael Schurter fae1be5ab2 Eval.GetEval ACL enforcement 2017-10-03 14:57:47 -07:00
Alex Dadgar 567eef50a8 Address feedback 2017-09-14 14:28:43 -07:00
Alex Dadgar 6911bd7676 Worker waits til max ModifyIndex across EvalsByJob
This PR fixes a scheduling race condition in which the plan results from
one invocation of the scheduler were not being considered by the next
since the Worker was not waiting for the correct index.

Fixes https://github.com/hashicorp/nomad/issues/3198
2017-09-14 14:28:43 -07:00
Alex Dadgar 84d06f6abe Sync namespace changes 2017-09-07 17:04:21 -07:00
Alex Dadgar b69b357c7f Nomad builds 2017-02-07 20:31:23 -08:00
Alex Dadgar a1d08c2aba Add scheduler version enforcement 2016-10-26 14:52:48 -07:00
Alex Dadgar 060318845f Comments addressed 2016-05-31 11:39:03 -07:00
Alex Dadgar 1f9f015c1b Fix race condition in which a reblocked evaluation could be dropped 2016-05-27 16:53:10 -07:00
Alex Dadgar 18d9e89065 Reuse the same evaluation and reblock it until there is no more work to do 2016-05-24 20:10:56 -07:00