Commit graph

3078 commits

Author SHA1 Message Date
Drew Bailey da4af9bef3
fix tests, update changelog 2020-01-29 13:55:39 -05:00
Drew Bailey a61bf32314
Allow nomad monitor command to lookup server UUID
Allows addressing servers with nomad monitor using the servers name or
ID.

Also unifies logic for addressing servers for client_agent_endpoint
commands and makes addressing logic region aware.

rpc getServer test
2020-01-29 13:55:29 -05:00
Mahmood Ali 9611324654
Merge pull request #6922 from hashicorp/b-alloc-canoncalize
Handle Upgrades and Alloc.TaskResources modification
2020-01-28 15:12:41 -05:00
Mahmood Ali 90cae566e5
Merge pull request #6935 from hashicorp/b-default-preemption-flag
scheduler: allow configuring default preemption for system scheduler
2020-01-28 15:11:06 -05:00
Mahmood Ali af17b4afc7 Support customizing full scheduler config 2020-01-28 14:51:42 -05:00
Mahmood Ali f7a51a14c6
Merge pull request #6977 from hashicorp/b-leadership-flapping-2
Handle Nomad leadership flapping (attempt 2)
2020-01-28 11:40:41 -05:00
Mahmood Ali 687d2b7054 tests: defer closing shutdownCh 2020-01-28 09:53:48 -05:00
Mahmood Ali ded4233c27 tweak leadership flapping log messages 2020-01-28 09:49:36 -05:00
Mahmood Ali 79823ae07d handle channel close signal
Always deliver last value then send close signal.
2020-01-28 09:44:34 -05:00
Mahmood Ali d202924a93 include test and address review comments 2020-01-28 09:06:52 -05:00
Nick Ethier 5cbb94e16e consul: add support for canary meta 2020-01-27 09:53:30 -05:00
Mahmood Ali e436d2701a Handle Nomad leadership flapping
Fixes a deadlock in leadership handling if leadership flapped.

Raft propagates leadership transition to Nomad through a NotifyCh channel.
Raft blocks when writing to this channel, so channel must be buffered or
aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader`
until the channel is consumed[1] and does not move on to executing follower
related logic (in `raft.runFollower`).

While Raft `runLeader` defer function blocks, raft cannot process any other
raft operations.  For example, `run{Leader|Follower}` methods consume
`raft.applyCh`, and while runLeader defer is blocked, all raft log applications
or config lookup will block indefinitely.

Sadly, `leaderLoop` and `establishLeader` makes few Raft calls!
`establishLeader` attempts to auto-create autopilot/scheduler config [3]; and
`leaderLoop` attempts to check raft configuration [4].  All of these calls occur
without a timeout.

Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is
invoked and hit any of these Raft calls, Raft handler _deadlock_ forever.

Depending on how many times it flapped and where exactly we get stuck, I suspect
it's possible to get in the following case:

* Agent metrics/stats http and RPC calls hang as they check raft.Configurations
* raft.State remains in Leader state, and server attempts to handle RPC calls
  (e.g. node/alloc updates) and these hang as well

As we create goroutines per RPC call, the number of goroutines grow over time
and may trigger a out of memory errors in addition to missed updates.

[1] d90d6d6bda/config.go (L190-L193)
[2] d90d6d6bda/raft.go (L425-L436)
[3] 2a89e47746/nomad/leader.go (L198-L202)
[4] 2a89e47746/nomad/leader.go (L877)
2020-01-22 13:08:34 -05:00
Mahmood Ali 129c884105 extract leader step function 2020-01-22 10:55:48 -05:00
Mahmood Ali f36cc54efd actually always canonicalize alloc.Job
alloc.Job may be stale as well and need to migrate it.  It does cost
extra cycles but should be negligible.
2020-01-15 09:02:48 -05:00
Mahmood Ali b1b714691c address review comments 2020-01-15 08:57:05 -05:00
Mahmood Ali 1ab682f622 scheduler: allow configuring default preemption for system scheduler
Some operators want a greater control over when preemption is enabled,
especially during an upgrade to limit potential side-effects.
2020-01-13 08:30:49 -05:00
Drew Bailey ff4bfb8809
Merge pull request #6841 from hashicorp/f-agent-pprof-acl
Remote agent pprof endpoints
2020-01-10 14:52:39 -05:00
Mahmood Ali bfa33cf471 canonicalize allocs from plan results too 2020-01-10 10:41:12 -05:00
Nick Ethier 1f28633954
Merge pull request #6816 from hashicorp/b-multiple-envoy
connect: configure envoy to support multiple sidecars in the same alloc
2020-01-09 23:25:39 -05:00
Drew Bailey b702dede49
adds qc param, address pr feedback 2020-01-09 15:15:11 -05:00
Drew Bailey 45210ed901
Rename profile package to pprof
Address pr feedback, rename profile package to pprof to more accurately
describe its purpose. Adds gc param for heap lookup profiles.
2020-01-09 15:15:10 -05:00
Drew Bailey 1b8af920f3
address pr feedback 2020-01-09 15:15:09 -05:00
Drew Bailey 4ced73875b
leave acl checking to rpc endpoints
fix test expectation

test wrapNonJSON
2020-01-09 15:15:08 -05:00
Drew Bailey 279512c7f8
provide helpful error, cleanup logic 2020-01-09 15:15:08 -05:00
Drew Bailey 7bbba613a5
prevent doubly wrapping with rpc error 2020-01-09 15:15:07 -05:00
Drew Bailey fd42020ad6
RPC server EnableDebug option
Passes in agent enable_debug config to nomad server and client configs.
This allows for rpc endpoints to have more granular control if they
should be enabled or not in combination with ACLs.

enable debug on client test
2020-01-09 15:15:07 -05:00
Drew Bailey 31c0aca10a
rename forward func, add comment for why we forward 2020-01-09 15:15:06 -05:00
Drew Bailey 9a80938fb1
region forwarding; prevent recursive forwards for impossible requests
prevent region forwarding loop, backfill tests

fix failing test
2020-01-09 15:15:06 -05:00
Drew Bailey 46121fe3fd
move shared structs out of client and into nomad 2020-01-09 15:15:05 -05:00
Drew Bailey aec81a0b99
api agent endpoints
helper func to return serverPart based off of serverID
2020-01-09 15:15:05 -05:00
Drew Bailey 3672414888
test pprof headers and profile methods
tidy up, add comments

clean up seconds param assignment
2020-01-09 15:15:04 -05:00
Drew Bailey fc37448683
warn when enabled debug is on when registering
m -> a receiver name

return codederrors, fix query
2020-01-09 15:15:04 -05:00
Drew Bailey 50288461c9
Server request forwarding for Agent.Profile
Return rpc errors for profile requests, set up remote forwarding to
target leader or server id for profile requests.

server forwarding, endpoint tests
2020-01-09 15:15:03 -05:00
Mahmood Ali d740d347ce Migrate old alloc structs on read
This commit ensures that Alloc.AllocatedResources is properly populated
when read from persistence stores (namely Raft and client state store).
The alloc struct may have been written previously by an arbitrary old
version that may only populate Alloc.TaskResources.
2020-01-09 08:46:50 -05:00
James Rasell df2dc48790 Fix error parsing config when setting consul.timeout. (#6907)
When parsing a config file which had the consul.timeout param set,
Nomad was reporting an error causing startup to fail. This seems
to be caused by the HCL decoder interpreting the timeout type as
an int rather than a string. This is caused by the struct
TimeoutHCL param having a hcl key of timeout alongside a Timeout
struct param of type time.Duration (int). Ensuring the decoder
ignores the Timeout struct param ensure the decoder runs
correctly.
2020-01-07 13:40:55 -05:00
Nick Ethier 677e9cdc16 connect: configure envoy such that multiple sidecars can run in the same alloc 2020-01-06 11:26:27 -05:00
Michael Schurter 92cdc9de01 nomad/state: remove dead upgrade path code
It is uncalled so there hsould be no runtime changes.
2019-12-20 11:10:22 -08:00
Drew Bailey d9e41d2880
docs for shutdown delay
update docs, address pr comments

ensure pointer is not nil

use pointer for diff tests, set vs unset
2019-12-16 11:38:35 -05:00
Drew Bailey ae145c9a37
allow only positive shutdown delay
more explicit test case, remove select statement
2019-12-16 11:38:30 -05:00
Drew Bailey 24929776a2
shutdown delay for task groups
copy struct values

ensure groupserviceHook implements RunnerPreKillhook

run deregister first

test that shutdown times are delayed

move magic number into variable
2019-12-16 11:38:16 -05:00
Seth Hoenig 270233e23d tests: remove trace statements from nodeDrainWatcher.watch
Avoid logging in the `watch` function as much as possible, since
it is not waited on during a server shutdown. When the logger
logs after a test passes, it may or may not cause the testing
framework to panic.

More info in:
https://github.com/golang/go/issues/29388#issuecomment-453648436
2019-12-16 07:08:11 -06:00
Michael Schurter 95fd2643d7 connect: canonicalize before adding sidecar
Fixes #6853

Canonicalize jobs first before adding any sidecars. This fixes a bug
where sidecar tasks were added without interpolated names and broke
validation. Sidecar tasks must be canonicalized independently.

Also adds a group network to the mock connect job because it wasn't a
valid connect job before!
2019-12-12 20:55:56 -08:00
Seth Hoenig d45dec1ca8 tests: parallelize state store tests
It has been decided we're going to live in a many core world.
Let's take advantage of that and parallelize these state store
tests which all run in memory and are largely CPU bound.

An unscientific benchmark demonstrating the improvement:

[mp state (master)] $ go test
PASS
ok  	github.com/hashicorp/nomad/nomad/state	5.162s

[mp state (f-parallelize-state-store-tests)] $ go test
PASS
ok  	github.com/hashicorp/nomad/nomad/state	1.527s
2019-12-11 09:36:37 -06:00
Seth Hoenig f0c3dca49c tests: swap lib/freeport for tweaked helper/freeport
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.

Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.

This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
2019-12-09 08:37:32 -06:00
Danielle Lancashire d2075ebae9 spellcheck: Fix spelling of retrieve 2019-12-05 18:59:47 -06:00
Mahmood Ali b3e557cae3 address feedback review
apply `s/requestAuthToken/requestACLToken/g`
2019-11-26 08:39:04 -05:00
Mahmood Ali 02e20c720b acl_endpoint: permission denied for unauthenticated requests
If ACL Request is unauthenticated, we should honor the anonymous token.
This PR makes few changes:

* `GetPolicy` endpoints may return policy if anonymous policy allows it,
or return permission denied otherwise.
* `ListPolicies` returns an empty policy list, or one with anonymous
policy if one exists.

Without this PR, the we return an incomprehensible error.

Before:
```
$ curl http://localhost:4646/v1/acl/policy/doesntexist; echo
acl token lookup failed: index error: UUID must be 36 characters
$ curl http://localhost:4646/v1/acl/policies; echo
acl token lookup failed: index error: UUID must be 36 characters
```

After:
```
$ curl http://localhost:4646/v1/acl/policy/doesntexist; echo
Permission denied
$ curl http://localhost:4646/v1/acl/policies; echo
[]
```
2019-11-22 08:43:09 -05:00
Michael Schurter 4b6762511d
Merge pull request #6021 from hashicorp/f-anonymous-policy-access
api: Update policy endpoint to permit anonymous access
2019-11-20 15:33:45 -08:00
Buck Doyle 5fcc00d0f9 Add gofmt changes 2019-11-20 12:47:01 -06:00
Buck Doyle dc9c0d5ead
Add explanatory comment 2019-11-20 11:45:44 -06:00
Buck Doyle bea9837510
Remove extraneous else block 2019-11-20 11:37:45 -06:00
Buck Doyle d6a3e571bd
Remove extraneous whitespace 2019-11-20 11:37:01 -06:00
Buck Doyle db77a24ed3
Merge branch 'master' into f-policy-json 2019-11-20 11:20:07 -06:00
Mahmood Ali 7fb4c35831 comments and casing 2019-11-19 16:03:55 -05:00
Mahmood Ali 97d0fd009d 404 if token isn't found 2019-11-19 15:52:53 -05:00
Mahmood Ali 6f8bb5e90b api: acl bootstrap errors aren't 500
Noticed that ACL endpoints return 500 status code for user errors.  This
is confusing and can lead to false monitoring alerts.

Here, I introduce a concept of RPCCoded errors to be returned by RPC
that signal a code in addition to error message.  Codes for now match
HTTP codes to ease reasoning.

```
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 500 (ACL bootstrap already done (reset index: 9))

$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 400 (ACL bootstrap already done (reset index: 9))
```
2019-11-19 15:51:57 -05:00
Michael Schurter 796758b8a5 core: add semver constraint
The existing version constraint uses logic optimized for package
managers, not schedulers, when checking prereleases:

- 1.3.0-beta1 will *not* satisfy ">= 0.6.1"
- 1.7.0-rc1 will *not* satisfy ">= 1.6.0-beta1"

This is due to package managers wishing to favor final releases over
prereleases.

In a scheduler versions more often represent the earliest release all
required features/APIs are available in a system. Whether the constraint
or the version being evaluated are prereleases has no impact on
ordering.

This commit adds a new constraint - `semver` - which will use Semver
v2.0 ordering when evaluating constraints. Given the above examples:

- 1.3.0-beta1 satisfies ">= 0.6.1" using `semver`
- 1.7.0-rc1 satisfies ">= 1.6.0-beta1" using `semver`

Since existing jobspecs may rely on the old behavior, a new constraint
was added and the implicit Consul Connect and Vault constraints were
updated to use it.
2019-11-19 08:40:19 -08:00
Nick Ethier bd454a4c6f
client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00
Luiz Aoqui e499c5bddc
Merge pull request #6698 from hashicorp/f-add-drain-start-time
api: add `StartedAt` in `Node.DrainStrategy`
2019-11-15 15:38:38 -05:00
Luiz Aoqui e862b61daa
api: use the same initial time for all drain properties 2019-11-14 16:06:09 -05:00
Drew Bailey 9b63828658
serverID to target remote leader or server
handle the case where we request a server-id which is this current server

update docs, error on node and server id params

more accurate names for tests

use shared no leader err, formatting

rm bad comment

remove redundant variable
2019-11-14 10:07:35 -05:00
Drew Bailey b644e1f47d
add server-id to monitor specific server 2019-11-14 09:53:41 -05:00
Drew Bailey 2185c1a89e
Allows monitor to target leader server
Allows user to pass in node-id=leader to forward monitor request to
remote a remote leader.
2019-11-14 09:53:40 -05:00
Luiz Aoqui 5bd7cdd5c3
api: add StartedAt in Node.DrainStrategy 2019-11-13 17:54:40 -05:00
Lars Lehtonen 22a3c21dd0
nomad: fix dropped test error 2019-11-13 12:49:41 -08:00
Michael Schurter 08afb7d605 vault: allow overriding implicit vault constraint
There's a bug in version parsing that breaks this constraint when using
a prerelease enterprise version of Vault (eg 1.3.0-beta1+ent). While
this does not fix the underlying bug it does provide a workaround for
future issues related to the implicit constraint. Like the implicit
Connect constraint: *all* implicit constraints should be overridable to
allow users to workaround bugs or other factors should the need arise.
2019-11-12 12:26:36 -08:00
Mahmood Ali c4c37cb42e vault: check token_explicit_max_ttl as well
Vault 1.2.0 deprecated `explicit_max_ttl` in favor of
`token_explicit_max_ttl`.
2019-11-12 08:47:23 -05:00
Lars Lehtonen adbab29228 nomad: TestEvalBroker_Dequeue_Empty_Timeout() proper goroutine error handling (#6657) 2019-11-08 14:35:06 -05:00
Drew Bailey 7420446458
Merge pull request #6639 from hashicorp/return-after-forward
return after request has been forwarded
2019-11-08 09:48:35 -05:00
Lars Lehtonen 39b68e0b88 TestEvalBroker_Dequeue_Blocked() proper goroutine error handling (#6651)
TestEvalBroker_Dequeue_Blocked() improve test readability
2019-11-08 08:52:23 -05:00
Nick Ethier e947aaed4f
nomad: fix bug that didn't allow for multiple connect services in same tg 2019-11-08 04:33:39 -05:00
Lars Lehtonen 6deae70e35 TestEvalBroker_PauseResumeNackTimeout() proper goroutine error handling (#6649)
TestEvalBroker_PauseResumeNackTimeout() improve test readability
2019-11-07 16:04:59 -05:00
Lars Lehtonen 2638cbb31d nomad: TestEvalBroker_EnqueueAll_Dequeue_Fair() proper goroutine error handling (#6636)
nomad: TestEvalBroker_EnqueueAll_Dequeue_Fair() improve test readability
2019-11-07 10:39:29 -05:00
Drew Bailey a5e2e1805f
return after request has been forwarded 2019-11-07 08:33:53 -05:00
Lars Lehtonen e64f98837c nomad: fix dropped error in TestJobEndpoint_Deregister_ACL (#6602) 2019-11-06 16:40:45 -05:00
Drew Bailey f4a7e3dc75
coordinate closing of doneCh, use interface to simplify callers
comments
2019-11-05 11:44:26 -05:00
Drew Bailey fe542680dc
log-json -> json
fix typo command/agent/monitor/monitor.go

Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com>

Update command/agent/monitor/monitor.go

Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com>

address feedback, lock to prevent send on closed channel

fix lock/unlock for dropped messages
2019-11-05 09:51:59 -05:00
Drew Bailey 298b8358a9
move forwarded monitor request into helper 2019-11-05 09:51:56 -05:00
Drew Bailey 8726b685de
address feedback 2019-11-05 09:51:56 -05:00
Drew Bailey 0e759c401c
moving endpoints over to frames 2019-11-05 09:51:54 -05:00
Drew Bailey 17d876d5ef
rename function, initialize log level better
underscores instead of dashes for query params
2019-11-05 09:51:53 -05:00
Drew Bailey 8178beecf0
address feedback, use agent_endpoint instead of monitor 2019-11-05 09:51:53 -05:00
Drew Bailey db65b1f4a5
agent:read acl policy for monitor 2019-11-05 09:51:52 -05:00
Drew Bailey 2533617888
rpc acl tests for both monitor endpoints 2019-11-05 09:51:51 -05:00
Drew Bailey 3c33747e1f
client monitor endpoint tests 2019-11-05 09:51:50 -05:00
Drew Bailey 4bc68855d0
use intercepting loggers for rpchandlers 2019-11-05 09:51:50 -05:00
Drew Bailey 3b9c33a5f0
new hclog with standardlogger intercept 2019-11-05 09:51:49 -05:00
Drew Bailey a45ae1cd58
enable json formatting, use queryoptions 2019-11-05 09:51:49 -05:00
Drew Bailey 786989dbe3
New monitor pkg for shared monitor functionality
Adds new package that can be used by client and server RPC endpoints to
facilitate monitoring based off of a logger

clean up old code

small comment about write

rm old comment about minsize

rename to Monitor

Removes connection logic from monitor command

Keep connection logic in endpoints, use a channel to send results from
monitoring

use new multisink logger and interfaces

small test for dropped messages

update go-hclogger and update sink/intercept logger interfaces
2019-11-05 09:51:49 -05:00
Lars Lehtonen 0a4542fadc nomad: fix test goroutine (#6593) 2019-10-31 08:23:32 -04:00
Seth Hoenig 98592113a3
Merge pull request #6582 from hashicorp/b-vault-createToken-log-msg
nomad: fix vault.CreateToken log message printing wrong error
2019-10-29 17:35:05 -05:00
Mahmood Ali 7f2e4dc5d8
Merge pull request #6574 from hashicorp/b-gh-6570-vault-role-validation
vault: honor new `token_period` in vault token role
2019-10-29 10:18:59 -04:00
Seth Hoenig 838c6e3329 nomad: fix vault.CreateToken log message printing wrong error
Fixes typo in word "failed".

Fixes bug where incorrect error is printed. The old code would only
ever print a nil error, instead of the validationErr which is being
created.
2019-10-28 23:05:32 -05:00
Mahmood Ali c5d8d66787 Fix admissionValidators
`admissionValidators` doesn't aggregate errors correctly, as it
aggregates errors in `errs` reference yet it always returns the nil
`err`.

Here, we avoid shadowing `err`, and move variable declarations to where
they are used.
2019-10-28 10:52:53 -04:00
Mahmood Ali abb930249a consul connect: do basic validation before mutating job
`groupConnectHook` assumes that Networks is a non-empty slice, but TG
hasn't been validated yet and validation may depend on mutation results.
As such, we do basic check here before dereferencing network slice
elements.
2019-10-28 10:49:02 -04:00
Mahmood Ali bb45a7a776 add tests for consul connect validation 2019-10-28 10:41:51 -04:00
Mahmood Ali 4c64658397 vault: Support new role field token_role
Vault 1.2.0 deprecated `period` field in favor of `token_period` in auth
role:

>  * Token store roles use new, common token fields for the values
>    that overlap with other auth backends. `period`, `explicit_max_ttl`, and
>    `bound_cidrs` will continue to work, with priority being given to the
>    `token_` prefixed versions of those parameters. They will also be returned
>    when doing a read on the role if they were used to provide values initially;
>    however, in Vault 1.4 if `period` or `explicit_max_ttl` is zero they will no
>    longer be returned. (`explicit_max_ttl` was already not returned if empty.)
https://github.com/hashicorp/vault/blob/master/CHANGELOG.md#120-july-30th-2019
2019-10-28 09:33:26 -04:00
Seth Hoenig 8b03477f46
Merge pull request #6448 from hashicorp/f-set-connect-sidecar-tags
connect: enable setting tags on consul connect sidecar service in job…
2019-10-17 15:14:09 -05:00
Seth Hoenig 039fbd3f3b connect: enable setting tags on consul connect sidecar service in jobspec (#6415) 2019-10-17 19:25:20 +00:00
Mahmood Ali 4e4a9b252c
Merge pull request #6290 from hashicorp/r-generated-code-refactor
dev: avoid codecgen code in downstream projects
2019-10-15 08:22:31 -04:00