Commit graph

3095 commits

Author SHA1 Message Date
Jasmine Dahilig 81d051d7e8 fix bug in lifecycle scheduler test mocks 2020-03-21 17:52:51 -04:00
Jasmine Dahilig b7f08c9d13 add appropriate lifecycle deadline default of 120s 2020-03-21 17:52:48 -04:00
Jasmine Dahilig 0cc9212a54 add test cases for scheduler alloc placement with lifecycle resources 2020-03-21 17:52:47 -04:00
Jasmine Dahilig 0d2988652c add lifecycle job mock 2020-03-21 17:52:47 -04:00
Jasmine Dahilig c27223207c update task hook coordinator tests 2020-03-21 17:52:46 -04:00
Mahmood Ali b880607bad update scheduler to account for hooks 2020-03-21 17:52:45 -04:00
Jasmine Dahilig 12393f90e7 add test for lifecycle coordinator 2020-03-21 17:52:42 -04:00
Jasmine Dahilig f6e58d6dad add canonicalize in the right place 2020-03-21 17:52:41 -04:00
Jasmine Dahilig 4498c8c24f add canonicalization 2020-03-21 17:52:39 -04:00
Jasmine Dahilig 67262d841b add validation tests and more validation 2020-03-21 17:52:39 -04:00
Mahmood Ali 214d128bd9 it's running now 2020-03-21 17:52:37 -04:00
Jasmine Dahilig fc13fa9739 change TaskLifecycle RunLevel to Hook and add Deadline time duration 2020-03-21 17:52:37 -04:00
Mahmood Ali 4ebeac721a update structs with lifecycle 2020-03-21 17:52:36 -04:00
Yoan Blanc 67692789b7
vendor: vault api and sdk
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-03-21 17:57:48 +01:00
Mahmood Ali 53e20e5cc2 Deflake TestRPC_Limits_Streaming test
The test starts enough connections to hit the limit, then closes the
connection and immediately starts one expecting the new one to succeed.

We must wait until the server side recognizes the connection
closing and free up a limits slot.  The current test attempts to achieve
that by waiting to get an error on conn.Read, however, this error is
returned from local client without waiting for server update.

As such, I change the logic so it retries on connection rejection but
force the first non-EOF failure to be a deadline error.
2020-03-20 17:21:43 -04:00
Mahmood Ali 0da7130a1a Protect against args being modified 2020-03-18 08:11:16 -04:00
Mahmood Ali 52fd31af80 server: node connections must not be forwarded
This fixes a bug where a forwarded node update request may be assumed
to be the actual direct client connection if the server just lost
leadership.

When a nomad non-leader server receives a Node.UpdateStatus request, it
forwards the RPC request to the leader, and holds on the request
Yamux connection in a cache to allow for server<->client forwarding.

When the leader handles the request, it must differentiate between a
forwarded connection vs the actual connection.  This is done in
https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/node_endpoint.go#L412

Now, consider if the non-leader server forwards to the connection to a
recently deposed nomad leader, which in turn forwards the RPC request to
the new leader.

Without this change, the deposed leader will mistake the forwarded
connection for the actual client connection and cache it mapped to the
client ID.  If the server attempts to connect to that client, it will
attempt to start a connection/session to the other server instead and
the call will hang forever.

This change ensures that we only add node connection mapping if the
request is not a forwarded request, regardless of circumstances.
2020-03-17 16:39:01 -04:00
Mahmood Ali 9d88f1d568 tests: deflake deploymentwatcher package
This deflake the tests in the deploymentwatcher package.  The package
uses a mock deployment watcher backend, where the watcher in a
background goroutine calls UpdateDeploymentStatus .  If the mock isn't
configured to expect the call, the background goroutine will fail.  One
UpdateDeploymentStatus call is made at the end of the background
goroutine, which may occur after the test completes, thus explaining the
flakiness.
2020-03-12 15:42:01 -04:00
Michael Schurter 2dcc85bed1 jobspec: fixup vault_grace deprecation
Followup to #7170

- Moved canonicalization of VaultGrace back into `api/` package.
- Fixed tests.
- Made docs styling consistent.
2020-03-10 14:58:49 -07:00
Michael Schurter b72b3e765c
Merge pull request #7170 from fredrikhgrelland/consul_template_upgrade
Update consul-template to v0.24.1 and remove deprecated vault grace
2020-03-10 14:15:47 -07:00
Mahmood Ali 005bd37758 tests: deflake TestServer_ReconcileMember
TestServer_ReconcileMember assumes that S3 isn't the leader:
`reconcileMembers` call would fail when attempting to remove itself!
2020-03-06 14:14:41 -05:00
Mahmood Ali 17ee94b52b fix typo 2020-03-03 16:55:54 -05:00
Mahmood Ali acbfeb5815 Simplify Bootstrap logic in tests
This change updates tests to honor `BootstrapExpect` exclusively when
forming test clusters and removes test only knobs, e.g.
`config.DevDisableBootstrap`.

Background:

Test cluster creation is fragile.  Test servers don't follow the
BootstapExpected route like production clusters.  Instead they start as
single node clusters and then get rejoin and may risk causing brain
split or other test flakiness.

The test framework expose few knobs to control those (e.g.
`config.DevDisableBootstrap` and `config.Bootstrap`) that control
whether a server should bootstrap the cluster.  These flags are
confusing and it's unclear when to use: their usage in multi-node
cluster isn't properly documented.  Furthermore, they have some bad
side-effects as they don't control Raft library: If
`config.DevDisableBootstrap` is true, the test server may not
immediately attempt to bootstrap a cluster, but after an election
timeout (~50ms), Raft may force a leadership election and win it (with
only one vote) and cause a split brain.

The knobs are also confusing as Bootstrap is an overloaded term.  In
BootstrapExpect, we refer to bootstrapping the cluster only after N
servers are connected.  But in tests and the knobs above, it refers to
whether the server is a single node cluster and shouldn't wait for any
other server.

Changes:

This commit makes two changes:

First, it relies on `BootstrapExpected` instead of `Bootstrap` and/or
`DevMode` flags.  This change is relatively trivial.

Introduce a `Bootstrapped` flag to track if the cluster is bootstrapped.
This allows us to keep `BootstrapExpected` immutable.  Previously, the
flag was a config value but it gets set to 0 after cluster bootstrap
completes.
2020-03-02 13:47:43 -05:00
Fredrik Hoem Grelland edb3bd0f3f Update consul-template to v0.24.1 and remove deprecated vault_grace (#7170) 2020-02-23 16:24:53 +01:00
Seth Hoenig 0f99cdd0d9
Merge pull request #7192 from hashicorp/b-connect-stanza-ignore
consul/connect: in-place update sidecar service registrations on changes
2020-02-21 09:24:53 -06:00
Seth Hoenig 07b9b24ceb
nomad: note why AddressMode is not part of CSD hash
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2020-02-21 09:24:42 -06:00
Seth Hoenig 54b5173eca consul/connect: in-place update sidecar service registrations on changes
Fix a bug where consul service definitions would not be updated if changes
were made to the service in the Nomad job. Currently this only fixes the
bug for cases where the fix is a matter of updating consul agent's service
registration. There is related bug where destructive changes are required
(see #6877) which will be fixed in another PR.

The enable_tag_override configuration setting for the parent service is
applied to the sidecar service.

Fixes #6459
2020-02-19 13:07:04 -06:00
Mahmood Ali 98ad59b1de update rest of consul packages 2020-02-16 16:25:04 -06:00
Mahmood Ali f492ab6d9e implement MinQuorum 2020-02-16 16:04:59 -06:00
Mahmood Ali 3dcc65d58d Update consul autopilot dependency 2020-02-16 15:41:43 -06:00
Mahmood Ali cf53ee57cd remove unused dropButLastChannel 2020-02-13 18:56:53 -05:00
Mahmood Ali fd51982018 tests: Avoid StartAsLeader raft config flag
It's being deprecated
2020-02-13 18:56:53 -05:00
Mahmood Ali 367133a399 Use latest raft patterns 2020-02-13 18:56:52 -05:00
Seth Hoenig 543354aabe
Merge pull request #7106 from hashicorp/f-ctag-override
client: enable configuring enable_tag_override for services
2020-02-13 12:34:48 -06:00
Michael Schurter 8c332a3757
Merge pull request #7102 from hashicorp/test-limits
Fix some race conditions and flaky tests
2020-02-13 10:19:11 -08:00
Mahmood Ali bc70beeb4a
Merge pull request #7044 from hashicorp/f-use-multiplexv2
rpc: Use MultiplexV2 for connections
2020-02-13 12:07:20 -05:00
Drew Bailey 24a5d36fcf
Merge pull request #7112 from hashicorp/f-include-pro-tag
include pro tag in serveral oss.go files
2020-02-13 11:26:41 -05:00
Seth Hoenig 2829b4cd23
Merge pull request #7129 from hashicorp/b-consistent-ct-name
command: use consistent CONSUL_HTTP_TOKEN name
2020-02-12 12:27:46 -06:00
Seth Hoenig 7f33b92e0b command: use consistent CONSUL_HTTP_TOKEN name
Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same.
Note that consul-template uses CONSUL_TOKEN, which Nomad also uses,
so be careful to preserve any reference to that in the consul-template
context.
2020-02-12 10:42:33 -06:00
Seth Hoenig ce50345b7a nomad: assert consul token is unset on job register in tests 2020-02-12 10:17:42 -06:00
Seth Hoenig 02151dee45 nomad: unset consul token on job register 2020-02-12 09:58:51 -06:00
Drew Bailey 6bd6c6638c
include pro tag in serveral oss.go files 2020-02-10 15:56:14 -05:00
Seth Hoenig 0e44094d1a client: enable configuring enable_tag_override for services
Consul provides a feature of Service Definitions where the tags
associated with a service can be modified through the Catalog API,
overriding the value(s) configured in the agent's service configuration.

To enable this feature, the flag enable_tag_override must be configured
in the service definition.

Previously, Nomad did not allow configuring this flag, and thus the default
value of false was used. Now, it is configurable.

Because Nomad itself acts as a state machine around the the service definitions
of the tasks it manages, it's worth describing what happens when this feature
is enabled and why.

Consider the basic case where there is no Nomad, and your service is provided
to consul as a boring JSON file. The ultimate source of truth for the definition
of that service is the file, and is stored in the agent. Later, Consul performs
"anti-entropy" which synchronizes the Catalog (stored only the leaders). Then
with enable_tag_override=true, the tags field is available for "external"
modification through the Catalog API (rather than directly configuring the
service definition file, or using the Agent API). The important observation
is that if the service definition ever changes (i.e. the file is changed &
config reloaded OR the Agent API is used to modify the service), those
"external" tag values are thrown away, and the new service definition is
once again the source of truth.

In the Nomad case, Nomad itself is the source of truth over the Agent in
the same way the JSON file was the source of truth in the example above.
That means any time Nomad sets a new service definition, any externally
configured tags are going to be replaced. When does this happen? Only on
major lifecycle events, for example when a task is modified because of an
updated job spec from the 'nomad job run <existing>' command. Otherwise,
Nomad's periodic re-sync's with Consul will now no longer try to restore
the externally modified tag values (as long as enable_tag_override=true).

Fixes #2057
2020-02-10 08:00:55 -06:00
Michael Schurter c5073f61a7 test: add timeout to ease debugging 2020-02-07 15:50:53 -08:00
Michael Schurter 9905dec6a3 test: workaround limits race 2020-02-07 15:50:53 -08:00
Michael Schurter 14c5ef3a8d test: fix race around reused default rpc addr
The default RPC addr was a global which is fine for normal runtime use
when it only has a single user.

However many tests modify it and cause races. Follow our convention of
returning defaults from funcs instead of using globals.
2020-02-07 15:50:53 -08:00
Mahmood Ali e106d373b2 rpc: Use MultiplexV2 for connections
MultiplexV2 is a new connection multiplex header that supports multiplex both
RPC and streaming requests over the same Yamux connection.

MultiplexV2 was added in 0.8.0 as part of
https://github.com/hashicorp/nomad/pull/3892 .  So Nomad 0.11 can expect it to
be supported.  Though, some more rigorous testing is required before merging
this.

I want to call out some implementation details:

First, the current connection pool reuses the Yamux stream for multiple RPC calls,
and doesn't close them until an error is encountered.  This commit doesn't
change it, and sets the `RpcNomad` byte only at stream creation.

Second, the StreamingRPC session gets closed by callers and cannot be reused.
Every StreamingRPC opens a new Yamux session.
2020-02-03 19:31:39 -05:00
Drew Bailey 9a65556211
add state store test to ensure PlacedCanaries is updated 2020-02-03 13:58:01 -05:00
Drew Bailey f51a3d1f37
nomad state store must be modified through raft, rm local state change 2020-02-03 13:57:34 -05:00
Drew Bailey 74779f23e6
keep placed canaries aligned with alloc status 2020-02-03 13:57:33 -05:00