Commit Graph

1472 Commits

Author SHA1 Message Date
Jasmine Dahilig 73a64e4397 change jobspec lifecycle stanza to use sidecar attribute instead of
block_until status
2020-03-21 17:52:57 -04:00
Jasmine Dahilig 1485b342e2 remove deadline code for now 2020-03-21 17:52:56 -04:00
Jasmine Dahilig b7f08c9d13 add appropriate lifecycle deadline default of 120s 2020-03-21 17:52:48 -04:00
Mahmood Ali b880607bad update scheduler to account for hooks 2020-03-21 17:52:45 -04:00
Jasmine Dahilig f6e58d6dad add canonicalize in the right place 2020-03-21 17:52:41 -04:00
Jasmine Dahilig 4498c8c24f add canonicalization 2020-03-21 17:52:39 -04:00
Jasmine Dahilig 67262d841b add validation tests and more validation 2020-03-21 17:52:39 -04:00
Mahmood Ali 214d128bd9 it's running now 2020-03-21 17:52:37 -04:00
Jasmine Dahilig fc13fa9739 change TaskLifecycle RunLevel to Hook and add Deadline time duration 2020-03-21 17:52:37 -04:00
Mahmood Ali 4ebeac721a update structs with lifecycle 2020-03-21 17:52:36 -04:00
Michael Schurter 2dcc85bed1 jobspec: fixup vault_grace deprecation
Followup to #7170

- Moved canonicalization of VaultGrace back into `api/` package.
- Fixed tests.
- Made docs styling consistent.
2020-03-10 14:58:49 -07:00
Fredrik Hoem Grelland edb3bd0f3f Update consul-template to v0.24.1 and remove deprecated vault_grace (#7170) 2020-02-23 16:24:53 +01:00
Seth Hoenig 0f99cdd0d9
Merge pull request #7192 from hashicorp/b-connect-stanza-ignore
consul/connect: in-place update sidecar service registrations on changes
2020-02-21 09:24:53 -06:00
Seth Hoenig 07b9b24ceb
nomad: note why AddressMode is not part of CSD hash
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2020-02-21 09:24:42 -06:00
Seth Hoenig 54b5173eca consul/connect: in-place update sidecar service registrations on changes
Fix a bug where consul service definitions would not be updated if changes
were made to the service in the Nomad job. Currently this only fixes the
bug for cases where the fix is a matter of updating consul agent's service
registration. There is related bug where destructive changes are required
(see #6877) which will be fixed in another PR.

The enable_tag_override configuration setting for the parent service is
applied to the sidecar service.

Fixes #6459
2020-02-19 13:07:04 -06:00
Mahmood Ali f492ab6d9e implement MinQuorum 2020-02-16 16:04:59 -06:00
Seth Hoenig 543354aabe
Merge pull request #7106 from hashicorp/f-ctag-override
client: enable configuring enable_tag_override for services
2020-02-13 12:34:48 -06:00
Seth Hoenig 7f33b92e0b command: use consistent CONSUL_HTTP_TOKEN name
Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same.
Note that consul-template uses CONSUL_TOKEN, which Nomad also uses,
so be careful to preserve any reference to that in the consul-template
context.
2020-02-12 10:42:33 -06:00
Seth Hoenig 0e44094d1a client: enable configuring enable_tag_override for services
Consul provides a feature of Service Definitions where the tags
associated with a service can be modified through the Catalog API,
overriding the value(s) configured in the agent's service configuration.

To enable this feature, the flag enable_tag_override must be configured
in the service definition.

Previously, Nomad did not allow configuring this flag, and thus the default
value of false was used. Now, it is configurable.

Because Nomad itself acts as a state machine around the the service definitions
of the tasks it manages, it's worth describing what happens when this feature
is enabled and why.

Consider the basic case where there is no Nomad, and your service is provided
to consul as a boring JSON file. The ultimate source of truth for the definition
of that service is the file, and is stored in the agent. Later, Consul performs
"anti-entropy" which synchronizes the Catalog (stored only the leaders). Then
with enable_tag_override=true, the tags field is available for "external"
modification through the Catalog API (rather than directly configuring the
service definition file, or using the Agent API). The important observation
is that if the service definition ever changes (i.e. the file is changed &
config reloaded OR the Agent API is used to modify the service), those
"external" tag values are thrown away, and the new service definition is
once again the source of truth.

In the Nomad case, Nomad itself is the source of truth over the Agent in
the same way the JSON file was the source of truth in the example above.
That means any time Nomad sets a new service definition, any externally
configured tags are going to be replaced. When does this happen? Only on
major lifecycle events, for example when a task is modified because of an
updated job spec from the 'nomad job run <existing>' command. Otherwise,
Nomad's periodic re-sync's with Consul will now no longer try to restore
the externally modified tag values (as long as enable_tag_override=true).

Fixes #2057
2020-02-10 08:00:55 -06:00
Seth Hoenig 587a5d4a8d nomad: make TaskGroup.UsesConnect helper a public helper 2020-01-31 19:05:11 -06:00
Seth Hoenig 78a7d1e426 comments: cleanup some leftover debug comments and such 2020-01-31 19:04:35 -06:00
Seth Hoenig 9df33f622f nomad: proxy requests for Service Identity tokens between Clients and Consul
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
2020-01-31 19:03:53 -06:00
Seth Hoenig 93cf770edb client: enable nomad client to request and set SI tokens for tasks
When a job is configured with Consul Connect aware tasks (i.e. sidecar),
the Nomad Client should be able to request from Consul (through Nomad Server)
Service Identity tokens specific to those tasks.
2020-01-31 19:03:38 -06:00
Seth Hoenig 2b66ce93bb nomad: ensure a unique ClusterID exists when leader (gh-6702)
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
2020-01-31 19:03:26 -06:00
Seth Hoenig f030a22c7c command, docs: create and document consul token configuration for connect acls (gh-6716)
This change provides an initial pass at setting up the configuration necessary to
enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul
Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert`
commands (similar to Vault tokens).

These values are not actually used yet in this changeset.
2020-01-31 19:02:53 -06:00
Michael Schurter c82b14b0c4 core: add limits to unauthorized connections
Introduce limits to prevent unauthorized users from exhausting all
ephemeral ports on agents:

 * `{https,rpc}_handshake_timeout`
 * `{http,rpc}_max_conns_per_client`

The handshake timeout closes connections that have not completed the TLS
handshake by the deadline (5s by default). For RPC connections this
timeout also separately applies to first byte being read so RPC
connections with TLS enabled have `rpc_handshake_time * 2` as their
deadline.

The connection limit per client prevents a single remote TCP peer from
exhausting all ephemeral ports. The default is 100, but can be lowered
to a minimum of 26. Since streaming RPC connections create a new TCP
connection (until MultiplexV2 is used), 20 connections are reserved for
Raft and non-streaming RPCs to prevent connection exhaustion due to
streaming RPCs.

All limits are configurable and may be disabled by setting them to `0`.

This also includes a fix that closes connections that attempt to create
TLS RPC connections recursively. While only users with valid mTLS
certificates could perform such an operation, it was added as a
safeguard to prevent programming errors before they could cause resource
exhaustion.
2020-01-30 10:38:25 -08:00
Mahmood Ali 9611324654
Merge pull request #6922 from hashicorp/b-alloc-canoncalize
Handle Upgrades and Alloc.TaskResources modification
2020-01-28 15:12:41 -05:00
Mahmood Ali 90cae566e5
Merge pull request #6935 from hashicorp/b-default-preemption-flag
scheduler: allow configuring default preemption for system scheduler
2020-01-28 15:11:06 -05:00
Mahmood Ali af17b4afc7 Support customizing full scheduler config 2020-01-28 14:51:42 -05:00
Nick Ethier 5cbb94e16e consul: add support for canary meta 2020-01-27 09:53:30 -05:00
Mahmood Ali f36cc54efd actually always canonicalize alloc.Job
alloc.Job may be stale as well and need to migrate it.  It does cost
extra cycles but should be negligible.
2020-01-15 09:02:48 -05:00
Mahmood Ali b1b714691c address review comments 2020-01-15 08:57:05 -05:00
Drew Bailey b702dede49
adds qc param, address pr feedback 2020-01-09 15:15:11 -05:00
Drew Bailey 45210ed901
Rename profile package to pprof
Address pr feedback, rename profile package to pprof to more accurately
describe its purpose. Adds gc param for heap lookup profiles.
2020-01-09 15:15:10 -05:00
Drew Bailey 4ced73875b
leave acl checking to rpc endpoints
fix test expectation

test wrapNonJSON
2020-01-09 15:15:08 -05:00
Drew Bailey 7bbba613a5
prevent doubly wrapping with rpc error 2020-01-09 15:15:07 -05:00
Drew Bailey 46121fe3fd
move shared structs out of client and into nomad 2020-01-09 15:15:05 -05:00
Mahmood Ali d740d347ce Migrate old alloc structs on read
This commit ensures that Alloc.AllocatedResources is properly populated
when read from persistence stores (namely Raft and client state store).
The alloc struct may have been written previously by an arbitrary old
version that may only populate Alloc.TaskResources.
2020-01-09 08:46:50 -05:00
James Rasell df2dc48790 Fix error parsing config when setting consul.timeout. (#6907)
When parsing a config file which had the consul.timeout param set,
Nomad was reporting an error causing startup to fail. This seems
to be caused by the HCL decoder interpreting the timeout type as
an int rather than a string. This is caused by the struct
TimeoutHCL param having a hcl key of timeout alongside a Timeout
struct param of type time.Duration (int). Ensuring the decoder
ignores the Timeout struct param ensure the decoder runs
correctly.
2020-01-07 13:40:55 -05:00
Drew Bailey d9e41d2880
docs for shutdown delay
update docs, address pr comments

ensure pointer is not nil

use pointer for diff tests, set vs unset
2019-12-16 11:38:35 -05:00
Drew Bailey ae145c9a37
allow only positive shutdown delay
more explicit test case, remove select statement
2019-12-16 11:38:30 -05:00
Drew Bailey 24929776a2
shutdown delay for task groups
copy struct values

ensure groupserviceHook implements RunnerPreKillhook

run deregister first

test that shutdown times are delayed

move magic number into variable
2019-12-16 11:38:16 -05:00
Michael Schurter 95fd2643d7 connect: canonicalize before adding sidecar
Fixes #6853

Canonicalize jobs first before adding any sidecars. This fixes a bug
where sidecar tasks were added without interpolated names and broke
validation. Sidecar tasks must be canonicalized independently.

Also adds a group network to the mock connect job because it wasn't a
valid connect job before!
2019-12-12 20:55:56 -08:00
Buck Doyle 5fcc00d0f9 Add gofmt changes 2019-11-20 12:47:01 -06:00
Buck Doyle dc9c0d5ead
Add explanatory comment 2019-11-20 11:45:44 -06:00
Buck Doyle db77a24ed3
Merge branch 'master' into f-policy-json 2019-11-20 11:20:07 -06:00
Mahmood Ali 7fb4c35831 comments and casing 2019-11-19 16:03:55 -05:00
Mahmood Ali 6f8bb5e90b api: acl bootstrap errors aren't 500
Noticed that ACL endpoints return 500 status code for user errors.  This
is confusing and can lead to false monitoring alerts.

Here, I introduce a concept of RPCCoded errors to be returned by RPC
that signal a code in addition to error message.  Codes for now match
HTTP codes to ease reasoning.

```
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 500 (ACL bootstrap already done (reset index: 9))

$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 400 (ACL bootstrap already done (reset index: 9))
```
2019-11-19 15:51:57 -05:00
Michael Schurter 796758b8a5 core: add semver constraint
The existing version constraint uses logic optimized for package
managers, not schedulers, when checking prereleases:

- 1.3.0-beta1 will *not* satisfy ">= 0.6.1"
- 1.7.0-rc1 will *not* satisfy ">= 1.6.0-beta1"

This is due to package managers wishing to favor final releases over
prereleases.

In a scheduler versions more often represent the earliest release all
required features/APIs are available in a system. Whether the constraint
or the version being evaluated are prereleases has no impact on
ordering.

This commit adds a new constraint - `semver` - which will use Semver
v2.0 ordering when evaluating constraints. Given the above examples:

- 1.3.0-beta1 satisfies ">= 0.6.1" using `semver`
- 1.7.0-rc1 satisfies ">= 1.6.0-beta1" using `semver`

Since existing jobspecs may rely on the old behavior, a new constraint
was added and the implicit Consul Connect and Vault constraints were
updated to use it.
2019-11-19 08:40:19 -08:00
Nick Ethier bd454a4c6f
client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00