open-nomad

Author	SHA1	Message	Date
Seth Hoenig	a1c4f16cf1	connect: prefix tag with nomad.; merge into envoy_stats_tags; update docs This PR expands on the work done in #12543 to - prefix the tag, so it is now "nomad.alloc_id" to be more consistent with Consul tags - merge into pre-existing envoy_stats_tags fields - update the upgrade guide docs - update changelog	2022-04-14 12:52:52 -05:00
James Rasell	a646333263	Merge branch 'main' into f-1.3-boogie-nights	2022-03-23 09:41:25 +01:00
Seth Hoenig	2631659551	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
James Rasell	7cd28a6fb6	client: refactor common service registration objects from Consul. This commit performs refactoring to pull out common service registration objects into a new `client/serviceregistration` package. This new package will form the base point for all client specific service registration functionality. The Consul specific implementation is not moved as it also includes non-service registration implementations; this reduces the blast radius of the changes as well.	2022-03-15 09:38:30 +01:00
Seth Hoenig	519429a2de	consul: probe consul namespace feature before using namespace api This PR changes Nomad's wrapper around the Consul NamespaceAPI so that it will detect if the Consul Namespaces feature is enabled before making a request to the Namespaces API. Namespaces are not enabled in Consul OSS, and require a suitable license to be used with Consul ENT. Previously Nomad would check for a 404 status code when makeing a request to the Namespaces API to "detect" if Consul OSS was being used. This does not work for Consul ENT with Namespaces disabled, which returns a 500. Now we avoid requesting the namespace API altogether if Consul is detected to be the OSS sku, or if the Namespaces feature is not licensed. Since Consul can be upgraded from OSS to ENT, or a new license applied, we cache the value for 1 minute, refreshing on demand if expired. Fixes https://github.com/hashicorp/nomad-enterprise/issues/575 Note that the ticket originally describes using attributes from https://github.com/hashicorp/nomad/issues/10688. This turns out not to be possible due to a chicken-egg situation between bootstrapping the agent and setting up the consul client. Also fun: the Consul fingerprinter creates its own Consul client, because there is no [currently] no way to pass the agent's client through the fingerprint factory.	2021-06-07 12:19:25 -05:00
Seth Hoenig	f17ba33f61	consul: plubming for specifying consul namespace in job/group This PR adds the common OSS changes for adding support for Consul Namespaces, which is going to be a Nomad Enterprise feature. There is no new functionality provided by this changeset and hopefully no new bugs.	2021-04-05 10:03:19 -06:00
Charlie Voiselle	0473f35003	Fixup uses of `sanity` (#10187 ) * Fixup uses of `sanity` * Remove unnecessary comments. These checks are better explained by earlier comments about the context of the test. Per @tgross, moved the tests together to better reinforce the overall shared context. * Update nomad/fsm_test.go	2021-03-16 18:05:08 -04:00
Nick Ethier	bb060bd46b	command/agent/consul: remove duplicated tests	2021-01-06 14:11:31 -05:00
Nick Ethier	4903e5b114	Consul with CNI and host_network addresses (#9095 ) * consul: advertise cni and multi host interface addresses * structs: add service/check address_mode validation * ar/groupservices: fetch networkstatus at hook runtime * ar/groupservice: nil check network status getter before calling * consul: comment network status can be nil	2020-10-15 15:32:21 -04:00
Mahmood Ali	b0cc23ae63	tests: deflake TestConsul_PeriodicSync	2020-03-30 07:06:47 -04:00
Seth Hoenig	0e44094d1a	client: enable configuring enable_tag_override for services Consul provides a feature of Service Definitions where the tags associated with a service can be modified through the Catalog API, overriding the value(s) configured in the agent's service configuration. To enable this feature, the flag enable_tag_override must be configured in the service definition. Previously, Nomad did not allow configuring this flag, and thus the default value of false was used. Now, it is configurable. Because Nomad itself acts as a state machine around the the service definitions of the tasks it manages, it's worth describing what happens when this feature is enabled and why. Consider the basic case where there is no Nomad, and your service is provided to consul as a boring JSON file. The ultimate source of truth for the definition of that service is the file, and is stored in the agent. Later, Consul performs "anti-entropy" which synchronizes the Catalog (stored only the leaders). Then with enable_tag_override=true, the tags field is available for "external" modification through the Catalog API (rather than directly configuring the service definition file, or using the Agent API). The important observation is that if the service definition ever changes (i.e. the file is changed & config reloaded OR the Agent API is used to modify the service), those "external" tag values are thrown away, and the new service definition is once again the source of truth. In the Nomad case, Nomad itself is the source of truth over the Agent in the same way the JSON file was the source of truth in the example above. That means any time Nomad sets a new service definition, any externally configured tags are going to be replaced. When does this happen? Only on major lifecycle events, for example when a task is modified because of an updated job spec from the 'nomad job run <existing>' command. Otherwise, Nomad's periodic re-sync's with Consul will now no longer try to restore the externally modified tag values (as long as enable_tag_override=true). Fixes #2057	2020-02-10 08:00:55 -06:00
Nick Ethier	5636203d4e	consul: fix var name from rebase	2020-01-27 14:00:19 -05:00
Nick Ethier	5cbb94e16e	consul: add support for canary meta	2020-01-27 09:53:30 -05:00
Nick Ethier	bd454a4c6f	client: improve group service stanza interpolation and check_re… (#6586 ) * client: improve group service stanza interpolation and check_restart support Interpolation can now be done on group service stanzas. Note that some task runtime specific information that was previously available when the service was registered poststart of a task is no longer available. The check_restart stanza for checks defined on group services will now properly restart the allocation upon check failures if configured.	2019-11-18 13:04:01 -05:00
Tim Gross	0f29dcc935	support script checks for task group services (#6197 ) In Nomad prior to Consul Connect, all Consul checks work the same except for Script checks. Because the Task being checked is running in its own container namespaces, the check is executed by Nomad in the Task's context. If the Script check passes, Nomad uses the TTL check feature of Consul to update the check status. This means in order to run a Script check, we need to know what Task to execute it in. To support Consul Connect, we need Group Services, and these need to be registered in Consul along with their checks. We could push the Service down into the Task, but this doesn't work if someone wants to associate a service with a task's ports, but do script checks in another task in the allocation. Because Nomad is handling the Script check and not Consul anyways, this moves the script check handling into the task runner so that the task runner can own the script check's configuration and lifecycle. This will allow us to pass the group service check configuration down into a task without associating the service itself with the task. When tasks are checked for script checks, we walk back through their task group to see if there are script checks associated with the task. If so, we'll spin off script check tasklets for them. The group-level service and any restart behaviors it needs are entirely encapsulated within the group service hook.	2019-09-03 15:09:04 -04:00
Evan Ercolano	fcf66918d0	Remove unused canary param from MakeTaskServiceID	2019-08-31 16:53:23 -04:00
Michael Schurter	59e0b67c7f	connect: task hook for bootstrapping envoy sidecar Fixes #6041 Unlike all other Consul operations, boostrapping requires Consul be available. This PR tries Consul 3 times with a backoff to account for the group services being asynchronously registered with Consul.	2019-08-22 08:15:32 -07:00
Mahmood Ali	e07413c420	Avoid de-registering slowly restored services When a nomad client restarts/upgraded, nomad restores state from running task and starts the sync loop. If sync loop runs early, it may deregister services from Consul prematurely even when Consul has the running service as healthy. This is not ideal, as re-registering the service means potentially waiting a whole service health check interval before declaring the service healthy. We attempt to mitigate this by introducing an initialization probation period. During this time, we only deregister services and checks that were explicitly deregistered, and leave unrecognized ones alone. This serves as a grace period for restoring to complete, or for operators to restore should they recognize they restored with the wrong nomad data directory.	2019-06-14 11:15:21 -04:00
Danielle Lancashire	8112177503	consul: Include port-label in service registration It is possible to provide multiple identically named services with different port assignments in a Nomad configuration. We introduced a regression when migrating to stable service identifiers where multiple services with the same name would conflict, and the last definition would take precedence. This commit includes the port label in the stable service identifier to allow the previous behaviour where this was supported, for example providing: ```hcl service { name = "redis-cache" tags = ["global", "cache"] port = "db" check { name = "alive" type = "tcp" interval = "10s" timeout = "2s" } } service { name = "redis-cache" tags = ["global", "foo"] port = "foo" check { name = "alive" type = "tcp" port = "db" interval = "10s" timeout = "2s" } } service { name = "redis-cache" tags = ["global", "bar"] port = "bar" check { name = "alive" type = "tcp" port = "db" interval = "10s" timeout = "2s" } } ``` in a nomad task definition is now completely valid. Each service definition with the same name must still have a unique port label however.	2019-06-13 15:24:54 +02:00
Mahmood Ali	2ddc39973d	Merge pull request #5668 from hashicorp/flaky-test-20190430 fix flaky test by allowing for call invocation overhead	2019-05-13 12:33:44 -04:00
Mahmood Ali	d405fcb093	fix flaky test by allowing for call invocation overhead	2019-05-08 18:04:37 -04:00
Danielle Lancashire	0b8e85118e	consul: Use a stable identifier for services The current implementation of Service Registration uses a hash of the nomad-internal state of a service to register it with Consul, this means that any update to the service invalidates this name and we then deregister, and recreate the service in Consul. While this behaviour slightly simplifies reasoning about service registration, this becomes problematic when we add consul health checks to a service. When the service is re-registered, so are the checks, which default to failing for at least one check period. This commit migrates us to using a stable identifier based on the allocation, task, and service identifiers, and uses the difference between the remote and local state to decide when to push updates. It uses the existing hashing mechanic to decide when UpdateTask should regenerate service registrations for providing to Sync, but this should be removable as part of a future refactor. It additionally introduces the _nomad-check- prefix for check definitions, to allow for future allowing of consul features like maintenance mode.	2019-05-02 16:54:18 +02:00
Mahmood Ali	916a40bb9e	move cstructs.DeviceNetwork to drivers pkg	2019-01-08 09:11:47 -05:00
Michael Schurter	6459c19ffc	consul: fix script checks exiting after 1 run Fixes a regression caused in d335a82859ca2177bc6deda0c2c85b559daf2db3 The removal of the inner context made the remaining cancels cancel the outer context and cause script checks to exit prematurely.	2018-12-03 18:50:02 -08:00
Mahmood Ali	c7610d8c22	mark and skip failing consul failing tests	2018-11-13 10:21:40 -05:00
Michael Schurter	2b1b3d7e1e	tests: get tests building if not yet passing	2018-10-16 16:56:57 -07:00
Alex Dadgar	7946a14aa8	Fix lints	2018-10-16 16:56:56 -07:00
Alex Dadgar	7739ef51ce	agent + consul	2018-09-13 10:43:40 -07:00
Preetha Appan	ce6d4a8d7a	Fix tests and move isClient to constructor	2018-06-01 15:59:53 -05:00
Alex Dadgar	51e67daf69	Use Tags when CanaryTags isn't specified This PR fixes a bug where we weren't defaulting to `tags` when `canary_tags` was empty and adds documentation.	2018-05-23 13:07:47 -07:00
Michael Schurter	f1d13683e6	consul: remove services with/without canary tags Guard against Canary being set to false at the same time as an allocation is being stopped: this could cause RemoveTask to be called with the wrong Canary value and leaking a service. Deleting both Canary values is the safest route.	2018-05-07 14:55:01 -05:00
Michael Schurter	50e04c976e	consul: support canary tags for services Also refactor Consul ServiceClient to take a struct instead of a massive set of arguments. Meant updating a lot of code but it should be far easier to extend in the future as you will only need to update a single struct instead of every single call site. Adds an e2e test for canary tags.	2018-05-07 14:55:01 -05:00
Michael Schurter	f6a4713141	consul: make grpc checks more like http checks	2018-05-04 11:08:11 -07:00
Michael Schurter	cfcbb9fa21	consul: periodically reconcile services/checks Periodically sync services and checks from Nomad to Consul. This is mostly useful when testing with the Consul dev agent which does not persist state across restarts. However, this is a reasonable safety measure to prevent skew between Consul's state and Nomad's services+checks. Also modernized the test suite a bit.	2018-04-19 15:45:42 -07:00
Michael Schurter	0971114f0c	Replace Consul TLSSkipVerify handling Instead of checking Consul's version on startup to see if it supports TLSSkipVerify, assume that it does and only log in the job service handler if we discover Consul does not support TLSSkipVerify. The old code would break TLSSkipVerify support if Nomad started before Consul (such as on system boot) as TLSSkipVerify would default to false if Consul wasn't running. Since TLSSkipVerify has been supported since Consul 0.7.2, it's safe to relax our handling.	2018-03-14 17:43:06 -07:00
Josh Soref	1359fd2c3d	spelling: unexpected	2018-03-11 19:08:07 +00:00
Josh Soref	05305afcd9	spelling: services	2018-03-11 18:53:58 +00:00
Josh Soref	85fabc63c8	spelling: expected	2018-03-11 17:57:01 +00:00
Michael Schurter	8a0cf66822	Improve invalid port error message for services Related to #3681 If a user specifies an invalid port label when using address_mode=driver they'll get an error message about the label being an invalid number which is very confusing. I also added a bunch of testing around Service.AddressMode validation since I was concerned by the linked issue that there were cases I was missing. Unfortunately when address_mode=driver is used there's only so much validation that can be done as structs/structs.go validation never peeks into the driver config which would be needed to verify the port labels/map.	2018-01-18 15:35:24 -08:00
Michael Schurter	447dc5bbd3	Fix test	2018-01-18 15:35:24 -08:00
Michael Schurter	583e17fad5	Always advertise driver IP when in driver mode Fixes #3681 When in drive address mode Nomad should always advertise the driver's IP in Consul even when no network exists. This matches the 0.6 behavior. When in host address mode Nomad advertises the alloc's network's IP if one exists. Otherwise it lets Consul determine the IP. I also added some much needed logging around Docker's network discovery.	2018-01-18 15:35:24 -08:00
Michael Schurter	714eb0b266	Services should not require a port Fixes #3673	2017-12-19 15:50:23 -08:00
Michael Schurter	b71edf846f	Hash fields used in task service IDs Fixes #3620 Previously we concatenated tags into task service IDs. This could break deregistration of tag names that contained double //s like some Fabio tags. This change breaks service ID backward compatibility so on upgrade all users services and checks will be removed and re-added with new IDs. This change has the side effect of including all service fields in the ID's hash, so we no longer have to track PortLabel and AddressMode changes independently.	2017-12-08 12:03:43 -08:00
Michael Schurter	91282315d1	Prevent using port 0 with address_mode=driver	2017-12-08 12:03:43 -08:00
Michael Schurter	4b20441eef	Validate port label for host address mode Also skip getting an address for script checks which don't use them. Fixed a weird invalid reserved port in a TaskRunner test helper as well as a problem with our mock Alloc/Job. Hopefully the latter doesn't cause other tests to fail, but we were referencing an invalid PortLabel and just not catching it before.	2017-12-08 12:03:43 -08:00
Michael Schurter	4ae115dc59	Allow custom ports for services and checks Fixes #3380 Adds address_mode to checks (but no auto) and allows services and checks to set literal port numbers when using address_mode=driver. This allows SDNs, overlays, etc to advertise internal and host addresses as well as do checks against either.	2017-12-08 12:03:00 -08:00
Michael Schurter	0f2a3dcec9	Test check watch updates	2017-09-14 16:48:39 -07:00
Michael Schurter	73fb71ca10	RestartDelay isn't needed as checks are re-added on restarts @dadgar made the excellent observation in #3105 that TaskRunner removes and re-registers checks on restarts. This means checkWatcher doesn't need to do any internal restart tracking. Individual checks can just remove themselves and be re-added when the task restarts.	2017-09-14 16:48:39 -07:00
Michael Schurter	72e5c0c0aa	Fix whitespace	2017-09-14 16:47:41 -07:00
Michael Schurter	ade29ecbed	Improve check watcher logging and add tests Also expose a mock Consul Agent to allow testing ServiceClient and checkWatcher from TaskRunner without actually talking to a real Consul.	2017-09-14 16:47:41 -07:00

1 2

67 commits