Commit graph

265 commits

Author SHA1 Message Date
Seth Hoenig 78a7d1e426 comments: cleanup some leftover debug comments and such 2020-01-31 19:04:35 -06:00
Seth Hoenig 8219c78667 nomad: handle SI token revocations concurrently
Be able to revoke SI token accessors concurrently, and also
ratelimit the requests being made to Consul for the various
ACL API uses.
2020-01-31 19:04:14 -06:00
Seth Hoenig 2c7ac9a80d nomad: fixup token policy validation 2020-01-31 19:04:08 -06:00
Seth Hoenig 9df33f622f nomad: proxy requests for Service Identity tokens between Clients and Consul
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
2020-01-31 19:03:53 -06:00
Nick Ethier 5636203d4e consul: fix var name from rebase 2020-01-27 14:00:19 -05:00
Nick Ethier 0ae99b3c9c consul: fix var name from rebase 2020-01-27 12:55:52 -05:00
Nick Ethier 5cbb94e16e consul: add support for canary meta 2020-01-27 09:53:30 -05:00
Nick Ethier bd454a4c6f
client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00
Michael Schurter 9fed8d1bed client: fix panic from 0.8 -> 0.10 upgrade
makeAllocTaskServices did not do a nil check on AllocatedResources
which causes a panic when upgrading directly from 0.8 to 0.10. While
skipping 0.9 is not supported we intend to fix serious crashers caused
by such upgrades to prevent cluster outages.

I did a quick audit of the client package and everywhere else that
accesses AllocatedResources appears to be properly guarded by a nil
check.
2019-11-01 07:47:03 -07:00
Seth Hoenig 039fbd3f3b connect: enable setting tags on consul connect sidecar service in jobspec (#6415) 2019-10-17 19:25:20 +00:00
Tim Gross cd9c23617f
client/connect: ConsulProxy LocalServicePort/Address (#6358)
Without a `LocalServicePort`, Connect services will try to use the
mapped port even when delivering traffic locally. A user can override
this behavior by pinning the port value in the `service` stanza but
this prevents us from using the Consul service name to reach the
service.

This commits configures the Consul proxy with its `LocalServicePort`
and `LocalServiceAddress` fields.
2019-09-23 14:30:48 -04:00
Tim Gross e3e30c15a9
remove resolved TODO from UpdateTTL docstring (#6336) 2019-09-16 16:26:06 -04:00
Tim Gross 0f29dcc935
support script checks for task group services (#6197)
In Nomad prior to Consul Connect, all Consul checks work the same
except for Script checks. Because the Task being checked is running in
its own container namespaces, the check is executed by Nomad in the
Task's context. If the Script check passes, Nomad uses the TTL check
feature of Consul to update the check status. This means in order to
run a Script check, we need to know what Task to execute it in.

To support Consul Connect, we need Group Services, and these need to
be registered in Consul along with their checks. We could push the
Service down into the Task, but this doesn't work if someone wants to
associate a service with a task's ports, but do script checks in
another task in the allocation.

Because Nomad is handling the Script check and not Consul anyways,
this moves the script check handling into the task runner so that the
task runner can own the script check's configuration and
lifecycle. This will allow us to pass the group service check
configuration down into a task without associating the service itself
with the task.

When tasks are checked for script checks, we walk back through their
task group to see if there are script checks associated with the
task. If so, we'll spin off script check tasklets for them. The
group-level service and any restart behaviors it needs are entirely
encapsulated within the group service hook.
2019-09-03 15:09:04 -04:00
Evan Ercolano fcf66918d0 Remove unused canary param from MakeTaskServiceID 2019-08-31 16:53:23 -04:00
Michael Schurter 67b7bc1e90 consul: ignore connect services when syncing
Consul registers Connect services automatically, however Nomad thinks it
owns them due to the _nomad prefix. Since the services are managed by
Consul, Nomad needs to explicitly ignore them or otherwies they will be
removed.
2019-08-30 11:53:41 -07:00
Jerome Gravel-Niquet cbdc1978bf Consul service meta (#6193)
* adds meta object to service in job spec, sends it to consul

* adds tests for service meta

* fix tests

* adds docs

* better hashing for service meta, use helper for copying meta when registering service

* tried to be DRY, but looks like it would be more work to use the
helper function
2019-08-23 12:49:02 -04:00
Michael Schurter 59e0b67c7f connect: task hook for bootstrapping envoy sidecar
Fixes #6041

Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
2019-08-22 08:15:32 -07:00
Michael Schurter b008fd1724 connect: register group services with Consul
Fixes #6042

Add new task group service hook for registering group services like
Connect-enabled services.

Does not yet support checks.
2019-08-20 12:25:10 -07:00
Michael Schurter db4de5fae9
Merge pull request #5975 from hashicorp/b-check-watcher-deadlock
consul: fix deadlock in check-based restarts
2019-07-18 13:13:40 -07:00
Michael Schurter 6d095b3b36 consul: add test for check watcher deadlock 2019-07-18 08:24:09 -07:00
Michael Schurter 826d2503e6
Update command/agent/consul/check_watcher.go
Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>
2019-07-18 07:08:27 -07:00
Michael Schurter 5407584bc3 consul: fix deadlock in check-based restarts
Fixes #5395
Alternative to #5957

Make task restarting asynchronous when handling check-based restarts.
This matches the pre-0.9 behavior where TaskRunner.Restart was an
asynchronous signal. The check-based restarting code was not designed
to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and
could easily overwhelm the buffered update chan and deadlock.

Many thanks to @byronwolfman for his excellent debugging, PR, and
reproducer!

I created this alternative as changing the functionality of
TaskRunner.Restart has a much larger impact. This approach reverts to
old known-good behavior and minimizes the number of places changes are
made.
2019-07-17 15:22:21 -07:00
Mahmood Ali ec7e258d71 address review feedback 2019-07-17 10:43:13 +07:00
Mahmood Ali e07413c420 Avoid de-registering slowly restored services
When a nomad client restarts/upgraded, nomad restores state from running
task and starts the sync loop.  If sync loop runs early, it may
deregister services from Consul prematurely even when Consul has the
running service as healthy.

This is not ideal, as re-registering the service means potentially
waiting a whole service health check interval before declaring the
service healthy.

We attempt to mitigate this by introducing an initialization probation
period.  During this time, we only deregister services and checks that
were explicitly deregistered, and leave unrecognized ones alone.  This
serves as a grace period for restoring to complete, or for operators to
restore should they recognize they restored with the wrong nomad data
directory.
2019-06-14 11:15:21 -04:00
Danielle Lancashire 8112177503
consul: Include port-label in service registration
It is possible to provide multiple identically named services with
different port assignments in a Nomad configuration.

We introduced a regression when migrating to stable service identifiers where
multiple services with the same name would conflict, and the last definition
would take precedence.

This commit includes the port label in the stable service identifier to
allow the previous behaviour where this was supported, for example
providing:

```hcl
service {
  name = "redis-cache"
  tags = ["global", "cache"]
  port = "db"
  check {
    name     = "alive"
    type     = "tcp"
    interval = "10s"
    timeout  = "2s"
  }
}

service {
  name = "redis-cache"
  tags = ["global", "foo"]
  port = "foo"

  check {
    name     = "alive"
    type     = "tcp"
    port     = "db"
    interval = "10s"
    timeout  = "2s"
  }
}

service {
  name = "redis-cache"
  tags = ["global", "bar"]
  port = "bar"

  check {
    name     = "alive"
    type     = "tcp"
    port     = "db"
    interval = "10s"
    timeout  = "2s"
  }
}
```

in a nomad task definition is now completely valid. Each service
definition with the same name must still have a unique port label however.
2019-06-13 15:24:54 +02:00
Mahmood Ali 2ddc39973d
Merge pull request #5668 from hashicorp/flaky-test-20190430
fix flaky test by allowing for call invocation overhead
2019-05-13 12:33:44 -04:00
Danielle Lancashire 0da2924b2a consul: Document example check id 2019-05-09 13:22:22 +02:00
Mahmood Ali d405fcb093 fix flaky test by allowing for call invocation overhead 2019-05-08 18:04:37 -04:00
Danielle Lancashire d824e00d1a consul: Do not deregister external checks
This commit causes sync to skip deregistering checks that are not
managed by nomad, such as service maintenance mode checks.  This is
handled in the same way as service registrations - by doing a Nomad
specific prefix match.
2019-05-02 16:54:18 +02:00
Danielle Lancashire 0b8e85118e consul: Use a stable identifier for services
The current implementation of Service Registration uses a hash of the
nomad-internal state of a service to register it with Consul, this means that
any update to the service invalidates this name and we then deregister, and
recreate the service in Consul.

While this behaviour slightly simplifies reasoning about service registration,
this becomes problematic when we add consul health checks to a service. When
the service is re-registered, so are the checks, which default to failing for
at least one check period.

This commit migrates us to using a stable identifier based on the
allocation, task, and service identifiers, and uses the difference
between the remote and local state to decide when to push updates.

It uses the existing hashing mechanic to decide when UpdateTask should
regenerate service registrations for providing to Sync, but this should
be removable as part of a future refactor.

It additionally introduces the _nomad-check- prefix for check
definitions, to allow for future allowing of consul features like
maintenance mode.
2019-05-02 16:54:18 +02:00
Michael Schurter e3e1797850 consul: squelch noisy useless logs
Only log when syncing actually did something.
2019-02-04 11:07:57 -08:00
Michael Schurter fc1bb95ef8 Remove old comment; it's been fixed! 2019-01-14 09:56:53 -08:00
Mahmood Ali 916a40bb9e move cstructs.DeviceNetwork to drivers pkg 2019-01-08 09:11:47 -05:00
Nick Ethier 82175d1328
client/drivermananger: add driver manager
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.

Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
2018-12-18 22:55:18 -05:00
Michael Schurter 8fa5e90095 consul: add ScriptExecutor context wrapper
Since d335a82859ca2177bc6deda0c2c85b559daf2db3 ScriptExecutors now take
a timeout duration instead of a context. This broke the script check
removal code which used context cancelation propagation to remove
script checks while they were executing.

This commit adds a wrapper around ScriptExecutors that obeys context
cancelation again. The only downside is that it leaks a goroutine until
the underlying Exec call completes or timeouts.

Since check removal is relatively rare, check timeouts usually low, and
scripts usually fast, the risk of leaking a goroutine seems very small.
2018-12-03 20:26:31 -08:00
Michael Schurter 6459c19ffc consul: fix script checks exiting after 1 run
Fixes a regression caused in d335a82859ca2177bc6deda0c2c85b559daf2db3

The removal of the inner context made the remaining cancels cancel the
outer context and cause script checks to exit prematurely.
2018-12-03 18:50:02 -08:00
Alex Dadgar 4ee603c382 Device hook and devices affect computed node class
This PR introduces a device hook that retrieves the device mount
information for an allocation. It also updates the computed node class
computation to take into account devices.

TODO Fix the task runner unit test. The environment variable is being
lost even though it is being properly set in the prestart hook.
2018-11-27 17:25:33 -08:00
Preetha Appan 18708d3f0b
Pass service metadata "external-source" for consul UI integration 2018-11-16 11:28:56 -06:00
Mahmood Ali c7610d8c22 mark and skip failing consul failing tests 2018-11-13 10:21:40 -05:00
Michael Schurter 6bdbfb8129 tests: get consul integration tests building 2018-11-05 12:32:05 -08:00
Michael Schurter d71a1b4547 tests: more fixes due to api changes 2018-10-29 15:25:22 -07:00
Michael Schurter 2b1b3d7e1e tests: get tests building if not yet passing 2018-10-16 16:56:57 -07:00
Nick Ethier f192c3752a client: refactor post allocrunnerv2 finalization 2018-10-16 16:56:56 -07:00
Nick Ethier 4a4c7dbbfc client: begin driver plugin integration
client: fingerprint driver plugins
2018-10-16 16:56:56 -07:00
Alex Dadgar 7946a14aa8 Fix lints 2018-10-16 16:56:56 -07:00
Michael Schurter a4b4d7b266 consul service hook
Deregistration works but difficult to test due to terminal updates not
being fully implemented in the new client/ar/tr.
2018-10-16 16:53:29 -07:00
Alex Dadgar 52f9cd7637 fixing tests 2018-10-04 14:26:19 -07:00
Alex Dadgar ca28afa3b2 small fixes 2018-09-15 16:42:38 -07:00
Alex Dadgar 7739ef51ce agent + consul 2018-09-13 10:43:40 -07:00
Alex Dadgar 300b1a7a15 Tests only use testlog package logger 2018-06-13 15:40:56 -07:00
Alex Dadgar 90c2108bfb Fix gc tests + parallel destroy + small test fixes 2018-06-12 10:23:45 -07:00
Preetha Appan ce6d4a8d7a
Fix tests and move isClient to constructor 2018-06-01 15:59:53 -05:00
Preetha Appan a5bfaa098c
Fix unnecessary deregistration in consul sync
This commit fixes an issue where if a nomad client and server shared the same consul instance, the server would deregister any services and checks registered by clients for running tasks.
2018-06-01 14:48:25 -05:00
Alex Dadgar 51e67daf69 Use Tags when CanaryTags isn't specified
This PR fixes a bug where we weren't defaulting to `tags` when
`canary_tags` was empty and adds documentation.
2018-05-23 13:07:47 -07:00
Michael Schurter f1d13683e6
consul: remove services with/without canary tags
Guard against Canary being set to false at the same time as an
allocation is being stopped: this could cause RemoveTask to be called
with the wrong Canary value and leaking a service.

Deleting both Canary values is the safest route.
2018-05-07 14:55:01 -05:00
Michael Schurter 50e04c976e
consul: support canary tags for services
Also refactor Consul ServiceClient to take a struct instead of a massive
set of arguments. Meant updating a lot of code but it should be far
easier to extend in the future as you will only need to update a single
struct instead of every single call site.

Adds an e2e test for canary tags.
2018-05-07 14:55:01 -05:00
Michael Schurter f6a4713141 consul: make grpc checks more like http checks 2018-05-04 11:08:11 -07:00
Michael Schurter 382caec1e1 consul: initial grpc implementation
Needs to be more like http.
2018-05-04 11:08:11 -07:00
Michael Schurter cfcbb9fa21 consul: periodically reconcile services/checks
Periodically sync services and checks from Nomad to Consul. This is
mostly useful when testing with the Consul dev agent which does not
persist state across restarts. However, this is a reasonable safety
measure to prevent skew between Consul's state and Nomad's
services+checks.

Also modernized the test suite a bit.
2018-04-19 15:45:42 -07:00
Michael Schurter d3650fb2cd test: build with mock_driver by default
`make release` and `make prerelease` set a `release` tag to disable
enabling the `mock_driver`
2018-04-18 14:45:33 -07:00
Michael Schurter 86f562be3a Remove unnecessary conversions 2018-03-16 16:32:59 -07:00
Michael Schurter c3e8f6319c gofmt -s (simplify) files 2018-03-16 16:31:16 -07:00
Michael Schurter 0971114f0c Replace Consul TLSSkipVerify handling
Instead of checking Consul's version on startup to see if it supports
TLSSkipVerify, assume that it does and only log in the job service
handler if we discover Consul does not support TLSSkipVerify.

The old code would break TLSSkipVerify support if Nomad started before
Consul (such as on system boot) as TLSSkipVerify would default to false
if Consul wasn't running. Since TLSSkipVerify has been supported since
Consul 0.7.2, it's safe to relax our handling.
2018-03-14 17:43:06 -07:00
Josh Soref 1359fd2c3d spelling: unexpected 2018-03-11 19:08:07 +00:00
Josh Soref 42d7f19861 spelling: supports 2018-03-11 19:00:11 +00:00
Josh Soref 05305afcd9 spelling: services 2018-03-11 18:53:58 +00:00
Josh Soref ad55e85e73 spelling: registrations 2018-03-11 18:40:53 +00:00
Josh Soref 3c1ce6d16d spelling: otherwise 2018-03-11 18:34:27 +00:00
Josh Soref 85fabc63c8 spelling: expected 2018-03-11 17:57:01 +00:00
Josh Soref 7a6dfa4b1a spelling: current 2018-03-11 17:52:32 +00:00
Josh Soref 5f87691df1 spelling: asynchronously 2018-03-11 17:41:50 +00:00
Michael Schurter 8a0cf66822 Improve invalid port error message for services
Related to #3681

If a user specifies an invalid port *label* when using
address_mode=driver they'll get an error message about the label being
an invalid number which is very confusing.

I also added a bunch of testing around Service.AddressMode validation
since I was concerned by the linked issue that there were cases I was
missing. Unfortunately when address_mode=driver is used there's only so
much validation that can be done as structs/structs.go validation never
peeks into the driver config which would be needed to verify the port
labels/map.
2018-01-18 15:35:24 -08:00
Michael Schurter 447dc5bbd3 Fix test 2018-01-18 15:35:24 -08:00
Michael Schurter 583e17fad5 Always advertise driver IP when in driver mode
Fixes #3681

When in drive address mode Nomad should always advertise the driver's IP
in Consul even when no network exists. This matches the 0.6 behavior.

When in host address mode Nomad advertises the alloc's network's IP if
one exists. Otherwise it lets Consul determine the IP.

I also added some much needed logging around Docker's network discovery.
2018-01-18 15:35:24 -08:00
Michael Schurter 714eb0b266 Services should not require a port
Fixes #3673
2017-12-19 15:50:23 -08:00
Michael Schurter cdcefd0908 Use the Service.Hash() method in agent service ids
The allocID and taskName parameters are useless for agents, but it's
still nice to reuse the same hash method for agent and task services.
This brings in the lowercase mode for the agent hash as well.
2017-12-11 16:50:15 -08:00
Michael Schurter 4f1002c1a8 Be more defensive in port checks 2017-12-08 12:27:57 -08:00
Michael Schurter d613e0aaf5 Move service hash logic to Service.Hash method 2017-12-08 12:03:43 -08:00
Michael Schurter b71edf846f Hash fields used in task service IDs
Fixes #3620

Previously we concatenated tags into task service IDs. This could break
deregistration of tag names that contained double //s like some Fabio
tags.

This change breaks service ID backward compatibility so on upgrade all
users services and checks will be removed and re-added with new IDs.

This change has the side effect of including all service fields in the
ID's hash, so we no longer have to track PortLabel and AddressMode
changes independently.
2017-12-08 12:03:43 -08:00
Michael Schurter 91282315d1 Prevent using port 0 with address_mode=driver 2017-12-08 12:03:43 -08:00
Michael Schurter 4b20441eef Validate port label for host address mode
Also skip getting an address for script checks which don't use them.

Fixed a weird invalid reserved port in a TaskRunner test helper as well
as a problem with our mock Alloc/Job. Hopefully the latter doesn't cause
other tests to fail, but we were referencing an invalid PortLabel and
just not catching it before.
2017-12-08 12:03:43 -08:00
Michael Schurter 4347026f83 Test Consul from TaskRunner thoroughly
Rely less on the mockConsulServiceClient because the real
consul.ServiceClient needs all the testing it can get!
2017-12-08 12:03:00 -08:00
Michael Schurter 4ae115dc59 Allow custom ports for services and checks
Fixes #3380

Adds address_mode to checks (but no auto) and allows services and checks
to set literal port numbers when using address_mode=driver.

This allows SDNs, overlays, etc to advertise internal and host addresses
as well as do checks against either.
2017-12-08 12:03:00 -08:00
Jens Herrmann 5680fcccc2 Fix typos in metric names. #3610 2017-12-01 15:24:14 +01:00
Michael Schurter 0aace3d749 Don't set Interval on TTL health checks 2017-10-16 17:35:47 -07:00
Alex Dadgar 4173834231 Enable more linters 2017-09-26 15:26:33 -07:00
Michael Schurter a844fba8d2 Fix comments: task -> check 2017-09-15 15:19:53 -07:00
Michael Schurter 0f2a3dcec9 Test check watch updates 2017-09-14 16:48:39 -07:00
Michael Schurter 847fe080f6 Rename unhealthy var and fix test indeterminism 2017-09-14 16:48:39 -07:00
Michael Schurter 573a0df03d Watched -> TriggersRestart
Watched was a silly name
2017-09-14 16:48:39 -07:00
Michael Schurter 4ea19baa52 Handle multiple failing checks on a single task
Before this commit if a task had 2 checks cause restarts at the same
time, both would trigger restarts of the task! This change removes all
checks for a task whenever one of them is restarted.
2017-09-14 16:48:39 -07:00
Michael Schurter 73fb71ca10 RestartDelay isn't needed as checks are re-added on restarts
@dadgar made the excellent observation in #3105 that TaskRunner removes
and re-registers checks on restarts. This means checkWatcher doesn't
need to do *any* internal restart tracking. Individual checks can just
remove themselves and be re-added when the task restarts.
2017-09-14 16:48:39 -07:00
Michael Schurter 448ad3945f Simplify from 2 select loops to one 2017-09-14 16:48:39 -07:00
Michael Schurter 550e631eea Wrap check watch updates in a struct
Reusing checkRestart for both adds/removes and the main check restarting
logic was confusing.
2017-09-14 16:48:39 -07:00
Michael Schurter 72e5c0c0aa Fix whitespace 2017-09-14 16:47:41 -07:00
Michael Schurter ade29ecbed Improve check watcher logging and add tests
Also expose a mock Consul Agent to allow testing ServiceClient and
checkWatcher from TaskRunner without actually talking to a real Consul.
2017-09-14 16:47:41 -07:00
Michael Schurter a137676358 Add comments and move delay calc to TaskRunner 2017-09-14 16:46:54 -07:00
Michael Schurter a180c00fc3 on_warning=false -> ignore_warnings=false
Treat warnings as unhealthy by default
2017-09-14 16:46:54 -07:00
Michael Schurter 8a87475498 Use existing restart policy infrastructure 2017-09-14 16:46:54 -07:00
Michael Schurter 22690c5f4c Add check watcher for restarting unhealthy tasks 2017-09-14 16:46:54 -07:00