Commit Graph

2394 Commits

Author SHA1 Message Date
Michael Schurter c1b8bef813 Use broadcast send retry logic everywhere 2017-07-18 14:36:32 -07:00
Alex Dadgar d2381c9263 Merge pull request #2853 from hashicorp/b-watcher
Improve alloc health watcher
2017-07-18 14:12:28 -07:00
Alex Dadgar bd43bd509c Save deployment status 2017-07-18 12:37:52 -07:00
Alex Dadgar 41f67e3535 Small fixes 2017-07-18 12:19:57 -07:00
Michael Schurter c24e73ede7 Fix deadlock caused by syncing during destroy
When replacing an alloc the new alloc is blocked until the old alloc is
destroyed. This could cause a deadlock:

1. Destroying the old alloc includes a final sync of its status
2. Syncing status causes a GC
3. A GC looks for terminal allocs to cleanup
4. The GC waits for an alloc to stop completely before GC'ing

If the GC chooses the currently-being-destroyed-alloc to GC, the GC
deadlocks. If `client.max_parallel` deadlocks happen the GC is wedged
until the Nomad process is restarted.

Performing the final sync asynchronously is an ugly hack but prevents
the deadlock by allowing the final sync to occur after the alloc runner
has shutdown and been destroyed.
2017-07-18 11:12:56 -07:00
Michael Schurter 420be86e39 Test AllocDir.Copy 2017-07-17 15:46:54 -07:00
Michael Schurter cdb2e96d99 Add AllocRunner.allocID for ease-of-use
Since the AllocRunner.alloc struct can be mutated, most of AllocRunner
needs to acquire a lock to get the alloc's ID. Log lines always need to
include the alloc ID, so we often skipped acquiring a lock just to grab
the ID and accepted the race.

Let's make the race detector a little happier by storing the ID in a
single assignment field.
2017-07-17 15:46:54 -07:00
Michael Schurter 181fda825a Fix log level 2017-07-17 15:46:54 -07:00
Michael Schurter 98f6e7f10f Don't fail if task dirs don't exist on creation
Task dir metadata is created in AllocRunner.Run which may not run before
an alloc is sync'd and Nomad exits. There's no reason not to just create
task dir metadata on restore if it doesn't exist.
2017-07-17 15:46:54 -07:00
Michael Schurter 51515cbe0c Ensure allocDir is never nil and persisted safely
Fixes #2834
2017-07-17 15:46:54 -07:00
Alex Dadgar 0821ee67f5 Fix alloc broadcaster panic on double close 2017-07-17 14:09:05 -07:00
Michael Schurter 0a6bf87365 Fix nil panic in Docker error condition
Fixes #2835

Yet another bug caused by overwriting container and then trying to
reference container.ID in the err handling block. Did a quick audit of
docker.go and it seems to be the last offender. See #2804 for previous
bug.
2017-07-14 10:48:19 -07:00
Alex Dadgar 05894f4611 Small fixes 2017-07-07 17:34:50 -07:00
Michael Schurter fecb16cfb2 Merge pull request #2793 from hashicorp/b-2776-ct-vault-servername
Propagate vault.tls_server_name to consul-template
2017-07-07 16:44:19 -07:00
Michael Schurter 95a9a5da71 Merge pull request #2787 from hashicorp/f-docker-test-mac
Test #2652 - Docker MAC Address option
2017-07-07 16:22:10 -07:00
Michael Schurter 4be4df21c9 Merge pull request #2797 from hashicorp/f-2785-docker-bridge-ip
Add driver.docker.bridge_ip node attribute
2017-07-07 16:20:20 -07:00
Michael Schurter 94389c3ecc Remove debug logging 2017-07-07 16:19:42 -07:00
Michael Schurter 5e3e3818db Merge pull request #2804 from hashicorp/b-2802-docker-panic
Don't panic in container list/remove/inspect race
2017-07-07 15:35:51 -07:00
Michael Schurter 67a7b0eac9 Don't panic in container list/remove/inspect race
Fixes #2802

While it's hard to reproduce the theoretical race is:

1. This goroutine calls ListContainers()
2. Another goroutine removes a container X
3. This goroutine attempts to InspectContainer(X)

However, this bug could be hit in the much simpler case of
InspectContainer() timing out.

In those cases an error is returned and the old code attempted to wrap
the error with the now-nil container.ID. Storing the container ID fixes
that panic.
2017-07-07 15:10:59 -07:00
Alex Dadgar bf97a2455c Vet and small improvement on watcher failure detection 2017-07-07 14:53:01 -07:00
Alex Dadgar 45712c6ca3 test fixes 2017-07-07 14:11:27 -07:00
Alex Dadgar ade9a7c768 @jippi Changed my mind! Good suggestion 2017-07-07 12:12:48 -07:00
Alex Dadgar c063eba836 Warn log 2017-07-07 12:10:04 -07:00
Alex Dadgar 067ed86a47 Client watches for allocation health using task state and Consul checks
This PR adds watching of allocation health at the client. The client can
watch for health based on the tasks running on time and also based on
the consul checks passing.
2017-07-07 12:10:04 -07:00
Alex Dadgar 001058227e watcher per alloc 2017-07-07 12:07:08 -07:00
Alex Dadgar 2e2fd26bed Update index 2017-07-07 12:07:08 -07:00
Alex Dadgar ecee5e370e initial watcher 2017-07-07 12:07:08 -07:00
Alex Dadgar c77944ed29 assign names 2017-07-07 12:03:11 -07:00
Michael Schurter 084dd384c1 Add driver.docker.bridge_ip node attribute
Fixes #2785
2017-07-07 10:14:10 -07:00
Michael Schurter d38d48151a Propagate vault.tls_server_name to consul-template
Fixes #2776
2017-07-06 16:56:50 -07:00
Michael Schurter 39edf23fd5 Merge pull request #2786 from hashicorp/f-docker-auth-soft-fail
Default to auth hard fail but optionally soft fail
2017-07-06 13:25:56 -07:00
Michael Schurter bae1b7db2d Test #2652
Also cleanup docker config opts docs
2017-07-06 12:46:25 -07:00
Michael Schurter 8f4353779a Merge branch 'master' into master 2017-07-06 12:09:36 -07:00
Michael Schurter 2900f941b5 Default to auth hard fail but optionally soft fail 2017-07-06 11:35:34 -07:00
Michael Schurter 08b452adf5 Merge pull request #2781 from hashicorp/f-2678-getter-mode
Add support for go-getter modes
2017-07-06 11:06:40 -07:00
Michael Schurter b000bb8598 Merge pull request #2744 from aep/master
Do not fail when no docker registry auth is available
2017-07-06 11:04:11 -07:00
Michael Schurter 0d3bdf7210 Add support for go-getter modes
Fixes #2678
2017-07-06 10:45:44 -07:00
Michael Schurter 644f0cfaa4 Consistently quote alloc ids in client logs 2017-07-06 10:24:52 -07:00
Michael Schurter 4fd9ef6a8c Tiny client race condition fix
Plus some logging improvements that may help with #2563
2017-07-05 16:15:19 -07:00
Michael Schurter 8e2e26c607 rkt: use %s instead of %q when interpolating env
Fixes #2686
2017-07-05 09:36:17 -07:00
Michael Schurter b2382f99f2 0 compute == error 2017-07-03 14:51:02 -07:00
Michael Schurter ecf090e980 Fix cpu_total_compute override 2017-07-03 14:51:02 -07:00
Michael Schurter 2d741c770b Merge pull request #2732 from hashicorp/b-persist-alloc-updates
Persist Alloc when EvalID changes
2017-07-03 14:46:43 -07:00
Michael Schurter 56a6f8ca8a Merge pull request #2763 from hashicorp/f-bad-state-help
Add more logging to restore state errors
2017-07-03 14:45:03 -07:00
Michael Schurter 9d4b0651ef Merge pull request #2753 from hashicorp/b-leader-dies-first
Destroy task group leader first
2017-07-03 14:38:04 -07:00
Michael Schurter 6e7cc3964e Merge pull request #2709 from hashicorp/f-advertise-docker-ips
Advertise driver-specific addresses
2017-07-03 14:04:12 -07:00
Michael Schurter 5ec52ec24a Destroy task group leader first
Before this commit all tasks in a task group were destroyed
concurrently. This meant logging sidecars might be stopped before the
leader task whose logs still need to be shipped.

This commit blocks on the leader shutting down before signalling to
followers to shutdown.
2017-07-03 13:56:56 -07:00
Michael Schurter 596727230b Suggest wiping out alloc dir too 2017-07-03 12:29:21 -07:00
Michael Schurter 11f68bfca2 Add more logging to restore state errors 2017-07-03 11:58:41 -07:00
Arvid E. Picciani aa4f029f10 Do not fail when no docker registry auth is available
this amends the behaviour introduced with #2651
and allows pulling public images when docker.auth.helper is set
2017-06-27 11:11:18 +02:00