Commit Graph

190 Commits

Author SHA1 Message Date
Alex Dadgar f5ff509fa5 Refactor - wip 2018-06-12 10:23:45 -07:00
Alex Dadgar 552604451c
rework where time gets set 2018-05-07 14:50:01 -05:00
Alex Dadgar ee50789c22
Initial implementation 2018-05-07 14:50:01 -05:00
Preetha Appan 6363a6fb4d
Remove old comment 2018-04-04 15:01:48 -05:00
Preetha Appan 5e4525bd30
Moves setting finishedAt to the right place and adds two unit tests. 2018-04-04 14:38:15 -05:00
Preetha Appan e6bbce3fa0
Add comment 2018-04-03 19:49:03 -05:00
Preetha Appan 00537c739b
Fixes edge cases around timing and task finish time being set more than once 2018-04-03 16:34:59 -05:00
Alex Dadgar beee130a6e Always capture the finish time 2018-03-29 11:27:22 -07:00
Josh Soref 09970343b5 spelling: destruction 2018-03-11 17:54:39 +00:00
Michael Schurter ca946679f6 Destroy partially migrated alloc dirs
Test that snapshot errors don't return a valid tar currently fails.
2017-11-29 17:26:11 -08:00
Michael Schurter f86f0bd9ea Handle leader task being dead in RestoreState
Fixes the panic mentioned in
https://github.com/hashicorp/nomad/issues/3420#issuecomment-341666932

While a leader task dying serially stops all follower tasks, the
synchronizing of state is asynchrnous. Nomad can shutdown before all
follower tasks have updated their state to dead thus saving the state
necessary to hit this panic: *have a non-terminal alloc with a dead
leader.*

The actual fix is a simple nil check to not assume non-terminal allocs
leader's have a TaskRunner.
2017-11-15 10:36:13 -08:00
Alex Dadgar b4af10edde Alloc Runner doesn't panic on restoration. 2017-11-02 16:14:13 -07:00
Diptanu Choudhury cb68889652 Added the node_id as a tag 2017-11-02 13:29:10 -07:00
Diptanu Choudhury 8a9d0d40b1 Added support for tagged metrics 2017-11-02 10:07:57 -07:00
Diptanu Choudhury 5f522c6de3 Incrementing the start counter when we are actually starting a container 2017-11-02 09:51:20 -07:00
Diptanu Choudhury 44535e5d10 Recording counter for dead allocs properly 2017-11-02 09:51:20 -07:00
Diptanu Choudhury 0b34e811b7 Added metrics to track task/alloc start/restarts/dead events 2017-11-02 09:51:20 -07:00
Michael Schurter 73e9b57908 Trigger GCs after alloc changes
GC much more aggressively by triggering GCs when allocations become
terminal as well as after new allocations are added.
2017-11-01 15:16:38 -05:00
Michael Schurter 2a81160dcd Fix GC'd alloc tracking
The Client.allocs map now contains all AllocRunners again, not just
un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs
allocs.

Also stops logging "marked for GC" twice.
2017-11-01 15:16:38 -05:00
Alex Dadgar 4173834231 Enable more linters 2017-09-26 15:26:33 -07:00
Michael Schurter 8a87475498 Use existing restart policy infrastructure 2017-09-14 16:46:54 -07:00
Alex Dadgar 1a86aecf55 Add version package
This PR adds a version package and consolidates version strings into a
Version struct.
2017-08-16 15:44:21 -07:00
Michael Schurter 7342e23669 Move migrating state into prevAllocWatcher 2017-08-14 16:02:28 -07:00
Michael Schurter 4601419d63 Soft fail on migration errors 2017-08-11 16:50:30 -07:00
Michael Schurter ad6cec9e82 Set failed status instead of panic'ing
Fixup some TODOs and formatting left from new prevAllocWatcher code.
2017-08-11 16:21:35 -07:00
Michael Schurter e41a654917 switch from alloc blocker to new interface
interface has 3 implementations:

1. local for blocking and moving data locally
2. remote for blocking and moving data from another node
3. noop for allocs that don't need to block
2017-08-11 16:21:35 -07:00
Michael Schurter ee04717a0b initial attempt at refactoring blocked/migrating 2017-08-11 16:21:35 -07:00
Michael Schurter ec6e6e6c66 Only set alloc status if it's not already terminal 2017-08-11 16:21:35 -07:00
Alex Dadgar 1b061b8f47 Unmount task directories when alloc is terminal
This PR unmounts directories from tasks when the alloc is terminal
rather than when it is garbage collected.

/cc @angrycub
2017-08-10 13:28:17 -07:00
Alex Dadgar 83ba2f1814 Template emits events explaining why it is blocked
This PR does the following:
* Adds a mechanism to emit events in the TaskRunner
* Vendors a new version of Consul-Template that allows extraction of
missing dependencies
* Adds logic to our consul_template.go to determine missing events and
emit them in a batched fashion.
* Refactors the consul_template code to split the run method and take in
a config struct rather than many parameters.

Fixes https://github.com/hashicorp/nomad/issues/2578
2017-08-09 18:01:27 -07:00
Alex Dadgar 4f6f6a13c8 Emit generic task events 2017-08-07 21:26:04 -07:00
Michael Schurter c76b3b54b9 Merge branch 'master' into fix-pending-state 2017-08-03 17:27:03 -07:00
Michael Schurter b01dd31f26 Don't attempt to restore tasks that never sync'd 2017-07-24 15:58:46 -07:00
Michael Schurter 9a7a1d8c13 Fix race by not accessing tr.task from ar 2017-07-21 16:16:53 -07:00
Michael Schurter 2e9a1e3fa6 Remove unneeded saveTaskRunnerState method
Collapse it into the one place it's called
2017-07-21 16:16:02 -07:00
Alex Dadgar 09c8ee621b Destroy tasks that are part of terminal alloc 2017-07-20 12:02:04 -07:00
Alex Dadgar 64776b1370 Should not persist state after alloc_runner is garbage collected 2017-07-19 17:31:30 -07:00
Michael Schurter c1b8bef813 Use broadcast send retry logic everywhere 2017-07-18 14:36:32 -07:00
Alex Dadgar d2381c9263 Merge pull request #2853 from hashicorp/b-watcher
Improve alloc health watcher
2017-07-18 14:12:28 -07:00
Alex Dadgar bd43bd509c Save deployment status 2017-07-18 12:37:52 -07:00
Alex Dadgar 41f67e3535 Small fixes 2017-07-18 12:19:57 -07:00
Michael Schurter c24e73ede7 Fix deadlock caused by syncing during destroy
When replacing an alloc the new alloc is blocked until the old alloc is
destroyed. This could cause a deadlock:

1. Destroying the old alloc includes a final sync of its status
2. Syncing status causes a GC
3. A GC looks for terminal allocs to cleanup
4. The GC waits for an alloc to stop completely before GC'ing

If the GC chooses the currently-being-destroyed-alloc to GC, the GC
deadlocks. If `client.max_parallel` deadlocks happen the GC is wedged
until the Nomad process is restarted.

Performing the final sync asynchronously is an ugly hack but prevents
the deadlock by allowing the final sync to occur after the alloc runner
has shutdown and been destroyed.
2017-07-18 11:12:56 -07:00
Michael Schurter cdb2e96d99 Add AllocRunner.allocID for ease-of-use
Since the AllocRunner.alloc struct can be mutated, most of AllocRunner
needs to acquire a lock to get the alloc's ID. Log lines always need to
include the alloc ID, so we often skipped acquiring a lock just to grab
the ID and accepted the race.

Let's make the race detector a little happier by storing the ID in a
single assignment field.
2017-07-17 15:46:54 -07:00
Michael Schurter 181fda825a Fix log level 2017-07-17 15:46:54 -07:00
Michael Schurter 98f6e7f10f Don't fail if task dirs don't exist on creation
Task dir metadata is created in AllocRunner.Run which may not run before
an alloc is sync'd and Nomad exits. There's no reason not to just create
task dir metadata on restore if it doesn't exist.
2017-07-17 15:46:54 -07:00
Michael Schurter 51515cbe0c Ensure allocDir is never nil and persisted safely
Fixes #2834
2017-07-17 15:46:54 -07:00
Michael Schurter e9a416b731 Merge branch 'master' into fix-pending-state 2017-07-10 10:43:23 -07:00
unknown 26b16fa3ce #2563 fixed pending state for allocations with terminal status 2017-07-09 16:18:06 +03:00
Alex Dadgar bf97a2455c Vet and small improvement on watcher failure detection 2017-07-07 14:53:01 -07:00
Alex Dadgar ade9a7c768 @jippi Changed my mind! Good suggestion 2017-07-07 12:12:48 -07:00