open-nomad

Commit Graph

Author	SHA1	Message	Date
Michael Schurter	ca946679f6	Destroy partially migrated alloc dirs Test that snapshot errors don't return a valid tar currently fails.	2017-11-29 17:26:11 -08:00
Michael Schurter	f86f0bd9ea	Handle leader task being dead in RestoreState Fixes the panic mentioned in https://github.com/hashicorp/nomad/issues/3420#issuecomment-341666932 While a leader task dying serially stops all follower tasks, the synchronizing of state is asynchrnous. Nomad can shutdown before all follower tasks have updated their state to dead thus saving the state necessary to hit this panic: have a non-terminal alloc with a dead leader. The actual fix is a simple nil check to not assume non-terminal allocs leader's have a TaskRunner.	2017-11-15 10:36:13 -08:00
Alex Dadgar	b4af10edde	Alloc Runner doesn't panic on restoration.	2017-11-02 16:14:13 -07:00
Diptanu Choudhury	cb68889652	Added the node_id as a tag	2017-11-02 13:29:10 -07:00
Diptanu Choudhury	8a9d0d40b1	Added support for tagged metrics	2017-11-02 10:07:57 -07:00
Diptanu Choudhury	5f522c6de3	Incrementing the start counter when we are actually starting a container	2017-11-02 09:51:20 -07:00
Diptanu Choudhury	44535e5d10	Recording counter for dead allocs properly	2017-11-02 09:51:20 -07:00
Diptanu Choudhury	0b34e811b7	Added metrics to track task/alloc start/restarts/dead events	2017-11-02 09:51:20 -07:00
Michael Schurter	73e9b57908	Trigger GCs after alloc changes GC much more aggressively by triggering GCs when allocations become terminal as well as after new allocations are added.	2017-11-01 15:16:38 -05:00
Michael Schurter	2a81160dcd	Fix GC'd alloc tracking The Client.allocs map now contains all AllocRunners again, not just un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs allocs. Also stops logging "marked for GC" twice.	2017-11-01 15:16:38 -05:00
Alex Dadgar	4173834231	Enable more linters	2017-09-26 15:26:33 -07:00
Michael Schurter	8a87475498	Use existing restart policy infrastructure	2017-09-14 16:46:54 -07:00
Alex Dadgar	1a86aecf55	Add version package This PR adds a version package and consolidates version strings into a Version struct.	2017-08-16 15:44:21 -07:00
Michael Schurter	7342e23669	Move migrating state into prevAllocWatcher	2017-08-14 16:02:28 -07:00
Michael Schurter	4601419d63	Soft fail on migration errors	2017-08-11 16:50:30 -07:00
Michael Schurter	ad6cec9e82	Set failed status instead of panic'ing Fixup some TODOs and formatting left from new prevAllocWatcher code.	2017-08-11 16:21:35 -07:00
Michael Schurter	e41a654917	switch from alloc blocker to new interface interface has 3 implementations: 1. local for blocking and moving data locally 2. remote for blocking and moving data from another node 3. noop for allocs that don't need to block	2017-08-11 16:21:35 -07:00
Michael Schurter	ee04717a0b	initial attempt at refactoring blocked/migrating	2017-08-11 16:21:35 -07:00
Michael Schurter	ec6e6e6c66	Only set alloc status if it's not already terminal	2017-08-11 16:21:35 -07:00
Alex Dadgar	1b061b8f47	Unmount task directories when alloc is terminal This PR unmounts directories from tasks when the alloc is terminal rather than when it is garbage collected. /cc @angrycub	2017-08-10 13:28:17 -07:00
Alex Dadgar	83ba2f1814	Template emits events explaining why it is blocked This PR does the following: * Adds a mechanism to emit events in the TaskRunner * Vendors a new version of Consul-Template that allows extraction of missing dependencies * Adds logic to our consul_template.go to determine missing events and emit them in a batched fashion. * Refactors the consul_template code to split the run method and take in a config struct rather than many parameters. Fixes https://github.com/hashicorp/nomad/issues/2578	2017-08-09 18:01:27 -07:00
Alex Dadgar	4f6f6a13c8	Emit generic task events	2017-08-07 21:26:04 -07:00
Michael Schurter	c76b3b54b9	Merge branch 'master' into fix-pending-state	2017-08-03 17:27:03 -07:00
Michael Schurter	b01dd31f26	Don't attempt to restore tasks that never sync'd	2017-07-24 15:58:46 -07:00
Michael Schurter	9a7a1d8c13	Fix race by not accessing tr.task from ar	2017-07-21 16:16:53 -07:00
Michael Schurter	2e9a1e3fa6	Remove unneeded saveTaskRunnerState method Collapse it into the one place it's called	2017-07-21 16:16:02 -07:00
Alex Dadgar	09c8ee621b	Destroy tasks that are part of terminal alloc	2017-07-20 12:02:04 -07:00
Alex Dadgar	64776b1370	Should not persist state after alloc_runner is garbage collected	2017-07-19 17:31:30 -07:00
Michael Schurter	c1b8bef813	Use broadcast send retry logic everywhere	2017-07-18 14:36:32 -07:00
Alex Dadgar	d2381c9263	Merge pull request #2853 from hashicorp/b-watcher Improve alloc health watcher	2017-07-18 14:12:28 -07:00
Alex Dadgar	bd43bd509c	Save deployment status	2017-07-18 12:37:52 -07:00
Alex Dadgar	41f67e3535	Small fixes	2017-07-18 12:19:57 -07:00
Michael Schurter	c24e73ede7	Fix deadlock caused by syncing during destroy When replacing an alloc the new alloc is blocked until the old alloc is destroyed. This could cause a deadlock: 1. Destroying the old alloc includes a final sync of its status 2. Syncing status causes a GC 3. A GC looks for terminal allocs to cleanup 4. The GC waits for an alloc to stop completely before GC'ing If the GC chooses the currently-being-destroyed-alloc to GC, the GC deadlocks. If `client.max_parallel` deadlocks happen the GC is wedged until the Nomad process is restarted. Performing the final sync asynchronously is an ugly hack but prevents the deadlock by allowing the final sync to occur after the alloc runner has shutdown and been destroyed.	2017-07-18 11:12:56 -07:00
Michael Schurter	cdb2e96d99	Add AllocRunner.allocID for ease-of-use Since the AllocRunner.alloc struct can be mutated, most of AllocRunner needs to acquire a lock to get the alloc's ID. Log lines always need to include the alloc ID, so we often skipped acquiring a lock just to grab the ID and accepted the race. Let's make the race detector a little happier by storing the ID in a single assignment field.	2017-07-17 15:46:54 -07:00
Michael Schurter	181fda825a	Fix log level	2017-07-17 15:46:54 -07:00
Michael Schurter	98f6e7f10f	Don't fail if task dirs don't exist on creation Task dir metadata is created in AllocRunner.Run which may not run before an alloc is sync'd and Nomad exits. There's no reason not to just create task dir metadata on restore if it doesn't exist.	2017-07-17 15:46:54 -07:00
Michael Schurter	51515cbe0c	Ensure allocDir is never nil and persisted safely Fixes #2834	2017-07-17 15:46:54 -07:00
Michael Schurter	e9a416b731	Merge branch 'master' into fix-pending-state	2017-07-10 10:43:23 -07:00
unknown	26b16fa3ce	#2563 fixed pending state for allocations with terminal status	2017-07-09 16:18:06 +03:00
Alex Dadgar	bf97a2455c	Vet and small improvement on watcher failure detection	2017-07-07 14:53:01 -07:00
Alex Dadgar	ade9a7c768	@jippi Changed my mind! Good suggestion	2017-07-07 12:12:48 -07:00
Alex Dadgar	067ed86a47	Client watches for allocation health using task state and Consul checks This PR adds watching of allocation health at the client. The client can watch for health based on the tasks running on time and also based on the consul checks passing.	2017-07-07 12:10:04 -07:00
Alex Dadgar	001058227e	watcher per alloc	2017-07-07 12:07:08 -07:00
Alex Dadgar	ecee5e370e	initial watcher	2017-07-07 12:07:08 -07:00
Michael Schurter	2d741c770b	Merge pull request #2732 from hashicorp/b-persist-alloc-updates Persist Alloc when EvalID changes	2017-07-03 14:46:43 -07:00
Michael Schurter	5ec52ec24a	Destroy task group leader first Before this commit all tasks in a task group were destroyed concurrently. This meant logging sidecars might be stopped before the leader task whose logs still need to be shipped. This commit blocks on the leader shutting down before signalling to followers to shutdown.	2017-07-03 13:56:56 -07:00
Michael Schurter	cff8546035	Fix spelling & re-add immutable state struct	2017-06-23 13:01:39 -07:00
Michael Schurter	d359d3b554	Rename immutable -> alloc meh; naming is hard	2017-06-23 10:58:36 -07:00
Michael Schurter	af2fc0f1bc	Persist Alloc when EvalID changes	2017-06-22 17:33:12 -07:00
Alex Dadgar	ee8dd84965	Fix nil job on allocation The way the copying was happening on the alloc_runner was by temporarily setting the alloc.Job to nil, copying and then restoring it. This created an issue in which when the alloc was shared (which it is in server/client mode and between alloc_runner/task_runner) there were race conditions that could create a panic. Fixes https://github.com/hashicorp/nomad/issues/2605	2017-05-17 14:07:06 -04:00

1 2 3 4

181 Commits