open-nomad

Author	SHA1	Message	Date
Mahmood Ali	3d166e6e9c	Merge pull request #4984 from hashicorp/b-client-update-driver client: update driver info on new driver fingerprint	2018-12-11 18:01:03 -05:00
Alex Dadgar	1531b6d534	Merge pull request #4970 from hashicorp/f-no-iops Deprecate IOPS	2018-12-11 12:51:22 -08:00
Mahmood Ali	ba515947c2	client: update driver info on new fingerprint Fixes a bug where a driver health and attributes are never updated from their initial status. If a driver started unhealthy, it may never go into a healthy status.	2018-12-11 14:25:10 -05:00
Danielle Tomlinson	805669ead4	client: Correctly pass a noop PrevAllocMigrator when restoring	2018-12-11 15:46:58 +01:00
Danielle Tomlinson	83720575de	client: Unify handling of previous and preempted allocs	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	dff7093243	client: Wait for preempted allocs to terminate When starting an allocation that is preempting other allocs, we create a new group allocation watcher, and then wait for the allocations to terminate in the allocation PreRun hooks. If there's no preempted allocations, then we simply provide a NoopAllocWatcher.	2018-12-11 00:59:18 +01:00
Alex Dadgar	1e3c3cb287	Deprecate IOPS IOPS have been modelled as a resource since Nomad 0.1 but has never actually been detected and there is no plan in the short term to add detection. This is because IOPS is a bit simplistic of a unit to define the performance requirements from the underlying storage system. In its current state it adds unnecessary confusion and can be removed without impacting any users. This PR leaves IOPS defined at the jobspec parsing level and in the api/ resources since these are the two public uses of the field. These should be considered deprecated and only exist to allow users to stop using them during the Nomad 0.9.x release. In the future, there should be no expectation that the field will exist.	2018-12-06 15:09:26 -08:00
Danielle Tomlinson	66c521ca17	client: Move fingerprint structs to pkg This removes a cyclical dependency when importing client/structs from dependencies of the plugin_loader, specifically, drivers. Due to client/config also depending on the plugin_loader. It also better reflects the ownership of fingerprint structs, as they are fairly internal to the fingerprint manager.	2018-12-01 17:10:39 +01:00
Alex Dadgar	4ee603c382	Device hook and devices affect computed node class This PR introduces a device hook that retrieves the device mount information for an allocation. It also updates the computed node class computation to take into account devices. TODO Fix the task runner unit test. The environment variable is being lost even though it is being properly set in the prestart hook.	2018-11-27 17:25:33 -08:00
Michael Schurter	1e4ef139dd	Merge pull request #4883 from hashicorp/f-graceful-shutdown Support graceful shutdowns in agent	2018-11-27 15:55:15 -06:00
Michael Schurter	4f7e6f9464	client: fix races in use of goroutine group The group utility struct does not support asynchronously launched goroutines (goroutines-inside-of-goroutines), so switch those uses to a normal go call. This means watchNodeUpdates and watchNodeEvents may not be shutdown when Shutdown() exits. During nomad agent shutdown this does not matter. During tests this means a test may leak those goroutines or be unable to know when those goroutines have exited. Since there's no runtime impact and these goroutines do not affect alloc state syncing it seems ok to risk leaking them.	2018-11-26 12:52:55 -08:00
Michael Schurter	9f43fb6d29	client: reuse group instead of diy'ing it	2018-11-26 12:52:31 -08:00
Michael Schurter	5bd744ac3d	client: support graceful shutdowns Client.Shutdown now blocks until all AllocRunners and TaskRunners have exited their Run loops. Tasks are left running.	2018-11-19 16:39:30 -08:00
Mahmood Ali	f139234372	address review comments	2018-11-16 17:13:01 -05:00
Mahmood Ali	f72e599ee7	Populate alloc stats API with device stats This change makes few compromises: * Looks up the devices associated with tasks at look up time. Given that `nomad alloc status` is called rarely generally (compared to stats telemetry and general job reporting), it seems fine. However, the lookup overhead grows bounded by number of `tasks x total-host-devices`, which can be significant. * `client.Client` performs the task devices->statistics lookup. It passes self to alloc/task runners so they can look up the device statistics allocated to them. * Currently alloc/task runners are responsible for constructing the entire RPC response for stats * The alternatives for making task runners device statistics aware don't seem appealing (e.g. having task runners contain reference to hostStats) * On the alloc aggregation resource usage, I did a naive merging of task device statistics. * Personally, I question the value of such aggregation, compared to costs of struct duplication and bloating the response - but opted to be consistent in the API. * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.	2018-11-16 10:26:32 -05:00
Mahmood Ali	046f098bac	Track Node Device attributes and serve them in API	2018-11-14 14:42:29 -05:00
Mahmood Ali	b74ccc742c	Expose Device Stats in /client/stats API endpoint	2018-11-14 14:41:19 -05:00
Alex Dadgar	a7ca737fb6	review comments	2018-11-07 11:31:52 -08:00
Alex Dadgar	204ca8230c	Device manager Introduce a device manager that manages the lifecycle of device plugins on the client. It fingerprints, collects stats, and forwards Reserve requests to the correct plugin. The manager, also handles device plugins failing and validates their output.	2018-11-07 10:43:15 -08:00
Michael Schurter	b7a9d61a38	ar: initialize allocwatcher on restore Fixes a panic. Left a comment on how the behavior could be improved, but this is what releases <0.9.0 did.	2018-10-19 09:45:45 -07:00
Michael Schurter	e060174130	ar: fix leader handling, state restoring, and destroying unrun ARs * Migrated all of the old leader task tests and got them passing * Refactor and consolidate task killing code in AR to always kill leader tasks first * Fixed lots of issues with state restoring * Fixed deadlock in AR.Destroy if AR.Run had never been called * Added a new in memory statedb for testing	2018-10-19 09:45:45 -07:00
Nick Ethier	3183b33d24	client: review comments and fixup/skip tests	2018-10-16 16:56:56 -07:00
Nick Ethier	f192c3752a	client: refactor post allocrunnerv2 finalization	2018-10-16 16:56:56 -07:00
Nick Ethier	4a4c7dbbfc	client: begin driver plugin integration client: fingerprint driver plugins	2018-10-16 16:56:56 -07:00
Alex Dadgar	45e41cca03	allocrunnerv2 -> allocrunner	2018-10-16 16:56:56 -07:00
Alex Dadgar	6c9d9d5173	move files around	2018-10-16 16:56:55 -07:00
Michael Schurter	960f3be76c	client: expose task state to client The interesting decision in this commit was to expose AR's state and not a fully materialized Allocation struct. AR.clientAlloc builds an Alloc that contains the task state, so I considered simply memoizing and exposing that method. However, that would lead to AR having two awkwardly similar methods: - Alloc() - which returns the server-sent alloc - ClientAlloc() - which returns the fully materialized client alloc Since ClientAlloc() could be memoized it would be just as cheap to call as Alloc(), so why not replace Alloc() entirely? Replacing Alloc() entirely would require Update() to immediately materialize the task states on server-sent Allocs as there may have been local task state changes since the server received an Alloc update. This quickly becomes difficult to reason about: should Update hooks use the TaskStates? Are state changes caused by TR Update hooks immediately reflected in the Alloc? Should AR persist its copy of the Alloc? If so, are its TaskStates canonical or the TaskStates on TR? So! Forget that. Let's separate the static Allocation from the dynamic AR & TR state! - AR.Alloc() is for static Allocation access (often for the Job) - AR.AllocState() is for the dynamic AR & TR runtime state (deployment status, task states, etc). If code needs to know the status of a task: AllocState() If code needs to know the names of tasks: Alloc() It should be very easy for a developer to reason about which method they should call and what they can do with the return values.	2018-10-16 16:56:55 -07:00
Michael Schurter	8d1419c62b	client: fix accessing alloc runners * GetClientAlloc() gains nothing from using allAllocs() * getAllocatedResources was calling getAllocRunners() twice	2018-10-16 16:56:55 -07:00
Michael Schurter	e6e2930a00	tr: implement stats collection hook Tested except for the net/rpc specific error case which may need changing in the gRPC world.	2018-10-16 16:53:31 -07:00
Alex Dadgar	cebfead6bc	add logger back	2018-10-16 16:53:30 -07:00
Alex Dadgar	8504505c0d	client uses passed logger and fix fingerprinters	2018-10-16 16:53:30 -07:00
Michael Schurter	9d1ea3b228	client: hclog-ify most of the client Leaving fingerprinters in case that interface changes with plugins.	2018-10-16 16:53:30 -07:00
Michael Schurter	e42154fc46	implement stopping, destroying, and disk migration * Stopping an alloc is implemented via Updates but update hooks are not run. * Destroying an alloc is a best effort cleanup. * AllocRunner destroy hooks implemented. * Disk migration and blocking on a previous allocation exiting moved to its own package to avoid cycles. Now only depends on alloc broadcaster instead of also using a waitch. * AllocBroadcaster now only drops stale allocations and always keeps the latest version. * Made AllocDir safe for concurrent use Lots of internal contexts that are currently unused. Unsure if they should be used or removed.	2018-10-16 16:53:30 -07:00
Michael Schurter	4236255686	lots of comment/log fixes	2018-10-16 16:53:30 -07:00
Michael Schurter	357641c364	persist alloc state on changes, not periodically Allow alloc and task runners to persist their own state when something changes instead of periodically syncing all state.	2018-10-16 16:53:30 -07:00
Michael Schurter	a3fe0510d1	Move all encoding and put deduping into state db Still WIP as it does not handle deletions.	2018-10-16 16:53:30 -07:00
Michael Schurter	533bc93b3a	implement all boltdb interactions behind StateDB	2018-10-16 16:53:30 -07:00
Michael Schurter	a5d3e3fb0a	Implement alloc updates in arv2 Updates are applied asynchronously but sequentially	2018-10-16 16:53:30 -07:00
Michael Schurter	a4b4d7b266	consul service hook Deregistration works but difficult to test due to terminal updates not being fully implemented in the new client/ar/tr.	2018-10-16 16:53:29 -07:00
Michael Schurter	5be982e674	restore vault client	2018-10-16 16:53:29 -07:00
Alex Dadgar	fd3bc1bd39	Update state with server	2018-10-16 16:53:29 -07:00
Michael Schurter	7f4ec50906	missed locking around c.allocs access	2018-10-16 16:53:29 -07:00
Michael Schurter	516d641db0	client: implement all-or-nothing alloc restoration Restoring calls NewAR -> Restore -> Run NewAR now calls NewTR AR.Restore calls TR.Restore AR.Run calls TR.Run	2018-10-16 16:53:29 -07:00
Alex Dadgar	80f6ce50c0	vault hook	2018-10-16 16:53:29 -07:00
Michael Schurter	b360f6f96e	fix hclog level	2018-10-16 16:53:29 -07:00
Michael Schurter	4f43ff5c51	pass statedb into allocrunnerv2	2018-10-16 16:53:29 -07:00
Michael Schurter	0f7dcfdc9a	example redis job "runs" on arv2! see below Tons left to do and lots of churn: 1. No state saving 2. No shutdown or gc 3. Removed AR factory for now 4. Made all "Config" structs local to the package they configure 5. Added allocID to GC to avoid a lookup Really hating how many things use *structs.Allocation. It's not bad without state saving, but if AllocRunner starts updating its copy things get racy fast.	2018-10-16 16:53:29 -07:00
Alex Dadgar	01f8e5b95f	renames	2018-10-04 14:57:25 -07:00
Alex Dadgar	52f9cd7637	fixing tests	2018-10-04 14:26:19 -07:00
Alex Dadgar	5c8697667e	Node reserved resources	2018-09-29 18:44:55 -07:00

1 2 3 4 5 ...

535 commits