open-nomad

Author	SHA1	Message	Date
Danielle Tomlinson	4184eadaf4	client: updateAlloc release lock after read The allocLock is used to synchronize access to the alloc runner map, not to ensure internal consistency of the alloc runners themselves. This updates the updateAlloc process to avoid hanging on to an exclusive lock of the map while applying changes to allocrunners themselves, as they should be internally consistent. This fixes a bug where any client allocation api will block during the shutdown or updating of an allocrunner and its child taskrunners.	2018-12-12 16:30:01 +01:00
Mahmood Ali	3d166e6e9c	Merge pull request #4984 from hashicorp/b-client-update-driver client: update driver info on new driver fingerprint	2018-12-11 18:01:03 -05:00
Alex Dadgar	1531b6d534	Merge pull request #4970 from hashicorp/f-no-iops Deprecate IOPS	2018-12-11 12:51:22 -08:00
Mahmood Ali	ba515947c2	client: update driver info on new fingerprint Fixes a bug where a driver health and attributes are never updated from their initial status. If a driver started unhealthy, it may never go into a healthy status.	2018-12-11 14:25:10 -05:00
Danielle Tomlinson	805669ead4	client: Correctly pass a noop PrevAllocMigrator when restoring	2018-12-11 15:46:58 +01:00
Danielle Tomlinson	83720575de	client: Unify handling of previous and preempted allocs	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	dff7093243	client: Wait for preempted allocs to terminate When starting an allocation that is preempting other allocs, we create a new group allocation watcher, and then wait for the allocations to terminate in the allocation PreRun hooks. If there's no preempted allocations, then we simply provide a NoopAllocWatcher.	2018-12-11 00:59:18 +01:00
Alex Dadgar	1e3c3cb287	Deprecate IOPS IOPS have been modelled as a resource since Nomad 0.1 but has never actually been detected and there is no plan in the short term to add detection. This is because IOPS is a bit simplistic of a unit to define the performance requirements from the underlying storage system. In its current state it adds unnecessary confusion and can be removed without impacting any users. This PR leaves IOPS defined at the jobspec parsing level and in the api/ resources since these are the two public uses of the field. These should be considered deprecated and only exist to allow users to stop using them during the Nomad 0.9.x release. In the future, there should be no expectation that the field will exist.	2018-12-06 15:09:26 -08:00
Danielle Tomlinson	66c521ca17	client: Move fingerprint structs to pkg This removes a cyclical dependency when importing client/structs from dependencies of the plugin_loader, specifically, drivers. Due to client/config also depending on the plugin_loader. It also better reflects the ownership of fingerprint structs, as they are fairly internal to the fingerprint manager.	2018-12-01 17:10:39 +01:00
Alex Dadgar	4ee603c382	Device hook and devices affect computed node class This PR introduces a device hook that retrieves the device mount information for an allocation. It also updates the computed node class computation to take into account devices. TODO Fix the task runner unit test. The environment variable is being lost even though it is being properly set in the prestart hook.	2018-11-27 17:25:33 -08:00
Michael Schurter	1e4ef139dd	Merge pull request #4883 from hashicorp/f-graceful-shutdown Support graceful shutdowns in agent	2018-11-27 15:55:15 -06:00
Michael Schurter	4f7e6f9464	client: fix races in use of goroutine group The group utility struct does not support asynchronously launched goroutines (goroutines-inside-of-goroutines), so switch those uses to a normal go call. This means watchNodeUpdates and watchNodeEvents may not be shutdown when Shutdown() exits. During nomad agent shutdown this does not matter. During tests this means a test may leak those goroutines or be unable to know when those goroutines have exited. Since there's no runtime impact and these goroutines do not affect alloc state syncing it seems ok to risk leaking them.	2018-11-26 12:52:55 -08:00
Michael Schurter	9f43fb6d29	client: reuse group instead of diy'ing it	2018-11-26 12:52:31 -08:00
Michael Schurter	5bd744ac3d	client: support graceful shutdowns Client.Shutdown now blocks until all AllocRunners and TaskRunners have exited their Run loops. Tasks are left running.	2018-11-19 16:39:30 -08:00
Mahmood Ali	f139234372	address review comments	2018-11-16 17:13:01 -05:00
Mahmood Ali	f72e599ee7	Populate alloc stats API with device stats This change makes few compromises: * Looks up the devices associated with tasks at look up time. Given that `nomad alloc status` is called rarely generally (compared to stats telemetry and general job reporting), it seems fine. However, the lookup overhead grows bounded by number of `tasks x total-host-devices`, which can be significant. * `client.Client` performs the task devices->statistics lookup. It passes self to alloc/task runners so they can look up the device statistics allocated to them. * Currently alloc/task runners are responsible for constructing the entire RPC response for stats * The alternatives for making task runners device statistics aware don't seem appealing (e.g. having task runners contain reference to hostStats) * On the alloc aggregation resource usage, I did a naive merging of task device statistics. * Personally, I question the value of such aggregation, compared to costs of struct duplication and bloating the response - but opted to be consistent in the API. * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.	2018-11-16 10:26:32 -05:00
Mahmood Ali	046f098bac	Track Node Device attributes and serve them in API	2018-11-14 14:42:29 -05:00
Mahmood Ali	b74ccc742c	Expose Device Stats in /client/stats API endpoint	2018-11-14 14:41:19 -05:00
Alex Dadgar	a7ca737fb6	review comments	2018-11-07 11:31:52 -08:00
Alex Dadgar	204ca8230c	Device manager Introduce a device manager that manages the lifecycle of device plugins on the client. It fingerprints, collects stats, and forwards Reserve requests to the correct plugin. The manager, also handles device plugins failing and validates their output.	2018-11-07 10:43:15 -08:00
Michael Schurter	b7a9d61a38	ar: initialize allocwatcher on restore Fixes a panic. Left a comment on how the behavior could be improved, but this is what releases <0.9.0 did.	2018-10-19 09:45:45 -07:00
Michael Schurter	e060174130	ar: fix leader handling, state restoring, and destroying unrun ARs * Migrated all of the old leader task tests and got them passing * Refactor and consolidate task killing code in AR to always kill leader tasks first * Fixed lots of issues with state restoring * Fixed deadlock in AR.Destroy if AR.Run had never been called * Added a new in memory statedb for testing	2018-10-19 09:45:45 -07:00
Nick Ethier	3183b33d24	client: review comments and fixup/skip tests	2018-10-16 16:56:56 -07:00
Nick Ethier	f192c3752a	client: refactor post allocrunnerv2 finalization	2018-10-16 16:56:56 -07:00
Nick Ethier	4a4c7dbbfc	client: begin driver plugin integration client: fingerprint driver plugins	2018-10-16 16:56:56 -07:00
Alex Dadgar	45e41cca03	allocrunnerv2 -> allocrunner	2018-10-16 16:56:56 -07:00
Alex Dadgar	6c9d9d5173	move files around	2018-10-16 16:56:55 -07:00
Michael Schurter	960f3be76c	client: expose task state to client The interesting decision in this commit was to expose AR's state and not a fully materialized Allocation struct. AR.clientAlloc builds an Alloc that contains the task state, so I considered simply memoizing and exposing that method. However, that would lead to AR having two awkwardly similar methods: - Alloc() - which returns the server-sent alloc - ClientAlloc() - which returns the fully materialized client alloc Since ClientAlloc() could be memoized it would be just as cheap to call as Alloc(), so why not replace Alloc() entirely? Replacing Alloc() entirely would require Update() to immediately materialize the task states on server-sent Allocs as there may have been local task state changes since the server received an Alloc update. This quickly becomes difficult to reason about: should Update hooks use the TaskStates? Are state changes caused by TR Update hooks immediately reflected in the Alloc? Should AR persist its copy of the Alloc? If so, are its TaskStates canonical or the TaskStates on TR? So! Forget that. Let's separate the static Allocation from the dynamic AR & TR state! - AR.Alloc() is for static Allocation access (often for the Job) - AR.AllocState() is for the dynamic AR & TR runtime state (deployment status, task states, etc). If code needs to know the status of a task: AllocState() If code needs to know the names of tasks: Alloc() It should be very easy for a developer to reason about which method they should call and what they can do with the return values.	2018-10-16 16:56:55 -07:00
Michael Schurter	8d1419c62b	client: fix accessing alloc runners * GetClientAlloc() gains nothing from using allAllocs() * getAllocatedResources was calling getAllocRunners() twice	2018-10-16 16:56:55 -07:00
Michael Schurter	e6e2930a00	tr: implement stats collection hook Tested except for the net/rpc specific error case which may need changing in the gRPC world.	2018-10-16 16:53:31 -07:00
Alex Dadgar	cebfead6bc	add logger back	2018-10-16 16:53:30 -07:00
Alex Dadgar	8504505c0d	client uses passed logger and fix fingerprinters	2018-10-16 16:53:30 -07:00
Michael Schurter	9d1ea3b228	client: hclog-ify most of the client Leaving fingerprinters in case that interface changes with plugins.	2018-10-16 16:53:30 -07:00
Michael Schurter	e42154fc46	implement stopping, destroying, and disk migration * Stopping an alloc is implemented via Updates but update hooks are not run. * Destroying an alloc is a best effort cleanup. * AllocRunner destroy hooks implemented. * Disk migration and blocking on a previous allocation exiting moved to its own package to avoid cycles. Now only depends on alloc broadcaster instead of also using a waitch. * AllocBroadcaster now only drops stale allocations and always keeps the latest version. * Made AllocDir safe for concurrent use Lots of internal contexts that are currently unused. Unsure if they should be used or removed.	2018-10-16 16:53:30 -07:00
Michael Schurter	4236255686	lots of comment/log fixes	2018-10-16 16:53:30 -07:00
Michael Schurter	357641c364	persist alloc state on changes, not periodically Allow alloc and task runners to persist their own state when something changes instead of periodically syncing all state.	2018-10-16 16:53:30 -07:00
Michael Schurter	a3fe0510d1	Move all encoding and put deduping into state db Still WIP as it does not handle deletions.	2018-10-16 16:53:30 -07:00
Michael Schurter	533bc93b3a	implement all boltdb interactions behind StateDB	2018-10-16 16:53:30 -07:00
Michael Schurter	a5d3e3fb0a	Implement alloc updates in arv2 Updates are applied asynchronously but sequentially	2018-10-16 16:53:30 -07:00
Michael Schurter	a4b4d7b266	consul service hook Deregistration works but difficult to test due to terminal updates not being fully implemented in the new client/ar/tr.	2018-10-16 16:53:29 -07:00
Michael Schurter	5be982e674	restore vault client	2018-10-16 16:53:29 -07:00
Alex Dadgar	fd3bc1bd39	Update state with server	2018-10-16 16:53:29 -07:00
Michael Schurter	7f4ec50906	missed locking around c.allocs access	2018-10-16 16:53:29 -07:00
Michael Schurter	516d641db0	client: implement all-or-nothing alloc restoration Restoring calls NewAR -> Restore -> Run NewAR now calls NewTR AR.Restore calls TR.Restore AR.Run calls TR.Run	2018-10-16 16:53:29 -07:00
Alex Dadgar	80f6ce50c0	vault hook	2018-10-16 16:53:29 -07:00
Michael Schurter	b360f6f96e	fix hclog level	2018-10-16 16:53:29 -07:00
Michael Schurter	4f43ff5c51	pass statedb into allocrunnerv2	2018-10-16 16:53:29 -07:00
Michael Schurter	0f7dcfdc9a	example redis job "runs" on arv2! see below Tons left to do and lots of churn: 1. No state saving 2. No shutdown or gc 3. Removed AR factory for now 4. Made all "Config" structs local to the package they configure 5. Added allocID to GC to avoid a lookup Really hating how many things use *structs.Allocation. It's not bad without state saving, but if AllocRunner starts updating its copy things get racy fast.	2018-10-16 16:53:29 -07:00
Alex Dadgar	01f8e5b95f	renames	2018-10-04 14:57:25 -07:00
Alex Dadgar	52f9cd7637	fixing tests	2018-10-04 14:26:19 -07:00
Alex Dadgar	5c8697667e	Node reserved resources	2018-09-29 18:44:55 -07:00
Alex Dadgar	3183153315	Node resources on client	2018-09-29 17:23:41 -07:00
Alex Dadgar	9971b3393f	yamux	2018-09-17 14:22:40 -07:00
Alex Dadgar	7739ef51ce	agent + consul	2018-09-13 10:43:40 -07:00
Michael Schurter	08862fc177	fix race around error handling	2018-09-05 17:34:17 -07:00
Preetha	043f4c208b	Merge pull request #3882 from burdandrei/telemetry-add-node-class-tag Added node class to tagged metrics	2018-06-21 17:04:35 -05:00
Alex Dadgar	b61051b3cd	Merge pull request #4409 from hashicorp/r-client-packages Refactor client packages	2018-06-13 17:32:25 -07:00
Alex Dadgar	90c2108bfb	Fix gc tests + parallel destroy + small test fixes	2018-06-12 10:23:45 -07:00
Alex Dadgar	f5ff509fa5	Refactor - wip	2018-06-12 10:23:45 -07:00
Chelsea Holland Komlo	f74e74b22d	add client logic to determine whether TLS RPC connections should reload	2018-06-08 14:38:58 -04:00
Chelsea Holland Komlo	064b5481e0	add server join info to server and client	2018-05-31 10:50:03 -07:00
Chelsea Holland Komlo	38f611a7f2	refactor NewTLSConfiguration to pass in verifyIncoming/verifyOutgoing add missing fields to TLS merge method	2018-05-23 18:35:30 -04:00
Chelsea Holland Komlo	796bae6f1b	allow configurable cipher suites disallow 3DES and RC4 ciphers add documentation for tls_cipher_suites	2018-05-09 17:15:31 -04:00
Chelsea Holland Komlo	9b8a079558	fix up comments	2018-04-17 11:53:08 -04:00
Alex Dadgar	9d612c8cb0	Cleanup	2018-04-16 15:48:34 -07:00
Alex Dadgar	32adaf9dfc	Copy the config given to the alloc runner	2018-04-16 15:45:52 -07:00
Alex Dadgar	4f2a7b6949	Fix copying drivers	2018-04-16 15:45:51 -07:00
Alex Dadgar	0b799822ff	Operate on copy	2018-04-16 15:45:49 -07:00
Alex Dadgar	ff1a1a63e8	Move where attribute for driver detection is set	2018-04-12 15:50:25 -07:00
Alex Dadgar	f24ce2c50c	Driver health detection cleanups This PR does: 1. Health message based on detection has format "Driver XXX detected" and "Driver XXX not detected" 2. Set initial health description based on detection status and don't wait for the first health check. 3. Combine updating attributes on the node, fingerprint and health checking update for drivers into a single call back. 4. Condensed driver info in `node status` only shows detected drivers and make the output less wide by removing spaces.	2018-04-12 12:46:40 -07:00
Andrei Burd	502d17fa90	Added node class to tagged metrics	2018-04-11 12:20:59 +03:00
Alex Dadgar	3d367d6fd7	Fix client uptime metric missing client prefix	2018-04-10 10:39:36 -07:00
Alex Dadgar	ae1f76477e	Start rebalance after discovering new servers	2018-04-05 15:41:59 -07:00
Alex Dadgar	be2513e0f9	more jitter	2018-04-05 13:48:33 -07:00
Alex Dadgar	bd3345942c	Handle no leader and faster retries near limit Handle the ErrNoLeader case and apply slower retries. Also when we have missed the heartbeat retry aggressively, backing off after we have missed for more than 30 seconds.	2018-04-05 11:22:47 -07:00
Alex Dadgar	279b5c22e5	Scale heartbeat retrying based on remaining heartbeat time	2018-04-05 10:58:13 -07:00
Alex Dadgar	7941f4eb2d	Fire retry only when consul discovers new servers	2018-04-05 10:40:17 -07:00
Alex Dadgar	86c32358d4	Spelling error	2018-04-03 18:30:01 -07:00
Alex Dadgar	01a6beafbf	RPC Retry Watcher	2018-04-03 18:05:28 -07:00
Alex Dadgar	58a3ec3fb2	Improve Vault error handling	2018-04-03 14:29:22 -07:00
Chelsea Holland Komlo	2174ede6b9	add clarifying comment	2018-03-29 10:58:39 -04:00
Chelsea Holland Komlo	e3319afee1	emit first node event	2018-03-28 17:26:53 -04:00
Chelsea Holland Komlo	efc03e252c	specify driver health messages	2018-03-28 11:35:21 -04:00
Chelsea Holland Komlo	003bc209b9	use time.Time for node events for compatibility	2018-03-27 15:43:57 -04:00
Chelsea Holland Komlo	f801709a0a	fix issue when updating node events	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	60f12d206f	improve comments; update watchDriver	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	739784736a	remove unused function	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d92703617c	simplify logic bump log level	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	86b7b3d2d9	fix up health check logic comparison; add node events to client driver checks	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	53a5bc2bb3	Code review feedback	2018-03-21 15:15:26 -04:00
Alex Dadgar	34dc58421c	notes from walk through	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	44b6951dda	improve tests	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	0425be8f48	updating comments; locking concurrent node access	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	c50d02ae93	go style; update comments	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	3aa726baab	fix scheduler driver name; create node structs file	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	3cba95e8a7	allow nomad to schedule based on the status of a client driver health check Slight updates for go style	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	0bde357731	add concept of health checks to fingerprinters and nodes fix up feedback from code review add driver info for all drivers to node	2018-03-21 15:15:25 -04:00
Preetha Appan	3c38eededd	Fix spelling in comment	2018-03-14 15:54:25 -05:00
Alex Dadgar	bef4a8ee09	fix clearing node events	2018-03-14 09:48:59 -07:00
Chelsea Komlo	810eedfa2a	Merge pull request #3945 from hashicorp/f-add-node-events Add node events	2018-03-14 08:42:55 -04:00

1 2 3 4 5 ...

586 commits