open-nomad

Author	SHA1	Message	Date
Nick Ethier	f192c3752a	client: refactor post allocrunnerv2 finalization	2018-10-16 16:56:56 -07:00
Nick Ethier	4a4c7dbbfc	client: begin driver plugin integration client: fingerprint driver plugins	2018-10-16 16:56:56 -07:00
Alex Dadgar	45e41cca03	allocrunnerv2 -> allocrunner	2018-10-16 16:56:56 -07:00
Alex Dadgar	6c9d9d5173	move files around	2018-10-16 16:56:55 -07:00
Michael Schurter	960f3be76c	client: expose task state to client The interesting decision in this commit was to expose AR's state and not a fully materialized Allocation struct. AR.clientAlloc builds an Alloc that contains the task state, so I considered simply memoizing and exposing that method. However, that would lead to AR having two awkwardly similar methods: - Alloc() - which returns the server-sent alloc - ClientAlloc() - which returns the fully materialized client alloc Since ClientAlloc() could be memoized it would be just as cheap to call as Alloc(), so why not replace Alloc() entirely? Replacing Alloc() entirely would require Update() to immediately materialize the task states on server-sent Allocs as there may have been local task state changes since the server received an Alloc update. This quickly becomes difficult to reason about: should Update hooks use the TaskStates? Are state changes caused by TR Update hooks immediately reflected in the Alloc? Should AR persist its copy of the Alloc? If so, are its TaskStates canonical or the TaskStates on TR? So! Forget that. Let's separate the static Allocation from the dynamic AR & TR state! - AR.Alloc() is for static Allocation access (often for the Job) - AR.AllocState() is for the dynamic AR & TR runtime state (deployment status, task states, etc). If code needs to know the status of a task: AllocState() If code needs to know the names of tasks: Alloc() It should be very easy for a developer to reason about which method they should call and what they can do with the return values.	2018-10-16 16:56:55 -07:00
Michael Schurter	8d1419c62b	client: fix accessing alloc runners * GetClientAlloc() gains nothing from using allAllocs() * getAllocatedResources was calling getAllocRunners() twice	2018-10-16 16:56:55 -07:00
Michael Schurter	e6e2930a00	tr: implement stats collection hook Tested except for the net/rpc specific error case which may need changing in the gRPC world.	2018-10-16 16:53:31 -07:00
Alex Dadgar	cebfead6bc	add logger back	2018-10-16 16:53:30 -07:00
Alex Dadgar	8504505c0d	client uses passed logger and fix fingerprinters	2018-10-16 16:53:30 -07:00
Michael Schurter	9d1ea3b228	client: hclog-ify most of the client Leaving fingerprinters in case that interface changes with plugins.	2018-10-16 16:53:30 -07:00
Michael Schurter	e42154fc46	implement stopping, destroying, and disk migration * Stopping an alloc is implemented via Updates but update hooks are not run. * Destroying an alloc is a best effort cleanup. * AllocRunner destroy hooks implemented. * Disk migration and blocking on a previous allocation exiting moved to its own package to avoid cycles. Now only depends on alloc broadcaster instead of also using a waitch. * AllocBroadcaster now only drops stale allocations and always keeps the latest version. * Made AllocDir safe for concurrent use Lots of internal contexts that are currently unused. Unsure if they should be used or removed.	2018-10-16 16:53:30 -07:00
Michael Schurter	4236255686	lots of comment/log fixes	2018-10-16 16:53:30 -07:00
Michael Schurter	357641c364	persist alloc state on changes, not periodically Allow alloc and task runners to persist their own state when something changes instead of periodically syncing all state.	2018-10-16 16:53:30 -07:00
Michael Schurter	a3fe0510d1	Move all encoding and put deduping into state db Still WIP as it does not handle deletions.	2018-10-16 16:53:30 -07:00
Michael Schurter	533bc93b3a	implement all boltdb interactions behind StateDB	2018-10-16 16:53:30 -07:00
Michael Schurter	a5d3e3fb0a	Implement alloc updates in arv2 Updates are applied asynchronously but sequentially	2018-10-16 16:53:30 -07:00
Michael Schurter	a4b4d7b266	consul service hook Deregistration works but difficult to test due to terminal updates not being fully implemented in the new client/ar/tr.	2018-10-16 16:53:29 -07:00
Michael Schurter	5be982e674	restore vault client	2018-10-16 16:53:29 -07:00
Alex Dadgar	fd3bc1bd39	Update state with server	2018-10-16 16:53:29 -07:00
Michael Schurter	7f4ec50906	missed locking around c.allocs access	2018-10-16 16:53:29 -07:00
Michael Schurter	516d641db0	client: implement all-or-nothing alloc restoration Restoring calls NewAR -> Restore -> Run NewAR now calls NewTR AR.Restore calls TR.Restore AR.Run calls TR.Run	2018-10-16 16:53:29 -07:00
Alex Dadgar	80f6ce50c0	vault hook	2018-10-16 16:53:29 -07:00
Michael Schurter	b360f6f96e	fix hclog level	2018-10-16 16:53:29 -07:00
Michael Schurter	4f43ff5c51	pass statedb into allocrunnerv2	2018-10-16 16:53:29 -07:00
Michael Schurter	0f7dcfdc9a	example redis job "runs" on arv2! see below Tons left to do and lots of churn: 1. No state saving 2. No shutdown or gc 3. Removed AR factory for now 4. Made all "Config" structs local to the package they configure 5. Added allocID to GC to avoid a lookup Really hating how many things use *structs.Allocation. It's not bad without state saving, but if AllocRunner starts updating its copy things get racy fast.	2018-10-16 16:53:29 -07:00
Alex Dadgar	01f8e5b95f	renames	2018-10-04 14:57:25 -07:00
Alex Dadgar	52f9cd7637	fixing tests	2018-10-04 14:26:19 -07:00
Alex Dadgar	5c8697667e	Node reserved resources	2018-09-29 18:44:55 -07:00
Alex Dadgar	3183153315	Node resources on client	2018-09-29 17:23:41 -07:00
Alex Dadgar	9971b3393f	yamux	2018-09-17 14:22:40 -07:00
Alex Dadgar	7739ef51ce	agent + consul	2018-09-13 10:43:40 -07:00
Michael Schurter	08862fc177	fix race around error handling	2018-09-05 17:34:17 -07:00
Preetha	043f4c208b	Merge pull request #3882 from burdandrei/telemetry-add-node-class-tag Added node class to tagged metrics	2018-06-21 17:04:35 -05:00
Alex Dadgar	b61051b3cd	Merge pull request #4409 from hashicorp/r-client-packages Refactor client packages	2018-06-13 17:32:25 -07:00
Alex Dadgar	90c2108bfb	Fix gc tests + parallel destroy + small test fixes	2018-06-12 10:23:45 -07:00
Alex Dadgar	f5ff509fa5	Refactor - wip	2018-06-12 10:23:45 -07:00
Chelsea Holland Komlo	f74e74b22d	add client logic to determine whether TLS RPC connections should reload	2018-06-08 14:38:58 -04:00
Chelsea Holland Komlo	064b5481e0	add server join info to server and client	2018-05-31 10:50:03 -07:00
Chelsea Holland Komlo	38f611a7f2	refactor NewTLSConfiguration to pass in verifyIncoming/verifyOutgoing add missing fields to TLS merge method	2018-05-23 18:35:30 -04:00
Chelsea Holland Komlo	796bae6f1b	allow configurable cipher suites disallow 3DES and RC4 ciphers add documentation for tls_cipher_suites	2018-05-09 17:15:31 -04:00
Chelsea Holland Komlo	9b8a079558	fix up comments	2018-04-17 11:53:08 -04:00
Alex Dadgar	9d612c8cb0	Cleanup	2018-04-16 15:48:34 -07:00
Alex Dadgar	32adaf9dfc	Copy the config given to the alloc runner	2018-04-16 15:45:52 -07:00
Alex Dadgar	4f2a7b6949	Fix copying drivers	2018-04-16 15:45:51 -07:00
Alex Dadgar	0b799822ff	Operate on copy	2018-04-16 15:45:49 -07:00
Alex Dadgar	ff1a1a63e8	Move where attribute for driver detection is set	2018-04-12 15:50:25 -07:00
Alex Dadgar	f24ce2c50c	Driver health detection cleanups This PR does: 1. Health message based on detection has format "Driver XXX detected" and "Driver XXX not detected" 2. Set initial health description based on detection status and don't wait for the first health check. 3. Combine updating attributes on the node, fingerprint and health checking update for drivers into a single call back. 4. Condensed driver info in `node status` only shows detected drivers and make the output less wide by removing spaces.	2018-04-12 12:46:40 -07:00
Andrei Burd	502d17fa90	Added node class to tagged metrics	2018-04-11 12:20:59 +03:00
Alex Dadgar	3d367d6fd7	Fix client uptime metric missing client prefix	2018-04-10 10:39:36 -07:00
Alex Dadgar	ae1f76477e	Start rebalance after discovering new servers	2018-04-05 15:41:59 -07:00
Alex Dadgar	be2513e0f9	more jitter	2018-04-05 13:48:33 -07:00
Alex Dadgar	bd3345942c	Handle no leader and faster retries near limit Handle the ErrNoLeader case and apply slower retries. Also when we have missed the heartbeat retry aggressively, backing off after we have missed for more than 30 seconds.	2018-04-05 11:22:47 -07:00
Alex Dadgar	279b5c22e5	Scale heartbeat retrying based on remaining heartbeat time	2018-04-05 10:58:13 -07:00
Alex Dadgar	7941f4eb2d	Fire retry only when consul discovers new servers	2018-04-05 10:40:17 -07:00
Alex Dadgar	86c32358d4	Spelling error	2018-04-03 18:30:01 -07:00
Alex Dadgar	01a6beafbf	RPC Retry Watcher	2018-04-03 18:05:28 -07:00
Alex Dadgar	58a3ec3fb2	Improve Vault error handling	2018-04-03 14:29:22 -07:00
Chelsea Holland Komlo	2174ede6b9	add clarifying comment	2018-03-29 10:58:39 -04:00
Chelsea Holland Komlo	e3319afee1	emit first node event	2018-03-28 17:26:53 -04:00
Chelsea Holland Komlo	efc03e252c	specify driver health messages	2018-03-28 11:35:21 -04:00
Chelsea Holland Komlo	003bc209b9	use time.Time for node events for compatibility	2018-03-27 15:43:57 -04:00
Chelsea Holland Komlo	f801709a0a	fix issue when updating node events	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	60f12d206f	improve comments; update watchDriver	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	739784736a	remove unused function	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d92703617c	simplify logic bump log level	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	86b7b3d2d9	fix up health check logic comparison; add node events to client driver checks	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	53a5bc2bb3	Code review feedback	2018-03-21 15:15:26 -04:00
Alex Dadgar	34dc58421c	notes from walk through	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	44b6951dda	improve tests	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	0425be8f48	updating comments; locking concurrent node access	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	c50d02ae93	go style; update comments	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	3aa726baab	fix scheduler driver name; create node structs file	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	3cba95e8a7	allow nomad to schedule based on the status of a client driver health check Slight updates for go style	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	0bde357731	add concept of health checks to fingerprinters and nodes fix up feedback from code review add driver info for all drivers to node	2018-03-21 15:15:25 -04:00
Preetha Appan	3c38eededd	Fix spelling in comment	2018-03-14 15:54:25 -05:00
Alex Dadgar	bef4a8ee09	fix clearing node events	2018-03-14 09:48:59 -07:00
Chelsea Komlo	810eedfa2a	Merge pull request #3945 from hashicorp/f-add-node-events Add node events	2018-03-14 08:42:55 -04:00
Preetha	360d6e5a92	Merge pull request #3968 from hashicorp/f-nicer-vault-error Make server side error messages from vault more clearer	2018-03-13 20:49:39 -05:00
Alex Dadgar	de6ebb6e6c	small cleanup	2018-03-13 18:08:22 -07:00
Chelsea Holland Komlo	b41501e442	code review feedback	2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo	1488b076d1	code review feedback	2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo	a8655320fd	fix up go check warnings	2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo	0934769b04	add client side emitting of node events Changelog	2018-03-13 18:08:21 -07:00
Preetha Appan	914eaed64f	Address some code review comments	2018-03-13 18:19:16 -05:00
Preetha Appan	09c231ce43	Return the err from server correctly	2018-03-13 18:10:14 -05:00
Preetha Appan	9618f52746	Remove error wrapping and make vault connection server side errors clearer.	2018-03-13 17:09:03 -05:00
Alex Dadgar	4844317cc2	Merge pull request #3890 from hashicorp/b-heartbeat Heartbeat improvements and handling failures during establishing leadership	2018-03-12 14:41:59 -07:00
Josh Soref	173ce63fe9	spelling: transition	2018-03-11 19:06:05 +00:00
Josh Soref	782c704de6	spelling: thresholds	2018-03-11 19:03:47 +00:00
Josh Soref	8149694f3a	spelling: server	2018-03-11 18:55:30 +00:00
Josh Soref	258d76ec13	spelling: registry	2018-03-11 18:41:13 +00:00
Josh Soref	3c1ce6d16d	spelling: otherwise	2018-03-11 18:34:27 +00:00
Josh Soref	1ef6d6319e	spelling: labels	2018-03-11 18:21:44 +00:00
Josh Soref	52b83328fc	spelling: heartbeating	2018-03-11 18:12:19 +00:00
Josh Soref	c9b86bbc2f	spelling: controls	2018-03-11 17:50:39 +00:00
Josh Soref	e78cf9c81a	spelling: already	2018-03-11 17:39:04 +00:00
Josh Soref	b8b46d3f74	spelling: allocation	2018-03-11 17:37:22 +00:00
Chelsea Holland Komlo	122d1c4e4a	simplify retry logic	2018-03-01 09:48:26 -05:00
Chelsea Holland Komlo	355805db56	reset timer after updating node copy	2018-02-27 17:18:10 -05:00
Chelsea Holland Komlo	a72aaaf47f	add network resources equal method, use time ticker remove impossible test case	2018-02-27 12:42:53 -05:00
Chelsea Holland Komlo	e736e31820	use time ticker, update how network resources are compared	2018-02-26 18:47:11 -05:00
Chelsea Holland Komlo	5059065b52	improved testing; node networks comparison	2018-02-26 15:55:38 -05:00
Chelsea Holland Komlo	1f31b39fe8	code review fixups	2018-02-26 12:36:30 -05:00
Chelsea Holland Komlo	ed8c8afbcd	edge trigger node update test update config copy trigger	2018-02-26 12:36:04 -05:00
Alex Dadgar	49a47483d1	Registering back to initializing Fix a bug in which if the node attributes/meta changed, we would re-register the node in status initializing. This would incorrectly trigger the client to log that it missed its heartbeat. It would change the status of the Node to initializing until the next heartbeat occured.	2018-02-16 17:49:31 -08:00
Alex Dadgar	eff4455c68	Fix original client server list behavior	2018-02-15 16:04:53 -08:00
Alex Dadgar	f9cf642436	Client tls	2018-02-15 15:22:57 -08:00
Alex Dadgar	e685211892	Code review feedback	2018-02-15 13:59:02 -08:00
Alex Dadgar	2c0ad26374	New RPC Modes and basic setup for streaming RPC handlers	2018-02-15 13:59:01 -08:00
Alex Dadgar	9bc75f0ad4	Fix manager tests and make testagent recover from port conflicts	2018-02-15 13:59:01 -08:00
Alex Dadgar	3f1f8604bb	initial round of comment review	2018-02-15 13:59:01 -08:00
Alex Dadgar	c8c1284bc3	SetServer command actually returns an error if given an invalid server	2018-02-15 13:59:01 -08:00
Alex Dadgar	3f786b904b	use server manager	2018-02-15 13:59:01 -08:00
Alex Dadgar	6dd1c9f49d	Refactor	2018-02-15 13:59:00 -08:00
Alex Dadgar	1472b943d6	Stats Endpoint	2018-02-15 13:59:00 -08:00
Chelsea Holland Komlo	4a26959825	code review feedback	2018-02-07 18:10:55 -05:00
Chelsea Holland Komlo	d626d24488	remove dependency on client for fingerprint manager	2018-02-07 18:10:45 -05:00
Chelsea Holland Komlo	e012e5ab8a	add fingerprint manager	2018-02-07 18:10:33 -05:00
Chelsea Holland Komlo	b21233fe23	update log message	2018-02-01 19:46:57 -05:00
Chelsea Holland Komlo	6f9c0ab361	req/resp should be within config locks; rename for detected fingerprints changelog	2018-02-01 19:00:39 -05:00
Chelsea Holland Komlo	b8e8064835	code review fixup	2018-01-31 18:34:03 -05:00
Chelsea Holland Komlo	7b53474a6e	add applicable boolean to fingerprint response public fields and remove getter functions	2018-01-31 13:21:45 -05:00
Chelsea Holland Komlo	9482c322b7	locks for fingerprint reads/writes	2018-01-30 11:32:45 -05:00
Chelsea Holland Komlo	7c19de797c	create safe getters and setters for fingerprint response	2018-01-26 11:22:05 -05:00
Chelsea Holland Komlo	896d6f8058	fixups from code review	2018-01-26 07:04:32 -05:00
Chelsea Holland Komlo	9a8344333b	refactor Fingerprint to request/response construct	2018-01-24 11:54:02 -05:00
Chelsea Holland Komlo	649f86f094	refactor creating a new tls configuration	2018-01-16 08:02:39 -05:00
Chelsea Holland Komlo	6c9f9c8ac3	adding additional test assertions; differentiate reloading agent and http server	2018-01-16 07:34:39 -05:00
Chelsea Holland Komlo	214d128eb9	reload raft transport layer fix up linting	2018-01-08 14:52:28 -05:00
Chelsea Holland Komlo	0708d34135	call reload on agent, client, and server separately	2018-01-08 09:56:31 -05:00
Chelsea Holland Komlo	9741097406	reloading tls config should be atomic for clients/servers	2018-01-08 09:21:06 -05:00
Chelsea Holland Komlo	ae7fc4695e	fixups from code review Revert "close raft long-lived connections" This reverts commit 3ffda28206fcb3d63ad117fd1d27ae6f832b6625. reload raft connections on changing tls	2018-01-08 09:21:06 -05:00
Chelsea Holland Komlo	acd3d1b162	fix up downgrading client to plaintext add locks around changing server configuration	2018-01-08 09:21:06 -05:00
Chelsea Holland Komlo	c0ad9a4627	add ability to upgrade/downgrade nomad agents tls configurations via sighup	2018-01-08 09:21:06 -05:00
Alex Dadgar	91ffbbb517	Review feedback	2017-12-07 16:10:57 -08:00
Alex Dadgar	02baa6c52b	Handle race between fingerprinters and registration	2017-12-07 13:09:37 -08:00
Alex Dadgar	4409fdacc0	Drop trace logging	2017-12-06 18:02:24 -08:00
Alex Dadgar	cd9a7f14b8	Add logging around heartbeats	2017-12-06 17:57:50 -08:00
Chelsea Komlo	2dfda33703	Nomad agent reload TLS configuration on SIGHUP (#3479 ) * Allow server TLS configuration to be reloaded via SIGHUP * dynamic tls reloading for nomad agents * code cleanup and refactoring * ensure keyloader is initialized, add comments * allow downgrading from TLS * initalize keyloader if necessary * integration test for tls reload * fix up test to assert success on reloaded TLS configuration * failure in loading a new TLS config should remain at current Reload only the config if agent is already using TLS * reload agent configuration before specific server/client lock keyloader before loading/caching a new certificate * introduce a get-or-set method for keyloader * fixups from code review * fix up linting errors * fixups from code review * add lock for config updates; improve copy of tls config * GetCertificate only reloads certificates dynamically for the server * config updates/copies should be on agent * improve http integration test * simplify agent reloading storing a local copy of config * reuse the same keyloader when reloading * Test that server and client get reloaded but keep keyloader * Keyloader exposes GetClientCertificate as well for outgoing connections * Fix spelling * correct changelog style	2017-11-14 17:53:23 -08:00
Michael Schurter	1769db98b7	Fix regression by returning error on unknown alloc	2017-11-01 15:16:38 -05:00
Michael Schurter	73e9b57908	Trigger GCs after alloc changes GC much more aggressively by triggering GCs when allocations become terminal as well as after new allocations are added.	2017-11-01 15:16:38 -05:00
Michael Schurter	2a81160dcd	Fix GC'd alloc tracking The Client.allocs map now contains all AllocRunners again, not just un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs allocs. Also stops logging "marked for GC" twice.	2017-11-01 15:16:38 -05:00
Alex Dadgar	4831380e57	Node access is done using locked Node copy Fixes https://github.com/hashicorp/nomad/issues/3454 Reliably reproduced the data race before by having a fingerprinter change the nodes attributes every millisecond and syncing at the same rate. With fix, did not ever panic.	2017-10-27 13:27:24 -07:00
Michael Schurter	15b991e039	base64 migrate token HTTP header values must be ASCII. Also constant time compare tokens and test the generate and compare helper functions.	2017-10-13 10:59:13 -07:00
Chelsea Holland Komlo	e1c4701a43	fix up build warnings	2017-10-11 17:11:57 -07:00
Chelsea Holland Komlo	b018ca4d46	fixing up code review comments	2017-10-11 17:09:20 -07:00
Chelsea Holland Komlo	410adaf726	Add functionality for authenticated volumes	2017-10-11 17:09:20 -07:00
Michael Schurter	a66c53d45a	Remove `structs` import from `api` Goes a step further and removes structs import from api's tests as well by moving GenerateUUID to its own package.	2017-09-29 10:36:08 -07:00
Alex Dadgar	4173834231	Enable more linters	2017-09-26 15:26:33 -07:00
Chelsea Holland Komlo	b26454cf99	Move setGaugeForAllocationStats to emitClientMetrics	2017-09-25 16:05:49 +00:00
Alex Dadgar	d306da846c	changelog and feedback	2017-09-14 14:08:58 -07:00
Alex Dadgar	07ed83fdd5	Non-locked accessors to common Node fields This PR removes locking around commonly accessed node attributes that do not need to be locked. The locking could cause nodes to TTL as the heartbeat code path was acquiring a lock that could be held for an excessively long time. An example of this is when Vault is inaccessible, since the fingerprint is run with a lock held but the Vault fingerprinter makes the API calls with a large timeout. Fixes https://github.com/hashicorp/nomad/issues/2689	2017-09-14 14:08:26 -07:00
Chelsea Holland Komlo	848af92183	fix panic in emitting tagged metrics	2017-09-11 15:32:37 +00:00
Chelsea Holland Komlo	0ef43c3c5f	final code review fixups	2017-09-05 18:47:44 +00:00
Chelsea Holland Komlo	a8cbd0b559	fixups from code review	2017-09-05 14:13:34 +00:00
Chelsea Holland Komlo	f72e4aad13	labels depend on full setup of client beforehand	2017-09-05 14:13:34 +00:00
Chelsea Holland Komlo	87a814397d	refactor to use baseLabels	2017-09-05 14:13:34 +00:00
Chelsea Holland Komlo	b2953d905a	pass in commonly used values	2017-09-05 14:13:34 +00:00
Chelsea Holland Komlo	c634043069	create base labels to be used in every metric	2017-09-05 14:13:34 +00:00
Chelsea Holland Komlo	f5ea83da8d	emit metrics using labels, add option for backwards compatibility	2017-09-05 14:12:57 +00:00
Armon Dadgar	76a03f2d8e	Address @dadgar feedback	2017-09-04 13:05:53 -07:00
Armon Dadgar	688897561b	client: adding token cache for ACL resolution	2017-09-04 13:05:36 -07:00
Armon Dadgar	c2e72e8a9c	client: create ACL and Policy cache	2017-09-04 13:05:35 -07:00
Michael Schurter	7342e23669	Move migrating state into prevAllocWatcher	2017-08-14 16:02:28 -07:00
Michael Schurter	e41a654917	switch from alloc blocker to new interface interface has 3 implementations: 1. local for blocking and moving data locally 2. remote for blocking and moving data from another node 3. noop for allocs that don't need to block	2017-08-11 16:21:35 -07:00
Michael Schurter	ee04717a0b	initial attempt at refactoring blocked/migrating	2017-08-11 16:21:35 -07:00
Alex Dadgar	ecee5e370e	initial watcher	2017-07-07 12:07:08 -07:00
Michael Schurter	644f0cfaa4	Consistently quote alloc ids in client logs	2017-07-06 10:24:52 -07:00
Michael Schurter	4fd9ef6a8c	Tiny client race condition fix Plus some logging improvements that may help with #2563	2017-07-05 16:15:19 -07:00
Michael Schurter	596727230b	Suggest wiping out alloc dir too	2017-07-03 12:29:21 -07:00
Michael Schurter	11f68bfca2	Add more logging to restore state errors	2017-07-03 11:58:41 -07:00
Mark Mickan	c196d320f8	Add tests for migrating symlinks in alloc and local directories	2017-06-04 15:56:22 +09:30
Mark Mickan	236f24c9a4	Include symlinks in snapshots when migrating disks Fixes #2685	2017-06-04 00:36:18 +09:30
Alex Dadgar	b1eea2269a	Fix deadlock	2017-05-31 14:05:47 -07:00
Michael Schurter	ffc2b36dc7	Merge pull request #2636 from hashicorp/f-gc-alloc-limit Add new gc_max_allocs tuneable	2017-05-30 16:14:09 -07:00
Michael Schurter	dd51aa1cb9	Merge pull request #2654 from hashicorp/f-env-consul Add envconsul-like support and refactor environment handling	2017-05-30 14:40:14 -07:00
Alex Dadgar	28aef447e9	Fix perms to just set exec bit	2017-05-25 14:44:13 -07:00
Michael Schurter	fd9bef768f	Move task env into execcontext Also inject PATH into rkt commands since we're no longer appending host env vars for it.	2017-05-23 13:53:34 -07:00
Michael Schurter	3841692138	gc_max_allocs should include blocked & migrating	2017-05-12 16:03:22 -07:00
Michael Schurter	0453c2709c	Add new gc_max_allocs tuneable More than gc_max_allocs may be running on a node, but terminal allocs will be garbage collected to try to keep the total number below the limit.	2017-05-11 17:18:02 -07:00
Alex Dadgar	68c3a2bd98	Fix vet errors	2017-05-11 13:08:08 -07:00
Alex Dadgar	843bc26e5d	Respond to comments	2017-05-09 10:50:24 -07:00
Alex Dadgar	e00f9c9413	Restore state + upgrade path	2017-05-02 18:21:49 -07:00
Alex Dadgar	ec101b4760	Revert "metrics" This reverts commit 4d6a012c6fb6f1fba6c62985d091b1a20c3198e7.	2017-05-02 09:28:11 -07:00
Alex Dadgar	8e516b5dc2	Async and sync saving of client state	2017-05-01 16:16:53 -07:00
Alex Dadgar	a7fd08d42a	perf	2017-05-01 16:01:50 -07:00
Alex Dadgar	e010fdf8c0	metrics	2017-05-01 14:51:27 -07:00
Alex Dadgar	b94f855326	boltDB database for client state	2017-05-01 14:50:34 -07:00
Michael Schurter	e204a287ed	Refactor Consul Syncer into new ServiceClient Fixes #2478 #2474 #1995 #2294 The new client only handles agent and task service advertisement. Server discovery is mostly unchanged. The Nomad client agent now handles all Consul operations instead of the executor handling task related operations. When upgrading from an earlier version of Nomad existing executors will be told to deregister from Consul so that the Nomad agent can re-register the task's services and checks. Drivers - other than qemu - now support an Exec method for executing abritrary commands in a task's environment. This is used to implement script checks. Interfaces are used extensively to avoid interacting with Consul in tests that don't assert any Consul related behavior.	2017-04-19 12:42:47 -07:00
Alex Dadgar	2321e8a4a0	Hash host ID so its stable and well distributed This PR takes the host ID and runs it through a hash so that it is well distributed. This makes it so that machines that report similar host IDs are easily distinguished. Instances of similar IDs occur on EC2 where the ID is prefixed and on motherboards created in the same batch. Fixes https://github.com/hashicorp/nomad/issues/2534	2017-04-10 11:44:51 -07:00
Alex Dadgar	81b78f77e1	Track task start/finish time & improve logs errors This PR adds tracking to when a task starts and finishes and the logs API takes advantage of this and returns better errors when asking for logs that do not exist.	2017-03-31 16:14:11 -07:00
Alex Dadgar	5e7e19de4b	Merge pull request #2461 from hashicorp/b-groups Various fixes for setting user/group of task	2017-03-28 11:13:27 -07:00
Alex Dadgar	4ecebe7d8c	Proper reference counting through task restarts This PR fixes an issue in which the reference count on a Docker image would become inflated through task restarts.	2017-03-25 17:05:53 -07:00
Alex Dadgar	a171a014b3	Various fixes for setting user/group of task This PR fixes two issues: * Folder permissions in -dev mode were incorrect and not suitable for running as a particular user. * Was not setting the group membership properly for the launched process. Fixes https://github.com/hashicorp/nomad/issues/2160	2017-03-20 14:21:13 -07:00
Alex Dadgar	70e4feb045	Limit parallelism during garbage collection This PR introduces a parallelism limit during garbage collection. This is used to avoid large resource usage spikes if garbage collecting many allocations at once.	2017-03-10 16:27:00 -08:00
Alex Dadgar	9011a7984c	Add metrics to show allocations on the client This PR adds the following metrics to the client: client.allocations.migrating client.allocations.blocked client.allocations.pending client.allocations.running client.allocations.terminal Also adds some missing fields to the API version of the evaluation.	2017-03-09 12:37:41 -08:00
Alex Dadgar	5be806a3df	Fix vet script and fix vet problems This PR fixes our vet script and fixes all the missed vet changes. It also fixes pointers being printed in `nomad stop <job>` and `nomad node-status <node>`.	2017-02-27 16:00:19 -08:00
Alex Dadgar	6910678c21	Allow random UUID	2017-02-27 13:42:37 -08:00
Alex Dadgar	7203dee7ab	Add allocated/unallocated metrics to client	2017-02-16 18:28:11 -08:00
Sean Chittenden	c4c321c770	Unconditionally lowercase the node ID read from disk.	2017-02-06 16:20:17 -08:00
Sean Chittenden	adb5be23ef	Add better verification of a host's HostID.	2017-02-02 16:24:32 -08:00
Sean Chittenden	bb4347e277	Slight mis-merge: secret-id in dev mode is random and needs to be returned.	2017-02-01 22:20:52 -08:00
Sean Chittenden	bb422a2258	Generate a durable NodeID if possible, otherwise fall back to a random HostID.	2017-02-01 22:11:33 -08:00
Diptanu Choudhury	11d7cb1230	Making the GC related fields tunable	2017-01-31 15:51:20 -08:00
Diptanu Choudhury	84a491f85a	Locking appropriately before closing the channel to indicate migration	2017-01-23 10:46:57 -08:00
Michael Schurter	054ee8df59	Fix index we get allocs by	2017-01-20 16:30:40 -08:00
Diptanu Choudhury	1999b7eebb	Merge pull request #2159 from hashicorp/b-consul-config Fixed merging consul config	2017-01-18 16:14:54 -08:00
Diptanu Choudhury	e927de02d2	Moved functions to helper from structs	2017-01-18 15:55:14 -08:00
Alex Dadgar	5d2b56b387	Random wait	2017-01-11 13:24:23 -08:00
Alex Dadgar	c19985244a	GetAllocs uses a blocking query This PR makes GetAllocs use a blocking query as well as adding a sanity check to the clients watchAllocation code to ensure it gets the correct allocations. This PR fixes https://github.com/hashicorp/nomad/issues/2119 and https://github.com/hashicorp/nomad/issues/2153. The issue was that the client was talking to two different servers, one to check which allocations to pull and the other to pull those allocations. However the latter call was not with a blocking query and thus the client would not retreive the allocations it requested. The logging has been improved to make the problem more clear as well.	2017-01-10 13:30:35 -08:00
Michael Schurter	86fcf96f72	Put a logger in AllocDir/TaskDir	2017-01-05 16:31:56 -08:00
Diptanu Choudhury	247bda9a88	Unlocking if we return before adding a new alloc runner	2017-01-05 13:18:48 -08:00
Diptanu Choudhury	9721a1ab04	Fixed how alloc lock is held	2017-01-05 13:06:56 -08:00
Michael Schurter	13064768ac	Fix race when shutting down in dev mode Client.Shutdown holds the allocLock when destroying alloc runners in dev mode. Client.updateAllocStatus can be called during AllocRunner shutdown and calls getAllocRunners which tries to acquire allocLock.RLock. This deadlocks since Client.Shutdown already has the write lock. Switching Client.Shutdown to use getAllocRunners and not hold a lock during AllocRunner shutdown is the solution.	2017-01-03 17:21:50 -08:00
Michael Schurter	4a9a574d9d	Merge pull request #2054 from hashicorp/f-prestart Add Driver.Prestart method	2016-12-20 16:18:56 -08:00
Diptanu Choudhury	b6120e2fc8	Removing the alloc runner from GC if it is destroyed by the server	2016-12-20 11:14:22 -08:00
Diptanu Choudhury	6e6e0d364a	Added comments	2016-12-20 10:49:48 -08:00
Diptanu Choudhury	36b5545d6b	Making the gc allocator understand real disk usage	2016-12-16 18:34:59 -08:00
Diptanu Choudhury	7aef9bcabe	Added the stats collector to GC	2016-12-14 15:11:11 -08:00
Diptanu Choudhury	e855cd587b	Refactored hoststats collector	2016-12-14 15:07:42 -08:00
Diptanu Choudhury	0ffd92668d	GC-ing before we start a new allocation	2016-12-14 15:04:06 -08:00
Diptanu Choudhury	afdaa979f7	Added a garbage collector for allocations	2016-12-14 15:01:12 -08:00
Alex Dadgar	648ad2ebc5	Merge pull request #2096 from hashicorp/b-addAlloc Fix race and remove panic	2016-12-13 13:50:17 -08:00
Diptanu Choudhury	53fb09023c	cancelling waiting for remote allocation if the alloc doesn't need migration	2016-12-13 13:06:33 -08:00
Alex Dadgar	3cbd237512	Fix race and remove panic	2016-12-13 12:34:23 -08:00
Christoffer Kylvåg	6a1f32b8ba	#1680 : Continue after not being able to stat a mountpoint	2016-12-13 12:28:57 +01:00
Diptanu Choudhury	cbf73908ff	Setting the appropriate file permissions which un-archiving compressed alloc dir	2016-12-05 17:04:43 -08:00
Diptanu Choudhury	bc17cacca0	Merge pull request #2017 from hashicorp/b-sticky Not moving alloc data when sticky is turned off	2016-12-05 14:11:45 -08:00
Diptanu Choudhury	21f49564d3	Not moving alloc data when sticky is turned off	2016-12-05 14:00:01 -08:00
Michael Schurter	770ed703d0	Add Driver.Prestart method The Driver.Prestart method currently does very little but lays the foundation for where lifecycle plugins can interleave execution _after_ task environment setup but _before_ the task starts. Currently Prestart does two things: * Any driver specific task environment building * Download Docker images This change also attaches a TaskEvent emitter to Drivers, so they can emit events during task initialization.	2016-12-02 11:03:48 -08:00
Alex Dadgar	86ed1fb2e5	Disallow stale queries when deriving Vault tokens This PR disallows stale queries when deriving a Vault token. Allowing stale queries could result in the allocation not existing on the server that is servicing the request.	2016-12-01 11:13:36 -08:00
Alex Dadgar	ec4d6936ff	add debug panic	2016-11-29 15:57:40 -08:00
Diptanu Choudhury	f67217297c	Ensuring allocs are not added multiple times to blocking queue	2016-11-29 11:19:37 -08:00
Alex Dadgar	88c7e04348	Check for Ephemeral Disk being nil	2016-11-15 10:03:06 -08:00
Alex Dadgar	ee921ccbb2	Merge pull request #1949 from carlpett/blacklist-fingerprints-and-drivers Support blacklisting fingerprinters	2016-11-09 10:31:17 -08:00
Calle Pettersson	4304755c12	Address comments from PR	2016-11-09 11:50:16 +01:00
Calle Pettersson	8632696e2d	Add blacklisting of drivers	2016-11-08 18:30:07 +01:00
Calle Pettersson	b603bb007e	Add blacklisting of fingerprinters	2016-11-08 18:29:44 +01:00
Alex Dadgar	9015e79aaa	Add compatibility code for secret ID while upgrading cluster in both server/client mode on single nodes	2016-11-07 16:52:08 -08:00
Diptanu Choudhury	1a8fa8c8d5	Making Nomad TLS configs region aware	2016-11-01 11:55:29 -07:00
Diptanu Choudhury	4079545a92	Making the client use tls if the node from which migration has to be made has enabled tls	2016-10-31 10:20:04 -07:00
Michael Schurter	cc115fe984	Swap log line classifiers to be consistent	2016-10-28 14:59:48 -07:00
Diptanu Choudhury	3182d0454f	Adding the alloc if we can't find the TG	2016-10-27 15:45:10 -07:00
Diptanu Choudhury	0682a1a113	Not blocking for remote alloc if the alloc is not sticky	2016-10-27 12:04:55 -07:00
Alex Dadgar	150b678a6b	Merge pull request #1806 from hashicorp/f-docker4mac-fixes A couple fixes to make Docker For Mac work	2016-10-27 09:29:40 -07:00
Diptanu Choudhury	50ca5e1e9d	Merge pull request #1853 from hashicorp/f-rpc-http-tls TLS support for http and RPC	2016-10-25 16:14:43 -07:00
Diptanu Choudhury	7c61e115bd	Moved tlsutil into helpers	2016-10-25 16:05:37 -07:00
Diptanu Choudhury	353e7fc7f1	Moving the certs into tlsutil package	2016-10-25 16:01:53 -07:00
Diptanu Choudhury	cf35aeac84	Moving the TLSConfig to structs	2016-10-25 15:57:38 -07:00
Alex Dadgar	03eba049ed	Merge pull request #1848 from hashicorp/f-vault-error Thread through whether DeriveToken error is recoverable or not	2016-10-24 15:01:18 -07:00

... 3 4 5 6 7 ...

713 commits