open-nomad

Commit Graph

Author	SHA1	Message	Date
Michael Schurter	526af6a246	framer: fix early exit/truncation in framer	2018-05-02 10:46:16 -07:00
Michael Schurter	f1a6aa103a	framer: fix race and remove unused error var In the old code `sending` in the `send()` method shared the Data slice's underlying backing array with its caller. Clearing StreamFrame.Data didn't break the reference from the sent frame to the StreamFramer's data slice.	2018-05-02 10:46:16 -07:00
Michael Schurter	7360fe3a6d	client: squelch errors on cleanly closed pipes	2018-05-02 10:46:16 -07:00
Michael Schurter	ffff97e25f	client: don't spin on read errors	2018-05-02 10:46:16 -07:00
Michael Schurter	5ef0a82e6e	client: reset encoders between uses According to go/codec's docs, Reset(...) should be called on Decoders/Encoders before reuse: https://godoc.org/github.com/ugorji/go/codec I could find no evidence that not calling Reset() caused bugs, but might as well do what the docs say?	2018-05-02 10:46:16 -07:00
Alex Dadgar	de4af37249	version bump and remove generated	2018-04-27 11:10:00 -07:00
Alex Dadgar	845a43864a	generated files	2018-04-27 10:45:40 -07:00
Alex Dadgar	35e06ddb31	Remove generated and version bump	2018-04-26 16:49:19 -07:00
Alex Dadgar	43192cefae	generated files	2018-04-26 16:28:58 -07:00
Michael Schurter	0e602d4779	Merge pull request #4188 from hashicorp/f-rkt-stats rkt: create parent cgroup to enable stats	2018-04-24 14:54:36 -07:00
Michael Schurter	d687761ebf	rkt: test Stats() and always run tests Remove the NOMAD_TEST_RKT flag as a guard for rkt tests. Still require Linux, root, and rkt to be installed. Only check for rkt installation once in hopes of speeding up rkt tests a bit.	2018-04-24 11:05:42 -07:00
Javier Palomo Almena	3e6c01ffa1	docker tests: Fix usage of NewDriverContext	2018-04-23 22:51:06 +02:00
Javier Palomo Almena	74d3c5df07	DriverContext: Add the TaskGroup and the Job name Adding this fields to the DriverContext object, will allow us to pass them to the drivers. An use case for this, will be to emit tagged metrics in the drivers, which contain all relevant information: - Job - TaskGroup - Task - ... Ref: https://github.com/hashicorp/nomad/pull/4185	2018-04-23 00:15:29 +02:00
Michael Schurter	4cee6cca6c	rkt: create parent cgroup to enable stats Having the Nomad executor create parent cgroups that rkt is launched within allows the stats collection code used for the exec driver to Just Work. The only downside is that now the Nomad executor's resource utilization counts against the cgroups resource limits just as it does for the exec driver.	2018-04-19 15:14:56 -07:00
Michael Schurter	1a85d0c990	run goimports	2018-04-19 11:16:28 -07:00
Michael Schurter	d77c265d1f	Merge pull request #4168 from ninoles/b-2117-windows-group-process B 2117 windows group process	2018-04-19 11:10:51 -07:00
Michael Schurter	fdbcbd4e5b	Merge pull request #4058 from hashicorp/f-mock-by-default [Post-0.8] test: build with mock_driver by default	2018-04-18 15:57:00 -07:00
Michael Schurter	d3650fb2cd	test: build with mock_driver by default `make release` and `make prerelease` set a `release` tag to disable enabling the `mock_driver`	2018-04-18 14:45:33 -07:00
Michael Schurter	a991923389	tests: fix race in alloc_runner_test.go I could not reproduce the failure locally even with `stress -cpu ...` eating all the cpu it could on my machine. But I think the race was in one of two places: * The task could restart which could create new events * I think there could be a race between the updater's version of events and alloc runners as updates are async I fixed both. Here's hoping that fixes this flaky test.	2018-04-17 17:14:59 -07:00
Fabien Ninoles	c81bec48c9	Merge branch 'master' into b-2117-windows-group-process	2018-04-17 13:47:25 -04:00
Fabien Ninoles	35cf641416	Update based on PR request.	2018-04-17 13:43:04 -04:00
Alex Dadgar	c4ad76091d	Merge pull request #4166 from hashicorp/b-panic-fix-update Fixes races accessing node and updating it during fingerprinting	2018-04-17 10:02:19 -07:00
Chelsea Holland Komlo	9b8a079558	fix up comments	2018-04-17 11:53:08 -04:00
Alex Dadgar	9d612c8cb0	Cleanup	2018-04-16 15:48:34 -07:00
Alex Dadgar	32adaf9dfc	Copy the config given to the alloc runner	2018-04-16 15:45:52 -07:00
Alex Dadgar	3ff2d4d795	fix race node access	2018-04-16 15:45:51 -07:00
Alex Dadgar	4f2a7b6949	Fix copying drivers	2018-04-16 15:45:51 -07:00
Alex Dadgar	0b799822ff	Operate on copy	2018-04-16 15:45:49 -07:00
Fabien Ninoles	27cf4995ce	- Clean up for windows compilation. - Set CREATE_NEW_PROCESS_GROUP for Windows subprocess. - Ensure we only kill actual process that need to.	2018-04-14 13:58:42 -04:00
Michael Schurter	3836b8a335	Merge pull request #3572 from emate/master Create new process group on process startup.	2018-04-13 11:56:38 -07:00
Alex Dadgar	adaf4fa7e0	Remove generated structs	2018-04-12 16:35:31 -07:00
Alex Dadgar	663c4d0433	Version bump and generated files	2018-04-12 16:21:50 -07:00
Alex Dadgar	ff1a1a63e8	Move where attribute for driver detection is set	2018-04-12 15:50:25 -07:00
Chelsea Holland Komlo	5291788b40	delete driver name from only health check attributes	2018-04-12 18:24:41 -04:00
Alex Dadgar	3d53d380f7	Fix tests	2018-04-12 14:29:30 -07:00
Alex Dadgar	f24ce2c50c	Driver health detection cleanups This PR does: 1. Health message based on detection has format "Driver XXX detected" and "Driver XXX not detected" 2. Set initial health description based on detection status and don't wait for the first health check. 3. Combine updating attributes on the node, fingerprint and health checking update for drivers into a single call back. 4. Condensed driver info in `node status` only shows detected drivers and make the output less wide by removing spaces.	2018-04-12 12:46:40 -07:00
Charlie Voiselle	ba88f00ccb	Changed "til" to "until" Should be "till" or "until"; chose "until" because it is unambiguous as to meaning.	2018-04-11 12:36:28 -05:00
Andrei Burd	502d17fa90	Added node class to tagged metrics	2018-04-11 12:20:59 +03:00
Chelsea Komlo	eb5aac16e6	Merge pull request #4111 from hashicorp/b-undetected-set-health-to-false Immediately set driver health status to false when driver moves to undetected	2018-04-10 18:30:31 -04:00
Chelsea Holland Komlo	d58b3e473c	update comment for when the fingerprinter setting health status	2018-04-10 16:53:00 -04:00
Chelsea Holland Komlo	f7ef13cc64	fingerprinter should set health check status if health check is not periodic	2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo	ede4f518bd	add setters for access to the fingerprint manager's node refactor extracting driver info	2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo	f479da19f5	guard against overwriting health status	2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo	ece1618815	immediately set healthy to false when driver moves to undetected	2018-04-10 15:29:51 -04:00
Alex Dadgar	3d367d6fd7	Fix client uptime metric missing client prefix	2018-04-10 10:39:36 -07:00
Seth Vargo	df4fe7e76c	Set user-agent when talking to GCE metadata	2018-04-10 10:36:46 -04:00
Chelsea Komlo	d3bd8fb96e	Merge pull request #4109 from hashicorp/f-shorten-docker-health-timeout Shorten docker health timeout	2018-04-09 15:38:39 -04:00
Chelsea Holland Komlo	ea4b65dd41	only initialize docker clients if they are nil	2018-04-09 14:13:07 -04:00
Chelsea Holland Komlo	288c7a33a1	refacotoring simplification from code review	2018-04-09 10:34:17 -04:00
Chelsea Holland Komlo	6e3b056c37	only run health check if driver moves from undetected to detected	2018-04-09 10:10:43 -04:00
Alex Dadgar	ae1f76477e	Start rebalance after discovering new servers	2018-04-05 15:41:59 -07:00
Alex Dadgar	929b6823a3	Merge pull request #4106 from hashicorp/b-servers Improved Client handling of failed RPCs	2018-04-05 13:48:50 -07:00
Alex Dadgar	be2513e0f9	more jitter	2018-04-05 13:48:33 -07:00
Chelsea Holland Komlo	d3637825ef	group similar functions; update comments health check timeout should be 1 minute	2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo	e8743f1f7b	remove do once block when creating a new docker client only set cached connections upon no error	2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo	d0d793fc23	use client with shorter timeouts for health checks	2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo	5d1b2b77cb	refactor docker clients method to be able to extend to creating new clients	2018-04-05 16:19:02 -04:00
Alex Dadgar	bd3345942c	Handle no leader and faster retries near limit Handle the ErrNoLeader case and apply slower retries. Also when we have missed the heartbeat retry aggressively, backing off after we have missed for more than 30 seconds.	2018-04-05 11:22:47 -07:00
Alex Dadgar	279b5c22e5	Scale heartbeat retrying based on remaining heartbeat time	2018-04-05 10:58:13 -07:00
Alex Dadgar	7941f4eb2d	Fire retry only when consul discovers new servers	2018-04-05 10:40:17 -07:00
Preetha	6254d75eee	Merge pull request #4101 from hashicorp/b-rescheduling-edge-fixes Fixes edge cases around timing/ task finish time being set more than once	2018-04-04 16:18:21 -05:00
Preetha Appan	12ba4c45da	remove outdated commented out test code	2018-04-04 15:03:24 -05:00
Preetha Appan	6363a6fb4d	Remove old comment	2018-04-04 15:01:48 -05:00
Preetha Appan	5e4525bd30	Moves setting finishedAt to the right place and adds two unit tests.	2018-04-04 14:38:15 -05:00
Alex Dadgar	86c32358d4	Spelling error	2018-04-03 18:30:01 -07:00
Alex Dadgar	01a6beafbf	RPC Retry Watcher	2018-04-03 18:05:28 -07:00
Preetha Appan	e6bbce3fa0	Add comment	2018-04-03 19:49:03 -05:00
Alex Dadgar	ec844f19d9	randomize servers	2018-04-03 17:46:13 -07:00
Preetha Appan	00537c739b	Fixes edge cases around timing and task finish time being set more than once	2018-04-03 16:34:59 -05:00
Alex Dadgar	58a3ec3fb2	Improve Vault error handling	2018-04-03 14:29:22 -07:00
Alex Dadgar	86f9044676	remove generated files	2018-03-30 16:52:49 -07:00
Alex Dadgar	af81349dbe	Generated files	2018-03-30 16:14:40 -07:00
Michael Schurter	257ba5937d	test: don't rely on alloc runner update count We were incorrectly relying on the count of alloc updates in a number of tests. Since alloc updates are async, their number is non-determinstic and largely meaningless. This should fix quite a few flaky tests in Travis and prevent future mistaken assumptions in tests.	2018-03-30 09:34:33 -07:00
Michael Schurter	62e9553333	Merge pull request #4069 from hashicorp/f-hashealth add HasHealth helper for nil checks	2018-03-29 17:03:20 -07:00
Alex Dadgar	beee130a6e	Always capture the finish time	2018-03-29 11:27:22 -07:00
Michael Schurter	91b5bb58d9	add HasHealth helper for nil checks We performed the DeploymentStatus nil checks a couple different ways, so hopefully this helper will consoldiate them and make it more clear what the code is doing.	2018-03-29 09:29:19 -07:00
Chelsea Komlo	4338360da9	Merge pull request #4065 from hashicorp/emit-node-event-on-first-health-change Emit first node event after initialization on health status change	2018-03-29 11:23:25 -04:00
Chelsea Holland Komlo	2174ede6b9	add clarifying comment	2018-03-29 10:58:39 -04:00
Michael Schurter	3a79c32677	Merge pull request #4059 from hashicorp/b-drain-health-svc-only only service allocs should have health watched	2018-03-28 16:49:22 -07:00
Michael Schurter	5eb0cb7176	only service allocs should have health watched	2018-03-28 16:20:11 -07:00
Chelsea Holland Komlo	e3319afee1	emit first node event	2018-03-28 17:26:53 -04:00
Chelsea Komlo	7812ac5abf	Merge pull request #4057 from hashicorp/specify-docker-msg Specify docker name in driver health messages	2018-03-28 13:32:36 -04:00
Preetha	177d2d6010	Merge pull request #4052 from hashicorp/f-specify-total-memory Allow to specify total memory on agent configuration	2018-03-28 12:28:41 -05:00
Chelsea Holland Komlo	efc03e252c	specify driver health messages	2018-03-28 11:35:21 -04:00
Preetha Appan	329428b49f	Code review feedback and unit test	2018-03-28 10:07:15 -05:00
Charlie Voiselle	ea10588227	rkt: logging enhancements (#4044 ) * Added extra debug logging; extended timeout; added jitter. * small log changes * increase timeout * remove unneccessary uuid	2018-03-27 17:30:06 -07:00
Michael Schurter	fcaee471a0	client: always mark exited sys/svc allocs as failed When restarts.attempts=0 was set in a jobspec a system or service alloc that exited with 0 status would be marked as `completed` instead of `failed`. Since system and service jobs are intended to run until stopped or updated, they should always be marked as failed when they exit even in cases where the exit code is 0.	2018-03-27 14:30:19 -07:00
Mildred Ki'Lya	1017cbe8ab	Allow to specify total memory on agent configuration Allow to set the total memory of an agent in its configuration file. This can be used in case the automatic detection doesn't work or in specific environments when memory overcommit (using swap for example) can be desirable.	2018-03-27 15:46:18 -05:00
Chelsea Holland Komlo	003bc209b9	use time.Time for node events for compatibility	2018-03-27 15:43:57 -04:00
Alex Dadgar	432784dae3	Fix alloc watcher snapshot streaming	2018-03-27 11:14:53 -07:00
Alex Dadgar	05449fea09	drop stats fetching log	2018-03-23 12:01:50 -07:00
Chelsea Komlo	5f0c382021	Merge pull request #4030 from hashicorp/health-check-ux UX improvments to driver health checks	2018-03-23 09:46:50 -04:00
Alex Dadgar	da27fc3880	Driver Info output	2018-03-22 17:18:32 -07:00
Chelsea Holland Komlo	e9005d8cfb	ux improvments to driver health checks	2018-03-22 18:38:29 -04:00
Michael Schurter	a318684738	Merge pull request #4022 from hashicorp/f-more-executor-logging executor: increase level for helpful log lines	2018-03-22 15:21:20 -07:00
Michael Schurter	a4f346abeb	remove spurious TODOs and FIXMEs	2018-03-21 16:55:22 -07:00
Michael Schurter	8b346c6176	test: try to prevent flakiness on travis	2018-03-21 16:51:45 -07:00
Michael Schurter	1b7ac447e9	alloc_runner: watch health for deployed batch jobs	2018-03-21 16:51:45 -07:00
Michael Schurter	62960ed7bd	client: don't monitor health of non-service jobs Also fix system job draining; won't work without deadline fixes	2018-03-21 16:51:44 -07:00
Alex Dadgar	a37329189a	Improve DeadlineTime helper	2018-03-21 16:51:44 -07:00
Alex Dadgar	db4a634072	RPC, FSM, State Store for marking DesiredTransistion fix build tag	2018-03-21 16:49:48 -07:00
Michael Schurter	bb0ff44fb4	mock_driver: improve Kill() logging	2018-03-21 16:49:48 -07:00
Michael Schurter	c0542474db	drain: initial drainv2 structs and impl	2018-03-21 16:49:48 -07:00
Chelsea Holland Komlo	f329e45e03	always set initial health status for every driver	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	bbaffe3eca	set driver to unhealthy once if it cannot be detected in periodic check	2018-03-21 15:15:26 -04:00
Alex Dadgar	5df4b3728d	Docker driver doesn't return errors but injects into the DriverInfo	2018-03-21 15:15:26 -04:00
Alex Dadgar	4365bb7f59	Only run health check if driver is detected	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	f801709a0a	fix issue when updating node events	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	285729aee2	function rename and re-arrange functions in fingerprint_manager	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	60f12d206f	improve comments; update watchDriver	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	739784736a	remove unused function	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d92703617c	simplify logic bump log level	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	86b7b3d2d9	fix up health check logic comparison; add node events to client driver checks	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	53a5bc2bb3	Code review feedback	2018-03-21 15:15:26 -04:00
Alex Dadgar	34dc58421c	notes from walk through	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	44b6951dda	improve tests	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d740a6a46e	refresh driver information for non-health checking drivers periodically	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d8f68e5ef8	fix up codereview feedback	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d5f6c940c4	fix up racy tests	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	0425be8f48	updating comments; locking concurrent node access	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	c50d02ae93	go style; update comments	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	3aa726baab	fix scheduler driver name; create node structs file	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	3cba95e8a7	allow nomad to schedule based on the status of a client driver health check Slight updates for go style	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	0bde357731	add concept of health checks to fingerprinters and nodes fix up feedback from code review add driver info for all drivers to node	2018-03-21 15:15:25 -04:00
Michael Schurter	1022170bf3	executor: increase level for helpful log lines Should help with debugging issues like #3971	2018-03-21 11:53:58 -07:00
Marcin Matlaszek	6019a88824	Make raw_exec processes cleanup function more precise.	2018-03-20 13:40:21 +01:00
Marcin Matlaszek	bb36c122e2	Fix errors when trying to kill whole process group.	2018-03-20 13:40:21 +01:00
Marcin Matlaszek	86d650d7b0	Make starting & cleaning process group Windows compatible.	2018-03-20 13:40:21 +01:00
Marcin Matlaszek	79c139f2ef	Create new process group on process startup. Clean up by sending SIGKILL to the whole process group.	2018-03-20 13:40:21 +01:00
Michael Schurter	1044bc0feb	Merge pull request #3984 from hashicorp/f-loosen-consul-skipverify Replace Consul TLSSkipVerify handling	2018-03-16 11:21:28 -07:00
Michael Schurter	32ee5e0d53	Merge pull request #3990 from hashicorp/f-rkt-groups rkt: allow specifying --group	2018-03-16 11:19:53 -07:00
Michael Schurter	bd78cfb039	rkt: allow specifying --group	2018-03-16 11:08:22 -07:00
Michael Schurter	fb10ec9c01	docker: make volume errors recoverable The interface+mock just to test this one little error handling may seem like overkill but there was just no other way to write an automated test around this logic as there's no way to simluate this error with stock Docker.	2018-03-15 17:52:43 -07:00
Michael Schurter	0971114f0c	Replace Consul TLSSkipVerify handling Instead of checking Consul's version on startup to see if it supports TLSSkipVerify, assume that it does and only log in the job service handler if we discover Consul does not support TLSSkipVerify. The old code would break TLSSkipVerify support if Nomad started before Consul (such as on system boot) as TLSSkipVerify would default to false if Consul wasn't running. Since TLSSkipVerify has been supported since Consul 0.7.2, it's safe to relax our handling.	2018-03-14 17:43:06 -07:00
Preetha Appan	3c38eededd	Fix spelling in comment	2018-03-14 15:54:25 -05:00
Alex Dadgar	bef4a8ee09	fix clearing node events	2018-03-14 09:48:59 -07:00
Chelsea Komlo	810eedfa2a	Merge pull request #3945 from hashicorp/f-add-node-events Add node events	2018-03-14 08:42:55 -04:00
Preetha	360d6e5a92	Merge pull request #3968 from hashicorp/f-nicer-vault-error Make server side error messages from vault more clearer	2018-03-13 20:49:39 -05:00
Alex Dadgar	de6ebb6e6c	small cleanup	2018-03-13 18:08:22 -07:00
Chelsea Holland Komlo	b41501e442	code review feedback	2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo	1488b076d1	code review feedback	2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo	a8655320fd	fix up go check warnings	2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo	0934769b04	add client side emitting of node events Changelog	2018-03-13 18:08:21 -07:00
Preetha Appan	914eaed64f	Address some code review comments	2018-03-13 18:19:16 -05:00
Preetha Appan	09c231ce43	Return the err from server correctly	2018-03-13 18:10:14 -05:00
Preetha Appan	9618f52746	Remove error wrapping and make vault connection server side errors clearer.	2018-03-13 17:09:03 -05:00
Michael Schurter	79df90acb0	Merge pull request #3958 from simplesurance/swappiness fix: disable swap for executor_linux allocations	2018-03-13 10:10:22 -07:00
Fabian Holler	e6af051c93	fix: disable swap for executor_linux allocations A comment in the nomad source code states that swapping for executor_linux allocations is disabled but it wasn't. Nomad wrote -1 to the memsw.limit_in_bytes cgroup file to disable swapping. This has the following problems: 1.) Writing -1 to the file does not disable swapping. It sets the limit for memory and swap to unlimited. 2.) On common Linux distributions like Ubuntu 16.04 LTS the memsw.limit_in_bytes cgroup file does not exist by default. The memsw.limit_in_bytes file only exist if the Linux kernel is build with CONFIG_MEMCG_SWAP=yes and either CONFIG_MEMCG_SWAP_ENABLED=yes or when the kernel parameter swapaccount=1 is passed during boot. Most Linux distributions disable swap accounting by default because of higher memory usage. Nomad silently ignores if writing to the memsw.limit_in_bytes file fails. The allocation succeeds, no message is logged to notify the user. To ensure that disabling swap works on common Linux kernels, disable swapping by writing 0 to the memory.swappiness file. Using the memory.swappiness file only requires that the kernel is compiled with CONFIG_MEMCG=yes. This is the default in common Linux kernels.	2018-03-13 10:52:50 +01:00
Alex Dadgar	4844317cc2	Merge pull request #3890 from hashicorp/b-heartbeat Heartbeat improvements and handling failures during establishing leadership	2018-03-12 14:41:59 -07:00
Michael Schurter	7dd7fbcda2	non-Existent -> nonexistent Reverting from #3963 https://www.merriam-webster.com/dictionary/existent	2018-03-12 11:59:33 -07:00

1 2 3 4 5 ...

3132 Commits