open-nomad

Author	SHA1	Message	Date
Mahmood Ali	6301725002	logmon: revert workaround for Windows go1.11 bug Revert e0126123ab1ba848f72458538bc6118c978245e6 now that we are running with Golang 1.12, and https://github.com/golang/go/issues/29119 is no longer relevant.	2019-08-24 08:19:44 -04:00
Mahmood Ali	df1f3eb9ee	Merge pull request #6201 from hashicorp/b-device-stats-interval initialize device manager stats interval	2019-08-24 08:16:03 -04:00
Mahmood Ali	07b5f4c530	Merge pull request #6146 from hashicorp/b-config-template-copy clientConfig.Copy() to copy template config too	2019-08-23 19:00:57 -04:00
Mahmood Ali	b98568774b	clientConfig.Copy() to copy template config too	2019-08-23 18:43:22 -04:00
Lang Martin	4f6493a301	taskrunner getter set Umask for go-getter, setuid test	2019-08-23 15:59:03 -04:00
Mahmood Ali	3890619100	initialize device manager stats interval Fixes a bug where we cpu is pigged at 100% due to collecting devices statistics. The passed stats interval was ignored, and the default zero value causes a very tight loop of stats collection. FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a `g2.2xlarge` ec2 instance. The stats interval defaults to 1 second and is user configurable. I believe this is too frequent as a default, and I may advocate for reducing it to a value closer to 5s or 10s, but keeping it as is for now. Fixes https://github.com/hashicorp/nomad/issues/6057 .	2019-08-23 14:58:34 -04:00
Jerome Gravel-Niquet	cbdc1978bf	Consul service meta (#6193 ) * adds meta object to service in job spec, sends it to consul * adds tests for service meta * fix tests * adds docs * better hashing for service meta, use helper for copying meta when registering service * tried to be DRY, but looks like it would be more work to use the helper function	2019-08-23 12:49:02 -04:00
Nick Ethier	96d379071d	ar: fix bridge networking port mapping when port.To is unset (#6190 )	2019-08-22 21:53:52 -04:00
Michael Schurter	59e0b67c7f	connect: task hook for bootstrapping envoy sidecar Fixes #6041 Unlike all other Consul operations, boostrapping requires Consul be available. This PR tries Consul 3 times with a backoff to account for the group services being asynchronously registered with Consul.	2019-08-22 08:15:32 -07:00
Michael Schurter	b008fd1724	connect: register group services with Consul Fixes #6042 Add new task group service hook for registering group services like Connect-enabled services. Does not yet support checks.	2019-08-20 12:25:10 -07:00
lchayoun	2307c9d1d2	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-16 11:11:47 +03:00
Nick Ethier	965f00b2fc	Builtin Admission Controller Framework (#6116 ) * nomad: add admission controller framework * nomad: add admission controller framework and Consul Connect hooks * run admission controllers before checking permissions * client: add default node meta for connect configurables * nomad: remove validateJob func since it has been moved to admission controller * nomad: use new TaskKind type * client: use consts for connect sidecar image and log level * Apply suggestions from code review Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> * nomad: add job register test with connect sidecar * Update nomad/job_endpoint_hooks.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-15 11:22:37 -04:00
lchayoun	c5a38a045a	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-13 19:23:13 +03:00
Tim Gross	03433f35d4	client/template: configuration for function blacklist and sandboxing When rendering a task template, the `plugin` function is no longer permitted by default and will raise an error. An operator can opt-in to permitting this function with the new `template.function_blacklist` field in the client configuration. When rendering a task template, path parameters for the `file` function will be treated as relative to the task directory by default. Relative paths or symlinks that point outside the task directory will raise an error. An operator can opt-out of this protection with the new `template.disable_file_sandbox` field in the client configuration.	2019-08-12 16:34:48 -04:00
Danielle Lancashire	7e6c8e5ac1	Copy documentation to api/tasks	2019-08-12 16:22:27 +02:00
Danielle Lancashire	861caa9564	HostVolumeConfig: Source -> Path	2019-08-12 15:39:08 +02:00
Danielle Lancashire	e132a30899	structs: Unify Volume and VolumeRequest	2019-08-12 15:39:08 +02:00
Danielle Lancashire	6ef8d5233e	client: Add volume_hook for mounting volumes	2019-08-12 15:39:08 +02:00
Danielle Lancashire	063e4240c1	client: Add parsing and registration of HostVolume configuration	2019-08-12 15:39:08 +02:00
lchayoun	ca892163b2	allow dash in non generated environment variable names	2019-08-11 12:51:42 +03:00
Nick Ethier	7806f4c597	Revert "client: add autofetch for CNI plugins" This reverts commit 0bd157cc3b04fb090dd0d54affcae71496102ce8.	2019-08-08 15:10:19 -04:00
Nick Ethier	7d28ece8de	Revert "client: remove debugging lines" This reverts commit 54ce4d1f7ef4913cb12c03dbc98bcd903f7787c9.	2019-08-08 14:52:52 -04:00
Liel Chayoun	24dcb2379c	Update env_test.go	2019-08-06 11:59:31 +03:00
Mahmood Ali	b17bac5101	Render consul templates using task env only (#6055 ) When rendering a task consul template, ensure that only task environment variables are used. Currently, `consul-template` always falls back to host process environment variables when key isn't a task env var[1]. Thus, we add an empty entry for each host process env-var not found in task env-vars. [1] `bfa5d0e133/template/funcs.go (L61-L75)`	2019-08-05 16:30:47 -04:00
Mahmood Ali	f66169cd6a	Merge pull request #6065 from hashicorp/b-nil-driver-exec Check if driver handle is nil before execing	2019-08-02 09:48:28 -05:00
Mahmood Ali	a4670db9b7	Check if driver handle is nil before execing Defend against tr.getDriverHandle being nil. Exec handler checks if task is running, but it may be stopped between check and driver handler fetching.	2019-08-02 10:07:41 +08:00
Nick Ethier	7de0bec8ab	client/cni: updated comments and simplified logic to auto download plugins	2019-07-31 01:04:10 -04:00
Nick Ethier	b16640c50d	Apply suggestions from code review Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2019-07-31 01:04:10 -04:00
Nick Ethier	321d10a041	client: remove debugging lines	2019-07-31 01:04:09 -04:00
Nick Ethier	af6b191963	client: add autofetch for CNI plugins	2019-07-31 01:04:09 -04:00
Nick Ethier	1e9dd1b193	remove unused file	2019-07-31 01:04:09 -04:00
Nick Ethier	09a4cfd8d7	fix failing tests	2019-07-31 01:04:07 -04:00
Nick Ethier	ef83f0831b	ar: plumb client config for networking into the network hook	2019-07-31 01:04:06 -04:00
Nick Ethier	af66a35924	networking: Add new bridge networking mode implementation	2019-07-31 01:04:06 -04:00
Michael Schurter	fb487358fb	connect: add group.service stanza support	2019-07-31 01:04:05 -04:00
Nick Ethier	63c5504d56	ar: fix lint errors	2019-07-31 01:03:19 -04:00
Nick Ethier	e312201d18	ar: rearrange network hook to support building on windows	2019-07-31 01:03:19 -04:00
Nick Ethier	370533c9c7	ar: fix test that failed due to error renaming	2019-07-31 01:03:19 -04:00
Nick Ethier	2d60ef64d9	plugins/driver: make DriverNetworkManager interface optional	2019-07-31 01:03:19 -04:00
Nick Ethier	f87e7e9c9a	ar: plumb error handling into alloc runner hook initialization	2019-07-31 01:03:18 -04:00
Nick Ethier	ef1795b344	ar: add tests for network hook	2019-07-31 01:03:18 -04:00
Nick Ethier	15989bba8e	ar: cleanup lint errors	2019-07-31 01:03:18 -04:00
Nick Ethier	220cba3e7e	ar: move linux specific code to it's own file and add tests	2019-07-31 01:03:18 -04:00
Nick Ethier	548f78ef15	ar: initial driver based network management	2019-07-31 01:03:17 -04:00
Nick Ethier	66c514a388	Add network lifecycle management Adds a new Prerun and Postrun hooks to manage set up of network namespaces on linux. Work still needs to be done to make the code platform agnostic and support Docker style network initalization.	2019-07-31 01:03:17 -04:00
Preetha Appan	d048029b5a	remove generated code and change version to 0.10.0	2019-07-30 15:56:05 -05:00
Nomad Release bot	e39fb11531	Generate files for 0.9.4 release	2019-07-30 19:05:18 +00:00
Preetha Appan	6b4c40f5a8	remove generated code	2019-07-23 12:07:49 -05:00
Nomad Release bot	04187c8b86	Generate files for 0.9.4-rc1 release	2019-07-22 21:42:36 +00:00
Michael Schurter	d90680021e	logmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formatting	2019-07-22 13:05:01 -07:00
Michael Schurter	e37bc3513c	logmon: ensure errors are still handled properly ...and add a comment to switch back to the old error handling once we switch to Go 1.12.	2019-07-22 12:49:48 -07:00
Danielle Lancashire	1bcbbbfbe6	logmon: Workaround golang/go#29119 There's a bug in go1.11 that causes some io operations on windows to return incorrect errors for some cases when Stat-ing files. To avoid upgrading to go1.12 in a point release, here we loosen up the cases where we will attempt to create fifos, and add some logging of underlying stat errors to help with debugging.	2019-07-22 18:28:12 +02:00
Jasmine Dahilig	2157f6ddf1	add formatting for hcl parsing error messages (#5972 )	2019-07-19 10:04:39 -07:00
Mahmood Ali	cd6f1d3102	Update consul-template dependency to latest To pick up the fix in https://github.com/hashicorp/consul-template/pull/1231 .	2019-07-18 07:32:03 +07:00
Mahmood Ali	8a82260319	log unrecoverable errors	2019-07-17 11:01:59 +07:00
Mahmood Ali	1a299c7b28	client/taskrunner: fix stats stats retry logic Previously, if a channel is closed, we retry the Stats call. But, if that call fails, we go in a backoff loop without calling Stats ever again. Here, we use a utility function for calling driverHandle.Stats call that retries as one expects. I aimed to preserve the logging formats but made small improvements as I saw fit.	2019-07-11 13:58:07 +08:00
Preetha Appan	7d645c5ad9	Test file for detect content type that satisfies linter and encoding	2019-07-10 11:42:04 -05:00
Preetha Appan	ef9a71c68b	code review feedback	2019-07-10 10:41:06 -05:00
Preetha Appan	990e468edc	Populate task event struct with kill timeout This makes for a nicer task event message	2019-07-09 09:37:09 -05:00
Preetha Appan	108a292cc0	fix linting failure in test case file	2019-07-08 11:29:12 -05:00
Michael Lange	b2e9570075	Use consistent casing in the JSON representation of the AllocFileInfo struct	2019-07-02 17:27:31 -07:00
Preetha Appan	8495fb9055	Added additional test cases and fixed go test case	2019-07-02 13:25:29 -05:00
Mahmood Ali	a97d451ac7	Merge pull request #5905 from hashicorp/b-ar-failed-prestart Fail alloc if alloc runner prestart hooks fail	2019-07-02 20:25:53 +08:00
Danielle	c6872cdf12	Merge pull request #5864 from hashicorp/dani/win-pipe-cleaner windows: Fix restarts using the raw_exec driver	2019-07-02 13:58:56 +02:00
Danielle Lancashire	e20300313f	fifo: Safer access to Conn	2019-07-02 13:12:54 +02:00
Mahmood Ali	f10201c102	run post-run/post-stop task runner hooks Handle when prestart failed while restoring a task, to prevent accidentally leaking consul/logmon processes.	2019-07-02 18:38:32 +08:00
Mahmood Ali	4afd7835e3	Fail alloc if alloc runner prestart hooks fail When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for https://github.com/hashicorp/nomad/issues/5840 .	2019-07-02 18:35:47 +08:00
Mahmood Ali	7614b8f09e	Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2 task runner to avoid running task if terminal	2019-07-02 15:31:17 +08:00
Mahmood Ali	7bfad051b9	address review comments	2019-07-02 14:53:50 +08:00
Mahmood Ali	c0c00ecc07	Merge pull request #5906 from hashicorp/b-alloc-stale-updates client: defensive against getting stale alloc updates	2019-07-02 12:40:17 +08:00
Preetha Appan	c09342903b	Improve test cases for detecting content type	2019-07-01 16:24:48 -05:00
Danielle Lancashire	688f82f07d	fifo: Close connections and cleanup lock handling	2019-07-01 14:14:29 +02:00
Danielle Lancashire	2c7d1f1b99	logmon: Add windows compatibility test	2019-07-01 14:14:06 +02:00
Mahmood Ali	c5f5a1fcb9	client: defensive against getting stale alloc updates When fetching node alloc assignments, be defensive against a stale read before killing local nodes allocs. The bug is when both client and servers are restarting and the client requests the node allocation for the node, it might get stale data as server hasn't finished applying all the restored raft transaction to store. Consequently, client would kill and destroy the alloc locally, just to fetch it again moments later when server store is up to date. The bug can be reproduced quite reliably with single node setup (configured with persistence). I suspect it's too edge-casey to occur in production cluster with multiple servers, but we may need to examine leader failover scenarios more closely. In this commit, we only remove and destroy allocs if the removal index is more recent than the alloc index. This seems like a cheap resiliency fix we already use for detecting alloc updates. A more proper fix would be to ensure that a nomad server only serves RPC calls when state store is fully restored or up to date in leadership transition cases.	2019-06-29 04:17:35 -05:00
Preetha Appan	3345ce3ba4	Infer content type in alloc fs stat endpoint	2019-06-28 20:31:28 -05:00
Danielle Lancashire	e1151f743b	appveyor: Run logmon tests	2019-06-28 16:01:41 +02:00
Danielle Lancashire	634ada671e	fifo: Require that fifos do not exist for create Although this operation is safe on linux, it is not safe on Windows when using the named pipe interface. To provide a ~reasonable common api abstraction, here we switch to returning File exists errors on the unix api.	2019-06-28 13:47:18 +02:00
Danielle Lancashire	0ff27cfc0f	vendor: Use dani fork of go-winio	2019-06-28 13:47:18 +02:00
Danielle Lancashire	514a2a6017	logmon: Refactor fifo access for windows safety On unix platforms, it is safe to re-open fifo's for reading after the first creation if the file is already a fifo, however this is not possible on windows where this triggers a permissions error on the socket path, as you cannot recreate it. We can't transparently handle this in the CreateAndRead handle, because the Access Is Denied error is too generic to reliably be an IO error. Instead, we add an explict API for opening a reader to an existing FIFO, and check to see if the fifo already exists inside the calling package (e.g logmon)	2019-06-28 13:41:54 +02:00
Mahmood Ali	3d89ae0f1e	task runner to avoid running task if terminal This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes https://github.com/hashicorp/nomad/issues/5883	2019-06-27 11:27:34 +08:00
Danielle Lancashire	b9ac184e1f	tr: Fetch Wait channel before killTask in restart Currently, if killTask results in the termination of a process before calling WaitTask, Restart() will incorrectly return a TaskNotFound error when using the raw_exec driver on Windows.	2019-06-26 15:20:57 +02:00
Mahmood Ali	b209584dce	Merge pull request #5726 from hashicorp/b-plugins-via-init Use init() to handle plugin invocation	2019-06-18 21:09:03 -04:00
Mahmood Ali	ac64509c59	comment on use of init() for plugin handlers	2019-06-18 20:54:55 -04:00
Chris Baker	f71114f5b8	cleanup test	2019-06-18 14:15:25 +00:00
Chris Baker	a2dc351fd0	formatting and clarity	2019-06-18 14:00:57 +00:00
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Mahmood Ali	962921f86c	Use init to handle plugin invocation Currently, nomad "plugin" processes (e.g. executor, logmon, docker_logger) are started as CLI commands to be handled by command CLI framework. Plugin launchers use `discover.NomadBinary()` to identify the binary and start it. This has few downsides: The trivial one is that when running tests, one must re-compile the nomad binary as the tests need to invoke the nomad executable to start plugin. This is frequently overlooked, resulting in puzzlement. The more significant issue with `executor` in particular is in relation to external driver: * Plugin must identify the path of invoking nomad binary, which is not trivial; `discvoer.NomadBinary()` now returns the path to the plugin rather than to nomad, preventing external drivers from launching executors. * The external driver may get a different version of executor than it expects (specially if we make a binary incompatible change in future). This commit addresses both downside by having the plugin invocation handling through an `init()` call, similar to how libcontainer init handler is done in [1] and recommened by libcontainer [2]. `init()` will be invoked and handled properly in tests and external drivers. For external drivers, this change will cause external drivers to launch the executor that's compiled against. There a are a couple of downsides to this approach: * These specific packages (i.e executor, logmon, and dockerlog) need to be careful in use of `init()`, package initializers. Must avoid having command execution rely on any other init in the package. I prefixed files with `z_` (golang processes files in lexical order), but ensured we don't depend on order. * The command handling is spread in multiple packages making it a bit less obvious how plugin starts are handled. [1] drivers/shared/executor/libcontainer_nsenter_linux.go [2] `eb4aeed24f/libcontainer (using-libcontainer)`	2019-06-13 16:48:01 -04:00
Jasmine Dahilig	ed9740db10	Merge pull request #5664 from hashicorp/f-http-hcl-region backfill region from hcl for jobUpdate and jobPlan	2019-06-13 12:25:01 -07:00
Jasmine Dahilig	51e141be7a	backfill region from job hcl in jobUpdate and jobPlan endpoints - updated region in job metadata that gets persisted to nomad datastore - fixed many unrelated unit tests that used an invalid region value (they previously passed because hcl wasn't getting picked up and the job would default to global region)	2019-06-13 08:03:16 -07:00
Mahmood Ali	e31159bf1f	Prepare for 0.9.4 dev cycle	2019-06-12 18:47:50 +00:00
Nomad Release bot	4803215109	Generate files for 0.9.3 release	2019-06-12 16:11:16 +00:00
Danielle	f923b568e0	Merge pull request #5821 from hashicorp/dani/b-5770 trhooks: Add TaskStopHook interface to services	2019-06-12 17:30:49 +02:00
Danielle Lancashire	c326344b57	trt: Fix test	2019-06-12 17:06:11 +02:00
Danielle Lancashire	13d76e35fd	trhooks: Add TaskStopHook interface to services We currently only run cleanup Service Hooks when a task is either Killed, or Exited. However, due to the implementation of a task runner, tasks are only Exited if they every correctly started running, which is not true when you recieve an error early in the task start flow, such as not being able to pull secrets from Vault. This updates the service hook to also call consul deregistration routines during a task Stop lifecycle event, to ensure that any registered checks and services are cleared in such cases. fixes #5770	2019-06-12 16:00:21 +02:00
Mahmood Ali	2acf30fdd3	Fallback to `alloc.TaskResources` for old allocs When a client is running against an old server (e.g. running 0.8), `alloc.AllocatedResources` may be nil, and we need to check the deprecated `alloc.TaskResources` instead. Fixes https://github.com/hashicorp/nomad/issues/5810	2019-06-11 10:32:53 -04:00
Mahmood Ali	7a4900aaa4	client/allocrunner: depend on internal task state Alloc runner already tracks tasks associated with alloc. Here, we become defensive by relying on the alloc runner tracked tasks, rather than depend on server never updating the job unexpectedly.	2019-06-10 18:42:51 -04:00
Mahmood Ali	d30c3d10b0	Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1 More test fixes	2019-06-05 19:09:18 -04:00
Mahmood Ali	935ee86e92	Merge pull request #5737 from fwkz/fix-restart-attempts Fix restart attempts of `restart` stanza in `delay` mode.	2019-06-05 19:05:07 -04:00
Mahmood Ali	97957fbf75	Prepare for 0.9.3 dev cycle	2019-06-05 14:54:00 +00:00
Nomad Release bot	43bfbf3fcc	Generate files for 0.9.2 release	2019-06-05 11:59:27 +00:00
Mahmood Ali	a9f81f2daa	client config flag to disable remote exec This exposes a client flag to disable nomad remote exec support in environments where access to tasks ought to be restricted. I used `disable_remote_exec` client flag that defaults to allowing remote exec. Opted for a client config that can be used to disable remote exec globally, or to a subset of the cluster if necessary.	2019-06-03 15:31:39 -04:00
Mahmood Ali	a4ead8ff79	remove 0.9.2-rc1 generated code	2019-05-23 11:14:24 -04:00
Nomad Release bot	6d6bc59732	Generate files for 0.9.2-rc1 release	2019-05-22 19:29:30 +00:00
Michael Schurter	a54511b304	Merge pull request #5731 from hashicorp/b-ignore-dc client: drop unused DC field from servers list	2019-05-22 08:42:15 -07:00
Mahmood Ali	84419f08ce	client: synchronize client.invalidAllocs access invalidAllocs may be accessed and manipulated from different goroutines, so must be locked.	2019-05-22 09:37:49 -04:00
Danielle Lancashire	27583ed8c1	client: Pass servers contacted ch to allocrunner This fixes an issue where batch and service workloads would never be restarted due to indefinitely blocking on a nil channel. It also raises the restoration logging message to `Info` to simplify log analysis.	2019-05-22 13:47:35 +02:00
Mahmood Ali	9df1e00f35	tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal Given that Signal may be called multiple times, blocking for `SignalCh` isn't sufficient to synchornizing access to Signals field.	2019-05-21 13:56:58 -04:00
Mahmood Ali	b06e585713	Merge pull request #5739 from hashicorp/r-rm-logmon-syslog-deadcode logmon: remove syslog server deadcode	2019-05-21 11:46:48 -04:00
Mahmood Ali	eca23bf9c4	Merge pull request #5742 from hashicorp/b-test-fixes-20190520 Grab bag of (primarily race) test fixes	2019-05-21 11:46:36 -04:00
Mahmood Ali	e88bb61488	Merge pull request #5740 from hashicorp/b-nomad-exec-term-race exec: allow drivers to handle stream termination	2019-05-21 11:24:12 -04:00
Mahmood Ali	b475ccbe3e	client: synchronize access to ar.alloc `allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's use `allocRunner.Alloc()` helper function to access it.	2019-05-21 09:55:05 -04:00
Mahmood Ali	2a7b073167	tests: fix fifo lib race Accidentally accessed outer `err` variable inside a goroutine	2019-05-21 09:49:56 -04:00
Mahmood Ali	296bd41c9e	tests: fix data race in client TestDriverManager_Fingerprint_Periodic	2019-05-21 09:49:56 -04:00
Mahmood Ali	d9e59eece0	tests: fix client TestFS_Stream data race Close is invoked in a different goroutine from test	2019-05-21 09:49:56 -04:00
Mahmood Ali	75e0a3f405	exec: allow drivers to handle stream termination Without this change, alloc_endpoint cancel the context passed to handler when we detect EOF. This races driver in setting exit code; and we run into a case where the exec process terminates cleanly yet we attempt to mark it as failed with context error. Here, we rely on the driver to handle errors returned from Stream and without racing to set an error.	2019-05-21 09:40:25 -04:00
Mahmood Ali	974bcbecc9	logmon: remove syslog server deadcode Remove unused syslog server related code that got replaced by the docker logger in Nomad 0.9	2019-05-21 09:36:43 -04:00
fwkz	8b84bec95a	Fix restart attempts of `restart` stanza. Number of restarts during 2nd interval is off by one.	2019-05-21 13:27:19 +02:00
Michael Schurter	d41abda957	client: drop unused DC field from servers list See #5730 for details.	2019-05-20 14:19:15 -07:00
Michael Schurter	2fe0768f3b	docs: changelog entry for #5669 and fix comment	2019-05-14 10:54:00 -07:00
Michael Schurter	af9096c8ba	client: register before restoring Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.	2019-05-14 10:53:27 -07:00
Michael Schurter	e07f73bfe0	client: do not restart dead tasks until server is contacted (try 2) Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162 Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.	2019-05-14 10:53:27 -07:00
Michael Schurter	d7e5ace1ed	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Michael Schurter	3b1f8991a1	client: log when server list changes Stop logging in the happy path when nothing has changed.	2019-05-13 15:42:55 -07:00
Michael Schurter	48db8135da	Merge pull request #5492 from hashicorp/f-allocated-mem client: expose allocated memory per task	2019-05-13 13:31:22 -07:00
Lang Martin	1d03a43ce2	Merge pull request #5642 from hashicorp/b-network-fingerprinting-ipv4 network fingerprinting multiple IPs on the configured network device	2019-05-13 11:46:53 -04:00
Michael Schurter	1c4e585fa7	client: expose allocated memory per task Related to #4280 This PR adds `client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge in bytes to metrics to ease calculating how close a task is to OOMing. ``` 'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000 'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000 'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000 'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000 'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000 ```	2019-05-10 11:12:12 -07:00
Lang Martin	f6bc45dd23	client improve a comment in updateNetworks	2019-05-10 11:25:04 -04:00
Mahmood Ali	919827f2df	Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base nomad exec part 1: plumbing and docker driver	2019-05-09 18:09:27 -04:00
Mahmood Ali	ab2cae0625	implement client endpoint of nomad exec Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking the relevant task handler for execution.	2019-05-09 16:49:08 -04:00
Preetha	1d02886bb6	Merge pull request #5654 from hashicorp/b-hearbeat-lockfix Remove unnecessary locking and serverlist syncing in heartbeats	2019-05-08 13:36:39 -05:00
Preetha Appan	3289e7f4a0	fix typo and add one more test scenario	2019-05-08 10:54:22 -05:00
Preetha Appan	db6b291a5a	code review feedback	2019-05-07 16:23:32 -05:00
Chris Baker	93ec1293be	stale allocation data leads to incorrect (and even negative) metrics (#5637 ) * client: was not using up-to-date client state in determining which alloc count towards allocated resources * Update client/client.go Co-Authored-By: cgbaker <cgbaker@hashicorp.com>	2019-05-07 15:54:36 -04:00
Preetha Appan	b063fc81a4	Remove unnecessary locking and serverlist syncing in heartbeats This removes an unnecessary shared lock between discovery and heartbeating which was causing heartbeats to be missed upon retries when a single server fails. Also made a drive by fix to call the periodic server shuffler goroutine.	2019-05-06 14:44:55 -05:00
Michael Schurter	8c7b3ff45a	Fix comment Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:01:30 -05:00
Michael Schurter	e19fa33f9c	Remove unnecessary boolean clause Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:00:17 -05:00
Preetha Appan	b99a204582	Update deployment health on failed allocations only if health is unset This fixes a confusing UX where a previously successful deployment's healthy/unhealthy count would get updated if any allocations failed after the deployment was already marked as successful.	2019-05-02 22:59:56 -05:00
Lang Martin	c32cce51f0	client fingerprinting can keep multi ips on a device	2019-05-02 18:11:28 -04:00
Lang Martin	94f23016a2	client_test new test fingerprinting can keep multi ips on a device	2019-05-02 18:11:28 -04:00
Mahmood Ali	7a32d3f3aa	client: handle 0.8 server network resources Fixes https://github.com/hashicorp/nomad/issues/5587 When a nomad 0.9 client is handling an alloc generated by a nomad 0.8 server, we should check the alloc.TaskResources for networking details rather than task.Resources. We check alloc.TaskResources for networking for other tasks in the task group [1], so it's a bit odd that we used the task.Resources struct here. TaskRunner also uses `alloc.TaskResources`[2]. The task.Resources struct in 0.8 was sparsly populated, resulting to storing of 0 in port mapping env vars: ``` vagrant@nomad-server-01:~$ nomad version Nomad v0.8.7 (21a2d93eecf018ad2209a5eab6aae6c359267933+CHANGES) vagrant@nomad-server-01:~$ nomad server members Name Address Port Status Leader Protocol Build Datacenter Region nomad-server-01.global 10.199.0.11 4648 alive true 2 0.8.7 dc1 global vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b \| jq '.Job.TaskGroups[0].Tasks[0].Resources.Networks' [ { "CIDR": "", "Device": "", "DynamicPorts": [ { "Label": "db", "Value": 0 } ], "IP": "", "MBits": 10, "ReservedPorts": null } ] vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b \| jq '.TaskResources' { "redis": { "CPU": 500, "DiskMB": 0, "IOPS": 0, "MemoryMB": 256, "Networks": [ { "CIDR": "", "Device": "eth1", "DynamicPorts": [ { "Label": "db", "Value": 21722 } ], "IP": "10.199.0.21", "MBits": 10, "ReservedPorts": null } ] } } ``` Also, updated the test values to mimic how Nomad 0.8 structs are represented, and made its result match the non compact values in `TestEnvironment_AsList`. [1] `24e9040b18/client/taskenv/env.go (L624-L639)` [2] https://github.com/hashicorp/nomad/blob/master/client/allocrunner/taskrunner/task_runner.go#L287-L303	2019-05-02 12:08:38 -04:00
Mahmood Ali	446f06721d	aux: helper method that returns token as well as ACL policy This helper returns the token as well as the ACL policy, to be used in a later commit for logging the token info associated with nomad exec invocation.	2019-04-30 10:23:56 -04:00
Lang Martin	371014b781	Merge pull request #5553 from hashicorp/b-fingerprinter-manual-config client fingerprinter doesn't overwrite manual configuration	2019-04-26 12:55:34 -04:00
Danielle	79515496cb	Merge pull request #5515 from hashicorp/dani/f-alloc-signal allocs: Add nomad alloc signal command	2019-04-26 14:21:05 +02:00
Danielle Lancashire	a8880f9643	alloc_signal: Add autcompletion and cmd tests	2019-04-26 12:47:53 +02:00
Mahmood Ali	bf0a09e270	retry grpc unavailable errors even if not shutting down	2019-04-25 18:39:17 -04:00
Mahmood Ali	81841e8528	try checking process status	2019-04-25 18:16:13 -04:00
Mahmood Ali	fc78521f29	add logging about attempts	2019-04-25 18:09:36 -04:00
Mahmood Ali	e6ca8641a8	try sleeping for stop signal to take effect	2019-04-25 17:16:29 -04:00
Mahmood Ali	ff3a095015	add a test that simulates logmon dying during Start() call	2019-04-25 16:41:17 -04:00
Mahmood Ali	bbac73883c	logmon: retry starting logmon if it exits Retry if we detect shutting down during Start() api call is started, locally.	2019-04-25 15:10:16 -04:00

1 2 3 4 5 ...

3946 commits