open-nomad

Commit Graph

Author	SHA1	Message	Date
Pete Woods	49b7d23cea	Add node "status" and "scheduling eligibility" tags to client metrics (#6130 ) When summing up the capability of your Nomad fleet for scaling purposes, it's important to exclude draining nodes, as they won't accept new jobs.	2019-09-03 12:11:11 -04:00
Michael Schurter	5957030d18	connect: add unix socket to proxy grpc for envoy (#6232 ) * connect: add unix socket to proxy grpc for envoy Fixes #6124 Implement a L4 proxy from a unix socket inside a network namespace to Consul's gRPC endpoint on the host. This allows Envoy to connect to Consul's xDS configuration API. * connect: pointer receiver on structs with mutexes * connect: warn on all proxy errors	2019-09-03 08:43:38 -07:00
Mahmood Ali	0f401e6b8b	lint: ignore generated windows syscall wrappers	2019-09-03 10:59:58 -04:00
Jasmine Dahilig	4edebe389a	add default update stanza and max_parallel=0 disables deployments (#6191 )	2019-09-02 10:30:09 -07:00
Evan Ercolano	fcf66918d0	Remove unused canary param from MakeTaskServiceID	2019-08-31 16:53:23 -04:00
Danielle Lancashire	4fcb7394e9	client: Fix memory fingerprinting on 32bit Also introduce regression ci for 32 bit fingerprinting	2019-08-31 18:33:59 +02:00
Mahmood Ali	0bd2eee87f	Merge pull request #6216 from hashicorp/b-recognize-pending-allocs alloc_runner: wait when starting suspicious allocs	2019-08-28 14:46:09 -04:00
Mahmood Ali	e0da3c5d0e	rename to hasLocalState, and ignore clientstate The ClientState being pending isn't a good criteria; as an alloc may have been updated in-place before it was completed. Also, updated the logic so we only check for task states. If an alloc has deployment state but no persisted tasks at all, restore will still fail.	2019-08-28 11:44:48 -04:00
Nick Ethier	cf014c7fd5	ar: ensure network forwarding is allowed for bridged allocs (#6196 ) * ar: ensure network forwarding is allowed in iptables for bridged allocs * ensure filter rule exists at setup time	2019-08-28 10:51:34 -04:00
Nick Ethier	cbb27e74bc	Add environment variables for connect upstreams (#6171 ) * taskenv: add connect upstream env vars + test * set taskenv upstreams instead of appending * Update client/taskenv/env.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-27 23:41:38 -04:00
Mahmood Ali	90c5eefbab	Alternative approach: avoid restoring This uses an alternative approach where we avoid restoring the alloc runner in the first place, if we suspect that the alloc may have been completed already.	2019-08-27 17:30:55 -04:00
Jasmine Dahilig	4078393bb6	expose nomad namespace as environment variable in allocation #5692 (#6192 )	2019-08-27 08:38:07 -07:00
Mahmood Ali	647c1457cb	alloc_runner: wait when starting suspicious allocs This commit aims to help users running with clients suseptible to the destroyed alloc being restrarted bug upgrade to latest. Without this, such users will have their tasks run unexpectedly on upgrade and only see the bug resolved after subsequent restart. If, on restore, the client sees a pending alloc without any other persisted info, then err on the side that it's an corrupt persisted state of an alloc instead of the client happening to be killed right when alloc is assigned to client. Few reasons motivate this behavior: Statistically speaking, corruption being the cause is more likely. A long running client will have higher chance of having allocs persisted incorrectly with pending state. Being killed right when an alloc is about to start is relatively unlikely. Also, delaying starting an alloc that hasn't started (by hopefully seconds) is not as severe as launching too many allocs that may bring client down. More importantly, this helps customers upgrade their clients without risking taking their clients down and destablizing their cluster. We don't want existing users to force triggering the bug while they upgrade and restart cluster.	2019-08-26 22:05:31 -04:00
Mahmood Ali	dfdf0edd3b	Merge pull request #6207 from hashicorp/b-gc-destroyed-allocs-rerun Don't persist allocs of destroyed alloc runners	2019-08-26 17:26:18 -04:00
Mahmood Ali	cc460d4804	Write to client store while holding lock Protect against a race where destroying and persist state goroutines race. The downside is that the database io operation will run while holding the lock and may run indefinitely. The risk of lock being long held is slow destruction, but slow io has bigger problems.	2019-08-26 13:45:58 -04:00
Mahmood Ali	1851820f20	logmon: log stat error to help debugging	2019-08-26 10:10:20 -04:00
Mahmood Ali	c132623ffc	Don't persist allocs of destroyed alloc runners This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes https://github.com/hashicorp/nomad/issues/5984 Related to https://github.com/hashicorp/nomad/pull/5890	2019-08-25 11:21:28 -04:00
Mahmood Ali	6301725002	logmon: revert workaround for Windows go1.11 bug Revert e0126123ab1ba848f72458538bc6118c978245e6 now that we are running with Golang 1.12, and https://github.com/golang/go/issues/29119 is no longer relevant.	2019-08-24 08:19:44 -04:00
Mahmood Ali	df1f3eb9ee	Merge pull request #6201 from hashicorp/b-device-stats-interval initialize device manager stats interval	2019-08-24 08:16:03 -04:00
Mahmood Ali	07b5f4c530	Merge pull request #6146 from hashicorp/b-config-template-copy clientConfig.Copy() to copy template config too	2019-08-23 19:00:57 -04:00
Mahmood Ali	b98568774b	clientConfig.Copy() to copy template config too	2019-08-23 18:43:22 -04:00
Lang Martin	4f6493a301	taskrunner getter set Umask for go-getter, setuid test	2019-08-23 15:59:03 -04:00
Mahmood Ali	3890619100	initialize device manager stats interval Fixes a bug where we cpu is pigged at 100% due to collecting devices statistics. The passed stats interval was ignored, and the default zero value causes a very tight loop of stats collection. FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a `g2.2xlarge` ec2 instance. The stats interval defaults to 1 second and is user configurable. I believe this is too frequent as a default, and I may advocate for reducing it to a value closer to 5s or 10s, but keeping it as is for now. Fixes https://github.com/hashicorp/nomad/issues/6057 .	2019-08-23 14:58:34 -04:00
Jerome Gravel-Niquet	cbdc1978bf	Consul service meta (#6193 ) * adds meta object to service in job spec, sends it to consul * adds tests for service meta * fix tests * adds docs * better hashing for service meta, use helper for copying meta when registering service * tried to be DRY, but looks like it would be more work to use the helper function	2019-08-23 12:49:02 -04:00
Nick Ethier	96d379071d	ar: fix bridge networking port mapping when port.To is unset (#6190 )	2019-08-22 21:53:52 -04:00
Michael Schurter	59e0b67c7f	connect: task hook for bootstrapping envoy sidecar Fixes #6041 Unlike all other Consul operations, boostrapping requires Consul be available. This PR tries Consul 3 times with a backoff to account for the group services being asynchronously registered with Consul.	2019-08-22 08:15:32 -07:00
Michael Schurter	b008fd1724	connect: register group services with Consul Fixes #6042 Add new task group service hook for registering group services like Connect-enabled services. Does not yet support checks.	2019-08-20 12:25:10 -07:00
lchayoun	2307c9d1d2	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-16 11:11:47 +03:00
Nick Ethier	965f00b2fc	Builtin Admission Controller Framework (#6116 ) * nomad: add admission controller framework * nomad: add admission controller framework and Consul Connect hooks * run admission controllers before checking permissions * client: add default node meta for connect configurables * nomad: remove validateJob func since it has been moved to admission controller * nomad: use new TaskKind type * client: use consts for connect sidecar image and log level * Apply suggestions from code review Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> * nomad: add job register test with connect sidecar * Update nomad/job_endpoint_hooks.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-15 11:22:37 -04:00
lchayoun	c5a38a045a	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-13 19:23:13 +03:00
Tim Gross	03433f35d4	client/template: configuration for function blacklist and sandboxing When rendering a task template, the `plugin` function is no longer permitted by default and will raise an error. An operator can opt-in to permitting this function with the new `template.function_blacklist` field in the client configuration. When rendering a task template, path parameters for the `file` function will be treated as relative to the task directory by default. Relative paths or symlinks that point outside the task directory will raise an error. An operator can opt-out of this protection with the new `template.disable_file_sandbox` field in the client configuration.	2019-08-12 16:34:48 -04:00
Danielle Lancashire	7e6c8e5ac1	Copy documentation to api/tasks	2019-08-12 16:22:27 +02:00
Danielle Lancashire	861caa9564	HostVolumeConfig: Source -> Path	2019-08-12 15:39:08 +02:00
Danielle Lancashire	e132a30899	structs: Unify Volume and VolumeRequest	2019-08-12 15:39:08 +02:00
Danielle Lancashire	6ef8d5233e	client: Add volume_hook for mounting volumes	2019-08-12 15:39:08 +02:00
Danielle Lancashire	063e4240c1	client: Add parsing and registration of HostVolume configuration	2019-08-12 15:39:08 +02:00
lchayoun	ca892163b2	allow dash in non generated environment variable names	2019-08-11 12:51:42 +03:00
Nick Ethier	7806f4c597	Revert "client: add autofetch for CNI plugins" This reverts commit 0bd157cc3b04fb090dd0d54affcae71496102ce8.	2019-08-08 15:10:19 -04:00
Nick Ethier	7d28ece8de	Revert "client: remove debugging lines" This reverts commit 54ce4d1f7ef4913cb12c03dbc98bcd903f7787c9.	2019-08-08 14:52:52 -04:00
Liel Chayoun	24dcb2379c	Update env_test.go	2019-08-06 11:59:31 +03:00
Mahmood Ali	b17bac5101	Render consul templates using task env only (#6055 ) When rendering a task consul template, ensure that only task environment variables are used. Currently, `consul-template` always falls back to host process environment variables when key isn't a task env var[1]. Thus, we add an empty entry for each host process env-var not found in task env-vars. [1] `bfa5d0e133/template/funcs.go (L61-L75)`	2019-08-05 16:30:47 -04:00
Mahmood Ali	f66169cd6a	Merge pull request #6065 from hashicorp/b-nil-driver-exec Check if driver handle is nil before execing	2019-08-02 09:48:28 -05:00
Mahmood Ali	a4670db9b7	Check if driver handle is nil before execing Defend against tr.getDriverHandle being nil. Exec handler checks if task is running, but it may be stopped between check and driver handler fetching.	2019-08-02 10:07:41 +08:00
Nick Ethier	7de0bec8ab	client/cni: updated comments and simplified logic to auto download plugins	2019-07-31 01:04:10 -04:00
Nick Ethier	b16640c50d	Apply suggestions from code review Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2019-07-31 01:04:10 -04:00
Nick Ethier	321d10a041	client: remove debugging lines	2019-07-31 01:04:09 -04:00
Nick Ethier	af6b191963	client: add autofetch for CNI plugins	2019-07-31 01:04:09 -04:00
Nick Ethier	1e9dd1b193	remove unused file	2019-07-31 01:04:09 -04:00
Nick Ethier	09a4cfd8d7	fix failing tests	2019-07-31 01:04:07 -04:00
Nick Ethier	ef83f0831b	ar: plumb client config for networking into the network hook	2019-07-31 01:04:06 -04:00
Nick Ethier	af66a35924	networking: Add new bridge networking mode implementation	2019-07-31 01:04:06 -04:00
Michael Schurter	fb487358fb	connect: add group.service stanza support	2019-07-31 01:04:05 -04:00
Nick Ethier	63c5504d56	ar: fix lint errors	2019-07-31 01:03:19 -04:00
Nick Ethier	e312201d18	ar: rearrange network hook to support building on windows	2019-07-31 01:03:19 -04:00
Nick Ethier	370533c9c7	ar: fix test that failed due to error renaming	2019-07-31 01:03:19 -04:00
Nick Ethier	2d60ef64d9	plugins/driver: make DriverNetworkManager interface optional	2019-07-31 01:03:19 -04:00
Nick Ethier	f87e7e9c9a	ar: plumb error handling into alloc runner hook initialization	2019-07-31 01:03:18 -04:00
Nick Ethier	ef1795b344	ar: add tests for network hook	2019-07-31 01:03:18 -04:00
Nick Ethier	15989bba8e	ar: cleanup lint errors	2019-07-31 01:03:18 -04:00
Nick Ethier	220cba3e7e	ar: move linux specific code to it's own file and add tests	2019-07-31 01:03:18 -04:00
Nick Ethier	548f78ef15	ar: initial driver based network management	2019-07-31 01:03:17 -04:00
Nick Ethier	66c514a388	Add network lifecycle management Adds a new Prerun and Postrun hooks to manage set up of network namespaces on linux. Work still needs to be done to make the code platform agnostic and support Docker style network initalization.	2019-07-31 01:03:17 -04:00
Preetha Appan	d048029b5a	remove generated code and change version to 0.10.0	2019-07-30 15:56:05 -05:00
Nomad Release bot	e39fb11531	Generate files for 0.9.4 release	2019-07-30 19:05:18 +00:00
Preetha Appan	6b4c40f5a8	remove generated code	2019-07-23 12:07:49 -05:00
Nomad Release bot	04187c8b86	Generate files for 0.9.4-rc1 release	2019-07-22 21:42:36 +00:00
Michael Schurter	d90680021e	logmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formatting	2019-07-22 13:05:01 -07:00
Michael Schurter	e37bc3513c	logmon: ensure errors are still handled properly ...and add a comment to switch back to the old error handling once we switch to Go 1.12.	2019-07-22 12:49:48 -07:00
Danielle Lancashire	1bcbbbfbe6	logmon: Workaround golang/go#29119 There's a bug in go1.11 that causes some io operations on windows to return incorrect errors for some cases when Stat-ing files. To avoid upgrading to go1.12 in a point release, here we loosen up the cases where we will attempt to create fifos, and add some logging of underlying stat errors to help with debugging.	2019-07-22 18:28:12 +02:00
Jasmine Dahilig	2157f6ddf1	add formatting for hcl parsing error messages (#5972 )	2019-07-19 10:04:39 -07:00
Mahmood Ali	cd6f1d3102	Update consul-template dependency to latest To pick up the fix in https://github.com/hashicorp/consul-template/pull/1231 .	2019-07-18 07:32:03 +07:00
Mahmood Ali	8a82260319	log unrecoverable errors	2019-07-17 11:01:59 +07:00
Mahmood Ali	1a299c7b28	client/taskrunner: fix stats stats retry logic Previously, if a channel is closed, we retry the Stats call. But, if that call fails, we go in a backoff loop without calling Stats ever again. Here, we use a utility function for calling driverHandle.Stats call that retries as one expects. I aimed to preserve the logging formats but made small improvements as I saw fit.	2019-07-11 13:58:07 +08:00
Preetha Appan	7d645c5ad9	Test file for detect content type that satisfies linter and encoding	2019-07-10 11:42:04 -05:00
Preetha Appan	ef9a71c68b	code review feedback	2019-07-10 10:41:06 -05:00
Preetha Appan	990e468edc	Populate task event struct with kill timeout This makes for a nicer task event message	2019-07-09 09:37:09 -05:00
Preetha Appan	108a292cc0	fix linting failure in test case file	2019-07-08 11:29:12 -05:00
Michael Lange	b2e9570075	Use consistent casing in the JSON representation of the AllocFileInfo struct	2019-07-02 17:27:31 -07:00
Preetha Appan	8495fb9055	Added additional test cases and fixed go test case	2019-07-02 13:25:29 -05:00
Mahmood Ali	a97d451ac7	Merge pull request #5905 from hashicorp/b-ar-failed-prestart Fail alloc if alloc runner prestart hooks fail	2019-07-02 20:25:53 +08:00
Danielle	c6872cdf12	Merge pull request #5864 from hashicorp/dani/win-pipe-cleaner windows: Fix restarts using the raw_exec driver	2019-07-02 13:58:56 +02:00
Danielle Lancashire	e20300313f	fifo: Safer access to Conn	2019-07-02 13:12:54 +02:00
Mahmood Ali	f10201c102	run post-run/post-stop task runner hooks Handle when prestart failed while restoring a task, to prevent accidentally leaking consul/logmon processes.	2019-07-02 18:38:32 +08:00
Mahmood Ali	4afd7835e3	Fail alloc if alloc runner prestart hooks fail When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for https://github.com/hashicorp/nomad/issues/5840 .	2019-07-02 18:35:47 +08:00
Mahmood Ali	7614b8f09e	Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2 task runner to avoid running task if terminal	2019-07-02 15:31:17 +08:00
Mahmood Ali	7bfad051b9	address review comments	2019-07-02 14:53:50 +08:00
Mahmood Ali	c0c00ecc07	Merge pull request #5906 from hashicorp/b-alloc-stale-updates client: defensive against getting stale alloc updates	2019-07-02 12:40:17 +08:00
Preetha Appan	c09342903b	Improve test cases for detecting content type	2019-07-01 16:24:48 -05:00
Danielle Lancashire	688f82f07d	fifo: Close connections and cleanup lock handling	2019-07-01 14:14:29 +02:00
Danielle Lancashire	2c7d1f1b99	logmon: Add windows compatibility test	2019-07-01 14:14:06 +02:00
Mahmood Ali	c5f5a1fcb9	client: defensive against getting stale alloc updates When fetching node alloc assignments, be defensive against a stale read before killing local nodes allocs. The bug is when both client and servers are restarting and the client requests the node allocation for the node, it might get stale data as server hasn't finished applying all the restored raft transaction to store. Consequently, client would kill and destroy the alloc locally, just to fetch it again moments later when server store is up to date. The bug can be reproduced quite reliably with single node setup (configured with persistence). I suspect it's too edge-casey to occur in production cluster with multiple servers, but we may need to examine leader failover scenarios more closely. In this commit, we only remove and destroy allocs if the removal index is more recent than the alloc index. This seems like a cheap resiliency fix we already use for detecting alloc updates. A more proper fix would be to ensure that a nomad server only serves RPC calls when state store is fully restored or up to date in leadership transition cases.	2019-06-29 04:17:35 -05:00
Preetha Appan	3345ce3ba4	Infer content type in alloc fs stat endpoint	2019-06-28 20:31:28 -05:00
Danielle Lancashire	e1151f743b	appveyor: Run logmon tests	2019-06-28 16:01:41 +02:00
Danielle Lancashire	634ada671e	fifo: Require that fifos do not exist for create Although this operation is safe on linux, it is not safe on Windows when using the named pipe interface. To provide a ~reasonable common api abstraction, here we switch to returning File exists errors on the unix api.	2019-06-28 13:47:18 +02:00
Danielle Lancashire	0ff27cfc0f	vendor: Use dani fork of go-winio	2019-06-28 13:47:18 +02:00
Danielle Lancashire	514a2a6017	logmon: Refactor fifo access for windows safety On unix platforms, it is safe to re-open fifo's for reading after the first creation if the file is already a fifo, however this is not possible on windows where this triggers a permissions error on the socket path, as you cannot recreate it. We can't transparently handle this in the CreateAndRead handle, because the Access Is Denied error is too generic to reliably be an IO error. Instead, we add an explict API for opening a reader to an existing FIFO, and check to see if the fifo already exists inside the calling package (e.g logmon)	2019-06-28 13:41:54 +02:00
Mahmood Ali	3d89ae0f1e	task runner to avoid running task if terminal This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes https://github.com/hashicorp/nomad/issues/5883	2019-06-27 11:27:34 +08:00
Danielle Lancashire	b9ac184e1f	tr: Fetch Wait channel before killTask in restart Currently, if killTask results in the termination of a process before calling WaitTask, Restart() will incorrectly return a TaskNotFound error when using the raw_exec driver on Windows.	2019-06-26 15:20:57 +02:00
Mahmood Ali	b209584dce	Merge pull request #5726 from hashicorp/b-plugins-via-init Use init() to handle plugin invocation	2019-06-18 21:09:03 -04:00
Mahmood Ali	ac64509c59	comment on use of init() for plugin handlers	2019-06-18 20:54:55 -04:00

1 2 3 4 5 ...

3913 Commits