open-nomad

Commit Graph

Author	SHA1	Message	Date
Michael Schurter	81a30ae106	Revert "Use joint context to cancel prestart hooks"	2019-10-08 11:27:08 -07:00
Drew Bailey	69eebcd241	simplify logic to check for vault read event defer shutdown to cleanup after failed run Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> update comment to include ctx note for shutdown	2019-09-30 11:02:14 -07:00
Drew Bailey	7565b8a8d9	Use joint context to cancel prestart hooks fixes https://github.com/hashicorp/nomad/issues/6382 The prestart hook for templates blocks while it resolves vault secrets. If the secret is not found it continues to retry. If a task is shutdown during this time, the prestart hook currently does not receive shutdownCtxCancel, causing it to hang. This PR joins the two contexts so either killCtx or shutdownCtx cancel and stop the task.	2019-09-30 10:48:01 -07:00
Tim Gross	a6aadb3714	connect: remove proxy socket for restarted client	2019-09-25 14:58:17 -04:00
Tim Gross	e43d33aa50	client: don't run alloc postrun during shutdown	2019-09-25 14:58:17 -04:00
Tim Gross	d965a15490	driver/networking: don't recreate existing network namespaces	2019-09-25 14:58:17 -04:00
Danielle Lancashire	4f2343e1c0	client: Return empty values when host stats fail Currently, there is an issue when running on Windows whereby under some circumstances the Windows stats API's will begin to return errors (such as internal timeouts) when a client is under high load, and potentially other forms of resource contention / system states (and other unknown cases). When an error occurs during this collection, we then short circuit further metrics emission from the client until the next interval. This can be problematic if it happens for a sustained number of intervals, as our metrics aggregator will begin to age out older metrics, and we will eventually stop emitting various types of metrics including `nomad.client.unallocated.*` metrics. However, when metrics collection fails on Linux, gopsutil will in many cases (e.g cpu.Times) silently return 0 values, rather than an error. Here, we switch to returning empty metrics in these failures, and logging the error at the source. This brings the behaviour into line with Linux/Unix platforms, and although making aggregation a little sadder on intermittent failures, will result in more desireable overall behaviour of keeping metrics available for further investigation if things look unusual.	2019-09-19 01:22:07 +02:00
Danielle	ec3ecdecfc	Merge pull request #6321 from hashicorp/dani/remove-config Hoist Volume.Config.Source into Volume.Source	2019-09-16 10:12:58 -07:00
Tim Gross	a6ef8c5d42	client/networking: wrap error message from CNI plugin (#6316 )	2019-09-13 08:20:05 -04:00
Danielle Lancashire	78b61de45f	config: Hoist volume.config.source into volume Currently, using a Volume in a job uses the following configuration: ``` volume "alias-name" { type = "volume-type" read_only = true config { source = "host_volume_name" } } ``` This commit migrates to the following: ``` volume "alias-name" { type = "volume-type" source = "host_volume_name" read_only = true } ``` The original design was based due to being uncertain about the future of storage plugins, and to allow maxium flexibility. However, this causes a few issues, namely: - We frequently need to parse this configuration during submission, scheduling, and mounting - It complicates the configuration from and end users perspective - It complicates the ability to do validation As we understand the problem space of CSI a little more, it has become clear that we won't need the `source` to be in config, as it will be used in the majority of cases: - Host Volumes: Always need a source - Preallocated CSI Volumes: Always needs a source from a volume or claim name - Dynamic Persistent CSI Volumes: Always needs a source to attach the volumes to for managing upgrades and to avoid dangling. - Dynamic Ephemeral CSI Volumes: Less thought out, but `source` will probably point to the plugin name, and a `config` block will allow you to pass meta to the plugin. Or will point to a pre-configured ephemeral config. *If implemented The new design simplifies this by merging the source into the volume stanza to solve the above issues with usability, performance, and error handling.	2019-09-13 04:37:59 +02:00
Tim Gross	3fa4bca4a0	script checks: Update needs to update Alloc as well (#6291 )	2019-09-06 11:18:00 -04:00
Tim Gross	8ce201854a	client: recreate script checks on Update (#6265 ) Splitting the immutable and mutable components of the scriptCheck led to a bug where the environment interpolation wasn't being incorporated into the check's ID, which caused the UpdateTTL to update for a check ID that Consul didn't have (because our Consul client creates the ID from the structs.ServiceCheck each time we update). Task group services don't have access to a task environment at creation, so their checks get registered before the check can be interpolated. Use the original check ID so they can be updated.	2019-09-05 11:43:23 -04:00
Michael Schurter	ee06c36345	Merge pull request #6254 from hashicorp/test-connect-e2e-demo e2e: test demo job for connect	2019-09-04 14:33:26 -07:00
Nick Ethier	e440ba80f1	ar: refactor network bridge config to use go-cni lib (#6255 ) * ar: refactor network bridge config to use go-cni lib * ar: use eth as the iface prefix for bridged network namespaces * vendor: update containerd/go-cni package * ar: update network hook to use TODO contexts when calling configurator * unnecessary conversion	2019-09-04 16:33:25 -04:00
Michael Schurter	93b47f4ddc	client: reword error message	2019-09-04 12:40:09 -07:00
Tim Gross	0f29dcc935	support script checks for task group services (#6197 ) In Nomad prior to Consul Connect, all Consul checks work the same except for Script checks. Because the Task being checked is running in its own container namespaces, the check is executed by Nomad in the Task's context. If the Script check passes, Nomad uses the TTL check feature of Consul to update the check status. This means in order to run a Script check, we need to know what Task to execute it in. To support Consul Connect, we need Group Services, and these need to be registered in Consul along with their checks. We could push the Service down into the Task, but this doesn't work if someone wants to associate a service with a task's ports, but do script checks in another task in the allocation. Because Nomad is handling the Script check and not Consul anyways, this moves the script check handling into the task runner so that the task runner can own the script check's configuration and lifecycle. This will allow us to pass the group service check configuration down into a task without associating the service itself with the task. When tasks are checked for script checks, we walk back through their task group to see if there are script checks associated with the task. If so, we'll spin off script check tasklets for them. The group-level service and any restart behaviors it needs are entirely encapsulated within the group service hook.	2019-09-03 15:09:04 -04:00
Michael Schurter	5957030d18	connect: add unix socket to proxy grpc for envoy (#6232 ) * connect: add unix socket to proxy grpc for envoy Fixes #6124 Implement a L4 proxy from a unix socket inside a network namespace to Consul's gRPC endpoint on the host. This allows Envoy to connect to Consul's xDS configuration API. * connect: pointer receiver on structs with mutexes * connect: warn on all proxy errors	2019-09-03 08:43:38 -07:00
Jasmine Dahilig	4edebe389a	add default update stanza and max_parallel=0 disables deployments (#6191 )	2019-09-02 10:30:09 -07:00
Evan Ercolano	fcf66918d0	Remove unused canary param from MakeTaskServiceID	2019-08-31 16:53:23 -04:00
Nick Ethier	cf014c7fd5	ar: ensure network forwarding is allowed for bridged allocs (#6196 ) * ar: ensure network forwarding is allowed in iptables for bridged allocs * ensure filter rule exists at setup time	2019-08-28 10:51:34 -04:00
Mahmood Ali	cc460d4804	Write to client store while holding lock Protect against a race where destroying and persist state goroutines race. The downside is that the database io operation will run while holding the lock and may run indefinitely. The risk of lock being long held is slow destruction, but slow io has bigger problems.	2019-08-26 13:45:58 -04:00
Mahmood Ali	c132623ffc	Don't persist allocs of destroyed alloc runners This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes https://github.com/hashicorp/nomad/issues/5984 Related to https://github.com/hashicorp/nomad/pull/5890	2019-08-25 11:21:28 -04:00
Lang Martin	4f6493a301	taskrunner getter set Umask for go-getter, setuid test	2019-08-23 15:59:03 -04:00
Jerome Gravel-Niquet	cbdc1978bf	Consul service meta (#6193 ) * adds meta object to service in job spec, sends it to consul * adds tests for service meta * fix tests * adds docs * better hashing for service meta, use helper for copying meta when registering service * tried to be DRY, but looks like it would be more work to use the helper function	2019-08-23 12:49:02 -04:00
Nick Ethier	96d379071d	ar: fix bridge networking port mapping when port.To is unset (#6190 )	2019-08-22 21:53:52 -04:00
Michael Schurter	59e0b67c7f	connect: task hook for bootstrapping envoy sidecar Fixes #6041 Unlike all other Consul operations, boostrapping requires Consul be available. This PR tries Consul 3 times with a backoff to account for the group services being asynchronously registered with Consul.	2019-08-22 08:15:32 -07:00
Michael Schurter	b008fd1724	connect: register group services with Consul Fixes #6042 Add new task group service hook for registering group services like Connect-enabled services. Does not yet support checks.	2019-08-20 12:25:10 -07:00
Tim Gross	03433f35d4	client/template: configuration for function blacklist and sandboxing When rendering a task template, the `plugin` function is no longer permitted by default and will raise an error. An operator can opt-in to permitting this function with the new `template.function_blacklist` field in the client configuration. When rendering a task template, path parameters for the `file` function will be treated as relative to the task directory by default. Relative paths or symlinks that point outside the task directory will raise an error. An operator can opt-out of this protection with the new `template.disable_file_sandbox` field in the client configuration.	2019-08-12 16:34:48 -04:00
Danielle Lancashire	861caa9564	HostVolumeConfig: Source -> Path	2019-08-12 15:39:08 +02:00
Danielle Lancashire	e132a30899	structs: Unify Volume and VolumeRequest	2019-08-12 15:39:08 +02:00
Danielle Lancashire	6ef8d5233e	client: Add volume_hook for mounting volumes	2019-08-12 15:39:08 +02:00
Mahmood Ali	b17bac5101	Render consul templates using task env only (#6055 ) When rendering a task consul template, ensure that only task environment variables are used. Currently, `consul-template` always falls back to host process environment variables when key isn't a task env var[1]. Thus, we add an empty entry for each host process env-var not found in task env-vars. [1] `bfa5d0e133/template/funcs.go (L61-L75)`	2019-08-05 16:30:47 -04:00
Mahmood Ali	f66169cd6a	Merge pull request #6065 from hashicorp/b-nil-driver-exec Check if driver handle is nil before execing	2019-08-02 09:48:28 -05:00
Mahmood Ali	a4670db9b7	Check if driver handle is nil before execing Defend against tr.getDriverHandle being nil. Exec handler checks if task is running, but it may be stopped between check and driver handler fetching.	2019-08-02 10:07:41 +08:00
Nick Ethier	321d10a041	client: remove debugging lines	2019-07-31 01:04:09 -04:00
Nick Ethier	af6b191963	client: add autofetch for CNI plugins	2019-07-31 01:04:09 -04:00
Nick Ethier	1e9dd1b193	remove unused file	2019-07-31 01:04:09 -04:00
Nick Ethier	09a4cfd8d7	fix failing tests	2019-07-31 01:04:07 -04:00
Nick Ethier	ef83f0831b	ar: plumb client config for networking into the network hook	2019-07-31 01:04:06 -04:00
Nick Ethier	af66a35924	networking: Add new bridge networking mode implementation	2019-07-31 01:04:06 -04:00
Nick Ethier	63c5504d56	ar: fix lint errors	2019-07-31 01:03:19 -04:00
Nick Ethier	e312201d18	ar: rearrange network hook to support building on windows	2019-07-31 01:03:19 -04:00
Nick Ethier	370533c9c7	ar: fix test that failed due to error renaming	2019-07-31 01:03:19 -04:00
Nick Ethier	2d60ef64d9	plugins/driver: make DriverNetworkManager interface optional	2019-07-31 01:03:19 -04:00
Nick Ethier	f87e7e9c9a	ar: plumb error handling into alloc runner hook initialization	2019-07-31 01:03:18 -04:00
Nick Ethier	ef1795b344	ar: add tests for network hook	2019-07-31 01:03:18 -04:00
Nick Ethier	15989bba8e	ar: cleanup lint errors	2019-07-31 01:03:18 -04:00
Nick Ethier	220cba3e7e	ar: move linux specific code to it's own file and add tests	2019-07-31 01:03:18 -04:00
Nick Ethier	548f78ef15	ar: initial driver based network management	2019-07-31 01:03:17 -04:00
Nick Ethier	66c514a388	Add network lifecycle management Adds a new Prerun and Postrun hooks to manage set up of network namespaces on linux. Work still needs to be done to make the code platform agnostic and support Docker style network initalization.	2019-07-31 01:03:17 -04:00
Jasmine Dahilig	2157f6ddf1	add formatting for hcl parsing error messages (#5972 )	2019-07-19 10:04:39 -07:00
Mahmood Ali	cd6f1d3102	Update consul-template dependency to latest To pick up the fix in https://github.com/hashicorp/consul-template/pull/1231 .	2019-07-18 07:32:03 +07:00
Mahmood Ali	8a82260319	log unrecoverable errors	2019-07-17 11:01:59 +07:00
Mahmood Ali	1a299c7b28	client/taskrunner: fix stats stats retry logic Previously, if a channel is closed, we retry the Stats call. But, if that call fails, we go in a backoff loop without calling Stats ever again. Here, we use a utility function for calling driverHandle.Stats call that retries as one expects. I aimed to preserve the logging formats but made small improvements as I saw fit.	2019-07-11 13:58:07 +08:00
Preetha Appan	ef9a71c68b	code review feedback	2019-07-10 10:41:06 -05:00
Preetha Appan	990e468edc	Populate task event struct with kill timeout This makes for a nicer task event message	2019-07-09 09:37:09 -05:00
Mahmood Ali	f10201c102	run post-run/post-stop task runner hooks Handle when prestart failed while restoring a task, to prevent accidentally leaking consul/logmon processes.	2019-07-02 18:38:32 +08:00
Mahmood Ali	4afd7835e3	Fail alloc if alloc runner prestart hooks fail When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for https://github.com/hashicorp/nomad/issues/5840 .	2019-07-02 18:35:47 +08:00
Mahmood Ali	7614b8f09e	Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2 task runner to avoid running task if terminal	2019-07-02 15:31:17 +08:00
Mahmood Ali	7bfad051b9	address review comments	2019-07-02 14:53:50 +08:00
Mahmood Ali	3d89ae0f1e	task runner to avoid running task if terminal This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes https://github.com/hashicorp/nomad/issues/5883	2019-06-27 11:27:34 +08:00
Danielle Lancashire	b9ac184e1f	tr: Fetch Wait channel before killTask in restart Currently, if killTask results in the termination of a process before calling WaitTask, Restart() will incorrectly return a TaskNotFound error when using the raw_exec driver on Windows.	2019-06-26 15:20:57 +02:00
Chris Baker	f71114f5b8	cleanup test	2019-06-18 14:15:25 +00:00
Chris Baker	a2dc351fd0	formatting and clarity	2019-06-18 14:00:57 +00:00
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Danielle	f923b568e0	Merge pull request #5821 from hashicorp/dani/b-5770 trhooks: Add TaskStopHook interface to services	2019-06-12 17:30:49 +02:00
Danielle Lancashire	c326344b57	trt: Fix test	2019-06-12 17:06:11 +02:00
Danielle Lancashire	13d76e35fd	trhooks: Add TaskStopHook interface to services We currently only run cleanup Service Hooks when a task is either Killed, or Exited. However, due to the implementation of a task runner, tasks are only Exited if they every correctly started running, which is not true when you recieve an error early in the task start flow, such as not being able to pull secrets from Vault. This updates the service hook to also call consul deregistration routines during a task Stop lifecycle event, to ensure that any registered checks and services are cleared in such cases. fixes #5770	2019-06-12 16:00:21 +02:00
Mahmood Ali	2acf30fdd3	Fallback to `alloc.TaskResources` for old allocs When a client is running against an old server (e.g. running 0.8), `alloc.AllocatedResources` may be nil, and we need to check the deprecated `alloc.TaskResources` instead. Fixes https://github.com/hashicorp/nomad/issues/5810	2019-06-11 10:32:53 -04:00
Mahmood Ali	7a4900aaa4	client/allocrunner: depend on internal task state Alloc runner already tracks tasks associated with alloc. Here, we become defensive by relying on the alloc runner tracked tasks, rather than depend on server never updating the job unexpectedly.	2019-06-10 18:42:51 -04:00
Mahmood Ali	d30c3d10b0	Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1 More test fixes	2019-06-05 19:09:18 -04:00
Mahmood Ali	935ee86e92	Merge pull request #5737 from fwkz/fix-restart-attempts Fix restart attempts of `restart` stanza in `delay` mode.	2019-06-05 19:05:07 -04:00
Danielle Lancashire	27583ed8c1	client: Pass servers contacted ch to allocrunner This fixes an issue where batch and service workloads would never be restarted due to indefinitely blocking on a nil channel. It also raises the restoration logging message to `Info` to simplify log analysis.	2019-05-22 13:47:35 +02:00
Mahmood Ali	9df1e00f35	tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal Given that Signal may be called multiple times, blocking for `SignalCh` isn't sufficient to synchornizing access to Signals field.	2019-05-21 13:56:58 -04:00
Mahmood Ali	b475ccbe3e	client: synchronize access to ar.alloc `allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's use `allocRunner.Alloc()` helper function to access it.	2019-05-21 09:55:05 -04:00
fwkz	8b84bec95a	Fix restart attempts of `restart` stanza. Number of restarts during 2nd interval is off by one.	2019-05-21 13:27:19 +02:00
Michael Schurter	2fe0768f3b	docs: changelog entry for #5669 and fix comment	2019-05-14 10:54:00 -07:00
Michael Schurter	af9096c8ba	client: register before restoring Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.	2019-05-14 10:53:27 -07:00
Michael Schurter	e07f73bfe0	client: do not restart dead tasks until server is contacted (try 2) Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162 Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.	2019-05-14 10:53:27 -07:00
Michael Schurter	d7e5ace1ed	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Michael Schurter	1c4e585fa7	client: expose allocated memory per task Related to #4280 This PR adds `client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge in bytes to metrics to ease calculating how close a task is to OOMing. ``` 'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000 'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000 'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000 'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000 'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000 ```	2019-05-10 11:12:12 -07:00
Mahmood Ali	919827f2df	Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base nomad exec part 1: plumbing and docker driver	2019-05-09 18:09:27 -04:00
Mahmood Ali	ab2cae0625	implement client endpoint of nomad exec Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking the relevant task handler for execution.	2019-05-09 16:49:08 -04:00
Chris Baker	93ec1293be	stale allocation data leads to incorrect (and even negative) metrics (#5637 ) * client: was not using up-to-date client state in determining which alloc count towards allocated resources * Update client/client.go Co-Authored-By: cgbaker <cgbaker@hashicorp.com>	2019-05-07 15:54:36 -04:00
Michael Schurter	8c7b3ff45a	Fix comment Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:01:30 -05:00
Michael Schurter	e19fa33f9c	Remove unnecessary boolean clause Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:00:17 -05:00
Preetha Appan	b99a204582	Update deployment health on failed allocations only if health is unset This fixes a confusing UX where a previously successful deployment's healthy/unhealthy count would get updated if any allocations failed after the deployment was already marked as successful.	2019-05-02 22:59:56 -05:00
Danielle	79515496cb	Merge pull request #5515 from hashicorp/dani/f-alloc-signal allocs: Add nomad alloc signal command	2019-04-26 14:21:05 +02:00
Mahmood Ali	bf0a09e270	retry grpc unavailable errors even if not shutting down	2019-04-25 18:39:17 -04:00
Mahmood Ali	81841e8528	try checking process status	2019-04-25 18:16:13 -04:00
Mahmood Ali	fc78521f29	add logging about attempts	2019-04-25 18:09:36 -04:00
Mahmood Ali	e6ca8641a8	try sleeping for stop signal to take effect	2019-04-25 17:16:29 -04:00
Mahmood Ali	ff3a095015	add a test that simulates logmon dying during Start() call	2019-04-25 16:41:17 -04:00
Mahmood Ali	bbac73883c	logmon: retry starting logmon if it exits Retry if we detect shutting down during Start() api call is started, locally.	2019-04-25 15:10:16 -04:00
Danielle Lancashire	3409e0be89	allocs: Add nomad alloc signal command This command will be used to send a signal to either a single task within an allocation, or all of the tasks if <task-name> is omitted. If the sent signal terminates the allocation, it will be treated as if the allocation has crashed, rather than as if it was operator-terminated. Signal validation is currently handled by the driver itself and nomad does not attempt to restrict or validate them.	2019-04-25 12:43:32 +02:00
Michael Schurter	61f17a1043	tweak logging level for failed log line Co-Authored-By: notnoop <mahmood@notnoop.com>	2019-04-22 14:40:17 -04:00
Danielle Lancashire	c31966fc71	loggging: Attempt to recover logmon failures Currently, when logmon fails to reattach, we will retry reattachment to the same pid until the task restart specification is exhausted. Because we cannot clear hook state during error conditions, it is not possible for us to signal to a future restart that it _shouldn't_ attempt to reattach to the plugin. Here we revert to explicitly detecting reattachment seperately from a launch of a new logmon, so we can recover from scenarios where a logmon plugin has failed. This is a net improvement over the current hard failure situation, as it means in the most common case (the pid has gone away), we can recover. Other reattachment failure modes where the plugin may still be running could potentially cause a duplicate process, or a subsequent failure to launch a new plugin. If there was a duplicate process, it could potentially cause duplicate logging. This is better than a production workload outage. If there was a subsequent failure to launch a new plugin, it would fail in the same (retry until restarts are exhausted) as the current failure mode.	2019-04-18 13:41:56 +02:00
Michael Schurter	f7a7acc345	Merge pull request #5518 from hashicorp/f-simplify-kill client: simplify kill logic	2019-04-15 14:11:58 -07:00
Chris Baker	6848591914	vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation	2019-04-12 15:06:34 +00:00
Danielle Lancashire	e135876493	allocs: Add nomad alloc restart This adds a `nomad alloc restart` command and api that allows a job operator with the alloc-lifecycle acl to perform an in-place restart of a Nomad allocation, or a given subtask.	2019-04-11 14:25:49 +02:00
Chris Baker	c0a7aee610	vault e2e: pass vault version into setup instead of having to infer it from test name	2019-04-10 10:34:10 -05:00
Chris Baker	f0c184fc29	taskrunner: removed some unecessary config from a test	2019-04-10 10:34:10 -05:00
Chris Baker	170f5239c8	client: gofmt	2019-04-10 10:34:10 -05:00
Chris Baker	a1d7971b2e	taskrunner: pass configured Vault namespace into TaskTemplateConfig	2019-04-10 10:34:10 -05:00
Michael Schurter	f7d4428855	client: simplify kill logic Remove runLaunched tracking as Run is always called for killable TaskRunners. TaskRunners which fail before Run can be called (during NewTaskRunner or Restore) are not killable as they're never added to the client's alloc map.	2019-04-04 15:18:33 -07:00
Michael Schurter	1d569a27dc	Revert "executor/linux: add defensive checks to binary path" This reverts commit cb36f4537e63d53b198c2a87d1e03880895631bd.	2019-04-02 11:17:12 -07:00
Michael Schurter	fc5487dbbc	executor/linux: add defensive checks to binary path	2019-04-02 09:40:53 -07:00
Michael Schurter	7d49bc4c71	executor/linux: make chroot binary paths absolute Avoid libcontainer.Process trying to lookup the binary via $PATH as the executor has already found where the binary is located.	2019-04-01 15:45:31 -07:00
Michael Schurter	a4572919cd	Merge pull request #5456 from hashicorp/test-taskenv tests: port pre-0.9 task env tests	2019-03-25 10:41:38 -07:00
Michael Schurter	8efad12538	tests: port pre-0.9 task env tests I chose to make them more of integration tests since there's a lot more plumbing involved. The internal implementation details of how we craft task envs can now change and these tests will still properly assert the task runtime environment is setup properly.	2019-03-25 09:46:53 -07:00
Nick Ethier	dc18b8928a	logmon: make Start rpc idempotent and simplify hook	2019-03-19 14:02:36 -04:00
Nick Ethier	ac7fbee1b8	logmon:add static check for logmon exited hook	2019-03-18 15:59:43 -04:00
Nick Ethier	7dc3d83634	client/logmon: restart log collection correctly when a task is restarted	2019-03-15 23:59:18 -04:00
Michael Schurter	0ba1a5251b	client: cleanup and document context uses Some of the context uses in TR hooks are useless (Killed during Stop never seems meaningful). None of the hooks are interruptable for graceful shutdown which is unfortunate and probably needs fixing.	2019-03-12 15:03:54 -07:00
Michael Schurter	32d31575cc	client: emit event and call exited hooks during cleanup Builds upon earlier commit that cleans up restored handles of terminal allocs by also emitting terminated events and calling exited hooks when appropriate.	2019-03-05 15:12:02 -08:00
Michael Schurter	64e145ebdb	logmon: drop reattach log level as its expected Logged once per terminal task on agent restart.	2019-03-04 13:26:01 -08:00
Michael Schurter	c5271d3fa5	client: test logmon cleanup The test is sadly quite complicated and peeks into things (logmon's reattach config) AR doesn't normally have access to. However, I couldn't find another way of asserting logmon got cleaned up without resorting to smaller unit tests. Smaller unit tests risk re-implementing dependencies in an unrealistic way, so I opted for an ugly integration test.	2019-03-04 13:15:15 -08:00
Michael Schurter	ef8d284352	client: ensure task is cleaned up when terminal This commit is a significant change. TR.Run is now always executed, even for terminal allocations. This was changed to allow TR.Run to cleanup (run stop hooks) if a handle was recovered. This is intended to handle the case of Nomad receiving a DesiredStatus=Stop allocation update, persisting it, but crashing before stopping AR/TR. The commit also renames task runner hook data as it was very easy to accidently set state on Requests instead of Responses using the old field names.	2019-03-01 14:00:23 -08:00
Michael Schurter	812f1679e2	Merge pull request #5352 from hashicorp/b-leaked-logmon logmon fixes	2019-02-26 08:35:46 -08:00
Michael Schurter	e39a10a1f4	tests: move unix-specific test to its own file Other logmon tests should be portable.	2019-02-26 07:56:44 -08:00
Michael Schurter	3b2a592e93	client: restart task on logmon failures This code chooses to be conservative as opposed to optimal: when failing to reattach to logmon simply return a recoverable error instead of immediately trying to restart logmon. The recoverable error will cause the task's restart policy to be applied and a new logmon will be launched upon restart. Trying to do the optimal approach of simply starting a new logmon requires error string comparison and should be tested against a task actively logging to assert the behavior (are writes blocked? dropped?).	2019-02-25 15:42:45 -08:00
Michael Schurter	8830b00866	client: test logmon_hook	2019-02-23 15:36:48 -08:00
Preetha Appan	43679f4ce1	More alloc runner tests ported from 0.8.7	2019-02-22 17:58:06 -06:00
Mahmood Ali	32551fb0e5	emit TaskRestartSignal event on vault restart When Vault token expires and task is restarted, emit `TaskRestartSignal` similar to v0.8.7	2019-02-22 15:56:14 -05:00
Mahmood Ali	8cb4bbcc08	address review comments	2019-02-22 15:56:14 -05:00
Mahmood Ali	216eaa4843	tests: port TestTaskRunner_VaultManager_Signal From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1427	2019-02-22 15:53:04 -05:00
Mahmood Ali	8e9e732319	tests: port TestTaskRunner_VaultManager_Restart From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1352	2019-02-22 15:53:04 -05:00
Mahmood Ali	33122ca7c0	tests: port TestTaskRunner_UnregisterConsul_Retries From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L620	2019-02-22 15:53:04 -05:00
Mahmood Ali	0128b0ce7a	tests: port TestTaskRunner_Template_NewVaultToken From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1275	2019-02-22 15:53:04 -05:00
Mahmood Ali	cfb80583af	tests: port TestTaskRunner_Template_Artifact From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1195	2019-02-22 15:52:59 -05:00
Mahmood Ali	1b14214a88	tests: port TestAllocRunner_RetryArtifact Port TestAllocRunner_RetryArtifact from https://github.com/hashicorp/nomad/blob/v0.8.7/client/alloc_runner_test.go#L610-L672 I changed the test name because it doesn't actually test that artifact hooks is retried	2019-02-22 15:50:39 -05:00
Mahmood Ali	c827e6e05a	tests: port TestAllocRunner_MoveAllocDir test	2019-02-22 15:50:39 -05:00
Michael Schurter	a2e3ea6dc9	logmon: fix reattach configuration There were multiple bugs here: 1. Reattach unmarshalling always returned an error because you can't unmarshal into a nil pointer. 2. The hook data wasn't being saved because it was put on the request struct, not the response struct. 3. The plugin configuration should only have reattach or a command set. Not both. 4. Setting Done=true meant the hook was never re-run on agent restart so reattaching was never attempted.	2019-02-21 15:32:18 -08:00
Michael Schurter	01cabdff88	client: restart on recoverable StartTask errors Fixes restarting on recoverable errors from StartTask. Ports TestTaskRunner_Run_RecoverableStartError from 0.8 which discovered the bug.	2019-02-21 15:30:49 -08:00
Michael Schurter	e3f321cd27	test: port TestTaskRunner_RestartSignalTask_NotRunning from 0.8	2019-02-21 15:30:49 -08:00
Michael Schurter	f3aa945a00	test: port TestTaskRunner_DriverNetwork from 0.8	2019-02-21 15:30:49 -08:00
Michael Schurter	518405ac33	Merge pull request #5322 from hashicorp/b-artifact-retries Fix regression by restarting on artifact download errors	2019-02-21 15:28:51 -08:00
Michael Schurter	2553800eb8	tests: port TestAllocRunner_Destroy from 0.8 Also add destroy(ar) helper to fix a bunch of shutdown races in AR tests.	2019-02-20 12:35:09 -08:00
Michael Schurter	6580ed668e	client: don't redownload completed artifacts on retries Track the download status of each artifact independently so that if only one of many artifacts fails to download, completed artifacts aren't downloaded again.	2019-02-20 08:45:12 -08:00
Michael Schurter	908bfab4c2	client: artifact errors are retry-able 0.9.0beta2 contains a regression where artifact download errors would not cause a task restart and instead immediately fail the task. This restores the pre-0.9 behavior of retrying all artifact errors and adds missing tests.	2019-02-20 07:21:27 -08:00
Michael Schurter	79ccf00b72	tests: add new task runner test helper Adds a new helper and removes a duplicated test.	2019-02-20 07:21:27 -08:00
Michael Schurter	159042a1a3	client: fix setting alloc unhealthy at deadline During the 0.9 client refactor the code to fail a deployment when the deadline was reached was broken. This restores and tests that behavior.	2019-02-19 07:44:14 -08:00
Mahmood Ali	87be233aca	test: improve readability of duration Co-Authored-By: schmichael <michael.schurter@gmail.com>	2019-02-14 08:12:06 -08:00
Mahmood Ali	16d3414842	test: improve failure message Co-Authored-By: schmichael <michael.schurter@gmail.com>	2019-02-14 08:11:37 -08:00
Michael Schurter	4814f0fb0b	tests: port TestTaskRunner_Download_List from 0.8	2019-02-12 15:48:04 -08:00
Michael Schurter	a152e3ef17	consul: fix task deregistration hook Broke ShutdownDelay but the test was timing dependent so it just appeared flaky. Made the test slower so that it should never incorrectly pass.	2019-02-12 15:36:02 -08:00
Michael Schurter	4ad879e75e	tests: port TaskRunner_DeriveToken tests from 0.8	2019-02-12 15:36:02 -08:00
Michael Schurter	6743ed9fdc	tests: port TestTaskRunner_BlockForVault from 0.8 Also fix race conditions in the mock vault client.	2019-02-12 13:46:09 -08:00
Michael Schurter	6c0cc65b2e	simplify hcl2 parsing helper No need to pass in the entire eval context	2019-02-04 11:07:57 -08:00
Alex Dadgar	5062c54874	Fix usage of fsi variable	2019-01-29 14:07:55 -08:00
Alex Dadgar	6f418ebaf0	Always populate task dir environment variables Fixes an issue where if a task was restarted after restating the client, the task dir environment variables would not be populated. This PR fixes this for both upgrades from 0.8.X and for normal 0.9 restarts.	2019-01-29 13:17:10 -08:00
Alex Dadgar	5da21635fb	Fix env templates having interpolated destinations Fixes an issue where env templates that had interpolated destinations would not work. Fixes https://github.com/hashicorp/nomad/issues/5250	2019-01-28 10:28:53 -08:00
Alex Dadgar	d6412fd8e7	Fix double restart counting for templates This PR fixes an issue where template restarts would count twice since it was emitting a restarting event.	2019-01-25 15:38:13 -08:00
Nick Ethier	a36c4320ff	Merge pull request #5227 from hashicorp/b-client-highcpu-usage Fix bug related to high cpu usage	2019-01-23 14:27:51 -05:00
Michael Schurter	13f061a83f	Merge pull request #5196 from hashicorp/f-plugin-utils Make plugins/shared external and make pluginutls/	2019-01-23 06:59:32 -08:00
Preetha	05bf183ba3	Merge pull request #5225 from hashicorp/b-notaskevent-terminalallocs Don't emit task events after alloc is in a terminal DesiredState	2019-01-23 08:54:10 -06:00
Michael Schurter	32daa7b47b	goimports until make check is happy	2019-01-23 06:27:14 -08:00
Nick Ethier	bcc3935228	tr: use context in as select statement	2019-01-22 20:11:39 -05:00
Michael Schurter	be0bab7c3f	move pluginutils -> helper/pluginutils I wanted a different color bikeshed, so I get to paint it	2019-01-22 15:50:08 -08:00
Alex Dadgar	2ca0e97361	Split hclspec	2019-01-22 15:43:34 -08:00
Alex Dadgar	5ca6dd7988	move hclutils	2019-01-22 15:43:34 -08:00
Alex Dadgar	72a5691897	Driver tests do not use hcl2/hcl, hclspec, or hclutils	2019-01-22 15:43:34 -08:00
Preetha Appan	38422642cb	Use DesiredState to determine whether to stop sending task events	2019-01-22 16:43:32 -06:00
Preetha Appan	862c9b7de5	dont emit events for terminal allocs	2019-01-22 16:26:33 -06:00
Michael Schurter	1fa376cac6	Merge pull request #5211 from hashicorp/test-porting-08 Port some 0.8 TaskRunner tests	2019-01-22 14:05:53 -08:00
Michael Schurter	8ced0adb67	test: port TestTaskRunner_CheckWatcher_Restart Added ability to adjust the number of events the TaskRunner keeps as there's no way to observe all events otherwise. Task events differ slightly from 0.8 because 0.9 emits Terminated every time a task exits instead of only when it exits on its own (not due to restart or kill). 0.9 does not emit Killing/Killed for restarts like 0.8 which seems fine as `Restart Signaled/Terminated/Restarting` is more descriptive. Original v0.8 events emitted: ``` expected := []string{ "Received", "Task Setup", "Started", "Restart Signaled", "Killing", "Killed", "Restarting", "Started", "Restart Signaled", "Killing", "Killed", "Restarting", "Started", "Restart Signaled", "Killing", "Killed", "Not Restarting", } ```	2019-01-22 09:46:46 -08:00
Michael Schurter	1719752a9d	test: port RestartTask from 0.8	2019-01-22 08:08:08 -08:00
Michael Schurter	9edff19625	test: port SignalFailure test from 0.8 Also fix signal error handling in mock_driver.	2019-01-22 08:08:08 -08:00
Preetha Appan	299a5fc821	Rename TaskKillRequest/Response to TaskPreKillRequest/Response	2019-01-22 09:54:02 -06:00
Preetha Appan	5a5b9c5666	Fix log comments	2019-01-22 09:45:58 -06:00
Preetha Appan	06e15f8381	Rename TaskKillHook to TaskPreKillHook to more closely match usage Also added/fixed comments	2019-01-22 09:41:56 -06:00
Michael Schurter	3b02af9386	Fix comment Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-01-22 09:41:21 -06:00
Preetha Appan	09291c689b	Rename TaskKillHook to TaskPreKillHook to more closely match usage Also added/fixed comments	2019-01-22 09:41:21 -06:00
Nick Ethier	47127de671	ar: return error from hooks if occured	2019-01-18 18:31:02 -05:00
Mahmood Ali	5df63fda7c	Merge pull request #5190 from hashicorp/f-memory-usage Track Basic Memory Usage as reported by cgroups	2019-01-18 16:46:02 -05:00
Chris Baker	290c3f36ad	set TaskGroupName in task_runner	2019-01-18 20:25:11 +00:00
Chris Baker	8917961caa	documenting test for task runner failure to set TaskGroupName	2019-01-18 20:00:49 +00:00
Michael Schurter	cfadacfd95	Merge pull request #5203 from hashicorp/b-terminated client: restore Terminated event on every exit	2019-01-18 08:54:15 -08:00
Preetha Appan	e0b68a19c6	Fix one more place that should be using taskResources taskResources handles new resource fields in a backwards compatible way	2019-01-17 15:52:51 -06:00
Michael Schurter	a20ac7c1de	client: restore Terminated event on every exit v0.9.0-dev started emitting a Terminated event every time a task process exited. While this wasn't true in previous versions, it's a useful task event because it's the only place for job operators to view the task's exit code. This behavior is asserted in the e2e/taskevents tests.	2019-01-17 10:02:25 -08:00
Danielle Tomlinson	a695b3562c	Merge pull request #5193 from hashicorp/dani/logmon-reattach logmon: Reattach to existing loggers	2019-01-16 17:34:13 +01:00
Danielle Tomlinson	99da4c780d	logmon: Reattach to existing loggers This commit prevents us from creating duplicate logmon hooks when restoring allocations by persisting the logmon reattach config using HookData.	2019-01-16 14:56:10 +01:00
Michael Schurter	daa7d029a1	test: porting TestTaskRunner_SimpleRun_Dispatch Porting test from 0.8 to 0.9.	2019-01-15 15:22:13 -08:00
Michael Schurter	48afda786b	Merge pull request #5187 from hashicorp/test-consul Port a bunch of pre-0.9 Consul tests to 0.9	2019-01-15 07:41:50 -08:00
Mahmood Ali	9909d98bee	Track Basic Memory Usage as reported by cgroups Track current memory usage, `memory.usage_in_bytes`, in addition to `memory.max_memory_usage_in_bytes` and friends. This number is closer what Docker reports. Related to https://github.com/hashicorp/nomad/issues/5165 .	2019-01-14 18:47:52 -05:00
Nick Ethier	c619e70d39	Merge pull request #5018 from hashicorp/f-executor-stats executor: streaming stats api	2019-01-14 15:02:35 -05:00
Michael Schurter	4e7ea460e8	test: port some pre-0.9 DeploymentHealth tests Skipping a failing one as I need to move to some other work and don't want to leave this work orphaned on my machine.	2019-01-14 09:56:53 -08:00
Michael Schurter	ff2f23f5f9	test: assert service interpolation behavior Ported from pre-0.9 tests.	2019-01-14 09:56:53 -08:00
Michael Schurter	e877bb6370	test: assert shutdown delay deregs first Restore a pre-0.9 test that asserts Consul services are deregistered before a task's shutdown delay.	2019-01-14 09:56:53 -08:00
Michael Schurter	1ca858fa92	Update client/allocrunner/taskrunner/stats_hook.go Co-Authored-By: nickethier <ncethier@gmail.com>	2019-01-14 12:31:27 -05:00
Nick Ethier	fbd403df96	tr: stop stats collection on Exited hook	2019-01-14 12:30:14 -05:00
Nick Ethier	597b7b751d	tr: add retry /w backoff to stats_hook failure	2019-01-12 12:18:24 -05:00
Nick Ethier	7e306afde3	executor: fix failing stats related test	2019-01-12 12:18:23 -05:00
Nick Ethier	9fea54e0dc	executor: implement streaming stats API plugins/driver: update driver interface to support streaming stats client/tr: use streaming stats api TODO: * how to handle errors and closed channel during stats streaming * prevent tight loop if Stats(ctx) returns an error drivers: update drivers TaskStats RPC to handle streaming results executor: better error handling in stats rpc docker: better control and error handling of stats rpc driver: allow stats to return a recoverable error	2019-01-12 12:18:22 -05:00
Preetha Appan	f059ef8a47	Modified destroy failure handling to rely on allocrunner's destroy method Added a unit test with custom statedb implementation that errors, to use to verify destroy errors	2019-01-12 10:37:12 -06:00
Alex Dadgar	bd12e0b1f7	Merge pull request #5168 from hashicorp/b-kill-race Improve Kill handling on task runner	2019-01-09 12:05:10 -08:00
Alex Dadgar	069e181e8f	add more comments	2019-01-09 12:04:22 -08:00
Michael Schurter	e5ddff861c	Spelling fix Co-Authored-By: dadgar <alex@hashicorp.com>	2019-01-09 11:42:40 -08:00
Mahmood Ali	90f3cea187	Merge pull request #5157 from hashicorp/r-drivers-no-cstructs drivers: avoid referencing client/structs package	2019-01-09 13:06:46 -05:00
Alex Dadgar	149dec2169	Improve Kill handling on task runner This PR improves how killing a task is handled. Before the kill function directly orchestrated the killing and was only valid while the task was running. The new behavior is to mark the desired state and wait for the task runner to converge to that state.	2019-01-08 16:42:26 -08:00

... 2 3 4 5 6 ...

480 Commits