open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	e43d33aa50	client: don't run alloc postrun during shutdown	2019-09-25 14:58:17 -04:00
Tim Gross	d965a15490	driver/networking: don't recreate existing network namespaces	2019-09-25 14:58:17 -04:00
Jasmine Dahilig	0780adfa7f	timeout after 5 seconds when client opens a data directory (#6348 )	2019-09-24 16:28:21 -07:00
Danielle Lancashire	c9bcb7e76a	client_stats: Always emit client stats	2019-09-19 01:22:08 +02:00
Danielle Lancashire	4f2343e1c0	client: Return empty values when host stats fail Currently, there is an issue when running on Windows whereby under some circumstances the Windows stats API's will begin to return errors (such as internal timeouts) when a client is under high load, and potentially other forms of resource contention / system states (and other unknown cases). When an error occurs during this collection, we then short circuit further metrics emission from the client until the next interval. This can be problematic if it happens for a sustained number of intervals, as our metrics aggregator will begin to age out older metrics, and we will eventually stop emitting various types of metrics including `nomad.client.unallocated.*` metrics. However, when metrics collection fails on Linux, gopsutil will in many cases (e.g cpu.Times) silently return 0 values, rather than an error. Here, we switch to returning empty metrics in these failures, and logging the error at the source. This brings the behaviour into line with Linux/Unix platforms, and although making aggregation a little sadder on intermittent failures, will result in more desireable overall behaviour of keeping metrics available for further investigation if things look unusual.	2019-09-19 01:22:07 +02:00
Danielle	f8b64ee1ab	Merge pull request #6330 from hashicorp/f-host-vols-fail-startup client: Fail startup if host volumes do not exist	2019-09-17 10:55:30 -07:00
Danielle	ec3ecdecfc	Merge pull request #6321 from hashicorp/dani/remove-config Hoist Volume.Config.Source into Volume.Source	2019-09-16 10:12:58 -07:00
Danielle Lancashire	e3796e9d60	client: Fail startup if host volumes do not exist Some drivers will automatically create directories when trying to mount a path into a container, but some will not. To unify this behaviour, this commit requires that host volumes already exist, and can be stat'd by the Nomad Agent during client startup.	2019-09-13 23:28:10 +02:00
Ruslan Usifov	b3c72d1729	close file handle when FileRotator object will closed. Fixes https://github.com/hashicorp/nomad/issues/6309 (#6323 )	2019-09-13 10:31:13 -04:00
Tim Gross	a6ef8c5d42	client/networking: wrap error message from CNI plugin (#6316 )	2019-09-13 08:20:05 -04:00
Danielle Lancashire	78b61de45f	config: Hoist volume.config.source into volume Currently, using a Volume in a job uses the following configuration: ``` volume "alias-name" { type = "volume-type" read_only = true config { source = "host_volume_name" } } ``` This commit migrates to the following: ``` volume "alias-name" { type = "volume-type" source = "host_volume_name" read_only = true } ``` The original design was based due to being uncertain about the future of storage plugins, and to allow maxium flexibility. However, this causes a few issues, namely: - We frequently need to parse this configuration during submission, scheduling, and mounting - It complicates the configuration from and end users perspective - It complicates the ability to do validation As we understand the problem space of CSI a little more, it has become clear that we won't need the `source` to be in config, as it will be used in the majority of cases: - Host Volumes: Always need a source - Preallocated CSI Volumes: Always needs a source from a volume or claim name - Dynamic Persistent CSI Volumes: Always needs a source to attach the volumes to for managing upgrades and to avoid dangling. - Dynamic Ephemeral CSI Volumes: Less thought out, but `source` will probably point to the plugin name, and a `config` block will allow you to pass meta to the plugin. Or will point to a pre-configured ephemeral config. *If implemented The new design simplifies this by merging the source into the volume stanza to solve the above issues with usability, performance, and error handling.	2019-09-13 04:37:59 +02:00
Mahmood Ali	78f62d3670	Merge pull request #6080 from lchayoun/bug-6079 Allow dash in environment variable names	2019-09-11 11:17:24 -07:00
Mahmood Ali	b6bf7f9a6c	Merge pull request #6260 from hashicorp/c-circleci-tweak-20190903 ci: ignore nested pkgs in GOTEST_PKGS_EXCLUDE	2019-09-11 11:17:10 -07:00
Tim Gross	b05cd4c430	test: expand symlink for temp dir for macOS compatibility (#6303 ) On macOS, `os.TempDir` returns a symlinked path under `/var` which is outside of the directories shared into the VM used for Docker, and that fails tests using Docker that need that mount. If we expand the symlink to get the real path in `/private`, we're in the shared folders and can safely mount them.	2019-09-10 12:20:09 -04:00
Mahmood Ali	4b8280e51d	remove generated code	2019-09-06 19:24:15 +00:00
Nomad Release bot	dc7d728a82	Generate files for 0.10.0-beta1 release	2019-09-06 18:47:09 +00:00
Tim Gross	3fa4bca4a0	script checks: Update needs to update Alloc as well (#6291 )	2019-09-06 11:18:00 -04:00
Mahmood Ali	01f42053e4	dev: avoid codecgen code in downstream projects This is an attempt to ease dependency management for external driver plugins, by avoiding requiring them to compile ugorji/go generated files. Plugin developers reported some pain with the brittleness of ugorji/go dependency in particular, specially when using go mod, the default go mod manager in golang 1.13. Context -------- Nomad uses msgpack to persist and serialize internal structs, using ugorji/go library. As an optimization, we use ugorji/go code generation to speedup process and aovid the relection-based slow path. We commit these generated files in repository when we cut and tag the release to ease reproducability and debugging old releases. Thus, downstream projects that depend on release tag, indirectly depends on ugorji/go generated code. Sadly, the generated code is brittle and specific to the version of ugorji/go being used. When go mod picks another version of ugorji/go then nomad (go mod by default uses release according to semver), downstream projects face compilation errors. Interestingly, downstream projects don't commonly serialize nomad internal structs. Drivers and device plugins use grpc instead of msgpack for the most part. In the few cases where they use msgpag (e.g. decoding task config), they do without codegen path as they run on driver specific structs not the nomad internal structs. Also, the ugorji/go serialization through reflection is generally backward compatible (mod some ugorji/go regression bugs that get introduced every now and then :( ). Proposal --------- The proposal here is to keep committing ugorji/go codec generated files for releases but to use a go tag for them. All nomad development through the makefile, including releasing, CI and dev flow, has the tag enabled. Downstream plugin projects, by default, will skip these files and life proceed as normal for them. The downside is that nomad developers who use generated code but avoid using make must start passing additional go tag argument. Though this is not a blessed configuration.	2019-09-06 09:22:00 -04:00
Tim Gross	8ce201854a	client: recreate script checks on Update (#6265 ) Splitting the immutable and mutable components of the scriptCheck led to a bug where the environment interpolation wasn't being incorporated into the check's ID, which caused the UpdateTTL to update for a check ID that Consul didn't have (because our Consul client creates the ID from the structs.ServiceCheck each time we update). Task group services don't have access to a task environment at creation, so their checks get registered before the check can be interpolated. Use the original check ID so they can be updated.	2019-09-05 11:43:23 -04:00
Michael Schurter	ee06c36345	Merge pull request #6254 from hashicorp/test-connect-e2e-demo e2e: test demo job for connect	2019-09-04 14:33:26 -07:00
Nick Ethier	e440ba80f1	ar: refactor network bridge config to use go-cni lib (#6255 ) * ar: refactor network bridge config to use go-cni lib * ar: use eth as the iface prefix for bridged network namespaces * vendor: update containerd/go-cni package * ar: update network hook to use TODO contexts when calling configurator * unnecessary conversion	2019-09-04 16:33:25 -04:00
Michael Schurter	93b47f4ddc	client: reword error message	2019-09-04 12:40:09 -07:00
Mahmood Ali	b76de72943	Merge pull request #6251 from hashicorp/b-port-map-regression NOMAD_PORT_<label> regression	2019-09-04 11:54:09 -04:00
Danielle	2564a1dabd	Merge pull request #6239 from hashicorp/b-32bitmem Fix memory fingerprinting on 32bit	2019-09-04 17:39:07 +02:00
Danielle	aa5605fce1	fingerprint: Add backwards compatibility comment Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-09-04 17:38:35 +02:00
Mahmood Ali	87f0457973	fix qemu and update docker with tests	2019-09-04 11:27:51 -04:00
Jasmine Dahilig	5b6e39b37c	fix portmap envvars in docker driver	2019-09-04 11:26:13 -04:00
Danielle Lancashire	67715d846e	fingerprint: Restore support for legacy memory fingerprint	2019-09-04 17:10:28 +02:00
Tim Gross	0f29dcc935	support script checks for task group services (#6197 ) In Nomad prior to Consul Connect, all Consul checks work the same except for Script checks. Because the Task being checked is running in its own container namespaces, the check is executed by Nomad in the Task's context. If the Script check passes, Nomad uses the TTL check feature of Consul to update the check status. This means in order to run a Script check, we need to know what Task to execute it in. To support Consul Connect, we need Group Services, and these need to be registered in Consul along with their checks. We could push the Service down into the Task, but this doesn't work if someone wants to associate a service with a task's ports, but do script checks in another task in the allocation. Because Nomad is handling the Script check and not Consul anyways, this moves the script check handling into the task runner so that the task runner can own the script check's configuration and lifecycle. This will allow us to pass the group service check configuration down into a task without associating the service itself with the task. When tasks are checked for script checks, we walk back through their task group to see if there are script checks associated with the task. If so, we'll spin off script check tasklets for them. The group-level service and any restart behaviors it needs are entirely encapsulated within the group service hook.	2019-09-03 15:09:04 -04:00
Pete Woods	49b7d23cea	Add node "status" and "scheduling eligibility" tags to client metrics (#6130 ) When summing up the capability of your Nomad fleet for scaling purposes, it's important to exclude draining nodes, as they won't accept new jobs.	2019-09-03 12:11:11 -04:00
Michael Schurter	5957030d18	connect: add unix socket to proxy grpc for envoy (#6232 ) * connect: add unix socket to proxy grpc for envoy Fixes #6124 Implement a L4 proxy from a unix socket inside a network namespace to Consul's gRPC endpoint on the host. This allows Envoy to connect to Consul's xDS configuration API. * connect: pointer receiver on structs with mutexes * connect: warn on all proxy errors	2019-09-03 08:43:38 -07:00
Mahmood Ali	0f401e6b8b	lint: ignore generated windows syscall wrappers	2019-09-03 10:59:58 -04:00
Jasmine Dahilig	4edebe389a	add default update stanza and max_parallel=0 disables deployments (#6191 )	2019-09-02 10:30:09 -07:00
Evan Ercolano	fcf66918d0	Remove unused canary param from MakeTaskServiceID	2019-08-31 16:53:23 -04:00
Danielle Lancashire	4fcb7394e9	client: Fix memory fingerprinting on 32bit Also introduce regression ci for 32 bit fingerprinting	2019-08-31 18:33:59 +02:00
Mahmood Ali	0bd2eee87f	Merge pull request #6216 from hashicorp/b-recognize-pending-allocs alloc_runner: wait when starting suspicious allocs	2019-08-28 14:46:09 -04:00
Mahmood Ali	e0da3c5d0e	rename to hasLocalState, and ignore clientstate The ClientState being pending isn't a good criteria; as an alloc may have been updated in-place before it was completed. Also, updated the logic so we only check for task states. If an alloc has deployment state but no persisted tasks at all, restore will still fail.	2019-08-28 11:44:48 -04:00
Nick Ethier	cf014c7fd5	ar: ensure network forwarding is allowed for bridged allocs (#6196 ) * ar: ensure network forwarding is allowed in iptables for bridged allocs * ensure filter rule exists at setup time	2019-08-28 10:51:34 -04:00
Nick Ethier	cbb27e74bc	Add environment variables for connect upstreams (#6171 ) * taskenv: add connect upstream env vars + test * set taskenv upstreams instead of appending * Update client/taskenv/env.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-27 23:41:38 -04:00
Mahmood Ali	90c5eefbab	Alternative approach: avoid restoring This uses an alternative approach where we avoid restoring the alloc runner in the first place, if we suspect that the alloc may have been completed already.	2019-08-27 17:30:55 -04:00
Jasmine Dahilig	4078393bb6	expose nomad namespace as environment variable in allocation #5692 (#6192 )	2019-08-27 08:38:07 -07:00
Mahmood Ali	647c1457cb	alloc_runner: wait when starting suspicious allocs This commit aims to help users running with clients suseptible to the destroyed alloc being restrarted bug upgrade to latest. Without this, such users will have their tasks run unexpectedly on upgrade and only see the bug resolved after subsequent restart. If, on restore, the client sees a pending alloc without any other persisted info, then err on the side that it's an corrupt persisted state of an alloc instead of the client happening to be killed right when alloc is assigned to client. Few reasons motivate this behavior: Statistically speaking, corruption being the cause is more likely. A long running client will have higher chance of having allocs persisted incorrectly with pending state. Being killed right when an alloc is about to start is relatively unlikely. Also, delaying starting an alloc that hasn't started (by hopefully seconds) is not as severe as launching too many allocs that may bring client down. More importantly, this helps customers upgrade their clients without risking taking their clients down and destablizing their cluster. We don't want existing users to force triggering the bug while they upgrade and restart cluster.	2019-08-26 22:05:31 -04:00
Mahmood Ali	dfdf0edd3b	Merge pull request #6207 from hashicorp/b-gc-destroyed-allocs-rerun Don't persist allocs of destroyed alloc runners	2019-08-26 17:26:18 -04:00
Mahmood Ali	cc460d4804	Write to client store while holding lock Protect against a race where destroying and persist state goroutines race. The downside is that the database io operation will run while holding the lock and may run indefinitely. The risk of lock being long held is slow destruction, but slow io has bigger problems.	2019-08-26 13:45:58 -04:00
Mahmood Ali	1851820f20	logmon: log stat error to help debugging	2019-08-26 10:10:20 -04:00
Mahmood Ali	c132623ffc	Don't persist allocs of destroyed alloc runners This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes https://github.com/hashicorp/nomad/issues/5984 Related to https://github.com/hashicorp/nomad/pull/5890	2019-08-25 11:21:28 -04:00
Mahmood Ali	6301725002	logmon: revert workaround for Windows go1.11 bug Revert e0126123ab1ba848f72458538bc6118c978245e6 now that we are running with Golang 1.12, and https://github.com/golang/go/issues/29119 is no longer relevant.	2019-08-24 08:19:44 -04:00
Mahmood Ali	df1f3eb9ee	Merge pull request #6201 from hashicorp/b-device-stats-interval initialize device manager stats interval	2019-08-24 08:16:03 -04:00
Mahmood Ali	07b5f4c530	Merge pull request #6146 from hashicorp/b-config-template-copy clientConfig.Copy() to copy template config too	2019-08-23 19:00:57 -04:00
Mahmood Ali	b98568774b	clientConfig.Copy() to copy template config too	2019-08-23 18:43:22 -04:00
Lang Martin	4f6493a301	taskrunner getter set Umask for go-getter, setuid test	2019-08-23 15:59:03 -04:00
Mahmood Ali	3890619100	initialize device manager stats interval Fixes a bug where we cpu is pigged at 100% due to collecting devices statistics. The passed stats interval was ignored, and the default zero value causes a very tight loop of stats collection. FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a `g2.2xlarge` ec2 instance. The stats interval defaults to 1 second and is user configurable. I believe this is too frequent as a default, and I may advocate for reducing it to a value closer to 5s or 10s, but keeping it as is for now. Fixes https://github.com/hashicorp/nomad/issues/6057 .	2019-08-23 14:58:34 -04:00
Jerome Gravel-Niquet	cbdc1978bf	Consul service meta (#6193 ) * adds meta object to service in job spec, sends it to consul * adds tests for service meta * fix tests * adds docs * better hashing for service meta, use helper for copying meta when registering service * tried to be DRY, but looks like it would be more work to use the helper function	2019-08-23 12:49:02 -04:00
Nick Ethier	96d379071d	ar: fix bridge networking port mapping when port.To is unset (#6190 )	2019-08-22 21:53:52 -04:00
Michael Schurter	59e0b67c7f	connect: task hook for bootstrapping envoy sidecar Fixes #6041 Unlike all other Consul operations, boostrapping requires Consul be available. This PR tries Consul 3 times with a backoff to account for the group services being asynchronously registered with Consul.	2019-08-22 08:15:32 -07:00
Michael Schurter	b008fd1724	connect: register group services with Consul Fixes #6042 Add new task group service hook for registering group services like Connect-enabled services. Does not yet support checks.	2019-08-20 12:25:10 -07:00
lchayoun	2307c9d1d2	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-16 11:11:47 +03:00
Nick Ethier	965f00b2fc	Builtin Admission Controller Framework (#6116 ) * nomad: add admission controller framework * nomad: add admission controller framework and Consul Connect hooks * run admission controllers before checking permissions * client: add default node meta for connect configurables * nomad: remove validateJob func since it has been moved to admission controller * nomad: use new TaskKind type * client: use consts for connect sidecar image and log level * Apply suggestions from code review Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> * nomad: add job register test with connect sidecar * Update nomad/job_endpoint_hooks.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-15 11:22:37 -04:00
lchayoun	c5a38a045a	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-13 19:23:13 +03:00
Tim Gross	03433f35d4	client/template: configuration for function blacklist and sandboxing When rendering a task template, the `plugin` function is no longer permitted by default and will raise an error. An operator can opt-in to permitting this function with the new `template.function_blacklist` field in the client configuration. When rendering a task template, path parameters for the `file` function will be treated as relative to the task directory by default. Relative paths or symlinks that point outside the task directory will raise an error. An operator can opt-out of this protection with the new `template.disable_file_sandbox` field in the client configuration.	2019-08-12 16:34:48 -04:00
Danielle Lancashire	7e6c8e5ac1	Copy documentation to api/tasks	2019-08-12 16:22:27 +02:00
Danielle Lancashire	861caa9564	HostVolumeConfig: Source -> Path	2019-08-12 15:39:08 +02:00
Danielle Lancashire	e132a30899	structs: Unify Volume and VolumeRequest	2019-08-12 15:39:08 +02:00
Danielle Lancashire	6ef8d5233e	client: Add volume_hook for mounting volumes	2019-08-12 15:39:08 +02:00
Danielle Lancashire	063e4240c1	client: Add parsing and registration of HostVolume configuration	2019-08-12 15:39:08 +02:00
lchayoun	ca892163b2	allow dash in non generated environment variable names	2019-08-11 12:51:42 +03:00
Nick Ethier	7806f4c597	Revert "client: add autofetch for CNI plugins" This reverts commit 0bd157cc3b04fb090dd0d54affcae71496102ce8.	2019-08-08 15:10:19 -04:00
Nick Ethier	7d28ece8de	Revert "client: remove debugging lines" This reverts commit 54ce4d1f7ef4913cb12c03dbc98bcd903f7787c9.	2019-08-08 14:52:52 -04:00
Liel Chayoun	24dcb2379c	Update env_test.go	2019-08-06 11:59:31 +03:00
Mahmood Ali	b17bac5101	Render consul templates using task env only (#6055 ) When rendering a task consul template, ensure that only task environment variables are used. Currently, `consul-template` always falls back to host process environment variables when key isn't a task env var[1]. Thus, we add an empty entry for each host process env-var not found in task env-vars. [1] `bfa5d0e133/template/funcs.go (L61-L75)`	2019-08-05 16:30:47 -04:00
Mahmood Ali	f66169cd6a	Merge pull request #6065 from hashicorp/b-nil-driver-exec Check if driver handle is nil before execing	2019-08-02 09:48:28 -05:00
Mahmood Ali	a4670db9b7	Check if driver handle is nil before execing Defend against tr.getDriverHandle being nil. Exec handler checks if task is running, but it may be stopped between check and driver handler fetching.	2019-08-02 10:07:41 +08:00
Nick Ethier	7de0bec8ab	client/cni: updated comments and simplified logic to auto download plugins	2019-07-31 01:04:10 -04:00
Nick Ethier	b16640c50d	Apply suggestions from code review Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2019-07-31 01:04:10 -04:00
Nick Ethier	321d10a041	client: remove debugging lines	2019-07-31 01:04:09 -04:00
Nick Ethier	af6b191963	client: add autofetch for CNI plugins	2019-07-31 01:04:09 -04:00
Nick Ethier	1e9dd1b193	remove unused file	2019-07-31 01:04:09 -04:00
Nick Ethier	09a4cfd8d7	fix failing tests	2019-07-31 01:04:07 -04:00
Nick Ethier	ef83f0831b	ar: plumb client config for networking into the network hook	2019-07-31 01:04:06 -04:00
Nick Ethier	af66a35924	networking: Add new bridge networking mode implementation	2019-07-31 01:04:06 -04:00
Michael Schurter	fb487358fb	connect: add group.service stanza support	2019-07-31 01:04:05 -04:00
Nick Ethier	63c5504d56	ar: fix lint errors	2019-07-31 01:03:19 -04:00
Nick Ethier	e312201d18	ar: rearrange network hook to support building on windows	2019-07-31 01:03:19 -04:00
Nick Ethier	370533c9c7	ar: fix test that failed due to error renaming	2019-07-31 01:03:19 -04:00
Nick Ethier	2d60ef64d9	plugins/driver: make DriverNetworkManager interface optional	2019-07-31 01:03:19 -04:00
Nick Ethier	f87e7e9c9a	ar: plumb error handling into alloc runner hook initialization	2019-07-31 01:03:18 -04:00
Nick Ethier	ef1795b344	ar: add tests for network hook	2019-07-31 01:03:18 -04:00
Nick Ethier	15989bba8e	ar: cleanup lint errors	2019-07-31 01:03:18 -04:00
Nick Ethier	220cba3e7e	ar: move linux specific code to it's own file and add tests	2019-07-31 01:03:18 -04:00
Nick Ethier	548f78ef15	ar: initial driver based network management	2019-07-31 01:03:17 -04:00
Nick Ethier	66c514a388	Add network lifecycle management Adds a new Prerun and Postrun hooks to manage set up of network namespaces on linux. Work still needs to be done to make the code platform agnostic and support Docker style network initalization.	2019-07-31 01:03:17 -04:00
Preetha Appan	d048029b5a	remove generated code and change version to 0.10.0	2019-07-30 15:56:05 -05:00
Nomad Release bot	e39fb11531	Generate files for 0.9.4 release	2019-07-30 19:05:18 +00:00
Preetha Appan	6b4c40f5a8	remove generated code	2019-07-23 12:07:49 -05:00
Nomad Release bot	04187c8b86	Generate files for 0.9.4-rc1 release	2019-07-22 21:42:36 +00:00
Michael Schurter	d90680021e	logmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formatting	2019-07-22 13:05:01 -07:00
Michael Schurter	e37bc3513c	logmon: ensure errors are still handled properly ...and add a comment to switch back to the old error handling once we switch to Go 1.12.	2019-07-22 12:49:48 -07:00
Danielle Lancashire	1bcbbbfbe6	logmon: Workaround golang/go#29119 There's a bug in go1.11 that causes some io operations on windows to return incorrect errors for some cases when Stat-ing files. To avoid upgrading to go1.12 in a point release, here we loosen up the cases where we will attempt to create fifos, and add some logging of underlying stat errors to help with debugging.	2019-07-22 18:28:12 +02:00
Jasmine Dahilig	2157f6ddf1	add formatting for hcl parsing error messages (#5972 )	2019-07-19 10:04:39 -07:00
Mahmood Ali	cd6f1d3102	Update consul-template dependency to latest To pick up the fix in https://github.com/hashicorp/consul-template/pull/1231 .	2019-07-18 07:32:03 +07:00
Mahmood Ali	8a82260319	log unrecoverable errors	2019-07-17 11:01:59 +07:00
Mahmood Ali	1a299c7b28	client/taskrunner: fix stats stats retry logic Previously, if a channel is closed, we retry the Stats call. But, if that call fails, we go in a backoff loop without calling Stats ever again. Here, we use a utility function for calling driverHandle.Stats call that retries as one expects. I aimed to preserve the logging formats but made small improvements as I saw fit.	2019-07-11 13:58:07 +08:00
Preetha Appan	7d645c5ad9	Test file for detect content type that satisfies linter and encoding	2019-07-10 11:42:04 -05:00
Preetha Appan	ef9a71c68b	code review feedback	2019-07-10 10:41:06 -05:00
Preetha Appan	990e468edc	Populate task event struct with kill timeout This makes for a nicer task event message	2019-07-09 09:37:09 -05:00
Preetha Appan	108a292cc0	fix linting failure in test case file	2019-07-08 11:29:12 -05:00
Michael Lange	b2e9570075	Use consistent casing in the JSON representation of the AllocFileInfo struct	2019-07-02 17:27:31 -07:00
Preetha Appan	8495fb9055	Added additional test cases and fixed go test case	2019-07-02 13:25:29 -05:00
Mahmood Ali	a97d451ac7	Merge pull request #5905 from hashicorp/b-ar-failed-prestart Fail alloc if alloc runner prestart hooks fail	2019-07-02 20:25:53 +08:00
Danielle	c6872cdf12	Merge pull request #5864 from hashicorp/dani/win-pipe-cleaner windows: Fix restarts using the raw_exec driver	2019-07-02 13:58:56 +02:00
Danielle Lancashire	e20300313f	fifo: Safer access to Conn	2019-07-02 13:12:54 +02:00
Mahmood Ali	f10201c102	run post-run/post-stop task runner hooks Handle when prestart failed while restoring a task, to prevent accidentally leaking consul/logmon processes.	2019-07-02 18:38:32 +08:00
Mahmood Ali	4afd7835e3	Fail alloc if alloc runner prestart hooks fail When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for https://github.com/hashicorp/nomad/issues/5840 .	2019-07-02 18:35:47 +08:00
Mahmood Ali	7614b8f09e	Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2 task runner to avoid running task if terminal	2019-07-02 15:31:17 +08:00
Mahmood Ali	7bfad051b9	address review comments	2019-07-02 14:53:50 +08:00
Mahmood Ali	c0c00ecc07	Merge pull request #5906 from hashicorp/b-alloc-stale-updates client: defensive against getting stale alloc updates	2019-07-02 12:40:17 +08:00
Preetha Appan	c09342903b	Improve test cases for detecting content type	2019-07-01 16:24:48 -05:00
Danielle Lancashire	688f82f07d	fifo: Close connections and cleanup lock handling	2019-07-01 14:14:29 +02:00
Danielle Lancashire	2c7d1f1b99	logmon: Add windows compatibility test	2019-07-01 14:14:06 +02:00
Mahmood Ali	c5f5a1fcb9	client: defensive against getting stale alloc updates When fetching node alloc assignments, be defensive against a stale read before killing local nodes allocs. The bug is when both client and servers are restarting and the client requests the node allocation for the node, it might get stale data as server hasn't finished applying all the restored raft transaction to store. Consequently, client would kill and destroy the alloc locally, just to fetch it again moments later when server store is up to date. The bug can be reproduced quite reliably with single node setup (configured with persistence). I suspect it's too edge-casey to occur in production cluster with multiple servers, but we may need to examine leader failover scenarios more closely. In this commit, we only remove and destroy allocs if the removal index is more recent than the alloc index. This seems like a cheap resiliency fix we already use for detecting alloc updates. A more proper fix would be to ensure that a nomad server only serves RPC calls when state store is fully restored or up to date in leadership transition cases.	2019-06-29 04:17:35 -05:00
Preetha Appan	3345ce3ba4	Infer content type in alloc fs stat endpoint	2019-06-28 20:31:28 -05:00
Danielle Lancashire	e1151f743b	appveyor: Run logmon tests	2019-06-28 16:01:41 +02:00
Danielle Lancashire	634ada671e	fifo: Require that fifos do not exist for create Although this operation is safe on linux, it is not safe on Windows when using the named pipe interface. To provide a ~reasonable common api abstraction, here we switch to returning File exists errors on the unix api.	2019-06-28 13:47:18 +02:00
Danielle Lancashire	0ff27cfc0f	vendor: Use dani fork of go-winio	2019-06-28 13:47:18 +02:00
Danielle Lancashire	514a2a6017	logmon: Refactor fifo access for windows safety On unix platforms, it is safe to re-open fifo's for reading after the first creation if the file is already a fifo, however this is not possible on windows where this triggers a permissions error on the socket path, as you cannot recreate it. We can't transparently handle this in the CreateAndRead handle, because the Access Is Denied error is too generic to reliably be an IO error. Instead, we add an explict API for opening a reader to an existing FIFO, and check to see if the fifo already exists inside the calling package (e.g logmon)	2019-06-28 13:41:54 +02:00
Mahmood Ali	3d89ae0f1e	task runner to avoid running task if terminal This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes https://github.com/hashicorp/nomad/issues/5883	2019-06-27 11:27:34 +08:00
Danielle Lancashire	b9ac184e1f	tr: Fetch Wait channel before killTask in restart Currently, if killTask results in the termination of a process before calling WaitTask, Restart() will incorrectly return a TaskNotFound error when using the raw_exec driver on Windows.	2019-06-26 15:20:57 +02:00
Mahmood Ali	b209584dce	Merge pull request #5726 from hashicorp/b-plugins-via-init Use init() to handle plugin invocation	2019-06-18 21:09:03 -04:00
Mahmood Ali	ac64509c59	comment on use of init() for plugin handlers	2019-06-18 20:54:55 -04:00
Chris Baker	f71114f5b8	cleanup test	2019-06-18 14:15:25 +00:00
Chris Baker	a2dc351fd0	formatting and clarity	2019-06-18 14:00:57 +00:00
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Mahmood Ali	962921f86c	Use init to handle plugin invocation Currently, nomad "plugin" processes (e.g. executor, logmon, docker_logger) are started as CLI commands to be handled by command CLI framework. Plugin launchers use `discover.NomadBinary()` to identify the binary and start it. This has few downsides: The trivial one is that when running tests, one must re-compile the nomad binary as the tests need to invoke the nomad executable to start plugin. This is frequently overlooked, resulting in puzzlement. The more significant issue with `executor` in particular is in relation to external driver: * Plugin must identify the path of invoking nomad binary, which is not trivial; `discvoer.NomadBinary()` now returns the path to the plugin rather than to nomad, preventing external drivers from launching executors. * The external driver may get a different version of executor than it expects (specially if we make a binary incompatible change in future). This commit addresses both downside by having the plugin invocation handling through an `init()` call, similar to how libcontainer init handler is done in [1] and recommened by libcontainer [2]. `init()` will be invoked and handled properly in tests and external drivers. For external drivers, this change will cause external drivers to launch the executor that's compiled against. There a are a couple of downsides to this approach: * These specific packages (i.e executor, logmon, and dockerlog) need to be careful in use of `init()`, package initializers. Must avoid having command execution rely on any other init in the package. I prefixed files with `z_` (golang processes files in lexical order), but ensured we don't depend on order. * The command handling is spread in multiple packages making it a bit less obvious how plugin starts are handled. [1] drivers/shared/executor/libcontainer_nsenter_linux.go [2] `eb4aeed24f/libcontainer (using-libcontainer)`	2019-06-13 16:48:01 -04:00
Jasmine Dahilig	ed9740db10	Merge pull request #5664 from hashicorp/f-http-hcl-region backfill region from hcl for jobUpdate and jobPlan	2019-06-13 12:25:01 -07:00
Jasmine Dahilig	51e141be7a	backfill region from job hcl in jobUpdate and jobPlan endpoints - updated region in job metadata that gets persisted to nomad datastore - fixed many unrelated unit tests that used an invalid region value (they previously passed because hcl wasn't getting picked up and the job would default to global region)	2019-06-13 08:03:16 -07:00
Mahmood Ali	e31159bf1f	Prepare for 0.9.4 dev cycle	2019-06-12 18:47:50 +00:00
Nomad Release bot	4803215109	Generate files for 0.9.3 release	2019-06-12 16:11:16 +00:00
Danielle	f923b568e0	Merge pull request #5821 from hashicorp/dani/b-5770 trhooks: Add TaskStopHook interface to services	2019-06-12 17:30:49 +02:00
Danielle Lancashire	c326344b57	trt: Fix test	2019-06-12 17:06:11 +02:00
Danielle Lancashire	13d76e35fd	trhooks: Add TaskStopHook interface to services We currently only run cleanup Service Hooks when a task is either Killed, or Exited. However, due to the implementation of a task runner, tasks are only Exited if they every correctly started running, which is not true when you recieve an error early in the task start flow, such as not being able to pull secrets from Vault. This updates the service hook to also call consul deregistration routines during a task Stop lifecycle event, to ensure that any registered checks and services are cleared in such cases. fixes #5770	2019-06-12 16:00:21 +02:00
Mahmood Ali	2acf30fdd3	Fallback to `alloc.TaskResources` for old allocs When a client is running against an old server (e.g. running 0.8), `alloc.AllocatedResources` may be nil, and we need to check the deprecated `alloc.TaskResources` instead. Fixes https://github.com/hashicorp/nomad/issues/5810	2019-06-11 10:32:53 -04:00
Mahmood Ali	7a4900aaa4	client/allocrunner: depend on internal task state Alloc runner already tracks tasks associated with alloc. Here, we become defensive by relying on the alloc runner tracked tasks, rather than depend on server never updating the job unexpectedly.	2019-06-10 18:42:51 -04:00
Mahmood Ali	d30c3d10b0	Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1 More test fixes	2019-06-05 19:09:18 -04:00
Mahmood Ali	935ee86e92	Merge pull request #5737 from fwkz/fix-restart-attempts Fix restart attempts of `restart` stanza in `delay` mode.	2019-06-05 19:05:07 -04:00
Mahmood Ali	97957fbf75	Prepare for 0.9.3 dev cycle	2019-06-05 14:54:00 +00:00
Nomad Release bot	43bfbf3fcc	Generate files for 0.9.2 release	2019-06-05 11:59:27 +00:00
Mahmood Ali	a9f81f2daa	client config flag to disable remote exec This exposes a client flag to disable nomad remote exec support in environments where access to tasks ought to be restricted. I used `disable_remote_exec` client flag that defaults to allowing remote exec. Opted for a client config that can be used to disable remote exec globally, or to a subset of the cluster if necessary.	2019-06-03 15:31:39 -04:00
Mahmood Ali	a4ead8ff79	remove 0.9.2-rc1 generated code	2019-05-23 11:14:24 -04:00
Nomad Release bot	6d6bc59732	Generate files for 0.9.2-rc1 release	2019-05-22 19:29:30 +00:00
Michael Schurter	a54511b304	Merge pull request #5731 from hashicorp/b-ignore-dc client: drop unused DC field from servers list	2019-05-22 08:42:15 -07:00
Mahmood Ali	84419f08ce	client: synchronize client.invalidAllocs access invalidAllocs may be accessed and manipulated from different goroutines, so must be locked.	2019-05-22 09:37:49 -04:00
Danielle Lancashire	27583ed8c1	client: Pass servers contacted ch to allocrunner This fixes an issue where batch and service workloads would never be restarted due to indefinitely blocking on a nil channel. It also raises the restoration logging message to `Info` to simplify log analysis.	2019-05-22 13:47:35 +02:00
Mahmood Ali	9df1e00f35	tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal Given that Signal may be called multiple times, blocking for `SignalCh` isn't sufficient to synchornizing access to Signals field.	2019-05-21 13:56:58 -04:00
Mahmood Ali	b06e585713	Merge pull request #5739 from hashicorp/r-rm-logmon-syslog-deadcode logmon: remove syslog server deadcode	2019-05-21 11:46:48 -04:00
Mahmood Ali	eca23bf9c4	Merge pull request #5742 from hashicorp/b-test-fixes-20190520 Grab bag of (primarily race) test fixes	2019-05-21 11:46:36 -04:00
Mahmood Ali	e88bb61488	Merge pull request #5740 from hashicorp/b-nomad-exec-term-race exec: allow drivers to handle stream termination	2019-05-21 11:24:12 -04:00
Mahmood Ali	b475ccbe3e	client: synchronize access to ar.alloc `allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's use `allocRunner.Alloc()` helper function to access it.	2019-05-21 09:55:05 -04:00
Mahmood Ali	2a7b073167	tests: fix fifo lib race Accidentally accessed outer `err` variable inside a goroutine	2019-05-21 09:49:56 -04:00
Mahmood Ali	296bd41c9e	tests: fix data race in client TestDriverManager_Fingerprint_Periodic	2019-05-21 09:49:56 -04:00
Mahmood Ali	d9e59eece0	tests: fix client TestFS_Stream data race Close is invoked in a different goroutine from test	2019-05-21 09:49:56 -04:00
Mahmood Ali	75e0a3f405	exec: allow drivers to handle stream termination Without this change, alloc_endpoint cancel the context passed to handler when we detect EOF. This races driver in setting exit code; and we run into a case where the exec process terminates cleanly yet we attempt to mark it as failed with context error. Here, we rely on the driver to handle errors returned from Stream and without racing to set an error.	2019-05-21 09:40:25 -04:00
Mahmood Ali	974bcbecc9	logmon: remove syslog server deadcode Remove unused syslog server related code that got replaced by the docker logger in Nomad 0.9	2019-05-21 09:36:43 -04:00
fwkz	8b84bec95a	Fix restart attempts of `restart` stanza. Number of restarts during 2nd interval is off by one.	2019-05-21 13:27:19 +02:00
Michael Schurter	d41abda957	client: drop unused DC field from servers list See #5730 for details.	2019-05-20 14:19:15 -07:00
Michael Schurter	2fe0768f3b	docs: changelog entry for #5669 and fix comment	2019-05-14 10:54:00 -07:00
Michael Schurter	af9096c8ba	client: register before restoring Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.	2019-05-14 10:53:27 -07:00
Michael Schurter	e07f73bfe0	client: do not restart dead tasks until server is contacted (try 2) Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162 Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.	2019-05-14 10:53:27 -07:00
Michael Schurter	d7e5ace1ed	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Michael Schurter	3b1f8991a1	client: log when server list changes Stop logging in the happy path when nothing has changed.	2019-05-13 15:42:55 -07:00
Michael Schurter	48db8135da	Merge pull request #5492 from hashicorp/f-allocated-mem client: expose allocated memory per task	2019-05-13 13:31:22 -07:00
Lang Martin	1d03a43ce2	Merge pull request #5642 from hashicorp/b-network-fingerprinting-ipv4 network fingerprinting multiple IPs on the configured network device	2019-05-13 11:46:53 -04:00
Michael Schurter	1c4e585fa7	client: expose allocated memory per task Related to #4280 This PR adds `client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge in bytes to metrics to ease calculating how close a task is to OOMing. ``` 'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000 'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000 'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000 'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000 'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000 ```	2019-05-10 11:12:12 -07:00
Lang Martin	f6bc45dd23	client improve a comment in updateNetworks	2019-05-10 11:25:04 -04:00
Mahmood Ali	919827f2df	Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base nomad exec part 1: plumbing and docker driver	2019-05-09 18:09:27 -04:00
Mahmood Ali	ab2cae0625	implement client endpoint of nomad exec Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking the relevant task handler for execution.	2019-05-09 16:49:08 -04:00
Preetha	1d02886bb6	Merge pull request #5654 from hashicorp/b-hearbeat-lockfix Remove unnecessary locking and serverlist syncing in heartbeats	2019-05-08 13:36:39 -05:00
Preetha Appan	3289e7f4a0	fix typo and add one more test scenario	2019-05-08 10:54:22 -05:00
Preetha Appan	db6b291a5a	code review feedback	2019-05-07 16:23:32 -05:00
Chris Baker	93ec1293be	stale allocation data leads to incorrect (and even negative) metrics (#5637 ) * client: was not using up-to-date client state in determining which alloc count towards allocated resources * Update client/client.go Co-Authored-By: cgbaker <cgbaker@hashicorp.com>	2019-05-07 15:54:36 -04:00
Preetha Appan	b063fc81a4	Remove unnecessary locking and serverlist syncing in heartbeats This removes an unnecessary shared lock between discovery and heartbeating which was causing heartbeats to be missed upon retries when a single server fails. Also made a drive by fix to call the periodic server shuffler goroutine.	2019-05-06 14:44:55 -05:00
Michael Schurter	8c7b3ff45a	Fix comment Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:01:30 -05:00
Michael Schurter	e19fa33f9c	Remove unnecessary boolean clause Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:00:17 -05:00
Preetha Appan	b99a204582	Update deployment health on failed allocations only if health is unset This fixes a confusing UX where a previously successful deployment's healthy/unhealthy count would get updated if any allocations failed after the deployment was already marked as successful.	2019-05-02 22:59:56 -05:00
Lang Martin	c32cce51f0	client fingerprinting can keep multi ips on a device	2019-05-02 18:11:28 -04:00
Lang Martin	94f23016a2	client_test new test fingerprinting can keep multi ips on a device	2019-05-02 18:11:28 -04:00
Mahmood Ali	7a32d3f3aa	client: handle 0.8 server network resources Fixes https://github.com/hashicorp/nomad/issues/5587 When a nomad 0.9 client is handling an alloc generated by a nomad 0.8 server, we should check the alloc.TaskResources for networking details rather than task.Resources. We check alloc.TaskResources for networking for other tasks in the task group [1], so it's a bit odd that we used the task.Resources struct here. TaskRunner also uses `alloc.TaskResources`[2]. The task.Resources struct in 0.8 was sparsly populated, resulting to storing of 0 in port mapping env vars: ``` vagrant@nomad-server-01:~$ nomad version Nomad v0.8.7 (21a2d93eecf018ad2209a5eab6aae6c359267933+CHANGES) vagrant@nomad-server-01:~$ nomad server members Name Address Port Status Leader Protocol Build Datacenter Region nomad-server-01.global 10.199.0.11 4648 alive true 2 0.8.7 dc1 global vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b \| jq '.Job.TaskGroups[0].Tasks[0].Resources.Networks' [ { "CIDR": "", "Device": "", "DynamicPorts": [ { "Label": "db", "Value": 0 } ], "IP": "", "MBits": 10, "ReservedPorts": null } ] vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b \| jq '.TaskResources' { "redis": { "CPU": 500, "DiskMB": 0, "IOPS": 0, "MemoryMB": 256, "Networks": [ { "CIDR": "", "Device": "eth1", "DynamicPorts": [ { "Label": "db", "Value": 21722 } ], "IP": "10.199.0.21", "MBits": 10, "ReservedPorts": null } ] } } ``` Also, updated the test values to mimic how Nomad 0.8 structs are represented, and made its result match the non compact values in `TestEnvironment_AsList`. [1] `24e9040b18/client/taskenv/env.go (L624-L639)` [2] https://github.com/hashicorp/nomad/blob/master/client/allocrunner/taskrunner/task_runner.go#L287-L303	2019-05-02 12:08:38 -04:00
Mahmood Ali	446f06721d	aux: helper method that returns token as well as ACL policy This helper returns the token as well as the ACL policy, to be used in a later commit for logging the token info associated with nomad exec invocation.	2019-04-30 10:23:56 -04:00
Lang Martin	371014b781	Merge pull request #5553 from hashicorp/b-fingerprinter-manual-config client fingerprinter doesn't overwrite manual configuration	2019-04-26 12:55:34 -04:00
Danielle	79515496cb	Merge pull request #5515 from hashicorp/dani/f-alloc-signal allocs: Add nomad alloc signal command	2019-04-26 14:21:05 +02:00
Danielle Lancashire	a8880f9643	alloc_signal: Add autcompletion and cmd tests	2019-04-26 12:47:53 +02:00
Mahmood Ali	bf0a09e270	retry grpc unavailable errors even if not shutting down	2019-04-25 18:39:17 -04:00
Mahmood Ali	81841e8528	try checking process status	2019-04-25 18:16:13 -04:00
Mahmood Ali	fc78521f29	add logging about attempts	2019-04-25 18:09:36 -04:00
Mahmood Ali	e6ca8641a8	try sleeping for stop signal to take effect	2019-04-25 17:16:29 -04:00
Mahmood Ali	ff3a095015	add a test that simulates logmon dying during Start() call	2019-04-25 16:41:17 -04:00
Mahmood Ali	bbac73883c	logmon: retry starting logmon if it exits Retry if we detect shutting down during Start() api call is started, locally.	2019-04-25 15:10:16 -04:00
Mahmood Ali	b51f00a7f3	logmon client to handle grpc closing errors	2019-04-25 14:32:24 -04:00
Danielle Lancashire	3409e0be89	allocs: Add nomad alloc signal command This command will be used to send a signal to either a single task within an allocation, or all of the tasks if <task-name> is omitted. If the sent signal terminates the allocation, it will be treated as if the allocation has crashed, rather than as if it was operator-terminated. Signal validation is currently handled by the driver itself and nomad does not attempt to restrict or validate them.	2019-04-25 12:43:32 +02:00
Chris Baker	91c4e1eabb	Merge pull request #5541 from hashicorp/b/5540-bad-client-alloc-metrics client/metrics: fixed stale metrics	2019-04-22 15:07:30 -04:00
Mahmood Ali	f515b93b5e	Merge pull request #5577 from hashicorp/dani/b-logmon-unrecoverable logging: Attempt to recover logmon failures	2019-04-22 14:40:24 -04:00
Michael Schurter	61f17a1043	tweak logging level for failed log line Co-Authored-By: notnoop <mahmood@notnoop.com>	2019-04-22 14:40:17 -04:00
Chris Baker	0b1a4dd206	client/metrics: modified metrics to use (updated) client copy of allocation instead of (unupdated) server copy	2019-04-22 18:31:45 +00:00
Lang Martin	eba4e29440	client fingerprinter doesn't overwrite manual configuration Revert "Revert accidental merge of pr #5482" This reverts commit c45652ab8c113487b9d4fbfb107782cbcf8a85b0.	2019-04-19 15:23:48 -04:00
Michael Schurter	26f3bdbf8f	Merge pull request #5583 from ygersie/fingerprint_nilpointer fix nil pointer in fingerprinting AWS env leading to crash	2019-04-19 08:08:59 -07:00
Mahmood Ali	902eed4bf9	clarify cryptic log line	2019-04-19 09:31:43 -04:00
Mahmood Ali	f74d60439f	client: log detected driver health state Noticed that `detected drivers` log line was misleading - when a driver doesn't fingerprint before timeout, their health status is empty string `""` which we would mark as detected. Now, we log all drivers along with their state to ease driver fingerprint debugging.	2019-04-19 09:15:25 -04:00
Mahmood Ali	6bdc9860b7	client: avoid registering node twice right away I noticed that `watchNodeUpdates()` almost immediately after `registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5 seconds. This call is unnecessary and made debugging a bit harder. So here, we ensure that we only re-register node for new node events, not for initial registration.	2019-04-19 09:12:50 -04:00
Mahmood Ali	f82ea8824f	client: wait for batched driver updated Here we retain 0.8.7 behavior of waiting for driver fingerprints before registering a node, with some timeout. This is needed for system jobs, as system job scheduling for node occur at node registration, and the race might mean that a system job may not get placed on the node because of missing drivers. The timeout isn't strictly necessary, but raising it to 1 minute as it's closer to indefinitely blocked than 1 second. We need to keep the value high enough to capture as much drivers/devices, but low enough that doesn't risk blocking too long due to misbehaving plugin. Fixes https://github.com/hashicorp/nomad/issues/5579	2019-04-19 09:00:24 -04:00
Yorick Gersie	95f81f3eeb	fix nil pointer in fingerprinting AWS env leading to crash HTTP Client returns a nil response if an error has occured. We first need to check for an error before being able to check the HTTP response code.	2019-04-19 11:07:13 +02:00
Danielle Lancashire	c31966fc71	loggging: Attempt to recover logmon failures Currently, when logmon fails to reattach, we will retry reattachment to the same pid until the task restart specification is exhausted. Because we cannot clear hook state during error conditions, it is not possible for us to signal to a future restart that it _shouldn't_ attempt to reattach to the plugin. Here we revert to explicitly detecting reattachment seperately from a launch of a new logmon, so we can recover from scenarios where a logmon plugin has failed. This is a net improvement over the current hard failure situation, as it means in the most common case (the pid has gone away), we can recover. Other reattachment failure modes where the plugin may still be running could potentially cause a duplicate process, or a subsequent failure to launch a new plugin. If there was a duplicate process, it could potentially cause duplicate logging. This is better than a production workload outage. If there was a subsequent failure to launch a new plugin, it would fail in the same (retry until restarts are exhausted) as the current failure mode.	2019-04-18 13:41:56 +02:00
Michael Schurter	a85e7b7cc9	vault: fix data races	2019-04-16 11:22:44 -07:00
Michael Schurter	0aeb3dbd86	vault: fix renewal time Renewal time was being calculated as 10s+Intn(lease-10s), so the renewal time could be very rapid or within 1s of the deadline: [10s, lease) This commit fixes the renewal time by calculating it as: (lease/2) +/- 10s For a lease of 60s this means the renewal will occur in [20s, 40s).	2019-04-16 11:22:44 -07:00
Michael Schurter	f7a7acc345	Merge pull request #5518 from hashicorp/f-simplify-kill client: simplify kill logic	2019-04-15 14:11:58 -07:00
Chris Baker	6848591914	vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation	2019-04-12 15:06:34 +00:00
Lang Martin	a2a1e7829d	Revert accidental merge of pr #5482 Revert "fingerprint Constraints and Affinities have Equals, as set" This reverts commit 596f16fb5f1a4a6766a57b3311af806d22382609. Revert "client tests assert the independent handling of interface and speed" This reverts commit 7857ac5993a578474d0570819f99b7b6e027de40. Revert "structs missed applying a style change from the review" This reverts commit 658916e3274efa438beadc2535f47109d0c2f0f2. Revert "client, structs comments" This reverts commit be2838d6baa9d382a5013fa80ea016856f28ade2. Revert "client fingerprint updateNetworks preserves the network configuration" This reverts commit fc309cb430e62d8e66267a724f006ae9abe1c63c. Revert "client_test cleanup comments from review" This reverts commit bc0bf4efb9114e699bc662f50c8f12319b6b3445. Revert "client Networks Equals is set equality" This reverts commit f8d432345b54b1953a4a4c719b9269f845e3e573. Revert "struct cleanup indentation in RequestedDevice Equals" This reverts commit f4746411cab328215def6508955b160a53452da3. Revert "struct Equals checks for identity before value checking" This reverts commit 0767a4665ed30ab8d9586a59a74db75d51fd9226. Revert "fix client-test, avoid hardwired platform dependecy on lo0" This reverts commit e89dbb2ab182b6368507dbcd33c3342223eb0ae7. Revert "refactor error in client fingerprint to include the offending data" This reverts commit a7fed726c6e0264d42a58410d840adde780a30f5. Revert "add client updateNodeResources to merge but preserve manual config" This reverts commit 84bd433c7e1d030193e054ec23474380ff3b9032. Revert "refactor struts.RequestedDevice to have its own Equals" This reverts commit 689782524090e51183474516715aa2f34908b8e6. Revert "refactor structs.Resource.Networks to have its own Equals" This reverts commit 49e2e6c77bb3eaa4577772b36c62205061c92fa1. Revert "refactor structs.Resource.Devices to have its own Equals" This reverts commit 4ede9226bb971ae42cc203560ed0029897aec2c9. Revert "add COMPAT(0.10): Remove in 0.10 notes to impl for structs.Resources" This reverts commit 49fbaace5298d5ccf031eb7ebec93906e1d468b5. Revert "add structs.Resources Equals" This reverts commit 8528a2a2a6450e4462a1d02741571b5efcb45f0b. Revert "test that fingerprint resources are updated, net not clobbered" This reverts commit 8ee02ddd23bafc87b9fce52b60c6026335bb722d.	2019-04-11 10:29:40 -04:00
Lang Martin	5d3596eb7e	client tests assert the independent handling of interface and speed	2019-04-11 09:56:22 -04:00
Lang Martin	7258a13c72	client, structs comments	2019-04-11 09:56:22 -04:00
Lang Martin	22d87e4538	client fingerprint updateNetworks preserves the network configuration	2019-04-11 09:56:22 -04:00
Lang Martin	8fe9699e51	client_test cleanup comments from review	2019-04-11 09:56:22 -04:00
Lang Martin	63c993c8ae	fix client-test, avoid hardwired platform dependecy on lo0	2019-04-11 09:56:22 -04:00
Lang Martin	a9db848974	refactor error in client fingerprint to include the offending data	2019-04-11 09:56:22 -04:00
Lang Martin	f211500cea	add client updateNodeResources to merge but preserve manual config	2019-04-11 09:56:22 -04:00
Lang Martin	a4b59130d2	test that fingerprint resources are updated, net not clobbered	2019-04-11 09:56:21 -04:00
Danielle Lancashire	e135876493	allocs: Add nomad alloc restart This adds a `nomad alloc restart` command and api that allows a job operator with the alloc-lifecycle acl to perform an in-place restart of a Nomad allocation, or a given subtask.	2019-04-11 14:25:49 +02:00
Chris Baker	829a972693	vault client test: minor formatting vendor: using upstream circonus-gometrics	2019-04-10 10:34:10 -05:00
Chris Baker	c0a7aee610	vault e2e: pass vault version into setup instead of having to infer it from test name	2019-04-10 10:34:10 -05:00
Chris Baker	f0c184fc29	taskrunner: removed some unecessary config from a test	2019-04-10 10:34:10 -05:00
Chris Baker	a26d4fe1e5	docs: -vault-namespace, VAULT_NAMESPACE, and config agent: added VAULT_NAMESPACE env-based configuration	2019-04-10 10:34:10 -05:00
Chris Baker	170f5239c8	client: gofmt	2019-04-10 10:34:10 -05:00
Chris Baker	a1d7971b2e	taskrunner: pass configured Vault namespace into TaskTemplateConfig	2019-04-10 10:34:10 -05:00
Chris Baker	0eaeef872f	config/docs: added `namespace` to vault config server/client: process `namespace` config, setting on the instantiated vault client	2019-04-10 10:34:10 -05:00
Michael Schurter	45b4827ad7	Bump to 0.9.1-dev	2019-04-09 09:01:48 -07:00
Nomad Release bot	e307734e4a	Generate files for 0.9.0 release	2019-04-09 01:56:00 +00:00
Michael Schurter	f7d4428855	client: simplify kill logic Remove runLaunched tracking as Run is always called for killable TaskRunners. TaskRunners which fail before Run can be called (during NewTaskRunner or Restore) are not killable as they're never added to the client's alloc map.	2019-04-04 15:18:33 -07:00
Michael Schurter	3af602b633	Remove 0.9.0-rc2 generated files	2019-04-03 07:41:09 -07:00
Nomad Release bot	16b4336ccf	Generate files for 0.9.0-rc2 release	2019-04-03 01:54:29 +00:00
Michael Schurter	923cd91850	Merge pull request #5504 from hashicorp/b-exec-path executor/linux: make chroot binary paths absolute	2019-04-02 14:09:50 -07:00
Michael Schurter	1d569a27dc	Revert "executor/linux: add defensive checks to binary path" This reverts commit cb36f4537e63d53b198c2a87d1e03880895631bd.	2019-04-02 11:17:12 -07:00
Michael Schurter	fc5487dbbc	executor/linux: add defensive checks to binary path	2019-04-02 09:40:53 -07:00
Michael Schurter	7d49bc4c71	executor/linux: make chroot binary paths absolute Avoid libcontainer.Process trying to lookup the binary via $PATH as the executor has already found where the binary is located.	2019-04-01 15:45:31 -07:00
Mahmood Ali	81f4f07ed7	rename fifo methods for clarity	2019-04-01 16:52:58 -04:00
Mahmood Ali	e87afe465b	clarify closeDone blocking and field name	2019-04-01 16:10:34 -04:00
Mahmood Ali	9d647713c0	no requires in a test goroutine	2019-04-01 15:38:39 -04:00
Mahmood Ali	2b1f858e1b	log when fifo fails to open	2019-04-01 13:18:03 -04:00
Mahmood Ali	967452a3f0	fifo: Use plain fifo file in Unix This PR switches to using plain fifo files instead of golang structs managed by containerd/fifo library. The library main benefit is management of opening fifo files. In Linux, a reader `open()` request would block until a writer opens the file (and vice-versa). The library uses goroutines so that it's the first IO operation that blocks. This benefit isn't really useful for us: Given that logmon simply streams output in a separate process, blocking of opening or first read is effectively the same. The library additionally makes further complications for managing state and tracking read/write permission that seems overhead for our use, compared to using a file directly. Looking here, I made the following incidental changes: * document that we do handle if fifo files are already created, as we rely on that behavior for logmon restarts * use type system to lock read vs write: currently, fifo library returns `io.ReadWriteCloser` even if fifo is opened for writing only!	2019-04-01 13:18:03 -04:00
Michael Schurter	a4572919cd	Merge pull request #5456 from hashicorp/test-taskenv tests: port pre-0.9 task env tests	2019-03-25 10:41:38 -07:00
Michael Schurter	8efad12538	tests: port pre-0.9 task env tests I chose to make them more of integration tests since there's a lot more plumbing involved. The internal implementation details of how we craft task envs can now change and these tests will still properly assert the task runtime environment is setup properly.	2019-03-25 09:46:53 -07:00
Michael Schurter	9afbc45cff	Bump to dev post-0.9.0-rc1 release	2019-03-22 08:26:30 -07:00
Nomad Release bot	3ab3dd4105	Generate files for 0.9.0-rc1 release	2019-03-21 19:06:13 +00:00
Mahmood Ali	b08a2744f8	Merge pull request #5428 from hashicorp/b-dropped-logs-on-task-restart client/logmon: restart log collection correctly when a task is restarted	2019-03-21 14:02:08 -04:00

... 3 4 5 6 7 ...

4092 Commits