open-nomad

Author	SHA1	Message	Date
Drew Bailey	ae145c9a37	allow only positive shutdown delay more explicit test case, remove select statement	2019-12-16 11:38:30 -05:00
Drew Bailey	24929776a2	shutdown delay for task groups copy struct values ensure groupserviceHook implements RunnerPreKillhook run deregister first test that shutdown times are delayed move magic number into variable	2019-12-16 11:38:16 -05:00
Danielle	b006be623d	Update client/fingerprint/env_aws.go Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2019-12-16 14:48:52 +01:00
Danielle Lancashire	5a87b3ab4b	env_aws: Disable Retries and set Session cfg Previously, Nomad used hand rolled HTTP requests to interact with the EC2 metadata API. Recently however, we switched to using the AWS SDK for this fingerprinting. The default behaviour of the AWS SDK is to perform retries with exponential backoff when a request fails. This is problematic for Nomad, because interacting with the EC2 API is in our client start path. Here we revert to our pre-existing behaviour of not performing retries in the fast path, as if the metadata service is unavailable, it's likely that nomad is not running in AWS.	2019-12-16 10:56:32 +01:00
Mahmood Ali	4a1cc67f58	Merge pull request #6820 from hashicorp/f-skip-docker-logging-knob driver: allow disabling log collection	2019-12-13 11:41:20 -05:00
Mahmood Ali	a7361612b6	Merge pull request #6556 from hashicorp/c-vendor-multierror-20191025 Update go-multierror library	2019-12-13 11:32:42 -05:00
Mahmood Ali	46bc3b57e6	address review comments	2019-12-13 11:21:00 -05:00
Mahmood Ali	b3a1e571e5	tests: fix error format assertion multierror library changed formatting slightly.	2019-12-13 11:01:20 -05:00
Chris Dickson	4d8ba272d1	client: expose allocated CPU per task (#6784 )	2019-12-09 15:40:22 -05:00
Seth Hoenig	f0c3dca49c	tests: swap lib/freeport for tweaked helper/freeport Copy the updated version of freeport (sdk/freeport), and tweak it for use in Nomad tests. This means staying below port 10000 to avoid conflicts with the lib/freeport that is still transitively used by the old version of consul that we vendor. Also provide implementations to find ephemeral ports of macOS and Windows environments. Ports acquired through freeport are supposed to be returned to freeport, which this change now also introduces. Many tests are modified to include calls to a cleanup function for Server objects. This should help quite a bit with some flakey tests, but not all of them. Our port problems will not go away completely until we upgrade our vendor version of consul. With Go modules, we'll probably do a 'replace' to swap out other copies of freeport with the one now in 'nomad/helper/freeport'.	2019-12-09 08:37:32 -06:00
Mahmood Ali	0b7085ba3a	driver: allow disabling log collection Operators commonly have docker logs aggregated using various tools and don't need nomad to manage their docker logs. Worse, Nomad uses a somewhat heavy docker api call to collect them and it seems to cause problems when a client runs hundreds of log collections. Here we add a knob to disable log aggregation completely for nomad. When log collection is disabled, we avoid running logmon and docker_logger for the docker tasks in this implementation. The downside here is once disabled, `nomad logs ...` commands and API no longer return logs and operators must corrolate alloc-ids with their aggregated log info. This is meant as a stop gap measure. Ideally, we'd follow up with at least two changes: First, we should optimize behavior when we can such that operators don't need to disable docker log collection. Potentially by reverting to using pre-0.9 syslog aggregation in linux environments, though with different trade-offs. Second, when/if logs are disabled, nomad logs endpoints should lookup docker logs api on demand. This ensures that the cost of log collection is paid sparingly.	2019-12-08 14:15:03 -05:00
Mahmood Ali	ded2a725db	Merge pull request #6788 from hashicorp/b-timeout-logmon-stop logmon: add timeout to RPC operations	2019-12-06 19:12:06 -05:00
Danielle Lancashire	d2075ebae9	spellcheck: Fix spelling of retrieve	2019-12-05 18:59:47 -06:00
Mahmood Ali	b2ae27863e	Merge pull request #6779 from hashicorp/r-aws-fingerprint-via-library Use AWS SDK to access EC2 Metadata	2019-12-02 13:30:51 -05:00
Mahmood Ali	83089feff5	logmon: add timeout to RPC operations Add an RPC timeout for logmon. In https://github.com/hashicorp/nomad/issues/6461#issuecomment-559747758 , `logmonClient.Stop` locked up and indefinitely blocked the task runner destroy operation. This is an incremental improvement. We still need to follow up to understand how we got to that state, and the full impact of locked-up Stop and its link to pending allocations on restart.	2019-12-02 10:33:05 -05:00
Mahmood Ali	293276a457	fingerprint code refactor Some code cleanup: * Use a field for setting EC2 metadata instead of env-vars in testing; but keep environment variables for backward compatibility reasons * Update tests to use testify	2019-11-26 10:51:28 -05:00
Mahmood Ali	1e48f8e20d	fingerprint: avoid api query if config overrides it	2019-11-26 10:51:28 -05:00
Mahmood Ali	5bb9089431	fingerprint: use ec2metadata package	2019-11-26 10:51:27 -05:00
Lars Lehtonen	0d344e8578	client: fix use of T.Fatal inside TestFS_logsImpl_NoFollow() goroutine.	2019-11-25 23:51:28 -08:00
Mahmood Ali	e89108fb01	fixup! tests: don't assume eth0 network is available	2019-11-21 08:28:20 -05:00
Mahmood Ali	443804b5c7	tests: don't assume eth0 network is available TestClient_UpdateNodeFromFingerprintKeepsConfig checks a test node network interface, which is hardcoded to `eth0` and is updated asynchronously. This causes flakiness when eth0 isn't available. Here, we hardcode the value to an arbitrary network interface.	2019-11-20 20:37:30 -05:00
Mahmood Ali	ed3f1957e7	tests: run TestClient_WatchAllocs in non-linux environments	2019-11-20 20:37:29 -05:00
Mahmood Ali	521f51a929	testS: fix TestClient_RestoreError When spinning a second client, ensure that it uses new driver instances, rather than reuse the already shutdown unhealthy drivers from first instance. This speeds up tests significantly, but cutting ~50 seconds or so, the timeout in NewClient until drivers fingerprints. They never do because drivers were shutdown already.	2019-11-20 20:37:28 -05:00
Mahmood Ali	4efb71cf0c	tests: remove TestClient_RestoreError test TestClient_RestoreError is very slow, taking ~81 seconds. It has few problematic patterns. It's unclear what it tests, it simulates a failure condition where all state db lookup fails and asserts that alloc fails. Though starting from https://github.com/hashicorp/nomad/pull/6216 , we don't fail allocs in that condition but rather restart them. Also, the drivers used in second client `c2` are the same singleton instances used in `c1` and already shutdown. We ought to start healthy new driver instances.	2019-11-20 20:37:27 -05:00
Preetha	be4a51d5b8	Merge pull request #6349 from hashicorp/b-host-stats client: Return empty values when host stats fail	2019-11-20 10:13:02 -06:00
Lang Martin	aa985ebe21	getter: allow the gcs download scheme (#6692 )	2019-11-19 09:10:56 -05:00
Nick Ethier	bd454a4c6f	client: improve group service stanza interpolation and check_re… (#6586 ) * client: improve group service stanza interpolation and check_restart support Interpolation can now be done on group service stanzas. Note that some task runtime specific information that was previously available when the service was registered poststart of a task is no longer available. The check_restart stanza for checks defined on group services will now properly restart the allocation upon check failures if configured.	2019-11-18 13:04:01 -05:00
Drew Bailey	b644e1f47d	add server-id to monitor specific server	2019-11-14 09:53:41 -05:00
Drew Bailey	f4a7e3dc75	coordinate closing of doneCh, use interface to simplify callers comments	2019-11-05 11:44:26 -05:00
Drew Bailey	fe542680dc	log-json -> json fix typo command/agent/monitor/monitor.go Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com> Update command/agent/monitor/monitor.go Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com> address feedback, lock to prevent send on closed channel fix lock/unlock for dropped messages	2019-11-05 09:51:59 -05:00
Drew Bailey	ddfa20b993	address feedback, fix gauge metric name	2019-11-05 09:51:57 -05:00
Drew Bailey	298b8358a9	move forwarded monitor request into helper	2019-11-05 09:51:56 -05:00
Drew Bailey	318b6c91bf	monitor command takes no args rm extra new line fix lint errors return after close fix, simplify test	2019-11-05 09:51:55 -05:00
Drew Bailey	c7b633b6c1	lock in sub select rm redundant lock wip to use framing wip switch to stream frames	2019-11-05 09:51:54 -05:00
Drew Bailey	17d876d5ef	rename function, initialize log level better underscores instead of dashes for query params	2019-11-05 09:51:53 -05:00
Drew Bailey	8178beecf0	address feedback, use agent_endpoint instead of monitor	2019-11-05 09:51:53 -05:00
Drew Bailey	db65b1f4a5	agent:read acl policy for monitor	2019-11-05 09:51:52 -05:00
Drew Bailey	2533617888	rpc acl tests for both monitor endpoints	2019-11-05 09:51:51 -05:00
Drew Bailey	a45ae1cd58	enable json formatting, use queryoptions	2019-11-05 09:51:49 -05:00
Drew Bailey	786989dbe3	New monitor pkg for shared monitor functionality Adds new package that can be used by client and server RPC endpoints to facilitate monitoring based off of a logger clean up old code small comment about write rm old comment about minsize rename to Monitor Removes connection logic from monitor command Keep connection logic in endpoints, use a channel to send results from monitoring use new multisink logger and interfaces small test for dropped messages update go-hclogger and update sink/intercept logger interfaces	2019-11-05 09:51:49 -05:00
Drew Bailey	e076204820	get local rpc endpoint working	2019-11-05 09:51:48 -05:00
Drew Bailey	976c43157c	remove log_writer prefix output with proper spacing update gzip handler, adjust first byte flow to allow gzip handler bypass wip, first stab at wiring up rpc endpoint	2019-11-05 09:51:48 -05:00
Michael Schurter	9fed8d1bed	client: fix panic from 0.8 -> 0.10 upgrade makeAllocTaskServices did not do a nil check on AllocatedResources which causes a panic when upgrading directly from 0.8 to 0.10. While skipping 0.9 is not supported we intend to fix serious crashers caused by such upgrades to prevent cluster outages. I did a quick audit of the client package and everywhere else that accesses AllocatedResources appears to be properly guarded by a nil check.	2019-11-01 07:47:03 -07:00
Lars Lehtonen	4ed9427c77	client/allocwatcher: fix dropped test error (#6592 )	2019-10-31 08:29:25 -04:00
Michael Schurter	eba4d4cd6f	vault: remove dead lease code	2019-10-25 15:08:35 -07:00
Michael Schurter	39437a5c5b	Merge branch 'master' into release-0100	2019-10-22 08:17:57 -07:00
Michael Schurter	b6bb561854	cleanup post 0.10.0 release	2019-10-22 07:48:09 -07:00
Nomad Release bot	3e6c9dd40e	Generate files for 0.10.0 release	2019-10-22 12:34:56 +00:00
Mahmood Ali	262dcb0842	Revert "lint: ignore generated windows syscall wrappers" This reverts commit 482862e6ab0f8db748367bb1eefc2efd11fbe11a.	2019-10-22 08:23:44 -04:00
Michael Schurter	460bd63db0	client: expose group network ports in env vars Fixes #6375 Intentionally omitted IPs prior to 0.10.0 release to minimize changes and risk.	2019-10-21 13:28:35 -07:00
Michael Schurter	8634533e82	client: expose group network ports in env vars Fixes #6375 Intentionally omitted IPs prior to 0.10.0 release to minimize changes and risk.	2019-10-21 12:31:13 -07:00
Michael Schurter	bb82f365ff	connect: upgrade to envoy 1.11.2 and add sha Append the Docker image sha to the Envoy image to ensure users default to using the version that Nomad was tested against.	2019-10-18 10:16:58 -07:00
Michael Schurter	ee5ea3ecc7	connect: upgrade to envoy 1.11.2 and add sha Append the Docker image sha to the Envoy image to ensure users default to using the version that Nomad was tested against.	2019-10-18 07:46:53 -07:00
Mahmood Ali	4e4a9b252c	Merge pull request #6290 from hashicorp/r-generated-code-refactor dev: avoid codecgen code in downstream projects	2019-10-15 08:22:31 -04:00
Michael Schurter	2992cb80b0	Remove 0.10.0-rc1 generated files	2019-10-10 13:31:42 -07:00
Nomad Release bot	3007f1662e	Generate files for 0.10.0-rc1 release	2019-10-10 19:08:23 +00:00
Michael Schurter	f54f1cb321	Revert "Revert "Use joint context to cancel prestart hooks""	2019-10-08 11:34:09 -07:00
Michael Schurter	81a30ae106	Revert "Use joint context to cancel prestart hooks"	2019-10-08 11:27:08 -07:00
Mahmood Ali	4b2ba62e35	acl: check ACL against object namespace Fix a bug where a millicious user can access or manipulate an alloc in a namespace they don't have access to. The allocation endpoints perform ACL checks against the request namespace, not the allocation namespace, and performs the allocation lookup independently from namespaces. Here, we check that the requested can access the alloc namespace regardless of the declared request namespace. Ideally, we'd enforce that the declared request namespace matches the actual allocation namespace. Unfortunately, we haven't documented alloc endpoints as namespaced functions; we suspect starting to enforce this will be very disruptive and inappropriate for a nomad point release. As such, we maintain current behavior that doesn't require passing the proper namespace in request. A future major release may start enforcing checking declared namespace.	2019-10-08 12:59:22 -04:00
Drew Bailey	69eebcd241	simplify logic to check for vault read event defer shutdown to cleanup after failed run Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> update comment to include ctx note for shutdown	2019-09-30 11:02:14 -07:00
Drew Bailey	7565b8a8d9	Use joint context to cancel prestart hooks fixes https://github.com/hashicorp/nomad/issues/6382 The prestart hook for templates blocks while it resolves vault secrets. If the secret is not found it continues to retry. If a task is shutdown during this time, the prestart hook currently does not receive shutdownCtxCancel, causing it to hang. This PR joins the two contexts so either killCtx or shutdownCtx cancel and stop the task.	2019-09-30 10:48:01 -07:00
Mahmood Ali	e29ee4c400	nomad: defensive check for namespaces in job registration call In a job registration request, ensure that the request namespace "header" and job namespace field match. This should be the case already in prod, as http handlers ensures that the values match [1]. This mitigates bugs that exploit bugs where we may check a value but act on another, resulting into bypassing ACL system. [1] https://github.com/hashicorp/nomad/blob/v0.9.5/command/agent/job_endpoint.go#L415-L418	2019-09-26 17:02:47 -04:00
Tim Gross	a6aadb3714	connect: remove proxy socket for restarted client	2019-09-25 14:58:17 -04:00
Tim Gross	e43d33aa50	client: don't run alloc postrun during shutdown	2019-09-25 14:58:17 -04:00
Tim Gross	d965a15490	driver/networking: don't recreate existing network namespaces	2019-09-25 14:58:17 -04:00
Jasmine Dahilig	0780adfa7f	timeout after 5 seconds when client opens a data directory (#6348 )	2019-09-24 16:28:21 -07:00
Danielle Lancashire	c9bcb7e76a	client_stats: Always emit client stats	2019-09-19 01:22:08 +02:00
Danielle Lancashire	4f2343e1c0	client: Return empty values when host stats fail Currently, there is an issue when running on Windows whereby under some circumstances the Windows stats API's will begin to return errors (such as internal timeouts) when a client is under high load, and potentially other forms of resource contention / system states (and other unknown cases). When an error occurs during this collection, we then short circuit further metrics emission from the client until the next interval. This can be problematic if it happens for a sustained number of intervals, as our metrics aggregator will begin to age out older metrics, and we will eventually stop emitting various types of metrics including `nomad.client.unallocated.*` metrics. However, when metrics collection fails on Linux, gopsutil will in many cases (e.g cpu.Times) silently return 0 values, rather than an error. Here, we switch to returning empty metrics in these failures, and logging the error at the source. This brings the behaviour into line with Linux/Unix platforms, and although making aggregation a little sadder on intermittent failures, will result in more desireable overall behaviour of keeping metrics available for further investigation if things look unusual.	2019-09-19 01:22:07 +02:00
Danielle	f8b64ee1ab	Merge pull request #6330 from hashicorp/f-host-vols-fail-startup client: Fail startup if host volumes do not exist	2019-09-17 10:55:30 -07:00
Danielle	ec3ecdecfc	Merge pull request #6321 from hashicorp/dani/remove-config Hoist Volume.Config.Source into Volume.Source	2019-09-16 10:12:58 -07:00
Danielle Lancashire	e3796e9d60	client: Fail startup if host volumes do not exist Some drivers will automatically create directories when trying to mount a path into a container, but some will not. To unify this behaviour, this commit requires that host volumes already exist, and can be stat'd by the Nomad Agent during client startup.	2019-09-13 23:28:10 +02:00
Ruslan Usifov	b3c72d1729	close file handle when FileRotator object will closed. Fixes https://github.com/hashicorp/nomad/issues/6309 (#6323 )	2019-09-13 10:31:13 -04:00
Tim Gross	a6ef8c5d42	client/networking: wrap error message from CNI plugin (#6316 )	2019-09-13 08:20:05 -04:00
Danielle Lancashire	78b61de45f	config: Hoist volume.config.source into volume Currently, using a Volume in a job uses the following configuration: ``` volume "alias-name" { type = "volume-type" read_only = true config { source = "host_volume_name" } } ``` This commit migrates to the following: ``` volume "alias-name" { type = "volume-type" source = "host_volume_name" read_only = true } ``` The original design was based due to being uncertain about the future of storage plugins, and to allow maxium flexibility. However, this causes a few issues, namely: - We frequently need to parse this configuration during submission, scheduling, and mounting - It complicates the configuration from and end users perspective - It complicates the ability to do validation As we understand the problem space of CSI a little more, it has become clear that we won't need the `source` to be in config, as it will be used in the majority of cases: - Host Volumes: Always need a source - Preallocated CSI Volumes: Always needs a source from a volume or claim name - Dynamic Persistent CSI Volumes: Always needs a source to attach the volumes to for managing upgrades and to avoid dangling. - Dynamic Ephemeral CSI Volumes: Less thought out, but `source` will probably point to the plugin name, and a `config` block will allow you to pass meta to the plugin. Or will point to a pre-configured ephemeral config. *If implemented The new design simplifies this by merging the source into the volume stanza to solve the above issues with usability, performance, and error handling.	2019-09-13 04:37:59 +02:00
Mahmood Ali	78f62d3670	Merge pull request #6080 from lchayoun/bug-6079 Allow dash in environment variable names	2019-09-11 11:17:24 -07:00
Mahmood Ali	b6bf7f9a6c	Merge pull request #6260 from hashicorp/c-circleci-tweak-20190903 ci: ignore nested pkgs in GOTEST_PKGS_EXCLUDE	2019-09-11 11:17:10 -07:00
Tim Gross	b05cd4c430	test: expand symlink for temp dir for macOS compatibility (#6303 ) On macOS, `os.TempDir` returns a symlinked path under `/var` which is outside of the directories shared into the VM used for Docker, and that fails tests using Docker that need that mount. If we expand the symlink to get the real path in `/private`, we're in the shared folders and can safely mount them.	2019-09-10 12:20:09 -04:00
Mahmood Ali	4b8280e51d	remove generated code	2019-09-06 19:24:15 +00:00
Nomad Release bot	dc7d728a82	Generate files for 0.10.0-beta1 release	2019-09-06 18:47:09 +00:00
Tim Gross	3fa4bca4a0	script checks: Update needs to update Alloc as well (#6291 )	2019-09-06 11:18:00 -04:00
Mahmood Ali	01f42053e4	dev: avoid codecgen code in downstream projects This is an attempt to ease dependency management for external driver plugins, by avoiding requiring them to compile ugorji/go generated files. Plugin developers reported some pain with the brittleness of ugorji/go dependency in particular, specially when using go mod, the default go mod manager in golang 1.13. Context -------- Nomad uses msgpack to persist and serialize internal structs, using ugorji/go library. As an optimization, we use ugorji/go code generation to speedup process and aovid the relection-based slow path. We commit these generated files in repository when we cut and tag the release to ease reproducability and debugging old releases. Thus, downstream projects that depend on release tag, indirectly depends on ugorji/go generated code. Sadly, the generated code is brittle and specific to the version of ugorji/go being used. When go mod picks another version of ugorji/go then nomad (go mod by default uses release according to semver), downstream projects face compilation errors. Interestingly, downstream projects don't commonly serialize nomad internal structs. Drivers and device plugins use grpc instead of msgpack for the most part. In the few cases where they use msgpag (e.g. decoding task config), they do without codegen path as they run on driver specific structs not the nomad internal structs. Also, the ugorji/go serialization through reflection is generally backward compatible (mod some ugorji/go regression bugs that get introduced every now and then :( ). Proposal --------- The proposal here is to keep committing ugorji/go codec generated files for releases but to use a go tag for them. All nomad development through the makefile, including releasing, CI and dev flow, has the tag enabled. Downstream plugin projects, by default, will skip these files and life proceed as normal for them. The downside is that nomad developers who use generated code but avoid using make must start passing additional go tag argument. Though this is not a blessed configuration.	2019-09-06 09:22:00 -04:00
Tim Gross	8ce201854a	client: recreate script checks on Update (#6265 ) Splitting the immutable and mutable components of the scriptCheck led to a bug where the environment interpolation wasn't being incorporated into the check's ID, which caused the UpdateTTL to update for a check ID that Consul didn't have (because our Consul client creates the ID from the structs.ServiceCheck each time we update). Task group services don't have access to a task environment at creation, so their checks get registered before the check can be interpolated. Use the original check ID so they can be updated.	2019-09-05 11:43:23 -04:00
Michael Schurter	ee06c36345	Merge pull request #6254 from hashicorp/test-connect-e2e-demo e2e: test demo job for connect	2019-09-04 14:33:26 -07:00
Nick Ethier	e440ba80f1	ar: refactor network bridge config to use go-cni lib (#6255 ) * ar: refactor network bridge config to use go-cni lib * ar: use eth as the iface prefix for bridged network namespaces * vendor: update containerd/go-cni package * ar: update network hook to use TODO contexts when calling configurator * unnecessary conversion	2019-09-04 16:33:25 -04:00
Michael Schurter	93b47f4ddc	client: reword error message	2019-09-04 12:40:09 -07:00
Mahmood Ali	b76de72943	Merge pull request #6251 from hashicorp/b-port-map-regression NOMAD_PORT_<label> regression	2019-09-04 11:54:09 -04:00
Danielle	2564a1dabd	Merge pull request #6239 from hashicorp/b-32bitmem Fix memory fingerprinting on 32bit	2019-09-04 17:39:07 +02:00
Danielle	aa5605fce1	fingerprint: Add backwards compatibility comment Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-09-04 17:38:35 +02:00
Mahmood Ali	87f0457973	fix qemu and update docker with tests	2019-09-04 11:27:51 -04:00
Jasmine Dahilig	5b6e39b37c	fix portmap envvars in docker driver	2019-09-04 11:26:13 -04:00
Danielle Lancashire	67715d846e	fingerprint: Restore support for legacy memory fingerprint	2019-09-04 17:10:28 +02:00
Tim Gross	0f29dcc935	support script checks for task group services (#6197 ) In Nomad prior to Consul Connect, all Consul checks work the same except for Script checks. Because the Task being checked is running in its own container namespaces, the check is executed by Nomad in the Task's context. If the Script check passes, Nomad uses the TTL check feature of Consul to update the check status. This means in order to run a Script check, we need to know what Task to execute it in. To support Consul Connect, we need Group Services, and these need to be registered in Consul along with their checks. We could push the Service down into the Task, but this doesn't work if someone wants to associate a service with a task's ports, but do script checks in another task in the allocation. Because Nomad is handling the Script check and not Consul anyways, this moves the script check handling into the task runner so that the task runner can own the script check's configuration and lifecycle. This will allow us to pass the group service check configuration down into a task without associating the service itself with the task. When tasks are checked for script checks, we walk back through their task group to see if there are script checks associated with the task. If so, we'll spin off script check tasklets for them. The group-level service and any restart behaviors it needs are entirely encapsulated within the group service hook.	2019-09-03 15:09:04 -04:00
Pete Woods	49b7d23cea	Add node "status" and "scheduling eligibility" tags to client metrics (#6130 ) When summing up the capability of your Nomad fleet for scaling purposes, it's important to exclude draining nodes, as they won't accept new jobs.	2019-09-03 12:11:11 -04:00
Michael Schurter	5957030d18	connect: add unix socket to proxy grpc for envoy (#6232 ) * connect: add unix socket to proxy grpc for envoy Fixes #6124 Implement a L4 proxy from a unix socket inside a network namespace to Consul's gRPC endpoint on the host. This allows Envoy to connect to Consul's xDS configuration API. * connect: pointer receiver on structs with mutexes * connect: warn on all proxy errors	2019-09-03 08:43:38 -07:00
Mahmood Ali	0f401e6b8b	lint: ignore generated windows syscall wrappers	2019-09-03 10:59:58 -04:00
Jasmine Dahilig	4edebe389a	add default update stanza and max_parallel=0 disables deployments (#6191 )	2019-09-02 10:30:09 -07:00
Evan Ercolano	fcf66918d0	Remove unused canary param from MakeTaskServiceID	2019-08-31 16:53:23 -04:00
Danielle Lancashire	4fcb7394e9	client: Fix memory fingerprinting on 32bit Also introduce regression ci for 32 bit fingerprinting	2019-08-31 18:33:59 +02:00
Mahmood Ali	0bd2eee87f	Merge pull request #6216 from hashicorp/b-recognize-pending-allocs alloc_runner: wait when starting suspicious allocs	2019-08-28 14:46:09 -04:00
Mahmood Ali	e0da3c5d0e	rename to hasLocalState, and ignore clientstate The ClientState being pending isn't a good criteria; as an alloc may have been updated in-place before it was completed. Also, updated the logic so we only check for task states. If an alloc has deployment state but no persisted tasks at all, restore will still fail.	2019-08-28 11:44:48 -04:00
Nick Ethier	cf014c7fd5	ar: ensure network forwarding is allowed for bridged allocs (#6196 ) * ar: ensure network forwarding is allowed in iptables for bridged allocs * ensure filter rule exists at setup time	2019-08-28 10:51:34 -04:00
Nick Ethier	cbb27e74bc	Add environment variables for connect upstreams (#6171 ) * taskenv: add connect upstream env vars + test * set taskenv upstreams instead of appending * Update client/taskenv/env.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-27 23:41:38 -04:00
Mahmood Ali	90c5eefbab	Alternative approach: avoid restoring This uses an alternative approach where we avoid restoring the alloc runner in the first place, if we suspect that the alloc may have been completed already.	2019-08-27 17:30:55 -04:00
Jasmine Dahilig	4078393bb6	expose nomad namespace as environment variable in allocation #5692 (#6192 )	2019-08-27 08:38:07 -07:00
Mahmood Ali	647c1457cb	alloc_runner: wait when starting suspicious allocs This commit aims to help users running with clients suseptible to the destroyed alloc being restrarted bug upgrade to latest. Without this, such users will have their tasks run unexpectedly on upgrade and only see the bug resolved after subsequent restart. If, on restore, the client sees a pending alloc without any other persisted info, then err on the side that it's an corrupt persisted state of an alloc instead of the client happening to be killed right when alloc is assigned to client. Few reasons motivate this behavior: Statistically speaking, corruption being the cause is more likely. A long running client will have higher chance of having allocs persisted incorrectly with pending state. Being killed right when an alloc is about to start is relatively unlikely. Also, delaying starting an alloc that hasn't started (by hopefully seconds) is not as severe as launching too many allocs that may bring client down. More importantly, this helps customers upgrade their clients without risking taking their clients down and destablizing their cluster. We don't want existing users to force triggering the bug while they upgrade and restart cluster.	2019-08-26 22:05:31 -04:00
Mahmood Ali	dfdf0edd3b	Merge pull request #6207 from hashicorp/b-gc-destroyed-allocs-rerun Don't persist allocs of destroyed alloc runners	2019-08-26 17:26:18 -04:00
Mahmood Ali	cc460d4804	Write to client store while holding lock Protect against a race where destroying and persist state goroutines race. The downside is that the database io operation will run while holding the lock and may run indefinitely. The risk of lock being long held is slow destruction, but slow io has bigger problems.	2019-08-26 13:45:58 -04:00
Mahmood Ali	1851820f20	logmon: log stat error to help debugging	2019-08-26 10:10:20 -04:00
Mahmood Ali	c132623ffc	Don't persist allocs of destroyed alloc runners This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes https://github.com/hashicorp/nomad/issues/5984 Related to https://github.com/hashicorp/nomad/pull/5890	2019-08-25 11:21:28 -04:00
Mahmood Ali	6301725002	logmon: revert workaround for Windows go1.11 bug Revert e0126123ab1ba848f72458538bc6118c978245e6 now that we are running with Golang 1.12, and https://github.com/golang/go/issues/29119 is no longer relevant.	2019-08-24 08:19:44 -04:00
Mahmood Ali	df1f3eb9ee	Merge pull request #6201 from hashicorp/b-device-stats-interval initialize device manager stats interval	2019-08-24 08:16:03 -04:00
Mahmood Ali	07b5f4c530	Merge pull request #6146 from hashicorp/b-config-template-copy clientConfig.Copy() to copy template config too	2019-08-23 19:00:57 -04:00
Mahmood Ali	b98568774b	clientConfig.Copy() to copy template config too	2019-08-23 18:43:22 -04:00
Lang Martin	4f6493a301	taskrunner getter set Umask for go-getter, setuid test	2019-08-23 15:59:03 -04:00
Mahmood Ali	3890619100	initialize device manager stats interval Fixes a bug where we cpu is pigged at 100% due to collecting devices statistics. The passed stats interval was ignored, and the default zero value causes a very tight loop of stats collection. FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a `g2.2xlarge` ec2 instance. The stats interval defaults to 1 second and is user configurable. I believe this is too frequent as a default, and I may advocate for reducing it to a value closer to 5s or 10s, but keeping it as is for now. Fixes https://github.com/hashicorp/nomad/issues/6057 .	2019-08-23 14:58:34 -04:00
Jerome Gravel-Niquet	cbdc1978bf	Consul service meta (#6193 ) * adds meta object to service in job spec, sends it to consul * adds tests for service meta * fix tests * adds docs * better hashing for service meta, use helper for copying meta when registering service * tried to be DRY, but looks like it would be more work to use the helper function	2019-08-23 12:49:02 -04:00
Nick Ethier	96d379071d	ar: fix bridge networking port mapping when port.To is unset (#6190 )	2019-08-22 21:53:52 -04:00
Michael Schurter	59e0b67c7f	connect: task hook for bootstrapping envoy sidecar Fixes #6041 Unlike all other Consul operations, boostrapping requires Consul be available. This PR tries Consul 3 times with a backoff to account for the group services being asynchronously registered with Consul.	2019-08-22 08:15:32 -07:00
Michael Schurter	b008fd1724	connect: register group services with Consul Fixes #6042 Add new task group service hook for registering group services like Connect-enabled services. Does not yet support checks.	2019-08-20 12:25:10 -07:00
lchayoun	2307c9d1d2	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-16 11:11:47 +03:00
Nick Ethier	965f00b2fc	Builtin Admission Controller Framework (#6116 ) * nomad: add admission controller framework * nomad: add admission controller framework and Consul Connect hooks * run admission controllers before checking permissions * client: add default node meta for connect configurables * nomad: remove validateJob func since it has been moved to admission controller * nomad: use new TaskKind type * client: use consts for connect sidecar image and log level * Apply suggestions from code review Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> * nomad: add job register test with connect sidecar * Update nomad/job_endpoint_hooks.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-15 11:22:37 -04:00
lchayoun	c5a38a045a	allow dash in non generated environment variable names - should only clean generate environment variables	2019-08-13 19:23:13 +03:00
Tim Gross	03433f35d4	client/template: configuration for function blacklist and sandboxing When rendering a task template, the `plugin` function is no longer permitted by default and will raise an error. An operator can opt-in to permitting this function with the new `template.function_blacklist` field in the client configuration. When rendering a task template, path parameters for the `file` function will be treated as relative to the task directory by default. Relative paths or symlinks that point outside the task directory will raise an error. An operator can opt-out of this protection with the new `template.disable_file_sandbox` field in the client configuration.	2019-08-12 16:34:48 -04:00
Danielle Lancashire	7e6c8e5ac1	Copy documentation to api/tasks	2019-08-12 16:22:27 +02:00
Danielle Lancashire	861caa9564	HostVolumeConfig: Source -> Path	2019-08-12 15:39:08 +02:00
Danielle Lancashire	e132a30899	structs: Unify Volume and VolumeRequest	2019-08-12 15:39:08 +02:00
Danielle Lancashire	6ef8d5233e	client: Add volume_hook for mounting volumes	2019-08-12 15:39:08 +02:00
Danielle Lancashire	063e4240c1	client: Add parsing and registration of HostVolume configuration	2019-08-12 15:39:08 +02:00
lchayoun	ca892163b2	allow dash in non generated environment variable names	2019-08-11 12:51:42 +03:00
Nick Ethier	7806f4c597	Revert "client: add autofetch for CNI plugins" This reverts commit 0bd157cc3b04fb090dd0d54affcae71496102ce8.	2019-08-08 15:10:19 -04:00
Nick Ethier	7d28ece8de	Revert "client: remove debugging lines" This reverts commit 54ce4d1f7ef4913cb12c03dbc98bcd903f7787c9.	2019-08-08 14:52:52 -04:00
Liel Chayoun	24dcb2379c	Update env_test.go	2019-08-06 11:59:31 +03:00
Mahmood Ali	b17bac5101	Render consul templates using task env only (#6055 ) When rendering a task consul template, ensure that only task environment variables are used. Currently, `consul-template` always falls back to host process environment variables when key isn't a task env var[1]. Thus, we add an empty entry for each host process env-var not found in task env-vars. [1] `bfa5d0e133/template/funcs.go (L61-L75)`	2019-08-05 16:30:47 -04:00
Mahmood Ali	f66169cd6a	Merge pull request #6065 from hashicorp/b-nil-driver-exec Check if driver handle is nil before execing	2019-08-02 09:48:28 -05:00
Mahmood Ali	a4670db9b7	Check if driver handle is nil before execing Defend against tr.getDriverHandle being nil. Exec handler checks if task is running, but it may be stopped between check and driver handler fetching.	2019-08-02 10:07:41 +08:00
Nick Ethier	7de0bec8ab	client/cni: updated comments and simplified logic to auto download plugins	2019-07-31 01:04:10 -04:00
Nick Ethier	b16640c50d	Apply suggestions from code review Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2019-07-31 01:04:10 -04:00
Nick Ethier	321d10a041	client: remove debugging lines	2019-07-31 01:04:09 -04:00
Nick Ethier	af6b191963	client: add autofetch for CNI plugins	2019-07-31 01:04:09 -04:00
Nick Ethier	1e9dd1b193	remove unused file	2019-07-31 01:04:09 -04:00
Nick Ethier	09a4cfd8d7	fix failing tests	2019-07-31 01:04:07 -04:00
Nick Ethier	ef83f0831b	ar: plumb client config for networking into the network hook	2019-07-31 01:04:06 -04:00
Nick Ethier	af66a35924	networking: Add new bridge networking mode implementation	2019-07-31 01:04:06 -04:00
Michael Schurter	fb487358fb	connect: add group.service stanza support	2019-07-31 01:04:05 -04:00
Nick Ethier	63c5504d56	ar: fix lint errors	2019-07-31 01:03:19 -04:00
Nick Ethier	e312201d18	ar: rearrange network hook to support building on windows	2019-07-31 01:03:19 -04:00
Nick Ethier	370533c9c7	ar: fix test that failed due to error renaming	2019-07-31 01:03:19 -04:00
Nick Ethier	2d60ef64d9	plugins/driver: make DriverNetworkManager interface optional	2019-07-31 01:03:19 -04:00
Nick Ethier	f87e7e9c9a	ar: plumb error handling into alloc runner hook initialization	2019-07-31 01:03:18 -04:00
Nick Ethier	ef1795b344	ar: add tests for network hook	2019-07-31 01:03:18 -04:00
Nick Ethier	15989bba8e	ar: cleanup lint errors	2019-07-31 01:03:18 -04:00
Nick Ethier	220cba3e7e	ar: move linux specific code to it's own file and add tests	2019-07-31 01:03:18 -04:00
Nick Ethier	548f78ef15	ar: initial driver based network management	2019-07-31 01:03:17 -04:00
Nick Ethier	66c514a388	Add network lifecycle management Adds a new Prerun and Postrun hooks to manage set up of network namespaces on linux. Work still needs to be done to make the code platform agnostic and support Docker style network initalization.	2019-07-31 01:03:17 -04:00
Preetha Appan	d048029b5a	remove generated code and change version to 0.10.0	2019-07-30 15:56:05 -05:00
Nomad Release bot	e39fb11531	Generate files for 0.9.4 release	2019-07-30 19:05:18 +00:00
Preetha Appan	6b4c40f5a8	remove generated code	2019-07-23 12:07:49 -05:00
Nomad Release bot	04187c8b86	Generate files for 0.9.4-rc1 release	2019-07-22 21:42:36 +00:00
Michael Schurter	d90680021e	logmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formattinglogmon: fix comment formatting	2019-07-22 13:05:01 -07:00
Michael Schurter	e37bc3513c	logmon: ensure errors are still handled properly ...and add a comment to switch back to the old error handling once we switch to Go 1.12.	2019-07-22 12:49:48 -07:00
Danielle Lancashire	1bcbbbfbe6	logmon: Workaround golang/go#29119 There's a bug in go1.11 that causes some io operations on windows to return incorrect errors for some cases when Stat-ing files. To avoid upgrading to go1.12 in a point release, here we loosen up the cases where we will attempt to create fifos, and add some logging of underlying stat errors to help with debugging.	2019-07-22 18:28:12 +02:00
Jasmine Dahilig	2157f6ddf1	add formatting for hcl parsing error messages (#5972 )	2019-07-19 10:04:39 -07:00
Mahmood Ali	cd6f1d3102	Update consul-template dependency to latest To pick up the fix in https://github.com/hashicorp/consul-template/pull/1231 .	2019-07-18 07:32:03 +07:00
Mahmood Ali	8a82260319	log unrecoverable errors	2019-07-17 11:01:59 +07:00
Mahmood Ali	1a299c7b28	client/taskrunner: fix stats stats retry logic Previously, if a channel is closed, we retry the Stats call. But, if that call fails, we go in a backoff loop without calling Stats ever again. Here, we use a utility function for calling driverHandle.Stats call that retries as one expects. I aimed to preserve the logging formats but made small improvements as I saw fit.	2019-07-11 13:58:07 +08:00
Preetha Appan	7d645c5ad9	Test file for detect content type that satisfies linter and encoding	2019-07-10 11:42:04 -05:00
Preetha Appan	ef9a71c68b	code review feedback	2019-07-10 10:41:06 -05:00
Preetha Appan	990e468edc	Populate task event struct with kill timeout This makes for a nicer task event message	2019-07-09 09:37:09 -05:00
Preetha Appan	108a292cc0	fix linting failure in test case file	2019-07-08 11:29:12 -05:00
Michael Lange	b2e9570075	Use consistent casing in the JSON representation of the AllocFileInfo struct	2019-07-02 17:27:31 -07:00
Preetha Appan	8495fb9055	Added additional test cases and fixed go test case	2019-07-02 13:25:29 -05:00
Mahmood Ali	a97d451ac7	Merge pull request #5905 from hashicorp/b-ar-failed-prestart Fail alloc if alloc runner prestart hooks fail	2019-07-02 20:25:53 +08:00
Danielle	c6872cdf12	Merge pull request #5864 from hashicorp/dani/win-pipe-cleaner windows: Fix restarts using the raw_exec driver	2019-07-02 13:58:56 +02:00
Danielle Lancashire	e20300313f	fifo: Safer access to Conn	2019-07-02 13:12:54 +02:00
Mahmood Ali	f10201c102	run post-run/post-stop task runner hooks Handle when prestart failed while restoring a task, to prevent accidentally leaking consul/logmon processes.	2019-07-02 18:38:32 +08:00
Mahmood Ali	4afd7835e3	Fail alloc if alloc runner prestart hooks fail When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for https://github.com/hashicorp/nomad/issues/5840 .	2019-07-02 18:35:47 +08:00
Mahmood Ali	7614b8f09e	Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2 task runner to avoid running task if terminal	2019-07-02 15:31:17 +08:00
Mahmood Ali	7bfad051b9	address review comments	2019-07-02 14:53:50 +08:00
Mahmood Ali	c0c00ecc07	Merge pull request #5906 from hashicorp/b-alloc-stale-updates client: defensive against getting stale alloc updates	2019-07-02 12:40:17 +08:00
Preetha Appan	c09342903b	Improve test cases for detecting content type	2019-07-01 16:24:48 -05:00
Danielle Lancashire	688f82f07d	fifo: Close connections and cleanup lock handling	2019-07-01 14:14:29 +02:00
Danielle Lancashire	2c7d1f1b99	logmon: Add windows compatibility test	2019-07-01 14:14:06 +02:00
Mahmood Ali	c5f5a1fcb9	client: defensive against getting stale alloc updates When fetching node alloc assignments, be defensive against a stale read before killing local nodes allocs. The bug is when both client and servers are restarting and the client requests the node allocation for the node, it might get stale data as server hasn't finished applying all the restored raft transaction to store. Consequently, client would kill and destroy the alloc locally, just to fetch it again moments later when server store is up to date. The bug can be reproduced quite reliably with single node setup (configured with persistence). I suspect it's too edge-casey to occur in production cluster with multiple servers, but we may need to examine leader failover scenarios more closely. In this commit, we only remove and destroy allocs if the removal index is more recent than the alloc index. This seems like a cheap resiliency fix we already use for detecting alloc updates. A more proper fix would be to ensure that a nomad server only serves RPC calls when state store is fully restored or up to date in leadership transition cases.	2019-06-29 04:17:35 -05:00
Preetha Appan	3345ce3ba4	Infer content type in alloc fs stat endpoint	2019-06-28 20:31:28 -05:00
Danielle Lancashire	e1151f743b	appveyor: Run logmon tests	2019-06-28 16:01:41 +02:00
Danielle Lancashire	634ada671e	fifo: Require that fifos do not exist for create Although this operation is safe on linux, it is not safe on Windows when using the named pipe interface. To provide a ~reasonable common api abstraction, here we switch to returning File exists errors on the unix api.	2019-06-28 13:47:18 +02:00
Danielle Lancashire	0ff27cfc0f	vendor: Use dani fork of go-winio	2019-06-28 13:47:18 +02:00
Danielle Lancashire	514a2a6017	logmon: Refactor fifo access for windows safety On unix platforms, it is safe to re-open fifo's for reading after the first creation if the file is already a fifo, however this is not possible on windows where this triggers a permissions error on the socket path, as you cannot recreate it. We can't transparently handle this in the CreateAndRead handle, because the Access Is Denied error is too generic to reliably be an IO error. Instead, we add an explict API for opening a reader to an existing FIFO, and check to see if the fifo already exists inside the calling package (e.g logmon)	2019-06-28 13:41:54 +02:00
Mahmood Ali	3d89ae0f1e	task runner to avoid running task if terminal This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes https://github.com/hashicorp/nomad/issues/5883	2019-06-27 11:27:34 +08:00
Danielle Lancashire	b9ac184e1f	tr: Fetch Wait channel before killTask in restart Currently, if killTask results in the termination of a process before calling WaitTask, Restart() will incorrectly return a TaskNotFound error when using the raw_exec driver on Windows.	2019-06-26 15:20:57 +02:00
Mahmood Ali	b209584dce	Merge pull request #5726 from hashicorp/b-plugins-via-init Use init() to handle plugin invocation	2019-06-18 21:09:03 -04:00
Mahmood Ali	ac64509c59	comment on use of init() for plugin handlers	2019-06-18 20:54:55 -04:00
Chris Baker	f71114f5b8	cleanup test	2019-06-18 14:15:25 +00:00
Chris Baker	a2dc351fd0	formatting and clarity	2019-06-18 14:00:57 +00:00
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Mahmood Ali	962921f86c	Use init to handle plugin invocation Currently, nomad "plugin" processes (e.g. executor, logmon, docker_logger) are started as CLI commands to be handled by command CLI framework. Plugin launchers use `discover.NomadBinary()` to identify the binary and start it. This has few downsides: The trivial one is that when running tests, one must re-compile the nomad binary as the tests need to invoke the nomad executable to start plugin. This is frequently overlooked, resulting in puzzlement. The more significant issue with `executor` in particular is in relation to external driver: * Plugin must identify the path of invoking nomad binary, which is not trivial; `discvoer.NomadBinary()` now returns the path to the plugin rather than to nomad, preventing external drivers from launching executors. * The external driver may get a different version of executor than it expects (specially if we make a binary incompatible change in future). This commit addresses both downside by having the plugin invocation handling through an `init()` call, similar to how libcontainer init handler is done in [1] and recommened by libcontainer [2]. `init()` will be invoked and handled properly in tests and external drivers. For external drivers, this change will cause external drivers to launch the executor that's compiled against. There a are a couple of downsides to this approach: * These specific packages (i.e executor, logmon, and dockerlog) need to be careful in use of `init()`, package initializers. Must avoid having command execution rely on any other init in the package. I prefixed files with `z_` (golang processes files in lexical order), but ensured we don't depend on order. * The command handling is spread in multiple packages making it a bit less obvious how plugin starts are handled. [1] drivers/shared/executor/libcontainer_nsenter_linux.go [2] `eb4aeed24f/libcontainer (using-libcontainer)`	2019-06-13 16:48:01 -04:00
Jasmine Dahilig	ed9740db10	Merge pull request #5664 from hashicorp/f-http-hcl-region backfill region from hcl for jobUpdate and jobPlan	2019-06-13 12:25:01 -07:00
Jasmine Dahilig	51e141be7a	backfill region from job hcl in jobUpdate and jobPlan endpoints - updated region in job metadata that gets persisted to nomad datastore - fixed many unrelated unit tests that used an invalid region value (they previously passed because hcl wasn't getting picked up and the job would default to global region)	2019-06-13 08:03:16 -07:00
Mahmood Ali	e31159bf1f	Prepare for 0.9.4 dev cycle	2019-06-12 18:47:50 +00:00
Nomad Release bot	4803215109	Generate files for 0.9.3 release	2019-06-12 16:11:16 +00:00
Danielle	f923b568e0	Merge pull request #5821 from hashicorp/dani/b-5770 trhooks: Add TaskStopHook interface to services	2019-06-12 17:30:49 +02:00
Danielle Lancashire	c326344b57	trt: Fix test	2019-06-12 17:06:11 +02:00
Danielle Lancashire	13d76e35fd	trhooks: Add TaskStopHook interface to services We currently only run cleanup Service Hooks when a task is either Killed, or Exited. However, due to the implementation of a task runner, tasks are only Exited if they every correctly started running, which is not true when you recieve an error early in the task start flow, such as not being able to pull secrets from Vault. This updates the service hook to also call consul deregistration routines during a task Stop lifecycle event, to ensure that any registered checks and services are cleared in such cases. fixes #5770	2019-06-12 16:00:21 +02:00
Mahmood Ali	2acf30fdd3	Fallback to `alloc.TaskResources` for old allocs When a client is running against an old server (e.g. running 0.8), `alloc.AllocatedResources` may be nil, and we need to check the deprecated `alloc.TaskResources` instead. Fixes https://github.com/hashicorp/nomad/issues/5810	2019-06-11 10:32:53 -04:00
Mahmood Ali	7a4900aaa4	client/allocrunner: depend on internal task state Alloc runner already tracks tasks associated with alloc. Here, we become defensive by relying on the alloc runner tracked tasks, rather than depend on server never updating the job unexpectedly.	2019-06-10 18:42:51 -04:00
Mahmood Ali	d30c3d10b0	Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1 More test fixes	2019-06-05 19:09:18 -04:00
Mahmood Ali	935ee86e92	Merge pull request #5737 from fwkz/fix-restart-attempts Fix restart attempts of `restart` stanza in `delay` mode.	2019-06-05 19:05:07 -04:00
Mahmood Ali	97957fbf75	Prepare for 0.9.3 dev cycle	2019-06-05 14:54:00 +00:00
Nomad Release bot	43bfbf3fcc	Generate files for 0.9.2 release	2019-06-05 11:59:27 +00:00
Mahmood Ali	a9f81f2daa	client config flag to disable remote exec This exposes a client flag to disable nomad remote exec support in environments where access to tasks ought to be restricted. I used `disable_remote_exec` client flag that defaults to allowing remote exec. Opted for a client config that can be used to disable remote exec globally, or to a subset of the cluster if necessary.	2019-06-03 15:31:39 -04:00
Mahmood Ali	a4ead8ff79	remove 0.9.2-rc1 generated code	2019-05-23 11:14:24 -04:00
Nomad Release bot	6d6bc59732	Generate files for 0.9.2-rc1 release	2019-05-22 19:29:30 +00:00
Michael Schurter	a54511b304	Merge pull request #5731 from hashicorp/b-ignore-dc client: drop unused DC field from servers list	2019-05-22 08:42:15 -07:00
Mahmood Ali	84419f08ce	client: synchronize client.invalidAllocs access invalidAllocs may be accessed and manipulated from different goroutines, so must be locked.	2019-05-22 09:37:49 -04:00
Danielle Lancashire	27583ed8c1	client: Pass servers contacted ch to allocrunner This fixes an issue where batch and service workloads would never be restarted due to indefinitely blocking on a nil channel. It also raises the restoration logging message to `Info` to simplify log analysis.	2019-05-22 13:47:35 +02:00
Mahmood Ali	9df1e00f35	tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal Given that Signal may be called multiple times, blocking for `SignalCh` isn't sufficient to synchornizing access to Signals field.	2019-05-21 13:56:58 -04:00
Mahmood Ali	b06e585713	Merge pull request #5739 from hashicorp/r-rm-logmon-syslog-deadcode logmon: remove syslog server deadcode	2019-05-21 11:46:48 -04:00
Mahmood Ali	eca23bf9c4	Merge pull request #5742 from hashicorp/b-test-fixes-20190520 Grab bag of (primarily race) test fixes	2019-05-21 11:46:36 -04:00
Mahmood Ali	e88bb61488	Merge pull request #5740 from hashicorp/b-nomad-exec-term-race exec: allow drivers to handle stream termination	2019-05-21 11:24:12 -04:00
Mahmood Ali	b475ccbe3e	client: synchronize access to ar.alloc `allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's use `allocRunner.Alloc()` helper function to access it.	2019-05-21 09:55:05 -04:00
Mahmood Ali	2a7b073167	tests: fix fifo lib race Accidentally accessed outer `err` variable inside a goroutine	2019-05-21 09:49:56 -04:00
Mahmood Ali	296bd41c9e	tests: fix data race in client TestDriverManager_Fingerprint_Periodic	2019-05-21 09:49:56 -04:00
Mahmood Ali	d9e59eece0	tests: fix client TestFS_Stream data race Close is invoked in a different goroutine from test	2019-05-21 09:49:56 -04:00
Mahmood Ali	75e0a3f405	exec: allow drivers to handle stream termination Without this change, alloc_endpoint cancel the context passed to handler when we detect EOF. This races driver in setting exit code; and we run into a case where the exec process terminates cleanly yet we attempt to mark it as failed with context error. Here, we rely on the driver to handle errors returned from Stream and without racing to set an error.	2019-05-21 09:40:25 -04:00
Mahmood Ali	974bcbecc9	logmon: remove syslog server deadcode Remove unused syslog server related code that got replaced by the docker logger in Nomad 0.9	2019-05-21 09:36:43 -04:00
fwkz	8b84bec95a	Fix restart attempts of `restart` stanza. Number of restarts during 2nd interval is off by one.	2019-05-21 13:27:19 +02:00
Michael Schurter	d41abda957	client: drop unused DC field from servers list See #5730 for details.	2019-05-20 14:19:15 -07:00
Michael Schurter	2fe0768f3b	docs: changelog entry for #5669 and fix comment	2019-05-14 10:54:00 -07:00
Michael Schurter	af9096c8ba	client: register before restoring Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.	2019-05-14 10:53:27 -07:00
Michael Schurter	e07f73bfe0	client: do not restart dead tasks until server is contacted (try 2) Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162 Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.	2019-05-14 10:53:27 -07:00
Michael Schurter	d7e5ace1ed	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Michael Schurter	3b1f8991a1	client: log when server list changes Stop logging in the happy path when nothing has changed.	2019-05-13 15:42:55 -07:00
Michael Schurter	48db8135da	Merge pull request #5492 from hashicorp/f-allocated-mem client: expose allocated memory per task	2019-05-13 13:31:22 -07:00
Lang Martin	1d03a43ce2	Merge pull request #5642 from hashicorp/b-network-fingerprinting-ipv4 network fingerprinting multiple IPs on the configured network device	2019-05-13 11:46:53 -04:00
Michael Schurter	1c4e585fa7	client: expose allocated memory per task Related to #4280 This PR adds `client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge in bytes to metrics to ease calculating how close a task is to OOMing. ``` 'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000 'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000 'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000 'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000 'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000 ```	2019-05-10 11:12:12 -07:00
Lang Martin	f6bc45dd23	client improve a comment in updateNetworks	2019-05-10 11:25:04 -04:00
Mahmood Ali	919827f2df	Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base nomad exec part 1: plumbing and docker driver	2019-05-09 18:09:27 -04:00
Mahmood Ali	ab2cae0625	implement client endpoint of nomad exec Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking the relevant task handler for execution.	2019-05-09 16:49:08 -04:00
Preetha	1d02886bb6	Merge pull request #5654 from hashicorp/b-hearbeat-lockfix Remove unnecessary locking and serverlist syncing in heartbeats	2019-05-08 13:36:39 -05:00
Preetha Appan	3289e7f4a0	fix typo and add one more test scenario	2019-05-08 10:54:22 -05:00
Preetha Appan	db6b291a5a	code review feedback	2019-05-07 16:23:32 -05:00
Chris Baker	93ec1293be	stale allocation data leads to incorrect (and even negative) metrics (#5637 ) * client: was not using up-to-date client state in determining which alloc count towards allocated resources * Update client/client.go Co-Authored-By: cgbaker <cgbaker@hashicorp.com>	2019-05-07 15:54:36 -04:00
Preetha Appan	b063fc81a4	Remove unnecessary locking and serverlist syncing in heartbeats This removes an unnecessary shared lock between discovery and heartbeating which was causing heartbeats to be missed upon retries when a single server fails. Also made a drive by fix to call the periodic server shuffler goroutine.	2019-05-06 14:44:55 -05:00
Michael Schurter	8c7b3ff45a	Fix comment Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:01:30 -05:00
Michael Schurter	e19fa33f9c	Remove unnecessary boolean clause Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:00:17 -05:00
Preetha Appan	b99a204582	Update deployment health on failed allocations only if health is unset This fixes a confusing UX where a previously successful deployment's healthy/unhealthy count would get updated if any allocations failed after the deployment was already marked as successful.	2019-05-02 22:59:56 -05:00
Lang Martin	c32cce51f0	client fingerprinting can keep multi ips on a device	2019-05-02 18:11:28 -04:00
Lang Martin	94f23016a2	client_test new test fingerprinting can keep multi ips on a device	2019-05-02 18:11:28 -04:00
Mahmood Ali	7a32d3f3aa	client: handle 0.8 server network resources Fixes https://github.com/hashicorp/nomad/issues/5587 When a nomad 0.9 client is handling an alloc generated by a nomad 0.8 server, we should check the alloc.TaskResources for networking details rather than task.Resources. We check alloc.TaskResources for networking for other tasks in the task group [1], so it's a bit odd that we used the task.Resources struct here. TaskRunner also uses `alloc.TaskResources`[2]. The task.Resources struct in 0.8 was sparsly populated, resulting to storing of 0 in port mapping env vars: ``` vagrant@nomad-server-01:~$ nomad version Nomad v0.8.7 (21a2d93eecf018ad2209a5eab6aae6c359267933+CHANGES) vagrant@nomad-server-01:~$ nomad server members Name Address Port Status Leader Protocol Build Datacenter Region nomad-server-01.global 10.199.0.11 4648 alive true 2 0.8.7 dc1 global vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b \| jq '.Job.TaskGroups[0].Tasks[0].Resources.Networks' [ { "CIDR": "", "Device": "", "DynamicPorts": [ { "Label": "db", "Value": 0 } ], "IP": "", "MBits": 10, "ReservedPorts": null } ] vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b \| jq '.TaskResources' { "redis": { "CPU": 500, "DiskMB": 0, "IOPS": 0, "MemoryMB": 256, "Networks": [ { "CIDR": "", "Device": "eth1", "DynamicPorts": [ { "Label": "db", "Value": 21722 } ], "IP": "10.199.0.21", "MBits": 10, "ReservedPorts": null } ] } } ``` Also, updated the test values to mimic how Nomad 0.8 structs are represented, and made its result match the non compact values in `TestEnvironment_AsList`. [1] `24e9040b18/client/taskenv/env.go (L624-L639)` [2] https://github.com/hashicorp/nomad/blob/master/client/allocrunner/taskrunner/task_runner.go#L287-L303	2019-05-02 12:08:38 -04:00
Mahmood Ali	446f06721d	aux: helper method that returns token as well as ACL policy This helper returns the token as well as the ACL policy, to be used in a later commit for logging the token info associated with nomad exec invocation.	2019-04-30 10:23:56 -04:00

... 3 4 5 6 7 ...

4155 commits