open-nomad

Author	SHA1	Message	Date
Tim Gross	bb7437ab9b	changelog entry for #10756	2021-06-16 22:02:10 -04:00
Seth Hoenig	532b898b07	consul/connect: in-place update service definition when connect upstreams are modified This PR fixes a bug where modifying the upstreams of a Connect sidecar proxy would not result Consul applying the changes, unless an additional change to the job would trigger a task replacement (thus replacing the service definition). The fix is to check if upstreams have been modified between Nomad's view of the sidecar service definition, and the service definition for the sidecar that is actually registered in Consul. Fixes #8754	2021-06-16 16:48:26 -05:00
Tim Gross	7bd61bbf43	docker: generate /etc/hosts file for bridge network mode (#10766 ) When `network.mode = "bridge"`, we create a pause container in Docker with no networking so that we have a process to hold the network namespace we create in Nomad. The default `/etc/hosts` file of that pause container is then used for all the Docker tasks that share that network namespace. Some applications rely on this file being populated. This changeset generates a `/etc/hosts` file and bind-mounts it to the container when Nomad owns the network, so that the container's hostname has an IP in the file as expected. The hosts file will include the entries added by the Docker driver's `extra_hosts` field. In this changeset, only the Docker task driver will take advantage of this option, as the `exec`/`java` drivers currently copy the host's `/etc/hosts` file and this can't be changed without breaking backwards compatibility. But the fields are available in the task driver protobuf for community task drivers to use if they'd like.	2021-06-16 14:55:22 -04:00
dependabot[bot]	7a5a8ac8c1	build(deps): bump postcss from 7.0.35 to 7.0.36 in /website (#10772 ) Bumps [postcss](https://github.com/postcss/postcss) from 7.0.35 to 7.0.36. - [Release notes](https://github.com/postcss/postcss/releases) - [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md) - [Commits](https://github.com/postcss/postcss/compare/7.0.35...7.0.36) --- updated-dependencies: - dependency-name: postcss dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2021-06-16 12:18:43 -04:00
dependabot[bot]	56ba7ab2cc	build(deps): bump ws from 7.3.1 to 7.4.6 in /scripts/screenshots/src (#10671 ) Bumps [ws](https://github.com/websockets/ws) from 7.3.1 to 7.4.6. - [Release notes](https://github.com/websockets/ws/releases) - [Commits](https://github.com/websockets/ws/compare/7.3.1...7.4.6) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2021-06-16 11:09:34 -04:00
Seth Hoenig	c7826699da	Merge pull request #10765 from hashicorp/b-java-fp-version client/fingerprint/java: improve java version string regex matching	2021-06-15 17:14:13 -05:00
Seth Hoenig	8f493cfa89	client/fingerprint/java: improve java version string regex matching This PR improves the regular expression used for matching the java version string, which varies a lot depending on the java vendor and version. These are the example strings we now test for: java version "1.7.0_80" openjdk version "11.0.1" 2018-10-16 openjdk version "11.0.1" 2018-10-16 java version "1.6.0_36" openjdk version "1.8.0_192" openjdk 11.0.11 2021-04-20 LTS The last one is a new test added on behalf of #6081, which is still broken on today's CentOS 7 default JDK package. openjdk 11.0.11 2021-04-20 LTS OpenJDK Runtime Environment 18.9 (build 11.0.11+9-LTS) OpenJDK 64-Bit Server VM 18.9 (build 11.0.11+9-LTS, mixed mode, sharing) ==> Evaluation "21c6caf7" finished with status "complete" but failed to place all allocations: Task Group "example" (failed to place 1 allocation): * Constraint "${driver.java.version} >= 11.0.0": 1 nodes excluded by filter Evaluation "2b737d48" waiting for additional capacity to place remainder Fixes #6081	2021-06-15 14:15:01 -05:00
Seth Hoenig	a470b2e3c8	Merge pull request #10764 from hashicorp/b-passfail-lost consul: make failures_before_critical and success_before_passing work with group services	2021-06-15 12:41:04 -05:00
Seth Hoenig	4e3b543450	docs: add bugfix note to 1.0.8	2021-06-15 12:40:44 -05:00
Seth Hoenig	d75669da4a	consul: make failures_before_critical and success_before_passing work with group services This PR fixes some job submission plumbing to make sure the Consul Check parameters - failure_before_critical - success_before_passing work with group-level services. They already work with task-level services.	2021-06-15 11:20:40 -05:00
Seth Hoenig	8a6686251b	Merge pull request #10762 from hashicorp/docs-update-cl-2 docs: update changelog	2021-06-15 09:25:51 -05:00
Seth Hoenig	e701dd4bea	docs: update changelog	2021-06-15 09:17:06 -05:00
James Rasell	ac6e72c2be	Merge pull request #10758 from hashicorp/b-fix-test-datarace-plugins plugins: fix test data race.	2021-06-15 14:33:53 +02:00
James Rasell	87413ff0cd	plugins: fix test data race.	2021-06-15 09:31:08 +02:00
Isabel Suchanek	e3cde4f4b3	cli: check deployment exists before monitoring (#10757 ) System and batch jobs don't create deployments, which means nomad tries to monitor a non-existent deployment when it runs a job and outputs an error message. This adds a check to make sure a deployment exists before monitoring. Also fixes some formatting. Co-authored-by: Tim Gross <tgross@hashicorp.com>	2021-06-14 16:42:38 -07:00
Mahmood Ali	33dfe98770	deployment watcher: Reuse allocsCh if allocIndex remains the same (#10756 ) Fix deployment watchers to avoid creating unnecessary deployment watcher goroutines and blocking queries. `deploymentWatcher.getAllocsCh` creates a new goroutine that makes a blocking query to fetch updates of deployment allocs. ## Background When operators submit a new or updated service job, Nomad create a new deployment by default. The deployment object controls how fast to place the allocations through [`max_parallel`](https://www.nomadproject.io/docs/job-specification/update#max_parallel) and health checks configurations. The `scheduler` and `deploymentwatcher` package collaborate to achieve deployment logic: The scheduler only places the canaries and `max_parallel` allocations for a new deployment; the `deploymentwatcher` monitors for alloc progress and then enqueues a new evaluation whenever the scheduler should reprocess a job and places the next `max_parallel` round of allocations. The `deploymentwatcher` package makes blocking queries against the state store, to fetch all deployments and the relevant allocs for each running deployments. If `deploymentwatcher` fails or is hindered from fetching the state, the deployments fail to make progress. `Deploymentwatcher` logic only runs on the leader. ## Why unnecessary deployment watchers can halt cluster progress Previously, `getAllocsCh` is called on every for loop iteration in `deploymentWatcher.watch()` function. However, the for-loop may iterate many times before the allocs get updated. In fact, whenever a new deployment is created/updated/deleted, all `deploymentWatcher`s get notified through `w.deploymentUpdateCh`. The `getAllocsCh` goroutines and blocking queries spike significantly and grow quadratically with respect to the number of running deployments. The growth leads to two adverse outcomes: 1. it spikes the CPU/Memory usage resulting potentially leading to OOM or very slow processing 2. it activates the [query rate limiter](`abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L896-L898)`), so later the watcher fails to get updates and consequently fails to make progress towards placing new allocations for the deployment! So the cluster fails to catch up and fails to make progress in almost all deployments. The cluster recovers after a leader transition: the deposed leader stops all watchers and free up goroutines and blocking queries; the new leader recreates the watchers without the quadratic growth and remaining under the rate limiter. Well, until a spike of deployments are created triggering the condition again. ### Relevant Code References Path for deployment monitoring: * [`Watcher.watchDeployments`](`abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L164-L192)`) loops waiting for deployment updates. * On every deployment update, [`w.getDeploys`](`abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L194-L229)`) returns all deployments in the system * `watchDeployments` calls `w.add(d)` on every active deployment * which in turns, [updates existing watcher if one is found](`abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L251-L255)`). * The deployment watcher [updates local local deployment field and trigger `deploymentUpdateCh` channel]( `abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L136-L147)`) * The [deployment watcher `deploymentUpdateCh` selector is activated](`abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L455-L489)`). Most of the time the selector clause is a no-op, because the flow was triggered due to another deployment update * The `watch` for-loop iterates again and in the previous code we create yet another goroutine and blocking call that risks being rate limited. Co-authored-by: Tim Gross <tgross@hashicorp.com>	2021-06-14 16:01:01 -04:00
Jai Bhagat	b92ab047ff	edit the computed agent version property This PR edits the computed agent version that is returned upon hitting the agent self request endpoint. The reason is because we believe that the Agent Member Tag property sometimes returns null because we may have cases where there are only clients and no servers and only servers are included in the Serf Gossip Protocol. There may be other cases where we do in fact have servers but the node is erased for some reason. We are unsure how to replicate that issue, however. edit mirage config This commit updates the Mirage Config because our acceptance tests depend on the Mirage Config, while we rely on Mirage Factories to populate fixture data for us to use when to run the Nomad UI locally Revert "update the open-button disability functionality depending on a job's state" This reverts commit 5190b308a51d55a7b0617854164c155d36d7e513.	2021-06-14 13:22:36 -04:00
Seth Hoenig	6eeaefa59f	Merge pull request #10754 from hashicorp/b-client-connect-constraint consul/connect: remove unnecessary connect constraint on clients	2021-06-14 09:41:25 -05:00
James Rasell	2154bd93ab	Merge pull request #10752 from hashicorp/b-fix-test-datarace-volumewatcher volumewatcher: fix test data race.	2021-06-14 16:30:34 +02:00
Tim Gross	38a0057715	quotas: evaluate quota feasibility last in scheduler (#10753 ) The `QuotaIterator` is used as the source of nodes passed into feasibility checking for constraints. Every node that passes the quota check counts the allocation resources agains the quota, and as a result we count nodes which will be later filtered out by constraints. Therefore for jobs with constraints, nodes that are feasibility checked but fail have been counted against quotas. This failure mode is order dependent; if all the unfiltered nodes happen to be quota checked first, everything works as expected. This changeset moves the `QuotaIterator` to happen last among all feasibility checkers (but before ranking). The `QuotaIterator` will never receive filtered nodes so it will calculate quotas correctly.	2021-06-14 10:11:40 -04:00
Seth Hoenig	7b8e15159b	consul/connect: remove unnecessary connect constraint on clients PR https://github.com/hashicorp/nomad/pull/10702 added 2 new constraints for connect jobs - one for Consul gRPC listener, and one for Connect being enabled on Clients. Connect does not need to be enabled on clients, only on Consul servers. Remove the extra constraint. Discuss: https://discuss.hashicorp.com/t/nomad-1-1-1-and-consul-connect-enabled-on-consul-clients/25295	2021-06-14 08:01:45 -05:00
James Rasell	0cccf7c2b8	volumewatcher: fix test data race.	2021-06-14 12:11:35 +02:00
Brandon Romano	f919ddef55	Merge pull request #10750 from hashicorp/br.quote-image Fix headshot image 404	2021-06-11 15:38:09 -07:00
Brandon Romano	c934102507	Fix headshot image 404	2021-06-11 15:31:05 -07:00
Luiz Aoqui	98e0e952a6	fix agent-info help message formatting (#10747 )	2021-06-11 15:39:28 -04:00
James Rasell	a99fcfb4c8	Merge pull request #10745 from hashicorp/b-fix-test-datarace-deploymentwatcher deploymentwatcher: fix test data race.	2021-06-11 17:23:03 +02:00
James Rasell	939b23936a	Merge pull request #10744 from hashicorp/b-remove-duplicate-imports chore: remove duplicate import statements	2021-06-11 16:42:34 +02:00
Mahmood Ali	74efd3626e	Merge pull request #10742 from hashicorp/deflake-tests-20210608 Deflaking Test 2021 June edition	2021-06-11 09:14:40 -04:00
James Rasell	c168108bb7	Merge pull request #10739 from hashicorp/f-remove-unused-types-pkg core: remove unused types pkg and PeriodicCallback type.	2021-06-11 13:27:22 +02:00
James Rasell	ff75b4da09	deploymentwatcher: fix test data race.	2021-06-11 11:55:21 +02:00
James Rasell	492e308846	tests: remove duplicate import statements.	2021-06-11 09:39:22 +02:00
James Rasell	3b16e4d0b3	jobspec2: remove duplicate imports statements.	2021-06-11 09:38:47 +02:00
James Rasell	050b5408c7	drivers: remove duplicate import statements.	2021-06-11 09:38:09 +02:00
James Rasell	2898e5d379	e2e: remove duplicate import statements.	2021-06-11 09:37:23 +02:00
Mahmood Ali	9b35bf1858	deflake TestNomad_BootstrapExpect and other leader tests The test fails reliably locally on my machine. The test uses non-dev mode where Raft actions get committed to disk, causing operations to exceed the 50ms tight Raft deadlines. So, here we ensure that non-dev servers use default Raft config files with longer timeouts. Also, noticed that the test queries a server, that may a follower with a stale state. I've updated the test to ensure we query the leader for its state. The Barrier call ensures that the leader is a "stable" leader with committed entries. Protects against a window where a new leader reports the previous term before it commits a raft log entry.	2021-06-10 22:04:10 -04:00
Mahmood Ali	ff73cc279e	tests: deflake TestAgentProfile_RemoteClient TestAgentProfile_RemoteClient test must wait for the client node to be registered in raft state store, and not merely that the server has a network connection from the client. In https://app.circleci.com/pipelines/github/hashicorp/nomad/15539/workflows/8dcbc3f3-946b-4da0-b089-9093788bc0c9/jobs/147919, notice how `node registration complete` log line occured after the test already have failed. This is another case of flakiness due to not waiting for client registration.	2021-06-10 22:00:15 -04:00
Mahmood Ali	8009d9837c	tests: deflake TestMonitor_Monitor_RemoteServer and cross-region tests Ensure that all servers are joined to each other before test proceed, instead of just joining them to the first server and relying on background serf propagation. Relying on backgorund serf propagation is a cause of flakiness, specially for tests with multiple regions. The server receiving the RPC may not be aware of the region and fail to forward RPC accordingly. For example, consider `TestMonitor_Monitor_RemoteServer` failure in https://app.circleci.com/pipelines/github/hashicorp/nomad/16402/workflows/7f327235-7d0c-40ba-9757-600522afca51/jobs/158045 you can observe: * `nomad-117` is joined to `nomad-118` and `nomad-119` * `nomad-119` is the foreign region * `nomad-117` gains leadership in the default region, `nomad-118` is the non-leader * search logs for `nomad: adding server` and notice that `nomad-118` only added `nomad-118` and `nomad-118`, but not `nomad-119`! * so the query to the non-leader in the test fails to be forwarded to the appopriate region.	2021-06-10 21:27:55 -04:00
Mahmood Ali	071c556b3d	tests: deflake CSI forwarding tests This updates `client.Ready()` so it returns once the client node got registered at the servers. Previously, it returns when the fingerprinters first batch completes, wtihout ensuring that the node is stored in the Raft data. The tests may fail later when it with unknown node errors later. `client.Reedy()` seem to be only called in CSI and some client stats now. This class of bug, assuming client is registered without checking, is a source of flakiness elsewhere. Other tests use other mechanisms for checking node readiness, though not consistently.	2021-06-10 21:26:34 -04:00
Isabel Suchanek	785eb40985	Merge pull request #10740 from hashicorp/docs-deploy-monitor docs: add deployment monitor to docs, changelog	2021-06-10 13:53:03 -07:00
Isabel Suchanek	c6c52bc53e	docs: add deployment monitor to docs, changelog Updates the deployment status and job run docs	2021-06-10 10:51:33 -07:00
James Rasell	25883eca43	core: remove unused types pkg and PeriodicCallback type.	2021-06-10 15:57:13 +02:00
Mahmood Ali	b372a1d2b4	update release to 1.1.1 (#10735 )	2021-06-10 08:57:30 -04:00
Mahmood Ali	448282ff84	prepare for 1.1.2 dev cycle	2021-06-10 08:04:25 -04:00
Nomad Release Bot	4fe52bc753	remove generated files	2021-06-10 08:04:25 -04:00
Nomad Release bot	7cc7389afd	Generate files for 1.1.1 release	2021-06-10 08:04:25 -04:00
Mahmood Ali	fd15ac1821	prepare changelog for 1.1.1/1.0.7 release	2021-06-10 08:04:25 -04:00
Shishir Mahajan	f50f10504f	Update containerd task driver options. - hostname - auth Signed-off-by: Shishir Mahajan <smahajan@roblox.com>	2021-06-10 08:03:49 -04:00
Mahmood Ali	aa77c2731b	tests: use standard library testing.TB Glint pulled in an updated version of mitchellh/go-testing-interface which broke some existing tests because the update added a Parallel() method to testing.T. This switches to the standard library testing.TB which doesn't have a Parallel() method.	2021-06-09 16:18:45 -07:00
Isabel Suchanek	dfaef2468c	cli: add monitor flag to deployment status Adding '-verbose' will print out the allocation information for the deployment. This also changes the job run command so that it now blocks until deployment is complete and adds timestamps to the output so that it's more in line with the output of node drain. This uses glint to print in place in running in a tty. Because glint doesn't yet support cmd/powershell, Windows workflows use a different library to print in place, which results in slightly different formatting: 1) different margins, and 2) no spinner indicating deployment in progress.	2021-06-09 16:18:45 -07:00
Mahmood Ali	baacc3c19b	golang: update to 1.16.5 (#10733 ) 1.16.5 contains some security fixes for zip that are notable to get: https://github.com/golang/go/issues?q=milestone%3AGo1.16.5+label%3ACherryPickApproved	2021-06-09 11:51:41 -04:00

... 5 6 7 8 9 ...

21777 commits