open-nomad

Author	SHA1	Message	Date
Seth Hoenig	fa0dc05b7a	tests: wait on client in a couple of tests These tend to fail on GHA, where I believe the client is not starting up fast enough before making requests. So wait on the client agent first. ``` === RUN TestDebug_CapturedFiles operator_debug_test.go:422: serverName: TestDebug_CapturedFiles.global, clientID, 1afb00e6-13f2-d8d6-d0f9-745a3fd6e8e4 operator_debug_test.go:492: Error Trace: operator_debug_test.go:492 Error: Should be empty, but was No node(s) with prefix "1afb00e6-13f2-d8d6-d0f9-745a3fd6e8e4" found Failed to retrieve clients, 0 nodes found in list: 1afb00e6-13f2-d8d6-d0f9-745a3fd6e8e4 Test: TestDebug_CapturedFiles --- FAIL: TestDebug_CapturedFiles (0.08s) ```	2022-03-30 08:48:23 -05:00
Seth Hoenig	61bf8022df	Merge pull request #12405 from hashicorp/ci-format-release-metadata-file ci: hcl format release metadata file	2022-03-30 08:13:15 -05:00
Seth Hoenig	d672bc46fd	ci: add trailing newline to release metadata	2022-03-30 08:12:55 -05:00
Tim Gross	3030f954a2	E2E disconnected clients test refactor (#12402 ) * Wait longer for node to go down in disconnected clients test. The existing helper only waits 10s, but there's a jitter on heartbeats that we need to account for. Wait for 30s for node to go down to give us plenty of room * Port disconnected clients to stdlib-style test	2022-03-30 09:12:44 -04:00
Seth Hoenig	0f20bb0e8c	ci: hcl format release metadata file	2022-03-30 08:02:55 -05:00
Michele Degges	f474ed6f51	[RelAPI Onboarding] Add release API metadata file (#12353 )	2022-03-29 15:38:50 -07:00
Michael Schurter	cae69ba8ce	Merge pull request #12312 from hashicorp/f-writeToFile template: disallow `writeToFile` by default	2022-03-29 13:41:59 -07:00
Tim Gross	03c1904112	csi: allow `namespace` field to be passed in volume spec (#12400 ) Use the volume spec's `namespace` field to override the value of the `-namespace` and `NOMAD_NAMESPACE` field, just as we do with job spec.	2022-03-29 14:46:39 -04:00
Michael Schurter	33fe04ff6a	template: fix comments and docs Review notes from @lgfa29 Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-03-29 09:25:23 -07:00
Tim Gross	19703e3316	E2E: test exercising node drain behavior for CSI volumes (#12384 )	2022-03-29 11:19:23 -04:00
dependabot[bot]	2df08852bf	build(deps): bump github.com/mitchellh/hashstructure from 1.0.0 to 1.1.0 (#12399 ) Bumps [github.com/mitchellh/hashstructure](https://github.com/mitchellh/hashstructure) from 1.0.0 to 1.1.0. - [Release notes](https://github.com/mitchellh/hashstructure/releases) - [Commits](https://github.com/mitchellh/hashstructure/compare/v1.0.0...v1.1.0) --- updated-dependencies: - dependency-name: github.com/mitchellh/hashstructure dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-03-29 11:17:09 -04:00
Tim Gross	a6652bffad	CSI: reorder controller volume detachment (#12387 ) In #12112 and #12113 we solved for the problem of races in releasing volume claims, but there was a case that we missed. During a node drain with a controller attach/detach, we can hit a race where we call controller publish before the unpublish has completed. This is discouraged in the spec but plugins are supposed to handle it safely. But if the storage provider's API is slow enough and the plugin doesn't handle the case safely, the volume can get "locked" into a state where the provider's API won't detach it cleanly. Check the claim before making any external controller publish RPC calls so that Nomad is responsible for the canonical information about whether a volume is currently claimed. This has a couple side-effects that also had to get fixed here: * Changing the order means that the volume will have a past claim without a valid external node ID because it came from the client, and this uncovered a separate bug where we didn't assert the external node ID was valid before returning it. Fallthrough to getting the ID from the plugins in the state store in this case. We avoided this originally because of concerns around plugins getting lost during node drain but now that we've fixed that we may want to revisit it in future work. * We should make sure we're handling `FailedPrecondition` cases from the controller plugin the same way we handle other retryable cases. * Several tests had to be updated because they were assuming we fail in a particular order that we're no longer doing.	2022-03-29 09:44:00 -04:00
Michael Schurter	7a28fcb8af	template: disallow `writeToFile` by default Resolves #12095 by WONTFIXing it. This approach disables `writeToFile` as it allows arbitrary host filesystem writes and is only a small quality of life improvement over multiple `template` stanzas. This approach has the significant downside of leaving people who have altered their `template.function_denylist` still vulnerable! I added an upgrade note, but we should have implemented the denylist as a `map[string]bool` so that new funcs could be denied without overriding custom configurations. This PR also includes a bug fix that broke enabling all consul-template funcs. We repeatedly failed to differentiate between a nil (unset) denylist and an empty (allow all) one.	2022-03-28 17:05:42 -07:00
Ryo Nakao	e11894a0cb	Ensure to close StreamFrame channel (#12248 )	2022-03-28 10:28:23 -04:00
Tim Gross	bc455fc69c	docs: changelog entry (#12393 )	2022-03-28 09:44:58 -04:00
Shishir	afcce3eea5	Display OS name in nomad node status command. (#12388 ) Signed-off-by: Shishir Mahajan <smahajan@roblox.com>	2022-03-28 09:28:14 -04:00
Seth Hoenig	e3c8a86e2e	Merge pull request #12381 from hashicorp/ci-gha-off ci: set test log level off in gha	2022-03-25 15:13:42 -05:00
Tim Gross	5c7f2bad0b	E2E: namespace HCP vault and consul policies to avoid collisions (#12386 ) Concurrent E2E runs can collide when provisioning policies on HCP Consul and HCP Vault. Namespace these by the test run name, as we do for most everything else.	2022-03-25 16:05:59 -04:00
Tim Gross	3c15236fd5	E2E: move example test to use golangs stdlib test runner (#12383 ) Our E2E "framework" has a bunch of features around test discovery and standing up infra that were never completed or fully used, and we ended up building out a large test suite that ignored all that in lieu of Terraform-provided infrastructure for the last couple years. This changeset is a proposal (and demonstration) for gradually migrating our E2E tests off the framework code so that developers can write fairly ordinary golang stdlib testing tests.	2022-03-25 14:44:16 -04:00
Seth Hoenig	e256afdfee	ci: set test log level off in gha	2022-03-25 13:43:33 -05:00
Seth Hoenig	4b895a436a	ci: set count to bypass caching	2022-03-25 13:43:33 -05:00
James Rasell	67b467983e	Merge pull request #12368 from hashicorp/f-1.3-boogie-nights service discovery: add initial MVP implementation	2022-03-25 18:04:47 +01:00
Tim Gross	67b87e46f1	e2e: test for allocations replacement on disconnected clients (#12375 ) This test exercises the behavior of clients that become disconnected and have their allocations replaced. Future test cases will exercise the `max_client_disconnect` field on the job spec.	2022-03-25 12:26:43 -04:00
Luiz Aoqui	c387e2d97e	ci: fix semgrep rule for RPC authentication	2022-03-25 12:00:48 -04:00
Hunter Morris	dcaf99dcc1	client: Add AWS EC2 instance-life-cycle from metadata to client fingerprint (#12371 )	2022-03-25 11:50:52 -04:00
James Rasell	9449e1c3e2	Merge branch 'main' into f-1.3-boogie-nights	2022-03-25 16:40:32 +01:00
Luiz Aoqui	848a3b271f	docs: fix link and add note about Nomad v1.3.0 on raft v3 upgrade (#12378 )	2022-03-25 10:11:46 -04:00
Seth Hoenig	42ccdf6db3	Merge pull request #12380 from hashicorp/ci-gha-verbose ci: cleanup verbose mode and enable for gha	2022-03-25 08:01:57 -05:00
James Rasell	1604f46026	Merge pull request #12357 from hashicorp/f-update-cli-namespace-wildcard-support-wording cli: update namespace wildcard help to be non-specific.	2022-03-25 11:24:39 +01:00
dgotlieb	f53f61c6ce	Add grpc and http2 listeners to gateway docs (#12367 ) Stating at Nomad version 1.2.0 `grpc` and `http2` [protocols are supported](https://github.com/hashicorp/nomad/pull/11187)	2022-03-24 17:09:19 -04:00
Karthick Ramachandran	122115c0ba	make stop job message clearer (#12252 )	2022-03-24 16:38:43 -04:00
Seth Hoenig	e85fbaf0ac	ci: cleanup verbose mode and enable for gha test_checks.sh was removed in 2019 and now just breaks if VERBOSE is set when running tests via make targets in GHA, use verbose mode to display what tests are running	2022-03-24 15:15:05 -05:00
Luiz Aoqui	e7382a6b45	tests: fix rpc limit tests (#12364 )	2022-03-24 15:31:47 -04:00
Seth Hoenig	987dda3092	Merge pull request #12274 from hashicorp/f-cgroupsv2 client: enable cpuset support for cgroups.v2	2022-03-24 14:22:54 -05:00
Michael Schurter	654d458960	core: add deprecated mvn tag to serf (#12327 ) Revert a small part of #11600 after @lgfa29 discovered it would break compatibility with Nomad <= v1.2! Nomad <= v1.2 expects the `vsn` tag to exist in Serf. It has always been `1`. It has no functional purpose. However it causes a parsing error if it is not set: https://github.com/hashicorp/nomad/blob/v1.2.6/nomad/util.go#L103-L108 This means Nomad servers at version v1.2 or older will not allow servers without this tag to join. The `mvn` minor version tag is also checked, but soft fails. I'm not setting that because I want as much of this cruft gone as possible.	2022-03-24 14:44:21 -04:00
Luiz Aoqui	64b558c14c	core: store and check for Raft version changes (#12362 ) Downgrading the Raft version protocol is not a supported operation. Checking for a downgrade is hard since this information is not stored in any persistent place. When a server re-joins a cluster with a prior Raft version, the Serf tag is updated so Nomad can't tell that the version changed. Mixed version clusters must be supported to allow for zero-downtime rolling upgrades. During this it's expected that the cluster will have mixed Raft versions. Enforcing consistency strong version consistency would disrupt this flow. The approach taken here is to store the Raft version on disk. When the server starts the `raft_protocol` value is written to the file `data_dir/raft/version`. If that file already exists, its content is checked against the current `raft_protocol` value to detect downgrades and prevent the server from starting. Any other types of errors are ignore to prevent disruptions that are outside the control of operators. The only option in cases of an invalid or corrupt file would be to delete it, making this check useless. So just overwrite its content with the new version and provide guidance on how to check that their cluster is an expected state.	2022-03-24 14:42:00 -04:00
Seth Hoenig	113b7eb727	client: cgroups v2 code review followup	2022-03-24 13:40:42 -05:00
Tim Gross	ff1bed38cd	csi: add `-secret` and `-parameter` flag to `volume snapshot create` (#12360 ) Pass-through the `-secret` and `-parameter` flags to allow setting parameters for the snapshot and overriding the secrets we've stored on the CSI volume in the state store.	2022-03-24 10:29:50 -04:00
Seth Hoenig	65c950baf4	Merge pull request #12369 from hashicorp/b-peers-perms core: write peers.json file with correct permissions	2022-03-24 09:18:24 -05:00
Seth Hoenig	a6c905616d	core: write peers.json file with correct permissions	2022-03-24 08:26:31 -05:00
James Rasell	16b1f19ffe	api: move serviceregistration client to servics to match CLI. The service registration client name was used to provide a distinction between the service block and the service client. This however creates new wording to understand and does not match the CLI, therefore this change fixes that so we have a Services client. Consul specific objects within the service file have been moved to the consul location to create a clearer separation.	2022-03-24 09:08:45 +01:00
James Rasell	96d8512c85	test: move remaining tests to use ci.Parallel.	2022-03-24 08:45:13 +01:00
dependabot[bot]	92021045b6	build(deps): bump github.com/stretchr/testify from 1.7.0 to 1.7.1 (#12306 )	2022-03-23 19:12:51 -04:00
Seth Hoenig	2e5c6de820	client: enable support for cgroups v2 This PR introduces support for using Nomad on systems with cgroups v2 [1] enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems for Nomad users. Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer, but not so for managing cpuset cgroups. Before, Nomad has been making use of a feature in v1 where a PID could be a member of more than one cgroup. In v2 this is no longer possible, and so the logic around computing cpuset values must be modified. When Nomad detects v2, it manages cpuset values in-process, rather than making use of cgroup heirarchy inheritence via shared/reserved parents. Nomad will only activate the v2 logic when it detects cgroups2 is mounted at /sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2 mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to use the v1 logic, and should operate as before. Systems that do not support cgroups v2 are also not affected. When v2 is activated, Nomad will create a parent called nomad.slice (unless otherwise configured in Client conifg), and create cgroups for tasks using naming convention <allocID>-<task>.scope. These follow the naming convention set by systemd and also used by Docker when cgroups v2 is detected. Client nodes now export a new fingerprint attribute, unique.cgroups.version which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by Nomad. The new cpuset management strategy fixes #11705, where docker tasks that spawned processes on startup would "leak". In cgroups v2, the PIDs are started in the cgroup they will always live in, and thus the cause of the leak is eliminated. [1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html Closes #11289 Fixes #11705 #11773 #11933	2022-03-23 11:35:27 -05:00
Tim Gross	5c91bc877c	csi: set gRPC authority header for unix domain socket (#12359 ) The go-grpc library used by most CSI plugins doesn't require the authority header to be set, which violates the HTTP2 spec but doesn't impact Nomad because both sides of the connection are using the same library. But plugins written in other languages (`democratic-csi` for example) may have more strictly conforming gRPC server libraries and we need to set the authority header manually.	2022-03-23 12:01:08 -04:00
James Rasell	cd42572f0d	Merge pull request #12330 from hashicorp/f-gh-263 cli: add service commands for list, info, and delete.	2022-03-23 15:49:29 +01:00
Tim Gross	1743648901	CSI: fix timestamp from volume snapshot responses (#12352 ) Listing snapshots was incorrectly returning nanoseconds instead of seconds, and formatting of timestamps both list and create snapshot was treating the timestamp as though it were nanoseconds instead of seconds. This resulted in create timestamps always being displayed as zero values. Fix the unit conversion error in the command line and the incorrect extraction in the CSI plugin client code. Beef up the unit tests to make sure this code is actually exercised.	2022-03-23 10:39:28 -04:00
Tim Gross	b7075f04fd	CSI: enforce single access mode at validation time (#12337 ) A volume that has single-use access mode is feasibility checked during scheduling to ensure that only a single reader or writer claim exists. However, because feasibility checking is done one alloc at a time before the plan is written, a job that's misconfigured to have count > 1 that mounts one of these volumes will pass feasibility checking. Enforce the check at validation time instead to prevent us from even trying to evaluation a job that's misconfigured this way.	2022-03-23 09:21:26 -04:00
James Rasell	0a9f54c525	cli: update namespace wildcard help to be non-specific. A number of commands support namespace wildcard querying, so it should be up to the sub-command to detail support, rather than keeping this list up to date.	2022-03-23 12:56:48 +01:00
James Rasell	bb8514fc75	core: remove node service registrations when node is down. When a node fails its heart beating a number of actions are taken to ensure state is cleaned. Service registrations a loosely tied to nodes, therefore we should remove these from state when a node is considered terminally down.	2022-03-23 09:42:46 +01:00

... 2 3 4 5 6 ...

22913 commits