Commit graph

574 commits

Author SHA1 Message Date
Seth Hoenig 51a2212d3d
client: sandbox go-getter subprocess with landlock (#15328)
* client: sandbox go-getter subprocess with landlock

This PR re-implements the getter package for artifact downloads as a subprocess.

Key changes include

On all platforms, run getter as a child process of the Nomad agent.
On Linux platforms running as root, run the child process as the nobody user.
On supporting Linux kernels, uses landlock for filesystem isolation (via go-landlock).
On all platforms, restrict environment variables of the child process to a static set.
notably TMP/TEMP now points within the allocation's task directory
kernel.landlock attribute is fingerprinted (version number or unavailable)
These changes make Nomad client more resilient against a faulty go-getter implementation that may panic, and more secure against bad actors attempting to use artifact downloads as a privilege escalation vector.

Adds new e2e/artifact suite for ensuring artifact downloading works.

TODO: Windows git test (need to modify the image, etc... followup PR)

* landlock: fixup items from cr

* cr: fixup tests and go.mod file
2022-12-07 16:02:25 -06:00
Seth Hoenig dfc3b067ea
e2e: fix 1 of 4 client disconnect tests (#15357)
This PR modifies the disconnect helper job to run as root, which is necesary
for manipulating iptables as it does. Also re-organizes the final test logic
to wait for client re-connect before looking for the replacement (3rd) allocation
in case that client was needed to run the alloc (also giving the sheduler more
time to do its thing).

Skips the other 3 tests, which fail and I cannot yet figure out what is going on.
2022-11-22 08:51:53 -06:00
Seth Hoenig 2c7c6334c0
e2e: fixup oversubscription test case for jammy (#15347)
* e2e: fixup oversubscription test case for jammy

jammy uses cgroups v2, need to lookup the max memory limit from the
unified heirarchy format

* e2e: set constraint to require cgroups v2 on oversub docker test
2022-11-21 12:41:55 -06:00
Seth Hoenig eaf842b226
e2e: jammy image needs latest java lts (#15323) 2022-11-18 14:36:36 -06:00
Seth Hoenig 845ff10281
e2e: disable systemd stub dns in jammy image (#15286) 2022-11-17 09:50:44 -06:00
Seth Hoenig 45ff0765c7
e2e: swap bionic image for jammy (#15220) 2022-11-16 10:37:18 -06:00
Derek Strickland 80b6f27efd
api: remove mapstructure tags fromPort struct (#12916)
This PR solves a defect in the deserialization of api.Port structs when returning structs from theEventStream.

Previously, the api.Port struct's fields were decorated with both mapstructure and hcl tags to support the network.port stanza's use of the keyword static when posting a static port value. This works fine when posting a job and when retrieving any struct that has an embedded api.Port instance as long as the value is deserialized using JSON decoding. The EventStream, however, uses mapstructure to decode event payloads in the api package. mapstructure expects an underlying field named static which does not exist. The result was that the Port.Value field would always be set to 0.

Upon further inspection, a few things became apparent.

The struct already has hcl tags that support the indirection during job submission.
Serialization/deserialization with both the json and hcl packages produce the desired result.
The use of of the mapstructure tags provided no value as the Port struct contains only fields with primitive types.
This PR:

Removes the mapstructure tags from the api.Port structs
Updates the job parsing logic to use hcl instead of mapstructure when decoding Port instances.
Closes #11044

Co-authored-by: DerekStrickland <dstrickland@hashicorp.com>
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2022-11-08 11:26:28 +01:00
Seth Hoenig d7aa37a5c9
e2e: explicitly wait on task status in chroot download exec test (#15145)
Also add some debug log lines for this test, because it doesn't make sense
for the allocation to be complete yet a task in the allocation to be not
started yet, which is what the test failures are implying.
2022-11-04 09:50:11 -05:00
Michael Schurter 9cac60dbed
test: use port collision instead of cpu exhaustion (#14994)
Originally this test relied on Job 1 blocking Job 2 until Job 1 had a
terminal *ClientStatus.* Job 2 ensured it would get blocked using 2
mechanisms:

1. A constraint requiring it is placed on the same node as Job 1.
2. Job 2 would require all unreserved CPU on the node to ensure it would
   be blocked until Job 1's resources were free.

That 2nd assertion breaks if *any previous job is still running on the
target node!* That seems very likely to happen in the flaky world of our
e2e tests. In fact there may be some jobs we intentionally want running
throughout; in hindsight it was never safe to assume my test would be
the only thing scheduled when it ran.

*Ports to the rescue!* Reserving a static port means that both Job 2
will now block on Job 1 being terminal. It will only conflict with other
tests if those tests use that port *on every node.* I ensured no
existing tests were using the port I chose.

Other changes:
- Gave job a bit more breathing room resource-wise.
- Tightened timings a bit since previous failure ran into the `go test`
  time limit.
- Cleaned up the DumpEvals output. It's quite nice and handy now!
2022-10-21 07:53:26 -07:00
Seth Hoenig e66c9ede24
e2e: convert flaky exec download in chroot unit test into e2e test (#14949)
Similar to https://github.com/hashicorp/nomad/pull/14710, convert flaky
test into e2e test.
2022-10-19 08:22:32 -05:00
Michael Schurter 01d90d18f6
test: expand timing and debugging for overlap test (#14920)
attempt #9000
2022-10-18 13:02:18 -07:00
Michael Schurter 21eced0a4e
test: extend timing and output of overlap e2e test (#14894)
Keeps failing in the nightly e2e test with unhelpful output like:
```
Failed
=== RUN   TestOverlap
    overlap_test.go:92: Followup job overlap93ee1d2b blocked. Sleeping for the rest of overlap48c26c39's shutdown_delay (9.2/10s)
    overlap_test.go:105: 1500/2000 retries reached for github.com/hashicorp/nomad/e2e/overlap.TestOverlap (err=timed out before an allocation was found for overlap93ee1d2b)
    overlap_test.go:105: timeout: timed out before an allocation was found for overlap93ee1d2b
--- FAIL: TestOverlap (38.96s)
```

I have not been able to replicate it in my own e2e cluster, so I added
the EvalDump helper to add detailed eval information like:

```
=== RUN   TestOverlap
1/1 Job overlap7b0e90ec Eval c38c9919-a4f0-5baf-45f7-0702383c682a
  Type:         service
  TriggeredBy:  job-register
  Deployment:
  Status:       pending ()
  NextEval:
  PrevEval:
  BlockedEval:
   -- No placement failures --
  QueuedAllocs:
  SnapshotIdx:  0
  CreateIndex:  96
  ModifyIndex:  96

...
```

Hopefully helpful when debugging other tests as well!
2022-10-14 14:15:07 -07:00
Michael Schurter bdb639b3e2
test: simplify overlap job placement logic (#14811)
* test: simplify overlap job placement logic

Trying to fix #14806

Both the previous approach as well as this one worked on e2e clusters I
spun up.

* simplify code flow
2022-10-12 11:21:28 -07:00
Giovani Avelar a625de2062
Allow specification of a custom job name/prefix for parameterized jobs (#14631) 2022-10-06 16:21:40 -04:00
James Rasell 0187240e7c
e2e: fixes the ordering on greater than checks within spread test. (#14818) 2022-10-06 15:27:36 +02:00
James Rasell 67e8f85360
e2e: fix incorrect must function usage in namespace suite. (#14805) 2022-10-05 15:50:56 +02:00
Michael Schurter ed3218c3dd
Fixing flaky TestOverlap test (#14780)
* test: ensure feasible node selected in overlap test

* test: warn when getting close to retry limit
2022-10-03 14:35:02 -07:00
Seth Hoenig 7235d9988b
e2e: convert chroot env unit tests into e2e tests (#14710)
This PR translates two of our most flakey unit tests into
e2e tests where they are fit much more naturally.
2022-09-26 15:40:29 -05:00
Michael Schurter 6161b417f3
test: add e2e for non-overlapping placements (#14646)
* test: add e2e for non-overlapping placements

Followup to #10446

Fails (as expected) against 1.3.x at the wait for blocked eval (because
the allocs are allowed to overlap).

Passes against 1.4.0-beta.1 (as expected).

* Update e2e/overlap/overlap_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2022-09-22 13:06:17 -07:00
Seth Hoenig 2088ca3345
cleanup more helper updates (#14638)
* cleanup: refactor MapStringStringSliceValueSet to be cleaner

* cleanup: replace SliceStringToSet with actual set

* cleanup: replace SliceStringSubset with real set

* cleanup: replace SliceStringContains with slices.Contains

* cleanup: remove unused function SliceStringHasPrefix

* cleanup: fixup StringHasPrefixInSlice doc string

* cleanup: refactor SliceSetDisjoint to use real set

* cleanup: replace CompareSliceSetString with SliceSetEq

* cleanup: replace CompareMapStringString with maps.Equal

* cleanup: replace CopyMapStringString with CopyMap

* cleanup: replace CopyMapStringInterface with CopyMap

* cleanup: fixup more CopyMapStringString and CopyMapStringInt

* cleanup: replace CopySliceString with slices.Clone

* cleanup: remove unused CopySliceInt

* cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice

* cleanup: replace CopyMap with maps.Clone

* cleanup: run go mod tidy
2022-09-21 14:53:25 -05:00
James Rasell 3f78a51fa5
e2e: use unique names for Connect ACL Consul policy names. (#14604)
In the event a single test fails to clear up properly after
itself, all other tests will fail as they attempt to create ACL
policies with the same names. This change ensures they use unique
ACL names, so when a single test fails, it is easy to identify it
is a problem with the test rather than the suite.
2022-09-16 13:35:40 +02:00
James Rasell 90d0b9157f
e2e: rewrite spread suite to use new e2e style. (#14598)
The rewrite refactors the suite to use the new style along with
other recent testing improvements. In order to ensure the spread
tests do not impact each other, there is new cleanup functionality
to ensure both the job and allocations are removed from state
before the test exits completely.
2022-09-15 17:12:20 +02:00
James Rasell d65267c60c
e2e: do not assume clean cluster when checking return objects. (#14557) 2022-09-13 14:25:19 +02:00
James Rasell 1f877bac1c
acl: fix encoding expiration time in ACL token list API. (#14542) 2022-09-12 15:50:35 +02:00
James Rasell 6f790769bb
e2e: fixup service discovery and ACL expiration tests. (#14517)
The NSD checks tests were racey, whereby the check may not have
been triggered by the time it was queried. This change wraps the
check so it can account for this.

This removes the current ACL expiration GC section in order to get
the tests passing and allow more time to investigate the test. I
have full confidence the feature is working as expected and have
tested extensively locally.
2022-09-09 14:27:40 +02:00
James Rasell d14d6e051a
e2e: fixup token expiration test to account for longer forced GC. (#14491) 2022-09-08 14:43:04 +02:00
James Rasell e24de517fa
e2e: add test to exercise ACL tokens with role and policy links. (#14432) 2022-09-02 08:56:00 +02:00
James Rasell 5d0cc93939
e2e: add acl test for token expiration. (#14418)
In order to add an E2E test to cover token expiration, the server
config has been updated to include a low minimum allowed TTL
value. For ease of reading, the max value is also set.
2022-09-01 09:36:09 +02:00
James Rasell 5f3665230b
e2e: add ACL test suite with ACL Role test. (#14398)
This adds a new ACL test suite to the e2e framework which includes
an initial test for ACL roles. The ACL test includes a helper to
track and clean created Nomad resources which keeps the test
cluster clean no matter if the test fails early or not.
2022-08-31 10:11:28 +02:00
Seth Hoenig 38727b6ab9 e2e: add e2e tests for nomad service disco checks
This PR adds 2 e2e tests for ensuring nomad service discovery checks
get created and produce status results as expected.
2022-08-22 15:31:13 -05:00
Piotr Kazmierczak b63944b5c1
cleanup: replace TypeToPtr helper methods with pointer.Of (#14151)
Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.
2022-08-17 18:26:34 +02:00
Seth Hoenig b3ea68948b build: run gofmt on all go source files
Go 1.19 will forecefully format all your doc strings. To get this
out of the way, here is one big commit with all the changes gofmt
wants to make.
2022-08-16 11:14:11 -05:00
Tim Gross 6c080e0b10
e2e: move namespaces test out of legacy framework (#13934)
This PR continues work we've started on other test suites to use the native
golang test runner instead of the custom framework.
2022-08-01 13:24:34 -04:00
Seth Hoenig 634d84edec e2e: add nsd simple load balancing test 2022-07-14 15:07:19 -05:00
James Rasell 17a467020c
e2e: add terraform init commands to readme doc. (#13655) 2022-07-08 16:52:35 +02:00
James Rasell 181b247384
core: allow pausing and un-pausing of leader broker routine (#13045)
* core: allow pause/un-pause of eval broker on region leader.

* agent: add ability to pause eval broker via scheduler config.

* cli: add operator scheduler commands to interact with config.

* api: add ability to pause eval broker via scheduler config

* e2e: add operator scheduler test for eval broker pause.

* docs: include new opertor scheduler CLI and pause eval API info.
2022-07-06 16:13:48 +02:00
Derek Strickland 34dea90d7a
docker: update images to reference hashicorpdev Docker organization (#12903)
docker: update images to reference hashicorpdev dockerhub organization
generate job_init.bindata_assetfs.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-06-08 15:06:00 -04:00
James Rasell c3c10d8c10
e2e: use longer wait in template update triggers to avoid flake. (#13271) 2022-06-07 14:49:03 +02:00
Tim Gross cc4a1f2ec4
e2e: upgrade playwright package and container image (#13080)
The nightly playwright tests are currently failing because of a
mismatch between the expected version of Chromium and what's in the
container image. Unfortunately the previous specific tag we were using
for the container image is no longer tagged on the registry. With some
testing, I was able to find an image tag that results in a good run.
2022-05-20 08:41:07 -04:00
Derek Strickland 90daed7c1d
e2e: Wait for deployment to finish before disconnect (#12795)
* Wait for deployment to finish
* Don't reschedule disconnect or restart-node jobs
2022-04-27 12:27:03 -04:00
Tim Gross c763c4cb96
remove pre-0.9 driver code and related E2E test (#12791)
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
2022-04-27 09:53:37 -04:00
Tim Gross cfd353207f
E2E: move volume mounts test to use golang's stdlib test runner (#12788)
Part of ongoing work to remove the old E2E framework code.
2022-04-26 14:28:20 -04:00
Tim Gross 83eb879d61
E2E: remove old CLI for driving provisioning (#12787)
We moved off the old provisioning process for nightly E2E to one driven
entirely by Terraform quite a while back now. We're in the slow
process of removing the framework code for this test-by-test, but this
chunk of code no longer has any callers.
2022-04-26 13:43:25 -04:00
Tim Gross f7d6841dd2
E2E: remove platform specific realpath code from UI run script (#12750)
We don't need the absolute path for any of the commands in this script
so long as we `cd` into the source directory path. Doing this removes
the need for weird platform-specific tricks we have to do with
realpath vs GNU realpath.
2022-04-22 10:10:18 -04:00
Tim Gross 7dd3910e51
E2E: fix debug logging on disconnected clients test (#12621) 2022-04-22 09:07:05 -04:00
Tim Gross d200a66509
E2E: make UIs runnable from any working directory (#12739)
The E2E test runner is running from the root of the Nomad
repository. Make this run independent of the working directory for
convenience of developers and the test runner.
2022-04-21 17:00:01 -04:00
Tim Gross dc013b5267
E2E: set longer timeout for CSI plugin alloc start (#12732)
The CSI plugin allocations take a while to be marked healthy,
sometimes causing E2E test flakes during the setup phase of the
tests. There's nothing CSI specific about marking plugin allocs
healthy, as the plugin supervisor hook does all the fingerprinting in
the postrun hook (the prestart hook just makes a couple of empty
directories). The timeouts we're seeing may be because of where we're
pulling the images from; most our jobs pull from a CDN-backed public
registry whereas these are pulling from ECR. Set a 1min timeout for
these to make sure we have enough time to pull the image and start the
task.
2022-04-21 11:11:43 -04:00
Tim Gross 2ad9f6bc5f
E2E: playwright configuration and smoke test (#12721)
Scripts for running playwright tests in a Docker container that has
chromium and webkit preinstalled. Includes a basic smoke test for
authentication so that we can be sure the test rig is working
end-to-end. Wiring this up in CI will be in an upcoming PR.
2022-04-21 09:13:10 -04:00
Tim Gross c4d92205b4
E2E: provide options for reverse proxy for web UI (#12671)
Our E2E test environment is deployed with mTLS, but it's impractical
for us to use mTLS in headless browsers for automated testing (or even
in manual testing). Provide certificates for proxying the web UI via
Nginx. This proxy uses client certs for proxying to the HTTP endpoint
and a self-signed cert for the browser-facing endpoint. We can accept
certificate errors in the automated tests we'll be adding in the next
step of this work.
2022-04-19 16:55:05 -04:00
Tim Gross 70c262eb95
E2E: terraform provisioner upgrades (#12652)
While working on infrastructure for testing the UI in E2E, we needed
to upgrade the certificate provider. Performing a provider upgrade via
the TF `init -upgrade` brought in updates for the file and AWS
providers as well. These updates include deprecating the use of
`sensitive_content` fields, removing CA algorithm parameters that can
be inferred from keys, and removing the requirement to manually
specify AWS assume role parameters in the provider config if they're
available in the calling environment's AWS config file (as they are
via doormat or our E2E environment).
2022-04-19 14:27:14 -04:00