When shutting down an allocation that ends up needing to be
force-killed, we're getting a spurious "OOM Killed (137)" message on
the task termination event. We introduced this as part of cgroups v2
support because the Docker daemon isn't detecting the container status
correctly. Although exit code 137 is the exit code we get for
OOM-killed processes, that's because OOM kill is a `SIGKILL`. So any
sigkilled process will get that exit code.
The CSI plugin allocations take a while to be marked healthy,
sometimes causing E2E test flakes during the setup phase of the
tests. There's nothing CSI specific about marking plugin allocs
healthy, as the plugin supervisor hook does all the fingerprinting in
the postrun hook (the prestart hook just makes a couple of empty
directories). The timeouts we're seeing may be because of where we're
pulling the images from; most our jobs pull from a CDN-backed public
registry whereas these are pulling from ECR. Set a 1min timeout for
these to make sure we have enough time to pull the image and start the
task.
Scripts for running playwright tests in a Docker container that has
chromium and webkit preinstalled. Includes a basic smoke test for
authentication so that we can be sure the test rig is working
end-to-end. Wiring this up in CI will be in an upcoming PR.
* add concurrent download support - resolves#11244
* format imports
* mark `wg.Done()` via `defer`
* added tests for successful and failure cases and resolved some goleak
* docs: add changelog for #11531
* test typo fixes and improvements
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Our E2E test environment is deployed with mTLS, but it's impractical
for us to use mTLS in headless browsers for automated testing (or even
in manual testing). Provide certificates for proxying the web UI via
Nginx. This proxy uses client certs for proxying to the HTTP endpoint
and a self-signed cert for the browser-facing endpoint. We can accept
certificate errors in the automated tests we'll be adding in the next
step of this work.
While working on infrastructure for testing the UI in E2E, we needed
to upgrade the certificate provider. Performing a provider upgrade via
the TF `init -upgrade` brought in updates for the file and AWS
providers as well. These updates include deprecating the use of
`sensitive_content` fields, removing CA algorithm parameters that can
be inferred from keys, and removing the requirement to manually
specify AWS assume role parameters in the provider config if they're
available in the calling environment's AWS config file (as they are
via doormat or our E2E environment).
This PR is 2 fixes for the flaky TestTaskRunner_TaskEnv_Chroot test.
And also the TestTaskRunner_Download_ChrootExec test.
- Use TinyChroot to stop copying gigabytes of junk, which causes GHA
to fail to create the environment in time.
- Pre-create cgroups on V2 systems. Normally the cgroup directory is
managed by the cpuset manager, but that is not active in taskrunner tests,
so create it by hand in the test framework.
This test checks for behavior when asking for logs of a docker task
configured with a log driver that does not support streaming logs.
Previously this was using the 'gelf' log driver, but it seems that no
longer returns an error as expected. Instead we can just use the 'none'
log driver, which has the desired effect
2022-04-19T10:23:19.129-0500 [ERROR] docklog/docker_logger.go:133: log streaming ended with terminal error: error="API error (501): configured logging driver does not support reading"
Build binaries for every code change, not just backend code
changes. This means that we'll have up-to-date compiled assets for
every commit available in CircleCI artifacts.
This PR updates the changelog, adds notes the 1.3 upgrade guide, and
updates the connect integration docs with documentation about the new
requirement on Consul ACL policies of Consul agent default anonymous ACL
tokens.
* Add os to NodeListStub struct.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
* Add os as a query param to /v1/nodes.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
* Add test: os as a query param to /v1/nodes.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
The CSI HTTP API has to transform the CSI volume to redact secrets,
remove the claims fields, and to consolidate the allocation stubs into
a single slice of alloc stubs. This was done manually in #8590 but
this is a large amount of code and has proven both very bug prone
(see #8659, #8666, #8699, #8735, and #12150) and requires updating
lots of code every time we add a field to volumes or plugins.
In #10202 we introduce encoding improvements for the `Node` struct
that allow a more minimal transformation. Apply this same approach to
serializing `structs.CSIVolume` to API responses.
Also, the original reasoning behind #8590 for plugins no longer holds
because the counts are now denormalized within the state store, so we
can simply remove this transformation entirely.
The API for `CSIVolume.List` sorts by created index and not by ID,
which breaks the logic for prefix matching in the `volume status`
output when the prefix is also an exact match. Ensure that we're
handling this case correctly.
The Nomad client's `csi_hook` interpolates the alloc suffix with the
volume request's name for CSI volumes with `per_alloc = true`, turning
`example` into `example[1]`. We need to do this same behavior in the
`alloc status` output so that we show the correct volume.
- Moved federation docs to the bottom since *everyone* is potentially
affected by the other sections on the page, but only users of
federation are affected by it.
- Added section on the plan for node rejected bug since it is fairly
easy to diagnose and removing affected nodes is a fairly reliable
workaround.
- Mention 5s cliff for wait_for_index.
- Remove the lie that we do not have job status metrics! How old was
that?!
- Reinforce the importance of monitoring basic system resources
This test has a failure that's happening only occassionally and not
very reproducibly. Print out the allocation status on test failure so
that we can do some post-mortum debugging of the test on nightly.
These tests have a data race where the test assertion is reading a
value that's being set in the `listenFunc` goroutines that are
subscribing to registry update events. Move the assertion into the
subscribing goroutine to remove the race. This bug was discovered
in #12098 but does not impact production Nomad code.
This PR expands on the work done in #12543 to
- prefix the tag, so it is now "nomad.alloc_id" to be more consistent with Consul tags
- merge into pre-existing envoy_stats_tags fields
- update the upgrade guide docs
- update changelog