Commit graph

20622 commits

Author SHA1 Message Date
Buck Doyle 528b13f69d
Fix audit workflow action versions (#9877)
This fixes the version reference error seen in this workflow failure:
https://github.com/hashicorp/nomad/actions/runs/504695096

I’ve also included an update to the sticky comment action version to address this warning in the above link:

marocchino/sticky-pull-request-comment@33a6cfb looks like the shortened version of a commit SHA. Referencing actions by the short SHA will be disabled soon. Please see https://docs.github.com/en/actions/learn-github-actions/security-hardening-for-github-actions#using-third-party-actions.

We were previously using 33a6cfb after the maintainer merged my PR to allow the comment to be read from a file, there was no released version with that, but it’s now included in v2.0.0.
2021-01-26 09:06:22 -06:00
Mahmood Ali 1ac8b32e08 e2e: Disable Connect tests
The connect tests are very disruptive: restart consul/nomad agents with new
tokens.  The test seems particularly flaky, failing 32 times out of 73 in my
sample.

The tests are particularly problematic because they are disruptive and affect
other tests. On failure, the nomad or consul agent on the client can get into a
wedged state, so health/deployment info in subsequent tests may be wrong. In
some cases, the node will be deemed as fail, and then the subsequent tests may
fail when the node is deemed lost and the test allocations get migrated unexpectedly.
2021-01-26 10:01:14 -05:00
Mahmood Ali 36ce1e73eb e2e: deflake nodedrain test
The nodedrain deadline test asserts that all allocations are migrated by the
deadline. However, when the deadline is short (e.g. 10s), the test may fail
because of scheduler/client-propagation delays.

In one failing test, it took ~15s from the RPC call to the moment to the moment
the scheduler issued migration update, and then 3 seconds for the alloc to be
stopped.

Here, I increase the timeouts to avoid such false positives.
2021-01-26 10:01:14 -05:00
Mahmood Ali cf8f6f07d7 e2e: vault increase timeout
Increase the timeout for vaultsecrets.  As the default  interval is 0.1s, 10
retries mean it only retries for one second, a very short time for some waiting
scenarios in the test (e.g. starting allocs, etc).
2021-01-26 10:01:14 -05:00
Mahmood Ali 94ad40907c e2e: prefer testutil.WaitForResultRetries
Prefer testutil.WaitForResultRetries that emits more descriptive errors on
failures. `require.Evatually` fails with opaque "Condition never satisfied"
error message.
2021-01-26 10:01:14 -05:00
Mahmood Ali f3f8f15b7b e2e: special case "Unexpected EOF" errors
This is an attempt at deflaking the e2e exec tests, and a way to improve
messages.

e2e occasionally fail with "unexpected EOF" even though the exec output matches
expectations. I suspect there is a race in handling EOF in server/http handling.

Here, we special case this error and ensures we get all failures,
to help debug the case better.
2021-01-26 10:01:14 -05:00
Mahmood Ali 925d9ce952 e2e: tweak failure messages
Tweak the error messages for the flakiest tests, so that on test failure, we get
more output
2021-01-26 09:16:48 -05:00
Mahmood Ali 6aa3dec6cc e2e: use testify requires instead of t.Fatal
testify requires offer better error message that is easier to notice when seeing
a wall of text in the builds.
2021-01-26 09:14:47 -05:00
Mahmood Ali 236b4055a7 e2e: deflake consul/CheckRestart test
Ensure we pass the alloc ID to status.  Otherwise, the test may fail if there is
another spurious allocation running from another test.
2021-01-26 09:12:20 -05:00
Mahmood Ali 0aafd9af64 e2e: Fix build script and pass shellcheck 2021-01-26 09:11:37 -05:00
James Rasell 7f8ebb5d10
Merge pull request #9888 from hashicorp/f-docs-gh-9842
docs: clarify where variables can be placed with HCLv2.
2021-01-26 14:33:18 +01:00
James Rasell 9c0c75b226
docs: clarify where variables can be placed with HCLv2. 2021-01-26 12:29:58 +01:00
Michael Lange 9395f16da6 Test coverage for the topology info panel.
This fixes a couple bugs

1. Overreporting resources reserved due to counting terminal allocs
2. Overreporting unique client placements due to uniquing on object refs
   instead of on client ID.
2021-01-25 19:01:11 -08:00
Michael Lange 7d998745ed Clamp widths at zero to prevent negative width warnings
This would only ever realistically happen with fixture data, but still
good to not have these warnings.
2021-01-25 18:59:55 -08:00
Michael Lange 93195f8e12 Only count the scheduled allocs on the topo viz node stats bar 2021-01-25 11:29:01 -08:00
Mahmood Ali 4397eda209
Merge pull request #9798 from hashicorp/e2e-terraform-tweaks-20200113
This PR makes two ergonomics changes, meant to get e2e builds more reproducible and ease changes.

### AMI Management

First, we pin the server AMIs to the commits associated with the build.  No more using the latest AMI a developer build in a test branch, or accidentally using a stale AMI because we forgot to build one!  Packer is to tag the AMI images with the commit sha used to generate the image, and then Terraform would look up only the AMIs associated with that sha. To minimize churn, we use the SHA associated with the latest Packer configurations, rather than SHA of all.

This has few benefits: reproducibility and avoiding accidental AMI changes and contamination of changes across branches. Also, the change is a stepping stone to an e2e pipeline that builds new AMIs automatically if Packer files changed.

The downside is that new AMIs will be generated even for irrelevant changes (e.g. spelling, commits), but I suspect that's OK. Also, an engineer will be forced to build the AMI whenever they change Packer files while iterating on e2e scripts; this hasn't been an issue for me yet, and I'll be open for iterating on that later if it proves to be an issue.

### Config Files and Packer

Second, this PR moves e2e config hcl management to Terraform instead of Packer. Currently, the config files live in `./terraform/config`, but they are baked into the servers by Packer and changes are ignored.  This current behavior surprised me, as I spent a bit of time debugging why my config changes weren't applied.  Having Terraform manage them would ease engineer's iteration.  Also, make Packer management more consistent (Packer only works `e2e/terraform/packer`), and easing the logic for AMI change detection.

The config directory is very small (100KB), and having it as an upload step adds negligible time to `terraform apply`.
2021-01-25 13:20:28 -05:00
Seth Hoenig 7597f9afea
Merge pull request #9829 from hashicorp/f-terminating-gateway
consul/connect: Add support for Connect terminating gateways
2021-01-25 10:56:19 -06:00
Mahmood Ali 39da228964 update readme about profiles and packer build 2021-01-25 11:40:26 -05:00
Seth Hoenig 720780992c consul/connect: copy bind address map if empty
This parameter is now supposed to be non-nil even if
empty, and the Copy method should also maintain that
invariant.
2021-01-25 10:36:04 -06:00
Seth Hoenig 1ad219c441 consul/connect: remove debug line 2021-01-25 10:36:04 -06:00
Seth Hoenig 8b05efcf88 consul/connect: Add support for Connect terminating gateways
This PR implements Nomad built-in support for running Consul Connect
terminating gateways. Such a gateway can be used by services running
inside the service mesh to access "legacy" services running outside
the service mesh while still making use of Consul's service identity
based networking and ACL policies.

https://www.consul.io/docs/connect/gateways/terminating-gateway

These gateways are declared as part of a task group level service
definition within the connect stanza.

service {
  connect {
    gateway {
      proxy {
        // envoy proxy configuration
      }
      terminating {
        // terminating-gateway configuration entry
      }
    }
  }
}

Currently Envoy is the only supported gateway implementation in
Consul. The gateay task can be customized by configuring the
connect.sidecar_task block.

When the gateway.terminating field is set, Nomad will write/update
the Configuration Entry into Consul on job submission. Because CEs
are global in scope and there may be more than one Nomad cluster
communicating with Consul, there is an assumption that any terminating
gateway defined in Nomad for a particular service will be the same
among Nomad clusters.

Gateways require Consul 1.8.0+, checked by a node constraint.

Closes #9445
2021-01-25 10:36:04 -06:00
Drew Bailey 007158ee75
ignore setting job summary when oldstatus == newstatus (#9884) 2021-01-25 10:34:27 -05:00
Steven Collins e9f91c1d56
Adds community USB plugin to documentation site 2021-01-25 10:15:36 -05:00
zzhai 899334f2f0 Update syntax.mdx
"one label"
should be the singular form.
2021-01-25 08:42:59 -05:00
Jeff Escalante 5859ecd806
update dependencies, removed unused dependency (#9878) 2021-01-22 21:24:26 -05:00
Michael Lange 5243428343
Merge pull request #9876 from hashicorp/b-ui/default-namespace-casing
UI: Use the same prefix pattern for both the region switcher and the namespace switcher
2021-01-22 13:43:54 -08:00
Michael Lange 886b1b4384 Clip long namespace names but make sure to keep the full name in the title attribute 2021-01-22 13:18:15 -08:00
Michael Lange 875de74503 Use the same prefix pattern from the region switcher for the namespace switcher 2021-01-22 13:18:15 -08:00
Tim Gross 45a45ebb3f changelog entry 2021-01-22 13:41:28 -05:00
Tim Gross 987cdb3a69 prefer TrimPrefix to checking HasPrefix first 2021-01-22 13:41:28 -05:00
Huan Wang ba8b2297b1 fix the inconsistency handling between infra image and normal task image 2021-01-22 13:41:28 -05:00
Tim Gross 555d031283 docs: check_restart is now supported for group services 2021-01-22 10:55:40 -05:00
Tim Gross 0b49e3da12 e2e: added tests for check restart behavior 2021-01-22 10:55:40 -05:00
Tim Gross 64449cddc1 implement alloc runner task restart hook
Most allocation hooks don't need to know when a single task within the
allocation is restarted. The check watcher for group services triggers the
alloc runner to restart all tasks, but the alloc runner's `Restart` method
doesn't trigger any of the alloc hooks, including the group service hook. The
result is that after the first time a check triggers a restart, we'll never
restart the tasks of an allocation again.

This commit adds a `RunnerTaskRestartHook` interface so that alloc runner
hooks can act if a task within the alloc is restarted. The only implementation
is in the group service hook, which will force a re-registration of the
alloc's services and fix check restarts.
2021-01-22 10:55:40 -05:00
Drew Bailey bae0c6cd20
changelog entry for 9768 (#9873) 2021-01-22 09:22:02 -05:00
Drew Bailey 630babb886
prevent double job status update (#9768)
* Prevent Job Statuses from being calculated twice

https://github.com/hashicorp/nomad/pull/8435 introduced atomic eval
insertion iwth job (de-)registration. This change removes a now obsolete
guard which checked if the index was equal to the job.CreateIndex, which
would empty the status. Now that the job regisration eval insetion is
atomic with the registration this check is no longer necessary to set
the job statuses correctly.

* test to ensure only single job event for job register

* periodic e2e

* separate job update summary step

* fix updatejobstability to use copy instead of modified reference of job

* update envoygatewaybindaddresses copy to prevent job diff on null vs empty

* set ConsulGatewayBindAddress to empty map instead of nil

fix nil assertions for empty map

rm unnecessary guard
2021-01-22 09:18:17 -05:00
Kris Hicks 8f9e47a8e7
Clean up Task Validation tests (#9833)
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
2021-01-21 11:53:02 -08:00
Kris Hicks 7694a66414
Don't prepend https to docker cred helper call (#9852)
Some credential helpers, like the ECR helper, will strip the protocol if
given. Others, like the linux "pass" helper, do not.
2021-01-21 11:46:59 -08:00
Charlie Voiselle 4f4d6e6c37
Enable network namespaces for QEMU driver (#9861)
* Enable network namespaces for QEMU driver
* Add CHANGELOG entry
2021-01-21 14:05:46 -05:00
Mahmood Ali 1c6f4549ff
Merge pull request #9868 from hashicorp/e2e-tests-20200121
Deflake some e2e tests
2021-01-21 12:02:52 -05:00
Mahmood Ali 9dcdafe4cf e2e: show command output on failure
When a command fails, it's nice to have the full output, as it contains
diagnostic information. The status code isn't sufficient for debugging.
2021-01-21 10:32:16 -05:00
Mahmood Ali f03e67712a
docs: remove timestamp hcl2 function (#9867)
timestamp isn't actually implemented
2021-01-21 10:29:50 -05:00
Mahmood Ali 923725bf3d e2e: deflake TestVolumeMounts
After submitting an update, the test ought to wait until the new
allocations are placed. Previously, we'd use the original to-be-stopped
allocations and the test fails when attempting to exec.
2021-01-21 10:28:41 -05:00
Mahmood Ali 95b7fc80b8 e2e deflake namespaces: only check namespace jobs
Deflake namespace e2e test by only asserting on jobs related to the
namespace tests. During our e2e tests, some left over jobs (e.g.
prometheus) are left running while being shutdown and cause the test to
fail.
2021-01-21 10:26:24 -05:00
Mahmood Ali 2e8bcac261 e2e: deflake events
Handle streamCh channel being closed.
2021-01-21 10:25:42 -05:00
Buck Doyle 27f73f2b7b
Change to fork of audit to log flaky tests (#9518)
This will report the names of flaky tests instead of just counting them.
2021-01-21 08:25:16 -06:00
Dennis Schön 3eaf1432aa
validate connect block allowed only within group.service 2021-01-20 14:34:23 -05:00
Drew Bailey 3099eb0c73
fix changelog date (#9862) 2021-01-20 13:14:21 -05:00
Seth Hoenig 08e323b753
Merge pull request #9849 from hashicorp/b-cc-ig-id
consul/connect: Enable running multiple ingress gateways per Nomad agent
2021-01-20 10:08:14 -06:00
Seth Hoenig 53218716b3
docs: fix typo in changelog
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2021-01-20 09:50:59 -06:00