Commit graph

395 commits

Author SHA1 Message Date
Daniel Rossbach 8c52c03c8c
qemu driver: Add option to configure drive_interface (#11864) 2022-06-10 10:03:51 -04:00
Luiz Aoqui e8b788b372
changelog: add entry for #12961 (#13318) 2022-06-10 09:04:00 -04:00
Tim Gross 9d5523a72d
CSI: skip node unpublish on GC'd or down nodes (#13301)
If the node has been GC'd or is down, we can't send it a node
unpublish. The CSI spec requires that we don't send the controller
unpublish before the node unpublish, but in the case where a node is
gone we can't know the final fate of the node unpublish step.

The `csi_hook` on the client will unpublish if the allocation has
stopped and if the host is terminated there's no mount for the volume
anyways. So we'll now assume that the node has unpublished at its
end. If it hasn't, any controller unpublish will potentially hang or
error and need to be retried.
2022-06-09 11:33:22 -04:00
phreakocious 94a78597d2
Add guest_agent config option for QEMU driver (#12800)
Add boolean 'guest_agent' config option for QEMU driver, which will
create the socket file for the QEMU Guest Agent in the task dir when
enabled.
2022-06-09 09:21:38 -04:00
Derek Strickland 13ea5ae87a
consul-template: Add fault tolerant defaults (#13041)
consul-template: Add fault tolerant defaults

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-06-08 14:08:25 -04:00
Luiz Aoqui 2e0bffba90
changelog: add entry for #12925 (#13250) 2022-06-08 10:14:33 -04:00
Tim Gross 8ff5ea1bee
CSI: no early return when feasibility check fails on eligible nodes (#13274)
As a performance optimization in the scheduler, feasibility checks
that apply to an entire class are only checked once for all nodes of
that class. Other feasibility checks are "available" checks because
they rely on more ephemeral characteristics and don't contribute to
the hash for the node class. This currently includes only CSI.

We have a separate fast path for "available" checks when the node has
already been marked eligible on the basis of class. This fast path has
a bug where it returns early rather than continuing the loop. This
causes the entire task group to be rejected.

Fix the bug by not returning early in the fast path and instead jump
to the top of the loop like all the other code paths in this method.
Includes a new test exercising topology at whole-scheduler level and a
fix for an existing test that should've caught this previously.
2022-06-07 13:31:10 -04:00
Derek Strickland 12f3ee46ea
alloc_runner: stop sidecar tasks last (#13055)
alloc_runner: stop sidecar tasks last
2022-06-07 11:35:19 -04:00
Tim Gross 81c70f4973
changelog entry for #12534 (#13260) 2022-06-06 16:19:17 -04:00
Conor Evans 86116a7607
add filebase64 function (#11791)
Signed-off-by: Conor Evans <coevans@tcd.ie>
2022-06-06 11:58:17 -04:00
Lance Haig 4bf27d743d
Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
Huan Wang 7d15157635
adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
Seth Hoenig 45e8748658
Merge pull request #13205 from hashicorp/b-batch-preempt2
core: reschedule evicted batch job when resources become available
2022-06-02 16:32:01 -05:00
Shantanu Gadgil 6cb8c95534
fingerprint kernel architecture name (#13182) 2022-06-02 15:51:00 -04:00
Seth Hoenig 0692190e12 core: reschedule evicted batch job when resources become available
This PR fixes a bug where an evicted batch job would not be rescheduled
once resources become available.

Closes #9890
2022-06-02 14:04:13 -05:00
Seth Hoenig 54efec5dfe docs: add docs and tests for tagged_addresses 2022-05-31 13:02:48 -05:00
Seth Hoenig 4631045d83 connect: enable setting connect upstream destination namespace 2022-05-26 09:39:36 -05:00
Seth Hoenig f7c0e078a9 build: update golang version to 1.18.2
This PR update to Go 1.18.2. Also update the versions of hclfmt
and go-hclogfmt which includes newer dependencies necessary for dealing
with go1.18.

The hcl v2 branch is now 'nomad-v2.9.1+tweaks2', to include a fix for
newer macOS versions: 8927e75e82
2022-05-25 10:04:04 -05:00
Luiz Aoqui 769ff1dcc3
Merge pull request #13109 from hashicorp/merge-release-1.3.1-branch
Merge release 1.3.1 branch
2022-05-25 10:45:09 -04:00
Seth Hoenig 20b6bf3c22
Merge pull request #13104 from hashicorp/b-blocked-eval-math
core: fix blocked eval math
2022-05-24 16:23:06 -05:00
Michael Schurter 2965dc6a1a
artifact: fix numerous go-getter security issues
Fix numerous go-getter security issues:

- Add timeouts to http, git, and hg operations to prevent DoS
- Add size limit to http to prevent resource exhaustion
- Disable following symlinks in both artifacts and `job run`
- Stop performing initial HEAD request to avoid file corruption on
  retries and DoS opportunities.

**Approach**

Since Nomad has no ability to differentiate a DoS-via-large-artifact vs
a legitimate workload, all of the new limits are configurable at the
client agent level.

The max size of HTTP downloads is also exposed as a node attribute so
that if some workloads have large artifacts they can specify a high
limit in their jobspecs.

In the future all of this plumbing could be extended to enable/disable
specific getters or artifact downloading entirely on a per-node basis.
2022-05-24 16:29:39 -04:00
Seth Hoenig 83bab8ed64
Merge pull request #13058 from hashicorp/b-cgroupsv1-docker-cgparent
drivers/docker: do not set cgroup parent in v1 mode
2022-05-24 14:07:40 -05:00
Seth Hoenig c6c3ae020d drivers/docker: do not set cgroup parent in v1 mode
This PR fixes a bug where the CgroupParent on the docker
HostConfig struct was accidently being set when running in
cgroups v1 mode.
2022-05-24 11:22:50 -05:00
Seth Hoenig 27d0c0dc9f docs: add changelog 2022-05-24 09:13:15 -05:00
Will Jordan d515e5c3b0
Don't buffer json logs on agent startup (#13076)
There's no reason to buffer json logs on agent startup
since logs in this format already aren't reordered.
2022-05-19 15:40:30 -04:00
Seth Hoenig fc58f4972c cli: correctly use and validate job with vault token set
This PR fixes `job validate` to respect '-vault-token', '$VAULT_TOKEN',
'-vault-namespace' if set.
2022-05-19 12:13:34 -05:00
Tim Gross b72ff42ada
api: include Consul token in job revert API (#13065) 2022-05-19 11:30:29 -04:00
Seth Hoenig 29d3da6dfd cl: update changelog 2022-05-17 10:35:08 -05:00
Seth Hoenig 26b5c01431
Merge pull request #12817 from twunderlich-grapl/fix-network-interpolation
Fix network.dns interpolation
2022-05-17 09:31:32 -05:00
Seth Hoenig 08becb117c cl: add changelog note for network interpolation 2022-05-17 09:14:55 -05:00
Phil Renaud 45dc1cfd58
12986 UI fails to load job when there is an "@" in job name in nomad 130 (#13012)
* LastIndexOf and always append a namespace on job links

* Confirmed the volume equivalent and simplified idWIthNamespace logic

* Changelog added

* PR comments addressed

* Drop the redirect for the time being

* Tests updated to reflect namespace on links

* Task detail test default namespace link for test
2022-05-13 17:01:27 -04:00
Tim Gross faeb3fcd44
scheduler: volume updates should always be destructive (#13008) 2022-05-13 11:34:04 -04:00
James Rasell 636b647a30
agent: fix panic when logging about protocol version config use. (#12962)
The log line comes before the agent logger has been setup,
therefore we need to use the UI logging to avoid panic.
2022-05-13 09:28:43 +02:00
Phil Renaud dd824ac3f8
Changelog for visual diff tests (#12909) 2022-05-06 11:29:10 -04:00
Phil Renaud 6a8f98723e
Chronological most-recent evals by default (#12847)
* Chronological most-recent evals by default

* Adding reverse: true to the list of expected queryparams in test

* changelog
2022-05-05 16:11:27 -04:00
Jai 316daf581e
fix broken link to task-group in Recent Allocation table in jobs.job.index (#12765)
* chore:  run prettier on hbs files

* ui:  ensure to pass a real job object to task-group link

* chore:  add changelog entry

* chore: prettify template

* ui:  template helper for formatting jobId in LinkTo component

* ui:  handle async relationship

* ui:  pass in job id to model arg instead of job model

* update test for serialized namespace

* ui:  defend against null  in tests

* ui:  prettified template added whitespace

* ui:  rollback ember-data to 3.24 because watcher return undefined on abort

* ui: use format-job-helper instead of job model via alloc

* ui: fix whitespace in template caused by prettier using template helper

* ui: update test for new namespace

* ui: revert prettier change

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-04-28 14:02:15 -04:00
Dave May 97cf204c00
debug: add version constraint to avoid pprof panic (#12807) 2022-04-28 13:18:55 -04:00
Tim Gross c763c4cb96
remove pre-0.9 driver code and related E2E test (#12791)
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
2022-04-27 09:53:37 -04:00
Michael Schurter e2544dd089
client: fix waiting on preempted alloc (#12779)
Fixes #10200

**The bug**

A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:

```
[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"
```

The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops *after* the scheduling decision was made to
preempt it, but *before* the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.

However the RPC error made no sense! `previous_alloc` in the logs was a
valid 36 character UUID!

**The fix**

The fix is:

```
-		prevAllocID:  c.Alloc.PreviousAllocation,
+		prevAllocID:  watchedAllocID,
```

The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters`
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!

**The repro**

I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl)
and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl)
for ease of repro.

First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad),
then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad)
that evicts 2 low priority jobs. Everything worked as expected.

However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147),
it reproduces the bug because the watcher references PreviousAllocation
instead of watchedAllocID.
2022-04-26 13:14:43 -07:00
Michael Schurter 6449ba8d41
api: add ParseHCLOpts helper method (#12777)
The existing ParseHCL func didn't allow setting HCLv1=true.
2022-04-25 11:51:52 -07:00
Tim Gross b2e4841747
CSI: plugin config updates should always be destructive (#12774) 2022-04-25 12:59:25 -04:00
Tim Gross 766025cde7
CSI: plugin supervisor prestart should not mark itself done (#12752)
The task runner hook `Prestart` response object includes a `Done`
field that's intended to tell the client not to run the hook
again. The plugin supervisor creates mount points for the task during
prestart and saves these mounts in the hook resources. But if a client
restarts the hook resources will not be populated. If the plugin task
restarts at any time after the client restarts, it will fail to have
the correct mounts and crash loop until restart attempts run out.

Fix this by not returning `Done` in the response, just as we do for
the `volume_mount_hook`.
2022-04-22 13:07:47 -04:00
James Rasell 24b499791d
deps: update consul-template to v0.29.0 (#12747)
* deps: update consul-template to v0.29.0

* changelog: add entry for #12747
2022-04-22 09:58:54 -07:00
Phil Renaud ab557b15e0
Adding changelog note (#12753) 2022-04-22 12:38:49 -04:00
Luiz Aoqui a8cc633156
vault: revert support for entity aliases (#12723)
After a more detailed analysis of this feature, the approach taken in
PR #12449 was found to be not ideal due to poor UX (users are
responsible for setting the entity alias they would like to use) and
issues around jobs potentially masquerading itself as another Vault
entity.
2022-04-22 10:46:34 -04:00
Seth Hoenig 3fcac242c6 services: enable setting arbitrary address value in service registrations
This PR introduces the `address` field in the `service` block so that Nomad
or Consul services can be registered with a custom `.Address.` to advertise.

The address can be an IP address or domain name. If the `address` field is
set, the `service.address_mode` must be set in `auto` mode.
2022-04-22 09:14:29 -05:00
Michael Schurter 5db3a671db
cli: add -json flag to support job commands (#12591)
* cli: add -json flag to support job commands

While the CLI has always supported running JSON jobs, its support has
been via HCLv2's JSON parsing. I have no idea what format it expects the
job to be in, but it's absolutely not in the same format as the API
expects.

So I ignored that and added a new -json flag to explicitly support *API*
style JSON jobspecs.

The jobspecs can even have the wrapping {"Job": {...}} envelope or not!

* docs: fix example for `nomad job validate`

We haven't been able to validate inside driver config stanzas ever since
the move to task driver plugins. 😭
2022-04-21 13:20:36 -07:00
Phil Renaud a5bef3ce72
[ui, bugfix] Link fix for volumes where per_alloc=true (#12713)
* Allocation page linkfix

* fix added to task page and computed prop moved to allocation model

* Fallback query added to task group when specific volume isnt knowable

* Delog

* link text reflects alloc suffix

* Helper instead of in-template conditionals

* formatVolumeName unit test

* Removing unused helper import
2022-04-21 13:57:18 -04:00
James Rasell 716b8e658b
api: Add support for filtering and pagination to the node list endpoint (#12727) 2022-04-21 17:04:33 +02:00
Gowtham 1ff8b5f759
Add Concurrent Download Support for artifacts (#11531)
* add concurrent download support - resolves #11244

* format imports

* mark `wg.Done()` via `defer`

* added tests for successful and failure cases and resolved some goleak

* docs: add changelog for #11531

* test typo fixes and improvements

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-04-20 10:15:56 -07:00