Commit graph

24622 commits

Author SHA1 Message Date
Tim Gross fe29cf8b7b
logs: fix logs.disabled on Windows (#17199)
On Windows the executor returns an error when trying to open the `NUL` device
when we pass it `os.DevNull` for the stdout/stderr paths. Instead of opening the
device, use the discard pipe so that we have platform-specific behavior from the
executor itself.

Fixes: #17148
2023-05-18 09:14:39 -04:00
James Rasell 96f7c84e4e
variable: fixup metadata copy comment and remove unrequired type. (#17234) 2023-05-18 13:49:41 +01:00
Phil Renaud 552dc06b1d
[ui] Latest Deployment component removed (#17192)
* Latest Deployment component removed

* Integration test selector update
2023-05-17 16:59:42 -04:00
Mike Nomitch 6df2160e69
docs: add documentation on ephemeral disk and logs (#15829) 2023-05-17 16:58:11 -04:00
Roman Zipp edf83f432a
docs: remove unneeded brackets from job specification template docs (#17219) 2023-05-17 16:45:00 -04:00
Phil Renaud 50a35143c9
[ui, deployments] Fix a bug where watchers on a parent (periodic) job would continue on a child route (#17214)
* Treated same-route as sub-route and didnt cancel watchers

* Changelog
2023-05-17 16:36:15 -04:00
Tim Gross c64efa0776
build: upgrade deprecated actions syntax (#17222)
Missed these in the previous pass.
2023-05-17 11:39:55 -04:00
hashicorp-tsccr[bot] aec3b16085
build: trusted workflow pinning (#16992)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-05-17 10:38:10 -04:00
Tim Gross 5fc63ace0b
scheduler: count implicit spread targets as a single target (#17195)
When calculating the score in the `SpreadIterator`, the score boost is
proportional to the difference between the current and desired count. But when
there are implicit spread targets, the current count is the sum of the possible
implicit targets, which results in incorrect scoring unless there's only one
implicit target.

This changeset updates the `propertySet` struct to accept a set of explicit
target values so it can detect when a property value falls into the implicit set
and should be combined with other implicit values.

Fixes: #11823
2023-05-17 10:25:00 -04:00
Tim Gross 710afecf61
build: update deprecated GitHub Actions (#17218)
Many of the GitHub Actions from the build pipeline are written in a truly
ancient version of NodeJS. Upgrade to more recent versions.

Remove RelEng from codeowners
2023-05-17 08:57:28 -04:00
Phil Renaud 817b2cab9b
[ui] Remove unnecessary subnav for parent jobs (#17190)
* Nest job subnav in a parent-check

* Move the evaluation tab test to withn an else of the children condition
2023-05-16 16:07:12 -04:00
Tim Gross 2426aae832
scheduler: prevent -Inf in spread scoring (#17198)
When spread targets have a percent value of zero it's possible for them to
return -Inf scoring because of a float divide by zero. This is very hard for
operators to debug because the string "-Inf" is returned in the API and that
breaks the presentation of debugging data.

Most scoring iterators are bracketed to -1/+1, but spread iterators do not so
that they can handle greatly unbalanced scoring so we can't simply return a -1
score without generating a score that might be greater than the negative scores
set by other spread targets. Instead, track the lowest-seen spread boost and use
that as the spread boost for any cases where we'd divide by zero.

Fixes: #8863
2023-05-16 16:01:32 -04:00
dependabot[bot] 7a92c7b5ac
build(deps-dev): bump prettier from 2.2.1 to 2.8.8 in /website (#16965)
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-16 12:12:53 -05:00
Seth Hoenig e04ff0d935
client: ignore restart issued to terminal allocations (#17175)
* client: ignore restart issued to terminal allocations

This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

* e2e: add e2e test for alloc restart zombies

* cl: tweak text

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-05-16 10:19:41 -05:00
Tim Gross 6814e8e6d9
drivers: make internal DisableLogCollection capability public (#17196)
The `DisableLogCollection` capability was introduced as an experimental
interface for the Docker driver in 0.10.4. The interface has been stable and
allowing third-party task drivers the same capability would be useful for those
drivers that don't need the additional overhead of logmon.

This PR only makes the capability public. It doesn't yet add it to the
configuration options for the other internal drivers.

Fixes: #14636 #15686
2023-05-16 09:16:03 -04:00
Piotr Kazmierczak fe272c3686
refactor acl.UpsertTokens to avoid unnecessary RPC calls. (#17194)
New RPC endpoints introduced during OIDC and JWT auth perform unnecessary many RPC calls when they upsert generated ACL tokens, as pointed out by @tgross.

This PR moves the common logic from acl.UpsertTokens method into a helper method that contains common logic, and sidesteps authentication, metrics, etc.
2023-05-16 09:31:51 +02:00
Michael Schurter ab11a96181
remove unused helper/fields package (#17197)
Has been unused since we switched task drivers to plugins and used
hclschema for config in #4936 (v0.9.0-beta1)
2023-05-15 12:10:11 -07:00
dependabot[bot] d980e0a815
build(deps-dev): bump @hashicorp/platform-content-conformance (#17030)
Bumps @hashicorp/platform-content-conformance from 0.0.10 to 0.0.11.

---
updated-dependencies:
- dependency-name: "@hashicorp/platform-content-conformance"
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-15 11:28:03 -04:00
Luiz Aoqui 389212bfda
node pool: initial base work (#17163)
Implementation of the base work for the new node pools feature. It includes a new `NodePool` struct and its corresponding state store table.

Upon start the state store is populated with two built-in node pools that cannot be modified nor deleted:

  * `all` is a node pool that always includes all nodes in the cluster.
  * `default` is the node pool where nodes that don't specify a node pool in their configuration are placed.
2023-05-15 10:49:08 -04:00
dependabot[bot] f49eb3278b
build(deps-dev): bump next from 12.3.1 to 13.4.2 in /website (#17177)
Bumps [next](https://github.com/vercel/next.js) from 12.3.1 to 13.4.2.
- [Release notes](https://github.com/vercel/next.js/releases)
- [Changelog](https://github.com/vercel/next.js/blob/canary/release.js)
- [Commits](https://github.com/vercel/next.js/compare/v12.3.1...v13.4.2)

---
updated-dependencies:
- dependency-name: next
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-15 10:46:40 -04:00
Mark Lewis 1729c955d2
Update delete.mdx (#17184)
Fix typo
2023-05-15 13:31:52 +01:00
Phil Renaud a36af08ea6
[ui, deployments] Deployment history on steady state (#17167)
* Deployment history on steady state

* de-hds after chat
2023-05-12 16:49:40 -04:00
Tim Gross d018fcbff7
allocrunner: provide factory function so we can build mock ARs (#17161)
Tools like `nomad-nodesim` are unable to implement a minimal implementation of
an allocrunner so that we can test the client communication without having to
lug around the entire allocrunner/taskrunner code base. The allocrunner was
implemented with an interface specifically for this purpose, but there were
circular imports that made it challenging to use in practice.

Move the AllocRunner interface into an inner package and provide a factory
function type. Provide a minimal test that exercises the new function so that
consumers have some idea of what the minimum implementation required is.
2023-05-12 13:29:44 -04:00
Phil Renaud 9a5d67d475
[ui] Keyboard shortcuts to switch regions (#17169)
* Regions keynav

* Dont show if you only have a single region (global by default)
2023-05-12 11:46:00 -04:00
Jai e55edf58ff
chore: add percy tests (#17157) 2023-05-12 09:57:22 -04:00
Jai 27f0d104e5
16664/upgrade (#17158)
* chore:  upgrade

Upgrade @babel/helper-string-parserprop-types

* chore: add resolution

* chore: update component API for breaking changes

* chore: update arguments

* api: forgive user for pass wrong args

* chore:  update tests

* chore: update yarn lock

* chore: upgrade to Glimmer component

* styling: add properties to component invocation

* chore: add inset styles
2023-05-12 09:54:13 -04:00
Jai ce29e55b7a
chore: write js doc (#17156)
* chore:  write jsdoc comments

* chore: update comments
2023-05-11 15:30:54 -04:00
Seth Hoenig 81e36b3650
core: eliminate second index on job_submissions table (#17146)
* core: eliminate second index on job_submissions table

This PR refactors the job_submissions state store code to eliminate the
use of a second index formerly used for purging all versions of a given
job. In practice we ended up with duplicate entries on the table. Instead,
use index prefix scanning on the primary index and tidy up any potential
for creating (or removing) duplicates.

* core: pr comments followup
2023-05-11 09:51:08 -05:00
Phil Renaud a910e4be1c
[ui, deployments] Denominator based on completed allocations for batch jobs (#17147)
* Denominator based on completed allocations for batch jobs

* Test for denominatored batch job change
2023-05-11 10:23:23 -04:00
Tim Gross 9ed75e1f72
client: de-duplicate alloc updates and gate during restore (#17074)
When client nodes are restarted, all allocations that have been scheduled on the
node have their modify index updated, including terminal allocations. There are
several contributing factors:

* The `allocSync` method that updates the servers isn't gated on first contact
  with the servers. This means that if a server updates the desired state while
  the client is down, the `allocSync` races with the `Node.ClientGetAlloc`
  RPC. This will typically result in the client updating the server with "running"
  and then immediately thereafter "complete".

* The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even
  if it's possible to assert that the server has definitely seen the client
  state. The allocrunner may queue-up updates even if we gate sending them. So
  then we end up with a race between the allocrunner updating its internal state
  to overwrite the previous update and `allocSync` sending the bogus or duplicate
  update.

This changeset adds tracking of server-acknowledged state to the
allocrunner. This state gets checked in the `allocSync` before adding the update
to the batch, and updated when `Node.UpdateAlloc` returns successfully. To
implement this we need to be able to equality-check the updates against the last
acknowledged state. We also need to add the last acknowledged state to the
client state DB, otherwise we'd drop unacknowledged updates across restarts.

The client restart test has been expanded to cover a variety of allocation
states, including allocs stopped before shutdown, allocs stopped by the server
while the client is down, and allocs that have been completely GC'd on the
server while the client is down. I've also bench tested scenarios where the task
workload is killed while the client is down, resulting in a failed restore.

Fixes #16381
2023-05-11 09:05:24 -04:00
Seth Hoenig 4abb3e03ca
cli: upload var file(s) content on job submission (#17128)
This PR makes it so that the content of any -var-file files is uploaded
to Nomad on job run.
2023-05-11 08:04:33 -05:00
Jai 24afd86cc5
ui: add sign-in link on err page (#17140) 2023-05-11 08:24:58 -04:00
Luiz Aoqui d800dc3367
deps: update go-metrics to prevent panic (#17133)
nomad#15861 describes intermitent panics caused by go-metrics Prometheus
client. We have not been able to further debug this problem due to the
lack of information when the panic happens.

go-metrics#146 prevents the panic from happening and also logs
additional information that can help us understand the root cause of the
problem.

This commits pins the go-metric dependency to this branch until we can
better debug the issue.
2023-05-10 21:33:15 -04:00
Phil Renaud 25b66e249d
[ui] Batch jobs, aside from child jobs, get the new status panel (#17118)
* Batch jobs, aside from child jobs, get the new status panel

* Clean up the imported jobAllocStatuses

* Note for mirage that batch jobs now have a historical status panel

* Batch job test for complete status

* Parameterized and periodic child jobs get the panel treatment

* Undo parameterized and periodic child test situations
2023-05-10 16:59:33 -04:00
Phil Renaud 681ea73913
Versions returned as an array rather than an object now, pending changed to unknown, and sorted (#17150) 2023-05-10 15:40:54 -04:00
James Rasell c60c5ace60
api: update Go mod go version to 1.20 to match main mod. (#17137) 2023-05-10 16:29:06 +01:00
James Rasell 48357c609b
test: fix flakey broker notifier test. (#16994) 2023-05-10 13:40:25 +01:00
Jai 08d97a19ca
feat: visualize HCL Job Specification in the Nomad UI jobs.job.definition view (#16669)
* ui:  Toggle for `read-only` view (#16279)

* ui: model update for specification

* style: add styling for select

* style: add styling for select

* refact: add spec to view

* refact: update component API

* test: refactor for new UI state

* refact: clean conditional

* refact: update component API for prop

* chore: correct naming

* chore:  remove `fn` helper

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>

* update `default` Mirage scenario (#16496)

* chore: update mirage scenario:

* ui: conditionally render toggle button (#16497)

* chore: update css variable name (#16498)

---------

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>

* ui:  Display JSON view of variables associated to job specification (#16570)

* chore: move fixture to util

* chore: update tests:

* ui: display variables table

* chore: add mirage fixture (#16572)

* ui: regex for job spec parse (#16668)

* ui: remove variable table (#16670)

* ui:  notify user if specification has variables (#16671)

* ui: regex for job spec parse

* chore: deprecate variable references

* chore: update mirage

* ui: add notification

* test: add test coverage for parse method (#16590)

* refact: `JobEditor` reactive query parameters (#16710)

* refact: add query parameter

* refact: move toggle action to controller

* ui:  remove toggle behavior in `JobEditor` (#16711)

* refact: rename logic for select

* chore: instantiate qp in route

* refact:  uniform alerts (#16715)

* style: buffer between alert and header

* refact: extract alerts into a component

* chore: update tests for qp

* chore: defensive logic for app controller

* refact:  move `edit` state to controller (#16725)

* refact: move edit state to controller

* refact: handle edit state (#16731)

* refact: handle edit state

* ui:  warning message (#16732)

* ui: warning message

* ui:  enable editing of HCL vars in the UI (#16734)

* enable editing of HCL vars

* refact: default qp logic

* refact: alert condition

* refact: Pass `variables` as string (#16849)

* ui:  Toggle for `read-only` view (#16279)

* ui: model update for specification

* style: add styling for select

* style: add styling for select

* refact: add spec to view

* refact: update component API

* test: refactor for new UI state

* refact: clean conditional

* refact: update component API for prop

* chore: correct naming

* chore:  remove `fn` helper

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>

* update `default` Mirage scenario (#16496)

* chore: update mirage scenario:

* ui: conditionally render toggle button (#16497)

* chore: update css variable name (#16498)

---------

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>

* refact: `JobEditor` reactive query parameters (#16710)

* refact: add query parameter

* refact: move toggle action to controller

* ui:  remove toggle behavior in `JobEditor` (#16711)

* refact: rename logic for select

* chore: instantiate qp in route

* refact:  uniform alerts (#16715)

* style: buffer between alert and header

* refact: extract alerts into a component

* chore: update tests for qp

* chore: defensive logic for app controller

* refact:  move `edit` state to controller (#16725)

* refact: move edit state to controller

* refact: handle edit state (#16731)

* refact: handle edit state

* ui:  warning message (#16732)

* ui: warning message

* ui:  enable editing of HCL vars in the UI (#16734)

* enable editing of HCL vars

* refact: default qp logic

* refact: alert condition

* refact: variables as string

* style: revert styling change

---------

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>

* bug: correctly edit variables (#16989)

* ui: visualize variables (#16987)

* ui: fetchRawSpecification

* refact: integrate new model method

* test: fetchRaw unit

* styling: enable height on cm

* chore: update copy

* feat: visual variables

* chore: conditional render info txt

* refact: add mirage endpoint

* refact: update test for new schema

* refact: job submit flow (#17015)

* refact: job update logic

* chore: remove dead code

* bug:  update `job.run` and `job.update` adapter methods (#17055)

* refact:  update adapter

* chore: update api usage

* styling:  UX requests (#17064)

* refact:  update adapter

* chore: update api usage

* styling: disable toggle w text

* styling: stick button

* style: space out alerts

* chore:  autofocus on first editor

* bug: dismiss alert

* chore:  add jsdoc and assertion check

* chore:  update mirage for Vercel (#17054)

* chore: mirage logic for vercel deploy

* chore:  update test for mirage change

* refact:  API refactoring (#17083)

* refact: udpate for req schema

* refact: update for variable flags and literal

* bug: visualize job model not derived state

* chore:  update copy

* chore: fix incorrect copy

* chore: deprecate variables derived state

* chore: update copy

* feat: enable toggle on edit

* chore: prettify

* refact: move conditional

---------

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
2023-05-09 11:03:52 -04:00
Seth Hoenig 6f4992ef29
client: unveil /etc/ssh/ssh_known_hosts for artifact downloads (#17122)
This PR fixes a bug where nodes configured with populated
/etc/ssh/ssh_known_hosts files would be unable to read them during
artifact downloading.

Fixes #17086
2023-05-09 09:43:52 -05:00
Seth Hoenig 74714272cc
api: set the job submission during job reversion (#17097)
* api: set the job submission during job reversion

This PR fixes a bug where the job submission would always be nil when
a job goes through a reversion to a previous version. Basically we need
to detect when this happens, lookup the submission of the job version
being reverted to, and set that as the submission of the new job being
created.

* e2e: add e2e test for job submissions during reversion

This e2e test ensures a reverted job inherits the job submission
associated with the version of the job being reverted to.
2023-05-08 14:18:34 -05:00
Daniel Bennett a7ed6f5c53
full task cleanup when alloc prerun hook fails (#17104)
to avoid leaking task resources (e.g. containers,
iptables) if allocRunner prerun fails during
restore on client restart.

now if prerun fails, TaskRunner.MarkFailedKill()
will only emit an event, mark the task as failed,
and cancel the tr's killCtx, so then ar.runTasks()
-> tr.Run() can take care of the actual cleanup.

removed from (formerly) tr.MarkFailedDead(),
now handled by tr.Run():
 * set task state as dead
 * save task runner local state
 * task stop hooks

also done in tr.Run() now that it's not skipped:
 * handleKill() to kill tasks while respecting
   their shutdown delay, and retrying as needed
   * also includes task preKill hooks
 * clearDriverHandle() to destroy the task
   and associated resources
 * task exited hooks
2023-05-08 13:17:10 -05:00
Luiz Aoqui 53020c0941
Revert "ci: use BACKPORT_MERGE_COMMIT option (#16730)" (#17116)
This reverts commit 1721e687c0832bea3d9b7eec5dcd3c4e7a924d71.

The change was expected to solve the sporadic problems we were having
with Backport Assistant, but it end up creating even more failures.
2023-05-08 13:30:43 -04:00
Tim Gross ba269eaf3f
docs: add note to upgrade guide about yanked version (#17115)
Nomad 1.5.4 shipped with a logmon bug that we rolled out a fix for in Nomad
1.5.5. Unfortunately we can't yank the release but we should leave a note in the
upgrade guide telling users to avoid it.
2023-05-08 13:28:45 -04:00
stswidwinski 9c1c2cb5d2
Correct the status description and modify time of canceled evals. (#17071)
Fix for #17070. Corrected the status description and modify time of evals which are canceled due to another eval having completed in the meantime.
2023-05-08 08:50:36 -04:00
Phil Renaud 2fbbac5dd8
[ui, deployments] Job status for System Jobs (#17046)
* System jobs get a panel and lost status reinstated

* Leveraging nodes and not worrying about rescheds for system jobs

* Consistency w restarted as well

* Text shadow removed and early return where possible

* System jobs added to the Historical Click list

* System alloc and client summary panels removed

* Bones of some new system jobs tests

* [ui, deployments] handle node read permissions for system job panel (#17073)

* Do the next-best-thing when we cant read nodes for system jobs

* Whitespace control handlebars expr

* Simplifies system jobs to not attempt to show a desired count, since it is a particularly complex number depending on constraints, number of nodes, etc.

* [ui, deployments] Fix order in which allocations are ascribed to the status chart (#17063)

* Discovery of alloc.isOld

* Correct sorting and better types

* A more honest walk-back that prioritizes running and pending allocs first

* Test scenario for descending-order allocs to show

* isOld mandates that we set a job version for our created job. Could also do this in the factory but maybe side-effecty

* Type simplification

* Fixed up a test that needed system job summary to be updated

* Tests for modifications to the job summary

* Explicitly mark the service jobs in test as not-deploying
2023-05-05 16:25:21 -04:00
Shantanu Gadgil 2cf27389ad
minor typo; 1.3.x not 1.13.x (#17101) 2023-05-05 13:51:05 -04:00
Tim Gross 5f3ff346ea
post release 1.5.5 (#17098)
* changelog entries for 1.5.5 and missing merge of changelog for 1.5.4, 1.4.9,
  and 1.3.14
* note on deprecation of `logs.enabled` field
2023-05-05 11:46:08 -04:00
Seth Hoenig fff2eec625
connect: use heuristic to detect sidecar task driver (#17065)
* connect: use heuristic to detect sidecar task driver

This PR adds a heuristic to detect whether to use the podman task driver
for the connect sidecar proxy. The podman driver will be selected if there
is at least one task in the task group configured to use podman, and there
are zero tasks in the group configured to use docker. In all other cases
the task driver defaults to docker.

After this change, we should be able to run typical Connect jobspecs
(e.g. nomad job init [-short] -connect) on Clusters configured with the
podman task driver, without modification to the job files.

Closes #17042

* golf: cleanup driver detection logic
2023-05-05 10:19:30 -05:00
James Rasell 6ec4a69f47
scale: fixed a bug where evals could be created with wrong type. (#17092)
The job scale RPC endpoint hard-coded the eval creation to use the
type of service. This meant scaling events triggered on jobs of
type batch would create evaluations with the wrong type, which
does not seem to cause any problems, just confusion when
correlating the two.
2023-05-05 14:46:10 +01:00
Tim Gross 17bd930ca9
logs: fix missing allocation logs after update to Nomad 1.5.4 (#17087)
When the server restarts for the upgrade, it loads the `structs.Job` from the
Raft snapshot/logs. The jobspec has long since been parsed, so none of the
guards around the default value are in play. The empty field value for `Enabled`
is the zero value, which is false.

This doesn't impact any running allocation because we don't replace running
allocations when either the client or server restart. But as soon as any
allocation gets rescheduled (ex. you drain all your clients during upgrades),
it'll be using the `structs.Job` that the server has, which has `Enabled =
false`, and logs will not be collected.

This changeset fixes the bug by adding a new field `Disabled` which defaults to
false (so that the zero value works), and deprecates the old field.

Fixes #17076
2023-05-04 16:01:18 -04:00