Commit graph

683 commits

Author SHA1 Message Date
Seth Hoenig 590ae08752
main: remove deprecated uses of rand.Seed (#16074)
* main: remove deprecated uses of rand.Seed

go1.20 deprecates rand.Seed, and seeds the rand package
automatically. Remove cases where we seed the random package,
and cleanup the one case where we intentionally create a
known random source.

* cl: update cl

* mod: update go mod
2023-02-07 09:19:38 -06:00
Tim Gross 8a7d6b0cde
cli: remove deprecated keyring and keygen commands (#16068)
These command were marked as deprecated in 1.4.0 with intent to remove in
1.5.0. Remove them and clean up the docs.
2023-02-07 09:49:52 -05:00
Dao Thanh Tung ae720fe28d
Add -json and -t flag for nomad acl token create command (#16055)
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
2023-02-07 12:05:41 +01:00
Seth Hoenig 68894bdc62
docker: disable driver when running as non-root on cgroups v2 hosts (#16063)
* docker: disable driver when running as non-root on cgroups v2 hosts

This PR modifies the docker driver to behave like exec when being run
as a non-root user on a host machine with cgroups v2 enabled. Because
of how cpu resources are managed by the Nomad client, the nomad agent
must be run as root to manage docker-created cgroups.

* cl: update cl
2023-02-06 14:09:19 -06:00
Michael Schurter 0a496c845e
Task API via Unix Domain Socket (#15864)
This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled.

This PR contains the core feature and tests but followup work is required for the following TODO items:

- Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink
- Unit tests for auth middleware
- Caching for auth middleware
- Rate limiting on negative lookups for auth middleware

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-02-06 11:31:22 -08:00
Seth Hoenig 911700ffea
build: update to go1.20 (#16029)
* build: update to go1.20

* build: use stringy go1.20 in circle yaml

* tests: handle new x509 certificate error structure in go1.20

* cl: add cl entry
2023-02-03 08:14:53 -06:00
Phil Renaud d3c351d2d2
Label for the Web UI (#16006)
* Demoable state

* Demo mirage color

* Label as a block with foreground and background colours

* Test mock updates

* Go test updated

* Documentation update for label support
2023-02-02 16:29:04 -05:00
Tim Gross 19a2c065f4
System and sysbatch jobs always have zero index (#16030)
Service jobs should have unique allocation Names, derived from the
Job.ID. System jobs do not have unique allocation Names because the index is
intended to indicated the instance out of a desired count size. Because system
jobs do not have an explicit count but the results are based on the targeted
nodes, the index is less informative and this was intentionally omitted from the
original design.

Update docs to make it clear that NOMAD_ALLOC_INDEX is always zero for 
system/sysbatch jobs

Validate that `volume.per_alloc` is incompatible with system/sysbatch jobs.
System and sysbatch jobs always have a `NOMAD_ALLOC_INDEX` of 0. So
interpolation via `per_alloc` will not work as soon as there's more than one
allocation placed. Validate against this on job submission.
2023-02-02 16:18:01 -05:00
Daniel Bennett 335f0a5371
docs: how to troubleshoot consul connect envoy (#15908)
* largely a doc-ification of this commit message:
  d47678074bf8ae9ff2da3c91d0729bf03aee8446
  this doesn't spell out all the possible failure modes,
  but should be a good starting point for folks.

* connect: add doc link to envoy bootstrap error

* add Unwrap() to RecoverableError
  mainly for easier testing
2023-02-02 14:20:26 -06:00
Charlie Voiselle cc6f4719f1
Add option to expose workload token to task (#15755)
Add `identity` jobspec block to expose workload identity tokens to tasks.

---------

Co-authored-by: Anders <mail@anars.dk>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-02-02 10:59:14 -08:00
Daniel Bennett dc9c8d4e47
Change job init default to example.nomad.hcl and recommend in docs (#15997)
recommend .nomad.hcl for job files instead of .nomad (without .hcl)
* nomad job init -> example.nomad.hcl
* update docs
2023-02-02 11:47:47 -06:00
Tim Gross 971a286ea3
cli: Fix a panic in deployment status when scheduling is slow (#16011)
If a deployment fails, the `deployment status` command can get a nil deployment
when it checks for a rollback deployment if there isn't one (or at least not one
at the time of the query). Fix the panic.
2023-02-02 12:34:44 -05:00
Phil Renaud 3db9f11c37
[feat] Nomad Job Templates (#15746)
* Extend variables under the nomad path prefix to allow for job-templates (#15570)

* Extend variables under the nomad path prefix to allow for job-templates

* Add job-templates to error message hinting

* RadioCard component for Job Templates (#15582)

* chore: add

* test: component API

* ui: component template

* refact: remove  bc naming collission

* styles: remove SASS var causing conflicts

* Disallow specific variable at nomad/job-templates (#15681)

* Disallows variables at exactly nomad/job-templates

* idiomatic refactor

* Expanding nomad job init to accept a template flag (#15571)

* Adding a string flag for templates on job init

* data-down actions-up version of a custom template editor within variable

* Dont force grid on job template editor

* list-templates flag started

* Correctly slice from end of path name

* Pre-review cleanup

* Variable form acceptance test for job template editing

* Some review cleanup

* List Job templates test

* Example from template test

* Using must.assertions instead of require etc

* ui: add choose template button (#15596)

* ui: add new routes

* chore: update file directory

* ui: add choose template button

* test: button and page navigation

* refact: update var name

* ui: use `Button` component from `HDS` (#15607)

* ui: integrate  buttons

* refact: remove  helper

* ui: remove icons on non-tertiary buttons

* refact: update normalize method for key/value pairs (#15612)

* `revert`: `onCancel` for `JobDefinition`

The `onCancel` method isn't included in the component API for `JobEditor` and the primary cancel behavior exists outside of the component. With the exception of the `JobDefinition` page where we include this button in the top right of the component instead of next to the `Plan` button.

* style: increase button size

* style: keep lime green

* ui: select template (#15613)

* ui: deprecate unused component

* ui: deprecate tests

* ui: jobs.run.templates.index

* ui: update logic to handle templates

* refact: revert key/value changes

* style: padding for cards + buttons

* temp: fixtures for mirage testing

* Revert "refact: revert key/value changes"

This reverts commit 124e95d12140be38fc921f7e15243034092c4063.

* ui: guard template for unsaved job

* ui: handle reading template variable

* Revert "refact: update normalize method for key/value pairs (#15612)"

This reverts commit 6f5ffc9b610702aee7c47fbff742cc81f819ab74.

* revert: remove test fixtures

* revert: prettier problems

* refact: test doesnt need filter expression

* styling: button sizes and responsive cards

* refact: remove route guarding

* ui: update variable adapter

* refact: remove model editing behavior

* refact: model should query variables to populate editor

* ui: clear qp on exit

* refact: cleanup deprecated API

* refact: query all namespaces

* refact: deprecate action

* ui: rely on  collection

* refact: patch deprecate transition API

* refact: patch test to expect namespace qp

* styling: padding, conditionals

* ui: flashMessage on 404

* test: update for o(n+1) query

* ui: create new job template (#15744)

* refact: remove unused code

* refact: add type safety

* test: select template flow

* test: add data-test attrs

* chore: remove dead code

* test: create new job flow

* ui: add create button

* ui: create job template

* refact: no need for wildcard

* refact:  record instead of delete

* styling: spacing

* ui: add error handling and form validation to job create template (#15767)

* ui: handle server side errors

* ui: show error to prevent duplicate

* refact: conditional namespace

* ui: save as template flow (#15787)

* bug:  patches failing tests associated with `pretender` (#15812)

* refact: update assertion

* refact: test set-up

* ui: job templates manager view (#15815)

* ui: manager list view

* test: edit flow

* refact: deprecate column-helper

* ui: template edit and delete flow (#15823)

* ui: manager list view

* refact: update title

* refact: update permissions

* ui: template edit page

* bug: typo

* refact: update toast messages

* bug:  clear selections on exit (#15827)

* bug:  clear controllers on exit

* test: mirage config changes (#15828)

* refact: deprecate column-helper

* style: update z-index for HDS

* Revert "style: update z-index for HDS"

This reverts commit d3d87ceab6d083f7164941587448607838944fc1.

* refact: update delete button

* refact: edit redirect

* refact: patch reactivity issues

* styling: fixed width

* refact: override defaults

* styling: edit text causing overflow

* styling:  add inline text

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>

* bug: edit `text` to `template`

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>

* test:  delete flow job templates (#15896)

* refact: edit names

* bug:  set correct ref to store

* chore: trim whitespace:

* test: delete flow

* bug: reactively update view (#15904)

* Initialized default jobs (#15856)

* Initialized default jobs

* More jobs scaffolded

* Better commenting on a couple example job specs

* Adapter doing the work

* fall back to epic config

* Label format helper and custom serialization logic

* Test updates to account for a never-empty state

* Test suite uses settled and maintain RecordArray in adapter return

* Updates to hello-world and variables example jobspecs

* Parameterized job gets optional payload output

* Formatting changes for param and service discovery job templates

* Multi-group service discovery job

* Basic test for default templates (#15965)

* Basic test for default templates

* Percy snapshot for manage page

* Some late-breaking design changes

* Some copy edits to the header paragraphs for job templates (#15967)

* Added some init options for job templates (#15994)

* Async method for populating default job templates from the variable adapter

---------

Co-authored-by: Jai <41024828+ChaiWithJai@users.noreply.github.com>
2023-02-02 10:37:40 -05:00
Charlie Voiselle 4caac1a92f
client: Add option to enable hairpinMode on Nomad bridge (#15961)
* Add `bridge_network_hairpin_mode` client config setting
* Add node attribute: `nomad.bridge.hairpin_mode`
* Changed format string to use `%q` to escape user provided data
* Add test to validate template JSON for developer safety

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-02-02 10:12:15 -05:00
jmwilkinson 37834dffda
Allow wildcard datacenters to be specified in job file (#11170)
Also allows for default value of `datacenters = ["*"]`
2023-02-02 09:57:45 -05:00
Luiz Aoqui 7c47b576cd
changelog: fix entries for #15522 and #15819 (#15998) 2023-02-01 18:03:39 -05:00
Tim Gross 0abf0b948b
job parsing: fix panic when variable validation is missing condition (#16018) 2023-02-01 16:41:03 -05:00
Tristan Pemble 5440965260
fix(#13844): canonicalize job to avoid nil pointer deference (#13845) 2023-02-01 16:01:28 -05:00
Seth Hoenig ca7ead191e
consul: restore consul token when reverting a job (#15996)
* consul: reset consul token on job during registration of a reversion

* e2e: add test for reverting a job with a consul service

* cl: fixup cl entry
2023-02-01 14:02:45 -06:00
James Rasell 9e8325d63c
acl: fix a bug in token creation when parsing expiration TTLs. (#15999)
The ACL token decoding was not correctly handling time duration
syntax such as "1h" which forced people to use the nanosecond
representation via the HTTP API.

The change adds an unmarshal function which allows this syntax to
be used, along with other styles correctly.
2023-02-01 17:43:41 +01:00
James Rasell 67acfd9f6b
acl: return 400 not 404 code when creating an invalid policy. (#16000) 2023-02-01 17:40:15 +01:00
Mike Nomitch 80848b202e
Increases max variable size to 64KiB from 16KiB (#15983) 2023-01-31 13:32:36 -05:00
stswidwinski 16eefbbf4d
GC: ensure no leakage of evaluations for batch jobs. (#15097)
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory.

Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
2023-01-31 13:32:14 -05:00
Seth Hoenig 139f2c0b0f
docker: set force=true on remove image to handle images referenced by multiple tags (#15962)
* docker: set force=true on remove image to handle images referenced by multiple tags

This PR changes our call of docker client RemoveImage() to RemoveImageExtended with
the Force=true option set. This fixes a bug where an image referenced by more than
one tag could never be garbage collected by Nomad. The Force option only applies to
stopped containers; it does not affect running workloads.

* docker: add note about image_delay and multiple tags
2023-01-31 07:53:18 -06:00
Yorick Gersie d94f22bee2
Ensure infra_image gets proper label used for reconciliation (#15898)
* Ensure infra_image gets proper label used for reconciliation

Currently infra containers are not cleaned up as part of the dangling container
cleanup routine. The reason is that Nomad checks if a container is a Nomad owned
container by verifying the existence of the: `com.hashicorp.nomad.alloc_id` label.

Ensure we set this label on the infra container as well.

* fix unit test

* changelog: add entry

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-01-30 09:46:45 -06:00
Jorge Marey d1c9aad762
Rename fields on proxyConfig (#15541)
* Change api Fields for expose and paths

* Add changelog entry

* changelog: add deprecation notes about connect fields

* api: minor style tweaks

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-01-30 09:31:16 -06:00
dependabot[bot] bb79824a20
build(deps): bump github.com/docker/docker from 20.10.21+incompatible to 20.10.23+incompatible (#15848)
* build(deps): bump github.com/docker/docker

Bumps [github.com/docker/docker](https://github.com/docker/docker) from 20.10.21+incompatible to 20.10.23+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v20.10.21...v20.10.23)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* changelog: add entry for docker/docker

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-01-30 09:10:06 -06:00
舍我其谁 3abb453bd0
volume: Add the missing option propagation_mode (#15626) 2023-01-30 09:32:07 -05:00
Seth Hoenig 074b76e3bf
consul: check for acceptable service identity on consul tokens (#15928)
When registering a job with a service and 'consul.allow_unauthenticated=false',
we scan the given Consul token for an acceptable policy or role with an
acceptable policy, but did not scan for an acceptable service identity (which
is backed by an acceptable virtual policy). This PR updates our consul token
validation to also accept a matching service identity when registering a service
into Consul.

Fixes #15902
2023-01-27 18:15:51 -06:00
Seth Hoenig 0fac4e19b3
client: always run alloc cleanup hooks on final update (#15855)
* client: run alloc pre-kill hooks on last pass despite no live tasks

This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.

Fixes #15477

* client: do not run ar cleanup hooks if client is shutting down
2023-01-27 09:59:31 -06:00
Luiz Aoqui de87cdc816
template: restore driver handle on update (#15915)
When the template hook Update() method is called it may recreate the
template manager if the Nomad or Vault token has been updated.

This caused the new template manager did not have a driver handler
because this was only being set on the Poststart hook, which is not
called for inplace updates.
2023-01-27 10:55:59 -05:00
Luiz Aoqui 09fc054c82
ui: fix alloc memory stats to match CLI output (#15909) 2023-01-26 17:08:13 -05:00
Luiz Aoqui bb323ef3de
ui: fix navigation for namespaced jobs in search and job version (#15906) 2023-01-26 16:03:07 -05:00
Seth Hoenig 7375fd40fc
nsd: block on removal of services (#15862)
* nsd: block on removal of services

This PR uses a WaitGroup to ensure workload removals are complete
before returning from ServiceRegistrationHandler.RemoveWorkload of
the nomad service provider. The de-registration of individual services
still occurs asynchrously, but we must block on the parent removal
call so that we do not race with further operations on the same set
of services - e.g. in the case of a task restart where we de-register
and then re-register the services in quick succession.

Fixes #15032

* nsd: add e2e test for initial failing check and restart
2023-01-26 08:17:57 -06:00
Yorick Gersie 2a5c423ae0
Allow per_alloc to be used with host volumes (#15780)
Disallowing per_alloc for host volumes in some cases makes life of a nomad user much harder.
When we rely on the NOMAD_ALLOC_INDEX for any configuration that needs to be re-used across
restarts we need to make sure allocation placement is consistent. With CSI volumes we can
use the `per_alloc` feature but for some reason this is explicitly disabled for host volumes.

Ensure host volumes understand the concept of per_alloc
2023-01-26 09:14:47 -05:00
Tim Gross 6677a103c2
metrics: measure rate of RPC requests that serve API (#15876)
This changeset configures the RPC rate metrics that were added in #15515 to all
the RPCs that support authenticated HTTP API requests. These endpoints already
configured with pre-forwarding authentication in #15870, and a handful of others
were done already as part of the proof-of-concept work. So this changeset is
entirely copy-and-pasting one method call into a whole mess of handlers.

Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining
set of RPCs that have no API consumers or aren't authenticated, in smaller
chunks that can be more thoughtfully reviewed.
2023-01-25 16:37:24 -05:00
Luiz Aoqui 3479e2231f
core: enforce strict steps for clients reconnect (#15808)
When a Nomad client that is running an allocation with
`max_client_disconnect` set misses a heartbeat the Nomad server will
update its status to `disconnected`.

Upon reconnecting, the client will make three main RPC calls:

- `Node.UpdateStatus` is used to set the client status to `ready`.
- `Node.UpdateAlloc` is used to update the client-side information about
  allocations, such as their `ClientStatus`, task states etc.
- `Node.Register` is used to upsert the entire node information,
  including its status.

These calls are made concurrently and are also running in parallel with
the scheduler. Depending on the order they run the scheduler may end up
with incomplete data when reconciling allocations.

For example, a client disconnects and its replacement allocation cannot
be placed anywhere else, so there's a pending eval waiting for
resources.

When this client comes back the order of events may be:

1. Client calls `Node.UpdateStatus` and is now `ready`.
2. Scheduler reconciles allocations and places the replacement alloc to
   the client. The client is now assigned two allocations: the original
   alloc that is still `unknown` and the replacement that is `pending`.
3. Client calls `Node.UpdateAlloc` and updates the original alloc to
   `running`.
4. Scheduler notices too many allocs and stops the replacement.

This creates unnecessary placements or, in a different order of events,
may leave the job without any allocations running until the whole state
is updated and reconciled.

To avoid problems like this clients must update _all_ of its relevant
information before they can be considered `ready` and available for
scheduling.

To achieve this goal the RPC endpoints mentioned above have been
modified to enforce strict steps for nodes reconnecting:

- `Node.Register` does not set the client status anymore.
- `Node.UpdateStatus` sets the reconnecting client to the `initializing`
  status until it successfully calls `Node.UpdateAlloc`.

These changes are done server-side to avoid the need of additional
coordination between clients and servers. Clients are kept oblivious of
these changes and will keep making these calls as they normally would.

The verification of whether allocations have been updates is done by
storing and comparing the Raft index of the last time the client missed
a heartbeat and the last time it updated its allocations.
2023-01-25 15:53:59 -05:00
Tim Gross f3f64af821
WI: allow workloads to use RPCs associated with HTTP API (#15870)
This changeset allows Workload Identities to authenticate to all the RPCs that
support HTTP API endpoints, for use with PR #15864.

* Extends the work done for pre-forwarding authentication to all RPCs that
  support a HTTP API endpoint.
* Consolidates the auth helpers used by the CSI, Service Registration, and Node
  endpoints that are currently used to support both tokens and client secrets.

Intentionally excluded from this changeset:
* The Variables endpoint still has custom handling because of the implicit
  policies. Ideally we'll figure out an efficient way to resolve those into real
  policies and then we can get rid of that custom handling.
* The RPCs that don't currently support auth tokens (i.e. those that don't
  support HTTP endpoints) have not been updated with the new pre-forwarding auth
  We'll be doing this under a separate PR to support RPC rate metrics.
2023-01-25 14:33:06 -05:00
Nick Wales 825af1f62a
docker: add option for Windows isolation modes (#15819) 2023-01-24 16:31:48 -05:00
Karl Johann Schubert b773a1b77f
client: add disk_total_mb and disk_free_mb config options (#15852) 2023-01-24 09:14:22 -05:00
Michael Schurter 92c7d96e0a
Add INFO task even log line and make logmon less noisy (#15842)
* client: log task events at INFO level

Fixes #15840

Example INFO level client logs with this enabled:

```
[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type="Task Setup" msg="Building Task Directory" failed=false
[WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy
[INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy path=/tmp/NomadClient2414238708/b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51/alloc/logs/.sleepy.stdout.fifo @module=logmon timestamp=2023-01-20T11:19:34.275-0800
[INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy @module=logmon path=/tmp/NomadClient2414238708/b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51/alloc/logs/.sleepy.stderr.fifo timestamp=2023-01-20T11:19:34.275-0800
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[WARN]  client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 driver=raw_exec task_name=sleepy
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Started msg="Task started by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
[INFO]  client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 driver=raw_exec task_name=sleepy path=/home/schmichael/go/bin/nomad pid=27668
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Terminated msg="Exit Code: 130, Signal: 2" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Killed msg="Task successfully killed" failed=false
[INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy path=/home/schmichael/go/bin/nomad pid=27653
[INFO]  client.gc: marking allocation for GC: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51
```

So task events will approximately *double* the number of per-task log
lines, but I think they add a lot of value.

* client: drop logmon 'opening' from debug->info

Cannot imagine why users care and removes 2 log lines per task
invocation.

```

[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type="Task Setup" msg="Building Task Directory" failed=false
<<< 2 "opening fifo" lines elided here >>>
[WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[WARN]  client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 driver=raw_exec task_name=sleepy
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type=Started msg="Task started by client" failed=false
```

* docs: add changelog for #15842
2023-01-20 14:35:00 -08:00
Tim Gross a51149736d
Rename nomad.broker.total_blocked metric (#15835)
This changeset fixes a long-standing point of confusion in metrics emitted by
the eval broker. The eval broker has a queue of "blocked" evals that are waiting
for an in-flight ("unacked") eval of the same job to be completed. But this
"blocked" state is not the same as the `blocked` status that we write to raft
and expose in the Nomad API to end users. There's a second metric
`nomad.blocked_eval.total_blocked` that refers to evaluations in that
state. This has caused ongoing confusion in major customer incidents and even in
our own documentation! (Fixed in this PR.)

There's little functional change in this PR aside from the name of the metric
emitted, but there's a bit refactoring to clean up the names in `eval_broker.go`
so that there aren't name collisions and multiple names for the same
state. Changes included are:
* Everything that was previously called "pending" referred to entities that were
  associated witht he "ready" metric. These are all now called "ready" to match
  the metric.
* Everything named "blocked" in `eval_broker.go` is now named "pending", except
  for a couple of comments that actually refer to blocked RPCs.
* Added a note to the upgrade guide docs for 1.5.0.
* Fixed the scheduling performance metrics docs because the description for
  `nomad.broker.total_blocked` was actually the description for
  `nomad.blocked_eval.total_blocked`.
2023-01-20 14:23:56 -05:00
Charlie Voiselle 5ea1d8a970
Add raft snapshot configuration options (#15522)
* Add config elements
* Wire in snapshot configuration to raft
* Add hot reload of raft config
* Add documentation for new raft settings
* Add changelog
2023-01-20 14:21:51 -05:00
Seth Hoenig d2d8ebbeba
consul: correctly interpret missing consul checks as unhealthy (#15822)
* consul: correctly understand missing consul checks as unhealthy

This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.

In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.

Fixes #15536

* consul: minor CR refactor using maps not sets

* consul: observe transition from healthy to unhealthy checks

* consul: spell healthy correctly
2023-01-19 14:01:12 -06:00
James Rasell 94aba987c6
changelog: add feature entry for SSO OIDC (#15821) 2023-01-19 16:48:04 +01:00
Dao Thanh Tung e2ae6d62e1
fix bug in nomad fmt -check does not return error code (#15797) 2023-01-17 09:15:34 -05:00
Benjamin Buzbee 13cc30ebeb
Return buffered text from log endpoint if decoding fails (#15558)
To see why I think this is a good change lets look at why I am making it

My disk was full, which means GC was happening agressively. So by the
time I called the logging endpoint from the SDK, the logs were GC'd

The error I was getting before was:
```
invalid character 'i' in literal false (expecting 'l')
```

Now the error I get is:
```
failed to decode log endpoint response as JSON: "failed to list entries: open /tmp/nomad.data.4219353875/alloc/f11fee50-2b66-a7a2-d3ec-8442cb3d557a/alloc/logs: no such file or directory"
```

Still not super descriptive but much more debugable
2023-01-16 10:39:56 +01:00
Phil Renaud d588aabca6
[ui] Fixes logger height issue when sidebar has events (#15759)
* Fixes logger height issue when sidebar has events

* Much simpler grid method for height calc
2023-01-13 12:16:02 -05:00
Seth Hoenig 8cd77c14a2
env/aws: update ec2 cpu info data (#15770) 2023-01-13 09:58:23 -06:00
Seth Hoenig a8d40ce26b
build: update to go 1.19.5 (#15769) 2023-01-13 09:57:32 -06:00