Commit graph

144 commits

Author SHA1 Message Date
Tim Gross fc1d4814d9
qemu: add args_allowlist to sandbox VM command line inputs
The QEMU driver allows arbitrary command line options, but many of
these options give access to host resources that operators may not
want to expose such as devices. Add an optional allowlist to the
plugin configuration so that operators can limit the resources for
QEMU.
2021-11-19 11:11:52 -05:00
Dave May 3c04d7927b
cli: refactor operator debug capture (#11466)
* debug: refactor Consul API collection
* debug: refactor Vault API collection
* debug: cleanup test timing
* debug: extend test to multiregion
* debug: save cmdline flags in bundle
* debug: add cli version to output
* Add changelog entry
2021-11-05 19:43:10 -04:00
Tim Gross 73e3b15305
build: bump go version to 1.17.3 (#11461) 2021-11-05 15:34:24 -04:00
James Rasell 99955eb80f
Merge pull request #11426 from hashicorp/b-set-dereg-eval-priority-correctly
rpc: set the deregistration eval priority to the job priority.
2021-11-05 15:53:10 +01:00
James Rasell 2cc661c523
Merge pull request #11429 from hashicorp/b-set-scale-eval-priority-correctly
rpc: set the job scale eval priority to the job priority.
2021-11-05 15:52:31 +01:00
Alessandro De Blasis 07c670fdc0
cli: show host_network in nomad status (#11432)
Enhance the CLI in order to return the host network in two flavors 
(default, verbose) of the `node status` command.

Fixes: #11223.
Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
2021-11-05 09:02:46 -04:00
Florian Apolloner ef88795af3
Added a -hcl2-strict flag to allow for lenient hcl variable parsing. (#11284)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2021-11-04 16:33:09 +01:00
James Rasell 674761436e
Merge pull request #11165 from hashicorp/b-gh-11149
jobspec2: ensure consistent error handling between var-file & var.
2021-11-04 16:24:00 +01:00
James Rasell 4125e13698
changelog: add entry for #11165 2021-11-04 15:35:02 +01:00
James Rasell 2b866b1d34
changelog: fixup entry extension for #11167 2021-11-04 15:28:34 +01:00
Michael Schurter 3718557041
Merge pull request #11416 from hashicorp/f-rejected-info
core: bump rejected plans from debug -> info
2021-11-03 16:49:28 -07:00
Michael Schurter ef3fc79225
Merge pull request #11334 from hashicorp/f-chroot-skip-allocdir
client: never embed alloc_dir in chroot
2021-11-03 16:48:09 -07:00
Charlie Voiselle 71643263a6
Parse job > group > consul block in HCL1 (#11423) 2021-11-03 13:49:32 -04:00
Luiz Aoqui 5d204c8ced
Revert "Return SchedulerConfig instead of SchedulerConfigResponse struct (#10799)" (#11433) 2021-11-02 17:42:52 -04:00
James Rasell a2176474a5
changelog: add entry for #11429 2021-11-02 12:58:10 +01:00
James Rasell 4803eb9d88
changelog: add entry for #11426 2021-11-02 11:43:13 +01:00
James Rasell c071efbd6b
Merge pull request #11411 from hashicorp/f-gh-11406
cli: add json and template flag opts to acl bootstrap command.
2021-11-02 09:48:25 +01:00
Charlie Voiselle 29e7d46dd9
Making RPC Upgrade mode reloadable. (#11144)
- Making RPC Upgrade mode reloadable.
- Add suggestions from code review
- remove spurious comment
- switch to require(t,...) form for test.
- Add to changelog
2021-11-01 16:30:53 -04:00
Luiz Aoqui 655ac2719f
Allow using specific object ID on diff (#11400) 2021-11-01 15:16:31 -04:00
Michael Schurter efe5714840 core: bump rejected plans from debug -> info
As we have continued to see reports of #9506 we need to elevate this log
line as it is the only way to detect when plans are being *erroneously*
rejected.

Users who see this log line repeatedly should drain and restart the node
in the log line. This seems to workaorund the issue.

Please post any details on #9506!
2021-10-31 12:51:42 -07:00
James Rasell 30ad7985b2
changelog: add entry for #11411. 2021-10-29 09:08:10 +02:00
Dave May 509c74ce19
debug: update default node-id and docs (#11398)
* debug: default node-id to all
* debug: align cli help and website documentation
2021-10-27 13:43:56 -04:00
Mahmood Ali cdddd64a42
logging: Log the cause behind agent startup failure (#11353)
Log the failure error when the agent fails to start. Previously, the
agent startup failure error would be emitted to the command UI but not
logged. So it doesn't get emitted to syslog or `log_file` if they are
set, and it makes debugging much harder. Also, logging the error again
before exit makes the error more visible: previously, the operator
needed to scroll to the top to find the error.

On a sample failure, the output will look like:
```
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from sample-configs/config-bad
==> Starting Nomad agent...
==> Error starting agent: setting up server node ID failed: mkdir /path-without-permission: read-only file system
    2021-10-20T14:38:51.179-0400 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/path-without-permission/plugins
    2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins
    2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [ERROR] agent: error starting agent: error="setting up server node ID failed: mkdir /path-without-permission: read-only file system"
```

This change adds the final `ERROR` message. It's easy to miss the `==>
Error starting agent` above.
2021-10-27 10:41:17 -07:00
Mahmood Ali daf20f9788
vault: set JobID in Vault metadata (#11397)
Closes: #11395 .
2021-10-27 07:20:29 -07:00
Mahmood Ali e06ff1d613
scheduler: stop allocs in unrelated nodes (#11391)
The system scheduler should leave allocs on draining nodes as-is, but
stop node stop allocs on nodes that are no longer part of the job
datacenters.

Previously, the scheduler did not make the distinction and left system
job allocs intact if they are already running.

I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 .

Fixes https://github.com/hashicorp/nomad/issues/11373
2021-10-27 07:04:13 -07:00
Mahmood Ali f03d65062d
Fix arm64 panics by updating google/snappy library to latest, 0.0.4 (#11396)
Pick up https://github.com/golang/snappy/pull/56 to handle arm64 architectures to fix panics. tldr; Golang 1.16 changed `memmove` implementation for arm64 requiring additional cpu registers that snappy wasn't preserving in its assembly implementation.

Other projects have experienced this issue as well, searching for `encode_arm64.s:666` on your favorite search engine will reveal some.  Vault updated the dependency earlier this August: https://github.com/hashicorp/vault/pull/12371 .

I believe this issue affects Nomad 1.2.x and 1.1.x. Nomad 1.0.x use Golang 1.15 and isn't affected. However, backporting the change to 1.0.x should be harmless.

Fixed https://github.com/hashicorp/nomad/issues/11385 .
2021-10-27 06:39:16 -07:00
Luiz Aoqui b463715a98
prevent active log from being overwritten when agent starts (#11386) 2021-10-26 20:57:07 -04:00
Luiz Aoqui 3c22fc79a5
add dispatch idempotency token support in the CLI (#10930) 2021-10-22 12:39:05 -04:00
Luiz Aoqui 2c7bfb7000
ui: persist node drain settings (#11368) 2021-10-22 10:51:31 -04:00
Luiz Aoqui dc5222f6e5
ui: display Nomad version in the Clients and Servers table (#11366) 2021-10-22 10:33:06 -04:00
Luiz Aoqui b73ecf684b
ui: update favicon (#11371) 2021-10-22 09:40:38 -04:00
Luiz Aoqui 6853bf9632
cli: allow setting namespace and region in the nomad ui command (#11364) 2021-10-21 16:24:39 -04:00
Luiz Aoqui 362c8c54f4
ui: set * as the default namespace selector (#11357) 2021-10-21 10:24:07 -04:00
Luiz Aoqui dceeccfc5d
ui: add client name tooltip when displaying client ID in tables (#11358) 2021-10-21 10:23:06 -04:00
Mahmood Ali e992ebf58d
document GH-11346 fix (#11350) 2021-10-20 22:03:19 -04:00
Michael Schurter 081cfb85d7 docs: add #11331 to changelog 2021-10-19 16:30:06 -07:00
Michael Schurter d25b60a82d docs: add #11334 to changelog 2021-10-18 09:22:01 -07:00
Luiz Aoqui 1bd9db3df0
changlog: add entry for #10796 (#11312) 2021-10-14 09:01:43 -04:00
James Rasell 444d25db07
Merge pull request #11280 from benbuzbee/log-err
Log error if there are no event handlers registered
2021-10-14 14:49:22 +02:00
Mahmood Ali d5e136b82b
executor: set CpuWeight in cgroup-v2 (#11287)
Cgroup-v2 uses `cpu.weight` property instead of cpu shares:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#cpu-interface-files
. And it uses a different range (i.e. `[1, 10000]`) from cpu.shares
(i.e. `[2, 262144]`) to make things more interesting.

Luckily, the libcontainer provides a helper function to perform the
conversion
[`ConvertCPUSharesToCgroupV2Value`](https://pkg.go.dev/github.com/opencontainers/runc@v1.0.2/libcontainer/cgroups#ConvertCPUSharesToCgroupV2Value).

I have confirmed that docker/libcontainer performs the conversion as
well in
https://github.com/opencontainers/runc/blob/v1.0.2/libcontainer/specconv/spec_linux.go#L536-L541
, and that CpuShares is ignored by libcontainer in
https://github.com/opencontainers/runc/blob/v1.0.2/libcontainer/cgroups/fs2/cpu.go#L24-L29
.
2021-10-14 08:46:07 -04:00
Luiz Aoqui 536a5751ff
changelog: add entries for #9160 and #11078 (#11290) 2021-10-14 08:43:36 -04:00
Charlie Voiselle cb8e52b5df
Return SchedulerConfig instead of SchedulerConfigResponse struct (#10799) 2021-10-13 21:23:13 -04:00
Michael Schurter 59fda1894e
Merge pull request #11167 from a-zagaevskiy/master
Support configurable dynamic port range
2021-10-13 16:47:38 -07:00
Dave May c37a6ed583
cli: rename paths in debug bundle for clarity (#11307)
* Rename folders to reflect purpose
* Improve captured files test coverage
* Rename CSI plugins output file
* Add changelog entry
* fix test and make changelog message more explicit

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2021-10-13 18:00:55 -04:00
Dave May 305e8e98bf
cli: Improved autocomplete support for job dispatch and operator debug (#11270)
* Add autocomplete to nomad job dispatch
* Add autocomplete to nomad operator debug
* Update incorrect comment
* Update test to verify autocomplete
* Add changelog
* Apply lint suggestions
* Create dynamic slices instead of specific length
* Align style across predictors
2021-10-12 20:01:54 -04:00
Dave May 2d14c54fa0
debug: Improve namespace and region support (#11269)
* Include region and namespace in CLI output
* Add region and prefix matching for server members
* Add namespace and region API outputs to cluster metadata folder
* Add region awareness to WaitForClient helper function
* Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice
* Refactor test client agent generation
* Add tests for region
* Add changelog
2021-10-12 16:58:41 -04:00
Florian Apolloner 511cae92b4
Fixed plan diffing to handle non-unique service names. (#10965) 2021-10-12 16:42:39 -04:00
Dave May 76b05f3cd2
cli: Add nomad job allocs command (#11242) 2021-10-12 16:30:36 -04:00
Luiz Aoqui 3e0bad5a41
wrap log messages with hclog (#11291) 2021-10-12 14:38:44 -04:00
Ben Buzbee 573fb840fa Log error if there are no event handlers registered
We see this error all the time
```
no handler registered for event
event.Message=, event.Annotations=, event.Timestamp=0001-01-01T00:00:00Z, event.TaskName=, event.AllocID=, event.TaskID=,
```

So we're handling an even with all default fields. I noted that this can
happen if only err is set as in

```
func (d *driverPluginClient) handleTaskEvents(reqCtx context.Context, ch chan *TaskEvent, stream proto.Driver_TaskEventsClient) {
	defer close(ch)
	for {
		ev, err := stream.Recv()
		if err != nil {
			if err != io.EOF {
				ch <- &TaskEvent{
					Err: grpcutils.HandleReqCtxGrpcErr(err, reqCtx, d.doneCtx),
				}
			}
```

In this case Err fails to be serialized by the logger, see this test

```

	ev := &drivers.TaskEvent{
		Err: fmt.Errorf("errz"),
	}
	i.logger.Warn("ben test", "event", ev)
	i.logger.Warn("ben test2", "event err str", ev.Err.Error())
	i.logger.Warn("ben test3", "event err", ev.Err)
	ev.Err = nil
	i.logger.Warn("ben test4", "nil error", ev.Err)

2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.643900Z","driver":"mock_driver","event":{"TaskID":"","TaskName":"","AllocID":"","Timestamp":"0001-01-01T00:00:00Z","Message":"","Annotations":null,"Err":{}}}
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test2","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644226Z","driver":"mock_driver","event err str":"errz"}
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test3","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644240Z","driver":"mock_driver","event err":"errz"}
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test4","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644252Z","driver":"mock_driver","nil error":null}
```

Note in the first example err is set to an empty object and the error is
lost.

What we want is the last two examples which call out the err field
explicitly so we can see what it is in this case
2021-10-11 19:44:52 +00:00