The CSI Specification defines various gRPC Errors and how they may be retried. After auditing all our CSI RPC calls in #6863, this changeset:
* adds retries and backoffs to the where they were needed but not implemented
* annotates those CSI RPCs that do not need retries so that we don't wonder whether it's been left off accidentally
* added a timeout and cancellation context to the `Probe` call, which didn't have one.
Issue #7523 documents the Consul ACLs used in each Consul interface
used by Nomad. Minimize the policies used in e2e tests so that we
are setting a good example.
The `csi_plugins` and `csi_volumes` tables were missing support for
snapshot persist and restore. This means restoring a snapshot would
result in missing information for CSI.
In tests where the logger is a test logger, emitting a trace log in a
background thread while it's shutting down may trigger a panic. Thus
avoid logging Trace if err != nil. Note that we already log an error
when err isn't a trace.
This fixes cases where tests panic with a trace like:
```
panic: Log in goroutine after TestAllocGarbageCollector_MakeRoomFor_MaxAllocs has completed
goroutine 30 [running]:
testing.(*common).logDepth(0xc000aa9e60, 0xc000c4a000, 0xab, 0x3)
/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:680 +0x4d3
testing.(*common).log(...)
/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:662
testing.(*common).Logf(0xc000aa9e60, 0x690b941, 0x4, 0xc001366c00, 0x2, 0x2)
/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:701 +0x7e
github.com/hashicorp/nomad/helper/testlog.(*writer).Write(0xc000a82a60, 0xc0000b48c0, 0xab, 0x13f, 0x0, 0x0, 0x0)
/Users/notnoop/go/src/github.com/hashicorp/nomad/helper/testlog/testlog.go:34 +0x106
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*writer).Flush(0xc000a80900, 0xbf9870f000000001, 0x20a87556e, 0x8b12bc0)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/writer.go:29 +0x14f
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*intLogger).log(0xc000e2c180, 0xc0003b6880, 0x17, 0x1, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:139 +0x15d
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*intLogger).Trace(0xc000e2c180, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:446 +0x7a
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*interceptLogger).Trace(0xc0002f1ad0, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/interceptlogger.go:48 +0x9c
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc0002f2380)
/Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:147 +0x1125
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
/Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:89 +0x1e3
FAIL github.com/hashicorp/nomad/client 10.605s
FAIL
```
The test inserts an alloc in the server state, but expect the client to
start the alloc runner for it almost immediately.
Here, we add a retry loop to check that the client start all expected
alloc runners eventually.
Pick up a panic fix of f1bd0923b8
Some CirleCI builds fail somewhat mysteriously, e.g. https://circleci.com/gh/hashicorp/nomad/52676 , due to this bug. The test json file emit the following:
```
{"Time":"2020-03-28T12:03:08.602544522Z","Action":"output","Package":"github.com/hashicorp/nomad/client/allocrunner","Test":"TestGroupServiceHook_Update08Alloc","Output":"panic: send on closed channel\n"}
{"Time":"2020-03-28T12:03:08.602576075Z","Action":"output","Package":"github.com/hashicorp/nomad/client/allocrunner","Test":"TestGroupServiceHook_Update08Alloc","Output":"\n"}
{"Time":"2020-03-28T12:03:08.602584429Z","Action":"output","Package":"github.com/hashicorp/nomad/client/allocrunner","Test":"TestGroupServiceHook_Update08Alloc","Output":"goroutine 403 [running]:\n"}
{"Time":"2020-03-28T12:03:08.602590561Z","Action":"output","Package":"github.com/hashicorp/nomad/client/allocrunner","Test":"TestGroupServiceHook_Update08Alloc","Output":"github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually.func1(0xc000a78000, 0xc0009cf160)\n"}
{"Time":"2020-03-28T12:03:08.602598464Z","Action":"output","Package":"github.com/hashicorp/nomad/client/allocrunner","Test":"TestGroupServiceHook_Update08Alloc","Output":"\t/home/circleci/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1494 +0x47\n"}
{"Time":"2020-03-28T12:03:08.602604952Z","Action":"output","Package":"github.com/hashicorp/nomad/client/allocrunner","Test":"TestGroupServiceHook_Update08Alloc","Output":"created by github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually\n"}
```
* added method to retrieve all scaling policies for use in snapshotting, plus test
* better testing for ScalingPoliciesByNamespace
* added scaling policy snapshot persist and restore (and test of restore)
manually tested snapshot restore.
resolves#7539
Looks like the latest `github.com/docker/docker/registry.ResolveAuthConfig` expect
`github.com/docker/docker/api/types.AuthConfig` rather than
`github.com/docker/cli/cli/config/types.AuthConfig`. The two types are
identical but live in different packages.
Here, we embed `registry.ResolveAuthConfig` from upstream repo, but with
the signature we need.