This is an attempt to ease dependency management for external driver
plugins, by avoiding requiring them to compile ugorji/go generated
files. Plugin developers reported some pain with the brittleness of
ugorji/go dependency in particular, specially when using go mod, the
default go mod manager in golang 1.13.
Context
--------
Nomad uses msgpack to persist and serialize internal structs, using
ugorji/go library. As an optimization, we use ugorji/go code generation
to speedup process and aovid the relection-based slow path.
We commit these generated files in repository when we cut and tag the
release to ease reproducability and debugging old releases. Thus,
downstream projects that depend on release tag, indirectly depends on
ugorji/go generated code.
Sadly, the generated code is brittle and specific to the version of
ugorji/go being used. When go mod picks another version of ugorji/go
then nomad (go mod by default uses release according to semver),
downstream projects face compilation errors.
Interestingly, downstream projects don't commonly serialize nomad
internal structs. Drivers and device plugins use grpc instead of
msgpack for the most part. In the few cases where they use msgpag (e.g.
decoding task config), they do without codegen path as they run on
driver specific structs not the nomad internal structs. Also, the
ugorji/go serialization through reflection is generally backward
compatible (mod some ugorji/go regression bugs that get introduced every
now and then :( ).
Proposal
---------
The proposal here is to keep committing ugorji/go codec generated files
for releases but to use a go tag for them.
All nomad development through the makefile, including releasing, CI and
dev flow, has the tag enabled.
Downstream plugin projects, by default, will skip these files and life
proceed as normal for them.
The downside is that nomad developers who use generated code but avoid
using make must start passing additional go tag argument. Though this
is not a blessed configuration.
We currently use an container image for `test-devices` job only; while
all other jobs use machine executor.
This allows us to switch golang and protoc verions easily without
manually managing Docker images (which requires building them manually
on a dev machines, etc). All that while, we install dependencies on
every build in all other jobs..
`test-devices` now is one of the fastest jobs and isn't a constraint or
a bottleneck, so increasing its overhead by few seconds doesn't hurt the
overall developer iteration.
If we split tests effectively later, we can revisit.
Allow honoring `GO_TAGS` environment variable if set. Currently, users
must set variable as a makefile argument e.g. `make GO_TAGS=ui dev`, and
this allows us to use env-var syntax (e.g. `GO_TAGS=ui make dev`) and
make it convenient to set GO_TAGS globally.
This commit provides an initial migration of general testing CI
infrastructure to CircleCI.
It uses CircleCI 2.1 paramereterised jobs to provide two base
configurations: a vm based `test-machine`, and docker based
`test-container`.
Jobs that require root, docker, or other similar features require the
machine based jobs, but others should be ran using the `test-container` package
as they are both cheaper and faster to run.
Our build scripts pass `$(GO_TAGS)` to `-tags` go compile flags, except
for `make dev`, where `$(NOMAD_UI_TAG)` is used. This change ensures
`make dev` is inline with the rest of makefile targets.
I use the flag primarily to enable the nomad ui using the committed
compiled assets without regenerating them, as I find using stale ui
satisfactory most of the time.
`make check` runs very intensive linters that slow and seem to behave
differently on different machines.
Linting is still a part of our CI and we shouldn't be cutting a release
when CI isn't green anyway.
Noticed that the protobuf files are out of sync with ones generated by 1.2.0 protoc go plugin.
The cause for these files seem to be related to release processes, e.g. [0.9.0-beta1 preperation](ecec3d38de (diff-da4da188ee496377d456025c2eab4e87)), and [0.9.0-beta3 preperation](b849d84f2f).
This restores the changes to that of the pinned protoc version and fails build if protobuf files are out of sync. Sample failing Travis job is that of the first commit change: https://travis-ci.org/hashicorp/nomad/jobs/506285085
`gotestsum` has user friendlier output that emits final summary, also it can emit junit xml file for
automated analysis instead of current format that should significantly
ease automated analysis of CI.
workaround a regression in 1.11.3
> We are aware of a functionality regression in "go get" when executed in GOPATH mode on an import path pattern containing "..." (e.g., "go get github.com/golang/pkg/..."), when downloading packages not already present in the GOPATH workspace. This is issue golang.org/issue/29241. It will be resolved in the next minor patch releases, Go 1.11.4 and Go 1.10.7, which we plan to release soon. We apologize for any disruption.
Lowering the runtime here to pre 7ca535aa90748caff1522468cc0c4ab672a74abb expectations.
The longest package at the time `client/driver` shrunk significantly,
and now the longest packages take less than 5 minutes.
We do have some long running timed out projects due to a stuck shutdown,
but in completed jobs (though they failed), the longest packages took
less than 5 minutes. The longest running packages in
https://travis-ci.org/hashicorp/nomad/jobs/464640776 were:
```
FAIL github.com/hashicorp/nomad/nomad 268.089s
ok github.com/hashicorp/nomad/drivers/docker 203.903s coverage: 68.8% of statements
ok github.com/hashicorp/nomad/drivers/rkt 132.104s coverage: 65.0% of statements
ok github.com/hashicorp/nomad/api 123.193s coverage: 62.9% of statements
ok github.com/hashicorp/nomad/command/agent 74.657s coverage: 72.3% of statements
ok github.com/hashicorp/nomad/command 63.592s coverage: 42.7% of statements
```
nomad/client take very long and exceed 15m sometimes:
In https://travis-ci.org/hashicorp/nomad/jobs/452990197 :
```
panic: test timed out after 15m0s
goroutine 4739 [running]:
testing.(*M).startAlarm.func1()
/home/travis/.gimme/versions/go1.11.2.linux.amd64/src/testing/testing.go:1296 +0xfd
....
goroutine 4665 [select]:
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.newClientStream.func5(0xc0003dd500, 0xc000420120, 0x2b3f86295588, 0xc000496810)
/home/travis/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:287 +0xd7
created by github.com/hashicorp/nomad/vendor/google.golang.org/grpc.newClientStream
/home/travis/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:286 +0x842
FAIL github.com/hashicorp/nomad/client/driver 900.036s
```
Introduce a device manager that manages the lifecycle of device plugins
on the client. It fingerprints, collects stats, and forwards Reserve
requests to the correct plugin. The manager, also handles device plugins
failing and validates their output.
Remove the NOMAD_TEST_RKT flag as a guard for rkt tests. Still require
Linux, root, and rkt to be installed. Only check for rkt installation
once in hopes of speeding up rkt tests a bit.
We make a HashiCorp hard fork of the jteeuwen/go-bindata hard fork that
was replaced and diffed the code against a Dec 1, 2015 copy of the
original repository we had as a cross-check of that hard fork.
This replaces references to jteeuwen/go-bindata to point to the
HashiCorp fork.
This commit reworks the Vagrantfile for Nomad in order to support
straightforward testing on more than one operating system, whilst
retaining the ability to stand up a test cluster running Ubuntu.
The following changes are made:
- Scripts have been extracted from the Vagrantfile into their own shell
script files, in order that editors lint them.
- All scripts have been edited to lint with no warnings or errors for
their respective shells.
- Scripts are named according to the operating system and privilege
level which they run. We prefer to run a whole shell script as root
versus prefixing (essentially) every command with `sudo` or an
equivalent.
- The Linux development box has been separated from the test cluster,
removing some of the more gnarly (and less portable) logic. The Linux
development box is still primary and autostarts.
- A FreeBSD target has been added. The base box works for both
Virtualbox and VMWare Fusion.
- A target is added to the GNUmakefile to stand up a test cluster, using
the default provider, or overriding the provider by setting the PROVIDER
variable in make:
- `make testcluster`
- `make testcluster PROVIDER=vmware_fusion`
- Machines in the test cluster have Avahi configured for zeroconf
discovery. Each machine can ping each other machine at `hostname.local`
- for example `nomad-server02.local`, `nomad-client03.local`.
This commit replaces the shell script-driven build process for Nomad
with one based around GNU Make (note we _do_ use GNU-specific
constructs), requiring no additional scripts for common cases of
development. The following targets are implemented:
Per-OS/arch combinations:
Binaries (Host - Mac OS X):
pkg/darwin_amd64/nomad
Binaries (Host - Linux):
pkg/linux_386/nomad
pkg/linux_amd64/nomad
pkg/linux_amd64-lxc/nomad
pkg/linux_arm/nomad
pkg/linux_arm64/nomad
pkg/windows_386/nomad
pkg/windows_amd64/nomad
Packages (Host - Mac OS X):
pkg/darwin_amd64.zip
Packages (Host - Linux):
pkg/linux_386.zip
pkg/linux_amd64.zip
pkg/linux_amd64-lxc.zip
pkg/linux_arm.zip
pkg/linux_arm64.zip
pkg/windows_386.zip
pkg/windows_amd64.zip
Phony targets:
dev - Builds for the current host GOOS/GOARCH (unless overriden
in the environment)
release - Builds all appropriate release packages for the
current host GOOS/GOARCH (i.e. Windows and Linux
packages on a Linux host, Darwin packages on an OSX
host)
generate - Generate code for the current host architecture using
`go generate`.
test - Runs the Nomad test suite
clean - Removes build artifacts
travis - Runs `make test` with the wrapper script to prevent
Travis CI from timing out.
help - Displays usage information about commonly used targets.
Note that there are some semantic differences from the previous version.
1. `generate` is no longer a dependency of `dev` builds. This is because
it causes a rebuild every time, even when no code has changed, since
`go generate` does not appear to leave file timestamps alone.
Regardless, it is insufficient to generate on one host OS - it needs
to be run on each target to ensure everything is generated correctly.
2. `gofmt` is no longer checked. This should be enabled as a linter once
the `gofmt -s` refactoring will pass on the whole code base, in order
to avoid special cased checks versus using go-metalinter.
Example Usages:
Make a development build for the current GOOS/GOARCH:
make dev
Make release build packages appropriate for the host OS:
make release
Update generated code for the host OS:
make generate
Run linting checks:
make check
Build a specific alternative GOOS/GOARCH/tags combination:
make pkg/linux_amd64-pkg/nomad
make pkg/linux_amd64-pkg.zip
Vault is required for the fingerprinting tests but is not currently
installed by the build process. This commit adds a new category of
external tools for test dependencies and `go get`'s them during the
bootstrap.
We also fix the syntax of the Makefile to use tabs throughout.
This PR fixes our vet script and fixes all the missed vet changes.
It also fixes pointers being printed in `nomad stop <job>` and `nomad
node-status <node>`.
The dev build is far simpler than the release build, so move it to its
own shell script. This simplifies the release build script slightly as
well at the cost of duplicating the version/tag logic.
Also don't even try to check for LXC if not running on Linux. I don't
think we want to try to support cross-compiling LXC from non-Linux
hosts.
We recently ran into an issue on a small percentage of nomad-clients
where the nomad-client was running successfully, but due to a race
condition, could not correctly bind to the docker socket. This caused
all of our nomad jobs to be allocated to a single nomad-client instead
of being spread evenly across our clients. The only way to discover this
was to run `nomad node-status <node>` and count each job allocation per
node.
This can lead to a fairly long debugging process if there are several
nomad-clients. Including the number of allocations for each node in the
`node-status` command would save a large amount of debug time.
```
jake@biscuits [12:08:41] [~]
-> % nomad node-status
ID Datacenter Name Class Drain Status Allocations
2b0aabc5 dc1 biscuits <none> false ready 0
```
```
jake@biscuits [12:08:55] [~]
-> % nomad node-status
ID Datacenter Name Class Drain Status Allocations
2b0aabc5 dc1 biscuits <none> false ready 1
```