The way the copying was happening on the alloc_runner was by temporarily
setting the alloc.Job to nil, copying and then restoring it. This
created an issue in which when the alloc was shared (which it is in
server/client mode and between alloc_runner/task_runner) there were race
conditions that could create a panic.
Fixes https://github.com/hashicorp/nomad/issues/2605
Errors here only occur if Consul is not running when Nomad is restarted.
Errors here are only an issue if:
* Consul is being used but is down or misbehaving
* The executor is old (<0.6)
* The task has services
* The services hit a pre-0.6 consul.Syncer bug
If all of those conditions are met the pre-0.6 bugs will persist for
this task until Nomad is restarted.
This PR removes deepcopying of the job attached to the allocation in the
alloc runner. This operation is called very often so removing reflect
from the code path and the potentially large number of mallocs need to
create a job reduced memory and cpu pressure.
Previously was interpolating the original task's services again.
Fixes#2180
Also fixes a slight memory leak in the new consul agent. Script check
handles weren't being deleted after cancellation.
Fixes#2478#2474#1995#2294
The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.
The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.
Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.
Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
This change ensures that the client syncs allocation state with the
servers before entering its wait loop for the allocation to be
destroyed.
Fixes https://github.com/hashicorp/nomad/issues/2563
This PR takes the host ID and runs it through a hash so that it is well
distributed. This makes it so that machines that report similar host IDs
are easily distinguished.
Instances of similar IDs occur on EC2 where the ID is prefixed and on
motherboards created in the same batch.
Fixes https://github.com/hashicorp/nomad/issues/2534
Fixes#2525
We used to be checking a RequireTLS field that was never set. Instead we
can just check the TLSConfig.EnableRPC field and require TLS if it's
enabled.
Added a few unfortunately slow integration tests to assert the intended
behavior of misconfigured RPC TLS.
Also disable a lot of noisy test logging when -v isn't specified.
Fixes an issue where the Ruby runtime expects the sticky bit to be set
on the temp directory. The sticky bit is commonly set on the temp
directory since it is usually shared by many users. This change brings
ours in line with that assumption.
This PR adds tracking to when a task starts and finishes and the logs
API takes advantage of this and returns better errors when asking for
logs that do not exist.
This PR fixes token revocation and adds tests to make sure it is
working. The 0.5.6 RC1's token revocation does not work becasue the
token's value is captured at the instantiation of the deferred
stoprenewal statement rather than its exectution.
This PR allows accessing the Node's attributes and metadata as in a
template.
```
template {
data = "{{ env \"attr.unique.network.ip-address\" }}"
destination = "local/out"
}
```
This PR fixes an issue that is hit when running templates with restart
mode in which the client could panic when the handle is not running.
Fixes https://github.com/hashicorp/nomad/issues/2479
This PR fixes two issues:
* Folder permissions in -dev mode were incorrect and not suitable for
running as a particular user.
* Was not setting the group membership properly for the launched
process.
Fixes https://github.com/hashicorp/nomad/issues/2160
This PR:
* Uses Go 1.8 executable lookup
* Stores any err message from stats init method
* Allows overriding of Cpu Compute for hosts where it can't be detected
This PR introduces a parallelism limit during garbage collection. This
is used to avoid large resource usage spikes if garbage collecting many
allocations at once.
This PR adds the following metrics to the client:
client.allocations.migrating
client.allocations.blocked
client.allocations.pending
client.allocations.running
client.allocations.terminal
Also adds some missing fields to the API version of the evaluation.
This commit adds Solaris versions of the following functions:
- `linkDir`
- `unlinkDir`
- `createSecretDir`
- `removeSecretDir`
I believe this requires Go 1.8 in order to compile, as the unlink
syscall was previously missing.
Previously, this value was guarded against running on Windows
because it called the `uname` command which is unlikely to
be there.
This change now sets the value from gopsutil, which might
well be an empty string.
Signed-off-by: Dave Walker (Daviey) <email@daviey.com>
Previously with client fingerprinting, sys/exec's Command
function was being used to execute `uname -r` and the return
string processed into the kernel.version node attribute.
This change uses gopsutil/host KernelVersion function
instead. This means we can drop the os/exec, strings and
fmt imports... and not execute an external binary.
Signed-off-by: Dave Walker (Daviey) <email@daviey.com>
This PR fixes two issues:
1) A close of a nil stopCollection channel when restoring and prestart
fails. The failure will cause the killCh to be triggered which will
close collection before it has been initialized.
2) Fixes a deadlock in which the handleWaitCh is never triggered since
it is not initialized when there is an error in prestart and the killCh
is triggered.
Both fixes are by maintaining the loop invariant that the two channels
are valid after there is a handle.
This PR fixes our vet script and fixes all the missed vet changes.
It also fixes pointers being printed in `nomad stop <job>` and `nomad
node-status <node>`.
This PR introduces a coordinator for doing CRUD on a Docker image. It
should fix racy deletion of images. The issue before was images would be
deleted between prestart and start causing an error.
A change in the behavior of `os.Rename` in Go 1.8 brought to light a
difference in the logic between `{Alloc,Task}Runner` and this test:
AllocRunner builds the alloc dir, moves dirs if necessary, and then lets
TaskRunner call TaskDir.Build().
This test called `TaskDir.Build` *before* `AllocDir.Move`, so in Go 1.8
it failed to `os.Rename over` the empty {data,local} dirs.
I updated the test to behave like the real code, but I defensively added
`os.Remove` calls as a subtle change in call order shouldn't break this
code. `os.Remove` won't remove a non-empty directory, so it's still
safe.
This *might* be a fix for #2295 -- I've been unable to reproduce the
bug. However, this guard seems wise regardless. I should never be
overwriting an intialized created resources with a nil.
This PR fixes a race condition in which the client was not locked while
deriving Vault tokens. This allowed the token to be set which would
cause subsequent Vault requests to fail with permission denied because
the incorrect Vault token was being used.
Further this PR makes the unsetting and unlocking of the client atomic
to avoid an even harder to hit race condition (not sure it was ever hit
but was still incorrect).