This uses an alternative approach where we avoid restoring the alloc
runner in the first place, if we suspect that the alloc may have been
completed already.
This commit aims to help users running with clients suseptible to the
destroyed alloc being restrarted bug upgrade to latest. Without this,
such users will have their tasks run unexpectedly on upgrade and only
see the bug resolved after subsequent restart.
If, on restore, the client sees a pending alloc without any other
persisted info, then err on the side that it's an corrupt persisted
state of an alloc instead of the client happening to be killed right
when alloc is assigned to client.
Few reasons motivate this behavior:
Statistically speaking, corruption being the cause is more likely. A
long running client will have higher chance of having allocs persisted
incorrectly with pending state. Being killed right when an alloc is
about to start is relatively unlikely.
Also, delaying starting an alloc that hasn't started (by hopefully
seconds) is not as severe as launching too many allocs that may bring
client down.
More importantly, this helps customers upgrade their clients without
risking taking their clients down and destablizing their cluster. We
don't want existing users to force triggering the bug while they upgrade
and restart cluster.
Protect against a race where destroying and persist state goroutines
race.
The downside is that the database io operation will run while holding
the lock and may run indefinitely. The risk of lock being long held is
slow destruction, but slow io has bigger problems.
The nodes api documentation is fairly out of date, here I've updated the
entire response based on a local dev agent, rather than explicitly
adding new fields to bring us up to the current api shape.
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted. A heavily-used client may launch thousands of allocs on startup
and get killed.
The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set. Periodically, they get persisted until alloc is
gced by server. During that time, the client db will contain the alloc
but not its individual tasks status nor completed state. On client restart,
client assumes that alloc is pending state and re-runs it.
Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.
This is a short-term fix, as we should consider revamping client state
management. Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.
Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
We currently use an container image for `test-devices` job only; while
all other jobs use machine executor.
This allows us to switch golang and protoc verions easily without
manually managing Docker images (which requires building them manually
on a dev machines, etc). All that while, we install dependencies on
every build in all other jobs..
`test-devices` now is one of the fastest jobs and isn't a constraint or
a bottleneck, so increasing its overhead by few seconds doesn't hurt the
overall developer iteration.
If we split tests effectively later, we can revisit.
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics. The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.
FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.
The stats interval defaults to 1 second and is user configurable. I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.
Fixes https://github.com/hashicorp/nomad/issues/6057 .
The dev mode flag for connect was binding to the default interface's
IP, but this makes for a bad user experience for the CLI which will
default to 127.0.0.1. If we bind to 0.0.0.0 instead the CLI will work
without further configuration by the user.
* adds meta object to service in job spec, sends it to consul
* adds tests for service meta
* fix tests
* adds docs
* better hashing for service meta, use helper for copying meta when registering service
* tried to be DRY, but looks like it would be more work to use the
helper function
This removes the in-repository Netlify configuration. There are now two
sites backed by the repository, so we must use the web UI to
control the build settings, as having the configuration in-repository
overrides the web UI settings.
The build settings for the two sites are below, as of this commit. See
the extra step in nomad-ui site’s build step that copies the _redirects
file to the correct destination so things are properly forwarded when
you visit the deployment.
nomad-ui:
base directory: ui
build command: ember build && mkdir -p ui-dist/ui && mv dist/* ui-dist/ui/ && cp ../.netlify/ui-redirects ui-dist/_redirects
publish directory: ui/ui-dist
nomad-website:
base directory: website
build command: bundle exec middleman build
publish directory: website/build
Fixes#6041
Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
This fixes a frequent failure in `test-rkt` jobs where dpkg installation
fails.
The image used currently, circleci/classic:201808-01, has unattended
upgrades enabled accidentally, which runs on every build. This means
that tools get modified unexpectedly during builds, and apt-get commands
may fail as the unattended upgrade is holding package database lock.
This updates `test-rkt` job only because the new image breaks
`test-docker` job (e.g. https://circleci.com/gh/hashicorp/nomad/2641 ),
and I punted on investigating test-docker for another day.