Protect against a race where destroying and persist state goroutines
race.
The downside is that the database io operation will run while holding
the lock and may run indefinitely. The risk of lock being long held is
slow destruction, but slow io has bigger problems.
The nodes api documentation is fairly out of date, here I've updated the
entire response based on a local dev agent, rather than explicitly
adding new fields to bring us up to the current api shape.
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted. A heavily-used client may launch thousands of allocs on startup
and get killed.
The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set. Periodically, they get persisted until alloc is
gced by server. During that time, the client db will contain the alloc
but not its individual tasks status nor completed state. On client restart,
client assumes that alloc is pending state and re-runs it.
Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.
This is a short-term fix, as we should consider revamping client state
management. Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.
Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
We currently use an container image for `test-devices` job only; while
all other jobs use machine executor.
This allows us to switch golang and protoc verions easily without
manually managing Docker images (which requires building them manually
on a dev machines, etc). All that while, we install dependencies on
every build in all other jobs..
`test-devices` now is one of the fastest jobs and isn't a constraint or
a bottleneck, so increasing its overhead by few seconds doesn't hurt the
overall developer iteration.
If we split tests effectively later, we can revisit.
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics. The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.
FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.
The stats interval defaults to 1 second and is user configurable. I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.
Fixes https://github.com/hashicorp/nomad/issues/6057 .
The dev mode flag for connect was binding to the default interface's
IP, but this makes for a bad user experience for the CLI which will
default to 127.0.0.1. If we bind to 0.0.0.0 instead the CLI will work
without further configuration by the user.
* adds meta object to service in job spec, sends it to consul
* adds tests for service meta
* fix tests
* adds docs
* better hashing for service meta, use helper for copying meta when registering service
* tried to be DRY, but looks like it would be more work to use the
helper function
This removes the in-repository Netlify configuration. There are now two
sites backed by the repository, so we must use the web UI to
control the build settings, as having the configuration in-repository
overrides the web UI settings.
The build settings for the two sites are below, as of this commit. See
the extra step in nomad-ui site’s build step that copies the _redirects
file to the correct destination so things are properly forwarded when
you visit the deployment.
nomad-ui:
base directory: ui
build command: ember build && mkdir -p ui-dist/ui && mv dist/* ui-dist/ui/ && cp ../.netlify/ui-redirects ui-dist/_redirects
publish directory: ui/ui-dist
nomad-website:
base directory: website
build command: bundle exec middleman build
publish directory: website/build
Fixes#6041
Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
This fixes a frequent failure in `test-rkt` jobs where dpkg installation
fails.
The image used currently, circleci/classic:201808-01, has unattended
upgrades enabled accidentally, which runs on every build. This means
that tools get modified unexpectedly during builds, and apt-get commands
may fail as the unattended upgrade is holding package database lock.
This updates `test-rkt` job only because the new image breaks
`test-docker` job (e.g. https://circleci.com/gh/hashicorp/nomad/2641 ),
and I punted on investigating test-docker for another day.
When a Client declares a volume is ReadOnly, we should only schedule it
for requests for ReadOnly volumes. This change means that if a host
exposes a readonly volume, we then validate that the group level
requests for the volume are all read only for that host.
This adds a job to test the UI on CircleCI, including the sort of branch
pattern-matching from #5839, so .-ui/ branches only have that job
and not the non-UI ones.
I considered having an entire workflow for UI, which could have separate
jobs for linting vs Ember tests, but the lint commands take so little time
that it didn’t seem worth it.
There’s no use of nvm to change the Node version as the Docker image
is what controls that. It’s annoying to have to update the version in multiple
places, but probably infrequent.
Adds a check for differences in `job.Diff` so that task group networks
and services, including new Consul connect stanzas, show up in the job
plan outputs.