Nick Ethier
77af17efbc
client/driver: add seperate handler for emitting pull progress
2018-05-07 12:17:34 -04:00
Nick Ethier
0bdd976b7d
client/driver: remove pull timeout due to race condition that can lead to unexpected timeouts
...
If two jobs are pulling the same image simultaneously, which ever starts the pull first will set the pull timeout.
This can lead to a poor UX where the first job requested a short timeout while the second job requested a longer timeout
causing the pull to potentially timeout much sooner than expected by the second job.
2018-05-07 12:18:11 -04:00
Nick Ethier
7c5821d7c6
client/driver: do accounting on layer pull progress
2018-05-07 12:17:53 -04:00
Nick Ethier
8efda7dc6c
client/driver: emit progress to all allocs pulling same image
2018-05-07 12:17:34 -04:00
Nick Ethier
e35948ab91
client/driver: add image pull progress monitoring
2018-05-07 12:17:38 -04:00
Michael Schurter
0d534d30d6
Merge pull request #4251 from hashicorp/f-grpc-checks
...
Support Consul gRPC Health Checks
2018-05-04 14:55:16 -07:00
Michael Schurter
f6a4713141
consul: make grpc checks more like http checks
2018-05-04 11:08:11 -07:00
Michael Schurter
382caec1e1
consul: initial grpc implementation
...
Needs to be more like http.
2018-05-04 11:08:11 -07:00
Jesus Vazquez
08a390448b
Update counter driver.docker.oom labels
2018-05-04 14:02:34 +08:00
Jesus Vazquez
4f6db56283
Initialize dockerhandle with jobname, taskgroupname, taskname and allocid
2018-05-04 14:02:19 +08:00
Jesus Vazquez
127b764dfb
Add Job, taskgroupname, taskname, and allocid to the DockerHandle struct
2018-05-04 14:01:26 +08:00
Jesus Vazquez
fd1ff1a0cf
Run goimports
2018-05-04 13:46:36 +08:00
Jesus Vazquez
5dd4059527
Add driver.docker counter metric for OOM Killer events
2018-05-04 13:46:36 +08:00
Michael Schurter
526af6a246
framer: fix early exit/truncation in framer
2018-05-02 10:46:16 -07:00
Michael Schurter
f1a6aa103a
framer: fix race and remove unused error var
...
In the old code `sending` in the `send()` method shared the Data slice's
underlying backing array with its caller. Clearing StreamFrame.Data
didn't break the reference from the sent frame to the StreamFramer's
data slice.
2018-05-02 10:46:16 -07:00
Michael Schurter
7360fe3a6d
client: squelch errors on cleanly closed pipes
2018-05-02 10:46:16 -07:00
Michael Schurter
ffff97e25f
client: don't spin on read errors
2018-05-02 10:46:16 -07:00
Michael Schurter
5ef0a82e6e
client: reset encoders between uses
...
According to go/codec's docs, Reset(...) should be called on
Decoders/Encoders before reuse:
https://godoc.org/github.com/ugorji/go/codec
I could find no evidence that *not* calling Reset() caused bugs, but
might as well do what the docs say?
2018-05-02 10:46:16 -07:00
Alex Dadgar
de4af37249
version bump and remove generated
2018-04-27 11:10:00 -07:00
Alex Dadgar
845a43864a
generated files
2018-04-27 10:45:40 -07:00
Alex Dadgar
35e06ddb31
Remove generated and version bump
2018-04-26 16:49:19 -07:00
Alex Dadgar
43192cefae
generated files
2018-04-26 16:28:58 -07:00
Michael Schurter
0e602d4779
Merge pull request #4188 from hashicorp/f-rkt-stats
...
rkt: create parent cgroup to enable stats
2018-04-24 14:54:36 -07:00
Michael Schurter
d687761ebf
rkt: test Stats() and always run tests
...
Remove the NOMAD_TEST_RKT flag as a guard for rkt tests. Still require
Linux, root, and rkt to be installed. Only check for rkt installation
once in hopes of speeding up rkt tests a bit.
2018-04-24 11:05:42 -07:00
Javier Palomo Almena
3e6c01ffa1
docker tests: Fix usage of NewDriverContext
2018-04-23 22:51:06 +02:00
Javier Palomo Almena
74d3c5df07
DriverContext: Add the TaskGroup and the Job name
...
Adding this fields to the DriverContext object, will allow us to pass
them to the drivers.
An use case for this, will be to emit tagged metrics in the drivers,
which contain all relevant information:
- Job
- TaskGroup
- Task
- ...
Ref: https://github.com/hashicorp/nomad/pull/4185
2018-04-23 00:15:29 +02:00
Michael Schurter
4cee6cca6c
rkt: create parent cgroup to enable stats
...
Having the Nomad executor create parent cgroups that rkt is launched
within allows the stats collection code used for the exec driver to Just
Work. The only downside is that now the Nomad executor's resource
utilization counts against the cgroups resource limits just as it does
for the exec driver.
2018-04-19 15:14:56 -07:00
Michael Schurter
1a85d0c990
run goimports
2018-04-19 11:16:28 -07:00
Michael Schurter
d77c265d1f
Merge pull request #4168 from ninoles/b-2117-windows-group-process
...
B 2117 windows group process
2018-04-19 11:10:51 -07:00
Michael Schurter
fdbcbd4e5b
Merge pull request #4058 from hashicorp/f-mock-by-default
...
[Post-0.8] test: build with mock_driver by default
2018-04-18 15:57:00 -07:00
Michael Schurter
d3650fb2cd
test: build with mock_driver by default
...
`make release` and `make prerelease` set a `release` tag to disable
enabling the `mock_driver`
2018-04-18 14:45:33 -07:00
Michael Schurter
a991923389
tests: fix race in alloc_runner_test.go
...
I could not reproduce the failure locally even with `stress -cpu ...`
eating all the cpu it could on my machine.
But I think the race was in one of two places:
* The task could restart which could create new events
* I think there could be a race between the updater's version of events
and alloc runners as updates are async
I fixed both. Here's hoping that fixes this flaky test.
2018-04-17 17:14:59 -07:00
Fabien Ninoles
c81bec48c9
Merge branch 'master' into b-2117-windows-group-process
2018-04-17 13:47:25 -04:00
Fabien Ninoles
35cf641416
Update based on PR request.
2018-04-17 13:43:04 -04:00
Alex Dadgar
c4ad76091d
Merge pull request #4166 from hashicorp/b-panic-fix-update
...
Fixes races accessing node and updating it during fingerprinting
2018-04-17 10:02:19 -07:00
Chelsea Holland Komlo
9b8a079558
fix up comments
2018-04-17 11:53:08 -04:00
Alex Dadgar
9d612c8cb0
Cleanup
2018-04-16 15:48:34 -07:00
Alex Dadgar
32adaf9dfc
Copy the config given to the alloc runner
2018-04-16 15:45:52 -07:00
Alex Dadgar
3ff2d4d795
fix race node access
2018-04-16 15:45:51 -07:00
Alex Dadgar
4f2a7b6949
Fix copying drivers
2018-04-16 15:45:51 -07:00
Alex Dadgar
0b799822ff
Operate on copy
2018-04-16 15:45:49 -07:00
Fabien Ninoles
27cf4995ce
- Clean up for windows compilation.
...
- Set CREATE_NEW_PROCESS_GROUP for Windows subprocess.
- Ensure we only kill actual process that need to.
2018-04-14 13:58:42 -04:00
Michael Schurter
3836b8a335
Merge pull request #3572 from emate/master
...
Create new process group on process startup.
2018-04-13 11:56:38 -07:00
Alex Dadgar
adaf4fa7e0
Remove generated structs
2018-04-12 16:35:31 -07:00
Alex Dadgar
663c4d0433
Version bump and generated files
2018-04-12 16:21:50 -07:00
Alex Dadgar
ff1a1a63e8
Move where attribute for driver detection is set
2018-04-12 15:50:25 -07:00
Chelsea Holland Komlo
5291788b40
delete driver name from only health check attributes
2018-04-12 18:24:41 -04:00
Alex Dadgar
3d53d380f7
Fix tests
2018-04-12 14:29:30 -07:00
Alex Dadgar
f24ce2c50c
Driver health detection cleanups
...
This PR does:
1. Health message based on detection has format "Driver XXX detected"
and "Driver XXX not detected"
2. Set initial health description based on detection status and don't
wait for the first health check.
3. Combine updating attributes on the node, fingerprint and health
checking update for drivers into a single call back.
4. Condensed driver info in `node status` only shows detected drivers
and make the output less wide by removing spaces.
2018-04-12 12:46:40 -07:00
Charlie Voiselle
ba88f00ccb
Changed "til" to "until"
...
Should be "till" or "until"; chose "until" because it is unambiguous as to meaning.
2018-04-11 12:36:28 -05:00
Andrei Burd
502d17fa90
Added node class to tagged metrics
2018-04-11 12:20:59 +03:00
Chelsea Komlo
eb5aac16e6
Merge pull request #4111 from hashicorp/b-undetected-set-health-to-false
...
Immediately set driver health status to false when driver moves to undetected
2018-04-10 18:30:31 -04:00
Chelsea Holland Komlo
d58b3e473c
update comment for when the fingerprinter setting health status
2018-04-10 16:53:00 -04:00
Chelsea Holland Komlo
f7ef13cc64
fingerprinter should set health check status if health check is not periodic
2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo
ede4f518bd
add setters for access to the fingerprint manager's node
...
refactor extracting driver info
2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo
f479da19f5
guard against overwriting health status
2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo
ece1618815
immediately set healthy to false when driver moves to undetected
2018-04-10 15:29:51 -04:00
Alex Dadgar
3d367d6fd7
Fix client uptime metric missing client prefix
2018-04-10 10:39:36 -07:00
Seth Vargo
df4fe7e76c
Set user-agent when talking to GCE metadata
2018-04-10 10:36:46 -04:00
Chelsea Komlo
d3bd8fb96e
Merge pull request #4109 from hashicorp/f-shorten-docker-health-timeout
...
Shorten docker health timeout
2018-04-09 15:38:39 -04:00
Chelsea Holland Komlo
ea4b65dd41
only initialize docker clients if they are nil
2018-04-09 14:13:07 -04:00
Chelsea Holland Komlo
288c7a33a1
refacotoring simplification from code review
2018-04-09 10:34:17 -04:00
Chelsea Holland Komlo
6e3b056c37
only run health check if driver moves from undetected to detected
2018-04-09 10:10:43 -04:00
Alex Dadgar
ae1f76477e
Start rebalance after discovering new servers
2018-04-05 15:41:59 -07:00
Alex Dadgar
929b6823a3
Merge pull request #4106 from hashicorp/b-servers
...
Improved Client handling of failed RPCs
2018-04-05 13:48:50 -07:00
Alex Dadgar
be2513e0f9
more jitter
2018-04-05 13:48:33 -07:00
Chelsea Holland Komlo
d3637825ef
group similar functions; update comments
...
health check timeout should be 1 minute
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo
e8743f1f7b
remove do once block when creating a new docker client
...
only set cached connections upon no error
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo
d0d793fc23
use client with shorter timeouts for health checks
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo
5d1b2b77cb
refactor docker clients method to be able to extend to creating new clients
2018-04-05 16:19:02 -04:00
Alex Dadgar
bd3345942c
Handle no leader and faster retries near limit
...
Handle the ErrNoLeader case and apply slower retries. Also when we have
missed the heartbeat retry aggressively, backing off after we have
missed for more than 30 seconds.
2018-04-05 11:22:47 -07:00
Alex Dadgar
279b5c22e5
Scale heartbeat retrying based on remaining heartbeat time
2018-04-05 10:58:13 -07:00
Alex Dadgar
7941f4eb2d
Fire retry only when consul discovers new servers
2018-04-05 10:40:17 -07:00
Preetha
6254d75eee
Merge pull request #4101 from hashicorp/b-rescheduling-edge-fixes
...
Fixes edge cases around timing/ task finish time being set more than once
2018-04-04 16:18:21 -05:00
Preetha Appan
12ba4c45da
remove outdated commented out test code
2018-04-04 15:03:24 -05:00
Preetha Appan
6363a6fb4d
Remove old comment
2018-04-04 15:01:48 -05:00
Preetha Appan
5e4525bd30
Moves setting finishedAt to the right place and adds two unit tests.
2018-04-04 14:38:15 -05:00
Alex Dadgar
86c32358d4
Spelling error
2018-04-03 18:30:01 -07:00
Alex Dadgar
01a6beafbf
RPC Retry Watcher
2018-04-03 18:05:28 -07:00
Preetha Appan
e6bbce3fa0
Add comment
2018-04-03 19:49:03 -05:00
Alex Dadgar
ec844f19d9
randomize servers
2018-04-03 17:46:13 -07:00
Preetha Appan
00537c739b
Fixes edge cases around timing and task finish time being set more than once
2018-04-03 16:34:59 -05:00
Alex Dadgar
58a3ec3fb2
Improve Vault error handling
2018-04-03 14:29:22 -07:00
Alex Dadgar
86f9044676
remove generated files
2018-03-30 16:52:49 -07:00
Alex Dadgar
af81349dbe
Generated files
2018-03-30 16:14:40 -07:00
Michael Schurter
257ba5937d
test: don't rely on alloc runner update count
...
We were incorrectly relying on the count of alloc updates in a number of
tests. Since alloc updates are async, their number is non-determinstic
and largely meaningless.
This should fix quite a few flaky tests in Travis and prevent future
mistaken assumptions in tests.
2018-03-30 09:34:33 -07:00
Michael Schurter
62e9553333
Merge pull request #4069 from hashicorp/f-hashealth
...
add HasHealth helper for nil checks
2018-03-29 17:03:20 -07:00
Alex Dadgar
beee130a6e
Always capture the finish time
2018-03-29 11:27:22 -07:00
Michael Schurter
91b5bb58d9
add HasHealth helper for nil checks
...
We performed the DeploymentStatus nil checks a couple different ways, so
hopefully this helper will consoldiate them and make it more clear what
the code is doing.
2018-03-29 09:29:19 -07:00
Chelsea Komlo
4338360da9
Merge pull request #4065 from hashicorp/emit-node-event-on-first-health-change
...
Emit first node event after initialization on health status change
2018-03-29 11:23:25 -04:00
Chelsea Holland Komlo
2174ede6b9
add clarifying comment
2018-03-29 10:58:39 -04:00
Michael Schurter
3a79c32677
Merge pull request #4059 from hashicorp/b-drain-health-svc-only
...
only service allocs should have health watched
2018-03-28 16:49:22 -07:00
Michael Schurter
5eb0cb7176
only service allocs should have health watched
2018-03-28 16:20:11 -07:00
Chelsea Holland Komlo
e3319afee1
emit first node event
2018-03-28 17:26:53 -04:00
Chelsea Komlo
7812ac5abf
Merge pull request #4057 from hashicorp/specify-docker-msg
...
Specify docker name in driver health messages
2018-03-28 13:32:36 -04:00
Preetha
177d2d6010
Merge pull request #4052 from hashicorp/f-specify-total-memory
...
Allow to specify total memory on agent configuration
2018-03-28 12:28:41 -05:00
Chelsea Holland Komlo
efc03e252c
specify driver health messages
2018-03-28 11:35:21 -04:00
Preetha Appan
329428b49f
Code review feedback and unit test
2018-03-28 10:07:15 -05:00
Charlie Voiselle
ea10588227
rkt: logging enhancements ( #4044 )
...
* Added extra debug logging; extended timeout; added jitter.
* small log changes
* increase timeout
* remove unneccessary uuid
2018-03-27 17:30:06 -07:00
Michael Schurter
fcaee471a0
client: always mark exited sys/svc allocs as failed
...
When restarts.attempts=0 was set in a jobspec a system or service alloc
that exited with 0 status would be marked as `completed` instead of
`failed`. Since system and service jobs are intended to run until
stopped or updated, they should always be marked as failed when they
exit even in cases where the exit code is 0.
2018-03-27 14:30:19 -07:00
Mildred Ki'Lya
1017cbe8ab
Allow to specify total memory on agent configuration
...
Allow to set the total memory of an agent in its configuration file. This
can be used in case the automatic detection doesn't work or in specific
environments when memory overcommit (using swap for example) can be
desirable.
2018-03-27 15:46:18 -05:00
Chelsea Holland Komlo
003bc209b9
use time.Time for node events for compatibility
2018-03-27 15:43:57 -04:00
Alex Dadgar
432784dae3
Fix alloc watcher snapshot streaming
2018-03-27 11:14:53 -07:00
Alex Dadgar
05449fea09
drop stats fetching log
2018-03-23 12:01:50 -07:00
Chelsea Komlo
5f0c382021
Merge pull request #4030 from hashicorp/health-check-ux
...
UX improvments to driver health checks
2018-03-23 09:46:50 -04:00
Alex Dadgar
da27fc3880
Driver Info output
2018-03-22 17:18:32 -07:00
Chelsea Holland Komlo
e9005d8cfb
ux improvments to driver health checks
2018-03-22 18:38:29 -04:00
Michael Schurter
a318684738
Merge pull request #4022 from hashicorp/f-more-executor-logging
...
executor: increase level for helpful log lines
2018-03-22 15:21:20 -07:00
Michael Schurter
a4f346abeb
remove spurious TODOs and FIXMEs
2018-03-21 16:55:22 -07:00
Michael Schurter
8b346c6176
test: try to prevent flakiness on travis
2018-03-21 16:51:45 -07:00
Michael Schurter
1b7ac447e9
alloc_runner: watch health for deployed batch jobs
2018-03-21 16:51:45 -07:00
Michael Schurter
62960ed7bd
client: don't monitor health of non-service jobs
...
Also fix system job draining; won't work without deadline fixes
2018-03-21 16:51:44 -07:00
Alex Dadgar
a37329189a
Improve DeadlineTime helper
2018-03-21 16:51:44 -07:00
Alex Dadgar
db4a634072
RPC, FSM, State Store for marking DesiredTransistion
...
fix build tag
2018-03-21 16:49:48 -07:00
Michael Schurter
bb0ff44fb4
mock_driver: improve Kill() logging
2018-03-21 16:49:48 -07:00
Michael Schurter
c0542474db
drain: initial drainv2 structs and impl
2018-03-21 16:49:48 -07:00
Chelsea Holland Komlo
f329e45e03
always set initial health status for every driver
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
bbaffe3eca
set driver to unhealthy once if it cannot be detected in periodic check
2018-03-21 15:15:26 -04:00
Alex Dadgar
5df4b3728d
Docker driver doesn't return errors but injects into the DriverInfo
2018-03-21 15:15:26 -04:00
Alex Dadgar
4365bb7f59
Only run health check if driver is detected
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
f801709a0a
fix issue when updating node events
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
285729aee2
function rename and re-arrange functions in fingerprint_manager
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
60f12d206f
improve comments; update watchDriver
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
739784736a
remove unused function
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
d92703617c
simplify logic
...
bump log level
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
86b7b3d2d9
fix up health check logic comparison; add node events to client driver checks
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
53a5bc2bb3
Code review feedback
2018-03-21 15:15:26 -04:00
Alex Dadgar
34dc58421c
notes from walk through
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
44b6951dda
improve tests
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
d740a6a46e
refresh driver information for non-health checking drivers periodically
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
d8f68e5ef8
fix up codereview feedback
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
d5f6c940c4
fix up racy tests
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
0425be8f48
updating comments; locking concurrent node access
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
c50d02ae93
go style; update comments
2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo
3aa726baab
fix scheduler driver name; create node structs file
2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo
3cba95e8a7
allow nomad to schedule based on the status of a client driver health check
...
Slight updates for go style
2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo
0bde357731
add concept of health checks to fingerprinters and nodes
...
fix up feedback from code review
add driver info for all drivers to node
2018-03-21 15:15:25 -04:00
Michael Schurter
1022170bf3
executor: increase level for helpful log lines
...
Should help with debugging issues like #3971
2018-03-21 11:53:58 -07:00
Marcin Matlaszek
6019a88824
Make raw_exec processes cleanup function more precise.
2018-03-20 13:40:21 +01:00
Marcin Matlaszek
bb36c122e2
Fix errors when trying to kill whole process group.
2018-03-20 13:40:21 +01:00
Marcin Matlaszek
86d650d7b0
Make starting & cleaning process group Windows compatible.
2018-03-20 13:40:21 +01:00
Marcin Matlaszek
79c139f2ef
Create new process group on process startup.
...
Clean up by sending SIGKILL to the whole process group.
2018-03-20 13:40:21 +01:00
Michael Schurter
1044bc0feb
Merge pull request #3984 from hashicorp/f-loosen-consul-skipverify
...
Replace Consul TLSSkipVerify handling
2018-03-16 11:21:28 -07:00
Michael Schurter
32ee5e0d53
Merge pull request #3990 from hashicorp/f-rkt-groups
...
rkt: allow specifying --group
2018-03-16 11:19:53 -07:00
Michael Schurter
bd78cfb039
rkt: allow specifying --group
2018-03-16 11:08:22 -07:00
Michael Schurter
fb10ec9c01
docker: make volume errors recoverable
...
The interface+mock just to test this one little error handling may seem
like overkill but there was just no other way to write an automated test
around this logic as there's no way to simluate this error with stock
Docker.
2018-03-15 17:52:43 -07:00
Michael Schurter
0971114f0c
Replace Consul TLSSkipVerify handling
...
Instead of checking Consul's version on startup to see if it supports
TLSSkipVerify, assume that it does and only log in the job service
handler if we discover Consul does not support TLSSkipVerify.
The old code would break TLSSkipVerify support if Nomad started before
Consul (such as on system boot) as TLSSkipVerify would default to false
if Consul wasn't running. Since TLSSkipVerify has been supported since
Consul 0.7.2, it's safe to relax our handling.
2018-03-14 17:43:06 -07:00
Preetha Appan
3c38eededd
Fix spelling in comment
2018-03-14 15:54:25 -05:00
Alex Dadgar
bef4a8ee09
fix clearing node events
2018-03-14 09:48:59 -07:00
Chelsea Komlo
810eedfa2a
Merge pull request #3945 from hashicorp/f-add-node-events
...
Add node events
2018-03-14 08:42:55 -04:00