Michael Schurter
195b8127fb
health_hook: fix panic and add tests
...
Still more testing to do, but I want to get this panic fixed ASAP.
All new tests pass with -race
2018-10-16 16:53:30 -07:00
Michael Schurter
64efc3d301
Emit events before long operations
...
Append when there's nothing blocking between appending and sending an
update to the server.
2018-10-16 16:53:30 -07:00
Michael Schurter
a2b696c4cf
Use a semaphore to block until watcher exits
2018-10-16 16:53:30 -07:00
Michael Schurter
a73162c977
ar: use multierror in update hook loop
...
Make it match TaskRunner update hook behavior
2018-10-16 16:53:30 -07:00
Michael Schurter
a7b427718c
tr: refactor EmitEvents into Emit+Append
...
* UpdateState: set state, append event, persist, update servers
* EmitEvent: append event, persist, update servers
* AppendEvent: append event, persist
AppendEvent may not even have to persist, but for the sake of
correctness I'm going with that for now.
2018-10-16 16:53:30 -07:00
Michael Schurter
93f3ac9ed6
ar: create health setting shim for health watcher
2018-10-16 16:53:30 -07:00
Michael Schurter
4d5aaac6d2
fix detection of task transitioning to running
2018-10-16 16:53:30 -07:00
Michael Schurter
4136e59f79
arv2: implement alloc health watching
...
Also remove initial alloc from broadcaster as it just caused useless
extra processing.
2018-10-16 16:53:30 -07:00
Michael Schurter
5c5c6dc41b
refactor ar hooks into their own files
...
minimize passed dependencies to ease testing
2018-10-16 16:53:30 -07:00
Michael Schurter
0bbf3a93ee
make AllocBroadcaster easier to use
...
And test thoroughly.
2018-10-16 16:53:30 -07:00
Michael Schurter
9d1ea3b228
client: hclog-ify most of the client
...
Leaving fingerprinters in case that interface changes with plugins.
2018-10-16 16:53:30 -07:00
Michael Schurter
e42154fc46
implement stopping, destroying, and disk migration
...
* Stopping an alloc is implemented via Updates but update hooks are
*not* run.
* Destroying an alloc is a best effort cleanup.
* AllocRunner destroy hooks implemented.
* Disk migration and blocking on a previous allocation exiting moved to
its own package to avoid cycles. Now only depends on alloc broadcaster
instead of also using a waitch.
* AllocBroadcaster now only drops stale allocations and always keeps the
latest version.
* Made AllocDir safe for concurrent use
Lots of internal contexts that are currently unused. Unsure if they
should be used or removed.
2018-10-16 16:53:30 -07:00
Michael Schurter
4236255686
lots of comment/log fixes
2018-10-16 16:53:30 -07:00
Michael Schurter
5749ede04e
keep forgetting lxc
2018-10-16 16:53:30 -07:00
Michael Schurter
357641c364
persist alloc state on changes, not periodically
...
Allow alloc and task runners to persist their own state when something
changes instead of periodically syncing all state.
2018-10-16 16:53:30 -07:00
Michael Schurter
820af27171
wrap boltdb in a write deduplicator
...
Saves a tiny bit of cpu and some IO. Sadly doesn't prevent all IO on
duplicate writes as the transactions are still created and committed.
$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: github.com/hashicorp/nomad/helper/boltdd
BenchmarkWriteDeduplication_On-4 500 4059591 ns/op 23736 B/op 56 allocs/op
BenchmarkWriteDeduplication_Off-4 300 4115319 ns/op 25942 B/op 55 allocs/op
2018-10-16 16:53:30 -07:00
Michael Schurter
990228a6e2
wip wrap boltdb to get path information
...
finished but doesn't handle deleting deeply nested buckets
2018-10-16 16:53:30 -07:00
Michael Schurter
a3fe0510d1
Move all encoding and put deduping into state db
...
Still WIP as it does not handle deletions.
2018-10-16 16:53:30 -07:00
Michael Schurter
533bc93b3a
implement all boltdb interactions behind StateDB
2018-10-16 16:53:30 -07:00
Michael Schurter
d890de036a
tr: persist hook state whenever it changes
2018-10-16 16:53:30 -07:00
Michael Schurter
fae5e89a0e
artifacts: don't emit event when there's no artifacts
2018-10-16 16:53:30 -07:00
Michael Schurter
5383d20505
removing old restoration path before api change
2018-10-16 16:53:30 -07:00
Michael Schurter
a5d3e3fb0a
Implement alloc updates in arv2
...
Updates are applied asynchronously but sequentially
2018-10-16 16:53:30 -07:00
Michael Schurter
39b3f3a85b
call handle.Network() instead of storing it
2018-10-16 16:53:30 -07:00
Michael Schurter
7132b67c1e
Add Network method to Handle interface
...
Should probably be moved to an Inspect method in the Driver Plugin world
2018-10-16 16:53:30 -07:00
Michael Schurter
a4b4d7b266
consul service hook
...
Deregistration works but difficult to test due to terminal updates not
being fully implemented in the new client/ar/tr.
2018-10-16 16:53:29 -07:00
Michael Schurter
5be982e674
restore vault client
2018-10-16 16:53:29 -07:00
Michael Schurter
ce04915c9f
log before killing tasks
2018-10-16 16:53:29 -07:00
Michael Schurter
a2bf851805
no need to TaskStateUpdated to return an error
...
also updated comments
2018-10-16 16:53:29 -07:00
Alex Dadgar
fd3bc1bd39
Update state with server
2018-10-16 16:53:29 -07:00
Alex Dadgar
bc905cc61d
Define and thread through state updating interface
2018-10-16 16:53:29 -07:00
Michael Schurter
9a63d6103d
tr: add validate task hook
2018-10-16 16:53:29 -07:00
Michael Schurter
7f4ec50906
missed locking around c.allocs access
2018-10-16 16:53:29 -07:00
Alex Dadgar
c93cfc89c0
wip
2018-10-16 16:53:29 -07:00
Alex Dadgar
7ddc0eb65c
Fix deadlock
2018-10-16 16:53:29 -07:00
Alex Dadgar
3779077052
Remove SetState from interface
2018-10-16 16:53:29 -07:00
Alex Dadgar
e1ba73b515
compile
2018-10-16 16:53:29 -07:00
Michael Schurter
6ebdf532ea
wip split event emitting and state transitions
2018-10-16 16:53:29 -07:00
Michael Schurter
516d641db0
client: implement all-or-nothing alloc restoration
...
Restoring calls NewAR -> Restore -> Run
NewAR now calls NewTR
AR.Restore calls TR.Restore
AR.Run calls TR.Run
2018-10-16 16:53:29 -07:00
Alex Dadgar
e401c660e7
Implement lifecycle hooks on the task runner
2018-10-16 16:53:29 -07:00
Alex Dadgar
89b4ba9cc8
comments
2018-10-16 16:53:29 -07:00
Alex Dadgar
86e81947b4
Hook renames
2018-10-16 16:53:29 -07:00
Alex Dadgar
2599cf9d74
remove comment
2018-10-16 16:53:29 -07:00
Alex Dadgar
88aa0299a9
Template hook
2018-10-16 16:53:29 -07:00
Alex Dadgar
c9765deff1
address comments
2018-10-16 16:53:29 -07:00
Alex Dadgar
80f6ce50c0
vault hook
2018-10-16 16:53:29 -07:00
Michael Schurter
30d377eba4
tr: improve skip log line
2018-10-16 16:53:29 -07:00
Michael Schurter
ef213b864b
tr: pass context to hooks
2018-10-16 16:53:29 -07:00
Michael Schurter
3a4f387fd3
tr: fix setting done in existing hooks
2018-10-16 16:53:29 -07:00
Michael Schurter
b360f6f96e
fix hclog level
2018-10-16 16:53:29 -07:00
Michael Schurter
ae89b7da95
reimplement success state for tr hooks and state persistence
...
splits apart local and remote persistence
removes some locking *for now*
2018-10-16 16:53:29 -07:00
Michael Schurter
4f43ff5c51
pass statedb into allocrunnerv2
2018-10-16 16:53:29 -07:00
Michael Schurter
582c76a420
remove unused allocrunner shim
2018-10-16 16:53:29 -07:00
Michael Schurter
c5504bd939
tr: cleanup main loop and shutdown hook impl
2018-10-16 16:53:29 -07:00
Michael Schurter
561260d6fe
tr: skip error/success saving
...
All hooks only need to be run once.
Since only one hook can fail per run there's no need to
track errors on a per hook basis.
2018-10-16 16:53:29 -07:00
Michael Schurter
67874e761f
tr: don't lock for immutable fields
2018-10-16 16:53:29 -07:00
Michael Schurter
f473cd03d6
tr: start update/shutdown logic
2018-10-16 16:53:29 -07:00
Michael Schurter
637ef264ae
Copy TR.Config vals to TR
...
I think I like this pattern better as some Config vals are mutable
(Alloc) and some aren't and some are used to derive other values and
never used directly.
Promoting them onto the TR struct is a little more work but is hopefully
more clear as to how each value is used.
2018-10-16 16:53:29 -07:00
Michael Schurter
0f7dcfdc9a
example redis job "runs" on arv2! see below
...
Tons left to do and lots of churn:
1. No state saving
2. No shutdown or gc
3. Removed AR factory *for now*
4. Made all "Config" structs local to the package they configure
5. Added allocID to GC to avoid a lookup
Really hating how many things use *structs.Allocation. It's not bad
without state saving, but if AllocRunner starts updating its copy things
get racy fast.
2018-10-16 16:53:29 -07:00
Michael Schurter
9a6aa38b0f
begin adding AllocRunner.Update
2018-10-16 16:53:29 -07:00
Michael Schurter
eae54e2954
artifact task hook
2018-10-16 16:53:29 -07:00
Alex Dadgar
b9bed81e6e
Initial V2 alloc runner
2018-10-16 16:53:28 -07:00
Alex Dadgar
a78cefec18
use int64
2018-10-16 15:34:32 -07:00
Preetha Appan
7c0d8c646c
Change CPU/Disk/MemoryMB to int everywhere in new resource structs
2018-10-16 16:21:42 -05:00
Christian Winther
0c5154100c
fix: increase log rotator line scan limit
...
In case where gelf/json logging is used, its fairly easy to exceed the 16k limit, resulting in json output being cut up into multiple strings
the result is invalid json lines which can create all kind of badness in the logging server
This fixes https://github.com/hashicorp/nomad/issues/4699
Signed-off-by: Christian Winther <jippignu@gmail.com>
2018-10-09 18:57:18 +02:00
Alex Dadgar
01f8e5b95f
renames
2018-10-04 14:57:25 -07:00
Alex Dadgar
52f9cd7637
fixing tests
2018-10-04 14:26:19 -07:00
Alex Dadgar
bac5cb1e8b
Scheduler uses allocated resources
2018-10-02 17:08:25 -07:00
Alex Dadgar
5c8697667e
Node reserved resources
2018-09-29 18:44:55 -07:00
Alex Dadgar
3183153315
Node resources on client
2018-09-29 17:23:41 -07:00
Alex Dadgar
9971b3393f
yamux
2018-09-17 14:22:40 -07:00
Alex Dadgar
ca28afa3b2
small fixes
2018-09-15 16:42:38 -07:00
Alex Dadgar
7739ef51ce
agent + consul
2018-09-13 10:43:40 -07:00
Michael Schurter
08862fc177
fix race around error handling
2018-09-05 17:34:17 -07:00
Michael Schurter
6def5bc4f9
client: set host name when migrating over tls
...
Not setting the host name led the Go HTTP client to expect a certificate
with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad`
names ephemeral dir migrations were broken when TLS was enabled.
Added an e2e test to ensure this doesn't break again as it's very
difficult to test and the TLS configuration is very easy to get wrong.
2018-09-05 17:24:17 -07:00
Alex Dadgar
c6576ddac1
Fix make check errors
2018-09-04 16:03:52 -07:00
Alex Dadgar
089b533047
Fix kill timeout exceeding 5m on Docker driver
...
Fixes an issue where the Docker API client would timeout before the kill
timeout was hit.
2018-08-17 16:01:09 -07:00
Alex Dadgar
49a1ba9297
Merge pull request #4535 from hashicorp/f-keep-docker-container-0.8.4
...
Option to prevent removal of container on exit
2018-07-26 11:11:22 -07:00
Charlie Voiselle
f319a149cd
Option to prevent removal of container on exit
2018-07-26 11:10:48 -07:00
Michael Schurter
ddf948001e
Merge pull request #4462 from omame/omame/cpu_cfs_period
...
Add support for specifying cpu_cfs_period in the Docker driver
2018-07-25 09:34:38 -07:00
Daniele Valeriani
b0a14caca2
Add test for cpu_cfs_period
2018-07-16 22:43:34 +02:00
Michael Schurter
91588cb861
rkt: revert to redis 3.2 to favor stability
2018-07-09 16:15:32 -07:00
Michael Schurter
c56f899ee9
rkt: speed up tests
...
Disable networking when it's not needed and improve failure message for
UserGroup test by including the full ps output on failure.
2018-07-09 14:02:27 -07:00
Michael Schurter
a1d4f77ce0
rkt: skip retrieving network information when net=none
...
Even when net=none we would attempt to retrieve network information from
rkt which would spew useless log lines such as:
```
testlog.go:30: 20:37:31.409209 [DEBUG] driver.rkt: failed getting network info for pod UUID 8303cfe6-0c10-4288-84f5-cb79ad6dbf1c attempt 2: no networks found. Sleeping for 970ms
```
It would also delay tests for ~60s during the network information retry
period.
So skip this when net=none. It's unlikely anyone actually uses net=none
outside of tests, so I doubt anyone will notice this change.
Official docs:
https://coreos.com/rkt/docs/latest/networking/overview.html#no-loopback-only-networking
2018-07-09 13:44:43 -07:00
Michael Schurter
0fbc84b81d
tests: make alloc id consistent in helper
...
It worked, but the old code used a different alloc id for the path than
the actual alloc! Use the same alloc id everywhere to prevent confusing
test output.
2018-07-09 13:37:35 -07:00
Michael Schurter
f3b8815c96
rkt: fix failing TestRktDriver_UserGroup test
...
Started failing due to the docker redis image switching from Debian
jessie to stretch:
53f8680550 (diff-acff46b161a3b7d6ed01ba79a032acc9)
Switched from Debian based image to Alpine to get a working `ps` command
again (albeit busybox's stripped down implementation)
2018-07-09 12:19:02 -07:00
Daniele Valeriani
748f6afd89
Validate the value of cpu_cfs_period
2018-07-02 22:30:22 +02:00
Daniele Valeriani
9364446a03
Remove an unnecessary conversion
2018-07-02 17:47:23 +02:00
Daniele Valeriani
906952a2c8
Add support for specifying cpu_cfs_period in the Docker driver
2018-07-02 16:37:04 +02:00
Preetha
b567750824
Merge pull request #4392 from burdandrei/telemetry-parametrized-jobs
...
Parametrized/periodic jobs per child tagged metric emmision
2018-06-21 17:13:36 -05:00
Preetha
043f4c208b
Merge pull request #3882 from burdandrei/telemetry-add-node-class-tag
...
Added node class to tagged metrics
2018-06-21 17:04:35 -05:00
Andrei Burd
444ee45aff
Parametrized/periodic jobs per child tagged metric emmision
2018-06-21 10:40:56 +03:00
James Rasell
75f95ccf09
Merge branch 'master' into f_gh_4381
2018-06-19 17:51:57 +02:00
Alex Dadgar
b61051b3cd
Merge pull request #4409 from hashicorp/r-client-packages
...
Refactor client packages
2018-06-13 17:32:25 -07:00
Alex Dadgar
22757d964e
lint
2018-06-13 16:06:39 -07:00
Alex Dadgar
af558df94c
Fix test using a lot of memory
2018-06-13 15:52:25 -07:00
Alex Dadgar
300b1a7a15
Tests only use testlog package logger
2018-06-13 15:40:56 -07:00
Chelsea Komlo
03075b603a
Merge pull request #4399 from hashicorp/r-reload-refactor
...
Refactor logic for dynamic reloading
2018-06-13 13:35:12 -04:00
Alex Dadgar
9bab9edf27
test fixes
2018-06-12 17:45:39 -07:00
Alex Dadgar
90c2108bfb
Fix gc tests + parallel destroy + small test fixes
2018-06-12 10:23:45 -07:00
Alex Dadgar
f5ff509fa5
Refactor - wip
2018-06-12 10:23:45 -07:00
Alex Dadgar
ff2ab8f58e
Fix vault template test
2018-06-12 09:57:28 -07:00
Alex Dadgar
d0043691fb
remove structs + bump version
2018-06-11 13:52:19 -07:00
Alex Dadgar
af5753d2cd
bump version + generated files
2018-06-11 13:39:42 -07:00
Nick Ethier
f36eb14360
Merge pull request #4403 from hashicorp/b-fix-dispatched-optional-meta
...
Fix dispatched optional meta correctly
2018-06-11 16:17:14 -04:00
Nick Ethier
e75e3ae665
nomad: use require pkg for tests
2018-06-11 13:50:50 -04:00
Nick Ethier
3aa6241b5c
client/driver/env: fix optional meta test
2018-06-11 12:29:13 -04:00
Nick Ethier
c65882cafd
client/driver/env: use 'job.Dispatch' to trigger optional meta logic
2018-06-11 12:15:19 -04:00
Nick Ethier
ccb5372813
Revert "Revert "client/driver/env: interpolate empty optional meta params as empty strings""
...
This reverts commit c17e0fc9dc5fd288935ab2b68fb441b4d25ac189.
2018-06-11 11:59:23 -04:00
Michael Schurter
c198cfd8ea
executor: fix log line formatting
2018-06-08 14:55:39 -07:00
Michael Schurter
d1a60e700e
executor: fix Windows blocking on pipe close
...
Sending the Ctrl-Break signal to PowerShell <6 causes it to drop into
debug mode. Closing its output pipe at that point will block
indefinitely and prevent the process from being killed by Nomad.
See the upstream powershell issue for details:
https://github.com/PowerShell/PowerShell/issues/4254
2018-06-08 14:48:05 -07:00
Chelsea Holland Komlo
f74e74b22d
add client logic to determine whether TLS RPC connections should reload
2018-06-08 14:38:58 -04:00
James Rasell
b9009c419c
Add 'nomad.advertise.address' to client meta via NomadFingerPrint
...
This change removes the addition of the advertise address to the
exported task env vars and instead moves this work into the
NomadFingerprint.Fingerprint which adds this value to the client
attrs. This can then be used within a Nomad job like
${attr.nomad.advertise.address}.
2018-06-08 09:44:10 +02:00
Alex Dadgar
d9b35fab52
Revert "client/driver/env: interpolate empty optional meta params as empty strings"
...
This reverts commit 84926f759a63a90be7bbcf0fad78deb3f02af23d.
2018-06-07 16:27:47 -07:00
Nick Ethier
b3c767fae0
client/driver: drop docker pull progress estimate if its < 0
2018-06-07 15:23:31 -04:00
James Rasell
367a8b5152
Add the local clients advertise address to interpolation env vars
...
This commit adds the Nomad local client advertise address in the
form host:port to the environment variables passed to each task.
2018-06-07 09:45:15 +02:00
Alex Dadgar
98705824ed
Merge pull request #4185 from jesusvazquez/add-counter-metric-for-oom-killer-events
...
Add driver.docker counter metric for OOM Killer events
2018-06-04 15:12:51 -07:00
Alex Dadgar
23cd56dc78
remove generated structs
2018-06-01 16:11:28 -07:00
Alex Dadgar
bf5b5747ab
fix test message
2018-06-01 15:51:54 -07:00
Alex Dadgar
3e3d3c7445
Disable Exec on non-linux platforms
...
This PR disables exec on non-linux platforms
2018-06-01 15:48:14 -07:00
Alex Dadgar
c0386819b3
bump version/lint/generated files
2018-06-01 15:23:10 -07:00
Preetha Appan
ce6d4a8d7a
Fix tests and move isClient to constructor
2018-06-01 15:59:53 -05:00
Alex Dadgar
a62dd2aadb
Merge pull request #4350 from hashicorp/b-raw-exec-cgroups
...
Raw exec can use cgroups to manage PIDs
2018-06-01 17:37:49 +00:00
Alex Dadgar
8da42940c9
wait for result
2018-06-01 10:14:53 -07:00
Alex Dadgar
40fec81315
Merge pull request #4277 from hashicorp/f-retry-join-clients
...
Add go-discover support to Nomad clients
2018-06-01 16:57:40 +00:00
Alex Dadgar
460ecb8705
Comments
2018-05-31 18:05:03 -07:00
Alex Dadgar
de98774f2c
Add test and docs
2018-05-31 18:05:03 -07:00
Alex Dadgar
ff28b04c46
Use more appropriate name than cgroup
2018-05-31 18:05:03 -07:00
Alex Dadgar
37e900b1d3
Only use freezer/devices when in the basic cgroup only
2018-05-31 18:05:03 -07:00
Alex Dadgar
ffd9270f2f
Use cgroup when possible
2018-05-31 18:05:03 -07:00
Alex Dadgar
0ff0ed290d
Fix TestDockerDriver_StartNVersions
2018-05-31 17:14:59 -07:00
Alex Dadgar
7e6dd498c9
Remove debug logging
2018-05-31 15:52:42 -07:00
Alex Dadgar
b1b908527f
spelling
2018-05-31 15:29:55 -07:00
Alex Dadgar
a3b29553a5
Force close stdout/stderr after grace
...
This commit changes the force closing of the stdout/stderr file
descriptor from closing immediately to being closed after a grace
period. This allows the created process to close its own file and allows
copying of the data.
2018-05-31 15:21:36 -07:00
Alex Dadgar
5e787e2d72
test build
2018-05-31 12:22:31 -07:00
Alex Dadgar
ead1b7f423
Log more info for TestExecutor_IsolationAndConstraints
2018-05-31 11:57:44 -07:00
Alex Dadgar
b05740ad13
Merge pull request #4341 from hashicorp/f-docker-pids
...
Support Docker Pids Limit
2018-05-31 17:59:29 +00:00
Chelsea Holland Komlo
064b5481e0
add server join info to server and client
2018-05-31 10:50:03 -07:00
Alex Dadgar
f4d4bbdc97
test pid limit
2018-05-30 12:55:24 -07:00
Chelsea Holland Komlo
94d510e969
Support Docker Pids Limit
2018-05-25 19:54:14 -04:00
Alex Dadgar
1685c8ebe4
cleanup
2018-05-24 16:25:20 -07:00
Alex Dadgar
2eacdb6bd6
Force closing of pipe to child process
2018-05-24 16:03:48 -07:00
Chelsea Holland Komlo
38f611a7f2
refactor NewTLSConfiguration to pass in verifyIncoming/verifyOutgoing
...
add missing fields to TLS merge method
2018-05-23 18:35:30 -04:00
Preetha
9084bb025e
Merge pull request #4303 from hashicorp/b-docker-client-nil-panic
...
Add nil check before setting timeout on docker client
2018-05-21 19:34:44 -07:00
Jesus Vazquez
23d959e42c
Add job, task, taskgroup to open method
2018-05-21 20:37:18 +02:00
Jesus Vazquez
0a062a04c7
Remove allocID from dockerhandle struct
2018-05-21 20:33:01 +02:00
Jesus Vazquez
e5a81815bb
Rename labels job, task_group and task
2018-05-21 20:32:50 +02:00
Jesus Vazquez
ffe1b1a1b6
Remove allocid label from driver.docker.oom counter metric
2018-05-21 20:30:56 +02:00
Alex Dadgar
38762d9bde
Merge pull request #4282 from hashicorp/f-rotator
...
Avoid splitting log line across two files
2018-05-21 17:52:13 +00:00
Alex Dadgar
d95698e2c5
Merge pull request #4298 from justenwalker/docker-driver-digest-tags
...
driver/docker: pull image with digest
2018-05-21 17:46:14 +00:00
Nick Ethier
6392009dd6
client/driver: use correct repo address when using docker-credential helper ( #4266 )
2018-05-15 17:39:48 -04:00
Justen Walker
a8989f33bb
driver/docker: add test for dockerImageRef
2018-05-14 14:24:03 -04:00
Justen Walker
194b2231d6
driver/docker: fix up TestParseDockerImage
2018-05-14 14:23:48 -04:00
Justen Walker
25b2807ce3
driver/docker: fix TestDockerDriver_ForcePull_RepoDigest
2018-05-14 14:23:02 -04:00
Nick Ethier
c4d07a2200
client/driver: gaurd authHelper test from running on windows
2018-05-14 13:46:57 -04:00
Justen Walker
b23ca7574c
driver/docker: cleanup parseDockerImage
2018-05-14 11:11:51 -04:00
Justen Walker
60f7f1aa08
driver/docker: pull image with digest
...
GH #4290
Add digest support to the docker driver image config. This commit
factors out some common code to print the repo:tag (dockerImageRef) for
events/logs as well as parsing the image to retreive the repo,tag
(parseDockerImage) so that the results are consistent/sane for both
repo:tag and repo@sha256:... references.
When pulling an image with a digest, the tag is blank and the repo
contains the digest. See:
https://github.com/fsouza/go-dockerclient/blob/master/image_test.go#L471
2018-05-14 10:42:58 -04:00
Preetha Appan
de66ec7394
Add nil check before setting timeout on docker client
2018-05-11 17:09:26 -05:00
Alex Dadgar
7ad5c76734
Add new line test
2018-05-11 10:52:09 -07:00
Alex Dadgar
3671ed139d
Avoid splitting log line across two files
...
We attempt to avoid splitting a log line between two files by detecting
if we are near the file size limit and scanning for new lines and only
flushing those.
BenchmarkRotator/1KB-8 300000 5613 ns/op
BenchmarkRotator/2KB-8 200000 8384 ns/op
BenchmarkRotator/4KB-8 100000 14604 ns/op
BenchmarkRotator/8KB-8 50000 25002 ns/op
BenchmarkRotator/16KB-8 30000 47572 ns/op
BenchmarkRotator/32KB-8 20000 92080 ns/op
BenchmarkRotator/64KB-8 10000 165883 ns/op
BenchmarkRotator/128KB-8 5000 294405 ns/op
BenchmarkRotator/256KB-8 2000 572374 ns/op
2018-05-10 15:11:01 -07:00
Alex Dadgar
f5d91b5338
Benchmark for rotator
...
BenchmarkRotator/1KB-8 200000 5572 ns/op
BenchmarkRotator/2KB-8 200000 8338 ns/op
BenchmarkRotator/4KB-8 100000 14246 ns/op
BenchmarkRotator/8KB-8 50000 25279 ns/op
BenchmarkRotator/16KB-8 30000 48602 ns/op
BenchmarkRotator/32KB-8 20000 92159 ns/op
BenchmarkRotator/64KB-8 10000 154766 ns/op
BenchmarkRotator/128KB-8 5000 296872 ns/op
BenchmarkRotator/256KB-8 3000 551793 ns/op
2018-05-10 14:15:15 -07:00
Nick Ethier
91603a377e
client/driver: parse repo instead of attempting to pull repo info
2018-05-09 22:34:25 -04:00
Nick Ethier
38a33f9c75
client/driver: add test for docker auth helper
2018-05-09 22:33:56 -04:00
Alex Dadgar
e067a9ae06
naming of constants
2018-05-09 16:46:52 -07:00
Chelsea Holland Komlo
796bae6f1b
allow configurable cipher suites
...
disallow 3DES and RC4 ciphers
add documentation for tls_cipher_suites
2018-05-09 17:15:31 -04:00
Alex Dadgar
0e79e1a46e
Keep stream and logs in sync for detecting closed pipe
2018-05-09 11:22:52 -07:00
Preetha
e7ae6e98d9
Merge pull request #4259 from hashicorp/f-deployment-improvements
2018-05-08 16:37:10 -05:00
Nick Ethier
3598925ca4
client/driver: use correct repo address when using docker-credential helper
2018-05-08 15:17:28 -04:00
Nick Ethier
54c86a0292
client/driver/env: interpolate empty optional meta params as empty strings
2018-05-07 20:19:51 -04:00
Nick Ethier
016ab7a105
client/driver: remove unused const 'dockerPullProgressEmitInterval'
2018-05-07 16:24:48 -04:00
Michael Schurter
f1d13683e6
consul: remove services with/without canary tags
...
Guard against Canary being set to false at the same time as an
allocation is being stopped: this could cause RemoveTask to be called
with the wrong Canary value and leaking a service.
Deleting both Canary values is the safest route.
2018-05-07 14:55:01 -05:00
Michael Schurter
50e04c976e
consul: support canary tags for services
...
Also refactor Consul ServiceClient to take a struct instead of a massive
set of arguments. Meant updating a lot of code but it should be far
easier to extend in the future as you will only need to update a single
struct instead of every single call site.
Adds an e2e test for canary tags.
2018-05-07 14:55:01 -05:00
Alex Dadgar
df8fce4347
Ensure canaries tags are interpolated
2018-05-07 14:50:01 -05:00
Alex Dadgar
552604451c
rework where time gets set
2018-05-07 14:50:01 -05:00
Alex Dadgar
ee50789c22
Initial implementation
2018-05-07 14:50:01 -05:00
Nick Ethier
d8de354dbf
client/driver: add waiting layer status count to pull progress status msg
2018-05-07 12:18:20 -04:00
Nick Ethier
77af17efbc
client/driver: add seperate handler for emitting pull progress
2018-05-07 12:17:34 -04:00
Nick Ethier
0bdd976b7d
client/driver: remove pull timeout due to race condition that can lead to unexpected timeouts
...
If two jobs are pulling the same image simultaneously, which ever starts the pull first will set the pull timeout.
This can lead to a poor UX where the first job requested a short timeout while the second job requested a longer timeout
causing the pull to potentially timeout much sooner than expected by the second job.
2018-05-07 12:18:11 -04:00
Nick Ethier
7c5821d7c6
client/driver: do accounting on layer pull progress
2018-05-07 12:17:53 -04:00
Nick Ethier
8efda7dc6c
client/driver: emit progress to all allocs pulling same image
2018-05-07 12:17:34 -04:00
Nick Ethier
e35948ab91
client/driver: add image pull progress monitoring
2018-05-07 12:17:38 -04:00
Michael Schurter
0d534d30d6
Merge pull request #4251 from hashicorp/f-grpc-checks
...
Support Consul gRPC Health Checks
2018-05-04 14:55:16 -07:00
Michael Schurter
f6a4713141
consul: make grpc checks more like http checks
2018-05-04 11:08:11 -07:00
Michael Schurter
382caec1e1
consul: initial grpc implementation
...
Needs to be more like http.
2018-05-04 11:08:11 -07:00
Jesus Vazquez
08a390448b
Update counter driver.docker.oom labels
2018-05-04 14:02:34 +08:00
Jesus Vazquez
4f6db56283
Initialize dockerhandle with jobname, taskgroupname, taskname and allocid
2018-05-04 14:02:19 +08:00
Jesus Vazquez
127b764dfb
Add Job, taskgroupname, taskname, and allocid to the DockerHandle struct
2018-05-04 14:01:26 +08:00
Jesus Vazquez
fd1ff1a0cf
Run goimports
2018-05-04 13:46:36 +08:00
Jesus Vazquez
5dd4059527
Add driver.docker counter metric for OOM Killer events
2018-05-04 13:46:36 +08:00
Michael Schurter
526af6a246
framer: fix early exit/truncation in framer
2018-05-02 10:46:16 -07:00
Michael Schurter
f1a6aa103a
framer: fix race and remove unused error var
...
In the old code `sending` in the `send()` method shared the Data slice's
underlying backing array with its caller. Clearing StreamFrame.Data
didn't break the reference from the sent frame to the StreamFramer's
data slice.
2018-05-02 10:46:16 -07:00
Michael Schurter
7360fe3a6d
client: squelch errors on cleanly closed pipes
2018-05-02 10:46:16 -07:00
Michael Schurter
ffff97e25f
client: don't spin on read errors
2018-05-02 10:46:16 -07:00
Michael Schurter
5ef0a82e6e
client: reset encoders between uses
...
According to go/codec's docs, Reset(...) should be called on
Decoders/Encoders before reuse:
https://godoc.org/github.com/ugorji/go/codec
I could find no evidence that *not* calling Reset() caused bugs, but
might as well do what the docs say?
2018-05-02 10:46:16 -07:00
Alex Dadgar
de4af37249
version bump and remove generated
2018-04-27 11:10:00 -07:00
Alex Dadgar
845a43864a
generated files
2018-04-27 10:45:40 -07:00
Alex Dadgar
35e06ddb31
Remove generated and version bump
2018-04-26 16:49:19 -07:00
Alex Dadgar
43192cefae
generated files
2018-04-26 16:28:58 -07:00
Michael Schurter
0e602d4779
Merge pull request #4188 from hashicorp/f-rkt-stats
...
rkt: create parent cgroup to enable stats
2018-04-24 14:54:36 -07:00
Michael Schurter
d687761ebf
rkt: test Stats() and always run tests
...
Remove the NOMAD_TEST_RKT flag as a guard for rkt tests. Still require
Linux, root, and rkt to be installed. Only check for rkt installation
once in hopes of speeding up rkt tests a bit.
2018-04-24 11:05:42 -07:00
Javier Palomo Almena
3e6c01ffa1
docker tests: Fix usage of NewDriverContext
2018-04-23 22:51:06 +02:00
Javier Palomo Almena
74d3c5df07
DriverContext: Add the TaskGroup and the Job name
...
Adding this fields to the DriverContext object, will allow us to pass
them to the drivers.
An use case for this, will be to emit tagged metrics in the drivers,
which contain all relevant information:
- Job
- TaskGroup
- Task
- ...
Ref: https://github.com/hashicorp/nomad/pull/4185
2018-04-23 00:15:29 +02:00
Michael Schurter
4cee6cca6c
rkt: create parent cgroup to enable stats
...
Having the Nomad executor create parent cgroups that rkt is launched
within allows the stats collection code used for the exec driver to Just
Work. The only downside is that now the Nomad executor's resource
utilization counts against the cgroups resource limits just as it does
for the exec driver.
2018-04-19 15:14:56 -07:00
Michael Schurter
1a85d0c990
run goimports
2018-04-19 11:16:28 -07:00
Michael Schurter
d77c265d1f
Merge pull request #4168 from ninoles/b-2117-windows-group-process
...
B 2117 windows group process
2018-04-19 11:10:51 -07:00
Michael Schurter
fdbcbd4e5b
Merge pull request #4058 from hashicorp/f-mock-by-default
...
[Post-0.8] test: build with mock_driver by default
2018-04-18 15:57:00 -07:00
Michael Schurter
d3650fb2cd
test: build with mock_driver by default
...
`make release` and `make prerelease` set a `release` tag to disable
enabling the `mock_driver`
2018-04-18 14:45:33 -07:00
Michael Schurter
a991923389
tests: fix race in alloc_runner_test.go
...
I could not reproduce the failure locally even with `stress -cpu ...`
eating all the cpu it could on my machine.
But I think the race was in one of two places:
* The task could restart which could create new events
* I think there could be a race between the updater's version of events
and alloc runners as updates are async
I fixed both. Here's hoping that fixes this flaky test.
2018-04-17 17:14:59 -07:00
Fabien Ninoles
c81bec48c9
Merge branch 'master' into b-2117-windows-group-process
2018-04-17 13:47:25 -04:00
Fabien Ninoles
35cf641416
Update based on PR request.
2018-04-17 13:43:04 -04:00
Alex Dadgar
c4ad76091d
Merge pull request #4166 from hashicorp/b-panic-fix-update
...
Fixes races accessing node and updating it during fingerprinting
2018-04-17 10:02:19 -07:00
Chelsea Holland Komlo
9b8a079558
fix up comments
2018-04-17 11:53:08 -04:00
Alex Dadgar
9d612c8cb0
Cleanup
2018-04-16 15:48:34 -07:00
Alex Dadgar
32adaf9dfc
Copy the config given to the alloc runner
2018-04-16 15:45:52 -07:00
Alex Dadgar
3ff2d4d795
fix race node access
2018-04-16 15:45:51 -07:00
Alex Dadgar
4f2a7b6949
Fix copying drivers
2018-04-16 15:45:51 -07:00
Alex Dadgar
0b799822ff
Operate on copy
2018-04-16 15:45:49 -07:00
Fabien Ninoles
27cf4995ce
- Clean up for windows compilation.
...
- Set CREATE_NEW_PROCESS_GROUP for Windows subprocess.
- Ensure we only kill actual process that need to.
2018-04-14 13:58:42 -04:00
Michael Schurter
3836b8a335
Merge pull request #3572 from emate/master
...
Create new process group on process startup.
2018-04-13 11:56:38 -07:00
Alex Dadgar
adaf4fa7e0
Remove generated structs
2018-04-12 16:35:31 -07:00
Alex Dadgar
663c4d0433
Version bump and generated files
2018-04-12 16:21:50 -07:00
Alex Dadgar
ff1a1a63e8
Move where attribute for driver detection is set
2018-04-12 15:50:25 -07:00
Chelsea Holland Komlo
5291788b40
delete driver name from only health check attributes
2018-04-12 18:24:41 -04:00
Alex Dadgar
3d53d380f7
Fix tests
2018-04-12 14:29:30 -07:00
Alex Dadgar
f24ce2c50c
Driver health detection cleanups
...
This PR does:
1. Health message based on detection has format "Driver XXX detected"
and "Driver XXX not detected"
2. Set initial health description based on detection status and don't
wait for the first health check.
3. Combine updating attributes on the node, fingerprint and health
checking update for drivers into a single call back.
4. Condensed driver info in `node status` only shows detected drivers
and make the output less wide by removing spaces.
2018-04-12 12:46:40 -07:00
Charlie Voiselle
ba88f00ccb
Changed "til" to "until"
...
Should be "till" or "until"; chose "until" because it is unambiguous as to meaning.
2018-04-11 12:36:28 -05:00
Andrei Burd
502d17fa90
Added node class to tagged metrics
2018-04-11 12:20:59 +03:00
Chelsea Komlo
eb5aac16e6
Merge pull request #4111 from hashicorp/b-undetected-set-health-to-false
...
Immediately set driver health status to false when driver moves to undetected
2018-04-10 18:30:31 -04:00
Chelsea Holland Komlo
d58b3e473c
update comment for when the fingerprinter setting health status
2018-04-10 16:53:00 -04:00
Chelsea Holland Komlo
f7ef13cc64
fingerprinter should set health check status if health check is not periodic
2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo
ede4f518bd
add setters for access to the fingerprint manager's node
...
refactor extracting driver info
2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo
f479da19f5
guard against overwriting health status
2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo
ece1618815
immediately set healthy to false when driver moves to undetected
2018-04-10 15:29:51 -04:00
Alex Dadgar
3d367d6fd7
Fix client uptime metric missing client prefix
2018-04-10 10:39:36 -07:00
Seth Vargo
df4fe7e76c
Set user-agent when talking to GCE metadata
2018-04-10 10:36:46 -04:00
Chelsea Komlo
d3bd8fb96e
Merge pull request #4109 from hashicorp/f-shorten-docker-health-timeout
...
Shorten docker health timeout
2018-04-09 15:38:39 -04:00
Chelsea Holland Komlo
ea4b65dd41
only initialize docker clients if they are nil
2018-04-09 14:13:07 -04:00
Chelsea Holland Komlo
288c7a33a1
refacotoring simplification from code review
2018-04-09 10:34:17 -04:00
Chelsea Holland Komlo
6e3b056c37
only run health check if driver moves from undetected to detected
2018-04-09 10:10:43 -04:00
Alex Dadgar
ae1f76477e
Start rebalance after discovering new servers
2018-04-05 15:41:59 -07:00
Alex Dadgar
929b6823a3
Merge pull request #4106 from hashicorp/b-servers
...
Improved Client handling of failed RPCs
2018-04-05 13:48:50 -07:00
Alex Dadgar
be2513e0f9
more jitter
2018-04-05 13:48:33 -07:00
Chelsea Holland Komlo
d3637825ef
group similar functions; update comments
...
health check timeout should be 1 minute
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo
e8743f1f7b
remove do once block when creating a new docker client
...
only set cached connections upon no error
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo
d0d793fc23
use client with shorter timeouts for health checks
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo
5d1b2b77cb
refactor docker clients method to be able to extend to creating new clients
2018-04-05 16:19:02 -04:00
Alex Dadgar
bd3345942c
Handle no leader and faster retries near limit
...
Handle the ErrNoLeader case and apply slower retries. Also when we have
missed the heartbeat retry aggressively, backing off after we have
missed for more than 30 seconds.
2018-04-05 11:22:47 -07:00
Alex Dadgar
279b5c22e5
Scale heartbeat retrying based on remaining heartbeat time
2018-04-05 10:58:13 -07:00
Alex Dadgar
7941f4eb2d
Fire retry only when consul discovers new servers
2018-04-05 10:40:17 -07:00
Preetha
6254d75eee
Merge pull request #4101 from hashicorp/b-rescheduling-edge-fixes
...
Fixes edge cases around timing/ task finish time being set more than once
2018-04-04 16:18:21 -05:00
Preetha Appan
12ba4c45da
remove outdated commented out test code
2018-04-04 15:03:24 -05:00
Preetha Appan
6363a6fb4d
Remove old comment
2018-04-04 15:01:48 -05:00
Preetha Appan
5e4525bd30
Moves setting finishedAt to the right place and adds two unit tests.
2018-04-04 14:38:15 -05:00
Alex Dadgar
86c32358d4
Spelling error
2018-04-03 18:30:01 -07:00
Alex Dadgar
01a6beafbf
RPC Retry Watcher
2018-04-03 18:05:28 -07:00
Preetha Appan
e6bbce3fa0
Add comment
2018-04-03 19:49:03 -05:00
Alex Dadgar
ec844f19d9
randomize servers
2018-04-03 17:46:13 -07:00
Preetha Appan
00537c739b
Fixes edge cases around timing and task finish time being set more than once
2018-04-03 16:34:59 -05:00
Alex Dadgar
58a3ec3fb2
Improve Vault error handling
2018-04-03 14:29:22 -07:00
Alex Dadgar
86f9044676
remove generated files
2018-03-30 16:52:49 -07:00
Alex Dadgar
af81349dbe
Generated files
2018-03-30 16:14:40 -07:00
Michael Schurter
257ba5937d
test: don't rely on alloc runner update count
...
We were incorrectly relying on the count of alloc updates in a number of
tests. Since alloc updates are async, their number is non-determinstic
and largely meaningless.
This should fix quite a few flaky tests in Travis and prevent future
mistaken assumptions in tests.
2018-03-30 09:34:33 -07:00
Michael Schurter
62e9553333
Merge pull request #4069 from hashicorp/f-hashealth
...
add HasHealth helper for nil checks
2018-03-29 17:03:20 -07:00
Alex Dadgar
beee130a6e
Always capture the finish time
2018-03-29 11:27:22 -07:00
Michael Schurter
91b5bb58d9
add HasHealth helper for nil checks
...
We performed the DeploymentStatus nil checks a couple different ways, so
hopefully this helper will consoldiate them and make it more clear what
the code is doing.
2018-03-29 09:29:19 -07:00
Chelsea Komlo
4338360da9
Merge pull request #4065 from hashicorp/emit-node-event-on-first-health-change
...
Emit first node event after initialization on health status change
2018-03-29 11:23:25 -04:00
Chelsea Holland Komlo
2174ede6b9
add clarifying comment
2018-03-29 10:58:39 -04:00
Michael Schurter
3a79c32677
Merge pull request #4059 from hashicorp/b-drain-health-svc-only
...
only service allocs should have health watched
2018-03-28 16:49:22 -07:00
Michael Schurter
5eb0cb7176
only service allocs should have health watched
2018-03-28 16:20:11 -07:00
Chelsea Holland Komlo
e3319afee1
emit first node event
2018-03-28 17:26:53 -04:00
Chelsea Komlo
7812ac5abf
Merge pull request #4057 from hashicorp/specify-docker-msg
...
Specify docker name in driver health messages
2018-03-28 13:32:36 -04:00
Preetha
177d2d6010
Merge pull request #4052 from hashicorp/f-specify-total-memory
...
Allow to specify total memory on agent configuration
2018-03-28 12:28:41 -05:00
Chelsea Holland Komlo
efc03e252c
specify driver health messages
2018-03-28 11:35:21 -04:00
Preetha Appan
329428b49f
Code review feedback and unit test
2018-03-28 10:07:15 -05:00
Charlie Voiselle
ea10588227
rkt: logging enhancements ( #4044 )
...
* Added extra debug logging; extended timeout; added jitter.
* small log changes
* increase timeout
* remove unneccessary uuid
2018-03-27 17:30:06 -07:00
Michael Schurter
fcaee471a0
client: always mark exited sys/svc allocs as failed
...
When restarts.attempts=0 was set in a jobspec a system or service alloc
that exited with 0 status would be marked as `completed` instead of
`failed`. Since system and service jobs are intended to run until
stopped or updated, they should always be marked as failed when they
exit even in cases where the exit code is 0.
2018-03-27 14:30:19 -07:00
Mildred Ki'Lya
1017cbe8ab
Allow to specify total memory on agent configuration
...
Allow to set the total memory of an agent in its configuration file. This
can be used in case the automatic detection doesn't work or in specific
environments when memory overcommit (using swap for example) can be
desirable.
2018-03-27 15:46:18 -05:00
Chelsea Holland Komlo
003bc209b9
use time.Time for node events for compatibility
2018-03-27 15:43:57 -04:00
Alex Dadgar
432784dae3
Fix alloc watcher snapshot streaming
2018-03-27 11:14:53 -07:00
Alex Dadgar
05449fea09
drop stats fetching log
2018-03-23 12:01:50 -07:00
Chelsea Komlo
5f0c382021
Merge pull request #4030 from hashicorp/health-check-ux
...
UX improvments to driver health checks
2018-03-23 09:46:50 -04:00
Alex Dadgar
da27fc3880
Driver Info output
2018-03-22 17:18:32 -07:00
Chelsea Holland Komlo
e9005d8cfb
ux improvments to driver health checks
2018-03-22 18:38:29 -04:00
Michael Schurter
a318684738
Merge pull request #4022 from hashicorp/f-more-executor-logging
...
executor: increase level for helpful log lines
2018-03-22 15:21:20 -07:00
Michael Schurter
a4f346abeb
remove spurious TODOs and FIXMEs
2018-03-21 16:55:22 -07:00
Michael Schurter
8b346c6176
test: try to prevent flakiness on travis
2018-03-21 16:51:45 -07:00
Michael Schurter
1b7ac447e9
alloc_runner: watch health for deployed batch jobs
2018-03-21 16:51:45 -07:00
Michael Schurter
62960ed7bd
client: don't monitor health of non-service jobs
...
Also fix system job draining; won't work without deadline fixes
2018-03-21 16:51:44 -07:00
Alex Dadgar
a37329189a
Improve DeadlineTime helper
2018-03-21 16:51:44 -07:00
Alex Dadgar
db4a634072
RPC, FSM, State Store for marking DesiredTransistion
...
fix build tag
2018-03-21 16:49:48 -07:00
Michael Schurter
bb0ff44fb4
mock_driver: improve Kill() logging
2018-03-21 16:49:48 -07:00
Michael Schurter
c0542474db
drain: initial drainv2 structs and impl
2018-03-21 16:49:48 -07:00
Chelsea Holland Komlo
f329e45e03
always set initial health status for every driver
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
bbaffe3eca
set driver to unhealthy once if it cannot be detected in periodic check
2018-03-21 15:15:26 -04:00
Alex Dadgar
5df4b3728d
Docker driver doesn't return errors but injects into the DriverInfo
2018-03-21 15:15:26 -04:00
Alex Dadgar
4365bb7f59
Only run health check if driver is detected
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
f801709a0a
fix issue when updating node events
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
285729aee2
function rename and re-arrange functions in fingerprint_manager
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
60f12d206f
improve comments; update watchDriver
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo
739784736a
remove unused function
2018-03-21 15:15:26 -04:00