Commit Graph

3470 Commits

Author SHA1 Message Date
Michael Schurter 64efc3d301 Emit events before long operations
Append when there's nothing blocking between appending and sending an
update to the server.
2018-10-16 16:53:30 -07:00
Michael Schurter a2b696c4cf Use a semaphore to block until watcher exits 2018-10-16 16:53:30 -07:00
Michael Schurter a73162c977 ar: use multierror in update hook loop
Make it match TaskRunner update hook behavior
2018-10-16 16:53:30 -07:00
Michael Schurter a7b427718c tr: refactor EmitEvents into Emit+Append
* UpdateState: set state, append event, persist, update servers
* EmitEvent: append event, persist, update servers
* AppendEvent: append event, persist

AppendEvent may not even have to persist, but for the sake of
correctness I'm going with that for now.
2018-10-16 16:53:30 -07:00
Michael Schurter 93f3ac9ed6 ar: create health setting shim for health watcher 2018-10-16 16:53:30 -07:00
Michael Schurter 4d5aaac6d2 fix detection of task transitioning to running 2018-10-16 16:53:30 -07:00
Michael Schurter 4136e59f79 arv2: implement alloc health watching
Also remove initial alloc from broadcaster as it just caused useless
extra processing.
2018-10-16 16:53:30 -07:00
Michael Schurter 5c5c6dc41b refactor ar hooks into their own files
minimize passed dependencies to ease testing
2018-10-16 16:53:30 -07:00
Michael Schurter 0bbf3a93ee make AllocBroadcaster easier to use
And test thoroughly.
2018-10-16 16:53:30 -07:00
Michael Schurter 9d1ea3b228 client: hclog-ify most of the client
Leaving fingerprinters in case that interface changes with plugins.
2018-10-16 16:53:30 -07:00
Michael Schurter e42154fc46 implement stopping, destroying, and disk migration
* Stopping an alloc is implemented via Updates but update hooks are
  *not* run.
* Destroying an alloc is a best effort cleanup.
* AllocRunner destroy hooks implemented.
* Disk migration and blocking on a previous allocation exiting moved to
  its own package to avoid cycles. Now only depends on alloc broadcaster
  instead of also using a waitch.
* AllocBroadcaster now only drops stale allocations and always keeps the
  latest version.
* Made AllocDir safe for concurrent use

Lots of internal contexts that are currently unused. Unsure if they
should be used or removed.
2018-10-16 16:53:30 -07:00
Michael Schurter 4236255686 lots of comment/log fixes 2018-10-16 16:53:30 -07:00
Michael Schurter 5749ede04e keep forgetting lxc 2018-10-16 16:53:30 -07:00
Michael Schurter 357641c364 persist alloc state on changes, not periodically
Allow alloc and task runners to persist their own state when something
changes instead of periodically syncing all state.
2018-10-16 16:53:30 -07:00
Michael Schurter 820af27171 wrap boltdb in a write deduplicator
Saves a tiny bit of cpu and some IO. Sadly doesn't prevent all IO on
duplicate writes as the transactions are still created and committed.

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: github.com/hashicorp/nomad/helper/boltdd
BenchmarkWriteDeduplication_On-4             500           4059591 ns/op           23736 B/op         56 allocs/op
BenchmarkWriteDeduplication_Off-4            300           4115319 ns/op           25942 B/op         55 allocs/op
2018-10-16 16:53:30 -07:00
Michael Schurter 990228a6e2 wip wrap boltdb to get path information
finished but doesn't handle deleting deeply nested buckets
2018-10-16 16:53:30 -07:00
Michael Schurter a3fe0510d1 Move all encoding and put deduping into state db
Still WIP as it does not handle deletions.
2018-10-16 16:53:30 -07:00
Michael Schurter 533bc93b3a implement all boltdb interactions behind StateDB 2018-10-16 16:53:30 -07:00
Michael Schurter d890de036a tr: persist hook state whenever it changes 2018-10-16 16:53:30 -07:00
Michael Schurter fae5e89a0e artifacts: don't emit event when there's no artifacts 2018-10-16 16:53:30 -07:00
Michael Schurter 5383d20505 removing old restoration path before api change 2018-10-16 16:53:30 -07:00
Michael Schurter a5d3e3fb0a Implement alloc updates in arv2
Updates are applied asynchronously but sequentially
2018-10-16 16:53:30 -07:00
Michael Schurter 39b3f3a85b call handle.Network() instead of storing it 2018-10-16 16:53:30 -07:00
Michael Schurter 7132b67c1e Add Network method to Handle interface
Should probably be moved to an Inspect method in the Driver Plugin world
2018-10-16 16:53:30 -07:00
Michael Schurter a4b4d7b266 consul service hook
Deregistration works but difficult to test due to terminal updates not
being fully implemented in the new client/ar/tr.
2018-10-16 16:53:29 -07:00
Michael Schurter 5be982e674 restore vault client 2018-10-16 16:53:29 -07:00
Michael Schurter ce04915c9f log before killing tasks 2018-10-16 16:53:29 -07:00
Michael Schurter a2bf851805 no need to TaskStateUpdated to return an error
also updated comments
2018-10-16 16:53:29 -07:00
Alex Dadgar fd3bc1bd39 Update state with server 2018-10-16 16:53:29 -07:00
Alex Dadgar bc905cc61d Define and thread through state updating interface 2018-10-16 16:53:29 -07:00
Michael Schurter 9a63d6103d tr: add validate task hook 2018-10-16 16:53:29 -07:00
Michael Schurter 7f4ec50906 missed locking around c.allocs access 2018-10-16 16:53:29 -07:00
Alex Dadgar c93cfc89c0 wip 2018-10-16 16:53:29 -07:00
Alex Dadgar 7ddc0eb65c Fix deadlock 2018-10-16 16:53:29 -07:00
Alex Dadgar 3779077052 Remove SetState from interface 2018-10-16 16:53:29 -07:00
Alex Dadgar e1ba73b515 compile 2018-10-16 16:53:29 -07:00
Michael Schurter 6ebdf532ea wip split event emitting and state transitions 2018-10-16 16:53:29 -07:00
Michael Schurter 516d641db0 client: implement all-or-nothing alloc restoration
Restoring calls NewAR -> Restore -> Run

NewAR now calls NewTR
AR.Restore calls TR.Restore
AR.Run calls TR.Run
2018-10-16 16:53:29 -07:00
Alex Dadgar e401c660e7 Implement lifecycle hooks on the task runner 2018-10-16 16:53:29 -07:00
Alex Dadgar 89b4ba9cc8 comments 2018-10-16 16:53:29 -07:00
Alex Dadgar 86e81947b4 Hook renames 2018-10-16 16:53:29 -07:00
Alex Dadgar 2599cf9d74 remove comment 2018-10-16 16:53:29 -07:00
Alex Dadgar 88aa0299a9 Template hook 2018-10-16 16:53:29 -07:00
Alex Dadgar c9765deff1 address comments 2018-10-16 16:53:29 -07:00
Alex Dadgar 80f6ce50c0 vault hook 2018-10-16 16:53:29 -07:00
Michael Schurter 30d377eba4 tr: improve skip log line 2018-10-16 16:53:29 -07:00
Michael Schurter ef213b864b tr: pass context to hooks 2018-10-16 16:53:29 -07:00
Michael Schurter 3a4f387fd3 tr: fix setting done in existing hooks 2018-10-16 16:53:29 -07:00
Michael Schurter b360f6f96e fix hclog level 2018-10-16 16:53:29 -07:00
Michael Schurter ae89b7da95 reimplement success state for tr hooks and state persistence
splits apart local and remote persistence

removes some locking *for now*
2018-10-16 16:53:29 -07:00
Michael Schurter 4f43ff5c51 pass statedb into allocrunnerv2 2018-10-16 16:53:29 -07:00
Michael Schurter 582c76a420 remove unused allocrunner shim 2018-10-16 16:53:29 -07:00
Michael Schurter c5504bd939 tr: cleanup main loop and shutdown hook impl 2018-10-16 16:53:29 -07:00
Michael Schurter 561260d6fe tr: skip error/success saving
All hooks only need to be run once.
Since only one hook can fail per run there's no need to
track errors on a per hook basis.
2018-10-16 16:53:29 -07:00
Michael Schurter 67874e761f tr: don't lock for immutable fields 2018-10-16 16:53:29 -07:00
Michael Schurter f473cd03d6 tr: start update/shutdown logic 2018-10-16 16:53:29 -07:00
Michael Schurter 637ef264ae Copy TR.Config vals to TR
I think I like this pattern better as some Config vals are mutable
(Alloc) and some aren't and some are used to derive other values and
never used directly.

Promoting them onto the TR struct is a little more work but is hopefully
more clear as to how each value is used.
2018-10-16 16:53:29 -07:00
Michael Schurter 0f7dcfdc9a example redis job "runs" on arv2! see below
Tons left to do and lots of churn:
1. No state saving
2. No shutdown or gc
3. Removed AR factory *for now*
4. Made all "Config" structs local to the package they configure
5. Added allocID to GC to avoid a lookup

Really hating how many things use *structs.Allocation. It's not bad
without state saving, but if AllocRunner starts updating its copy things
get racy fast.
2018-10-16 16:53:29 -07:00
Michael Schurter 9a6aa38b0f begin adding AllocRunner.Update 2018-10-16 16:53:29 -07:00
Michael Schurter eae54e2954 artifact task hook 2018-10-16 16:53:29 -07:00
Alex Dadgar b9bed81e6e Initial V2 alloc runner 2018-10-16 16:53:28 -07:00
Alex Dadgar a78cefec18 use int64 2018-10-16 15:34:32 -07:00
Preetha Appan 7c0d8c646c
Change CPU/Disk/MemoryMB to int everywhere in new resource structs 2018-10-16 16:21:42 -05:00
Christian Winther 0c5154100c
fix: increase log rotator line scan limit
In case where gelf/json logging is used, its fairly easy to exceed the 16k limit, resulting in json output being cut up into multiple strings

the result is invalid json lines which can create all kind of badness in the logging server

This fixes https://github.com/hashicorp/nomad/issues/4699

Signed-off-by: Christian Winther <jippignu@gmail.com>
2018-10-09 18:57:18 +02:00
Alex Dadgar 01f8e5b95f renames 2018-10-04 14:57:25 -07:00
Alex Dadgar 52f9cd7637 fixing tests 2018-10-04 14:26:19 -07:00
Alex Dadgar bac5cb1e8b Scheduler uses allocated resources 2018-10-02 17:08:25 -07:00
Alex Dadgar 5c8697667e Node reserved resources 2018-09-29 18:44:55 -07:00
Alex Dadgar 3183153315 Node resources on client 2018-09-29 17:23:41 -07:00
Alex Dadgar 9971b3393f yamux 2018-09-17 14:22:40 -07:00
Alex Dadgar ca28afa3b2 small fixes 2018-09-15 16:42:38 -07:00
Alex Dadgar 7739ef51ce agent + consul 2018-09-13 10:43:40 -07:00
Michael Schurter 08862fc177 fix race around error handling 2018-09-05 17:34:17 -07:00
Michael Schurter 6def5bc4f9 client: set host name when migrating over tls
Not setting the host name led the Go HTTP client to expect a certificate
with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad`
names ephemeral dir migrations were broken when TLS was enabled.

Added an e2e test to ensure this doesn't break again as it's very
difficult to test and the TLS configuration is very easy to get wrong.
2018-09-05 17:24:17 -07:00
Alex Dadgar c6576ddac1 Fix make check errors 2018-09-04 16:03:52 -07:00
Alex Dadgar 089b533047 Fix kill timeout exceeding 5m on Docker driver
Fixes an issue where the Docker API client would timeout before the kill
timeout was hit.
2018-08-17 16:01:09 -07:00
Alex Dadgar 49a1ba9297
Merge pull request #4535 from hashicorp/f-keep-docker-container-0.8.4
Option to prevent removal of container on exit
2018-07-26 11:11:22 -07:00
Charlie Voiselle f319a149cd Option to prevent removal of container on exit 2018-07-26 11:10:48 -07:00
Michael Schurter ddf948001e
Merge pull request #4462 from omame/omame/cpu_cfs_period
Add support for specifying cpu_cfs_period in the Docker driver
2018-07-25 09:34:38 -07:00
Daniele Valeriani b0a14caca2 Add test for cpu_cfs_period 2018-07-16 22:43:34 +02:00
Michael Schurter 91588cb861 rkt: revert to redis 3.2 to favor stability 2018-07-09 16:15:32 -07:00
Michael Schurter c56f899ee9 rkt: speed up tests
Disable networking when it's not needed and improve failure message for
UserGroup test by including the full ps output on failure.
2018-07-09 14:02:27 -07:00
Michael Schurter a1d4f77ce0 rkt: skip retrieving network information when net=none
Even when net=none we would attempt to retrieve network information from
rkt which would spew useless log lines such as:

```
testlog.go:30: 20:37:31.409209 [DEBUG] driver.rkt: failed getting network info for pod UUID 8303cfe6-0c10-4288-84f5-cb79ad6dbf1c attempt 2: no networks found. Sleeping for 970ms
```

It would also delay tests for ~60s during the network information retry
period.

So skip this when net=none. It's unlikely anyone actually uses net=none
outside of tests, so I doubt anyone will notice this change.

Official docs:
https://coreos.com/rkt/docs/latest/networking/overview.html#no-loopback-only-networking
2018-07-09 13:44:43 -07:00
Michael Schurter 0fbc84b81d tests: make alloc id consistent in helper
It worked, but the old code used a different alloc id for the path than
the actual alloc! Use the same alloc id everywhere to prevent confusing
test output.
2018-07-09 13:37:35 -07:00
Michael Schurter f3b8815c96 rkt: fix failing TestRktDriver_UserGroup test
Started failing due to the docker redis image switching from Debian
jessie to stretch:
53f8680550 (diff-acff46b161a3b7d6ed01ba79a032acc9)

Switched from Debian based image to Alpine to get a working `ps` command
again (albeit busybox's stripped down implementation)
2018-07-09 12:19:02 -07:00
Daniele Valeriani 748f6afd89 Validate the value of cpu_cfs_period 2018-07-02 22:30:22 +02:00
Daniele Valeriani 9364446a03 Remove an unnecessary conversion 2018-07-02 17:47:23 +02:00
Daniele Valeriani 906952a2c8 Add support for specifying cpu_cfs_period in the Docker driver 2018-07-02 16:37:04 +02:00
Preetha b567750824
Merge pull request #4392 from burdandrei/telemetry-parametrized-jobs
Parametrized/periodic jobs per child tagged metric emmision
2018-06-21 17:13:36 -05:00
Preetha 043f4c208b
Merge pull request #3882 from burdandrei/telemetry-add-node-class-tag
Added node class to tagged metrics
2018-06-21 17:04:35 -05:00
Andrei Burd 444ee45aff Parametrized/periodic jobs per child tagged metric emmision 2018-06-21 10:40:56 +03:00
James Rasell 75f95ccf09
Merge branch 'master' into f_gh_4381 2018-06-19 17:51:57 +02:00
Alex Dadgar b61051b3cd
Merge pull request #4409 from hashicorp/r-client-packages
Refactor client packages
2018-06-13 17:32:25 -07:00
Alex Dadgar 22757d964e lint 2018-06-13 16:06:39 -07:00
Alex Dadgar af558df94c Fix test using a lot of memory 2018-06-13 15:52:25 -07:00
Alex Dadgar 300b1a7a15 Tests only use testlog package logger 2018-06-13 15:40:56 -07:00
Chelsea Komlo 03075b603a
Merge pull request #4399 from hashicorp/r-reload-refactor
Refactor logic for dynamic reloading
2018-06-13 13:35:12 -04:00
Alex Dadgar 9bab9edf27 test fixes 2018-06-12 17:45:39 -07:00
Alex Dadgar 90c2108bfb Fix gc tests + parallel destroy + small test fixes 2018-06-12 10:23:45 -07:00
Alex Dadgar f5ff509fa5 Refactor - wip 2018-06-12 10:23:45 -07:00
Alex Dadgar ff2ab8f58e Fix vault template test 2018-06-12 09:57:28 -07:00
Alex Dadgar d0043691fb remove structs + bump version 2018-06-11 13:52:19 -07:00
Alex Dadgar af5753d2cd bump version + generated files 2018-06-11 13:39:42 -07:00
Nick Ethier f36eb14360
Merge pull request #4403 from hashicorp/b-fix-dispatched-optional-meta
Fix dispatched optional meta correctly
2018-06-11 16:17:14 -04:00
Nick Ethier e75e3ae665
nomad: use require pkg for tests 2018-06-11 13:50:50 -04:00
Nick Ethier 3aa6241b5c
client/driver/env: fix optional meta test 2018-06-11 12:29:13 -04:00
Nick Ethier c65882cafd
client/driver/env: use 'job.Dispatch' to trigger optional meta logic 2018-06-11 12:15:19 -04:00
Nick Ethier ccb5372813
Revert "Revert "client/driver/env: interpolate empty optional meta params as empty strings""
This reverts commit c17e0fc9dc5fd288935ab2b68fb441b4d25ac189.
2018-06-11 11:59:23 -04:00
Michael Schurter c198cfd8ea executor: fix log line formatting 2018-06-08 14:55:39 -07:00
Michael Schurter d1a60e700e executor: fix Windows blocking on pipe close
Sending the Ctrl-Break signal to PowerShell <6 causes it to drop into
debug mode. Closing its output pipe at that point will block
indefinitely and prevent the process from being killed by Nomad.

See the upstream powershell issue for details:
https://github.com/PowerShell/PowerShell/issues/4254
2018-06-08 14:48:05 -07:00
Chelsea Holland Komlo f74e74b22d add client logic to determine whether TLS RPC connections should reload 2018-06-08 14:38:58 -04:00
James Rasell b9009c419c
Add 'nomad.advertise.address' to client meta via NomadFingerPrint
This change removes the addition of the advertise address to the
exported task env vars and instead moves this work into the
NomadFingerprint.Fingerprint which adds this value to the client
attrs. This can then be used within a Nomad job like
${attr.nomad.advertise.address}.
2018-06-08 09:44:10 +02:00
Alex Dadgar d9b35fab52 Revert "client/driver/env: interpolate empty optional meta params as empty strings"
This reverts commit 84926f759a63a90be7bbcf0fad78deb3f02af23d.
2018-06-07 16:27:47 -07:00
Nick Ethier b3c767fae0
client/driver: drop docker pull progress estimate if its < 0 2018-06-07 15:23:31 -04:00
James Rasell 367a8b5152
Add the local clients advertise address to interpolation env vars
This commit adds the Nomad local client advertise address in the
form host:port to the environment variables passed to each task.
2018-06-07 09:45:15 +02:00
Alex Dadgar 98705824ed
Merge pull request #4185 from jesusvazquez/add-counter-metric-for-oom-killer-events
Add driver.docker counter metric for OOM Killer events
2018-06-04 15:12:51 -07:00
Alex Dadgar 23cd56dc78 remove generated structs 2018-06-01 16:11:28 -07:00
Alex Dadgar bf5b5747ab fix test message 2018-06-01 15:51:54 -07:00
Alex Dadgar 3e3d3c7445 Disable Exec on non-linux platforms
This PR disables exec on non-linux platforms
2018-06-01 15:48:14 -07:00
Alex Dadgar c0386819b3 bump version/lint/generated files 2018-06-01 15:23:10 -07:00
Preetha Appan ce6d4a8d7a
Fix tests and move isClient to constructor 2018-06-01 15:59:53 -05:00
Alex Dadgar a62dd2aadb
Merge pull request #4350 from hashicorp/b-raw-exec-cgroups
Raw exec can use cgroups to manage PIDs
2018-06-01 17:37:49 +00:00
Alex Dadgar 8da42940c9 wait for result 2018-06-01 10:14:53 -07:00
Alex Dadgar 40fec81315
Merge pull request #4277 from hashicorp/f-retry-join-clients
Add go-discover support to Nomad clients
2018-06-01 16:57:40 +00:00
Alex Dadgar 460ecb8705 Comments 2018-05-31 18:05:03 -07:00
Alex Dadgar de98774f2c Add test and docs 2018-05-31 18:05:03 -07:00
Alex Dadgar ff28b04c46 Use more appropriate name than cgroup 2018-05-31 18:05:03 -07:00
Alex Dadgar 37e900b1d3 Only use freezer/devices when in the basic cgroup only 2018-05-31 18:05:03 -07:00
Alex Dadgar ffd9270f2f Use cgroup when possible 2018-05-31 18:05:03 -07:00
Alex Dadgar 0ff0ed290d Fix TestDockerDriver_StartNVersions 2018-05-31 17:14:59 -07:00
Alex Dadgar 7e6dd498c9 Remove debug logging 2018-05-31 15:52:42 -07:00
Alex Dadgar b1b908527f spelling 2018-05-31 15:29:55 -07:00
Alex Dadgar a3b29553a5 Force close stdout/stderr after grace
This commit changes the force closing of the stdout/stderr file
descriptor from closing immediately to being closed after a grace
period. This allows the created process to close its own file and allows
copying of the data.
2018-05-31 15:21:36 -07:00
Alex Dadgar 5e787e2d72 test build 2018-05-31 12:22:31 -07:00
Alex Dadgar ead1b7f423 Log more info for TestExecutor_IsolationAndConstraints 2018-05-31 11:57:44 -07:00
Alex Dadgar b05740ad13
Merge pull request #4341 from hashicorp/f-docker-pids
Support Docker Pids Limit
2018-05-31 17:59:29 +00:00
Chelsea Holland Komlo 064b5481e0 add server join info to server and client 2018-05-31 10:50:03 -07:00
Alex Dadgar f4d4bbdc97 test pid limit 2018-05-30 12:55:24 -07:00
Chelsea Holland Komlo 94d510e969 Support Docker Pids Limit 2018-05-25 19:54:14 -04:00
Alex Dadgar 1685c8ebe4 cleanup 2018-05-24 16:25:20 -07:00
Alex Dadgar 2eacdb6bd6 Force closing of pipe to child process 2018-05-24 16:03:48 -07:00
Chelsea Holland Komlo 38f611a7f2 refactor NewTLSConfiguration to pass in verifyIncoming/verifyOutgoing
add missing fields to TLS merge method
2018-05-23 18:35:30 -04:00
Preetha 9084bb025e
Merge pull request #4303 from hashicorp/b-docker-client-nil-panic
Add nil check before setting timeout on docker client
2018-05-21 19:34:44 -07:00
Jesus Vazquez 23d959e42c Add job, task, taskgroup to open method 2018-05-21 20:37:18 +02:00
Jesus Vazquez 0a062a04c7 Remove allocID from dockerhandle struct 2018-05-21 20:33:01 +02:00
Jesus Vazquez e5a81815bb Rename labels job, task_group and task 2018-05-21 20:32:50 +02:00
Jesus Vazquez ffe1b1a1b6 Remove allocid label from driver.docker.oom counter metric 2018-05-21 20:30:56 +02:00
Alex Dadgar 38762d9bde
Merge pull request #4282 from hashicorp/f-rotator
Avoid splitting log line across two files
2018-05-21 17:52:13 +00:00
Alex Dadgar d95698e2c5
Merge pull request #4298 from justenwalker/docker-driver-digest-tags
driver/docker: pull image with digest
2018-05-21 17:46:14 +00:00
Nick Ethier 6392009dd6
client/driver: use correct repo address when using docker-credential helper (#4266) 2018-05-15 17:39:48 -04:00
Justen Walker a8989f33bb driver/docker: add test for dockerImageRef 2018-05-14 14:24:03 -04:00
Justen Walker 194b2231d6 driver/docker: fix up TestParseDockerImage 2018-05-14 14:23:48 -04:00
Justen Walker 25b2807ce3 driver/docker: fix TestDockerDriver_ForcePull_RepoDigest 2018-05-14 14:23:02 -04:00
Nick Ethier c4d07a2200
client/driver: gaurd authHelper test from running on windows 2018-05-14 13:46:57 -04:00
Justen Walker b23ca7574c driver/docker: cleanup parseDockerImage 2018-05-14 11:11:51 -04:00
Justen Walker 60f7f1aa08 driver/docker: pull image with digest
GH #4290

Add digest support to the docker driver image config. This commit
factors out some common code to print the repo:tag (dockerImageRef) for
events/logs as well as parsing the image to retreive the repo,tag
(parseDockerImage) so that the results are consistent/sane for both
repo:tag and repo@sha256:... references.

When pulling an image with a digest, the tag is blank and the repo
contains the digest. See:
https://github.com/fsouza/go-dockerclient/blob/master/image_test.go#L471
2018-05-14 10:42:58 -04:00
Preetha Appan de66ec7394
Add nil check before setting timeout on docker client 2018-05-11 17:09:26 -05:00
Alex Dadgar 7ad5c76734 Add new line test 2018-05-11 10:52:09 -07:00
Alex Dadgar 3671ed139d Avoid splitting log line across two files
We attempt to avoid splitting a log line between two files by detecting
if we are near the file size limit and scanning for new lines and only
flushing those.

BenchmarkRotator/1KB-8            300000              5613 ns/op
BenchmarkRotator/2KB-8            200000              8384 ns/op
BenchmarkRotator/4KB-8            100000             14604 ns/op
BenchmarkRotator/8KB-8             50000             25002 ns/op
BenchmarkRotator/16KB-8            30000             47572 ns/op
BenchmarkRotator/32KB-8            20000             92080 ns/op
BenchmarkRotator/64KB-8            10000            165883 ns/op
BenchmarkRotator/128KB-8            5000            294405 ns/op
BenchmarkRotator/256KB-8            2000            572374 ns/op
2018-05-10 15:11:01 -07:00
Alex Dadgar f5d91b5338 Benchmark for rotator
BenchmarkRotator/1KB-8            200000              5572 ns/op
BenchmarkRotator/2KB-8            200000              8338 ns/op
BenchmarkRotator/4KB-8            100000             14246 ns/op
BenchmarkRotator/8KB-8             50000             25279 ns/op
BenchmarkRotator/16KB-8            30000             48602 ns/op
BenchmarkRotator/32KB-8            20000             92159 ns/op
BenchmarkRotator/64KB-8            10000            154766 ns/op
BenchmarkRotator/128KB-8            5000            296872 ns/op
BenchmarkRotator/256KB-8            3000            551793 ns/op
2018-05-10 14:15:15 -07:00
Nick Ethier 91603a377e
client/driver: parse repo instead of attempting to pull repo info 2018-05-09 22:34:25 -04:00
Nick Ethier 38a33f9c75
client/driver: add test for docker auth helper 2018-05-09 22:33:56 -04:00
Alex Dadgar e067a9ae06 naming of constants 2018-05-09 16:46:52 -07:00
Chelsea Holland Komlo 796bae6f1b allow configurable cipher suites
disallow 3DES and RC4 ciphers

add documentation for tls_cipher_suites
2018-05-09 17:15:31 -04:00
Alex Dadgar 0e79e1a46e Keep stream and logs in sync for detecting closed pipe 2018-05-09 11:22:52 -07:00
Preetha e7ae6e98d9
Merge pull request #4259 from hashicorp/f-deployment-improvements 2018-05-08 16:37:10 -05:00
Nick Ethier 3598925ca4
client/driver: use correct repo address when using docker-credential helper 2018-05-08 15:17:28 -04:00
Nick Ethier 54c86a0292
client/driver/env: interpolate empty optional meta params as empty strings 2018-05-07 20:19:51 -04:00
Nick Ethier 016ab7a105
client/driver: remove unused const 'dockerPullProgressEmitInterval' 2018-05-07 16:24:48 -04:00
Michael Schurter f1d13683e6
consul: remove services with/without canary tags
Guard against Canary being set to false at the same time as an
allocation is being stopped: this could cause RemoveTask to be called
with the wrong Canary value and leaking a service.

Deleting both Canary values is the safest route.
2018-05-07 14:55:01 -05:00
Michael Schurter 50e04c976e
consul: support canary tags for services
Also refactor Consul ServiceClient to take a struct instead of a massive
set of arguments. Meant updating a lot of code but it should be far
easier to extend in the future as you will only need to update a single
struct instead of every single call site.

Adds an e2e test for canary tags.
2018-05-07 14:55:01 -05:00
Alex Dadgar df8fce4347
Ensure canaries tags are interpolated 2018-05-07 14:50:01 -05:00
Alex Dadgar 552604451c
rework where time gets set 2018-05-07 14:50:01 -05:00
Alex Dadgar ee50789c22
Initial implementation 2018-05-07 14:50:01 -05:00
Nick Ethier d8de354dbf
client/driver: add waiting layer status count to pull progress status msg 2018-05-07 12:18:20 -04:00
Nick Ethier 77af17efbc
client/driver: add seperate handler for emitting pull progress 2018-05-07 12:17:34 -04:00
Nick Ethier 0bdd976b7d
client/driver: remove pull timeout due to race condition that can lead to unexpected timeouts
If two jobs are pulling the same image simultaneously, which ever starts the pull first will set the pull timeout.
This can lead to a poor UX where the first job requested a short timeout while the second job requested a longer timeout
causing the pull to potentially timeout much sooner than expected by the second job.
2018-05-07 12:18:11 -04:00
Nick Ethier 7c5821d7c6
client/driver: do accounting on layer pull progress 2018-05-07 12:17:53 -04:00
Nick Ethier 8efda7dc6c
client/driver: emit progress to all allocs pulling same image 2018-05-07 12:17:34 -04:00
Nick Ethier e35948ab91
client/driver: add image pull progress monitoring 2018-05-07 12:17:38 -04:00
Michael Schurter 0d534d30d6
Merge pull request #4251 from hashicorp/f-grpc-checks
Support Consul gRPC Health Checks
2018-05-04 14:55:16 -07:00
Michael Schurter f6a4713141 consul: make grpc checks more like http checks 2018-05-04 11:08:11 -07:00
Michael Schurter 382caec1e1 consul: initial grpc implementation
Needs to be more like http.
2018-05-04 11:08:11 -07:00
Jesus Vazquez 08a390448b Update counter driver.docker.oom labels 2018-05-04 14:02:34 +08:00
Jesus Vazquez 4f6db56283 Initialize dockerhandle with jobname, taskgroupname, taskname and allocid 2018-05-04 14:02:19 +08:00
Jesus Vazquez 127b764dfb Add Job, taskgroupname, taskname, and allocid to the DockerHandle struct 2018-05-04 14:01:26 +08:00
Jesus Vazquez fd1ff1a0cf Run goimports 2018-05-04 13:46:36 +08:00
Jesus Vazquez 5dd4059527 Add driver.docker counter metric for OOM Killer events 2018-05-04 13:46:36 +08:00
Michael Schurter 526af6a246 framer: fix early exit/truncation in framer 2018-05-02 10:46:16 -07:00
Michael Schurter f1a6aa103a framer: fix race and remove unused error var
In the old code `sending` in the `send()` method shared the Data slice's
underlying backing array with its caller. Clearing StreamFrame.Data
didn't break the reference from the sent frame to the StreamFramer's
data slice.
2018-05-02 10:46:16 -07:00
Michael Schurter 7360fe3a6d client: squelch errors on cleanly closed pipes 2018-05-02 10:46:16 -07:00
Michael Schurter ffff97e25f client: don't spin on read errors 2018-05-02 10:46:16 -07:00
Michael Schurter 5ef0a82e6e client: reset encoders between uses
According to go/codec's docs, Reset(...) should be called on
Decoders/Encoders before reuse:

https://godoc.org/github.com/ugorji/go/codec

I could find no evidence that *not* calling Reset() caused bugs, but
might as well do what the docs say?
2018-05-02 10:46:16 -07:00
Alex Dadgar de4af37249 version bump and remove generated 2018-04-27 11:10:00 -07:00
Alex Dadgar 845a43864a generated files 2018-04-27 10:45:40 -07:00
Alex Dadgar 35e06ddb31 Remove generated and version bump 2018-04-26 16:49:19 -07:00
Alex Dadgar 43192cefae generated files 2018-04-26 16:28:58 -07:00
Michael Schurter 0e602d4779
Merge pull request #4188 from hashicorp/f-rkt-stats
rkt: create parent cgroup to enable stats
2018-04-24 14:54:36 -07:00
Michael Schurter d687761ebf rkt: test Stats() and always run tests
Remove the NOMAD_TEST_RKT flag as a guard for rkt tests. Still require
Linux, root, and rkt to be installed. Only check for rkt installation
once in hopes of speeding up rkt tests a bit.
2018-04-24 11:05:42 -07:00
Javier Palomo Almena 3e6c01ffa1 docker tests: Fix usage of NewDriverContext 2018-04-23 22:51:06 +02:00
Javier Palomo Almena 74d3c5df07 DriverContext: Add the TaskGroup and the Job name
Adding this fields to the DriverContext object, will allow us to pass
them to the drivers.

An use case for this, will be to emit tagged metrics in the drivers,
which contain all relevant information:
- Job
- TaskGroup
- Task
- ...

Ref: https://github.com/hashicorp/nomad/pull/4185
2018-04-23 00:15:29 +02:00
Michael Schurter 4cee6cca6c rkt: create parent cgroup to enable stats
Having the Nomad executor create parent cgroups that rkt is launched
within allows the stats collection code used for the exec driver to Just
Work. The only downside is that now the Nomad executor's resource
utilization counts against the cgroups resource limits just as it does
for the exec driver.
2018-04-19 15:14:56 -07:00
Michael Schurter 1a85d0c990 run goimports 2018-04-19 11:16:28 -07:00
Michael Schurter d77c265d1f
Merge pull request #4168 from ninoles/b-2117-windows-group-process
B 2117 windows group process
2018-04-19 11:10:51 -07:00
Michael Schurter fdbcbd4e5b
Merge pull request #4058 from hashicorp/f-mock-by-default
[Post-0.8] test: build with mock_driver by default
2018-04-18 15:57:00 -07:00
Michael Schurter d3650fb2cd test: build with mock_driver by default
`make release` and `make prerelease` set a `release` tag to disable
enabling the `mock_driver`
2018-04-18 14:45:33 -07:00
Michael Schurter a991923389 tests: fix race in alloc_runner_test.go
I could not reproduce the failure locally even with `stress -cpu ...`
eating all the cpu it could on my machine.

But I think the race was in one of two places:
* The task could restart which could create new events
* I think there could be a race between the updater's version of events
  and alloc runners as updates are async

I fixed both. Here's hoping that fixes this flaky test.
2018-04-17 17:14:59 -07:00
Fabien Ninoles c81bec48c9 Merge branch 'master' into b-2117-windows-group-process 2018-04-17 13:47:25 -04:00
Fabien Ninoles 35cf641416 Update based on PR request. 2018-04-17 13:43:04 -04:00
Alex Dadgar c4ad76091d
Merge pull request #4166 from hashicorp/b-panic-fix-update
Fixes races accessing node and updating it during fingerprinting
2018-04-17 10:02:19 -07:00
Chelsea Holland Komlo 9b8a079558 fix up comments 2018-04-17 11:53:08 -04:00
Alex Dadgar 9d612c8cb0 Cleanup 2018-04-16 15:48:34 -07:00
Alex Dadgar 32adaf9dfc Copy the config given to the alloc runner 2018-04-16 15:45:52 -07:00
Alex Dadgar 3ff2d4d795 fix race node access 2018-04-16 15:45:51 -07:00
Alex Dadgar 4f2a7b6949 Fix copying drivers 2018-04-16 15:45:51 -07:00
Alex Dadgar 0b799822ff Operate on copy 2018-04-16 15:45:49 -07:00
Fabien Ninoles 27cf4995ce - Clean up for windows compilation.
- Set CREATE_NEW_PROCESS_GROUP for Windows subprocess.
- Ensure we only kill actual process that need to.
2018-04-14 13:58:42 -04:00
Michael Schurter 3836b8a335
Merge pull request #3572 from emate/master
Create new process group on process startup.
2018-04-13 11:56:38 -07:00
Alex Dadgar adaf4fa7e0 Remove generated structs 2018-04-12 16:35:31 -07:00
Alex Dadgar 663c4d0433 Version bump and generated files 2018-04-12 16:21:50 -07:00
Alex Dadgar ff1a1a63e8 Move where attribute for driver detection is set 2018-04-12 15:50:25 -07:00
Chelsea Holland Komlo 5291788b40 delete driver name from only health check attributes 2018-04-12 18:24:41 -04:00
Alex Dadgar 3d53d380f7 Fix tests 2018-04-12 14:29:30 -07:00
Alex Dadgar f24ce2c50c Driver health detection cleanups
This PR does:

1. Health message based on detection has format "Driver XXX detected"
and "Driver XXX not detected"
2. Set initial health description based on detection status and don't
wait for the first health check.
3. Combine updating attributes on the node, fingerprint and health
checking update for drivers into a single call back.
4. Condensed driver info in `node status` only shows detected drivers
and make the output less wide by removing spaces.
2018-04-12 12:46:40 -07:00
Charlie Voiselle ba88f00ccb Changed "til" to "until"
Should be "till" or "until"; chose "until" because it is unambiguous as to meaning.
2018-04-11 12:36:28 -05:00
Andrei Burd 502d17fa90 Added node class to tagged metrics 2018-04-11 12:20:59 +03:00
Chelsea Komlo eb5aac16e6
Merge pull request #4111 from hashicorp/b-undetected-set-health-to-false
Immediately set driver health status to false when driver moves to undetected
2018-04-10 18:30:31 -04:00
Chelsea Holland Komlo d58b3e473c update comment for when the fingerprinter setting health status 2018-04-10 16:53:00 -04:00
Chelsea Holland Komlo f7ef13cc64 fingerprinter should set health check status if health check is not periodic 2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo ede4f518bd add setters for access to the fingerprint manager's node
refactor extracting driver info
2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo f479da19f5 guard against overwriting health status 2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo ece1618815 immediately set healthy to false when driver moves to undetected 2018-04-10 15:29:51 -04:00
Alex Dadgar 3d367d6fd7 Fix client uptime metric missing client prefix 2018-04-10 10:39:36 -07:00
Seth Vargo df4fe7e76c
Set user-agent when talking to GCE metadata 2018-04-10 10:36:46 -04:00
Chelsea Komlo d3bd8fb96e
Merge pull request #4109 from hashicorp/f-shorten-docker-health-timeout
Shorten docker health timeout
2018-04-09 15:38:39 -04:00
Chelsea Holland Komlo ea4b65dd41 only initialize docker clients if they are nil 2018-04-09 14:13:07 -04:00
Chelsea Holland Komlo 288c7a33a1 refacotoring simplification from code review 2018-04-09 10:34:17 -04:00
Chelsea Holland Komlo 6e3b056c37 only run health check if driver moves from undetected to detected 2018-04-09 10:10:43 -04:00
Alex Dadgar ae1f76477e Start rebalance after discovering new servers 2018-04-05 15:41:59 -07:00
Alex Dadgar 929b6823a3
Merge pull request #4106 from hashicorp/b-servers
Improved Client handling of failed RPCs
2018-04-05 13:48:50 -07:00
Alex Dadgar be2513e0f9 more jitter 2018-04-05 13:48:33 -07:00
Chelsea Holland Komlo d3637825ef group similar functions; update comments
health check timeout should be 1 minute
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo e8743f1f7b remove do once block when creating a new docker client
only set cached connections upon no error
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo d0d793fc23 use client with shorter timeouts for health checks 2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo 5d1b2b77cb refactor docker clients method to be able to extend to creating new clients 2018-04-05 16:19:02 -04:00
Alex Dadgar bd3345942c Handle no leader and faster retries near limit
Handle the ErrNoLeader case and apply slower retries. Also when we have
missed the heartbeat retry aggressively, backing off after we have
missed for more than 30 seconds.
2018-04-05 11:22:47 -07:00
Alex Dadgar 279b5c22e5 Scale heartbeat retrying based on remaining heartbeat time 2018-04-05 10:58:13 -07:00
Alex Dadgar 7941f4eb2d Fire retry only when consul discovers new servers 2018-04-05 10:40:17 -07:00
Preetha 6254d75eee
Merge pull request #4101 from hashicorp/b-rescheduling-edge-fixes
Fixes edge cases around timing/ task finish time being set more than once
2018-04-04 16:18:21 -05:00
Preetha Appan 12ba4c45da
remove outdated commented out test code 2018-04-04 15:03:24 -05:00
Preetha Appan 6363a6fb4d
Remove old comment 2018-04-04 15:01:48 -05:00
Preetha Appan 5e4525bd30
Moves setting finishedAt to the right place and adds two unit tests. 2018-04-04 14:38:15 -05:00
Alex Dadgar 86c32358d4
Spelling error 2018-04-03 18:30:01 -07:00
Alex Dadgar 01a6beafbf RPC Retry Watcher 2018-04-03 18:05:28 -07:00
Preetha Appan e6bbce3fa0
Add comment 2018-04-03 19:49:03 -05:00
Alex Dadgar ec844f19d9 randomize servers 2018-04-03 17:46:13 -07:00
Preetha Appan 00537c739b
Fixes edge cases around timing and task finish time being set more than once 2018-04-03 16:34:59 -05:00
Alex Dadgar 58a3ec3fb2 Improve Vault error handling 2018-04-03 14:29:22 -07:00
Alex Dadgar 86f9044676 remove generated files 2018-03-30 16:52:49 -07:00
Alex Dadgar af81349dbe Generated files 2018-03-30 16:14:40 -07:00
Michael Schurter 257ba5937d test: don't rely on alloc runner update count
We were incorrectly relying on the count of alloc updates in a number of
tests. Since alloc updates are async, their number is non-determinstic
and largely meaningless.

This should fix quite a few flaky tests in Travis and prevent future
mistaken assumptions in tests.
2018-03-30 09:34:33 -07:00
Michael Schurter 62e9553333
Merge pull request #4069 from hashicorp/f-hashealth
add HasHealth helper for nil checks
2018-03-29 17:03:20 -07:00
Alex Dadgar beee130a6e Always capture the finish time 2018-03-29 11:27:22 -07:00
Michael Schurter 91b5bb58d9 add HasHealth helper for nil checks
We performed the DeploymentStatus nil checks a couple different ways, so
hopefully this helper will consoldiate them and make it more clear what
the code is doing.
2018-03-29 09:29:19 -07:00
Chelsea Komlo 4338360da9
Merge pull request #4065 from hashicorp/emit-node-event-on-first-health-change
Emit first node event after initialization on health status change
2018-03-29 11:23:25 -04:00
Chelsea Holland Komlo 2174ede6b9 add clarifying comment 2018-03-29 10:58:39 -04:00
Michael Schurter 3a79c32677
Merge pull request #4059 from hashicorp/b-drain-health-svc-only
only service allocs should have health watched
2018-03-28 16:49:22 -07:00
Michael Schurter 5eb0cb7176 only service allocs should have health watched 2018-03-28 16:20:11 -07:00
Chelsea Holland Komlo e3319afee1 emit first node event 2018-03-28 17:26:53 -04:00
Chelsea Komlo 7812ac5abf
Merge pull request #4057 from hashicorp/specify-docker-msg
Specify docker name in driver health messages
2018-03-28 13:32:36 -04:00
Preetha 177d2d6010
Merge pull request #4052 from hashicorp/f-specify-total-memory
Allow to specify total memory on agent configuration
2018-03-28 12:28:41 -05:00
Chelsea Holland Komlo efc03e252c specify driver health messages 2018-03-28 11:35:21 -04:00
Preetha Appan 329428b49f
Code review feedback and unit test 2018-03-28 10:07:15 -05:00
Charlie Voiselle ea10588227 rkt: logging enhancements (#4044)
* Added extra debug logging; extended timeout; added jitter.

* small log changes

* increase timeout

* remove unneccessary uuid
2018-03-27 17:30:06 -07:00
Michael Schurter fcaee471a0 client: always mark exited sys/svc allocs as failed
When restarts.attempts=0 was set in a jobspec a system or service alloc
that exited with 0 status would be marked as `completed` instead of
`failed`. Since system and service jobs are intended to run until
stopped or updated, they should always be marked as failed when they
exit even in cases where the exit code is 0.
2018-03-27 14:30:19 -07:00
Mildred Ki'Lya 1017cbe8ab
Allow to specify total memory on agent configuration
Allow to set the total memory of an agent in its configuration file. This
can be used in case the automatic detection doesn't work or in specific
environments when memory overcommit (using swap for example) can be
desirable.
2018-03-27 15:46:18 -05:00
Chelsea Holland Komlo 003bc209b9 use time.Time for node events for compatibility 2018-03-27 15:43:57 -04:00
Alex Dadgar 432784dae3 Fix alloc watcher snapshot streaming 2018-03-27 11:14:53 -07:00
Alex Dadgar 05449fea09 drop stats fetching log 2018-03-23 12:01:50 -07:00
Chelsea Komlo 5f0c382021
Merge pull request #4030 from hashicorp/health-check-ux
UX improvments to driver health checks
2018-03-23 09:46:50 -04:00
Alex Dadgar da27fc3880 Driver Info output 2018-03-22 17:18:32 -07:00
Chelsea Holland Komlo e9005d8cfb ux improvments to driver health checks 2018-03-22 18:38:29 -04:00
Michael Schurter a318684738
Merge pull request #4022 from hashicorp/f-more-executor-logging
executor: increase level for helpful log lines
2018-03-22 15:21:20 -07:00
Michael Schurter a4f346abeb remove spurious TODOs and FIXMEs 2018-03-21 16:55:22 -07:00
Michael Schurter 8b346c6176 test: try to prevent flakiness on travis 2018-03-21 16:51:45 -07:00
Michael Schurter 1b7ac447e9 alloc_runner: watch health for deployed batch jobs 2018-03-21 16:51:45 -07:00
Michael Schurter 62960ed7bd client: don't monitor health of non-service jobs
Also fix system job draining; won't work without deadline fixes
2018-03-21 16:51:44 -07:00
Alex Dadgar a37329189a Improve DeadlineTime helper 2018-03-21 16:51:44 -07:00
Alex Dadgar db4a634072 RPC, FSM, State Store for marking DesiredTransistion
fix build tag
2018-03-21 16:49:48 -07:00
Michael Schurter bb0ff44fb4 mock_driver: improve Kill() logging 2018-03-21 16:49:48 -07:00
Michael Schurter c0542474db drain: initial drainv2 structs and impl 2018-03-21 16:49:48 -07:00
Chelsea Holland Komlo f329e45e03 always set initial health status for every driver 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo bbaffe3eca set driver to unhealthy once if it cannot be detected in periodic check 2018-03-21 15:15:26 -04:00
Alex Dadgar 5df4b3728d Docker driver doesn't return errors but injects into the DriverInfo 2018-03-21 15:15:26 -04:00
Alex Dadgar 4365bb7f59 Only run health check if driver is detected 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo f801709a0a fix issue when updating node events 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 285729aee2 function rename and re-arrange functions in fingerprint_manager 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 60f12d206f improve comments; update watchDriver 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 739784736a remove unused function 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo d92703617c simplify logic
bump log level
2018-03-21 15:15:26 -04:00