Commit graph

706 commits

Author SHA1 Message Date
Michael Schurter dbf4c3a3c8
Apply suggestions from code review
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-01-12 10:38:20 -06:00
Preetha Appan 7bd1440710
REfactor statedb factory config to set it directly in client config 2019-01-12 10:38:20 -06:00
Preetha Appan e237f19b38
Remove invalid allocs 2019-01-12 10:38:20 -06:00
Preetha Appan f059ef8a47
Modified destroy failure handling to rely on allocrunner's destroy method
Added a unit test with custom statedb implementation that errors, to
use to verify destroy errors
2019-01-12 10:37:12 -06:00
Preetha Appan 6c95da8f67
Add back code to mark alloc as failed when restore fails
Also modify restore such that any handled errors don't propagate
back to the client
2019-01-12 10:37:12 -06:00
Preetha Appan 5fde0b0f5c
Revert code that made an alloc update when restore fails
Restore currently shuts down the client so the alloc update cant
always make it to the server
2019-01-12 10:37:12 -06:00
Preetha Appan 41bfdd764b
Handle client initialization errors when adding allocs or restoring allocs
We mark the alloc as failed and track failed allocs so that we don't send
updates after the first time
2019-01-12 10:37:12 -06:00
Danielle Tomlinson 3e586e93da client: Cleanup allocrunner access 2019-01-11 18:39:18 +01:00
Alex Dadgar c9825a9c36 recover 2019-01-07 14:49:40 -08:00
Nick Ethier a96afb6c91
fix tests that fail as a result of async client startup 2018-12-20 00:53:44 -05:00
Michael Schurter d9ea8252a7 client/state: support upgrading from 0.8->0.9
Also persist and load DeploymentStatus to avoid rechecking health after
client restarts.
2018-12-19 10:39:27 -08:00
Nick Ethier a02308ee6a
drivermanager: attempt to reattach and shutdown driver plugin if blocked by allow/block lists 2018-12-18 23:01:57 -05:00
Nick Ethier ce1a5cba0e
drivermanager: use allocID and task name to route task events 2018-12-18 23:01:51 -05:00
Nick Ethier d8a0265e68
client: batch initial fingerprinting in plugin manangers
drivermanager: fix pr comments/feedback
2018-12-18 22:56:19 -05:00
Nick Ethier 7d23cbf448
client/drivermananger: fixup issues from rebase and address PR comments 2018-12-18 22:55:38 -05:00
Nick Ethier 82175d1328
client/drivermananger: add driver manager
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.

Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
2018-12-18 22:55:18 -05:00
Danielle Tomlinson cb78a90f40 client: Async API for shutdown/destroy allocrunners 2018-12-18 23:38:33 +01:00
Danielle Tomlinson d9174d8dcf
Merge pull request #4989 from hashicorp/dani/b-client-update-race-condition
client: Give a copy of clientconfig to allocrunner
2018-12-17 10:49:46 +01:00
Danielle Tomlinson 8b06e8d297
Merge pull request #4990 from hashicorp/dani/b-alloc-lock
client: updateAlloc release lock after read
2018-12-13 12:43:59 +01:00
Danielle Tomlinson 3823599da9 client: Give a copy of clientconfig to allocrunner
Currently, there is a race condition between creating a taskrunner, and
updating node attributes via fingerprinting.

This is because the taskenv builder will try to iterate over the
clientconfig.Node.Attributes map, which can be concurrently updated by
the fingerprinting process, thus causing a panic.

This fixes that by providing a copy of the clientconfg to the
allocrunner inside the Read lock during config creation.
2018-12-13 12:42:15 +01:00
Danielle Tomlinson 4184eadaf4 client: updateAlloc release lock after read
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.

This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.
2018-12-12 16:30:01 +01:00
Mahmood Ali 3d166e6e9c
Merge pull request #4984 from hashicorp/b-client-update-driver
client: update driver info on new driver fingerprint
2018-12-11 18:01:03 -05:00
Alex Dadgar 1531b6d534
Merge pull request #4970 from hashicorp/f-no-iops
Deprecate IOPS
2018-12-11 12:51:22 -08:00
Mahmood Ali ba515947c2 client: update driver info on new fingerprint
Fixes a bug where a driver health and attributes are never updated from
their initial status.  If a driver started unhealthy, it may never go
into a healthy status.
2018-12-11 14:25:10 -05:00
Danielle Tomlinson 805669ead4 client: Correctly pass a noop PrevAllocMigrator when restoring 2018-12-11 15:46:58 +01:00
Danielle Tomlinson 83720575de client: Unify handling of previous and preempted allocs 2018-12-11 13:12:35 +01:00
Danielle Tomlinson dff7093243 client: Wait for preempted allocs to terminate
When starting an allocation that is preempting other allocs, we create a
new group allocation watcher, and then wait for the allocations to
terminate in the allocation PreRun hooks.

If there's no preempted allocations, then we simply provide a
NoopAllocWatcher.
2018-12-11 00:59:18 +01:00
Alex Dadgar 1e3c3cb287 Deprecate IOPS
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
2018-12-06 15:09:26 -08:00
Danielle Tomlinson 66c521ca17 client: Move fingerprint structs to pkg
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.

It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
2018-12-01 17:10:39 +01:00
Alex Dadgar 4ee603c382 Device hook and devices affect computed node class
This PR introduces a device hook that retrieves the device mount
information for an allocation. It also updates the computed node class
computation to take into account devices.

TODO Fix the task runner unit test. The environment variable is being
lost even though it is being properly set in the prestart hook.
2018-11-27 17:25:33 -08:00
Michael Schurter 1e4ef139dd
Merge pull request #4883 from hashicorp/f-graceful-shutdown
Support graceful shutdowns in agent
2018-11-27 15:55:15 -06:00
Michael Schurter 4f7e6f9464 client: fix races in use of goroutine group
The group utility struct does not support asynchronously launched
goroutines (goroutines-inside-of-goroutines), so switch those uses to a
normal go call.

This means watchNodeUpdates and watchNodeEvents may not be shutdown when
Shutdown() exits. During nomad agent shutdown this does not matter.

During tests this means a test may leak those goroutines or be unable to
know when those goroutines have exited.

Since there's no runtime impact and these goroutines do not affect alloc
state syncing it seems ok to risk leaking them.
2018-11-26 12:52:55 -08:00
Michael Schurter 9f43fb6d29 client: reuse group instead of diy'ing it 2018-11-26 12:52:31 -08:00
Michael Schurter 5bd744ac3d client: support graceful shutdowns
Client.Shutdown now blocks until all AllocRunners and TaskRunners have
exited their Run loops. Tasks are left running.
2018-11-19 16:39:30 -08:00
Mahmood Ali f139234372 address review comments 2018-11-16 17:13:01 -05:00
Mahmood Ali f72e599ee7 Populate alloc stats API with device stats
This change makes few compromises:

* Looks up the devices associated with tasks at look up time.  Given
that `nomad alloc status` is called rarely generally (compared to stats
telemetry and general job reporting), it seems fine.  However, the
lookup overhead grows bounded by number of `tasks x total-host-devices`,
which can be significant.

* `client.Client` performs the task devices->statistics lookup.  It
passes self to alloc/task runners so they can look up the device statistics
allocated to them.
  * Currently alloc/task runners are responsible for constructing the
entire RPC response for stats
  * The alternatives for making task runners device statistics aware
don't seem appealing (e.g. having task runners contain reference to hostStats)

* On the alloc aggregation resource usage, I did a naive merging of task device statistics.
  * Personally, I question the value of such aggregation, compared to
costs of struct duplication and bloating the response - but opted to be
consistent in the API.
  * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.
2018-11-16 10:26:32 -05:00
Mahmood Ali 046f098bac Track Node Device attributes and serve them in API 2018-11-14 14:42:29 -05:00
Mahmood Ali b74ccc742c Expose Device Stats in /client/stats API endpoint 2018-11-14 14:41:19 -05:00
Alex Dadgar a7ca737fb6 review comments 2018-11-07 11:31:52 -08:00
Alex Dadgar 204ca8230c Device manager
Introduce a device manager that manages the lifecycle of device plugins
on the client. It fingerprints, collects stats, and forwards Reserve
requests to the correct plugin. The manager, also handles device plugins
failing and validates their output.
2018-11-07 10:43:15 -08:00
Michael Schurter b7a9d61a38 ar: initialize allocwatcher on restore
Fixes a panic. Left a comment on how the behavior could be improved, but
this is what releases <0.9.0 did.
2018-10-19 09:45:45 -07:00
Michael Schurter e060174130 ar: fix leader handling, state restoring, and destroying unrun ARs
* Migrated all of the old leader task tests and got them passing
* Refactor and consolidate task killing code in AR to always kill leader
  tasks first
* Fixed lots of issues with state restoring
* Fixed deadlock in AR.Destroy if AR.Run had never been called
* Added a new in memory statedb for testing
2018-10-19 09:45:45 -07:00
Nick Ethier 3183b33d24 client: review comments and fixup/skip tests 2018-10-16 16:56:56 -07:00
Nick Ethier f192c3752a client: refactor post allocrunnerv2 finalization 2018-10-16 16:56:56 -07:00
Nick Ethier 4a4c7dbbfc client: begin driver plugin integration
client: fingerprint driver plugins
2018-10-16 16:56:56 -07:00
Alex Dadgar 45e41cca03 allocrunnerv2 -> allocrunner 2018-10-16 16:56:56 -07:00
Alex Dadgar 6c9d9d5173 move files around 2018-10-16 16:56:55 -07:00
Michael Schurter 960f3be76c client: expose task state to client
The interesting decision in this commit was to expose AR's state and not
a fully materialized Allocation struct. AR.clientAlloc builds an Alloc
that contains the task state, so I considered simply memoizing and
exposing that method.

However, that would lead to AR having two awkwardly similar methods:
 - Alloc() - which returns the server-sent alloc
 - ClientAlloc() - which returns the fully materialized client alloc

Since ClientAlloc() could be memoized it would be just as cheap to call
as Alloc(), so why not replace Alloc() entirely?

Replacing Alloc() entirely would require Update() to immediately
materialize the task states on server-sent Allocs as there may have been
local task state changes since the server received an Alloc update.

This quickly becomes difficult to reason about: should Update hooks use
the TaskStates? Are state changes caused by TR Update hooks immediately
reflected in the Alloc? Should AR persist its copy of the Alloc? If so,
are its TaskStates canonical or the TaskStates on TR?

So! Forget that. Let's separate the static Allocation from the dynamic
AR & TR state!

 - AR.Alloc() is for static Allocation access (often for the Job)
 - AR.AllocState() is for the dynamic AR & TR runtime state (deployment
   status, task states, etc).

If code needs to know the status of a task: AllocState()
If code needs to know the names of tasks: Alloc()

It should be very easy for a developer to reason about which method they
should call and what they can do with the return values.
2018-10-16 16:56:55 -07:00
Michael Schurter 8d1419c62b client: fix accessing alloc runners
* GetClientAlloc() gains nothing from using allAllocs()
* getAllocatedResources was calling getAllocRunners() twice
2018-10-16 16:56:55 -07:00
Michael Schurter e6e2930a00 tr: implement stats collection hook
Tested except for the net/rpc specific error case which may need
changing in the gRPC world.
2018-10-16 16:53:31 -07:00
Alex Dadgar cebfead6bc add logger back 2018-10-16 16:53:30 -07:00
Alex Dadgar 8504505c0d client uses passed logger and fix fingerprinters 2018-10-16 16:53:30 -07:00
Michael Schurter 9d1ea3b228 client: hclog-ify most of the client
Leaving fingerprinters in case that interface changes with plugins.
2018-10-16 16:53:30 -07:00
Michael Schurter e42154fc46 implement stopping, destroying, and disk migration
* Stopping an alloc is implemented via Updates but update hooks are
  *not* run.
* Destroying an alloc is a best effort cleanup.
* AllocRunner destroy hooks implemented.
* Disk migration and blocking on a previous allocation exiting moved to
  its own package to avoid cycles. Now only depends on alloc broadcaster
  instead of also using a waitch.
* AllocBroadcaster now only drops stale allocations and always keeps the
  latest version.
* Made AllocDir safe for concurrent use

Lots of internal contexts that are currently unused. Unsure if they
should be used or removed.
2018-10-16 16:53:30 -07:00
Michael Schurter 4236255686 lots of comment/log fixes 2018-10-16 16:53:30 -07:00
Michael Schurter 357641c364 persist alloc state on changes, not periodically
Allow alloc and task runners to persist their own state when something
changes instead of periodically syncing all state.
2018-10-16 16:53:30 -07:00
Michael Schurter a3fe0510d1 Move all encoding and put deduping into state db
Still WIP as it does not handle deletions.
2018-10-16 16:53:30 -07:00
Michael Schurter 533bc93b3a implement all boltdb interactions behind StateDB 2018-10-16 16:53:30 -07:00
Michael Schurter a5d3e3fb0a Implement alloc updates in arv2
Updates are applied asynchronously but sequentially
2018-10-16 16:53:30 -07:00
Michael Schurter a4b4d7b266 consul service hook
Deregistration works but difficult to test due to terminal updates not
being fully implemented in the new client/ar/tr.
2018-10-16 16:53:29 -07:00
Michael Schurter 5be982e674 restore vault client 2018-10-16 16:53:29 -07:00
Alex Dadgar fd3bc1bd39 Update state with server 2018-10-16 16:53:29 -07:00
Michael Schurter 7f4ec50906 missed locking around c.allocs access 2018-10-16 16:53:29 -07:00
Michael Schurter 516d641db0 client: implement all-or-nothing alloc restoration
Restoring calls NewAR -> Restore -> Run

NewAR now calls NewTR
AR.Restore calls TR.Restore
AR.Run calls TR.Run
2018-10-16 16:53:29 -07:00
Alex Dadgar 80f6ce50c0 vault hook 2018-10-16 16:53:29 -07:00
Michael Schurter b360f6f96e fix hclog level 2018-10-16 16:53:29 -07:00
Michael Schurter 4f43ff5c51 pass statedb into allocrunnerv2 2018-10-16 16:53:29 -07:00
Michael Schurter 0f7dcfdc9a example redis job "runs" on arv2! see below
Tons left to do and lots of churn:
1. No state saving
2. No shutdown or gc
3. Removed AR factory *for now*
4. Made all "Config" structs local to the package they configure
5. Added allocID to GC to avoid a lookup

Really hating how many things use *structs.Allocation. It's not bad
without state saving, but if AllocRunner starts updating its copy things
get racy fast.
2018-10-16 16:53:29 -07:00
Alex Dadgar 01f8e5b95f renames 2018-10-04 14:57:25 -07:00
Alex Dadgar 52f9cd7637 fixing tests 2018-10-04 14:26:19 -07:00
Alex Dadgar 5c8697667e Node reserved resources 2018-09-29 18:44:55 -07:00
Alex Dadgar 3183153315 Node resources on client 2018-09-29 17:23:41 -07:00
Alex Dadgar 9971b3393f yamux 2018-09-17 14:22:40 -07:00
Alex Dadgar 7739ef51ce agent + consul 2018-09-13 10:43:40 -07:00
Michael Schurter 08862fc177 fix race around error handling 2018-09-05 17:34:17 -07:00
Preetha 043f4c208b
Merge pull request #3882 from burdandrei/telemetry-add-node-class-tag
Added node class to tagged metrics
2018-06-21 17:04:35 -05:00
Alex Dadgar b61051b3cd
Merge pull request #4409 from hashicorp/r-client-packages
Refactor client packages
2018-06-13 17:32:25 -07:00
Alex Dadgar 90c2108bfb Fix gc tests + parallel destroy + small test fixes 2018-06-12 10:23:45 -07:00
Alex Dadgar f5ff509fa5 Refactor - wip 2018-06-12 10:23:45 -07:00
Chelsea Holland Komlo f74e74b22d add client logic to determine whether TLS RPC connections should reload 2018-06-08 14:38:58 -04:00
Chelsea Holland Komlo 064b5481e0 add server join info to server and client 2018-05-31 10:50:03 -07:00
Chelsea Holland Komlo 38f611a7f2 refactor NewTLSConfiguration to pass in verifyIncoming/verifyOutgoing
add missing fields to TLS merge method
2018-05-23 18:35:30 -04:00
Chelsea Holland Komlo 796bae6f1b allow configurable cipher suites
disallow 3DES and RC4 ciphers

add documentation for tls_cipher_suites
2018-05-09 17:15:31 -04:00
Chelsea Holland Komlo 9b8a079558 fix up comments 2018-04-17 11:53:08 -04:00
Alex Dadgar 9d612c8cb0 Cleanup 2018-04-16 15:48:34 -07:00
Alex Dadgar 32adaf9dfc Copy the config given to the alloc runner 2018-04-16 15:45:52 -07:00
Alex Dadgar 4f2a7b6949 Fix copying drivers 2018-04-16 15:45:51 -07:00
Alex Dadgar 0b799822ff Operate on copy 2018-04-16 15:45:49 -07:00
Alex Dadgar ff1a1a63e8 Move where attribute for driver detection is set 2018-04-12 15:50:25 -07:00
Alex Dadgar f24ce2c50c Driver health detection cleanups
This PR does:

1. Health message based on detection has format "Driver XXX detected"
and "Driver XXX not detected"
2. Set initial health description based on detection status and don't
wait for the first health check.
3. Combine updating attributes on the node, fingerprint and health
checking update for drivers into a single call back.
4. Condensed driver info in `node status` only shows detected drivers
and make the output less wide by removing spaces.
2018-04-12 12:46:40 -07:00
Andrei Burd 502d17fa90 Added node class to tagged metrics 2018-04-11 12:20:59 +03:00
Alex Dadgar 3d367d6fd7 Fix client uptime metric missing client prefix 2018-04-10 10:39:36 -07:00
Alex Dadgar ae1f76477e Start rebalance after discovering new servers 2018-04-05 15:41:59 -07:00
Alex Dadgar be2513e0f9 more jitter 2018-04-05 13:48:33 -07:00
Alex Dadgar bd3345942c Handle no leader and faster retries near limit
Handle the ErrNoLeader case and apply slower retries. Also when we have
missed the heartbeat retry aggressively, backing off after we have
missed for more than 30 seconds.
2018-04-05 11:22:47 -07:00
Alex Dadgar 279b5c22e5 Scale heartbeat retrying based on remaining heartbeat time 2018-04-05 10:58:13 -07:00
Alex Dadgar 7941f4eb2d Fire retry only when consul discovers new servers 2018-04-05 10:40:17 -07:00
Alex Dadgar 86c32358d4
Spelling error 2018-04-03 18:30:01 -07:00
Alex Dadgar 01a6beafbf RPC Retry Watcher 2018-04-03 18:05:28 -07:00
Alex Dadgar 58a3ec3fb2 Improve Vault error handling 2018-04-03 14:29:22 -07:00
Chelsea Holland Komlo 2174ede6b9 add clarifying comment 2018-03-29 10:58:39 -04:00
Chelsea Holland Komlo e3319afee1 emit first node event 2018-03-28 17:26:53 -04:00
Chelsea Holland Komlo efc03e252c specify driver health messages 2018-03-28 11:35:21 -04:00
Chelsea Holland Komlo 003bc209b9 use time.Time for node events for compatibility 2018-03-27 15:43:57 -04:00
Chelsea Holland Komlo f801709a0a fix issue when updating node events 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 60f12d206f improve comments; update watchDriver 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 739784736a remove unused function 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo d92703617c simplify logic
bump log level
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 86b7b3d2d9 fix up health check logic comparison; add node events to client driver checks 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 53a5bc2bb3 Code review feedback 2018-03-21 15:15:26 -04:00
Alex Dadgar 34dc58421c notes from walk through 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 44b6951dda improve tests 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 0425be8f48 updating comments; locking concurrent node access 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo c50d02ae93 go style; update comments 2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo 3aa726baab fix scheduler driver name; create node structs file 2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo 3cba95e8a7 allow nomad to schedule based on the status of a client driver health check
Slight updates for go style
2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo 0bde357731 add concept of health checks to fingerprinters and nodes
fix up feedback from code review

add driver info for all drivers to node
2018-03-21 15:15:25 -04:00
Preetha Appan 3c38eededd
Fix spelling in comment 2018-03-14 15:54:25 -05:00
Alex Dadgar bef4a8ee09 fix clearing node events 2018-03-14 09:48:59 -07:00
Chelsea Komlo 810eedfa2a
Merge pull request #3945 from hashicorp/f-add-node-events
Add node events
2018-03-14 08:42:55 -04:00
Preetha 360d6e5a92
Merge pull request #3968 from hashicorp/f-nicer-vault-error
Make server side error messages from vault more clearer
2018-03-13 20:49:39 -05:00
Alex Dadgar de6ebb6e6c small cleanup 2018-03-13 18:08:22 -07:00
Chelsea Holland Komlo b41501e442 code review feedback 2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo 1488b076d1 code review feedback 2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo a8655320fd fix up go check warnings 2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo 0934769b04 add client side emitting of node events
Changelog
2018-03-13 18:08:21 -07:00
Preetha Appan 914eaed64f
Address some code review comments 2018-03-13 18:19:16 -05:00
Preetha Appan 09c231ce43
Return the err from server correctly 2018-03-13 18:10:14 -05:00
Preetha Appan 9618f52746
Remove error wrapping and make vault connection server side errors clearer. 2018-03-13 17:09:03 -05:00
Alex Dadgar 4844317cc2
Merge pull request #3890 from hashicorp/b-heartbeat
Heartbeat improvements and handling failures during establishing leadership
2018-03-12 14:41:59 -07:00
Josh Soref 173ce63fe9 spelling: transition 2018-03-11 19:06:05 +00:00
Josh Soref 782c704de6 spelling: thresholds 2018-03-11 19:03:47 +00:00
Josh Soref 8149694f3a spelling: server 2018-03-11 18:55:30 +00:00
Josh Soref 258d76ec13 spelling: registry 2018-03-11 18:41:13 +00:00
Josh Soref 3c1ce6d16d spelling: otherwise 2018-03-11 18:34:27 +00:00
Josh Soref 1ef6d6319e spelling: labels 2018-03-11 18:21:44 +00:00
Josh Soref 52b83328fc spelling: heartbeating 2018-03-11 18:12:19 +00:00
Josh Soref c9b86bbc2f spelling: controls 2018-03-11 17:50:39 +00:00
Josh Soref e78cf9c81a spelling: already 2018-03-11 17:39:04 +00:00
Josh Soref b8b46d3f74 spelling: allocation 2018-03-11 17:37:22 +00:00
Chelsea Holland Komlo 122d1c4e4a simplify retry logic 2018-03-01 09:48:26 -05:00
Chelsea Holland Komlo 355805db56 reset timer after updating node copy 2018-02-27 17:18:10 -05:00
Chelsea Holland Komlo a72aaaf47f add network resources equal method, use time ticker
remove impossible test case
2018-02-27 12:42:53 -05:00
Chelsea Holland Komlo e736e31820 use time ticker, update how network resources are compared 2018-02-26 18:47:11 -05:00
Chelsea Holland Komlo 5059065b52 improved testing; node networks comparison 2018-02-26 15:55:38 -05:00
Chelsea Holland Komlo 1f31b39fe8 code review fixups 2018-02-26 12:36:30 -05:00
Chelsea Holland Komlo ed8c8afbcd edge trigger node update
test update config copy trigger
2018-02-26 12:36:04 -05:00
Alex Dadgar 49a47483d1 Registering back to initializing
Fix a bug in which if the node attributes/meta changed, we would
re-register the node in status initializing. This would incorrectly
trigger the client to log that it missed its heartbeat.

It would change the status of the Node to initializing until the next
heartbeat occured.
2018-02-16 17:49:31 -08:00
Alex Dadgar eff4455c68 Fix original client server list behavior 2018-02-15 16:04:53 -08:00
Alex Dadgar f9cf642436 Client tls 2018-02-15 15:22:57 -08:00
Alex Dadgar e685211892 Code review feedback 2018-02-15 13:59:02 -08:00
Alex Dadgar 2c0ad26374 New RPC Modes and basic setup for streaming RPC handlers 2018-02-15 13:59:01 -08:00
Alex Dadgar 9bc75f0ad4 Fix manager tests and make testagent recover from port conflicts 2018-02-15 13:59:01 -08:00
Alex Dadgar 3f1f8604bb initial round of comment review 2018-02-15 13:59:01 -08:00
Alex Dadgar c8c1284bc3 SetServer command actually returns an error if given an invalid server 2018-02-15 13:59:01 -08:00
Alex Dadgar 3f786b904b use server manager 2018-02-15 13:59:01 -08:00
Alex Dadgar 6dd1c9f49d Refactor 2018-02-15 13:59:00 -08:00
Alex Dadgar 1472b943d6 Stats Endpoint 2018-02-15 13:59:00 -08:00
Chelsea Holland Komlo 4a26959825 code review feedback 2018-02-07 18:10:55 -05:00
Chelsea Holland Komlo d626d24488 remove dependency on client for fingerprint manager 2018-02-07 18:10:45 -05:00
Chelsea Holland Komlo e012e5ab8a add fingerprint manager 2018-02-07 18:10:33 -05:00
Chelsea Holland Komlo b21233fe23 update log message 2018-02-01 19:46:57 -05:00
Chelsea Holland Komlo 6f9c0ab361 req/resp should be within config locks; rename for detected fingerprints
changelog
2018-02-01 19:00:39 -05:00
Chelsea Holland Komlo b8e8064835 code review fixup 2018-01-31 18:34:03 -05:00
Chelsea Holland Komlo 7b53474a6e add applicable boolean to fingerprint response
public fields and remove getter functions
2018-01-31 13:21:45 -05:00
Chelsea Holland Komlo 9482c322b7 locks for fingerprint reads/writes 2018-01-30 11:32:45 -05:00
Chelsea Holland Komlo 7c19de797c create safe getters and setters for fingerprint response 2018-01-26 11:22:05 -05:00
Chelsea Holland Komlo 896d6f8058 fixups from code review 2018-01-26 07:04:32 -05:00
Chelsea Holland Komlo 9a8344333b refactor Fingerprint to request/response construct 2018-01-24 11:54:02 -05:00
Chelsea Holland Komlo 649f86f094 refactor creating a new tls configuration 2018-01-16 08:02:39 -05:00
Chelsea Holland Komlo 6c9f9c8ac3 adding additional test assertions; differentiate reloading agent and http server 2018-01-16 07:34:39 -05:00
Chelsea Holland Komlo 214d128eb9 reload raft transport layer
fix up linting
2018-01-08 14:52:28 -05:00
Chelsea Holland Komlo 0708d34135 call reload on agent, client, and server separately 2018-01-08 09:56:31 -05:00
Chelsea Holland Komlo 9741097406 reloading tls config should be atomic for clients/servers 2018-01-08 09:21:06 -05:00
Chelsea Holland Komlo ae7fc4695e fixups from code review
Revert "close raft long-lived connections"

This reverts commit 3ffda28206fcb3d63ad117fd1d27ae6f832b6625.

reload raft connections on changing tls
2018-01-08 09:21:06 -05:00
Chelsea Holland Komlo acd3d1b162 fix up downgrading client to plaintext
add locks around changing server configuration
2018-01-08 09:21:06 -05:00
Chelsea Holland Komlo c0ad9a4627 add ability to upgrade/downgrade nomad agents tls configurations via sighup 2018-01-08 09:21:06 -05:00
Alex Dadgar 91ffbbb517 Review feedback 2017-12-07 16:10:57 -08:00
Alex Dadgar 02baa6c52b Handle race between fingerprinters and registration 2017-12-07 13:09:37 -08:00
Alex Dadgar 4409fdacc0 Drop trace logging 2017-12-06 18:02:24 -08:00
Alex Dadgar cd9a7f14b8 Add logging around heartbeats 2017-12-06 17:57:50 -08:00
Chelsea Komlo 2dfda33703 Nomad agent reload TLS configuration on SIGHUP (#3479)
* Allow server TLS configuration to be reloaded via SIGHUP

* dynamic tls reloading for nomad agents

* code cleanup and refactoring

* ensure keyloader is initialized, add comments

* allow downgrading from TLS

* initalize keyloader if necessary

* integration test for tls reload

* fix up test to assert success on reloaded TLS configuration

* failure in loading a new TLS config should remain at current

Reload only the config if agent is already using TLS

* reload agent configuration before specific server/client

lock keyloader before loading/caching a new certificate

* introduce a get-or-set method for keyloader

* fixups from code review

* fix up linting errors

* fixups from code review

* add lock for config updates; improve copy of tls config

* GetCertificate only reloads certificates dynamically for the server

* config updates/copies should be on agent

* improve http integration test

* simplify agent reloading storing a local copy of config

* reuse the same keyloader when reloading

* Test that server and client get reloaded but keep keyloader

* Keyloader exposes GetClientCertificate as well for outgoing connections

* Fix spelling

* correct changelog style
2017-11-14 17:53:23 -08:00
Michael Schurter 1769db98b7 Fix regression by returning error on unknown alloc 2017-11-01 15:16:38 -05:00
Michael Schurter 73e9b57908 Trigger GCs after alloc changes
GC much more aggressively by triggering GCs when allocations become
terminal as well as after new allocations are added.
2017-11-01 15:16:38 -05:00
Michael Schurter 2a81160dcd Fix GC'd alloc tracking
The Client.allocs map now contains all AllocRunners again, not just
un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs
allocs.

Also stops logging "marked for GC" twice.
2017-11-01 15:16:38 -05:00
Alex Dadgar 4831380e57 Node access is done using locked Node copy
Fixes https://github.com/hashicorp/nomad/issues/3454

Reliably reproduced the data race before by having a fingerprinter
change the nodes attributes every millisecond and syncing at the same
rate. With fix, did not ever panic.
2017-10-27 13:27:24 -07:00
Michael Schurter 15b991e039 base64 migrate token
HTTP header values must be ASCII.

Also constant time compare tokens and test the generate and compare
helper functions.
2017-10-13 10:59:13 -07:00
Chelsea Holland Komlo e1c4701a43 fix up build warnings 2017-10-11 17:11:57 -07:00
Chelsea Holland Komlo b018ca4d46 fixing up code review comments 2017-10-11 17:09:20 -07:00
Chelsea Holland Komlo 410adaf726 Add functionality for authenticated volumes 2017-10-11 17:09:20 -07:00
Michael Schurter a66c53d45a Remove structs import from api
Goes a step further and removes structs import from api's tests as well
by moving GenerateUUID to its own package.
2017-09-29 10:36:08 -07:00
Alex Dadgar 4173834231 Enable more linters 2017-09-26 15:26:33 -07:00
Chelsea Holland Komlo b26454cf99 Move setGaugeForAllocationStats to emitClientMetrics 2017-09-25 16:05:49 +00:00
Alex Dadgar d306da846c changelog and feedback 2017-09-14 14:08:58 -07:00
Alex Dadgar 07ed83fdd5 Non-locked accessors to common Node fields
This PR removes locking around commonly accessed node attributes that do
not need to be locked. The locking could cause nodes to TTL as the
heartbeat code path was acquiring a lock that could be held for an
excessively long time. An example of this is when Vault is inaccessible,
since the fingerprint is run with a lock held but the Vault
fingerprinter makes the API calls with a large timeout.

Fixes https://github.com/hashicorp/nomad/issues/2689
2017-09-14 14:08:26 -07:00
Chelsea Holland Komlo 848af92183 fix panic in emitting tagged metrics 2017-09-11 15:32:37 +00:00
Chelsea Holland Komlo 0ef43c3c5f final code review fixups 2017-09-05 18:47:44 +00:00
Chelsea Holland Komlo a8cbd0b559 fixups from code review 2017-09-05 14:13:34 +00:00
Chelsea Holland Komlo f72e4aad13 labels depend on full setup of client beforehand 2017-09-05 14:13:34 +00:00
Chelsea Holland Komlo 87a814397d refactor to use baseLabels 2017-09-05 14:13:34 +00:00