Commit Graph

3961 Commits

Author SHA1 Message Date
Drew Bailey fc37448683
warn when enabled debug is on when registering
m -> a receiver name

return codederrors, fix query
2020-01-09 15:15:04 -05:00
Drew Bailey 50288461c9
Server request forwarding for Agent.Profile
Return rpc errors for profile requests, set up remote forwarding to
target leader or server id for profile requests.

server forwarding, endpoint tests
2020-01-09 15:15:03 -05:00
Drew Bailey 49ad5fbc85
agent pprof endpoints
wip, agent endpoint and client endpoint for pprof profiles

agent endpoint test
2020-01-09 15:15:02 -05:00
Tim Gross fa4da93578
interpolate environment for services in script checks (#6916)
In 0.10.2 (specifically 387b016) we added interpolation to group
service blocks and centralized the logic for task environment
interpolation. This wasn't also added to script checks, which caused a
regression where the IDs for script checks for services w/
interpolated fields (ex. the service name) didn't match the service ID
that was registered with Consul.

This changeset calls the same taskenv interpolation logic during
`script_check` configuration, and adds tests to reduce the risk of
future regressions by comparing the IDs of service hook and the check hook.
2020-01-09 08:12:54 -05:00
Tim Gross e9bac50c76
client: fix trace log message in alloc hook update (#6881) 2019-12-19 16:44:04 -05:00
Drew Bailey d9e41d2880
docs for shutdown delay
update docs, address pr comments

ensure pointer is not nil

use pointer for diff tests, set vs unset
2019-12-16 11:38:35 -05:00
Drew Bailey ae145c9a37
allow only positive shutdown delay
more explicit test case, remove select statement
2019-12-16 11:38:30 -05:00
Drew Bailey 24929776a2
shutdown delay for task groups
copy struct values

ensure groupserviceHook implements RunnerPreKillhook

run deregister first

test that shutdown times are delayed

move magic number into variable
2019-12-16 11:38:16 -05:00
Danielle b006be623d
Update client/fingerprint/env_aws.go
Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>
2019-12-16 14:48:52 +01:00
Danielle Lancashire 5a87b3ab4b
env_aws: Disable Retries and set Session cfg
Previously, Nomad used hand rolled HTTP requests to interact with the
EC2 metadata API. Recently however, we switched to using the AWS SDK for
this fingerprinting.

The default behaviour of the AWS SDK is to perform retries with
exponential backoff when a request fails. This is problematic for Nomad,
because interacting with the EC2 API is in our client start path.

Here we revert to our pre-existing behaviour of not performing retries
in the fast path, as if the metadata service is unavailable, it's likely
that nomad is not running in AWS.
2019-12-16 10:56:32 +01:00
Mahmood Ali 4a1cc67f58
Merge pull request #6820 from hashicorp/f-skip-docker-logging-knob
driver: allow disabling log collection
2019-12-13 11:41:20 -05:00
Mahmood Ali a7361612b6
Merge pull request #6556 from hashicorp/c-vendor-multierror-20191025
Update go-multierror library
2019-12-13 11:32:42 -05:00
Mahmood Ali 46bc3b57e6 address review comments 2019-12-13 11:21:00 -05:00
Mahmood Ali b3a1e571e5 tests: fix error format assertion
multierror library changed formatting slightly.
2019-12-13 11:01:20 -05:00
Chris Dickson 4d8ba272d1 client: expose allocated CPU per task (#6784) 2019-12-09 15:40:22 -05:00
Seth Hoenig f0c3dca49c tests: swap lib/freeport for tweaked helper/freeport
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.

Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.

This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
2019-12-09 08:37:32 -06:00
Mahmood Ali 0b7085ba3a driver: allow disabling log collection
Operators commonly have docker logs aggregated using various tools and
don't need nomad to manage their docker logs.  Worse, Nomad uses a
somewhat heavy docker api call to collect them and it seems to cause
problems when a client runs hundreds of log collections.

Here we add a knob to disable log aggregation completely for nomad.
When log collection is disabled, we avoid running logmon and
docker_logger for the docker tasks in this implementation.

The downside here is once disabled, `nomad logs ...` commands and API
no longer return logs and operators must corrolate alloc-ids with their
aggregated log info.

This is meant as a stop gap measure.  Ideally, we'd follow up with at
least two changes:

First, we should optimize behavior when we can such that operators don't
need to disable docker log collection.  Potentially by reverting to
using pre-0.9 syslog aggregation in linux environments, though with
different trade-offs.

Second, when/if logs are disabled, nomad logs endpoints should lookup
docker logs api on demand.  This ensures that the cost of log collection
is paid sparingly.
2019-12-08 14:15:03 -05:00
Mahmood Ali ded2a725db
Merge pull request #6788 from hashicorp/b-timeout-logmon-stop
logmon: add timeout to RPC operations
2019-12-06 19:12:06 -05:00
Danielle Lancashire d2075ebae9 spellcheck: Fix spelling of retrieve 2019-12-05 18:59:47 -06:00
Mahmood Ali b2ae27863e
Merge pull request #6779 from hashicorp/r-aws-fingerprint-via-library
Use AWS SDK to access EC2 Metadata
2019-12-02 13:30:51 -05:00
Mahmood Ali 83089feff5 logmon: add timeout to RPC operations
Add an RPC timeout for logmon.  In
https://github.com/hashicorp/nomad/issues/6461#issuecomment-559747758 ,
`logmonClient.Stop` locked up and indefinitely blocked the task runner
destroy operation.

This is an incremental improvement.  We still need to follow up to
understand how we got to that state, and the full impact of locked-up
Stop and its link to pending allocations on restart.
2019-12-02 10:33:05 -05:00
Mahmood Ali 293276a457 fingerprint code refactor
Some code cleanup:

* Use a field for setting EC2 metadata instead of env-vars in testing;
but keep environment variables for backward compatibility reasons

* Update tests to use testify
2019-11-26 10:51:28 -05:00
Mahmood Ali 1e48f8e20d fingerprint: avoid api query if config overrides it 2019-11-26 10:51:28 -05:00
Mahmood Ali 5bb9089431 fingerprint: use ec2metadata package 2019-11-26 10:51:27 -05:00
Lars Lehtonen 0d344e8578
client: fix use of T.Fatal inside TestFS_logsImpl_NoFollow() goroutine. 2019-11-25 23:51:28 -08:00
Mahmood Ali e89108fb01 fixup! tests: don't assume eth0 network is available 2019-11-21 08:28:20 -05:00
Mahmood Ali 443804b5c7 tests: don't assume eth0 network is available
TestClient_UpdateNodeFromFingerprintKeepsConfig checks a test node
network interface, which is hardcoded to `eth0` and is updated
asynchronously.  This causes flakiness when eth0 isn't available.

Here, we hardcode the value to an arbitrary network interface.
2019-11-20 20:37:30 -05:00
Mahmood Ali ed3f1957e7 tests: run TestClient_WatchAllocs in non-linux environments 2019-11-20 20:37:29 -05:00
Mahmood Ali 521f51a929 testS: fix TestClient_RestoreError
When spinning a second client, ensure that it uses new driver
instances, rather than reuse the already shutdown unhealthy drivers from
first instance.

This speeds up tests significantly, but cutting ~50 seconds or so, the
timeout in NewClient until drivers fingerprints.  They never do because
drivers were shutdown already.
2019-11-20 20:37:28 -05:00
Mahmood Ali 4efb71cf0c tests: remove TestClient_RestoreError test
TestClient_RestoreError is very slow, taking ~81 seconds.

It has few problematic patterns.  It's unclear what it tests, it
simulates a failure condition where all state db lookup fails and
asserts that alloc fails.  Though starting from
https://github.com/hashicorp/nomad/pull/6216 , we don't fail allocs in
that condition but rather restart them.

Also, the drivers used in second client `c2` are the same singleton
instances used in `c1` and already shutdown.  We ought to start healthy
new driver instances.
2019-11-20 20:37:27 -05:00
Preetha be4a51d5b8
Merge pull request #6349 from hashicorp/b-host-stats
client: Return empty values when host stats fail
2019-11-20 10:13:02 -06:00
Lang Martin aa985ebe21
getter: allow the gcs download scheme (#6692) 2019-11-19 09:10:56 -05:00
Nick Ethier bd454a4c6f
client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00
Drew Bailey b644e1f47d
add server-id to monitor specific server 2019-11-14 09:53:41 -05:00
Drew Bailey f4a7e3dc75
coordinate closing of doneCh, use interface to simplify callers
comments
2019-11-05 11:44:26 -05:00
Drew Bailey fe542680dc
log-json -> json
fix typo command/agent/monitor/monitor.go

Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com>

Update command/agent/monitor/monitor.go

Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com>

address feedback, lock to prevent send on closed channel

fix lock/unlock for dropped messages
2019-11-05 09:51:59 -05:00
Drew Bailey ddfa20b993
address feedback, fix gauge metric name 2019-11-05 09:51:57 -05:00
Drew Bailey 298b8358a9
move forwarded monitor request into helper 2019-11-05 09:51:56 -05:00
Drew Bailey 318b6c91bf
monitor command takes no args
rm extra new line

fix lint errors

return after close

fix, simplify test
2019-11-05 09:51:55 -05:00
Drew Bailey c7b633b6c1
lock in sub select
rm redundant lock

wip to use framing

wip switch to stream frames
2019-11-05 09:51:54 -05:00
Drew Bailey 17d876d5ef
rename function, initialize log level better
underscores instead of dashes for query params
2019-11-05 09:51:53 -05:00
Drew Bailey 8178beecf0
address feedback, use agent_endpoint instead of monitor 2019-11-05 09:51:53 -05:00
Drew Bailey db65b1f4a5
agent:read acl policy for monitor 2019-11-05 09:51:52 -05:00
Drew Bailey 2533617888
rpc acl tests for both monitor endpoints 2019-11-05 09:51:51 -05:00
Drew Bailey a45ae1cd58
enable json formatting, use queryoptions 2019-11-05 09:51:49 -05:00
Drew Bailey 786989dbe3
New monitor pkg for shared monitor functionality
Adds new package that can be used by client and server RPC endpoints to
facilitate monitoring based off of a logger

clean up old code

small comment about write

rm old comment about minsize

rename to Monitor

Removes connection logic from monitor command

Keep connection logic in endpoints, use a channel to send results from
monitoring

use new multisink logger and interfaces

small test for dropped messages

update go-hclogger and update sink/intercept logger interfaces
2019-11-05 09:51:49 -05:00
Drew Bailey e076204820
get local rpc endpoint working 2019-11-05 09:51:48 -05:00
Drew Bailey 976c43157c
remove log_writer
prefix output with proper spacing

update gzip handler, adjust first byte flow to allow gzip handler bypass

wip, first stab at wiring up rpc endpoint
2019-11-05 09:51:48 -05:00
Michael Schurter 9fed8d1bed client: fix panic from 0.8 -> 0.10 upgrade
makeAllocTaskServices did not do a nil check on AllocatedResources
which causes a panic when upgrading directly from 0.8 to 0.10. While
skipping 0.9 is not supported we intend to fix serious crashers caused
by such upgrades to prevent cluster outages.

I did a quick audit of the client package and everywhere else that
accesses AllocatedResources appears to be properly guarded by a nil
check.
2019-11-01 07:47:03 -07:00
Lars Lehtonen 4ed9427c77 client/allocwatcher: fix dropped test error (#6592) 2019-10-31 08:29:25 -04:00