open-nomad

Author	SHA1	Message	Date
Alessandro De Blasis	07c670fdc0	cli: show `host_network` in `nomad status` (#11432 ) Enhance the CLI in order to return the host network in two flavors (default, verbose) of the `node status` command. Fixes: #11223. Signed-off-by: Alessandro De Blasis <alex@deblasis.net>	2021-11-05 09:02:46 -04:00
Michael Schurter	fd68bbc342	test: update tests to properly use AllocDir Also use t.TempDir when possible.	2021-10-19 10:49:07 -07:00
Michael Schurter	10c3bad652	client: never embed alloc_dir in chroot Fixes #2522 Skip embedding client.alloc_dir when building chroot. If a user configures a Nomad client agent so that the chroot_env will embed the client.alloc_dir, Nomad will happily infinitely recurse while building the chroot until something horrible happens. The best case scenario is the filesystem's path length limit is hit. The worst case scenario is disk space is exhausted. A bad agent configuration will look something like this: ```hcl data_dir = "/tmp/nomad-badagent" client { enabled = true chroot_env { # Note that the source matches the data_dir "/tmp/nomad-badagent" = "/ohno" # ... } } ``` Note that `/ohno/client` (the state_dir) will still be created but not `/ohno/alloc` (the alloc_dir). While I cannot think of a good reason why someone would want to embed Nomad's client (and possibly server) directories in chroots, there should be no cause for harm. chroots are only built when Nomad runs as root, and Nomad disables running exec jobs as root by default. Therefore even if client state is copied into chroots, it will be inaccessible to tasks. Skipping the `data_dir` and `{client,server}.state_dir` is possible, but this PR attempts to implement the minimum viable solution to reduce risk of unintended side effects or bugs. When running tests as root in a vm without the fix, the following error occurs: ``` === RUN TestAllocDir_SkipAllocDir alloc_dir_test.go:520: Error Trace: alloc_dir_test.go:520 Error: Received unexpected error: Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long Test: TestAllocDir_SkipAllocDir --- FAIL: TestAllocDir_SkipAllocDir (22.76s) ``` Also removed unused Copy methods on AllocDir and TaskDir structs. Thanks to @eveld for not letting me forget about this!	2021-10-18 09:22:01 -07:00
James Rasell	444d25db07	Merge pull request #11280 from benbuzbee/log-err Log error if there are no event handlers registered	2021-10-14 14:49:22 +02:00
Michael Schurter	59fda1894e	Merge pull request #11167 from a-zagaevskiy/master Support configurable dynamic port range	2021-10-13 16:47:38 -07:00
Ben Buzbee	573fb840fa	Log error if there are no event handlers registered We see this error all the time ``` no handler registered for event event.Message=, event.Annotations=, event.Timestamp=0001-01-01T00:00:00Z, event.TaskName=, event.AllocID=, event.TaskID=, ``` So we're handling an even with all default fields. I noted that this can happen if only err is set as in ``` func (d driverPluginClient) handleTaskEvents(reqCtx context.Context, ch chan TaskEvent, stream proto.Driver_TaskEventsClient) { defer close(ch) for { ev, err := stream.Recv() if err != nil { if err != io.EOF { ch <- &TaskEvent{ Err: grpcutils.HandleReqCtxGrpcErr(err, reqCtx, d.doneCtx), } } ``` In this case Err fails to be serialized by the logger, see this test ``` ev := &drivers.TaskEvent{ Err: fmt.Errorf("errz"), } i.logger.Warn("ben test", "event", ev) i.logger.Warn("ben test2", "event err str", ev.Err.Error()) i.logger.Warn("ben test3", "event err", ev.Err) ev.Err = nil i.logger.Warn("ben test4", "nil error", ev.Err) 2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.643900Z","driver":"mock_driver","event":{"TaskID":"","TaskName":"","AllocID":"","Timestamp":"0001-01-01T00:00:00Z","Message":"","Annotations":null,"Err":{}}} 2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test2","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644226Z","driver":"mock_driver","event err str":"errz"} 2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test3","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644240Z","driver":"mock_driver","event err":"errz"} 2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test4","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644252Z","driver":"mock_driver","nil error":null} ``` Note in the first example err is set to an empty object and the error is lost. What we want is the last two examples which call out the err field explicitly so we can see what it is in this case	2021-10-11 19:44:52 +00:00
Florian Apolloner	709c1a2947	Fixed creation of ControllerCreateVolumeRequest. (#11238 )	2021-10-06 10:17:39 -04:00
Mahmood Ali	c86cff02f9	logmon: Fix a memory leak on task restart Fix a logmon leak causing high goroutine and memory usage when a task restarts. Logmon `FileRotator` buffers the task stdout/stderr streams and periodically flushing them to log files. Logmon creates a new FileRotator for each stream for each task run. However, the `flushPeriodically` goroutine is leaked when a task restarts, holding a reference to a no-longer-needed `FileRotator` instance along with its 64kb buffer. The cause is that the code assumed `time.Ticker.Stop()` closes the ticker channel, thereby terminating the goroutine, but the documentation says otherwise: > Stop turns off a ticker. After Stop, no more ticks will be sent. Stop does not close the channel, to prevent a concurrent goroutine reading from the channel from seeing an erroneous "tick". https://pkg.go.dev/time#Ticker.Stop	2021-10-05 12:11:53 -04:00
Mahmood Ali	9668245c4c	logmon: add a test for leaked goroutines	2021-10-05 12:11:42 -04:00
Mahmood Ali	614ade1bb6	logmon: refactor Logging tests Mostly to use testify assertions and close open resources	2021-10-05 12:10:58 -04:00
Michael Schurter	7071425af3	client: defensively log reserved ports - Fix test broken due to being improperly setup. - Include min/max ports in default client config.	2021-10-04 15:43:35 -07:00
Mahmood Ali	4d90afb425	gofmt all the files mostly to handle build directives in 1.17.	2021-10-01 10:14:28 -04:00
Michael Schurter	c6e72b6818	client: output reserved ports with min/max ports Also add a little more min/max port testing and add the consts back that had been removed: but unexported and as defaults.	2021-09-30 17:05:46 -07:00
Luiz Aoqui	a7698dedba	Disable PowerShell profile and simplify fingerprinting link speed on Windows (#11183 )	2021-09-22 11:17:47 -04:00
Michael Schurter	4ad0c258b9	client: add NOMAD_LICENSE to default env deny list By default we should not expose the NOMAD_LICENSE environment variable to tasks. Also refactor where the DefaultEnvDenyList lives so we don't have to maintain 2 copies of it. Since client/config is the most obvious location, keep a reference there to its unfortunate home buried deep in command/agent/host. Since the agent uses this list as well for the /agent/host endpoint the list must be accessible from both command/agent and client.	2021-09-21 13:51:17 -07:00
James Rasell	2c49063f85	client: task env vars should take precendece over host env vars.	2021-09-20 09:15:28 +01:00
Luiz Aoqui	edd32ba571	Log network device name during fingerprinting (#11184 )	2021-09-16 10:48:31 -04:00
James Rasell	0e926ef3fd	allow configuration of Docker hostnames in bridge mode (#11173 ) Add a new hostname string parameter to the network block which allows operators to specify the hostname of the network namespace. Changing this causes a destructive update to the allocation and it is omitted if empty from API responses. This parameter also supports interpolation. In order to have a hostname passed as a configuration param when creating an allocation network, the CreateNetwork func of the DriverNetworkManager interface needs to be updated. In order to minimize the disruption of future changes, rather than add another string func arg, the function now accepts a request struct along with the allocID param. The struct has the hostname as a field. The in-tree implementations of DriverNetworkManager.CreateNetwork have been modified to account for the function signature change. In updating for the change, the enhancement of adding hostnames to network namespaces has also been added to the Docker driver, whilst the default Linux manager does not current implement it.	2021-09-16 08:13:09 +02:00
Aleksandr Zagaevskiy	ebb87e65fe	Support configurable dynamic port range	2021-09-10 11:52:47 +03:00
James Rasell	d4a333e9b5	lint: mark false positive or fix gocritic append lint errors.	2021-09-06 10:49:44 +02:00
James Rasell	b6813f1221	chore: fix incorrect docstring formatting.	2021-08-30 11:08:12 +02:00
Mahmood Ali	c37339a8c8	Merge pull request #9160 from hashicorp/f-sysbatch core: implement system batch scheduler	2021-08-16 09:30:24 -04:00
Michael Schurter	a7aae6fa0c	Merge pull request #10848 from ggriffiths/listsnapshot_secrets CSI Listsnapshot secrets support	2021-08-10 15:59:33 -07:00
Mahmood Ali	efcc8bf082	Speed up client startup and registration (#11005 ) Speed up client startup, by retrying more until the servers are known. Currently, if client fingerprinting is fast and finishes before the client connect to a server, node registration may be delayed by 15 seconds or so! Ideally, we'd wait until the client discovers the servers and then retry immediately, but that requires significant code changes. Here, we simply retry the node registration request every second. That's basically the equivalent of check if the client discovered servers every second. Should be a cheap operation. When testing this change on my local computer and where both servers and clients are co-located, the time from startup till node registration dropped from 34 seconds to 8 seconds!	2021-08-10 17:06:18 -04:00
James Rasell	a9a04141a3	consul/connect: avoid warn messages on connect proxy errors When creating a TCP proxy bridge for Connect tasks, we are at the mercy of either end for managing the connection state. For long lived gRPC connections the proxy could reasonably expect to stay open until the context was cancelled. For the HTTP connections used by connect native tasks, we experience connection disconnects. The proxy gets recreated as needed on follow up requests, however we also emit a WARN log when the connection is broken. This PR lowers the WARN to a TRACE, because these disconnects are to be expected. Ideally we would be able to proxy at the HTTP layer, however Consul or the connect native task could be configured to expect mTLS, preventing Nomad from MiTM the requests. We also can't mange the proxy lifecycle more intelligently, because we have no control over the HTTP client or server and how they wish to manage connection state. What we have now works, it's just noisy. Fixes #10933	2021-08-05 11:27:35 +02:00
Seth Hoenig	3371214431	core: implement system batch scheduler This PR implements a new "System Batch" scheduler type. Jobs can make use of this new scheduler by setting their type to 'sysbatch'. Like the name implies, sysbatch can be thought of as a hybrid between system and batch jobs - it is for running short lived jobs intended to run on every compatible node in the cluster. As with batch jobs, sysbatch jobs can also be periodic and/or parameterized dispatch jobs. A sysbatch job is considered complete when it has been run on all compatible nodes until reaching a terminal state (success or failed on retries). Feasibility and preemption are governed the same as with system jobs. In this PR, the update stanza is not yet supported. The update stanza is sill limited in functionality for the underlying system scheduler, and is not useful yet for sysbatch jobs. Further work in #4740 will improve support for the update stanza and deployments. Closes #2527	2021-08-03 10:30:47 -04:00
Grant Griffiths	fecbbaee22	CSI ListSnapshots secrets implementation Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>	2021-07-28 11:30:29 -07:00
Seth Hoenig	f71d1755a6	env/aws: update ec2 cpu data using tools/ec2info ``` $ go run . ```	2021-07-22 09:32:46 -05:00
Mahmood Ali	8df9b1fd0f	client: avoid acting on stale data after launch (#10907 ) When the client launches, use a consistent read to fetch its own allocs, but allow stale read afterwards as long as reads don't revert into older state. This change addresses an edge case affecting restarting client. When a client restarts, it may fetch a stale data concerning its allocs: allocs that have completed prior to the client shutdown may still have "run/running" desired/client status, and have the client attempt to re-run again. An alternative approach is to track the indices such that the client set MinQueryIndex on the maximum index the client ever saw, or compare received allocs against locally restored client state. Garbage collection complicates this approach (local knowledge is not complete), and the approach still risks starting "dead" allocations (e.g. the allocation may have been placed when client just restarted and have already been reschuled by the time the client started. This approach here is effective against all kinds of stalness problems with small overhead.	2021-07-20 15:13:28 -04:00
Michael Schurter	efe8ea2c2c	Merge pull request #10849 from benbuzbee/benbuz/fix-destroy Don't treat a failed recover + successful destroy as a successful recover	2021-07-19 10:49:31 -07:00
Seth Hoenig	f9d3fedca2	consul/connect: add missing import statements	2021-07-12 09:28:16 -05:00
Seth Hoenig	5540dfc17f	consul/connect: use join host port Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2021-07-12 09:04:54 -05:00
Seth Hoenig	f80ae067a8	consul/connect: fix bug causing high cpu with multiple connect sidecars in group This PR fixes a bug where the underlying Envoy process of a Connect gateway would consume a full core of CPU if there is more than one sidecar or gateway in a group. The utilization was being caused by Consul injecting an envoy_ready_listener on 127.0.0.1:8443, of which only one of the Envoys would be able to bind to. The others would spin in a hot loop trying to bind the listener. As a workaround, we now specify -address during the Envoy bootstrap config step, which is how Consul maps this ready listener. Because there is already the envoy_admin_listener, and we need to continue supporting running gateways in host networking mode, and in those case we want to use the same port value coming from the service.port field, we now bind the admin listener to the 127.0.0.2 loop-back interface, and the ready listener takes 127.0.0.1. This shouldn't make a difference in the 99.999% use case where envoy is being run in its official docker container. Advanced users can reference ${NOMAD_ENVOY_ADMIN_ADDR_<service>} (as they 'ought to) if needed, as well as the new variable ${NOMAD_ENVOY_READY_ADDR_<service>} for the envoy_ready_listener.	2021-07-09 14:34:44 -05:00
Tim Gross	5937f54fc3	client: interpolate meta blocks with task environment (#10876 ) Adds missing interpolation step to the `meta` blocks when building the task environment. Also fixes incorrect parameter order in the test assertion and adds diagnostics to the test.	2021-07-08 16:03:15 -04:00
Seth Hoenig	e47ea462fb	client: fix logline in group shutdown hook Fixes #10844	2021-07-08 11:14:37 -05:00
Seth Hoenig	c8260c3940	consul: avoid triggering unnecessary sync when removing workload There are bits of logic in callers of RemoveWorkload on group/task cleanup hooks which call RemoveWorkload with the "Canary" version of the workload, in case the alloc is marked as a Canary. This logic triggers an extra sync with Consul, and also doesn't do the intended behavior - for which no special casing is necessary anyway. When the workload is marked for removal, all associated services and checks will be removed regardless of the Canary status, because the service and check IDs do not incorporate the canary-ness in the first place. The only place where canary-ness matters is when updating a workload, where we need to compute the hash of the services and checks to determine whether they have been modified, the Canary flag of which is a part of that. Fixes #10842	2021-07-06 14:08:42 -05:00
Ben Buzbee	e247f8806b	Don't treat a failed recover + successful destroy as a successful recover This code just seems incorrect. As it stands today it reports a successful restore if RecoverTask fails and then DestroyTask succeeds. This can result in a really annoying bug where it then calls RecoverTask again, whereby it will probably get ErrTaskNotFound and call DestroyTask once more. I think the only reason this has not been noticed so far is because most drivers like Docker will return Success, then nomad will call RecoverTask, get an error (not found) and call DestroyTask again, and get a ErrTasksNotFound err.	2021-07-03 01:46:36 +00:00
Seth Hoenig	5aa657c6bd	consul/connect: automatically set consul tls sni name for connect native tasks This PR makes it so that Nomad will automatically set the CONSUL_TLS_SERVER_NAME environment variable for Connect native tasks running in bridge networking mode where Consul has TLS enabled. Because of the use of a unix domain socket for communicating with Consul when in bridge networking mode, the server name is a file name instead of something compatible with the mTLS certificate Consul will authenticate against. "localhost" is by default a compatible name, so Nomad will set the environment variable to that. Fixes #10804	2021-06-28 08:36:53 -05:00
Tim Gross	59c1237fc9	tests: allocrunner CNI tests are Linux-only (#10783 ) Running the `client/allocrunner` tests fail to compile on macOS because the CNI test file depends on the CNI network configurator, which is in a Linux-only file.	2021-06-18 11:34:31 -04:00
Tim Gross	7bd61bbf43	docker: generate /etc/hosts file for bridge network mode (#10766 ) When `network.mode = "bridge"`, we create a pause container in Docker with no networking so that we have a process to hold the network namespace we create in Nomad. The default `/etc/hosts` file of that pause container is then used for all the Docker tasks that share that network namespace. Some applications rely on this file being populated. This changeset generates a `/etc/hosts` file and bind-mounts it to the container when Nomad owns the network, so that the container's hostname has an IP in the file as expected. The hosts file will include the entries added by the Docker driver's `extra_hosts` field. In this changeset, only the Docker task driver will take advantage of this option, as the `exec`/`java` drivers currently copy the host's `/etc/hosts` file and this can't be changed without breaking backwards compatibility. But the fields are available in the task driver protobuf for community task drivers to use if they'd like.	2021-06-16 14:55:22 -04:00
James Rasell	939b23936a	Merge pull request #10744 from hashicorp/b-remove-duplicate-imports chore: remove duplicate import statements	2021-06-11 16:42:34 +02:00
James Rasell	492e308846	tests: remove duplicate import statements.	2021-06-11 09:39:22 +02:00
Mahmood Ali	071c556b3d	tests: deflake CSI forwarding tests This updates `client.Ready()` so it returns once the client node got registered at the servers. Previously, it returns when the fingerprinters first batch completes, wtihout ensuring that the node is stored in the Raft data. The tests may fail later when it with unknown node errors later. `client.Reedy()` seem to be only called in CSI and some client stats now. This class of bug, assuming client is registered without checking, is a source of flakiness elsewhere. Other tests use other mechanisms for checking node readiness, though not consistently.	2021-06-10 21:26:34 -04:00
Nomad Release Bot	4fe52bc753	remove generated files	2021-06-10 08:04:25 -04:00
Nomad Release bot	7cc7389afd	Generate files for 1.1.1 release	2021-06-10 08:04:25 -04:00
Seth Hoenig	c13bf8b917	Merge pull request #10715 from hashicorp/f-cns-attrs consul: probe consul namespace feature before using namespace api	2021-06-07 16:11:17 -05:00
Seth Hoenig	209e2d6d81	consul: pr cleanup namespace probe function signatures	2021-06-07 15:41:01 -05:00
Seth Hoenig	519429a2de	consul: probe consul namespace feature before using namespace api This PR changes Nomad's wrapper around the Consul NamespaceAPI so that it will detect if the Consul Namespaces feature is enabled before making a request to the Namespaces API. Namespaces are not enabled in Consul OSS, and require a suitable license to be used with Consul ENT. Previously Nomad would check for a 404 status code when makeing a request to the Namespaces API to "detect" if Consul OSS was being used. This does not work for Consul ENT with Namespaces disabled, which returns a 500. Now we avoid requesting the namespace API altogether if Consul is detected to be the OSS sku, or if the Namespaces feature is not licensed. Since Consul can be upgraded from OSS to ENT, or a new license applied, we cache the value for 1 minute, refreshing on demand if expired. Fixes https://github.com/hashicorp/nomad-enterprise/issues/575 Note that the ticket originally describes using attributes from https://github.com/hashicorp/nomad/issues/10688. This turns out not to be possible due to a chicken-egg situation between bootstrapping the agent and setting up the consul client. Also fun: the Consul fingerprinter creates its own Consul client, because there is no [currently] no way to pass the agent's client through the fingerprint factory.	2021-06-07 12:19:25 -05:00
Mahmood Ali	2c73552b4d	pool: track usage of incoming streams (#10710 ) Track usage of incoming streams on a connection. Connections without reference counts get marked as unused and reaped in a periodic job. This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated unexpectedly. Previously, when a client heartbeats switches between servers, the pool connection reaper eventually identifies the connection as unused and closes it even if it has an active exec/fs sessions. Fixes #10579	2021-06-07 10:22:37 -04:00
Seth Hoenig	d026ff1f66	consul/connect: add support for connect mesh gateways This PR implements first-class support for Nomad running Consul Connect Mesh Gateways. Mesh gateways enable services in the Connect mesh to make cross-DC connections via gateways, where each datacenter may not have full node interconnectivity. Consul docs with more information: https://www.consul.io/docs/connect/gateways/mesh-gateway The following group level service block can be used to establish a Connect mesh gateway. service { connect { gateway { mesh { // no configuration } } } } Services can make use of a mesh gateway by configuring so in their upstream blocks, e.g. service { connect { sidecar_service { proxy { upstreams { destination_name = "<service>" local_bind_port = <port> datacenter = "<datacenter>" mesh_gateway { mode = "<mode>" } } } } } } Typical use of a mesh gateway is to create a bridge between datacenters. A mesh gateway should then be configured with a service port that is mapped from a host_network configured on a WAN interface in Nomad agent config, e.g. client { host_network "public" { interface = "eth1" } } Create a port mapping in the group.network block for use by the mesh gateway service from the public host_network, e.g. network { mode = "bridge" port "mesh_wan" { host_network = "public" } } Use this port label for the service.port of the mesh gateway, e.g. service { name = "mesh-gateway" port = "mesh_wan" connect { gateway { mesh {} } } } Currently Envoy is the only supported gateway implementation in Consul. By default Nomad client will run the latest official Envoy docker image supported by the local Consul agent. The Envoy task can be customized by setting `meta.connect.gateway_image` in agent config or by setting the `connect.sidecar_task` block. Gateways require Consul 1.8.0+, enforced by the Nomad scheduler. Closes #9446	2021-06-04 08:24:49 -05:00

1 2 3 4 5 ...

4433 commits