open-nomad

Author	SHA1	Message	Date
Michael Schurter	3b57df33e3	client: fix data races in config handling (#14139 ) Before this change, Client had 2 copies of the config object: config and configCopy. There was no guidance around which to use where (other than configCopy's comment to pass it to alloc runners), both are shared among goroutines and mutated in data racy ways. At least at one point I think the idea was to have `config` be mutable and then grab a lock to overwrite `configCopy`'s pointer atomically. This would have allowed alloc runners to read their config copies in data race safe ways, but this isn't how the current implementation worked. This change takes the following approach to safely handling configs in the client: 1. `Client.config` is the only copy of the config and all access must go through the `Client.configLock` mutex 2. Since the mutex only protects the config pointer itself and not fields inside the Config struct: all config mutation must be done on a copy of the config, and then Client's config pointer is overwritten while the mutex is acquired. Alloc runners and other goroutines with the old config pointer will not see config updates. 3. Deep copying is implemented on the Config struct to satisfy the previous approach. The TLS Keyloader is an exception because it has its own internal locking to support mutating in place. An unfortunate complication but one I couldn't find a way to untangle in a timely fashion. 4. To facilitate deep copying I made an internally backward incompatible API change: our `helper/funcs` used to turn containers (slices and maps) with 0 elements into nils. This probably saves a few memory allocations but makes it very easy to cause panics. Since my new config handling approach uses more copying, it became very difficult to ensure all code that used containers on configs could handle nils properly. Since this code has caused panics in the past, I fixed it: nil containers are copied as nil, but 0-element containers properly return a new 0-element container. No more "downgrading to nil!"	2022-08-18 16:32:04 -07:00
Michael Schurter	3e50f72fad	core: merge reserved_ports into host_networks (#13651 ) Fixes #13505 This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports). As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility: Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly. Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me. So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.	2022-07-12 14:40:25 -07:00
Karan Sharma	9426be01fc	feat: Warn if bootstrap_expect is even number (#12961 )	2022-06-06 15:22:59 +02:00
Michael Schurter	2965dc6a1a	artifact: fix numerous go-getter security issues Fix numerous go-getter security issues: - Add timeouts to http, git, and hg operations to prevent DoS - Add size limit to http to prevent resource exhaustion - Disable following symlinks in both artifacts and `job run` - Stop performing initial HEAD request to avoid file corruption on retries and DoS opportunities. Approach Since Nomad has no ability to differentiate a DoS-via-large-artifact vs a legitimate workload, all of the new limits are configurable at the client agent level. The max size of HTTP downloads is also exposed as a node attribute so that if some workloads have large artifacts they can specify a high limit in their jobspecs. In the future all of this plumbing could be extended to enable/disable specific getters or artifact downloading entirely on a per-node basis.	2022-05-24 16:29:39 -04:00
Will Jordan	d515e5c3b0	Don't buffer json logs on agent startup (#13076 ) There's no reason to buffer json logs on agent startup since logs in this format already aren't reordered.	2022-05-19 15:40:30 -04:00
James Rasell	636b647a30	agent: fix panic when logging about protocol version config use. (#12962 ) The log line comes before the agent logger has been setup, therefore we need to use the UI logging to avoid panic.	2022-05-13 09:28:43 +02:00
Yoan Blanc	5e8254beda	feat: remove dependency to consul/lib Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2022-04-09 13:22:44 +02:00
James Rasell	431c153cd9	client: add Nomad template service functionality to runner. (#12458 ) This change modifies the template task runner to utilise the new consul-template which includes Nomad service lookup template funcs. In order to provide security and auth to consul-template, we use a custom HTTP dialer which is passed to consul-template when setting up the runner. This method follows Vault implementation. Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-04-06 19:17:05 +02:00
Michael Schurter	7494a0c4fd	core: remove all traces of unused protocol version Nomad inherited protocol version numbering configuration from Consul and Serf, but unlike those projects Nomad has never used it. Nomad's `protocol_version` has always been `1`. While the code is effectively unused and therefore poses no runtime risks to leave, I felt like removing it was best because: 1. Nomad's RPC subsystem has been able to evolve extensively without needing to increment the version number. 2. Nomad's HTTP API has evolved extensively without increment `API{Major,Minor}Version`. If we want to version the HTTP API in the future, I doubt this is the mechanism we would choose. 3. The presence of the `server.protocol_version` configuration parameter is confusing since `server.raft_protocol` is an important parameter for operators to consider. Even more confusing is that there is a distinct Serf protocol version which is included in `nomad server members` output under the heading `Protocol`. `raft_protocol` is the only protocol version relevant to Nomad developers and operators. The other protocol versions are either deadcode or have never changed (Serf). 4. If we were to need to version the RPC, HTTP API, or Serf protocols, I don't think these configuration parameters and variables are the best choice. If we come to that point we should choose a versioning scheme based on the use case and modern best practices -- not this 6+ year old dead code.	2022-02-18 16:12:36 -08:00
Thomas Lefebvre	3b57f3af9d	Add config command and config validate subcommand to nomad CLI (#9198 )	2022-02-08 16:52:35 -05:00
James Rasell	82b168bf34	Merge pull request #11403 from hashicorp/f-gh-11059 agent/docs: add better clarification when top-level data dir needs setting	2022-01-13 16:41:35 +01:00
Luiz Aoqui	d48e50da9a	Fix log level parsing from lines that include a timestamp (#11838 )	2022-01-13 09:56:35 -05:00
Michael Schurter	e6eff95769	agent: validate reserved_ports are valid Goal is to fix at least one of the causes that can cause a node to be ineligible to receive work: https://github.com/hashicorp/nomad/issues/9506#issuecomment-1002880600	2022-01-12 14:21:47 -08:00
Kevin Schoonover	5d9a506bc0	agent: support multiple http address in addresses.http (#11582 )	2022-01-03 09:33:53 -05:00
James Rasell	4c92a77aac	agent: clarify error info when data dir needs setting.	2021-10-28 15:05:56 +02:00
Mahmood Ali	cdddd64a42	logging: Log the cause behind agent startup failure (#11353 ) Log the failure error when the agent fails to start. Previously, the agent startup failure error would be emitted to the command UI but not logged. So it doesn't get emitted to syslog or `log_file` if they are set, and it makes debugging much harder. Also, logging the error again before exit makes the error more visible: previously, the operator needed to scroll to the top to find the error. On a sample failure, the output will look like: ``` ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation. ==> Loaded configuration from sample-configs/config-bad ==> Starting Nomad agent... ==> Error starting agent: setting up server node ID failed: mkdir /path-without-permission: read-only file system 2021-10-20T14:38:51.179-0400 [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/path-without-permission/plugins 2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins 2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins 2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0 2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0 2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0 2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0 2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0 2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0 2021-10-20T14:38:51.181-0400 [ERROR] agent: error starting agent: error="setting up server node ID failed: mkdir /path-without-permission: read-only file system" ``` This change adds the final `ERROR` message. It's easy to miss the `==> Error starting agent` above.	2021-10-27 10:41:17 -07:00
Michael Schurter	59fda1894e	Merge pull request #11167 from a-zagaevskiy/master Support configurable dynamic port range	2021-10-13 16:47:38 -07:00
Michael Schurter	e14cd34392	client: improve errors & tests for dynamic ports	2021-10-13 16:25:25 -07:00
Luiz Aoqui	3e0bad5a41	wrap `log` messages with `hclog` (#11291 )	2021-10-12 14:38:44 -04:00
Aleksandr Zagaevskiy	d92666e6a7	fixup! Support configurable dynamic port range	2021-10-11 14:13:59 +03:00
Aleksandr Zagaevskiy	ebb87e65fe	Support configurable dynamic port range	2021-09-10 11:52:47 +03:00
James Rasell	b6813f1221	chore: fix incorrect docstring formatting.	2021-08-30 11:08:12 +02:00
Drew Bailey	74836b95b2	configuration and oss components for licensing (#10216 ) * configuration and oss components for licensing * vendor sync	2021-03-23 09:08:14 -04:00
Dennis Schön	c376545f49	don't prefix json logging	2021-01-20 09:09:31 -05:00
Seth Hoenig	0091325721	command: give flag-helpers a better name	2020-12-14 10:07:27 -06:00
Kris Hicks	0a3a748053	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
James Rasell	e0734bed77	agent: fix enterprise config overlay merging.	2020-10-14 09:35:16 +02:00
Chris Baker	1d35578bed	removed backwards-compatible/untagged metrics deprecated in 0.7	2020-10-13 20:18:39 +00:00
Chris Baker	7f701fddd0	updated docs and validation to further prohibit null chars in region, datacenter, and job name	2020-10-05 18:01:50 +00:00
Mahmood Ali	72ac33e4e7	Refactor setupLoggers	2020-07-17 11:05:57 -04:00
Mahmood Ali	9366181be6	always check `default_scheduler_config` config Also, avoid early return on validation to avoid masking some validation bugs in dev setup.	2020-05-14 14:16:12 -04:00
Mahmood Ali	2c963885b0	handle upgrade path and defaults Ensure that `""` Scheduler Algorithm gets explicitly set to binpack on upgrades or on API handling when user misses the value. The scheduler already treats `""` value as binpack. This PR merely ensures that the operator API returns the effective value.	2020-05-09 12:34:08 -04:00
Mahmood Ali	b9e3cde865	tests and some clean up	2020-05-01 13:13:30 -04:00
Mahmood Ali	b78680eee7	agent: shutdown agent http server last Shutdown http server last, after nomad client/server components terminate. Before this change, if the agent is taking an unexpectedly long time to shutdown, the operator cannot query the http server directly: they cannot access agent specific http endpoints and need to query another agent about the troublesome agent. Unexpectedly long shutdown can happen in normal cases, e.g. a client might hung is if one of the allocs it is running has a long shutdown_delay. Here, we switch to ensuring that the http server is shutdown last. I believe this doesn't require extra care in agent shutting down logic while operators may be able to submit write http requests. We already need to cope with operators submiting these http requests to another agent or by servers updating the client allocations.	2020-04-13 10:50:07 -04:00
Drew Bailey	b09abef332	Audit config, seams for enterprise audit features allow oss to parse sink duration clean up audit sink parsing ent eventer config reload fix typo SetEnabled to eventer interface client acl test rm dead code fix failing test	2020-03-23 13:47:42 -04:00
James Rasell	e3d14cc634	cli: fix indentation issue with -dev-connect agent help output.	2020-03-18 12:25:20 +01:00
Seth Hoenig	076cb4754e	agent: re-enable the server in dev mode	2020-01-31 19:04:19 -06:00
Seth Hoenig	9df33f622f	nomad: proxy requests for Service Identity tokens between Clients and Consul Nomad jobs may be configured with a TaskGroup which contains a Service definition that is Consul Connect enabled. These service definitions end up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In the case where Consul ACLs are enabled, a Service Identity token is required for these tasks to run & connect, etc. This changeset enables the Nomad Server to recieve RPC requests for the derivation of SI tokens on behalf of instances of Consul Connect using Tasks. Those tokens are then relayed back to the requesting Client, which then injects the tokens in the secrets directory of the Task.	2020-01-31 19:03:53 -06:00
Seth Hoenig	f030a22c7c	command, docs: create and document consul token configuration for connect acls (gh-6716) This change provides an initial pass at setting up the configuration necessary to enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert` commands (similar to Vault tokens). These values are not actually used yet in this changeset.	2020-01-31 19:02:53 -06:00
Charlie Voiselle	835831a3d8	Added service wrapper code (#6220 ) This is the basic code to add the Windows Service Manager hooks to Nomad. Includes vendoring golang.org/x/sys/windows/svc and added Docs: * guide for installing as a windows service. * configuration for logging to file from PR #6429	2019-11-11 15:16:07 -05:00
Drew Bailey	f46fd5b3e1	only look up rpchandler for node if we have nodeid fix some comments and nomad monitor -h output	2019-11-05 09:51:51 -05:00
Drew Bailey	786989dbe3	New monitor pkg for shared monitor functionality Adds new package that can be used by client and server RPC endpoints to facilitate monitoring based off of a logger clean up old code small comment about write rm old comment about minsize rename to Monitor Removes connection logic from monitor command Keep connection logic in endpoints, use a channel to send results from monitoring use new multisink logger and interfaces small test for dropped messages update go-hclogger and update sink/intercept logger interfaces	2019-11-05 09:51:49 -05:00
Drew Bailey	976c43157c	remove log_writer prefix output with proper spacing update gzip handler, adjust first byte flow to allow gzip handler bypass wip, first stab at wiring up rpc endpoint	2019-11-05 09:51:48 -05:00
Drew Bailey	0de94466b2	Display error when remote side ended monitor multisink logger remove usage of logwriter	2019-11-05 09:51:48 -05:00
Drew Bailey	b0184e2032	Adds AgentMonitor Endpoint AgentMonitor is an endpoint to stream logs for a given agent. It allows callers to pass in a supplied log level, which may be different than the agents config allowing for temporary debugging with lower log levels. Pass in logWriter when setting up Agent	2019-11-05 09:51:46 -05:00
Mahmood Ali	3f6e50617a	Merge pull request #6047 from hashicorp/b-ignore-server-if-disabled Only warn against BootstrapExpect set in CLI flag	2019-10-29 10:55:44 -04:00
Danielle Lancashire	9eaac48f25	agent: Refactor log setup to support log-to-file	2019-10-07 14:42:32 +02:00
Lang Martin	fb41dd86ba	default raft protocol v2	2019-09-24 14:37:55 -04:00
Mahmood Ali	6d73ca0cfb	Merge pull request #6250 from hashicorp/f-raft-protocol-v3 Update default raft protocol to version 3	2019-09-04 09:34:41 -04:00
Tim Gross	b79021adfd	cli: split -dev and -dev-connect flags	2019-08-30 09:33:30 -04:00

1 2 3 4 5

211 commits