open-nomad

Commit Graph

Author	SHA1	Message	Date
Seth Hoenig	1ca5ea3240	env_aws: run ec2info to update ec2 info Use `tools/ec2info` to update the generated table of instance types. `$ go run .`	2020-12-02 09:35:03 -06:00
Seth Hoenig	3b2b083cbf	Merge pull request #9487 from hashicorp/f-connect-sidecar-concurrency consul/connect: default envoy concurrency to 1	2020-12-01 15:51:41 -06:00
Seth Hoenig	bf857684d1	consul/connect: default envoy concurrency to 1 Previously, every Envoy Connect sidecar would spawn as many worker threads as logical CPU cores. That is Envoy's default behavior when `--concurrency` is not explicitly set. Nomad now sets the concurrency flag to 1, which is sensible for the default cpu = 250 Mhz resources allocated for sidecar proxies. The concurrency value can be configured in Client configuration by setting `meta.connect.proxy_concurrency`. Closes #9341	2020-12-01 13:12:45 -06:00
Michael Schurter	ea0e1789f4	Merge pull request #9435 from hashicorp/f-allocupdate-timer client: always wait 200ms before sending updates	2020-12-01 08:45:17 -08:00
Drew Bailey	9adca240f8	Event Stream: Track ACL changes, unsubscribe on invalidating changes (#9447 ) * upsertaclpolicies * delete acl policies msgtype * upsert acl policies msgtype * delete acl tokens msgtype * acl bootstrap msgtype wip unsubscribe on token delete test that subscriptions are closed after an ACL token has been deleted Start writing policyupdated test * update test to use before/after policy * add SubscribeWithACLCheck to run acl checks on subscribe * update rpc endpoint to use broker acl check * Add and use subscriptions.closeSubscriptionFunc This fixes the issue of not being able to defer unlocking the mutex on the event broker in the for loop. handle acl policy updates * rpc endpoint test for terminating acl change * add comments Co-authored-by: Kris Hicks <khicks@hashicorp.com>	2020-12-01 11:11:34 -05:00
Benjamin Buzbee	e0acbbfcc6	Fix RPC retry logic in nomad client's rpc.go for blocking queries (#9266 )	2020-11-30 15:11:10 -05:00
Roman Vynar	b957f87cd7	Add compute/zone to Azure fingerprinting	2020-11-26 13:26:51 +02:00
Michael Schurter	5ec065b180	client: always wait 200ms before sending updates Always wait 200ms before calling the Node.UpdateAlloc RPC to send allocation updates to servers. Prior to this change we only reset the update ticker when an error was encountered. This meant the 200ms ticker was running while the RPC was being performed. If the RPC was slow due to network latency or server load and took >=200ms, the ticker would tick during the RPC. Then on the next loop only the select would randomly choose between the two viable cases: receive an update or fire the RPC again. If the RPC case won it would immediately loop again due to there being no updates to send. When the update chan receive is selected a single update is added to the slice. The odds are then 50/50 that the subsequent loop will send the single update instead of receiving any more updates. This could cause a couple of problems: 1. Since only a small number of updates are sent, the chan buffer may fill, applying backpressure, and slowing down other client operations. 2. The small number of updates sent may already be stale and not represent the current state of the allocation locally. A risk here is that it's hard to reason about how this will interact with the 50ms batches on servers when the servers under load. A further improvement would be to completely remove the alloc update chan and instead use a mutex to build a map of alloc updates. I wanted to test the lowest risk possible change on loaded servers first before making more drastic changes.	2020-11-25 11:36:51 -08:00
Michael Schurter	15f2b8fe7c	client: skip broken test and fix assertion	2020-11-18 10:01:02 -08:00
Michael Schurter	ff91bba70e	client: fix interpolation in template source While Nomad v0.12.8 fixed `NOMAD_{ALLOC,TASK,SECRETS}_DIR` use in `template.destination`, interpolating these variables in `template.source` caused a path escape error. Why not apply the destination fix to source? The destination fix forces destination to always be relative to the task directory. This makes sense for the destination as a destination outside the task directory would be unreachable by the task. There's no reason to ever render a template outside the task directory. (Using `..` does allow destinations to escape the task directory if `template.disable_file_sandbox = true`. That's just awkward and unsafe enough I hope no one uses it.) There is a reason to source a template outside a task directory. At least if there weren't then I can't think of why we implemented `template.disable_file_sandbox`. So v0.12.8 left the behavior of `template.source` the more straightforward "Interpolate and validate." However, since outside of `raw_exec` every other driver uses absolute paths for `NOMAD__DIR` interpolation, this means those variables are unusable unless `disable_file_sandbox` is set. The Fix* The variables are now interpolated as relative paths only for the purpose of rendering templates. This is an unfortunate special case, but reflects the fact that the templates view of the filesystem is completely different (unconstrainted) vs the task's view (chrooted). Arguably the values of these variables should be context-specific. I think it's more reasonable to think of the "hack" as templating running uncontainerized than that giving templates different paths is a hack. TODO - [ ] E2E tests - [ ] Job validation may still be broken and prevent my fix from working? raw_exec `raw_exec` is actually broken _a different way_ as exercised by tests in this commit. I think we should probably remove these tests and fix that in a followup PR/release, but I wanted to leave them in for the initial review and discussion. Since non-containerized source paths are broken anyway, perhaps there's another solution to this entire problem I'm overlooking?	2020-11-17 22:03:04 -08:00
Wim	4e37897dd9	Use correct interface for netStatus CNI plugins can return multiple interfaces, eg the bridge plugin. We need the interface with the sandbox.	2020-11-14 22:29:30 +01:00
Seth Hoenig	4cc3c01d5b	Merge pull request #9352 from hashicorp/f-artifact-headers jobspec: add support for headers in artifact stanza	2020-11-13 14:04:27 -06:00
Seth Hoenig	bb8a5816a0	jobspec: add support for headers in artifact stanza This PR adds the ability to set HTTP headers when downloading an artifact from an `http` or `https` resource. The implementation in `go-getter` is such that a new `HTTPGetter` must be created for each artifact that sets headers (as opposed to conveniently setting headers per-request). This PR maintains the memoization of the default Getter objects, creating new ones only for artifacts where headers are set. Closes #9306	2020-11-13 12:03:54 -06:00
Jasmine Dahilig	d6110cbed4	lifecycle: add poststop hook (#8194 )	2020-11-12 08:01:42 -08:00
Chris Baker	48b1674335	Merge pull request #9311 from jeromegn/allow-empty-devices Don't ignore nil devices in plugin fingerprint	2020-11-11 13:54:03 -06:00
Tim Gross	60874ebe25	csi: Postrun hook should not change mode (#9323 ) The unpublish workflow requires that we know the mode (RW vs RO) if we want to unpublish the node. Update the hook and the Unpublish RPC so that we mark the claim for release in a new state but leave the mode alone. This fixes a bug where RO claims were failing node unpublish. The core job GC doesn't know the mode, but we don't need it for that workflow, so add a mode specifically for GC; the volumewatcher uses this as a sentinel to check whether claims (with their specific RW vs RO modes) need to be claimed.	2020-11-11 13:06:30 -05:00
Jerome Gravel-Niquet	d1f1dbd203	Don't ignore nil devices in plugin fingerprint Even if a plugin sends back an empty `[]device.DeviceGroup`, it's transformed to `nil` during the RPC. Our custom device plugin is returning empty `FingerprintResponse.Devices` very often. Our temporary fix is to send a dummy `DeviceGroup` if the slice is empty. This has the effect of never triggering the "first fingerprint" and therefore timing out after 50s. In turn, this made our node exceed its hearbeat grace period when restarting it, revoking all vault tokens for its allocations, causing a restart of all our allocations because the token couldn't be renewed. Removing the logic for `f.Devices == nil` does not appear to affect the functionality of the function.	2020-11-10 16:04:22 -05:00
Seth Hoenig	9960f96446	client/fingerprint: detect unloaded dynamic bridge kernel module In Nomad v0.12.0, the client added additional fingerprinting around the presense of the bridge kernel module. The fingerprinter only checked in `/proc/modules` which is a list of loaded modules. In some cases, the bridge kernel module is builtin rather than dynamically loaded. The fix for that case is in #8721. However we were still missing the case where the bridge module is dynamically loaded, but not yet loaded during the startup of the Nomad agent. In this case the fingerprinter would believe the bridge module was unavailable when really it gets loaded on demand. This PR now has the fingerprinter scan the kernel module dependency file, which will contain an entry for the bridge module even if it is not yet loaded. In summary, the client now looks for the bridge kernel module in - /proc/modules - /lib/modules/<kernel>/modules.builtin - /lib/modules/<kernel>/modules.dep Closes #8423	2020-11-09 13:56:14 -06:00
Nick Ethier	04f5c4ee5f	ar/groupservice: remove drivernetwork (#9233 ) * ar/groupservice: remove drivernetwork * consul: allow host address_mode to accept raw port numbers * consul: fix logic for blank address	2020-11-05 15:00:22 -05:00
Stefan Richter	484ef8a1e8	Add NOMAD_JOB_ID and NOMAD_JOB_PAERENT_ID env variables (#8967 ) Beforehand tasks and field replacements did not have access to the unique ID of their job or its parent. This adds this information as new environment variables.	2020-10-23 10:49:58 -04:00
Tim Gross	1fb1c9c5d4	artifact/template: make destination path absolute inside taskdir (#9149 ) Prior to Nomad 0.12.5, you could use `${NOMAD_SECRETS_DIR}/mysecret.txt` as the `artifact.destination` and `template.destination` because we would always append the destination to the task working directory. In the recent security patch we treated the `destination` absolute path as valid if it didn't escape the working directory, but this breaks backwards compatibility and interpolation of `destination` fields. This changeset partially reverts the behavior so that we always append the destination, but we also perform the escape check on that new destination after interpolation so the security hole is closed. Also, ConsulTemplate test should exercise interpolation	2020-10-22 15:47:49 -04:00
Tim Gross	6df36e4cdb	artifact/template: prevent file sandbox escapes Ensure that the client honors the client configuration for the `template.disable_file_sandbox` field when validating the jobspec's `template.source` parameter, and not just with consul-template's own `file` function. Prevent interpolated `template.source`, `template.destination`, and `artifact.destination` fields from escaping file sandbox.	2020-10-21 14:34:12 -04:00
Alexander Shtuchkin	90fd8bb85f	Implement 'batch mode' for persisting allocations on the client. (#9093 ) Fixes #9047, see problem details there. As a solution, we use BoltDB's 'Batch' mode that combines multiple parallel writes into small number of transactions. See https://github.com/boltdb/bolt#batch-read-write-transactions for more information.	2020-10-20 16:15:37 -04:00
Seth Hoenig	9cdb98f0e4	client: add tests around meta and canarymeta interpolation Expanding on #9096, add tests for making sure service.Meta and service.CanaryMeta are interpolated from environment variables.	2020-10-20 12:50:29 -05:00
Jorge Marey	8a0ef606a3	Add interpolation on service canarymeta	2020-10-20 12:45:36 -05:00
Drew Bailey	6c788fdccd	Events/msgtype cleanup (#9117 ) * use msgtype in upsert node adds message type to signature for upsert node, update tests, remove placeholder method * UpsertAllocs msg type test setup * use upsertallocs with msg type in signature update test usage of delete node delete placeholder msgtype method * add msgtype to upsert evals signature, update test call sites with test setup msg type handle snapshot upsert eval outside of FSM and ignore eval event remove placeholder upsertevalsmsgtype handle job plan rpc and prevent event creation for plan msgtype cleanup upsertnodeevents updatenodedrain msgtype msg type 0 is a node registration event, so set the default to the ignore type * fix named import * fix signature ordering on upsertnode to match	2020-10-19 09:30:15 -04:00
Nick Ethier	4903e5b114	Consul with CNI and host_network addresses (#9095 ) * consul: advertise cni and multi host interface addresses * structs: add service/check address_mode validation * ar/groupservices: fetch networkstatus at hook runtime * ar/groupservice: nil check network status getter before calling * consul: comment network status can be nil	2020-10-15 15:32:21 -04:00
Michael Schurter	9c3972937b	s/0.13/1.0/g 1.0 here we come!	2020-10-14 15:17:47 -07:00
Chris Baker	1d35578bed	removed backwards-compatible/untagged metrics deprecated in 0.7	2020-10-13 20:18:39 +00:00
Seth Hoenig	ed13e5723f	consul/connect: dynamically select envoy sidecar at runtime As newer versions of Consul are released, the minimum version of Envoy it supports as a sidecar proxy also gets bumped. Starting with the upcoming Consul v1.9.X series, Envoy v1.11.X will no longer be supported. Current versions of Nomad hardcode a version of Envoy v1.11.2 to be used as the default implementation of Connect sidecar proxy. This PR introduces a change such that each Nomad Client will query its local Consul for a list of Envoy proxies that it supports (https://github.com/hashicorp/consul/pull/8545) and then launch the Connect sidecar proxy task using the latest supported version of Envoy. If the `SupportedProxies` API component is not available from Consul, Nomad will fallback to the old version of Envoy supported by old versions of Consul. Setting the meta configuration option `meta.connect.sidecar_image` or setting the `connect.sidecar_task` stanza will take precedence as is the current behavior for sidecar proxies. Setting the meta configuration option `meta.connect.gateway_image` will take precedence as is the current behavior for connect gateways. `meta.connect.sidecar_image` and `meta.connect.gateway_image` may make use of the special `${NOMAD_envoy_version}` variable interpolation, which resolves to the newest version of Envoy supported by the Consul agent. Addresses #8585 #7665	2020-10-13 09:14:12 -05:00
Seth Hoenig	5a3748ca82	Merge pull request #9038 from hashicorp/f-ec2-table env_aws: get ec2 cpu perf data from AWS API	2020-10-12 18:55:33 -05:00
Nick Ethier	d45be0b5a6	client: add NetworkStatus to Allocation (#8657 )	2020-10-12 13:43:04 -04:00
Yoan Blanc	891accb89a	use allow/deny instead of the colored alternatives (#9019 ) Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-10-12 08:47:05 -04:00
Tim Gross	b5abf4ec9d	csi: fix incorrect comment on csi_hook context lifetime	2020-10-09 11:03:51 -04:00
Seth Hoenig	9b555fe6d5	env_aws: fixup test case node attr detection	2020-10-08 12:59:07 -05:00
Seth Hoenig	e693d15a5b	env_aws: get ec2 cpu perf data from AWS API Previously, Nomad was using a hand-made lookup table for looking up EC2 CPU performance characteristics (core count + speed = ticks). This data was incomplete and incorrect depending on region. The AWS API has the correct data but requires API keys to use (i.e. should not be queried directly from Nomad). This change introduces a lookup table generated by a small command line tool in Nomad's tools module which uses the Amazon AWS API. Running the tool requires AWS_* environment variables set. $ # in nomad/tools/cpuinfo $ go run . Going forward, Nomad can incorporate regeneration of the lookup table somewhere in the CI pipeline so that we remain up-to-date on the latest offerings from EC2. Fixes #7830	2020-10-08 12:01:09 -05:00
Landan Cheruka	023a2d36b7	fingerprint: changed unique.platform.azure.hostname to unique.platform.azure.name (#9016 )	2020-10-02 16:50:12 -04:00
Javier Heredia	103ac0a37f	Add consul segment fingerprint (#7214 )	2020-10-02 15:15:59 -04:00
Fredrik Hoem Grelland	a015c52846	configure nomad cluster to use a Consul Namespace [Consul Enterprise] (#8849 )	2020-10-02 14:46:36 -04:00
Fredrik Hoem Grelland	953d4de8dd	update consul-template to v0.25.1 (#8988 )	2020-10-01 14:08:49 -04:00
Landan Cheruka	3df1802119	client: added azure fingerprinting support (#8979 )	2020-10-01 09:10:27 -04:00
Lars Lehtonen	03abe3c890	client: fix test umask (#8987 )	2020-09-30 08:09:41 -04:00
Mahmood Ali	2e9e8ccc24	Merge pull request #8982 from hashicorp/b-exec-dns-resolv drivers/exec: fix DNS resolution in systemd hosts	2020-09-29 11:39:43 -05:00
Mahmood Ali	7ddf4b2902	drivers/exec: fix DNS resolution in systemd hosts Host with systemd-resolved have `/etc/resolv.conf` is a symlink to `/run/systemd/resolve/stub-resolv.conf`. By bind-mounting /etc/resolv.conf only, the exec container DNS resolution fail very badly. This change fixes DNS resolution by binding /run/systemd/resolve as well. Note that this assumes that the systemd resolver (default to 127.0.0.53) is accessible within the container. This is the case here because exec containers share the same network namespace by default. Jobs with custom network dns configurations are not affected, and Nomad will continue to use the job dns settings rather than host one.	2020-09-29 11:33:51 -04:00
Seth Hoenig	af9543c997	consul: fix validation of task in group-level script-checks When defining a script-check in a group-level service, Nomad needs to know which task is associated with the check so that it can use the correct task driver to execute the check. This PR fixes two bugs: 1) validate service.task or service.check.task is configured 2) make service.check.task inherit service.task if it is itself unset Fixes #8952	2020-09-28 15:02:59 -05:00
Pete Woods	81fa2a01fc	Add node "status", "scheduling eligibility" to all client metrics (#8925 ) - We previously added these to the client host metrics, but it's useful to have them on all client metrics. - e.g. so you can exclude draining nodes from charts showing your fleet size.	2020-09-22 13:53:50 -04:00
Pierre Cauchois	e4b739cafd	RPC Timeout/Retries account for blocking requests (#8921 ) The current implementation measures RPC request timeout only against config.RPCHoldTimeout, which is fine for non-blocking requests but will almost surely be exceeded by long-poll requests that block for minutes at a time. This adds an HasTimedOut method on the RPCInfo interface that takes into account whether the request is blocking, its maximum wait time, and the RPCHoldTimeout.	2020-09-18 08:58:41 -04:00
Joel May	2adc5bdec7	fingerprinting: add AWS MAC and public-ipv6 (#8887 )	2020-09-17 09:03:01 -04:00
Lars Lehtonen	55f0302c46	client/allocrunner/taskrunner: client.Close after err check (#8825 )	2020-09-04 08:12:08 -04:00
Tim Gross	8ad90b4253	fix params for Agent.Host client RPC (#8795 ) The parameters for the receiving side of the Agent.Host client RPC did not take the arguments serialized at the server side. This results in a panic.	2020-08-31 17:14:26 -04:00

1 2 3 4 5 ...

4274 Commits