Commit graph

3777 commits

Author SHA1 Message Date
Alex Munda baae6d5546
Update comment about idempotency check 2021-06-30 12:30:44 -05:00
Alex Munda 01bcd9c41c
Make idempotency error user friendly
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2021-06-30 12:26:33 -05:00
Alex Munda ca86c7ba0c
Add idempotency token to dispatch request instead of special meta key 2021-06-29 15:59:23 -05:00
Alex Munda 122136b657
Always allow idempotency key meta. Tests for idempotent dispatch 2021-06-29 10:30:04 -05:00
Alex Munda 561cd9fc7f
Enforce idempotency of dispatched jobs using special meta key 2021-06-23 17:10:31 -05:00
Seth Hoenig ebaaaae88e consul/connect: Validate uniqueness of Connect upstreams within task group
This PR adds validation during job submission that Connect proxy upstreams
within a task group are using different listener addresses. Otherwise, a
duplicate envoy listener will be created and not be able to bind.

Closes #7833
2021-06-18 16:50:53 -05:00
Mahmood Ali 33dfe98770
deployment watcher: Reuse allocsCh if allocIndex remains the same (#10756)
Fix deployment watchers to avoid creating unnecessary deployment watcher goroutines and blocking queries. `deploymentWatcher.getAllocsCh` creates a new goroutine that makes a blocking query to fetch updates of deployment allocs.

## Background

When operators submit a new or updated service job, Nomad create a new deployment by default. The deployment object controls how fast to place the allocations through [`max_parallel`](https://www.nomadproject.io/docs/job-specification/update#max_parallel) and health checks configurations.

The `scheduler` and `deploymentwatcher` package collaborate to achieve deployment logic: The scheduler only places the canaries and `max_parallel` allocations for a new deployment; the `deploymentwatcher` monitors for alloc progress and then enqueues a new evaluation whenever the scheduler should reprocess a job and places the next `max_parallel` round of allocations.

The `deploymentwatcher` package makes blocking queries against the state store, to fetch all deployments and the relevant allocs for each running deployments. If `deploymentwatcher` fails or is hindered from fetching the state, the deployments fail to make progress.

`Deploymentwatcher` logic only runs on the leader.

## Why unnecessary deployment watchers can halt cluster progress
Previously, `getAllocsCh` is called on every for loop iteration in `deploymentWatcher.watch()` function. However, the for-loop may iterate many times before the allocs get updated. In fact, whenever a new deployment is created/updated/deleted, *all* `deploymentWatcher`s get notified through `w.deploymentUpdateCh`. The `getAllocsCh` goroutines and blocking queries spike significantly and grow quadratically with respect to the number of running deployments. The growth leads to two adverse outcomes:

1. it spikes the CPU/Memory usage resulting potentially leading to OOM or very slow processing
2. it activates the [query rate limiter](abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L896-L898)), so later the watcher fails to get updates and consequently fails to make progress towards placing new allocations for the deployment!

So the cluster fails to catch up and fails to make progress in almost all deployments. The cluster recovers after a leader transition: the deposed leader stops all watchers and free up goroutines and blocking queries; the new leader recreates the watchers without the quadratic growth and remaining under the rate limiter.  Well, until a spike of deployments are created triggering the condition again.

### Relevant Code References

Path for deployment monitoring:
* [`Watcher.watchDeployments`](abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L164-L192)) loops waiting for deployment updates.
* On every deployment update, [`w.getDeploys`](abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L194-L229)) returns all deployments in the system
* `watchDeployments` calls `w.add(d)` on every active deployment
* which in turns, [updates existing watcher if one is found](abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L251-L255)).
* The deployment watcher [updates local local deployment field and trigger `deploymentUpdateCh` channel]( abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L136-L147))
* The [deployment watcher `deploymentUpdateCh` selector is activated](abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L455-L489)). Most of the time the selector clause is a no-op, because the flow was triggered due to another deployment update
* The `watch` for-loop iterates again and in the previous code we create yet another goroutine and blocking call that risks being rate limited.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2021-06-14 16:01:01 -04:00
Seth Hoenig 6eeaefa59f
Merge pull request #10754 from hashicorp/b-client-connect-constraint
consul/connect: remove unnecessary connect constraint on clients
2021-06-14 09:41:25 -05:00
Seth Hoenig 7b8e15159b consul/connect: remove unnecessary connect constraint on clients
PR https://github.com/hashicorp/nomad/pull/10702 added 2 new constraints
for connect jobs - one for Consul gRPC listener, and one for Connect being
enabled on Clients. Connect does not need to be enabled on clients, only
on Consul servers. Remove the extra constraint.

Discuss:
https://discuss.hashicorp.com/t/nomad-1-1-1-and-consul-connect-enabled-on-consul-clients/25295
2021-06-14 08:01:45 -05:00
James Rasell 0cccf7c2b8
volumewatcher: fix test data race. 2021-06-14 12:11:35 +02:00
James Rasell a99fcfb4c8
Merge pull request #10745 from hashicorp/b-fix-test-datarace-deploymentwatcher
deploymentwatcher: fix test data race.
2021-06-11 17:23:03 +02:00
James Rasell 939b23936a
Merge pull request #10744 from hashicorp/b-remove-duplicate-imports
chore: remove duplicate import statements
2021-06-11 16:42:34 +02:00
Mahmood Ali 74efd3626e
Merge pull request #10742 from hashicorp/deflake-tests-20210608
Deflaking Test 2021 June edition
2021-06-11 09:14:40 -04:00
James Rasell c168108bb7
Merge pull request #10739 from hashicorp/f-remove-unused-types-pkg
core: remove unused types pkg and PeriodicCallback type.
2021-06-11 13:27:22 +02:00
James Rasell ff75b4da09
deploymentwatcher: fix test data race. 2021-06-11 11:55:21 +02:00
James Rasell 492e308846
tests: remove duplicate import statements. 2021-06-11 09:39:22 +02:00
Mahmood Ali 9b35bf1858 deflake TestNomad_BootstrapExpect and other leader tests
The test fails reliably locally on my machine. The test uses non-dev mode
where Raft actions get committed to disk, causing operations to exceed
the 50ms tight Raft deadlines.

So, here we ensure that non-dev servers use default Raft config
files with longer timeouts.

Also, noticed that the test queries a server, that may a follower with a
stale state.

I've updated the test to ensure we query the leader for its state. The
Barrier call ensures that the leader is a "stable" leader with committed
entries. Protects against a window where a new leader reports the
previous term before it commits a raft log entry.
2021-06-10 22:04:10 -04:00
Mahmood Ali ff73cc279e tests: deflake TestAgentProfile_RemoteClient
TestAgentProfile_RemoteClient test must wait for the client node to be
registered in raft state store, and not merely that the server has a
network connection from the client.

In https://app.circleci.com/pipelines/github/hashicorp/nomad/15539/workflows/8dcbc3f3-946b-4da0-b089-9093788bc0c9/jobs/147919, notice how `node registration complete` log line occured after the test already have failed.

This is another case of flakiness due to not waiting for client
registration.
2021-06-10 22:00:15 -04:00
Mahmood Ali 8009d9837c tests: deflake TestMonitor_Monitor_RemoteServer and cross-region tests
Ensure that all servers are joined to each other before test proceed,
instead of just joining them to the first server and relying on
background serf propagation.

Relying on backgorund serf propagation is a cause of flakiness,
specially for tests with multiple regions. The server receiving the RPC
may not be aware of the region and fail to forward RPC accordingly.

For example, consider `TestMonitor_Monitor_RemoteServer` failure in https://app.circleci.com/pipelines/github/hashicorp/nomad/16402/workflows/7f327235-7d0c-40ba-9757-600522afca51/jobs/158045  you can observe:
* `nomad-117` is joined to `nomad-118` and `nomad-119`
* `nomad-119` is the foreign region
* `nomad-117` gains leadership in the default region, `nomad-118` is the non-leader
*  search logs for `nomad: adding server` and notice that `nomad-118`
   only added `nomad-118` and `nomad-118`, but not `nomad-119`!
* so the query to the non-leader in the test fails to be forwarded to
  the appopriate region.
2021-06-10 21:27:55 -04:00
James Rasell 25883eca43
core: remove unused types pkg and PeriodicCallback type. 2021-06-10 15:57:13 +02:00
Nomad Release Bot 4fe52bc753 remove generated files 2021-06-10 08:04:25 -04:00
Nomad Release bot 7cc7389afd Generate files for 1.1.1 release 2021-06-10 08:04:25 -04:00
Mahmood Ali aa77c2731b tests: use standard library testing.TB
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
2021-06-09 16:18:45 -07:00
Seth Hoenig dbdc479970 consul: move consul acl tests into ent files
(cherry-pick ent back to oss)

This PR moves a lot of Consul ACL token validation tests into ent files,
so that we can verify correct behavior difference between OSS and ENT
Nomad versions.
2021-06-09 08:38:42 -05:00
Seth Hoenig 87be8c4c4b consul: correctly check consul acl token namespace when using consul oss
This PR fixes the Nomad Object Namespace <-> Consul ACL Token relationship
check when using Consul OSS (or Consul ENT without namespace support).

Nomad v1.1.0 introduced a regression where Nomad would fail the validation
when submitting Connect jobs and allow_unauthenticated set to true, with
Consul OSS - because it would do the namespace check against the Consul ACL
token assuming the "default" namespace, which does not work because Consul OSS
does not have namespaces.

Instead of making the bad assumption, expand the namespace check to handle
each special case explicitly.

Fixes #10718
2021-06-08 13:55:57 -05:00
Jasmine Dahilig ca4be6857e
deployment query rate limit (#10706) 2021-06-04 12:38:46 -07:00
Seth Hoenig d026ff1f66 consul/connect: add support for connect mesh gateways
This PR implements first-class support for Nomad running Consul
Connect Mesh Gateways. Mesh gateways enable services in the Connect
mesh to make cross-DC connections via gateways, where each datacenter
may not have full node interconnectivity.

Consul docs with more information:
https://www.consul.io/docs/connect/gateways/mesh-gateway

The following group level service block can be used to establish
a Connect mesh gateway.

service {
  connect {
    gateway {
      mesh {
        // no configuration
      }
    }
  }
}

Services can make use of a mesh gateway by configuring so in their
upstream blocks, e.g.

service {
  connect {
    sidecar_service {
      proxy {
        upstreams {
          destination_name = "<service>"
          local_bind_port  = <port>
          datacenter       = "<datacenter>"
          mesh_gateway {
            mode = "<mode>"
          }
        }
      }
    }
  }
}

Typical use of a mesh gateway is to create a bridge between datacenters.
A mesh gateway should then be configured with a service port that is
mapped from a host_network configured on a WAN interface in Nomad agent
config, e.g.

client {
  host_network "public" {
    interface = "eth1"
  }
}

Create a port mapping in the group.network block for use by the mesh
gateway service from the public host_network, e.g.

network {
  mode = "bridge"
  port "mesh_wan" {
    host_network = "public"
  }
}

Use this port label for the service.port of the mesh gateway, e.g.

service {
  name = "mesh-gateway"
  port = "mesh_wan"
  connect {
    gateway {
      mesh {}
    }
  }
}

Currently Envoy is the only supported gateway implementation in Consul.
By default Nomad client will run the latest official Envoy docker image
supported by the local Consul agent. The Envoy task can be customized
by setting `meta.connect.gateway_image` in agent config or by setting
the `connect.sidecar_task` block.

Gateways require Consul 1.8.0+, enforced by the Nomad scheduler.

Closes #9446
2021-06-04 08:24:49 -05:00
Seth Hoenig 4c087efd59
Merge pull request #10702 from hashicorp/f-cc-constraints
consul/connect: use additional constraints in scheduling connect tasks
2021-06-04 08:11:21 -05:00
Tim Gross 8b2ecde5b4 csi: accept list of caps during validation in volume register
When `nomad volume create` was introduced in Nomad 1.1.0, we changed the
volume spec to take a list of capabilities rather than a single capability, to
meet the requirements of the CSI spec. When a volume is registered via `nomad
volume register`, we should be using the same fields to validate the volume
with the controller plugin.
2021-06-04 07:57:26 -04:00
Seth Hoenig d359eb6f3a consul/connect: use additional constraints in scheduling connect tasks
This PR adds two additional constraints on Connect sidecar and gateway tasks,
making sure Nomad schedules them only onto nodes where Connect is actually
enabled on the Consul agent.

Consul requires `connect.enabled = true` and `ports.grpc = <number>` to be
explicitly set on agent configuration before Connect APIs will work. Until
now, Nomad would only validate a minimum version of Consul, which would cause
confusion for users who try to run Connect tasks on nodes where Consul is not
yet sufficiently configured. These contstraints prevent job scheduling on nodes
where Connect is not actually use-able.

Closes #10700
2021-06-03 15:43:34 -05:00
Tim Gross c01d661c98 csi: validate volume block has attachment_mode and access_mode
The `attachment_mode` and `access_mode` fields are required for CSI
volumes. The `mount_options` block is only allowed for CSI volumes.
2021-06-03 16:07:19 -04:00
Tim Gross e9777a88ce plan applier: add trace-level log of plan
The plans generated by the scheduler produce high-level output of counts on each
evaluation, but when debugging scheduler issues it'd be nice to have a more
detailed view of the resulting plan. Emitting this log at trace minimizes the
overhead, and producing it in the plan applyer makes it easier to find as it
will always be on the leader.
2021-06-02 10:25:23 -04:00
Tim Gross 7a55a6af16 leader: call eval log formatting lazily
Arguments to our logger's various write methods are evaluated eagerly, so
method calls in log parameters will always be called, regardless of log
level. Move some logger messages to the logger's `Fmt` method so that
`GoString` is evaluated lazily instead.
2021-06-02 09:59:55 -04:00
Ryan Sundberg d43c5f98a5 CSI: Include MountOptions in capabilities sent to CSI for all RPCs
Include the VolumeCapability.MountVolume data in
ControllerPublishVolume, CreateVolume, and ValidateVolumeCapabilities
RPCs sent to the CSI controller. The previous behavior was to only
include the MountVolume capability in the NodeStageVolume request, which
on some CSI implementations would be rejected since the Volume was not
originally provisioned with the specific mount capabilities requested.
2021-05-24 10:59:54 -04:00
James Rasell db4e2541bd
events: fix event endpoint tests to ignore heartbeats.
seems when this PR was raised, the Nomad CI provider was having
availability issues meaning the test suite was not correctly run,
thus allowing broken tests into main. The PR itself exercised test
code which had not been hit before.

The particular problem is when identifying whether the event
received is a heartbeat; this was performed using standard Golang
conditionals. Unfortunately the operator == is not defined on byte
arrays, resulting in the check always returning false. To overcome
this issue the code now uses the bytes.Equal function to correctly
compare the data.
2021-05-24 10:28:19 +02:00
Szabolcs Gelencsér fc97bd6acf
events: fix slow client connection to empty event stream (#10637)
* events: fix slow client connection to empty event stream

* doc: fix changelog of event stream connection init
2021-05-21 13:17:07 -04:00
Chris Baker 263ddd567c
Node Drain Metadata (#10250) 2021-05-07 13:58:40 -04:00
Mahmood Ali 102763c979
Support disabling TCP checks for connect sidecar services 2021-05-07 12:10:26 -04:00
Seth Hoenig b024d85f48 connect: use deterministic injected dynamic exposed port
This PR uses the checksum of the check for which a dynamic exposed
port is being generated (instead of a UUID prefix) so that the
generated port label is deterministic.

This fixes 2 bugs:
 - 'job plan' output is now idempotent for jobs making use of injected ports
 - tasks will no longer be destructively updated when jobs making use of
   injected ports are re-run without changing any user specified part of
   job config.

Closes: https://github.com/hashicorp/nomad/issues/10099
2021-04-30 15:18:22 -06:00
Michael Schurter 547a718ef6
Merge pull request #10248 from hashicorp/f-remotetask-2021
core: propagate remote task handles
2021-04-30 08:57:26 -07:00
Mahmood Ali 52d881f567
Allow configuring memory oversubscription (#10466)
Cluster operators want to have better control over memory
oversubscription and may want to enable/disable it based on their
experience.

This PR adds a scheduler configuration field to control memory
oversubscription. It's additional field that can be set in the [API via Scheduler Config](https://www.nomadproject.io/api-docs/operator/scheduler), or [the agent server config](https://www.nomadproject.io/docs/configuration/server#configuring-scheduler-config).

I opted to have the memory oversubscription be an opt-in, but happy to change it.  To enable it, operators should call the API with:
```json
{
  "MemoryOversubscriptionEnabled": true
}
```

If memory oversubscription is disabled, submitting jobs specifying `memory_max` will get a "Memory oversubscription is not
enabled" warnings, but the jobs will be accepted without them accessing
the additional memory.

The warning message is like:
```
$ nomad job run /tmp/j
Job Warnings:
1 warning(s):

* Memory oversubscription is not enabled; Task cache.redis memory_max value will be ignored

==> Monitoring evaluation "7c444157"
    Evaluation triggered by job "example"
==> Monitoring evaluation "7c444157"
    Evaluation within deployment: "9d826f13"
    Allocation "aa5c3cad" created: node "9272088e", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7c444157" finished with status "complete"

# then you can examine the Alloc AllocatedResources to validate whether the task is allowed to exceed memory:
$ nomad alloc status -json aa5c3cad | jq '.AllocatedResources.Tasks["redis"].Memory'
{
  "MemoryMB": 256,
  "MemoryMaxMB": 0
}
```
2021-04-29 22:09:56 -04:00
Luiz Aoqui f1b9055d21
Add metrics for blocked eval resources (#10454)
* add metrics for blocked eval resources

* docs: add new blocked_evals metrics

* fix to call `pruneStats` instead of `stats.prune` directly
2021-04-29 15:03:45 -04:00
Seth Hoenig d54a606819
Merge pull request #10439 from hashicorp/pick-ent-acls-changes
e2e: add e2e tests for consul namespaces on ent with acls
2021-04-28 08:30:08 -06:00
Tim Gross 79f81d617e licensing: remove raft storage and sync
This changeset is the OSS portion of the work to remove the raft storage and
sync for Nomad Enterprise.
2021-04-28 10:28:23 -04:00
Michael Schurter e62795798d core: propagate remote task handles
Add a new driver capability: RemoteTasks.

When a task is run by a driver with RemoteTasks set, its TaskHandle will
be propagated to the server in its allocation's TaskState. If the task
is replaced due to a down node or draining, its TaskHandle will be
propagated to its replacement allocation.

This allows tasks to be scheduled in remote systems whose lifecycles are
disconnected from the Nomad node's lifecycle.

See https://github.com/hashicorp/nomad-driver-ecs for an example ECS
remote task driver.
2021-04-27 15:07:03 -07:00
Seth Hoenig 09cd01a5f3 e2e: add e2e tests for consul namespaces on ent with acls
This PR adds e2e tests for Consul Namespaces for Nomad Enterprise
with Consul ACLs enabled.

Needed to add support for Consul ACL tokens with `namespace` and
`namespace_prefix` blocks, which Nomad parses and validates before
tossing the token. These bits will need to be picked back to OSS.
2021-04-27 14:45:54 -06:00
Seth Hoenig d76bcf0e12
Merge pull request #10457 from hashicorp/b-igce-wildcard
consul/connect: fix bug where ingress gateways could not use wildcard services
2021-04-27 14:41:47 -06:00
Seth Hoenig 865c7a5841 consul/connect: fix bug where ingress gateways could not use wildcard services
This PR fixes a bug where Nomad was more restrictive on Ingress Gateway Configuration
Entry definitions than Consul. Before, Nomad would not allow for declaring IGCEs with
http listeners with service name "*", which is a special feature allowable by Consul.

Note: to make http protocol work, a service-default must be defined setting the
protocol to http for each service.

Fixes: #9729
2021-04-27 13:42:26 -06:00
Seth Hoenig f47c6d34f7 consul/connect: check connect group and service names for uppercase characters
This PR adds job-submission validation that checks for the use of uppercase characters
in group and service names for services that make use of Consul Connect. This prevents
attempting to launch services that Consul will not validate correctly, which in turn
causes tasks to fail to launch in Nomad.

Underlying Consul issue: https://github.com/hashicorp/consul/issues/6765

Closes #7581 #10450
2021-04-27 11:26:37 -06:00
Mahmood Ali cf24a9eaaf
api: /v1/jobs always include namespaces (#10434)
Add Namespace as a top-level field in `/v1/jobs` stub.

The `/v1/jobs` endpoint already includes the namespace under `JobSummary`, though the API is odd, as typically the job ID and Namespace are in the same level, and the oddity complicates the UI frontend development.

The downside of adding it is redundant field, that makes the response body a bit bigger, specially for clusters with large jobs. Though, it should compress nicely and I expect the overhead to be small to overall response size. The benefit of a cleaner and more consistent API seem worth it.

Fixes #10431
2021-04-23 16:36:54 -04:00
Mahmood Ali d2fcce21f8
Migrate all allocs when draining a node (#10411)
This fixes a bug affecting drain nodes, where allocs may fail to be
migrated if they belong to different namespaces but share the same job
name.

The reason is that the helper function that creates the migration evals
indexed the allocs by job ID without accounting for the namespaces.
When job ids clash, only an eval is created for one and the rest of the
allocs remain intact.

Fixes #10172
2021-04-21 12:11:14 -04:00
Seth Hoenig f71dd3857e api: include ent fuzzy struct types in oss
Small change to pull in ent struct types in a switch
statement used by ent. They are benign in oss, this
is just to make sure OSS->ENT merges don't create a
diff.
2021-04-20 11:19:38 -06:00
Seth Hoenig 4e6dbaaec1
Merge pull request #10184 from hashicorp/f-fuzzy-search
api: implement fuzzy search API
2021-04-20 09:06:40 -06:00
Seth Hoenig 509490e5d2 e2e: consul namespace tests from nomad ent
(cherry-picked from ent without _ent things)

This is part 2/4 of e2e tests for Consul Namespaces. Took a
first pass at what the parameterized tests can look like, but
only on the ENT side for this PR. Will continue to refactor
in the next PRs.

Also fixes 2 bugs:
 - Config Entries registered by Nomad Server on job registration
   were not getting Namespace set
 - Group level script checks were not getting Namespace set

Those changes will need to be copied back to Nomad OSS.

Nomad OSS + no ACLs (previously, needs refactor)
Nomad ENT + no ACLs (this)
Nomad OSS + ACLs (todo)
Nomad ENT + ALCs (todo)
2021-04-19 15:35:31 -06:00
Seth Hoenig c34ef9eb78 api: fuzzy search results include job name with id in scope 2021-04-16 17:03:36 -06:00
Seth Hoenig 0b2114a7a5 api: make fuzzy searching case-agnostic 2021-04-16 16:56:10 -06:00
Seth Hoenig 1ee8d5ffc5 api: implement fuzzy search API
This PR introduces the /v1/search/fuzzy API endpoint, used for fuzzy
searching objects in Nomad. The fuzzy search endpoint routes requests
to the Nomad Server leader, which implements the Search.FuzzySearch RPC
method.

Requests to the fuzzy search API are based on the api.FuzzySearchRequest
object, e.g.

{
  "Text": "ed",
  "Context": "all"
}

Responses from the fuzzy search API are based on the api.FuzzySearchResponse
object, e.g.

{
  "Index": 27,
  "KnownLeader": true,
  "LastContact": 0,
  "Matches": {
    "tasks": [
      {
        "ID": "redis",
        "Scope": [
          "default",
          "example",
          "cache"
        ]
      }
    ],
    "evals": [],
    "deployment": [],
    "volumes": [],
    "scaling_policy": [],
    "images": [
      {
        "ID": "redis:3.2",
        "Scope": [
          "default",
          "example",
          "cache",
          "redis"
        ]
      }
    ]
  },
  "Truncations": {
    "volumes": false,
    "scaling_policy": false,
    "evals": false,
    "deployment": false
  }
}

The API is tunable using the new server.search stanza, e.g.

server {
  search {
    fuzzy_enabled   = true
    limit_query     = 200
    limit_results   = 1000
    min_term_length = 5
  }
}

These values can be increased or decreased, so as to provide more
search results or to reduce load on the Nomad Server. The fuzzy search
API can be disabled entirely by setting `fuzzy_enabled` to `false`.
2021-04-16 16:36:07 -06:00
Nick Spain 085b54bd0b Update TaskGroup services edited diff test to actually check Body 2021-04-13 09:15:35 -04:00
Nick Spain bfd4980e3f Hash Body field as part of ServiceCheck 2021-04-13 09:15:35 -04:00
Nick Spain 653d84ef68 Add a 'body' field to the check stanza
Consul allows specifying the HTTP body to send in a health check. Nomad
uses Consul for health checking so this just plumbs the value through to
where the Consul API is called.

There is no validation that `body` is not used with an incompatible
check method like GET.
2021-04-13 09:15:35 -04:00
Charlie Voiselle 8afb9eb05d
Fix parameterized <-> non-parameterized job error (#10357)
The error messages are reversed from tests performed above them.  The test uses the `validateJobUpdate()` function, but ignores the text of the error message itself.
2021-04-12 09:27:04 -04:00
Tim Gross 3113cced7b CSI: ensure page slices are within bounds
Plugins could potentially ignore the `max_entries` field and return a list of
entries that is greater, so we slice the return value in the server RPC to
enforce these value. But page sizes less than the number of entries for the
external CSI ListVolumes and ListSnapshots RPCs could cause a panic, so fix
the boundary checking.
2021-04-09 14:12:38 -04:00
Lars Lehtonen 61d3c3b480 nomad/structs: fix diff 2021-04-09 08:21:46 -04:00
Tim Gross 0892d34ff9 CSI: capability block is required for volume registration 2021-04-08 13:02:24 -04:00
Tim Gross d2e479505c CSI: capability check ListVolumes at RPC for nicer error messages
The plugin stub object does not include fine-grained capability checks, which
means `nomad volume status -verbose` will return ugly and verbose error
"Unimplemented" messages from the plugin if it does not support the CSI
`ListVolumes` RPC. Return a nicer error message from our RPC handler instead.
2021-04-07 12:00:22 -04:00
Tim Gross 276633673d CSI: use AccessMode/AttachmentMode from CSIVolumeClaim
Registration of Nomad volumes previously allowed for a single volume
capability (access mode + attachment mode pair). The recent `volume create`
command requires that we pass a list of requested capabilities, but the
existing workflow for claiming volumes and attaching them on the client
assumed that the volume's single capability was correct and unchanging.

Add `AccessMode` and `AttachmentMode` to `CSIVolumeClaim`, use these fields to
set the initial claim value, and add backwards compatibility logic to handle
the existing volumes that already have claims without these fields.
2021-04-07 11:24:09 -04:00
Tim Gross dbcc2694b0 refactor: move VolumeRequest validation to Validate method 2021-04-07 11:24:09 -04:00
Tim Gross 72c07f15fb refactor: internal claim methods should be private 2021-04-07 11:24:09 -04:00
Seth Hoenig fe8fce00d9 consul: minor CR cleanup 2021-04-05 10:10:16 -06:00
Seth Hoenig f17ba33f61 consul: plubming for specifying consul namespace in job/group
This PR adds the common OSS changes for adding support for Consul Namespaces,
which is going to be a Nomad Enterprise feature. There is no new functionality
provided by this changeset and hopefully no new bugs.
2021-04-05 10:03:19 -06:00
Chris Baker 21bc48ca29 json handles were moved to a new package in #10202
this was unecessary after refactoring, so this moves them back to their
original location in package structs
2021-04-02 13:31:10 +00:00
Chris Baker 436d46bd19
Merge branch 'main' into f-node-drain-api 2021-04-01 15:22:57 -05:00
Tim Gross 0856483115 CSI: fingerprint detailed node capabilities
In order to support new node RPCs, we need to fingerprint plugin capabilities
in more detail. This changeset mirrors recent work to fingerprint controller
capabilities, but is not yet in use by any Nomad RPC.
2021-04-01 16:00:58 -04:00
Tim Gross 466b620fa4
CSI: volume snapshot 2021-04-01 11:16:52 -04:00
Tim Gross 71b9daffb9 CSI: fix misleading HTTP test
The HTTP test to create CSI volumes depends on having a controller plugin to
talk to, but the test was using a node-only plugin, which allows it to
silently ignore the missing controller.
2021-03-31 16:37:09 -04:00
Tim Gross 9fc4cf1419 CSI: fingerprint detailed controller capabilities
In order to support new controller RPCs, we need to fingerprint volume
capabilities in more detail and perform controller RPCs only when the specific
capability is present. This fixes a bug in Ceph support where the plugin can
only suport create/delete but we assume that it also supports attach/detach.
2021-03-31 16:37:09 -04:00
Tim Gross f149abfa41 CSI: volume creation/registration should not validate attachment
The CSI specification requires that we validate a list of `Capability` (access
mode + accessibility) when we create volume, but the existing volume
registration workflow incorrectly validates a single capability. The
specific capability required by a volume claim is checked at the time we make
the claim, so remove the check for `AttachmentMode`/`AcccessMode`.
2021-03-31 16:37:09 -04:00
Tim Gross aec5337862 CSI: HTTP handlers for create/delete/list 2021-03-31 16:37:09 -04:00
Tim Gross d38008176e CSI: create/delete/list volume RPCs
This commit implements the RPC handlers on the client that talk to the CSI
plugins on that client for the Create/Delete/List RPC.
2021-03-31 16:37:09 -04:00
Tim Gross 43622680fa test infrastructure for mock client RPCs (#10193)
This commit includes a new test client that allows overriding the RPC
protocols. Only the RPCs that are passed in are registered, which lets you
implement a mock RPC in the server tests. This commit includes an example of
this for the ClientCSI RPC server.
2021-03-31 16:37:09 -04:00
Mahmood Ali e24dac2f39 fixup! oversubscription: Add MemoryMaxMB to internal structs 2021-03-31 09:26:26 -04:00
Mahmood Ali 4ec2d8f5e4 fixup! oversubscription: Add MemoryMaxMB to internal structs 2021-03-31 08:52:20 -04:00
Mahmood Ali 0c2551270a oversubscription: Add MemoryMaxMB to internal structs
Start tracking a new MemoryMaxMB field that represents the maximum memory a task
may use in the client. This allows tasks to specify a memory reservation (to be
used by scheduler when placing the task) but use excess memory used on the
client if the client has any.

This commit adds the server tracking for the value, and ensures that allocations
AllocatedResource fields include the value.
2021-03-30 16:55:58 -04:00
Nick Ethier daecfa61e6
Merge pull request #10203 from hashicorp/f-cpu-cores
Reserved Cores [1/4]: Structs and scheduler implementation
2021-03-29 14:05:54 -04:00
Chris Baker 16e37b986a reworked Node.Canonicalize() to enforce invariants, fixed a broken test 2021-03-26 18:58:38 +00:00
Chris Baker 99ef00d5da reinserted/expanded fsm node.canonicalize test that was still needed 2021-03-26 17:10:39 +00:00
Chris Baker 770c9cecb5 restored Node.Sanitize() for RPC endpoints
multiple other updates from code review
2021-03-26 17:03:15 +00:00
Mahmood Ali dbc3850358
Merge pull request #10145 from hashicorp/b-periodic-init-status
periodic: always reset periodic children status
2021-03-26 09:19:08 -04:00
Mahmood Ali 5d75705edd dispatched parameterized job should clear status too 2021-03-25 15:14:21 -04:00
Mahmood Ali e643742a38 Add a test for parameterized summary counts 2021-03-25 11:27:09 -04:00
Mahmood Ali b0e048bfa4 periodic: always reset periodic children status
Fixes a bug where Nomad reports negative or incorrect running children
counts for periodic jobs.

The periodic dispatcher derives a child job without reseting the status.
If the periodic job has a `running` status, the derived job will start
as `running` status and transition to `pending`.  Since this is
unexpected transition, the counting in StateStore.setJobSummary gets out of sync and
result in negative/incorrect values.

Note that this only affects periodic jobs after a leader transition.
During the first job registration, the job is added with `pending` or
`""` status. However, after a leader transition, the new leader
repopulates the dispatcher heap with `"running"` status and triggers the
bug.
2021-03-25 11:27:09 -04:00
Chris Baker 07646316e5 some comments on the new json extensions/encoding 2021-03-23 18:18:51 +00:00
Chris Baker a43b3a8736 change to fail-safe in json encoding 2021-03-23 18:13:10 +00:00
Chris Baker cb540ed691 added tests that the API doesn't leak Node.SecretID
added more documentation on JSON encoding to the contributing guide
2021-03-23 18:09:20 +00:00
Drew Bailey 74836b95b2
configuration and oss components for licensing (#10216)
* configuration and oss components for licensing

* vendor sync
2021-03-23 09:08:14 -04:00
Mahmood Ali 73b7b98018
testing: default nomad test nodes to 1.0.0 (#10213) 2021-03-23 08:24:26 -04:00
Chris Baker a186badf35 moved JSON handlers and extension code around a bit for proper order of
initialization
2021-03-22 14:12:42 +00:00
Chris Baker 9f7bc5a575 refactor? 2021-03-22 01:49:21 +00:00
Chris Baker dd291e69f4 removed deprecated fields from Drain structs and API
node drain: use msgtype on txn so that events are emitted
wip: encoding extension to add Node.Drain field back to API responses

new approach for hiding Node.SecretID in the API, using `json` tag
documented this approach in the contributing guide
refactored the JSON handlers with extensions
modified event stream encoding to use the go-msgpack encoders with the extensions
2021-03-21 15:30:11 +00:00
Nick Ethier 648ade63ad scheduler: implement scheduling of reserved cores 2021-03-19 00:29:07 -04:00