Commit graph

69 commits

Author SHA1 Message Date
Lang Martin 3621df1dbf csi: volume ids are only unique per namespace (#7358)
* nomad/state/schema: use the namespace compound index

* scheduler/scheduler: CSIVolumeByID interface signature namespace

* scheduler/stack: SetJob on CSIVolumeChecker to capture namespace

* scheduler/feasible: pass the captured namespace to CSIVolumeByID

* nomad/state/state_store: use namespace in csi_volume index

* nomad/fsm: pass namespace to CSIVolumeDeregister & Claim

* nomad/core_sched: pass the namespace in volumeClaimReap

* nomad/node_endpoint_test: namespaces in Claim testing

* nomad/csi_endpoint: pass RequestNamespace to state.*

* nomad/csi_endpoint_test: appropriately failed test

* command/alloc_status_test: appropriately failed test

* node_endpoint_test: avoid notTheNamespace for the job

* scheduler/feasible_test: call SetJob to capture the namespace

* nomad/csi_endpoint: ACL check the req namespace, query by namespace

* nomad/state/state_store: remove deregister namespace check

* nomad/state/state_store: remove unused CSIVolumes

* scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace

* nomad/csi_endpoint: ACL check

* nomad/state/state_store_test: remove call to state.CSIVolumes

* nomad/core_sched_test: job namespace match so claim gc works
2020-03-23 13:59:25 -04:00
Tim Gross 22e9f679c3 csi: implement controller detach RPCs (#7356)
This changeset implements the remaining controller detach RPCs: server-to-client and client-to-controller. The tests also uncovered a bug in our RPC for claims which is fixed here; the volume claim RPC is used for both claiming and releasing a claim on a volume. We should only submit a controller publish RPC when the claim is new and not when it's being released.
2020-03-23 13:59:25 -04:00
Tim Gross 8bc5641438 csi: volume claim garbage collection (#7125)
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.

For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.

The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
2020-03-23 13:58:30 -04:00
Tim Gross 8673ea5cba csi: add empty CSI volume publication GC to scheduled core jobs (#7014)
This changeset adds a new core job `CoreJobCSIVolumePublicationGC` to
the leader's loop for scheduling core job evals. Right now this is an
empty method body without even a config file stanza. Later changesets
will implement the logic of volume publication GC.
2020-03-23 13:58:29 -04:00
Seth Hoenig f0c3dca49c tests: swap lib/freeport for tweaked helper/freeport
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.

Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.

This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
2019-12-09 08:37:32 -06:00
Alex Dadgar 14a61ea3ea Don't GC running but desired stop allocations
This PR fixes an edge case where we could GC an allocation that was in a
desired stop state but had not terminated yet. This can be hit if the
client hasn't shutdown the allocation yet or if the allocation is still
shutting down (long kill_timeout).

Fixes https://github.com/hashicorp/nomad/issues/4940
2018-12-05 13:01:12 -08:00
Preetha Appan 39072977d6
Use create index as trigger condition to gc old terminal allocs 2018-11-09 11:44:21 -06:00
Preetha Appan e586817ce7
batch jobs GC removes terminal allocs if job modifyindex is older than running job 2018-11-01 00:05:31 -05:00
Preetha Appan a9d63c0df3
Check allocation's desired state in GC eligibility logic in core scheduler 2018-05-21 13:28:31 -05:00
Preetha 0b6fbb8e16
Merge pull request #4131 from hashicorp/b-rescheduling-fix-gc
Update garbage collection logic to make sure allocs with pending evals are not GCed
2018-04-11 15:44:36 -05:00
Preetha Appan 1da4d88f3d
Make test descriptions better 2018-04-11 15:12:23 -05:00
Preetha Appan 688fd9ee37
Update alloc GC eligility logic to not rely on follow up evals 2018-04-11 13:58:02 -05:00
Charlie Voiselle ba88f00ccb Changed "til" to "until"
Should be "till" or "until"; chose "until" because it is unambiguous as to meaning.
2018-04-11 12:36:28 -05:00
Preetha Appan 59cce1d620
Fix unit test for core scheduler GC 2018-04-10 17:12:06 -05:00
Preetha Appan 7040884002
Simplify and update allocation gc eligibility logic 2018-04-10 16:08:37 -05:00
Alex Dadgar 7545c0053e job gc uses batch endpoint 2018-03-16 10:53:03 -07:00
Josh Soref 4e40338cfa spelling: rescheduling 2018-03-11 18:40:32 +00:00
Josh Soref bf05f146cd spelling: deployment 2018-03-11 17:57:49 +00:00
Preetha Appan eaedffc7f7
Fix go vet errors 2018-02-28 12:21:27 -06:00
Alex Dadgar a6dfffa4fa Add testing interfaces 2018-02-15 13:59:00 -08:00
Preetha Appan 8ecb6ca91b
Code review feedback and more test cases 2018-01-31 09:58:05 -06:00
Preetha Appan 28d2439810
Consider dead job status and modify unit test setup for correctness 2018-01-31 09:58:05 -06:00
Preetha Appan 4fd2691323
Use next alloc id being set, move outside structs package and other code review feedback 2018-01-31 09:58:05 -06:00
Preetha Appan dd91a2f5be
Make garbage collection be aware of rescheduling info in allocations 2018-01-31 09:58:05 -06:00
Alex Dadgar d3e119f4d0 thread leader token through core gc and test 2017-10-23 15:04:00 -07:00
Alex Dadgar 84d06f6abe Sync namespace changes 2017-09-07 17:04:21 -07:00
Alex Dadgar 06eddf243c parallel nomad tests 2017-07-25 17:39:36 -07:00
Alex Dadgar 84c2f25e0a Deployment GC ensures no alloc references 2017-07-17 14:09:59 -07:00
Alex Dadgar 09dfa2fc10 Rename CreateDeployments and remove cancelling behavior in state_store 2017-07-07 12:10:04 -07:00
Alex Dadgar b64185a3f1 Deployment GC
This PR implements the garbage collector for deployments. Deployments
will by default be garbage collected after 1 hour.
2017-07-07 12:05:57 -07:00
Alex Dadgar 34332af70e GC and some fixes 2017-04-15 17:08:05 -07:00
Alex Dadgar 3825f7cf1f Eval GC will collect allocs from stopped batch job
This PR fixes a bug in which allocations from stopped batch jobs could
not be garbage collected.
2017-03-11 15:48:57 -08:00
Alex Dadgar 04862ca10e Tests compile 2017-02-07 21:30:57 -08:00
Alex Dadgar 7f9c6466d4 Disallow GC of parameterized jobs
This PR makes it so parameterized jobs do not get garbage collected and
adds a test.
2017-01-26 11:57:32 -08:00
Alex Dadgar 007a538515 Fix core scheduler tests 2016-08-11 14:36:22 -07:00
Alex Dadgar e33bda76bf test sched doesn't mark complete as lost + core_sched tests 2016-08-04 11:24:17 -07:00
Diptanu Choudhury 6193529040 Fixed more tests 2016-07-25 17:31:40 -07:00
Alex Dadgar e26f826189 fix job gc tests 2016-07-25 14:56:23 -07:00
Alex Dadgar 0db55c1dce Revert "Fix job gc tests"
This reverts commit 4be50ac8c78b09d603d9680064391d449b268436.
2016-07-25 14:53:07 -07:00
Alex Dadgar e61aa2484a Fix job gc tests 2016-07-25 14:49:57 -07:00
Diptanu Choudhury 487c66b84d Removing the queued state of Job Summary and alloc desired status false 2016-07-13 13:20:46 -06:00
Alex Dadgar 099cee067d comments 2016-06-28 10:02:06 -07:00
Alex Dadgar 3f0a47f9e4 Disallow EvalGC to reap batch jobs evals/allocs and make JobGC only oneshot GCs everything 2016-06-27 22:54:03 -07:00
Diptanu Choudhury e43c460534 Fixed name of a test 2016-06-22 13:04:54 -07:00
Diptanu Choudhury 0fe8746692 GC-ing dead batch jobs 2016-06-22 11:40:27 -07:00
Alex Dadgar 25decca3ca Worker waitForIndex uses StateStore index, not Raft Applied Index 2016-06-22 09:04:22 -07:00
Alex Dadgar 98bf249625 Partial GC allocations 2016-06-10 18:32:37 -07:00
Alex Dadgar cc95d5d332 GC Nodes even if they have terminal allocations 2016-06-03 16:24:41 -07:00
Alex Dadgar d94204554f Merge pull request #1012 from hashicorp/f-partition-gc
core: Limit GC size
2016-04-14 13:00:53 -07:00
Alex Dadgar b34ab80c93 Address comments 2016-04-14 11:41:04 -07:00