open-nomad

Commit Graph

Author	SHA1	Message	Date
Seth Hoenig	ba728f8f97	api: enable support for setting original job source (#16763 ) * api: enable support for setting original source alongside job This PR adds support for setting job source material along with the registration of a job. This includes a new HTTP endpoint and a new RPC endpoint for making queries for the original source of a job. The HTTP endpoint is /v1/job/<id>/submission?version=<version> and the RPC method is Job.GetJobSubmission. The job source (if submitted, and doing so is always optional), is stored in the job_submission memdb table, separately from the actual job. This way we do not incur overhead of reading the large string field throughout normal job operations. The server config now includes job_max_source_size for configuring the maximum size the job source may be, before the server simply drops the source material. This should help prevent Bad Things from happening when huge jobs are submitted. If the value is set to 0, all job source material will be dropped. * api: avoid writing var content to disk for parsing * api: move submission validation into RPC layer * api: return an error if updating a job submission without namespace or job id * api: be exact about the job index we associate a submission with (modify) * api: reword api docs scheduling * api: prune all but the last 6 job submissions * api: protect against nil job submission in job validation * api: set max job source size in test server * api: fixups from pr	2023-04-11 08:45:08 -05:00
hashicorp-copywrite[bot]	005636afa0	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Tim Gross	99d46e5a49	scheduling: prevent self-collision in dynamic port network offerings (#16401 ) When the scheduler tries to find a placement for a new allocation, it iterates over a subset of nodes. For each node, we populate a `NetworkIndex` bitmap with the ports of all existing allocations and any other allocations already proposed as part of this same evaluation via its `SetAllocs` method. Then we make an "ask" of the `NetworkIndex` in `AssignPorts` for any ports we need and receive an "offer" in return. The offer will include both static ports and any dynamic port assignments. The `AssignPorts` method was written to support group networks, and it shares code that selects dynamic ports with the original `AssignTaskNetwork` code. `AssignTaskNetwork` can request multiple ports from the bitmap at a time. But `AssignPorts` requests them one at a time and does not account for possible collisions, and doesn't return an error in that case. What happens next varies: 1. If the scheduler doesn't place the allocation on that node, the port conflict is thrown away and there's no problem. 2. If the node is picked and this is the only allocation (or last allocation), the plan applier will reject the plan when it calls `SetAllocs`, as we'd expect. 3. If the node is picked and there are additional allocations in the same eval that iterate over the same node, their call to `SetAllocs` will detect the impossible state and the node will be rejected. This can have the puzzling behavior where a second task group for the job without any networking at all can hit a port collision error! It looks like this bug has existed since we implemented group networks, but there are several factors that add up to making the issue rare for many users yet frustratingly frequent for others: * You're more likely to hit this bug the more tightly packed your range for dynamic ports is. With 12000 ports in the range by default, many clusters can avoid this for a long time. * You're more likely to hit case (3) for jobs with lots of allocations or if a scheduler has to iterate over a large number of nodes, such as with system jobs, jobs with `spread` blocks, or (sometimes) jobs using `unique` constraints. For unlucky combinations of these factors, it's possible that case (3) happens repeatedly, preventing scheduling of a given job until a client state change (ex. restarting the agent so all its allocations are rescheduled elsewhere) re-opens the range of dynamic ports available. This changeset: * Fixes the bug by accounting for collisions in dynamic port selection in `AssignPorts`. * Adds test coverage for `AssignPorts`, expands coverage of this case for the deprecated `AssignTaskNetwork`, and tightens the dynamic port range in a scheduler test for spread scheduling to more easily detect this kind of problem in the future. * Adds a `String()` method to `Bitmap` so that any future "screaming" log lines have a human-readable list of used ports.	2023-03-09 10:09:54 -05:00
Tim Gross	19a2c065f4	System and sysbatch jobs always have zero index (#16030 ) Service jobs should have unique allocation Names, derived from the Job.ID. System jobs do not have unique allocation Names because the index is intended to indicated the instance out of a desired count size. Because system jobs do not have an explicit count but the results are based on the targeted nodes, the index is less informative and this was intentionally omitted from the original design. Update docs to make it clear that NOMAD_ALLOC_INDEX is always zero for system/sysbatch jobs Validate that `volume.per_alloc` is incompatible with system/sysbatch jobs. System and sysbatch jobs always have a `NOMAD_ALLOC_INDEX` of 0. So interpolation via `per_alloc` will not work as soon as there's more than one allocation placed. Validate against this on job submission.	2023-02-02 16:18:01 -05:00
jmwilkinson	37834dffda	Allow wildcard datacenters to be specified in job file (#11170 ) Also allows for default value of `datacenters = ["*"]`	2023-02-02 09:57:45 -05:00
Seth Hoenig	2088ca3345	cleanup more helper updates (#14638 ) * cleanup: refactor MapStringStringSliceValueSet to be cleaner * cleanup: replace SliceStringToSet with actual set * cleanup: replace SliceStringSubset with real set * cleanup: replace SliceStringContains with slices.Contains * cleanup: remove unused function SliceStringHasPrefix * cleanup: fixup StringHasPrefixInSlice doc string * cleanup: refactor SliceSetDisjoint to use real set * cleanup: replace CompareSliceSetString with SliceSetEq * cleanup: replace CompareMapStringString with maps.Equal * cleanup: replace CopyMapStringString with CopyMap * cleanup: replace CopyMapStringInterface with CopyMap * cleanup: fixup more CopyMapStringString and CopyMapStringInt * cleanup: replace CopySliceString with slices.Clone * cleanup: remove unused CopySliceInt * cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice * cleanup: replace CopyMap with maps.Clone * cleanup: run go mod tidy	2022-09-21 14:53:25 -05:00
Mahmood Ali	a9d5e4c510	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Piotr Kazmierczak	b63944b5c1	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 ) Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.	2022-08-17 18:26:34 +02:00
Abirdcfly	d66943d4f7	fix minor unreachable code caused by t.Fatal Signed-off-by: Abirdcfly <fp544037857@gmail.com>	2022-08-08 23:50:11 +08:00
Tim Gross	8ff5ea1bee	CSI: no early return when feasibility check fails on eligible nodes (#13274 ) As a performance optimization in the scheduler, feasibility checks that apply to an entire class are only checked once for all nodes of that class. Other feasibility checks are "available" checks because they rely on more ephemeral characteristics and don't contribute to the hash for the node class. This currently includes only CSI. We have a separate fast path for "available" checks when the node has already been marked eligible on the basis of class. This fast path has a bug where it returns early rather than continuing the loop. This causes the entire task group to be rejected. Fix the bug by not returning early in the fast path and instead jump to the top of the loop like all the other code paths in this method. Includes a new test exercising topology at whole-scheduler level and a fix for an existing test that should've caught this previously.	2022-06-07 13:31:10 -04:00
Derek Strickland	b128769e19	reconciler: support disconnected clients (#12058 ) * Add merge helper for string maps * structs: add statuses, MaxClientDisconnect, and helper funcs * taintedNodes: Include disconnected nodes * upsertAllocsImpl: don't use existing ClientStatus when upserting unknown * allocSet: update filterByTainted and add delayByMaxClientDisconnect * allocReconciler: support disconnecting and reconnecting allocs * GenericScheduler: upsert unknown and queue reconnecting Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-04-05 17:10:37 -04:00
Seth Hoenig	2631659551	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
Tim Gross	2dafe46fe3	CSI: allow updates to volumes on re-registration (#12167 ) CSI `CreateVolume` RPC is idempotent given that the topology, capabilities, and parameters are unchanged. CSI volumes have many user-defined fields that are immutable once set, and many fields that are not user-settable. Update the `Register` RPC so that updating a volume via the API merges onto any existing volume without touching Nomad-controlled fields, while validating it with the same strict requirements expected for idempotent `CreateVolume` RPCs. Also, clarify that this state store method is used for everything, not just for the `Register` RPC.	2022-03-07 11:06:59 -05:00
Luiz Aoqui	b1753d0568	scheduler: detect and log unexpected scheduling collisions (#11793 )	2022-01-14 20:09:14 -05:00
Mahmood Ali	4d90afb425	gofmt all the files mostly to handle build directives in 1.17.	2021-10-01 10:14:28 -04:00
Tim Gross	417ec91317	scheduler: datacenter updates should be destructive Updates to the datacenter field should be destructive for any allocation that is on a node no longer in the list of datacenters, but inplace for any allocation on a node that is still in the list. Add a check for this change to the system and generic schedulers after we've checked the task definition for updates and obtained the node for each current allocation.	2021-07-07 11:18:30 -04:00
Michael Schurter	547a718ef6	Merge pull request #10248 from hashicorp/f-remotetask-2021 core: propagate remote task handles	2021-04-30 08:57:26 -07:00
Mahmood Ali	52d881f567	Allow configuring memory oversubscription (#10466 ) Cluster operators want to have better control over memory oversubscription and may want to enable/disable it based on their experience. This PR adds a scheduler configuration field to control memory oversubscription. It's additional field that can be set in the [API via Scheduler Config](https://www.nomadproject.io/api-docs/operator/scheduler), or [the agent server config](https://www.nomadproject.io/docs/configuration/server#configuring-scheduler-config). I opted to have the memory oversubscription be an opt-in, but happy to change it. To enable it, operators should call the API with: ```json { "MemoryOversubscriptionEnabled": true } ``` If memory oversubscription is disabled, submitting jobs specifying `memory_max` will get a "Memory oversubscription is not enabled" warnings, but the jobs will be accepted without them accessing the additional memory. The warning message is like: ``` $ nomad job run /tmp/j Job Warnings: 1 warning(s): * Memory oversubscription is not enabled; Task cache.redis memory_max value will be ignored ==> Monitoring evaluation "7c444157" Evaluation triggered by job "example" ==> Monitoring evaluation "7c444157" Evaluation within deployment: "9d826f13" Allocation "aa5c3cad" created: node "9272088e", group "cache" Evaluation status changed: "pending" -> "complete" ==> Evaluation "7c444157" finished with status "complete" # then you can examine the Alloc AllocatedResources to validate whether the task is allowed to exceed memory: $ nomad alloc status -json aa5c3cad \| jq '.AllocatedResources.Tasks["redis"].Memory' { "MemoryMB": 256, "MemoryMaxMB": 0 } ```	2021-04-29 22:09:56 -04:00
Michael Schurter	e62795798d	core: propagate remote task handles Add a new driver capability: RemoteTasks. When a task is run by a driver with RemoteTasks set, its TaskHandle will be propagated to the server in its allocation's TaskState. If the task is replaced due to a down node or draining, its TaskHandle will be propagated to its replacement allocation. This allows tasks to be scheduled in remote systems whose lifecycles are disconnected from the Nomad node's lifecycle. See https://github.com/hashicorp/nomad-driver-ecs for an example ECS remote task driver.	2021-04-27 15:07:03 -07:00
Chris Baker	436d46bd19	Merge branch 'main' into f-node-drain-api	2021-04-01 15:22:57 -05:00
Mahmood Ali	0c2551270a	oversubscription: Add MemoryMaxMB to internal structs Start tracking a new MemoryMaxMB field that represents the maximum memory a task may use in the client. This allows tasks to specify a memory reservation (to be used by scheduler when placing the task) but use excess memory used on the client if the client has any. This commit adds the server tracking for the value, and ensures that allocations AllocatedResource fields include the value.	2021-03-30 16:55:58 -04:00
Chris Baker	dd291e69f4	removed deprecated fields from Drain structs and API node drain: use msgtype on txn so that events are emitted wip: encoding extension to add Node.Drain field back to API responses new approach for hiding Node.SecretID in the API, using `json` tag documented this approach in the contributing guide refactored the JSON handlers with extensions modified event stream encoding to use the go-msgpack encoders with the extensions	2021-03-21 15:30:11 +00:00
Tim Gross	fa25e048b2	CSI: unique volume per allocation Add a `PerAlloc` field to volume requests that directs the scheduler to test feasibility for volumes with a source ID that includes the allocation index suffix (ex. `[0]`), rather than the exact source ID. Read the `PerAlloc` field when making the volume claim at the client to determine if the allocation index suffix (ex. `[0]`) should be added to the volume source ID.	2021-03-18 15:35:11 -04:00
Drew Bailey	009b8d5363	Persist shared allocated ports for inplace update (#9830 ) * Persist shared allocated ports for inplace update Ports were not copied over when performing inplace updates in the generic scheduler * changelog * drop spew	2021-01-15 12:45:12 -05:00
Kris Hicks	62972cc839	scheduler: Fix always-false sort func (#9547 ) Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>	2020-12-08 09:57:47 -08:00
Nick Ethier	d21cbeb30f	command: remove task network usage from init examples	2020-11-23 10:25:11 -06:00
Drew Bailey	6c788fdccd	Events/msgtype cleanup (#9117 ) * use msgtype in upsert node adds message type to signature for upsert node, update tests, remove placeholder method * UpsertAllocs msg type test setup * use upsertallocs with msg type in signature update test usage of delete node delete placeholder msgtype method * add msgtype to upsert evals signature, update test call sites with test setup msg type handle snapshot upsert eval outside of FSM and ignore eval event remove placeholder upsertevalsmsgtype handle job plan rpc and prevent event creation for plan msgtype cleanup upsertnodeevents updatenodedrain msgtype msg type 0 is a node registration event, so set the default to the ignore type * fix named import * fix signature ordering on upsertnode to match	2020-10-19 09:30:15 -04:00
Drew Bailey	b4c135358d	use Events to wrap index and events, store in events table	2020-10-14 12:44:39 -04:00
Drew Bailey	9d48818eb8	writetxn can return error, add alloc and job generic events. Add events table for durability	2020-10-14 12:44:39 -04:00
Drew Bailey	400455d302	Events/eval alloc events (#9012 ) * generic eval update event first pass at alloc client update events * api/event client	2020-10-14 12:44:37 -04:00
Mahmood Ali	6a0dd8bc87	Merge pull request #8867 from hashicorp/b-canary-substitution scheduler: Revert requireCanary logic	2020-09-15 12:58:55 -05:00
Mahmood Ali	98de2d2278	add a test when .NextAllocation is set but alloc is still running	2020-09-14 17:12:53 -04:00
Mahmood Ali	fd54cfce6e	Revert the `requireCanary` check introduced in https://github.com/hashicorp/nomad/pull/8691/files#diff-1801138ac4d10f2064ba6f2e434ac9b4L430-R431 . The change was intended to fix a case where a canary alloc may fail to be rescheduled if all the other allocs fail as well (e.g. if all allocs happen to be placed on a node that died). However, it introduced some unintended side-effects. Reverting the change for now and will investigate further.	2020-09-10 14:59:02 -04:00
Mahmood Ali	c6e1d22697	test for rescheduling non-canaries	2020-09-10 14:59:02 -04:00
Mahmood Ali	8a342926b7	Respect alloc job version for lost/failed allocs This change fixes a bug where lost/failed allocations are replaced by allocations with the latest versions, even if the version hasn't been promoted yet. Now, when generating a plan for lost/failed allocations, the scheduler first checks if the current deployment is in Canary stage, and if so, it ensures that any lost/failed allocations is replaced one with the latest promoted version instead.	2020-08-19 09:52:48 -04:00
Tim Gross	d3341a2019	refactor: make it clear where we're accessing dstate The field name `Deployment.TaskGroups` contains a map of `DeploymentState`, which makes it a little harder to follow state updates when combined with inconsistent naming conventions, particularly when we also have the state store or actual `TaskGroup`s in scope. This changeset changes all uses to `dstate` so as not to be confused with actual TaskGroups.	2020-07-20 11:25:53 -04:00
Nick Ethier	1e4ea699ad	fix test failures from rebase	2020-06-18 11:05:32 -07:00
Lang Martin	ac7c39d3d3	Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection (#8099 ) * client/heartbeatstop: reversed time condition for startup grace * scheduler/generic_sched: use `delayInstead` to avoid a loop Without protecting the loop that creates followUpEvals, a delayed eval is allowed to create an immediate subsequent delayed eval. For both `stop_after_client_disconnect` and the `reschedule` block, a delayed eval should always produce some immediate result (running or blocked) and then only after the outcome of that eval produce a second delayed eval. * scheduler/reconcile: lostLater are different than delayedReschedules Just slightly. `lostLater` allocs should be used to create batched evaluations, but `handleDelayedReschedules` assumes that the allocations are in the untainted set. When it creates the in-place updates to those allocations at the end, it causes the allocation to be treated as running over in the planner, which causes the initial `stop_after_client_disconnect` evaluation to be retried by the worker.	2020-06-03 09:48:38 -04:00
Mahmood Ali	d9792777d9	Open source Preemption code Nomad 0.12 OSS is to include preemption feature. This commit moves the private code for managing preemption to OSS repository.	2020-05-27 15:02:01 -04:00
Lang Martin	d3c4700cd3	server: stop after client disconnect (#7939 ) * jobspec, api: add stop_after_client_disconnect * nomad/state/state_store: error message typo * structs: alloc methods to support stop_after_client_disconnect 1. a global AllocStates to track status changes with timestamps. We need this to track the time at which the alloc became lost originally. 2. ShouldClientStop() and WaitClientStop() to actually do the math * scheduler/reconcile_util: delayByStopAfterClientDisconnect * scheduler/reconcile: use delayByStopAfterClientDisconnect * scheduler/util: updateNonTerminalAllocsToLost comments This was setup to only update allocs to lost if the DesiredStatus had already been set by the scheduler. It seems like the intention was to update the status from any non-terminal state, and not all lost allocs have been marked stop or evict by now * scheduler/testing: AssertEvalStatus just use require * scheduler/generic_sched: don't create a blocked eval if delayed * scheduler/generic_sched_test: several scheduling cases	2020-05-13 16:39:04 -04:00
Mahmood Ali	2ff2745374	test for allocated devices on job in-update update When an alloc is updated in-place, test that the allocated devices are preserved in new alloc struct.	2020-04-21 08:56:05 -04:00
Jasmine Dahilig	81d051d7e8	fix bug in lifecycle scheduler test mocks	2020-03-21 17:52:51 -04:00
Jasmine Dahilig	0cc9212a54	add test cases for scheduler alloc placement with lifecycle resources	2020-03-21 17:52:47 -04:00
Jasmine Dahilig	3e4e8f2b02	add allocfit test for lifecycles	2020-03-21 17:52:46 -04:00
Drew Bailey	9a65556211	add state store test to ensure PlacedCanaries is updated	2020-02-03 13:58:01 -05:00
Drew Bailey	ef175c0b31	Update Evicted allocations to lost when lost If an alloc is being preempted and marked as evict, but the underlying node is lost before the migration takes place, the allocation currently stays as desired evict, status running forever, or until the node comes back online. This commit updates updateNonTerminalAllocsToLost to check for a destired status of Evict as well as Stop when updating allocations on tainted nodes. switch to table test for lost node cases	2020-01-07 13:34:18 -05:00
Michael Schurter	7655e0cee4	Merge pull request #6792 from hashicorp/b-propose-panic scheduler: fix panic when preempting and evicting allocs	2019-12-03 10:40:19 -08:00
Tim Gross	c50057bf1f	scheduler: fix job update placement on prev node penalized (#6781 ) Fixes #5856 When the scheduler looks for a placement for an allocation that's replacing another allocation, it's supposed to penalize the previous node if the allocation had been rescheduled or failed. But we're currently always penalizing the node, which leads to unnecessary migrations on job update. This commit leaves in place the existing behavior where if the previous alloc was itself rescheduled, its previous nodes are also penalized. This is conservative but the right behavior especially on larger clusters where a group of hosts might be having correlated trouble (like an AZ failure). Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-12-03 06:14:49 -08:00
Michael Schurter	0374069f82	scheduler: update tests with modern error helper	2019-12-02 20:25:52 -08:00
Michael Schurter	c6bbe85f42	core: fix panic when AllocatedResources is nil Fix for #6540	2019-10-28 14:38:21 -07:00

1 2 3 4

170 Commits