open-nomad

Commit Graph

Author	SHA1	Message	Date
Luiz Aoqui	e012d9411e	Task lifecycle restart (#14127 ) * allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes	2022-08-24 17:43:07 -04:00
Seth Hoenig	88a1353149	cli: display nomad service check status output in CLI commands This PR adds some NSD check status output to the CLI. 1. The 'nomad alloc status' command produces nsd check summary output (if present) 2. The 'nomad alloc checks' sub-command is added to produce complete nsd check output (if present)	2022-08-19 09:18:29 -05:00
Tim Gross	c38c052ef3	api: document warnings for setting `api.ClientConnTimeout` (#14122 ) HTTP API consumers that have network line-of-sight to client nodes can connect directly for a small number of APIs. But in environments where the consumer doesn't have line-of-sight, there's a long pause waiting for the `api.ClientConnTimeout` to expire. Warn about this in the API docs so that authors can avoid the extra timeout.	2022-08-15 16:06:02 -04:00
James Rasell	2c540b03c5	api: use errors.New not fmt.Errorf when error doesn't have format. (#14027 ) * api: use errors.New not fmt.Errorf when error doesn't have format. * semgrep: add rule to catch fmt.Errorf use without formatting.	2022-08-05 17:05:47 +02:00
James Rasell	bb5b510c9d	cli: do not import structs, use API package only. (#13938 )	2022-08-02 16:33:08 +02:00
James Rasell	d61c683b19	api: add service registration HTTP API wrapper.	2022-03-03 12:14:00 +01:00
Mahmood Ali	2ebbffad12	exec: api: handle closing errors differently refactor the api handling of `nomad exec`, and ensure that we process all received events before handling websocket closing. The exit code should be the last message received, and we ought to ignore any websocket close error we receive afterwards. Previously, we used two channels: one for websocket frames and another for handling errors. This raised the possibility that we processed the error before processing the frames, resulting into an "unexpected EOF" error.	2021-05-25 11:19:42 -04:00
Luiz Aoqui	f1b9055d21	Add metrics for blocked eval resources (#10454 ) * add metrics for blocked eval resources * docs: add new blocked_evals metrics * fix to call `pruneStats` instead of `stats.prune` directly	2021-04-29 15:03:45 -04:00
Mahmood Ali	18b581656d	oversubscription: adds CLI and API support This commit updates the API to pass the MemoryMaxMB field, and the CLI to show the max set for the task. Also, start parsing the MemoryMaxMB in hcl2, as it's set by tags. A sample CLI output; note the additional `Max: ` for "task": ``` $ nomad alloc status 96fbeb0b ID = 96fbeb0b-a0b3-aa95-62bf-b8a39492fd5c [...] Task "cgroup-fetcher" is "running" Task Resources CPU Memory Disk Addresses 0/500 MHz 32 MiB/20 MiB 300 MiB Task Events: [...] Task "task" is "running" Task Resources CPU Memory Disk Addresses 0/500 MHz 176 KiB/20 MiB 300 MiB Max: 30 MiB Task Events: [...] ```	2021-03-30 16:55:58 -04:00
James Rasell	8dc2a9c6e1	api: add Allocation client and server terminal status funcs.	2021-03-25 08:52:59 +01:00
Michael Dwan	29b05929e8	Add devices to AllocatedTaskResources	2021-02-22 12:47:36 -07:00
Michael Schurter	8ccbd92cb6	api: add field filters to /v1/{allocations,nodes} Fixes #9017 The ?resources=true query parameter includes resources in the object stub listings. Specifically: - For `/v1/nodes?resources=true` both the `NodeResources` and `ReservedResources` field are included. - For `/v1/allocations?resources=true` the `AllocatedResources` field is included. The ?task_states=false query parameter removes TaskStates from /v1/allocations responses. (By default TaskStates are included.)	2020-10-14 10:35:22 -07:00
Mahmood Ali	f5700611c0	api: target servers for allocation requests (#8897 ) Allocation requests should target servers, which then can forward the request to the appropriate clients. Contacting clients directly is fragile and prune to failures: e.g. clients maybe firewalled and not accessible from the API client, or have some internal certificates not trusted by the API client. FWIW, in contexts where we anticipate lots of traffic (e.g. logs, or exec), the api package attempts contacting the client directly but then fallsback to using the server. This approach seems excessive in these simple GET/PUT requests. Fixes #8894	2020-09-16 09:34:17 -04:00
Nick Ethier	89118016fc	command: correctly show host IP in ports output /w multi-host networks (#8289 )	2020-06-25 15:16:01 -04:00
Lang Martin	80619137ab	csi: volumes listed in `nomad node status` (#7318 ) * api/allocations: GetTaskGroup finds the taskgroup struct * command/node_status: display CSI volume names * nomad/state/state_store: new CSIVolumesByNodeID * nomad/state/iterator: new SliceIterator type implements memdb.ResultIterator * nomad/csi_endpoint: deal with a slice of volumes * nomad/state/state_store: CSIVolumesByNodeID return a SliceIterator * nomad/structs/csi: CSIVolumeListRequest takes a NodeID * nomad/csi_endpoint: use the return iterator * command/agent/csi_endpoint: parse query params for CSIVolumes.List * api/nodes: new CSIVolumes to list volumes by node * command/node_status: use the new list endpoint to print volumes * nomad/state/state_store: error messages consider the operator * command/node_status: include the Provider	2020-03-23 13:58:30 -04:00
Lang Martin	887e1f28c9	csi: CLI for volume status, registration/deregistration and plugin status (#7193 ) * command/csi: csi, csi_plugin, csi_volume * helper/funcs: move ExtraKeys from parse_config to UnusedKeys * command/agent/config_parse: use helper.UnusedKeys * api/csi: annotate CSIVolumes with hcl fields * command/csi_plugin: add Synopsis * command/csi_volume_register: use hcl.Decode style parsing * command/csi_volume_list * command/csi_volume_status: list format, cleanup * command/csi_plugin_list * command/csi_plugin_status * command/csi_volume_deregister * command/csi_volume: add Synopsis * api/contexts/contexts: add csi search contexts to the constants * command/commands: register csi commands * api/csi: fix struct tag for linter * command/csi_plugin_list: unused struct vars * command/csi_plugin_status: unused struct vars * command/csi_volume_list: unused struct vars * api/csi: add allocs to CSIPlugin * command/csi_plugin_status: format the allocs * api/allocations: copy Allocation.Stub in from structs * nomad/client_rpc: add some error context with Errorf * api/csi: collapse read & write alloc maps to a stub list * command/csi_volume_status: cleanup allocation display * command/csi_volume_list: use Schedulable instead of Healthy * command/csi_volume_status: use Schedulable instead of Healthy * command/csi_volume_list: sprintf string * command/csi: delete csi.go, csi_plugin.go * command/plugin: refactor csi components to sub-command plugin status * command/plugin: remove csi * command/plugin_status: remove csi * command/volume: remove csi * command/volume_status: split out csi specific * helper/funcs: add RemoveEqualFold * command/agent/config_parse: use helper.RemoveEqualFold * api/csi: do ,unusedKeys right * command/volume: refactor csi components to `nomad volume` * command/volume_register: split out csi specific * command/commands: use the new top level commands * command/volume_deregister: hardwired type csi for now * command/volume_status: csiFormatVolumes rescued from volume_list * command/plugin_status: avoid a panic on no args * command/volume_status: avoid a panic on no args * command/plugin_status: predictVolumeType * command/volume_status: predictVolumeType * nomad/csi_endpoint_test: move CreateTestPlugin to testing * command/plugin_status_test: use CreateTestCSIPlugin * nomad/structs/structs: add CSIPlugins and CSIVolumes search consts * nomad/state/state_store: add CSIPlugins and CSIVolumesByIDPrefix * nomad/search_endpoint: add CSIPlugins and CSIVolumes * command/plugin_status: move the header to the csi specific * command/volume_status: move the header to the csi specific * nomad/state/state_store: CSIPluginByID prefix * command/status: rename the search context to just Plugins/Volumes * command/plugin,volume_status: test return ids now * command/status: rename the search context to just Plugins/Volumes * command/plugin_status: support -json and -t * command/volume_status: support -json and -t * command/plugin_status_csi: comments * command/_status: clean up text api/csi: fix stale comments * command/volume: make deregister sound less fearsome * command/plugin_status: set the id length * command/plugin_status_csi: more compact plugin health * command/volume: better error message, comment	2020-03-23 13:58:30 -04:00
Mahmood Ali	37e0598344	api: alloc exec recovers from bad client connection If alloc exec fails to connect to the nomad client associated with the alloc, fail over to using a server. The code attempted to special case `net.Error` for failover to rule out other permanent non-networking errors, by reusing a pattern in the logging handling. But this pattern does not apply here. `net/http.Http` wraps all errors as `*url.Error` that is net.Error. The websocket doesn't, and instead returns the raw error. If the raw error isn't a `net.Error`, like in the case of TLS handshake errors, the api package would fail immediately rather than failover.	2020-03-04 17:43:00 -05:00
Mahmood Ali	b77fd8654b	cli: recover from client ACL lookup failures This fixes a bug in the CLI handling of node lookup failures when querying allocation and FS endpoints. Allocation and FS endpoint are handled by the client; one can query the relevant client directly, or query a server to have it forwarded transparently to relevant client. Querying the client directly is benefecial to avoid loading servers with IO. As an optimization, the CLI attempts to query the client directly, but then falls back to using server forwarding path if it encounters network or connection errors (e.g. clients are locked down or in a separate inaccessible network). Here, we fix a bug where if the CLI fails to find to lookup the client details because it lacks ACL capability or other unexpected reasons, the CLI will not go through fallback path.	2019-10-04 11:23:59 -04:00
Michael Schurter	d220e630c0	api: add missing Networks field to alloc resources	2019-07-31 01:04:06 -04:00
Chris Baker	83ee50d5ab	api: removed unused AllocID from AllocSignalRequest	2019-06-21 21:44:38 +00:00
Mahmood Ali	09931bcdce	add api support for nomad exec Adds nomad exec support in our API, by hitting the websocket endpoint. We introduce API structs that correspond to the drivers streaming exec structs. For creating the websocket connection, we reuse the transport setting from api http client.	2019-05-09 16:49:08 -04:00
Mahmood Ali	f920efb962	divest /api from nomad/structs The API package needs to be independent from rest of nomad packages, to avoid leaking internal packages and dependencies (e.g. raft, ugorji, etc)	2019-04-28 13:32:26 -04:00
Danielle Lancashire	3409e0be89	allocs: Add nomad alloc signal command This command will be used to send a signal to either a single task within an allocation, or all of the tasks if <task-name> is omitted. If the sent signal terminates the allocation, it will be treated as if the allocation has crashed, rather than as if it was operator-terminated. Signal validation is currently handled by the driver itself and nomad does not attempt to restrict or validate them.	2019-04-25 12:43:32 +02:00
Danielle	198a838b61	Merge pull request #5512 from hashicorp/dani/f-alloc-stop alloc-lifecycle: nomad alloc stop	2019-04-23 13:05:08 +02:00
Danielle Lancashire	832f607433	allocs: Add nomad alloc stop This adds a `nomad alloc stop` command that can be used to stop and force migrate an allocation to a different node. This is built on top of the AllocUpdateDesiredTransitionRequest and explicitly limits the scope of access to that transition to expose it under the alloc-lifecycle ACL. The API returns the follow up eval that can be used as part of monitoring in the CLI or parsed and used in an external tool.	2019-04-23 12:50:23 +02:00
Preetha Appan	22109d1e20	Add preemption related fields to AllocationListStub	2019-04-18 10:36:44 -05:00
Danielle Lancashire	e135876493	allocs: Add nomad alloc restart This adds a `nomad alloc restart` command and api that allows a job operator with the alloc-lifecycle acl to perform an in-place restart of a Nomad allocation, or a given subtask.	2019-04-11 14:25:49 +02:00
James Rasell	9470507cf4	Add NodeName to the alloc/job status outputs. Currently when operators need to log onto a machine where an alloc is running they will need to perform both an alloc/job status call and then a call to discover the node name from the node list. This updates both the job status and alloc status output to include the node name within the information to make operator use easier. Closes #2359 Cloess #1180	2019-04-10 10:34:10 -05:00
Mahmood Ali	7bdd43f3e0	api: avoid codegen for syncing Given that the values will rarely change, specially considering that any changes would be backward incompatible change. As such, it's simpler to keep syncing manually in the rare occasion and avoid the syncing code overhead.	2019-01-18 18:52:31 -05:00
Preetha Appan	5f0a9d2cfd	Show preemption output in plan CLI	2018-11-08 09:48:43 -06:00
Preetha Appan	5b3bfb63eb	structs and API changes to plan and alloc structs needed for preemption	2018-10-30 11:06:32 -05:00
Alex Dadgar	a78cefec18	use int64	2018-10-16 15:34:32 -07:00
Preetha Appan	7c0d8c646c	Change CPU/Disk/MemoryMB to int everywhere in new resource structs	2018-10-16 16:21:42 -05:00
Alex Dadgar	bac5cb1e8b	Scheduler uses allocated resources	2018-10-02 17:08:25 -07:00
Preetha Appan	751c0eb5a5	code review feedback	2018-09-04 16:10:11 -05:00
Preetha Appan	9bc0962527	Track top k nodes by norm score rather than top k nodes per scorer	2018-09-04 16:10:11 -05:00
Preetha Appan	6ed527c636	Use heap to store top K scoring nodes. Scoring metadata is now aggregated by scorer type to make it easier to parse when reading it in the CLI.	2018-09-04 16:10:11 -05:00
Alex Dadgar	f95ab4ade8	Mark canaries on creation, and unmark on promotion	2018-05-07 14:50:01 -05:00
Alex Dadgar	8a81038cdb	Set Reschedule from deployment watcher	2018-05-07 14:50:01 -05:00
Preetha Appan	274bed1892	Add RescheduleTracker to allocs list stub struct	2018-05-01 14:53:47 -05:00
Michael Schurter	2832853bfa	Add DesiredTransition.ShouldMigrate to api pkg	2018-03-21 16:51:45 -07:00
Michael Schurter	d1ec65d765	switch to new raft DesiredTransition message	2018-03-21 16:49:48 -07:00
Alex Dadgar	db4a634072	RPC, FSM, State Store for marking DesiredTransistion fix build tag	2018-03-21 16:49:48 -07:00
Preetha Appan	342c3fb961	Added FollowupEvalID field and helper methods to calculate reschedule eligibility based on delay	2018-03-14 16:10:32 -05:00
Alex Dadgar	aa98f8ba7b	Enhance API pkg to utilize Server's Client Tunnel This PR enhances the API package by having client only RPCs route through the server when they are low cost and for filesystem access to first attempt a direct connection to the node and then falling back to a server routed request.	2018-02-15 13:59:03 -08:00
Preetha Appan	9d15e0c05b	Code review feedback	2018-01-31 09:58:05 -06:00
Preetha Appan	5714a6b8bf	Add method on API alloc to calculate attempted and remaining reschedule events	2018-01-31 09:58:05 -06:00
Preetha Appan	e09ea8c0b0	Address code review comments	2018-01-31 09:58:05 -06:00
Preetha Appan	0c56a12a77	Add RescheduleTracker to allocations API struct	2018-01-31 09:56:53 -06:00
Preetha Appan	fd2fbefa4c	Add a field to track the next allocation during a replacement	2018-01-24 17:55:05 -06:00

1 2

84 Commits