Commit Graph

1 Commits

Author SHA1 Message Date
Tim Gross 464026c87b
scheduler: recover from panic (#12009)
If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.

Recover the goroutine from the top-level `Process` methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.
2022-02-07 11:47:53 -05:00