scheduler: recover from panic (#12009)

If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.

Recover the goroutine from the top-level `Process` methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.
This commit is contained in:
Tim Gross 2022-02-07 11:47:53 -05:00 committed by GitHub
parent 7a63a249ca
commit 464026c87b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 18 additions and 2 deletions

3
.changelog/12009.txt Normal file
View File

@ -0,0 +1,3 @@
```release-note:improvement
scheduler: recover scheduler goroutines on panic
```

View File

@ -125,7 +125,14 @@ func NewBatchScheduler(logger log.Logger, eventsCh chan<- interface{}, state Sta
} }
// Process is used to handle a single evaluation // Process is used to handle a single evaluation
func (s *GenericScheduler) Process(eval *structs.Evaluation) error { func (s *GenericScheduler) Process(eval *structs.Evaluation) (err error) {
defer func() {
if r := recover(); r != nil {
err = fmt.Errorf("processing eval %q panicked scheduler - please report this as a bug! - %v", eval.ID, r)
}
}()
// Store the evaluation // Store the evaluation
s.eval = eval s.eval = eval

View File

@ -72,7 +72,13 @@ func NewSysBatchScheduler(logger log.Logger, eventsCh chan<- interface{}, state
} }
// Process is used to handle a single evaluation. // Process is used to handle a single evaluation.
func (s *SystemScheduler) Process(eval *structs.Evaluation) error { func (s *SystemScheduler) Process(eval *structs.Evaluation) (err error) {
defer func() {
if r := recover(); r != nil {
err = fmt.Errorf("processing eval %q panicked scheduler - please report this as a bug! - %v", eval.ID, r)
}
}()
// Store the evaluation // Store the evaluation
s.eval = eval s.eval = eval