fbfe4ab1bd
This fixes a bug where jobs may get "stuck" unprocessed that dispropotionately affect periodic jobs around leadership transitions. When registering a job, the job registration and the eval to process it get applied to raft as two separate transactions; if the job registration succeeds but eval application fails, the job may remain unprocessed. Operators may detect such failure, when submitting a job update and get a 500 error code, and they could retry; periodic jobs failures are more likely to go unnoticed, and no further periodic invocations will be processed until an operator force evaluation. This fixes the issue by ensuring that the job registration and eval application get persisted and processed atomically in the same raft log entry. Also, applies the same change to ensure atomicity in job deregistration. Backward Compatibility We must maintain compatibility in two scenarios: mixed clusters where a leader can handle atomic updates but followers cannot, and a recent cluster processes old log entries from legacy or mixed cluster mode. To handle this constraints: ensure that the leader continue to emit the Evaluation log entry until all servers have upgraded; also, when processing raft logs, the servers honor evaluations found in both spots, the Eval in job (de-)registration and the eval update entries. When an updated server sees mix-mode behavior where an eval is inserted into the raft log twice, it ignores the second instance. I made one compromise in consistency in the mixed-mode scenario: servers may disagree on the eval.CreateIndex value: the leader and updated servers will report the job registration index while old servers will report the index of the eval update log entry. This discripency doesn't seem to be material - it's the eval.JobModifyIndex that matters. |
||
---|---|---|
.. | ||
config | ||
batch_future.go | ||
batch_future_test.go | ||
bitmap.go | ||
bitmap_test.go | ||
csi.go | ||
csi_test.go | ||
devices.go | ||
devices_test.go | ||
diff.go | ||
diff_test.go | ||
errors.go | ||
errors_test.go | ||
funcs.go | ||
funcs_test.go | ||
generate.sh | ||
network.go | ||
network_test.go | ||
node.go | ||
node_class.go | ||
node_class_test.go | ||
node_test.go | ||
operator.go | ||
service_identities.go | ||
services.go | ||
services_test.go | ||
streaming_rpc.go | ||
structs.go | ||
structs_codegen.go | ||
structs_periodic_test.go | ||
structs_test.go | ||
testing.go | ||
volumes.go |