Go to file
Mahmood Ali e436d2701a Handle Nomad leadership flapping
Fixes a deadlock in leadership handling if leadership flapped.

Raft propagates leadership transition to Nomad through a NotifyCh channel.
Raft blocks when writing to this channel, so channel must be buffered or
aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader`
until the channel is consumed[1] and does not move on to executing follower
related logic (in `raft.runFollower`).

While Raft `runLeader` defer function blocks, raft cannot process any other
raft operations.  For example, `run{Leader|Follower}` methods consume
`raft.applyCh`, and while runLeader defer is blocked, all raft log applications
or config lookup will block indefinitely.

Sadly, `leaderLoop` and `establishLeader` makes few Raft calls!
`establishLeader` attempts to auto-create autopilot/scheduler config [3]; and
`leaderLoop` attempts to check raft configuration [4].  All of these calls occur
without a timeout.

Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is
invoked and hit any of these Raft calls, Raft handler _deadlock_ forever.

Depending on how many times it flapped and where exactly we get stuck, I suspect
it's possible to get in the following case:

* Agent metrics/stats http and RPC calls hang as they check raft.Configurations
* raft.State remains in Leader state, and server attempts to handle RPC calls
  (e.g. node/alloc updates) and these hang as well

As we create goroutines per RPC call, the number of goroutines grow over time
and may trigger a out of memory errors in addition to missed updates.

[1] d90d6d6bda/config.go (L190-L193)
[2] d90d6d6bda/raft.go (L425-L436)
[3] 2a89e47746/nomad/leader.go (L198-L202)
[4] 2a89e47746/nomad/leader.go (L877)
2020-01-22 13:08:34 -05:00
.circleci
.github
.netlify
acl
api
client
command
contributing
demo
dev
devices/gpu/nvidia
dist
drivers
e2e
helper
integrations
internal/testing/apitests
jobspec
lib
nomad
plugins
scheduler
scripts
terraform
testutil
ui
vendor
version
website
.gitattributes
.gitignore
.golangci.yml
CHANGELOG.md
GNUmakefile
LICENSE
README.md
Vagrantfile
appveyor.yml
build_linux_arm.go
main.go
main_test.go

README.md

Nomad Build Status Discuss

Overview

Nomad is an easy-to-use, flexible, and performant workload orchestrator that deploys:

Nomad enables developers to use declarative infrastructure-as-code for deploying their applications (jobs). Nomad uses bin packing to efficiently schedule jobs and optimize for resource utilization. Nomad is supported on macOS, Windows, and Linux.

Nomad is widely adopted and used in production by PagerDuty, Target, Citadel, Trivago, SAP, Pandora, Roblox, eBay, Deluxe Entertainment, and more.

  • Deploy Containers and Legacy Applications: Nomads flexibility as an orchestrator enables an organization to run containers, legacy, and batch applications together on the same infrastructure. Nomad brings core orchestration benefits to legacy applications without needing to containerize via pluggable task drivers.

  • Simple & Reliable: Nomad runs as a single 75MB binary and is entirely self contained - combining resource management and scheduling into a single system. Nomad does not require any external services for storage or coordination. Nomad automatically handles application, node, and driver failures. Nomad is distributed and resilient, using leader election and state replication to provide high availability in the event of failures.

  • Device Plugins & GPU Support: Nomad offers built-in support for GPU workloads such as machine learning (ML) and artificial intelligence (AI). Nomad uses device plugins to automatically detect and utilize resources from hardware devices such as GPU, FPGAs, and TPUs.

  • Federation for Multi-Region, Multi-Cloud: Nomad was designed to support infrastructure at a global scale. Nomad supports federation out-of-the-box and can deploy jobs across multiple regions and clouds.

  • Proven Scalability: Nomad is optimistically concurrent, which increases throughput and reduces latency for workloads. Nomad has been proven to scale to clusters of 10K+ nodes in real-world production environments.

  • HashiCorp Ecosystem: Nomad integrates seamlessly with Terraform, Consul, Vault for provisioning, service discovery, and secrets management.

Getting Started

Get started with Nomad quickly in a sandbox environment on the public cloud or on your computer.

These methods are not meant for production.

Documentation & Guides

Documentation is available on the Nomad website here.

Resources

Who Uses Nomad

...and more!

Contributing to Nomad

If you wish to contribute to Nomad, you will need Go installed on your machine (version 1.12.13+ is required, and gcc-go is not supported).

See the contributing directory for more developer documentation.

Developing with Vagrant There is an included Vagrantfile that can help bootstrap the process. The created virtual machine is based off of Ubuntu 16, and installs several of the base libraries that can be used by Nomad.

To use this virtual machine, checkout Nomad and run vagrant up from the root of the repository:

$ git clone https://github.com/hashicorp/nomad.git
$ cd nomad
$ vagrant up

The virtual machine will launch, and a provisioning script will install the needed dependencies.

Developing locally For local dev first make sure Go is properly installed, including setting up a GOPATH. After setting up Go, clone this repository into $GOPATH/src/github.com/hashicorp/nomad. Then you can download the required build tools such as vet, cover, godep etc by bootstrapping your environment.

$ make bootstrap
...

Nomad creates many file handles for communicating with tasks, log handlers, etc. In some development environments, particularly macOS, the default number of file descriptors is too small to run Nomad's test suite. You should set ulimit -n 1024 or higher in your shell. This setting is scoped to your current shell and doesn't affect other running shells or future shells.

Afterwards type make test. This will run the tests. If this exits with exit status 0, then everything is working!

$ make test
...

To compile a development version of Nomad, run make dev. This will put the Nomad binary in the bin and $GOPATH/bin folders:

$ make dev

Optionally run Consul to enable service discovery and health checks:

$ sudo consul agent -dev

And finally start the nomad agent:

$ sudo bin/nomad agent -dev

If the Nomad UI is desired in the development version, run make dev-ui. This will build the UI from source and compile it into the dev binary.

$ make dev-ui
...
$ bin/nomad
...

To compile protobuf files, installing protoc is required: See
https://github.com/google/protobuf for more information.

Note: Building the Nomad UI from source requires Node, Yarn, and Ember CLI. These tools are already in the Vagrant VM. Read the UI README for more info.

To cross-compile Nomad, run make prerelease and make release. This will generate all the static assets, compile Nomad for multiple platforms and place the resulting binaries into the ./pkg directory:

$ make prerelease
$ make release
...
$ ls ./pkg
...