--- layout: "docs" page_title: "Rolling Upgrades - Operating a Job" sidebar_current: "docs-operating-a-job-updating-rolling-upgrades" description: |- In order to update a service while reducing downtime, Nomad provides a built-in mechanism for rolling upgrades. Rolling upgrades incrementally transistions jobs between versions and using health check information to reduce downtime. --- # Rolling Upgrades Nomad supports rolling updates as a first class feature. To enable rolling updates a job or task group is annotated with a high-level description of the update strategy using the [`update` stanza][update]. Under the hood, Nomad handles limiting parallelism, interfacing with Consul to determine service health and even automatically reverting to an older, healthy job when a deployment fails. ## Enabling Rolling Updates Rolling updates are enabled by adding the [`update` stanza][update] to the job specification. The `update` stanza may be placed at the job level or in an individual task group. When placed at the job level, the update strategy is inherited by all task groups in the job. When placed at both the job and group level, the `update` stanzas are merged, with group stanzas taking precedence over job level stanzas. See the [`update` stanza documentation](/docs/job-specification/update.html#upgrade-stanza-inheritance) for an example. ```hcl job "geo-api-server" { # ... group "api-server" { count = 6 # Add an update stanza to enable rolling updates of the service update { max_parallel = 2 min_healthy_time = "30s" healthy_deadline = "10m" } task "server" { driver = "docker" config { image = "geo-api-server:0.1" } # ... } } } ``` In this example, by adding the simple `update` stanza to the "api-server" task group, we inform Nomad that updates to the group should be handled with a rolling update strategy. Thus when a change is made to the job file that requires new allocations to be made, Nomad will deploy 2 allocations at a time and require that the allocations be running in a healthy state for 30 seconds before deploying more versions of the new group. By default Nomad determines allocation health by ensuring that all tasks in the group are running and that any [service check](/docs/job-specification/service.html#check-parameters) the tasks register are passing. ## Planning Changes Suppose we make a change to a file to upgrade the version of a Docker container that is configured with the same rolling update strategy from above. ```diff @@ -2,6 +2,8 @@ job "geo-api-server" { group "api-server" { task "server" { driver = "docker" config { - image = "geo-api-server:0.1" + image = "geo-api-server:0.2" ``` The [`nomad plan` command](/docs/commands/plan.html) allows us to visualize the series of steps the scheduler would perform. We can analyze this output to confirm it is correct: ```text $ nomad plan geo-api-server.nomad ``` Here is some sample output: ```text +/- Job: "geo-api-server" +/- Task Group: "api-server" (2 create/destroy update, 4 ignore) +/- Task: "server" (forces create/destroy update) +/- Config { +/- image: "geo-api-server:0.1" => "geo-api-server:0.2" } Scheduler dry-run: - All tasks successfully allocated. Job Modify Index: 7 To submit the job with version verification run: nomad run -check-index 7 my-web.nomad When running the job with the check-index flag, the job will only be run if the server side version matches the job modify index returned. If the index has changed, another user has modified the job and the plan's results are potentially invalid. ``` Here we can see that Nomad will begin a rolling update by creating and destroying 2 allocations first and for the time being ignoring 4 of the old allocations, matching our configured `max_parallel`. ## Inspecting a Deployment After running the plan we can submit the updated job by simply running `nomad run`. Once run, Nomad will begin the rolling upgrade of our service by placing 2 allocations at a time of the new job and taking two of the old jobs down. We can inspect the current state of a rolling deployment using `nomad status`: ```text $ nomad status geo-api-server ID = geo-api-server Name = geo-api-server Submit Date = 07/26/17 18:08:56 UTC Type = service Priority = 50 Datacenters = dc1 Status = running Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost api-server 0 0 6 0 4 0 Latest Deployment ID = c5b34665 Status = running Description = Deployment is running Deployed Task Group Desired Placed Healthy Unhealthy api-server 6 4 2 0 Allocations ID Node ID Task Group Version Desired Status Created At 14d288e8 f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC a134f73c f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC a2574bb6 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC 496e7aa2 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC 9fc96fcc f7b1ee08 api-server 0 run running 07/26/17 18:04:30 UTC 2521c47a f7b1ee08 api-server 0 run running 07/26/17 18:04:30 UTC 6b794fcb f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC 9bc11bd7 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC 691eea24 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC af115865 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC ``` Here we can see that Nomad has created a deployment to conduct the rolling upgrade from job version 0 to 1 and has placed 4 instances of the new job and has stopped 4 of the old instances. If we look at the deployed allocations, we also can see that Nomad has placed 4 instances of job version 1 but only considers 2 of them healthy. This is because the 2 newest placed allocations haven't been healthy for the required 30 seconds yet. If we wait for the deployment to complete and re-issue the command, we get the following: ```text $ nomad status geo-api-server ID = geo-api-server Name = geo-api-server Submit Date = 07/26/17 18:08:56 UTC Type = service Priority = 50 Datacenters = dc1 Status = running Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost cache 0 0 6 0 6 0 Latest Deployment ID = c5b34665 Status = successful Description = Deployment completed successfully Deployed Task Group Desired Placed Healthy Unhealthy cache 6 6 6 0 Allocations ID Node ID Task Group Version Desired Status Created At d42a1656 f7b1ee08 api-server 1 run running 07/26/17 18:10:10 UTC 401daaf9 f7b1ee08 api-server 1 run running 07/26/17 18:10:00 UTC 14d288e8 f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC a134f73c f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC a2574bb6 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC 496e7aa2 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC 9fc96fcc f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC 2521c47a f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC 6b794fcb f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC 9bc11bd7 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC 691eea24 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC af115865 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC ``` Nomad has successfully transitioned the group to running the updated canary and did so with no downtime to our service by ensuring only two allocations were changed at a time and the newly placed allocations ran successfully. Had any of the newly placed allocations failed their health check, Nomad would have aborted the deployment and stopped placing new allocations. If configured, Nomad can automatically revert back to the old job definition when the deployment fails. ## Auto Reverting on Failed Deployments In the case we do a deployment in which the new allocations are unhealthy, Nomad will fail the deployment and stop placing new instances of the job. It optionally supports automatically reverting back to the last stable job version on deployment failure. Nomad keeps a history of submitted jobs and whether the job version was stable. A job is considered stable if all its allocations are healthy. To enable this we simply add the `auto_revert` parameter to the `update` stanza: ``` update { max_parallel = 2 min_healthy_time = "30s" healthy_deadline = "10m" # Enable automatically reverting to the last stable job on a failed # deployment. auto_revert = true } ``` Now imagine we want to update our image to "geo-api-server:0.3" but we instead update it to the below and run the job: ```diff @@ -2,6 +2,8 @@ job "geo-api-server" { group "api-server" { task "server" { driver = "docker" config { - image = "geo-api-server:0.2" + image = "geo-api-server:0.33" ``` If we run `nomad job deployments` we can see that the deployment fails and Nomad auto-reverts to the last stable job: ```text $ nomad job deployments geo-api-server ID Job ID Job Version Status Description 0c6f87a5 geo-api-server 3 successful Deployment completed successfully b1712b7f geo-api-server 2 failed Failed due to unhealthy allocations - rolling back to job version 1 3eee83ce geo-api-server 1 successful Deployment completed successfully 72813fcf geo-api-server 0 successful Deployment completed successfully ``` Nomad job versions increment monotonically, so even though Nomad reverted to the job specification at version 1, it creates a new job version. We can see the differences between a jobs versions and how Nomad auto-reverted the job using the `job history` command: ```text $ nomad job history -p geo-api-server Version = 3 Stable = true Submit Date = 07/26/17 18:44:18 UTC Diff = +/- Job: "geo-api-server" +/- Task Group: "api-server" +/- Task: "server" +/- Config { +/- image: "geo-api-server:0.33" => "geo-api-server:0.2" } Version = 2 Stable = false Submit Date = 07/26/17 18:45:21 UTC Diff = +/- Job: "geo-api-server" +/- Task Group: "api-server" +/- Task: "server" +/- Config { +/- image: "geo-api-server:0.2" => "geo-api-server:0.33" } Version = 1 Stable = true Submit Date = 07/26/17 18:44:18 UTC Diff = +/- Job: "geo-api-server" +/- Task Group: "api-server" +/- Task: "server" +/- Config { +/- image: "geo-api-server:0.1" => "geo-api-server:0.2" } Version = 0 Stable = true Submit Date = 07/26/17 18:43:43 UTC ``` We can see that Nomad considered the job running "geo-api-server:0.1" and "geo-api-server:0.2" as stable but job Version 2 that submitted the incorrect image is marked as unstable. This is because the placed allocations failed to start. Nomad detected the deployment failed and as such, created job Version 3 that reverted back to the last healthy job. [update]: /docs/job-specification/update.html "Nomad update Stanza"