5f1033263b
Add redirects from /docs/ -> /guides/ and update old redirects to point to the new location.
454 lines
16 KiB
Markdown
454 lines
16 KiB
Markdown
---
|
|
layout: "guides"
|
|
page_title: "Blue/Green & Canary Deployments - Operating a Job"
|
|
sidebar_current: "guides-operating-a-job-updating-blue-green-deployments"
|
|
description: |-
|
|
Nomad has built-in support for doing blue/green and canary deployments to more
|
|
safely update existing applications and services.
|
|
---
|
|
|
|
# Blue/Green & Canary Deployments
|
|
|
|
Sometimes [rolling
|
|
upgrades](/guides/operating-a-job/update-strategies/rolling-upgrades.html) do not
|
|
offer the required flexibility for updating an application in production. Often
|
|
organizations prefer to put a "canary" build into production or utilize a
|
|
technique known as a "blue/green" deployment to ensure a safe application
|
|
rollout to production while minimizing downtime.
|
|
|
|
## Blue/Green Deployments
|
|
|
|
Blue/Green deployments have several other names including Red/Black or A/B, but
|
|
the concept is generally the same. In a blue/green deployment, there are two
|
|
application versions. Only one application version is active at a time, except
|
|
during the transition phase from one version to the next. The term "active"
|
|
tends to mean "receiving traffic" or "in service".
|
|
|
|
Imagine a hypothetical API server which has five instances deployed to
|
|
production at version 1.3, and we want to safely upgrade to version 1.4. We want
|
|
to create five new instances at version 1.4 and in the case that they are
|
|
operating correctly we want to promote them and take down the five versions
|
|
running 1.3. In the event of failure, we can quickly rollback to 1.3.
|
|
|
|
To start, we examine our job which is running in production:
|
|
|
|
```hcl
|
|
job "docs" {
|
|
# ...
|
|
|
|
group "api" {
|
|
count = 5
|
|
|
|
update {
|
|
max_parallel = 1
|
|
canary = 5
|
|
min_healthy_time = "30s"
|
|
healthy_deadline = "10m"
|
|
auto_revert = true
|
|
}
|
|
|
|
task "api-server" {
|
|
driver = "docker"
|
|
|
|
config {
|
|
image = "api-server:1.3"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
We see that it has an `update` stanza that has the `canary` equal to the desired
|
|
count. This is what allows us to easily model blue/green deployments. When we
|
|
change the job to run the "api-server:1.4" image, Nomad will create 5 new
|
|
allocations without touching the original "api-server:1.3" allocations. Below we
|
|
can see how this works by changing the image to run the new version:
|
|
|
|
```diff
|
|
@@ -2,6 +2,8 @@ job "docs" {
|
|
group "api" {
|
|
task "api-server" {
|
|
config {
|
|
- image = "api-server:1.3"
|
|
+ image = "api-server:1.4"
|
|
```
|
|
|
|
Next we plan and run these changes:
|
|
|
|
```text
|
|
$ nomad job plan docs.nomad
|
|
+/- Job: "docs"
|
|
+/- Task Group: "api" (5 canary, 5 ignore)
|
|
+/- Task: "api-server" (forces create/destroy update)
|
|
+/- Config {
|
|
+/- image: "api-server:1.3" => "api-server:1.4"
|
|
}
|
|
|
|
Scheduler dry-run:
|
|
- All tasks successfully allocated.
|
|
|
|
Job Modify Index: 7
|
|
To submit the job with version verification run:
|
|
|
|
nomad job run -check-index 7 example.nomad
|
|
|
|
When running the job with the check-index flag, the job will only be run if the
|
|
server side version matches the job modify index returned. If the index has
|
|
changed, another user has modified the job and the plan's results are
|
|
potentially invalid.
|
|
|
|
$ nomad job run docs.nomad
|
|
# ...
|
|
```
|
|
|
|
We can see from the plan output that Nomad is going to create 5 canaries that
|
|
are running the "api-server:1.4" image and ignore all the allocations running
|
|
the older image. Now if we examine the status of the job we can see that both
|
|
the blue ("api-server:1.3") and green ("api-server:1.4") set are running.
|
|
|
|
```text
|
|
$ nomad status docs
|
|
ID = docs
|
|
Name = docs
|
|
Submit Date = 07/26/17 19:57:47 UTC
|
|
Type = service
|
|
Priority = 50
|
|
Datacenters = dc1
|
|
Status = running
|
|
Periodic = false
|
|
Parameterized = false
|
|
|
|
Summary
|
|
Task Group Queued Starting Running Failed Complete Lost
|
|
api 0 0 10 0 0 0
|
|
|
|
Latest Deployment
|
|
ID = 32a080c1
|
|
Status = running
|
|
Description = Deployment is running but requires promotion
|
|
|
|
Deployed
|
|
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
|
|
api true false 5 5 5 5 0
|
|
|
|
Allocations
|
|
ID Node ID Task Group Version Desired Status Created At
|
|
6d8eec42 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
7051480e 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
36c6610f 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
410ba474 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
85662a7a 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
3ac3fe05 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
4bd51979 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
2998387b 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
35b813ee 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
b53b4289 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
```
|
|
|
|
Now that we have the new set in production, we can route traffic to it and
|
|
validate the new job version is working properly. Based on whether the new
|
|
version is functioning properly or improperly we will either want to promote or
|
|
fail the deployment.
|
|
|
|
### Promoting the Deployment
|
|
|
|
After deploying the new image along side the old version we have determined it
|
|
is functioning properly and we want to transition fully to the new version.
|
|
Doing so is as simple as promoting the deployment:
|
|
|
|
```text
|
|
$ nomad deployment promote 32a080c1
|
|
==> Monitoring evaluation "61ac2be5"
|
|
Evaluation triggered by job "docs"
|
|
Evaluation within deployment: "32a080c1"
|
|
Evaluation status changed: "pending" -> "complete"
|
|
==> Evaluation "61ac2be5" finished with status "complete"
|
|
```
|
|
|
|
If we look at the job's status we see that after promotion, Nomad stopped the
|
|
older allocations and is only running the new one. This now completes our
|
|
blue/green deployment.
|
|
|
|
```text
|
|
$ nomad status docs
|
|
ID = docs
|
|
Name = docs
|
|
Submit Date = 07/26/17 19:57:47 UTC
|
|
Type = service
|
|
Priority = 50
|
|
Datacenters = dc1
|
|
Status = running
|
|
Periodic = false
|
|
Parameterized = false
|
|
|
|
Summary
|
|
Task Group Queued Starting Running Failed Complete Lost
|
|
api 0 0 5 0 5 0
|
|
|
|
Latest Deployment
|
|
ID = 32a080c1
|
|
Status = successful
|
|
Description = Deployment completed successfully
|
|
|
|
Deployed
|
|
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
|
|
api true true 5 5 5 5 0
|
|
|
|
Allocations
|
|
ID Node ID Task Group Version Desired Status Created At
|
|
6d8eec42 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
7051480e 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
36c6610f 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
410ba474 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
85662a7a 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
3ac3fe05 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
|
4bd51979 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
|
2998387b 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
|
35b813ee 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
|
b53b4289 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
|
```
|
|
|
|
### Failing the Deployment
|
|
|
|
After deploying the new image alongside the old version we have determined it
|
|
is not functioning properly and we want to roll back to the old version. Doing
|
|
so is as simple as failing the deployment:
|
|
|
|
```text
|
|
$ nomad deployment fail 32a080c1
|
|
Deployment "32a080c1-de5a-a4e7-0218-521d8344c328" failed. Auto-reverted to job version 0.
|
|
|
|
==> Monitoring evaluation "6840f512"
|
|
Evaluation triggered by job "example"
|
|
Evaluation within deployment: "32a080c1"
|
|
Allocation "0ccb732f" modified: node "36e7a123", group "cache"
|
|
Allocation "64d4f282" modified: node "36e7a123", group "cache"
|
|
Allocation "664e33c7" modified: node "36e7a123", group "cache"
|
|
Allocation "a4cb6a4b" modified: node "36e7a123", group "cache"
|
|
Allocation "fdd73bdd" modified: node "36e7a123", group "cache"
|
|
Evaluation status changed: "pending" -> "complete"
|
|
==> Evaluation "6840f512" finished with status "complete"
|
|
```
|
|
|
|
If we now look at the job's status we can see that after failing the deployment,
|
|
Nomad stopped the new allocations and is only running the old ones and reverted
|
|
the working copy of the job back to the original specification running
|
|
"api-server:1.3".
|
|
|
|
```text
|
|
$ nomad status docs
|
|
ID = docs
|
|
Name = docs
|
|
Submit Date = 07/26/17 19:57:47 UTC
|
|
Type = service
|
|
Priority = 50
|
|
Datacenters = dc1
|
|
Status = running
|
|
Periodic = false
|
|
Parameterized = false
|
|
|
|
Summary
|
|
Task Group Queued Starting Running Failed Complete Lost
|
|
api 0 0 5 0 5 0
|
|
|
|
Latest Deployment
|
|
ID = 6f3f84b3
|
|
Status = successful
|
|
Description = Deployment completed successfully
|
|
|
|
Deployed
|
|
Task Group Auto Revert Desired Placed Healthy Unhealthy
|
|
cache true 5 5 5 0
|
|
|
|
Allocations
|
|
ID Node ID Task Group Version Desired Status Created At
|
|
27dc2a42 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
|
5b7d34bb 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
|
983b487d 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
|
d1cbf45a 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
|
d6b46def 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
|
0ccb732f 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
|
64d4f282 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
|
664e33c7 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
|
a4cb6a4b 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
|
fdd73bdd 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
|
|
|
$ nomad job deployments docs
|
|
ID Job ID Job Version Status Description
|
|
6f3f84b3 example 2 successful Deployment completed successfully
|
|
32a080c1 example 1 failed Deployment marked as failed - rolling back to job version 0
|
|
c4c16494 example 0 successful Deployment completed successfully
|
|
```
|
|
|
|
## Canary Deployments
|
|
|
|
Canary updates are a useful way to test a new version of a job before beginning
|
|
a rolling upgrade. The `update` stanza supports setting the number of canaries
|
|
the job operator would like Nomad to create when the job changes via the
|
|
`canary` parameter. When the job specification is updated, Nomad creates the
|
|
canaries without stopping any allocations from the previous job.
|
|
|
|
This pattern allows operators to achieve higher confidence in the new job
|
|
version because they can route traffic, examine logs, etc, to determine the new
|
|
application is performing properly.
|
|
|
|
```hcl
|
|
job "docs" {
|
|
# ...
|
|
|
|
group "api" {
|
|
count = 5
|
|
|
|
update {
|
|
max_parallel = 1
|
|
canary = 1
|
|
min_healthy_time = "30s"
|
|
healthy_deadline = "10m"
|
|
auto_revert = true
|
|
}
|
|
|
|
task "api-server" {
|
|
driver = "docker"
|
|
|
|
config {
|
|
image = "api-server:1.3"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
In the example above, the `update` stanza tells Nomad to create a single canary
|
|
when the job specification is changed. Below we can see how this works by
|
|
changing the image to run the new version:
|
|
|
|
```diff
|
|
@@ -2,6 +2,8 @@ job "docs" {
|
|
group "api" {
|
|
task "api-server" {
|
|
config {
|
|
- image = "api-server:1.3"
|
|
+ image = "api-server:1.4"
|
|
```
|
|
|
|
Next we plan and run these changes:
|
|
|
|
```text
|
|
$ nomad job plan docs.nomad
|
|
+/- Job: "docs"
|
|
+/- Task Group: "api" (1 canary, 5 ignore)
|
|
+/- Task: "api-server" (forces create/destroy update)
|
|
+/- Config {
|
|
+/- image: "api-server:1.3" => "api-server:1.4"
|
|
}
|
|
|
|
Scheduler dry-run:
|
|
- All tasks successfully allocated.
|
|
|
|
Job Modify Index: 7
|
|
To submit the job with version verification run:
|
|
|
|
nomad job run -check-index 7 example.nomad
|
|
|
|
When running the job with the check-index flag, the job will only be run if the
|
|
server side version matches the job modify index returned. If the index has
|
|
changed, another user has modified the job and the plan's results are
|
|
potentially invalid.
|
|
|
|
$ nomad job run docs.nomad
|
|
# ...
|
|
```
|
|
|
|
We can see from the plan output that Nomad is going to create 1 canary that
|
|
will run the "api-server:1.4" image and ignore all the allocations running
|
|
the older image. If we inspect the status we see that the canary is running
|
|
along side the older version of the job:
|
|
|
|
```text
|
|
$ nomad status docs
|
|
ID = docs
|
|
Name = docs
|
|
Submit Date = 07/26/17 19:57:47 UTC
|
|
Type = service
|
|
Priority = 50
|
|
Datacenters = dc1
|
|
Status = running
|
|
Periodic = false
|
|
Parameterized = false
|
|
|
|
Summary
|
|
Task Group Queued Starting Running Failed Complete Lost
|
|
api 0 0 6 0 0 0
|
|
|
|
Latest Deployment
|
|
ID = 32a080c1
|
|
Status = running
|
|
Description = Deployment is running but requires promotion
|
|
|
|
Deployed
|
|
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
|
|
api true false 5 1 1 1 0
|
|
|
|
Allocations
|
|
ID Node ID Task Group Version Desired Status Created At
|
|
85662a7a 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
|
3ac3fe05 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
4bd51979 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
2998387b 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
35b813ee 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
b53b4289 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
|
```
|
|
|
|
Now if we promote the canary, this will trigger a rolling update to replace the
|
|
remaining allocations running the older image. The rolling update will happen at
|
|
a rate of `max_parallel`, so in this case one allocation at a time:
|
|
|
|
```text
|
|
$ nomad deployment promote 37033151
|
|
==> Monitoring evaluation "37033151"
|
|
Evaluation triggered by job "docs"
|
|
Evaluation within deployment: "ed28f6c2"
|
|
Allocation "f5057465" created: node "f6646949", group "cache"
|
|
Allocation "f5057465" status changed: "pending" -> "running"
|
|
Evaluation status changed: "pending" -> "complete"
|
|
==> Evaluation "37033151" finished with status "complete"
|
|
|
|
$ nomad status docs
|
|
ID = docs
|
|
Name = docs
|
|
Submit Date = 07/26/17 20:28:59 UTC
|
|
Type = service
|
|
Priority = 50
|
|
Datacenters = dc1
|
|
Status = running
|
|
Periodic = false
|
|
Parameterized = false
|
|
|
|
Summary
|
|
Task Group Queued Starting Running Failed Complete Lost
|
|
api 0 0 5 0 2 0
|
|
|
|
Latest Deployment
|
|
ID = ed28f6c2
|
|
Status = running
|
|
Description = Deployment is running
|
|
|
|
Deployed
|
|
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
|
|
api true true 5 1 2 1 0
|
|
|
|
Allocations
|
|
ID Node ID Task Group Version Desired Status Created At
|
|
f5057465 f6646949 api 1 run running 07/26/17 20:29:23 UTC
|
|
b1c88d20 f6646949 api 1 run running 07/26/17 20:28:59 UTC
|
|
1140bacf f6646949 api 0 run running 07/26/17 20:28:37 UTC
|
|
1958a34a f6646949 api 0 run running 07/26/17 20:28:37 UTC
|
|
4bda385a f6646949 api 0 run running 07/26/17 20:28:37 UTC
|
|
62d96f06 f6646949 api 0 stop complete 07/26/17 20:28:37 UTC
|
|
f58abbb2 f6646949 api 0 stop complete 07/26/17 20:28:37 UTC
|
|
```
|
|
|
|
Alternatively, if the canary was not performing properly, we could abandon the
|
|
change using the `nomad deployment fail` command, similar to the blue/green
|
|
example.
|