There are two major features in Consul 1.4.0 that may impact upgrades: a [new ACL system](#acl-upgrade) and [multi-datacenter support for Connect](#connect-multi-datacenter) in the Enterprise version.
### ACL Upgrade
Consul 1.4.0 includes a [new ACL system](/docs/guides/acl.html) that is
designed to have a smooth upgrade path but requires care to upgrade components
in the right order.
**Note:** As with most major version upgrades, you cannot downgrade once the
upgrade to 1.4.0 is complete as it adds new state to the raft store. As always
it is _strongly_ recommended that you test the upgrade first outside of
production and ensure you take backup snapshots of all datacenters before
upgrading.
#### Primary Datacenter
The "ACL datacenter" in 1.3.x and earlier is now referred to as the "Primary
datacenter". All configuration is backwards compatible and shouldn't need to
change prior to upgrade although it's strongly recommended to migrate ACL
configuration to the new syntax soon after upgrade. This includes moving to
`primary_datacenter` rather than `acl_datacenter` and `acl_*` to the new [ACL
block](/docs/agent/options.html#acl).
Datacenters can be upgraded in any order although secondaries will remain in
[Legacy ACL mode](#legacy-acl-mode) until the primary datacenter is fully
ugraded.
Each datacenter should follow the [standard rolling upgrade
When a 1.4.0 server first starts, it runs in "Legacy ACL mode". In this mode,
bootstrap requests and new ACL APIs will not be functional yet and will return
an error. The server advertises it's ability to support 1.4.0 ACLs via gossip
and waits.
In the primary datacenter, the servers all wait in legacy ACL mode until they
see every server in the primary datacenter advertise 1.4.0 ACL support. Once
this happens, the leader will complete the transition out of "legacy ACL mode"
and write this into the state so future restarts don't need to go through the
same transition.
In a secondary datacenter, the same process happens except that servers
_additionally_ wait for all servers in the primary datacenter making it safe to
upgrade datacenters in any order.
It should be noted that even if you are not upgrading, starting a brand new
1.4.0 cluster will transition through legacy ACL mode so you may be unable to
bootstrap ACLs until all the expected servers are up and healthy.
#### Legacy Token Accessor Migration
As soon as all servers in the primary datacenter have been upgraded to 1.4.0,
the leader will begin the process of creating new accessor IDs for all existing
ACL tokens.
This process completes in the background and is rate limited to ensure it
doesn't overload the leader. It completes upgrades in batches of 128 tokens and
will not upgrade more than one batch per second so on a cluster with 10,000
tokens, this may take several minutes.
While this is happening both old and new ACLs will work correctly with the
caveat that new ACL [Token APIs](/api/acl/tokens.html) may not return an
accessor ID for legacy tokens that are not yet migrated.
#### Migrating Existing ACLs
New ACL policies have slightly different syntax designed to fix some
shortcomings in old ACL syntax. During and after the upgrade process, any old
ACL tokens will continue to work and grant exactly the same level of access.
After upgrade, it is still possible to create "legacy" tokens using the existing
API so existing integrations that create tokens (e.g. Vault) will continue to
work. The "legacy" tokens generated though will not be able to take advantage of
new policy features. It's recommended that you complete migration of all tokens
as soon as possible after upgrade, as well as updating any integrations to work
with the the new ACL [Token](/api/acl/tokens.html) and
[Policy](/api/acl/policies.html) APIs.
More complete details on how to upgrade "legacy" tokens is available [here](/docs/guides/acl-migrate-tokens.html).
### Connect Multi-datacenter
This only applies to users upgrading from an older version of Consul Enterprise to Consul Enterprise 1.4.0 (all license types).
In addition, this upgrade will only affect clusters where [Connect is enabled](/docs/connect/configuration.html) on your servers before the migration.
Connect multi-datacenter uses the same primary/secondary approach as ACLs and will use the same [primary_datacenter](#primary-datacenter). When a secondary datacenter server restarts with 1.4.0 it will detect it is not the primary and begin an automatic bootstrap of multi-datacenter CA federation.
Datacenters can be upgraded in either order; secondary datacenters will not switch into multi-datacenter mode until all servers in both the secondary and primary datacenter are detected to be running at least Consul 1.4.0. Secondary datacenters monitor this periodically (every few minutes) and will automatically upgrade Connect to use a federated Certificate Authority when they do.
In general, migrating a Consul cluster from OSS to Enterprise will update the CA to be federated automatically and without impact on Connect traffic. When upgrading Consul Enterprise 1.3.x to Consul Enterprise 1.4.0 upgrades the CA upgrade is seamless, however depending on the size of the cluster, _new_ connection attempts in the secondary datacenter might fail for a short window (typically seconds) while the update is propagated due to the 1.3.x Beta authorization endpoint validating originating cluster in a way that was not fully forwards compatible with migrating between cluster trust domains. That issue is fixed in 1.4.0 as part of General Availability.
Once migrated (typically a few seconds). Connect will use the primary datacenter's Certificate Authority as the root of trust for all other datacenters. CA migration or root key changes in the primary will now rotate automatically and without loss of connectivity throughout all datacenters and workloads.
For more information see [Connect Multi-datacenter](/docs/enterprise/connect-multi-datacenter/index.html).
This version added support for multiple tag filters in service discovery queries, however it introduced a subtle bug where API calls to `/catalog/service/:name?tag=<tag>` would ignore the tag filter _only during the upgrade_. It only occurs when clients are still running 1.2.3 or earlier but servers have been upgraded. The `/health/service/:name?tag=<tag>` endpoint and DNS interface were _not_ affected.
For this reason, we recommend you upgrade directly to 1.3.1 which includes only a fix for this issue.
#### Carefully Check and Remove Stale Servers During Rolling Upgrades
Consul 1.0 (and earlier versions of Consul when running with [Raft protocol 3](/docs/agent/options.html#_raft_protocol) had an issue where performing rolling updates of Consul servers could result in an outage from old servers remaining in the cluster. [Autopilot](/docs/guides/autopilot.html) would normally remove old servers when new ones come online, but it was also waiting to promote servers to voters in pairs to maintain an odd quorum size. The pairwise promotion feature was removed so that servers become voters as soon as they are stable, allowing Autopilot to remove old servers in a safer way.
When upgrading from Consul 1.0, you may need to manually [force-leave](/docs/commands/force-leave.html) old servers as part of a rolling update to Consul 1.0.1.
Consul 1.0 has several important breaking changes that are documented here. Please be sure to read over all the details here before upgrading.
#### Raft Protocol Now Defaults to 3
The [`-raft-protocol`](/docs/agent/options.html#_raft_protocol) default has been changed from 2 to 3, enabling all [Autopilot](/docs/guides/autopilot.html) features by default.
Raft protocol version 3 requires Consul running 0.8.0 or newer on all servers in order to work, so if you are upgrading with older servers in a cluster then you will need to set this back to 2 in order to upgrade. See [Raft Protocol Version Compatibility](/docs/upgrade-specific.html#raft-protocol-version-compatibility) for more details. Also the format of `peers.json` used for outage recovery is different when running with the latest Raft protocol. See [Manual Recovery Using peers.json](/docs/guides/outage.html#manual-recovery-using-peers-json) for a description of the required format.
Please note that the Raft protocol is different from Consul's internal protocol as described on the [Protocol Compatibility Promise](/docs/compatibility.html) page, and as is shown in commands like `consul members` and `consul version`. To see the version of the Raft protocol in use on each server, use the `consul operator raft list-peers` command.
The easiest way to upgrade servers is to have each server leave the cluster, upgrade its Consul version, and then add it back. Make sure the new server joins successfully and that the cluster is stable before rolling the upgrade forward to the next server. It's also possible to stand up a new set of servers, and then slowly stand down each of the older servers in a similar fashion.
When using Raft protocol version 3, servers are identified by their [`-node-id`](/docs/agent/options.html#_node_id) instead of their IP address when Consul makes changes to its internal Raft quorum configuration. This means that once a cluster has been upgraded with servers all running Raft protocol version 3, it will no longer allow servers running any older Raft protocol versions to be added. If running a single Consul server, restarting it in-place will result in that server not being able to elect itself as a leader. To avoid this, either set the Raft protocol back to 2, or use [Manual Recovery Using peers.json](/docs/guides/outage.html#manual-recovery-using-peers-json) to map the server to its node ID in the Raft quorum configuration.
#### Config Files Require an Extension
As part of supporting the [HCL](https://github.com/hashicorp/hcl#syntax) format for Consul's config files, an `.hcl` or `.json` extension is required for all config files loaded by Consul, even when using the [`-config-file`](/docs/agent/options.html#_config_file) argument to specify a file directly.
All of Consul's previously deprecated command line flags and config options have been removed, so these will need to be mapped to their equivalents before upgrading. Here's the complete list of removed options and their equivalents:
#### `statsite_prefix` Renamed to `metrics_prefix`
Since the `statsite_prefix` configuration option applied to all telemetry providers, `statsite_prefix` was renamed to [`metrics_prefix`](/docs/agent/options.html#telemetry-metrics_prefix). Configuration files will need to be updated when upgrading to this version of Consul.
#### `advertise_addrs` Removed
This configuration option was removed since it was redundant with `advertise_addr` and `advertise_addr_wan` in combination with `ports` and also wrongly stated that you could configure both host and port.
#### Escaping Behavior Changed for go-discover Configs
The format for [`-retry-join`](/docs/agent/options.html#retry-join) and [`-retry-join-wan`](/docs/agent/options.html#retry-join-wan) values that use [go-discover](https://github.com/hashicorp/go-discover) cloud auto joining has changed. Values in `key=val` sequences must no longer be URL encoded and can be provided as literals as long as they do not contain spaces, backslashes `\` or double quotes `"`. If values contain these characters then use double quotes as in `"some key"="some value"`. Special characters within a double quoted string can be escaped with a backslash `\`.
#### HTTP Verbs are Enforced in Many HTTP APIs
Many endpoints in the HTTP API that previously took any HTTP verb now check for specific HTTP verbs and enforce them. This may break clients relying on the old behavior. Here's the complete list of updated endpoints and required HTTP verbs:
| Endpoint | Required HTTP Verb |
| -------- | ------------------ |
| /v1/acl/info | GET |
| /v1/acl/list | GET |
| /v1/acl/replication | GET |
| /v1/agent/check/deregister | PUT |
| /v1/agent/check/fail | PUT |
| /v1/agent/check/pass | PUT |
| /v1/agent/check/register | PUT |
| /v1/agent/check/warn | PUT |
| /v1/agent/checks | GET |
| /v1/agent/force-leave | PUT |
| /v1/agent/join | PUT |
| /v1/agent/members | GET |
| /v1/agent/metrics | GET |
| /v1/agent/self | GET |
| /v1/agent/service/register | PUT |
| /v1/agent/service/deregister | PUT |
| /v1/agent/services | GET |
| /v1/catalog/datacenters | GET |
| /v1/catalog/deregister | PUT |
| /v1/catalog/node | GET |
| /v1/catalog/nodes | GET |
| /v1/catalog/register | PUT |
| /v1/catalog/service | GET |
| /v1/catalog/services | GET |
| /v1/coordinate/datacenters | GET |
| /v1/coordinate/nodes | GET |
| /v1/health/checks | GET |
| /v1/health/node | GET |
| /v1/health/service | GET |
| /v1/health/state | GET |
| /v1/internal/ui/node | GET |
| /v1/internal/ui/nodes | GET |
| /v1/internal/ui/services | GET |
| /v1/session/info | GET |
| /v1/session/list | GET |
| /v1/session/node | GET |
| /v1/status/leader | GET |
| /v1/status/peers | GET |
| /v1/operator/area/:uuid/members | GET |
| /v1/operator/area/:uuid/join | PUT |
#### Unauthorized KV Requests Return 403
When ACLs are enabled, reading a key with an unauthorized token returns a 403. This previously returned a 404 response.
#### Config Section of Agent Self Endpoint has Changed
The /v1/agent/self endpoint's `Config` section has often been in flux as it was directly returning one of Consul's internal data structures. This configuration structure has been moved under `DebugConfig`, and is documents as for debugging use and subject to change, and a small set of elements of `Config` have been maintained and documented. See [Read Configuration](/api/agent.html#read-configuration) endpoint documentation for details.
#### Deprecated `configtest` Command Removed
The `configtest` command was deprecated and has been superseded by the `validate` command.
#### Undocumented Flags in `validate` Command Removed
The `validate` command supported the `-config-file` and `-config-dir` command line flags but did not document them. This support has been removed since the flags are not required.
#### Metric Names Updated
Metric names no longer start with `consul.consul`. To help with transitioning dashboards and other metric consumers, the field `enable_deprecated_names` has been added to the telemetry section of the config, which will enable metrics with the old naming scheme to be sent alongside the new ones. The following prefixes were affected:
| Prefix |
| ------ |
| consul.consul.acl |
| consul.consul.autopilot |
| consul.consul.catalog |
| consul.consul.fsm |
| consul.consul.health |
| consul.consul.http |
| consul.consul.kvs |
| consul.consul.leader |
| consul.consul.prepared-query |
| consul.consul.rpc |
| consul.consul.session |
| consul.consul.session_ttl |
| consul.consul.txn |
#### Checks Validated On Agent Startup
Consul agents now validate health check definitions in their configuration and will fail at startup if any checks are invalid. In previous versions of Consul, invalid health checks would get skipped.
A new [`enable_script_checks`](/docs/agent/options.html#_enable_script_checks) configuration option was added, and defaults to `false`, meaning that in order to allow an agent to run health checks that execute scripts, this will need to be configured and set to `true`. This provides a safer out-of-the-box configuration for Consul where operators must opt-in to allow script-based health checks.
If your cluster uses script health checks please be sure to set this to `true` as part of upgrading agents. If this is set to `true`, you should also enable [ACLs](/docs/guides/acl.html) to provide control over which users are allowed to register health checks that could potentially execute scripts on the agent machines.
Consul releases will no longer include a `web_ui.zip` file with the compiled web assets. These have been built in to the Consul binary since the 0.7.x series and can be enabled with the [`-ui`](/docs/agent/options.html#_ui) configuration option. These built-in web assets have always been identical to the contents of the `web_ui.zip` file for each release. The [`-ui-dir`](/docs/agent/options.html#_ui_dir) option is still available for hosting customized versions of the web assets, but the vast majority of Consul users can just use the built in web assets.
The [`acl_enforce_version_8`](/docs/agent/options.html#acl_enforce_version_8) configuration now defaults to `true` to enable [full version 8 ACL support](/docs/guides/acl.html#version_8_acls) by default. If you are upgrading an existing cluster with ACLs enabled, you will need to set this to `false` during the upgrade on **both Consul agents and Consul servers**. Version 8 ACLs were also changed so that [`acl_datacenter`](/docs/agent/options.html#acl_datacenter) must be set on agents in order to enable the agent-side enforcement of ACLs. This makes for a smoother experience in clusters where ACLs aren't enabled at all, but where the agents would have to wait to contact a Consul server before learning that.
Child process reaping support has been removed, along with the `reap` configuration option. Reaping is also done via [dumb-init](https://github.com/Yelp/dumb-init) in the [Consul Docker image](https://github.com/hashicorp/docker-consul), so removing it from Consul itself simplifies the code and eases future maintenance for Consul. If you are running Consul as PID 1 in a container you will need to arrange for a wrapper process to reap child processes.
The default for [`max_stale`](/docs/agent/options.html#max_stale) has been increased from 5 seconds to a near-indefinite threshold (10 years) to allow DNS queries to continue to be served in the event of a long outage with no leader. A new telemetry counter was added at `consul.dns.stale_queries` to track when agents serve DNS queries that are stale by more than 5 seconds.