diff --git a/.changelog/9103.txt b/.changelog/9103.txt new file mode 100644 index 000000000..da18c8e8f --- /dev/null +++ b/.changelog/9103.txt @@ -0,0 +1,12 @@ +```release-note:bug +autopilot: **(Enterprise Only)** Previously servers in other zones would not be promoted when all servers in a second zone had failed. Now the actual behavior matches the docs and autopilot will promote a healthy non-voter from any zone to replace failure of an entire zone. +``` +```release-note:feature +autopilot: A new `/v1/operator/autopilot/state` HTTP API was created to give greater visibility into what autopilot is doing and how it has classified all the servers it is tracking. +``` +```release-note:improvement +autopilot: **(Enterprise Only)** Autopilot now supports using both Redundancy Zones and Automated Upgrades together. +``` +```release-note:breaking-change +raft: Raft protocol v3 is no longer supported. If currently using protocol v2 then an intermediate upgrade to a version supporting both protocols will be necessary (1.0.0 - 1.8.x) +``` diff --git a/website/pages/api-docs/operator/autopilot.mdx b/website/pages/api-docs/operator/autopilot.mdx index 6d4260336..9995e1445 100644 --- a/website/pages/api-docs/operator/autopilot.mdx +++ b/website/pages/api-docs/operator/autopilot.mdx @@ -251,3 +251,257 @@ $ curl \ The HTTP status code will indicate the health of the cluster. If `Healthy` is true, then a status of 200 will be returned. If `Healthy` is false, then a status of 429 will be returned. + + +## Read the Autopilot State + +This endpoint queries the health of the autopilot status. + +| Method | Path | Produces | +| ------ | ---------------------------- | ------------------ | +| `GET` | `/operator/autopilot/state` | `application/json` | + +The table below shows this endpoint's support for +[blocking queries](/api/features/blocking), +[consistency modes](/api/features/consistency), +[agent caching](/api/features/caching), and +[required ACLs](/api#authentication). + +| Blocking Queries | Consistency Modes | Agent Caching | ACL Required | +| ---------------- | ----------------- | ------------- | --------------- | +| `NO` | `none` | `none` | `operator:read` | + +### Parameters + +- `dc` `(string: "")` - Specifies the datacenter to query. This will default to + the datacenter of the agent being queried. This is specified as part of the + URL as a query string. + +### Sample Request + +```shell-session +$ curl \ + http://127.0.0.1:8500/v1/operator/autopilot/state +``` + +### Response Format + +```json +{ + "Healthy": true, + "FailureTolerance": 1, + "OptimisticFailureTolerance": 4, + "Servers": { + "5e26a3af-f4fc-4104-a8bb-4da9f19cb278": {}, + "10b71f14-4b08-4ae5-840c-f86d39e7d330": {}, + "1fd52e5e-2f72-47d3-8cfc-2af760a0c8c2": {}, + "63783741-abd7-48a9-895a-33d01bf7cb30": {}, + "6cf04fd0-7582-474f-b408-a830b5471285": {} + }, + "Leader": "5e26a3af-f4fc-4104-a8bb-4da9f19cb278", + "Voters": [ + "5e26a3af-f4fc-4104-a8bb-4da9f19cb278", + "10b71f14-4b08-4ae5-840c-f86d39e7d330", + "1fd52e5e-2f72-47d3-8cfc-2af760a0c8c2" + ], + "RedundancyZones": { + "az1": {}, + "az2": {}, + "az3": {} + }, + "ReadReplicas": [ + "63783741-abd7-48a9-895a-33d01bf7cb30", + "6cf04fd0-7582-474f-b408-a830b5471285" + ], + "Upgrade": {} +} +``` + +- `Healthy` is whether all the servers are currently healthy. + +- `FailureTolerance` is the number of redundant healthy servers that could be + fail without causing an outage (this would be 2 in a healthy cluster of 5 + servers). + +- `OptimisticFailuretolerance` is the maximum number + of servers that could fail in the right order over the right period of time + without causing an outage. This value is only useful when using the [Redundancy + Zones feature](/docs/enterprise/redundancy) with autopilot. + +- `Servers` is a mapping of server ID to an object holding detailed information about that server. + The format of the detailed info is documented in its own section. + +- `Leader` is the server ID of current leader. This value can be used as an index into the `Servers` object. + +- `Voters` is a list of server IDs that are voters. These values can be used as indexes into the `Servers` object. + +- `RedundancyZones` is mapping of redundancy zone name to redundancy zone information. + The format of the redundancy zone information is documented in its own section. + +- `ReadReplicas` is a list of server IDs that autopilot has identified as read replicas. + These will never be promoted. These values can be used as indexes into the `Servers` map. + +- `Upgrade` is an object holding all the information about any ongoing automated upgrade. + The format of this object is detailed in its own section. + +### Server Response Format + +```json +{ + "ID": "1c3e3278-3f88-4a97-9f6a-1058584e8058", + "Name": "node1", + "Address": "198.18.0.1:8300", + "NodeStatus": "alive", + "Version": "1.9.0+ent", + "LastContact": "1.321ms", + "LastTerm": 4, + "LastIndex": 42, + "Healthy": true, + "StableSince": "2020-08-12T12:13:14Z", + "RedundancyZone": "az1", + "UpgradeVersion": "1.2.3", + "ReadReplica": false, + "Status": "voter", + "Meta": { + "build": "1.2.3", + "zone": "az1" + }, + "NodeType": "redundancy-zone-voter" +} +``` + +- `ID` is the Raft ID of the server. + +- `Name` is the node name of the server. + +- `Address` is the address of the server. + +- `NodeStatus` is the SerfHealth check status for the server. + +- `Version` is the Consul version of the server. + +- `LastContact` is the time elapsed since this server's last contact with the leader. + +- `LastTerm` is the server's last known Raft leader term. + +- `LastIndex` is the index of the server's last committed Raft log entry. + +- `Healthy` is whether the server is healthy according to the current Autopilot configuration. + +- `StableSince` is the time this server has been in its current `Healthy` state. + +- `RedundancyZone` is the name of the redundancy zone this server is within. + +- `UpgradeVersion` is the version that will be used for automated upgrade calculations. + +- `ReadReplica` indicates whether this server is a read replica or not. + +- `Status` indicates the current Raft status of this server. Possible values are: + `leader`, `voter`, `non-voter`, or `staging`. + +- `Meta` is the node metadata of this server. Values within this map are used for determining a server's + redundancy zone and upgrade version. + +- `NodeType` is the desired type autopilot thinks this server should have. In Consul OSS the only possible + value is `voter` as all present servers should having voting rights. In Consul Enterprise the possible values also + include `read-replica`, `zone-voter`, `zone-standby` and `zone-extra-voter`. `zone-voter` indicates that autopilot + wants this server to be the voter for a particular redundancy zone. When a zone has no voter all nodes will be typed + as this until one is promoted. When that happens the other non-voters in the zone will be typed as `zone-standby`. + This indicates that they are currently desired to be standby servers in case the voter from the zone fails. Finally, + the `zone-extra-voter` status indicates that autopilot wants this server to be a voter due to a failure of all servers + in another zone and that when one of the servers in that failed zone are restored, this server will be demoted. + +### Redundancy Zone Response Format + +```json +{ + "Servers": [ + "10b71f14-4b08-4ae5-840c-f86d39e7d330", + "b007061c-6d15-4c90-b3d6-2fef276a0650" + ], + "Voters": [ + "b007061c-6d15-4c90-b3d6-2fef276a0650" + ], + "FailureTolerance": 1, +} +``` + +Each zone in the responses `RedundancyZones` mapping will have this structure. + +- `Servers` is a list of server IDs of all the servers in this zone. These values can be used as indexes + into the top level response's `Servers` mapping. + +- `Voters` is a list of server IDs of all servers in this zone that have voting rights. Typically this will + be a list with 1 value but in some failure scenarios or upgrade scenarios the size could increase. These + values can be used as indexes into the top level response's `Servers` mapping. + +- `FailureTolerance` is the number of servers in this zone that could fail without causing a total zone failure + and subsequent promotion of a server from another zone as a fallback. + +### Upgrade Information Response Format + +```json +{ + "Status": "awaiting-new-servers", + "TargetVersion": "1.9.1+ent", + "TargetVersionVoters": [ + "f0344689-3e1f-4125-b55d-e888d3abf514" + ], + "TargetVersionNonVoters": [ + "619a4ba6-1a0b-476e-8a1a-28aeee7735a2", + "fd683fe6-541f-4ebf-bc5a-6eae51571ddb" + ], + "TargetVersionReadReplicas": [ + "9f1e27ae-1129-45ef-97dd-6d8c3ec47e6a" + ], + "OtherVersionVoters": [ + "0cbdd493-235f-48f2-98d9-1bf2443b9d72", + "21812bd7-2f21-4565-9892-2fdd3d4e1a99", + "c654ba5c-cc76-4056-a5ca-6e78d95f27ad" + ], + "OtherVersionNonVoters": [ + "6d973f11-6bdb-4f7d-8a90-c1300066da4c", + "6241ab45-371e-4b2a-a0f1-d847c3b7b1b0" + ], + "OtherVersionReadReplicas": [ + "42d10fc3-581b-4403-832d-945b3a0d8841" + ], + } +``` + +- `Status` is the automated upgrade status. Possible values are: + + - `disabled` indicates that automated upgrades are disabled either from user configuration or due to being unlicensed. + + - `idle` indicates that there is no ongoing upgrade and that all servers are running the same Consul version. + + - `await-new-voters` indicates that a newer versioned server has been added but that autopilot is waiting for more servers + of that version to be added before proceeding with the upgrade. + + - `promoting` indicates that enough servers of the target version have been added and autopilot will now promote them + to voters. + + - `demoting` indicates that autopilot is currently demoting the servers not running the target version. + + - `leader-transfer` indicates that autopilot is in the process of transferring leadership to a server running + the target version. + + - `await-new-servers` indicates that the majority of the upgrade is complete but that more servers running the target + version need to be added to completely replace all of the previous servers. + + - `await-server-removal` indicates that the upgrade is complete and it is now safe to remove the previous servers. + +- `TargetVersion` is the version that Autopilot is upgrading to. This will be the maximum version of all servers + `UpgradeVersion` field in the top level `Servers` mapping. + +- `TargetVersionVoters` is a list of IDs of servers running the target version and that currently have voting rights. + +- `TargetVersionNonVoters` is a list of IDs of servers running the target version and that currently do not have voting rights. + +- `TargetVersionReadReplicas` is a list of IDs of servers running the target version and are read replicas. + +- `OtherVersionVoters` is a list of IDs of servers not running the target version and that currently have voting rights. + +- `OtherVersionNonVoters` is a list of IDs of servers not running the target version and that currently do not have voting rights. + +- `OtherVersionReadReplicas` is a list of IDs of servers not running the target version and are read replicas. \ No newline at end of file