refresh of the Consul architecture overview page

This commit is contained in:
trujillo-adam 2022-09-26 11:21:08 -07:00
parent d9e42b0f1c
commit 3c482bdfeb
8 changed files with 96 additions and 66 deletions

View File

@ -5,83 +5,110 @@ description: >-
Consul datacenters consist of clusters of server agents (control plane) and client agents deployed alongside service instances (dataplane). Learn how these components and their different communication methods make Consul possible.
---
# Consul Internals Overview
# Consul Architecture
Consul is a complex system that has many different moving parts. To help
users and developers of Consul form a mental model of how it works, this
page documents the system architecture.
This topic provides an overview of the Consul architecture. We recommend reviewing the Consul [glossary](/docs/install/glossary) as a companion to this topic to help you become familiar with HashiCorp terms.
-> Before describing the architecture, we recommend reading the
[glossary](/docs/install/glossary) of terms to help
clarify what is being discussed.
> Refer to the [Reference Architecture tutorial](https://learn.hashicorp.com/tutorials/consul/reference-architecture) for hands-on guidance about deploying Consul in production.
The architecture concepts in this document can be used with the [Reference Architecture guide](https://learn.hashicorp.com/tutorials/consul/reference-architecture?in=consul/production-deploy#deployment-system-requirements) when deploying Consul in production.
## Introduction
## 10,000 foot view
Consul provides a control plane that enables you to register, access, and secure services deployed across your network. The _control plane_ is the part of the network infrastructure that maintains a central registry to track services and their respective IP addresses.
From a 10,000 foot altitude the architecture of Consul looks like this:
When using Consuls service mesh capabilities, Consul dynamically configures sidecar and gateway proxies in the request path, which enables you to authorize service-to-service connections, route requests to healthy service instances, and enforce mTLS encryption without modifying your services code. This ensures that communication remains performant and reliable. Refer to [Service Mesh Proxy Overview](/docs/connect/proxies) for an overview of sidecar proxies.
[![Consul Architecture](/img/consul-arch.png)](/img/consul-arch.png)
![Diagram of the Consul control plane](/img/consul-arch/consul-arch-overview-control-plane.svg)
Let's break down this image and describe each piece. First of all, we can see
that there are two datacenters, labeled "one" and "two". Consul has first
class support for [multiple datacenters](https://learn.hashicorp.com/consul/security-networking/datacenters) and
expects this to be the common case.
## Datacenters
Within each datacenter, we have a mixture of clients and servers. It is expected
that there will be between three to five servers. This strikes a balance between
availability in the case of failure and performance, as consensus gets progressively
slower as more machines are added. However, there is no limit to the number of clients,
and they can easily scale into the thousands or tens of thousands.
The Consul control plane contains one or more _datacenters_. A datacenter is the smallest unit of Consul infrastructure that can perform basic Consul operations. A datacenter contains at least one [Consul server agent](#server-agents), but a real-world deployment contains three or five server agents and several [Consul client agents](#client-agents). You can create multiple datacenters and allow nodes in different datacenters to interact with each other. Refer to [Bootstrap a Datacenter](/docs/install/bootstrapping) for information about how to create a datacenter.
All the agents that are in a datacenter participate in a [gossip protocol](/docs/architecture/gossip).
This means there is a gossip pool that contains all the agents for a given datacenter. This serves
a few purposes: first, there is no need to configure clients with the addresses of servers;
discovery is done automatically. Second, the work of detecting agent failures
is not placed on the servers but is distributed. This makes failure detection much more
scalable than naive heartbeating schemes. It also provides failure detection for the nodes; if the agent is not reachable, then the node may have experienced a failure. Thirdly, it is used as a messaging layer to notify
when important events such as leader election take place.
### Clusters
The servers in each datacenter are all part of a single Raft peer set. This means that
they work together to elect a single leader, a selected server which has extra duties. The leader
is responsible for processing all queries and transactions. Transactions must also be replicated to
all peers as part of the [consensus protocol](/docs/architecture/consensus). Because of this
requirement, when a non-leader server receives an RPC request, it forwards it to the cluster leader.
A collection of Consul agents that are aware of each other is called a _cluster_. The terms _datacenter_ and _cluster_ are often used interchangeably. In some cases, however, _cluster_ refers only to Consul server agents, such as in [HCP Consul](https://cloud.hashicorp.com/consul). In other contexts, such as the [_admin partitions_](/docs/enterprise/admin-partitions) feature included with Consul Enterprise, a cluster may refer to collection of client agents.
The server agents also operate as part of a WAN gossip pool. This pool is different from the LAN pool
as it is optimized for the higher latency of the internet and is expected to contain only
other Consul server agents. The purpose of this pool is to allow datacenters to discover each
other in a low-touch manner. Bringing a new datacenter online is as easy as joining the existing
WAN gossip pool. Because the servers are all operating in this pool, it also enables cross-datacenter
requests. When a server receives a request for a different datacenter, it forwards it to a random
server in the correct datacenter. That server may then forward to the local leader.
## Agents
This results in a very low coupling between datacenters, but because of failure detection,
connection caching and multiplexing, cross-datacenter requests are relatively fast and reliable.
You can run the Consul binary to start Consul _agents_, which are daemons that implement Consul control plane functionality. You can start agents as servers or clients. Refer to [Consul Agent](/docs/agent) for additional information.
In general, data is not replicated between different Consul datacenters. When a
request is made for a resource in another datacenter, the local Consul servers forward
an RPC request to the remote Consul servers for that resource and return the results.
If the remote datacenter is not available, then those resources will also not be
available, but that won't otherwise affect the local datacenter. There are some special
situations where a limited subset of data can be replicated, such as with Consul's built-in
[ACL replication](https://learn.hashicorp.com/tutorials/consul/access-control-replication-multiple-datacenters) capability, or
external tools like [consul-replicate](https://github.com/hashicorp/consul-replicate).
### Server agents
In some places, client agents may cache data from the servers to make it
available locally for performance and reliability. Examples include Connect
certificates and intentions which allow the client agent to make local decisions
about inbound connection requests without a round trip to the servers. Some API
endpoints also support optional result caching. This helps reliability because
the local agent can continue to respond to some queries like service-discovery
or Connect authorization from cache even if the connection to the servers is
disrupted or the servers are temporarily unavailable.
Consul server agents store all state information, including service and node IP addresses, health checks, and configuration. We recommend deploying three or five servers in a cluster. The more servers you deploy, the greater the resilience and availability in the event of a failure. More servers, however, slow down [consensus](#consensus-protocol), which is a critical server function that enables Consul to efficiently and effectively process information.
## Getting in depth
#### Consensus protocol
At this point we've covered the high level architecture of Consul, but there are many
more details for each of the subsystems. The [consensus protocol](/docs/architecture/consensus) is
documented in detail as is the [gossip protocol](/docs/architecture/gossip). The [documentation](/docs/security)
for the security model and protocols used are also available.
Consul clusters elect a single server to be the _leader_ through a process called _consensus_. The leader processes all queries and transactions, which prevents conflicting updates in clusters containing multiple servers.
For other details, either consult the code, ask in IRC, or reach out to the mailing list.
Servers that are not currently acting as the cluster leader are called _followers_. Followers forward requests from client agents to the cluster leader. The leader replicates the requests to all other servers in the cluster. Replication ensures that if the leader is unavailable, other servers in the cluster can elect another leader without losing any data.
Consul servers establish consensus using the Raft algorithm on port `8300`. Refer to [Consensus Protocol](/docs/architecture/consensus) for additional information.
![Diagram of the Consul control plane consensus traffic](/img/consul-arch/consul-arch-overview-consensus.svg)
### Client agents
Consul clients report node and service health status to the Consul cluster. In a typical deployment, you must run client agents on every compute node in your datacenter. Clients use remote procedure calls (RPC) to interact with servers. By default, clients send RPC requests to the servers on port `8300`.
There are no limits to the number of client agents or services you can use with Consul, but production deployments should distribute services across multiple Consul datacenters. Using a multi-datacenter deployment enhances infrastructure resilience and limits control plane issues. We recommend deploying a maximum of 5,000 client agents per datacenter. Some large organizations have deployed tens of thousands of client agents and hundreds of thousands of service instances across a multi-datacenter deployment. Refer to [Cross-datacenter requests](#cross-datacenter-requests) for additional information.
## LAN gossip pool
Client and server agents participate in a LAN gossip pool so that they can distribute and perform node [health checks](/docs/discovery/checks). Agents in the pool propagate the health check information across the cluster. Agent gossip communication occurs on port `8301` using UDP. Agent gossip falls back to TCP if UDP is not available. Refer to [Gossip Protocol](/docs/architecture/gossip) for additional information.
The following simplified diagram shows the interactions between servers and clients.
<Tabs>
<Tab heading="LAN gossip pool">
![Diagram of the Consul LAN gossip pool](/img/consul-arch/consul-arch-overview-lan-gossip-pool.svg)
</Tab>
<Tab heading="RPC">
![Diagram of RPC communication between Consul agents](/img/consul-arch/consul-arch-overview-rpc.svg)
</Tab>
</Tabs>
## Cross-datacenter requests
Each Consul datacenter maintains its own catalog of services and their health. By default, the information is not replicated across datacenters. WAN federation and cluster peering are two multi-datacenter deployment models that enable service connectivity across datacenters.
### WAN federation
WAN federation refers to designating a _primary datacenter_ that contains authoritative information about all datacenters, including service mesh configurations and access control list (ACL) resources.
In this model, when a client agent requests a resource in a remote secondary datacenter, a local Consul server forwards the RPC request to a remote Consul server that has access to the resource. A remote server sends the results to the local server. If the remote datacenter is unavailable, its resources are also unavailable. By default, WAN-federated servers send cross-datacenter requests over TCP on port `8300`.
You can configure control plane and data plane traffic to go through mesh gateways, which simplifies networking requirements.
> **Hands-on**: To enable services to communicate across datacenters when the ACL system is enabled, refer to the [ACL Replication for Multiple Datacenters](https://learn.hashicorp.com/tutorials/consul/access-control-replication-multiple-datacenters) tutorial.
#### WAN gossip pool
Servers may also participate in a WAN gossip pool, which is optimized for greater latency imposed by the Internet. The pool enables servers to exchange information, such as their addresses and health, and gracefully handle loss of connectivity in the event of a failure.
In the following diagram, the servers in each data center participate in a WAN gossip pool by sending data over TCP/UDP on port `8302`. Refer to [Gossip Protocol](/docs/architecture/gossip) for additional information.
<Tabs>
<Tab heading="WAN gossip pool">
![Diagram of the Consul LAN gossip pool](/img/consul-arch/consul-arch-overview-wan-gossip-cross-cluster.svg)
</Tab>
<Tab heading="Remote datacenter forwarding">
![Diagram of RPC communication between Consul agents](/img/consul-arch/consul-arch-overview-remote-dc-forwarding-cross-cluster.svg)
</Tab>
</Tabs>
### Peering clusters (beta)
You can create peering connections between two or more independent clusters so that services deployed to different datacenters or admin partitions can communicate. An [admin partition](/docs/enterprise/admin-partitions) is a feature in Consul Enterprise that enables you to define isolated network regions that use the same Consul servers. In the cluster peering model, you create a token in one of the datacenters or partitions and configure another datacenter or partition to present the token to establish the connection.
-> **Cluster peering is currently in beta:** Functionality associated with cluster peering is subject to change. You should never use the beta release in secure environments or production scenarios. Features in beta may have performance issues, scaling issues, and limited support.
Refer to [What is Cluster Peering?](/docs/connect/cluster-peering) for additional information.

BIN
website/public/img/consul-arch.png (Stored with Git LFS)

Binary file not shown.

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 72 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 77 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 81 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 108 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 67 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 104 KiB