Add common blocking implementation details to docs (#5358)

* Add common blocking implementation details to docs These come up over and over again with blocking query loops in our own code and third-party's. #5333 is possibly a case (unconfirmed) where "badly behaved" blocking clients cause issues, however since we've never explicitly documented these things it's not reasonable for third-party clients to have guessed that they are needed! This hopefully gives us something to point to for the future. It's a little wordy - happy to consider breaking some of the blocking stuff out of this page if we think it's appropriate but just wanted to quickly plaster over this gap in our docs for now. * Update index.html.md * Apply suggestions from code review Co-Authored-By: banks <banks@banksco.de> * Update index.html.md * Update index.html.md * Clearified monotonically * Fixing formating
2019-02-21 21:33:45 +00:00 · 2019-02-21 21:33:45 +00:00 · abc5478b51
parent 72218cafae
commit abc5478b51
1 changed files with 45 additions and 0 deletions
--- a/website/source/api/index.html.md
+++ b/website/source/api/index.html.md
@ -77,6 +77,51 @@ to the supplied maximum `wait` time to spread out the wake up time of any
 concurrent requests. This adds up to `wait / 16` additional time to the maximum
 duration.

+### Implementation Details
+
+While the mechanism is relatively simple to work with, there are a few edge 
+cases that must be handled correctly.
+
+ * **Reset the index if it goes backwards**. While indexes in general are 
+   monotonically increasing(i.e. they should only ever increase as time passes), 
+   there are several real-world scenarios in 
+   which they can go backwards for a given query. Implementations must check 
+   to see if a returned index is lower than the previous value, 
+   and if it is, should reset index to `0` - effectively restarting their blocking loop. 
+   Failure to do so may cause the client to miss future updates for an unbounded 
+   time, or to use an invalid index value that causes no blocking and increases 
+   load on the servers. Cases where this can occur include:
+   * If a raft snapshot is restored on the servers with older version of the data.
+   * KV list operations where an item with the highest index is removed.
+   * A Consul upgrade changes the way watches work to optimize them with more 
+   granular indexes.
+
+ * **Sanity check index is greater than zero**. After the initial request (or a
+   reset as above) the `X-Consul-Index` returned _should_ always be greater than zero. It
+   is a bug in Consul if it is not, however this has happened a few times and can
+   still be triggered on some older Consul versions. It's especially bad because it
+   causes blocking clients that are not aware to enter a busy loop, using excessive 
+   client CPU and causing high load on servers. It is _always_ safe to use an 
+   index of `1` to wait for updates when the data being requested doesn't exist
+   yet, so clients _should_ sanity check that their index is at least 1 after 
+   each blocking response is handled to be sure they actually block on the next 
+   request.
+
+ * **Rate limit**. The blocking query mechanism is reasonably efficient when updates 
+   are relatively rare (order of tens of seconds to minutes between updates). In cases 
+   where a result gets updated very fast however - possibly during an outage or incident 
+   with a badly behaved client - blocking query loops degrade into busy loops that 
+   consume excessive client CPU and cause high server load. While it's possible to just add a sleep 
+   to every iteration of the loop, this is **not** recommended since it causes update 
+   delivery to be delayed in the happy case, and it can exacerbate the problem since 
+   it increases the chance that the index has changed on the next request. Clients 
+   _should_ instead rate limit the loop so that in the happy case they proceed without 
+   waiting, but when values start to churn quickly they degrade into polling at a 
+   reasonable rate (say every 15 seconds). Ideally this is done with an algorithm that 
+   allows a couple of quick successive deliveries before it starts to limit rate - a 
+   [token bucket](https://en.wikipedia.org/wiki/Token_bucket) with burst of 2 is a simple
+   way to achieve this.
+
 ### Hash-based Blocking Queries

 A limited number of agent endpoints also support blocking however because the