Rework a very hot, very sensitive part of snappy to reduce the number of

instructions, the number of dynamic branches, and avoid a particular loop structure than LLVM has a very hard time optimizing for this particular case. The code being changed is part of the hottest path for snappy decompression. In the benchmarks for decompressing protocol buffers, this has proven to be amazingly sensitive to the slightest changes in code layout. For example, previously we added '.p2align 5' assembly directive to the code. This essentially padded the loop out from the function. Merely by doing this we saw significant performance improvements. As a consequence, several of the compiler's typically reasonable optimizations can have surprising bad impacts. Loop unrolling is a primary culprit, but in the next LLVM release we are seeing an issue due to loop rotation. While some of the problems caused by the newly triggered loop rotation in LLVM can be mitigated with ongoing work on LLVM's code layout optimizations (specifically, loop header cloning), that is a fairly long term project. And even minor fluctuations in how that subsequent optimization is performed may prevent gaining the performance back. For now, we need some way to unblock the next LLVM release which contains a generic improvement to the LLVM loop optimizer that enables loop rotation in more places, but uncovers this sensitivity and weakness in a particular case. This CL restructures the loop to have a simpler structure. Specifically, we eagerly test what the terminal condition will be and provide two versions of the copy loop that use a single loop predicate. The comments in the source code and benchmarks indicate that only one of these two cases is actually hot: we expect to generally have enough slop in the buffer. That in turn allows us to generate a much simpler branch and loop structure for the hot path (especially for the protocol buffer decompression benchmark). However, structuring even this simple loop in a way that doesn't trigger some other performance bubble (often a more severe one) is quite challenging. We have to carefully manage the variables used in the loop and the addressing pattern. We should teach LLVM how to do this reliably, but that too is a *much* more significant undertaking and is extremely rare to have this degree of importance. The desired structure of the loop, as shown with IACA's analysis for the broadwell micro-architecture (HSW and SKX are similar): | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 1 | | | 1.0 1.0 | | | | | | | mov rcx, qword ptr [rdi+rdx*1-0x8] | 2^ | | | | 0.4 | 1.0 | | | 0.6 | | mov qword ptr [rdi], rcx | 1 | | | | 1.0 1.0 | | | | | | mov rcx, qword ptr [rdi+rdx*1] | 2^ | | | 0.3 | | 1.0 | | | 0.7 | | mov qword ptr [rdi+0x8], rcx | 1 | 0.5 | | | | | 0.5 | | | | add rdi, 0x10 | 1 | 0.2 | | | | | | 0.8 | | | cmp rdi, rax | 0F | | | | | | | | | | jb 0xffffffffffffffe9 Specifically, the arrangement of addressing modes for the stores such that micro-op fusion (indicated by the `^` on the `2` micro-op count) is important to achieve good throughput for this loop. The other thing necessary to make this change effective is to remove our previous hack using `.p2align 5` to pad out the main decompression loop, and to forcibly disable loop unrolling for critical loops. Because this change simplifies the loop structure, more unrolling opportunities show up. Also, the next LLVM release's generic loop optimization improvements allow unrolling in more places, requiring still more disabling of unrolling in this change. Perhaps most surprising of these is that we must disable loop unrolling in the *slow* path. While unrolling there seems pointless, it should also be harmless. This cold code is laid out very far away from all of the hot code. All the samples shown in a profile of the benchmark occur before this loop in the function. And yet, if the loop gets unrolled (which seems to only happen reliably with the next LLVM release) we see a nearly 20% regression in decompressing protocol buffers! With the current release of LLVM, we still observe some regression from this source change, but it is fairly small (5% on decompressing protocol buffers, less elsewhere). And with the next LLVM release it drops to under 1% even in that case. Meanwhile, without this change, the next release of LLVM will regress decompressing protocol buffers by more than 10%.
2017-12-21 20:51:07 -08:00 · 2017-12-21 20:51:07 -08:00 · 4aba5426d4
parent 26102a0c66
commit 4aba5426d4
1 changed files with 49 additions and 11 deletions
--- a/snappy.cc
+++ b/snappy.cc
@ -127,6 +127,11 @@ void UnalignedCopy128(const void* src, void* dst) {
 // Note that this does not match the semantics of either memcpy() or memmove().
 inline char* IncrementalCopySlow(const char* src, char* op,
                                 char* const op_limit) {
+  // TODO: Remove pragma when LLVM is aware this function is only called in
+  // cold regions and when cold regions don't get vectorized or unrolled.
+#ifdef __clang__
+#pragma clang loop unroll(disable)
+#endif
  while (op < op_limit) {
    *op++ = *src++;
  }
@ -144,6 +149,7 @@ inline char* IncrementalCopy(const char* src, char* op, char* const op_limit,
  // pat  = op - src
  // len  = limit - op
  assert(src < op);
+  assert(op <= op_limit);
  assert(op_limit <= buf_limit);
  // NOTE: The compressor always emits 4 <= len <= 64. It is ok to assume that
  // to optimize this function but we have to also handle these cases in case
@ -202,13 +208,52 @@ inline char* IncrementalCopy(const char* src, char* op, char* const op_limit,
  // UnalignedCopy128 might overwrite data in op. UnalignedCopy64 is safe
  // because expanding the pattern to at least 8 bytes guarantees that
  // op - src >= 8.
-  while (op <= buf_limit - 16) {
+  //
+  // Typically, the op_limit is the gating factor so try to simplify the loop
+  // based on that.
+  if (SNAPPY_PREDICT_TRUE(op_limit <= buf_limit - 16)) {
+    // Factor the displacement from op to the source into a variable. This helps
+    // simplify the loop below by only varying the op pointer which we need to
+    // test for the end. Note that this was done after carefully examining the
+    // generated code to allow the addressing modes in the loop below to
+    // maximize micro-op fusion where possible on modern Intel processors. The
+    // generated code should be checked carefully for new processors or with
+    // major changes to the compiler.
+    // TODO: Simplify this code when the compiler reliably produces the correct
+    // x86 instruction sequence.
+    ptrdiff_t op_to_src = src - op;
+
+    // The trip count of this loop is not large and so unrolling will only hurt
+    // code size without helping performance.
+    //
+    // TODO: Replace with loop trip count hint.
+#ifdef __clang__
+#pragma clang loop unroll(disable)
+#endif
+    do {
+      UnalignedCopy64(op + op_to_src, op);
+      UnalignedCopy64(op + op_to_src + 8, op + 8);
+      op += 16;
+    } while (op < op_limit);
+    return op_limit;
+  }
+
+  // Fall back to doing as much as we can with the available slop in the
+  // buffer. This code path is relatively cold however so we save code size by
+  // avoiding unrolling and vectorizing.
+  //
+  // TODO: Remove pragma when when cold regions don't get vectorized or
+  // unrolled.
+#ifdef __clang__
+#pragma clang loop unroll(disable)
+#endif
+  for (char *op_end = buf_limit - 16; op < op_end; op += 16, src += 16) {
    UnalignedCopy64(src, op);
    UnalignedCopy64(src + 8, op + 8);
-    src += 16;
-    op += 16;
-    if (SNAPPY_PREDICT_TRUE(op >= op_limit)) return op_limit;
  }
+  if (op >= op_limit)
+    return op_limit;
+
  // We only take this branch if we didn't have enough slop and we can do a
  // single 8 byte copy.
  if (SNAPPY_PREDICT_FALSE(op <= buf_limit - 8)) {
@ -685,13 +730,6 @@ class SnappyDecompressor {
        }

    MAYBE_REFILL();
-    // Add loop alignment directive. Without this directive, we observed
-    // significant performance degradation on several intel architectures
-    // in snappy benchmark built with LLVM. The degradation was caused by
-    // increased branch miss prediction.
-#if defined(__clang__) && defined(__x86_64__)
-    asm volatile (".p2align 5");
-#endif
    for ( ;; ) {
      const unsigned char c = *(reinterpret_cast<const unsigned char*>(ip++));