Rework a very hot, very sensitive part of snappy to reduce the number of

instructions, the number of dynamic branches, and avoid a particular
loop structure than LLVM has a very hard time optimizing for this
particular case.

The code being changed is part of the hottest path for snappy
decompression. In the benchmarks for decompressing protocol buffers,
this has proven to be amazingly sensitive to the slightest changes in
code layout. For example, previously we added '.p2align 5' assembly
directive to the code. This essentially padded the loop out from the
function. Merely by doing this we saw significant performance
improvements.

As a consequence, several of the compiler's typically reasonable
optimizations can have surprising bad impacts. Loop unrolling is a
primary culprit, but in the next LLVM release we are seeing an issue due
to loop rotation. While some of the problems caused by the newly
triggered loop rotation in LLVM can be mitigated with ongoing work on
LLVM's code layout optimizations (specifically, loop header cloning),
that is a fairly long term project. And even minor fluctuations in how
that subsequent optimization is performed may prevent gaining the
performance back.

For now, we need some way to unblock the next LLVM release which
contains a generic improvement to the LLVM loop optimizer that enables
loop rotation in more places, but uncovers this sensitivity and weakness
in a particular case.

This CL restructures the loop to have a simpler structure. Specifically,
we eagerly test what the terminal condition will be and provide two
versions of the copy loop that use a single loop predicate.

The comments in the source code and benchmarks indicate that only one of
these two cases is actually hot: we expect to generally have enough slop
in the buffer. That in turn allows us to generate a much simpler branch
and loop structure for the hot path (especially for the protocol buffer
decompression benchmark).

However, structuring even this simple loop in a way that doesn't trigger
some other performance bubble (often a more severe one) is quite
challenging. We have to carefully manage the variables used in the loop
and the addressing pattern. We should teach LLVM how to do this
reliably, but that too is a *much* more significant undertaking and is
extremely rare to have this degree of importance. The desired structure
of the loop, as shown with IACA's analysis for the broadwell
micro-architecture (HSW and SKX are similar):

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | mov rcx, qword ptr [rdi+rdx*1-0x8]
|   2^   |           |     |           | 0.4       | 1.0 |     |     | 0.6 |    | mov qword ptr [rdi], rcx
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | mov rcx, qword ptr [rdi+rdx*1]
|   2^   |           |     | 0.3       |           | 1.0 |     |     | 0.7 |    | mov qword ptr [rdi+0x8], rcx
|   1    | 0.5       |     |           |           |     | 0.5 |     |     |    | add rdi, 0x10
|   1    | 0.2       |     |           |           |     |     | 0.8 |     |    | cmp rdi, rax
|   0F   |           |     |           |           |     |     |     |     |    | jb 0xffffffffffffffe9

Specifically, the arrangement of addressing modes for the stores such
that micro-op fusion (indicated by the `^` on the `2` micro-op count) is
important to achieve good throughput for this loop.

The other thing necessary to make this change effective is to remove our
previous hack using `.p2align 5` to pad out the main decompression loop,
and to forcibly disable loop unrolling for critical loops. Because this
change simplifies the loop structure, more unrolling opportunities show
up. Also, the next LLVM release's generic loop optimization improvements
allow unrolling in more places, requiring still more disabling of
unrolling in this change.  Perhaps most surprising of these is that we
must disable loop unrolling in the *slow* path. While unrolling there
seems pointless, it should also be harmless.  This cold code is laid out
very far away from all of the hot code. All the samples shown in a
profile of the benchmark occur before this loop in the function. And
yet, if the loop gets unrolled (which seems to only happen reliably with
the next LLVM release) we see a nearly 20% regression in decompressing
protocol buffers!

With the current release of LLVM, we still observe some regression from
this source change, but it is fairly small (5% on decompressing protocol
buffers, less elsewhere). And with the next LLVM release it drops to
under 1% even in that case. Meanwhile, without this change, the next
release of LLVM will regress decompressing protocol buffers by more than
10%.
This commit is contained in:
chandlerc 2017-12-21 20:51:07 -08:00 committed by Victor Costan
parent 26102a0c66
commit 4aba5426d4
1 changed files with 49 additions and 11 deletions

View File

@ -127,6 +127,11 @@ void UnalignedCopy128(const void* src, void* dst) {
// Note that this does not match the semantics of either memcpy() or memmove().
inline char* IncrementalCopySlow(const char* src, char* op,
char* const op_limit) {
// TODO: Remove pragma when LLVM is aware this function is only called in
// cold regions and when cold regions don't get vectorized or unrolled.
#ifdef __clang__
#pragma clang loop unroll(disable)
#endif
while (op < op_limit) {
*op++ = *src++;
}
@ -144,6 +149,7 @@ inline char* IncrementalCopy(const char* src, char* op, char* const op_limit,
// pat = op - src
// len = limit - op
assert(src < op);
assert(op <= op_limit);
assert(op_limit <= buf_limit);
// NOTE: The compressor always emits 4 <= len <= 64. It is ok to assume that
// to optimize this function but we have to also handle these cases in case
@ -202,13 +208,52 @@ inline char* IncrementalCopy(const char* src, char* op, char* const op_limit,
// UnalignedCopy128 might overwrite data in op. UnalignedCopy64 is safe
// because expanding the pattern to at least 8 bytes guarantees that
// op - src >= 8.
while (op <= buf_limit - 16) {
//
// Typically, the op_limit is the gating factor so try to simplify the loop
// based on that.
if (SNAPPY_PREDICT_TRUE(op_limit <= buf_limit - 16)) {
// Factor the displacement from op to the source into a variable. This helps
// simplify the loop below by only varying the op pointer which we need to
// test for the end. Note that this was done after carefully examining the
// generated code to allow the addressing modes in the loop below to
// maximize micro-op fusion where possible on modern Intel processors. The
// generated code should be checked carefully for new processors or with
// major changes to the compiler.
// TODO: Simplify this code when the compiler reliably produces the correct
// x86 instruction sequence.
ptrdiff_t op_to_src = src - op;
// The trip count of this loop is not large and so unrolling will only hurt
// code size without helping performance.
//
// TODO: Replace with loop trip count hint.
#ifdef __clang__
#pragma clang loop unroll(disable)
#endif
do {
UnalignedCopy64(op + op_to_src, op);
UnalignedCopy64(op + op_to_src + 8, op + 8);
op += 16;
} while (op < op_limit);
return op_limit;
}
// Fall back to doing as much as we can with the available slop in the
// buffer. This code path is relatively cold however so we save code size by
// avoiding unrolling and vectorizing.
//
// TODO: Remove pragma when when cold regions don't get vectorized or
// unrolled.
#ifdef __clang__
#pragma clang loop unroll(disable)
#endif
for (char *op_end = buf_limit - 16; op < op_end; op += 16, src += 16) {
UnalignedCopy64(src, op);
UnalignedCopy64(src + 8, op + 8);
src += 16;
op += 16;
if (SNAPPY_PREDICT_TRUE(op >= op_limit)) return op_limit;
}
if (op >= op_limit)
return op_limit;
// We only take this branch if we didn't have enough slop and we can do a
// single 8 byte copy.
if (SNAPPY_PREDICT_FALSE(op <= buf_limit - 8)) {
@ -685,13 +730,6 @@ class SnappyDecompressor {
}
MAYBE_REFILL();
// Add loop alignment directive. Without this directive, we observed
// significant performance degradation on several intel architectures
// in snappy benchmark built with LLVM. The degradation was caused by
// increased branch miss prediction.
#if defined(__clang__) && defined(__x86_64__)
asm volatile (".p2align 5");
#endif
for ( ;; ) {
const unsigned char c = *(reinterpret_cast<const unsigned char*>(ip++));