First of all, I'm sorry about this ugly hack. I hope the following long
explanation is enough to justify it.
We have observed that, in some conditions, the results for dataset number 10
(pb) in the zippy benchmark can show a >20% regression on Skylake CPUs.
In order to diagnose this, we profiled the benchmark looking at hot functions
(99% of the time is spent on DecompressAllTags), then looked at the generated
code to see if there was any difference. In order to discard a minor difference
we observed in register allocation we replaced zippy.cc with a pre-built assembly
file so it was the same in both variants, and we still were able to reproduce the
regression.
After discarding a regression caused by the compiler, we digged a bit further
and noticed that the alignment of the function in the final binary was
different. Both were aligned to a 16-byte boundary, but the slower one was also
(by chance) aligned to a 32-byte boundary. A regression caused by alignment
differences would explain why I could reproduce it consistently on the same CitC
client, but not others: slight differences in the sources can cause the resulting
binary to have different layout.
Here are some detailed benchmark results before/after the fix. Note how fixing
the alignment makes the difference between baseline and experiment go away, but
regular 32-byte alignment puts both variants in the same ballpark as the
original regression:
Original (note BM_UCord_10 and BM_UDataBuffer_10 around the -24% line):
BASELINE
BM_UCord/10 2938 2932 24194 3.767GB/s pb
BM_UDataBuffer/10 3008 3004 23316 3.677GB/s pb
EXPERIMENT
BM_UCord/10 3797 3789 18512 2.915GB/s pb
BM_UDataBuffer/10 4024 4016 17543 2.750GB/s pb
Aligning DecompressAllTags to a 32-byte boundary:
BASELINE
BM_UCord/10 3872 3862 18035 2.860GB/s pb
BM_UDataBuffer/10 4010 3998 17591 2.763GB/s pb
EXPERIMENT
BM_UCord/10 3884 3876 18126 2.850GB/s pb
BM_UDataBuffer/10 4037 4027 17199 2.743GB/s pb
Aligning DecompressAllTags to a 32-byte boundary + 16 bytes (this patch):
BASELINE
BM_UCord/10 3103 3095 22642 3.569GB/s pb
BM_UDataBuffer/10 3186 3177 21947 3.476GB/s pb
EXPERIMENT
BM_UCord/10 3104 3095 22632 3.569GB/s pb
BM_UDataBuffer/10 3167 3159 22076 3.496GB/s pb
This change forces the "good" alignment for DecompressAllTags which, if
anything, should make benchmark results more stable (and maybe we'll improve
some unlucky application!).
instructions, the number of dynamic branches, and avoid a particular
loop structure than LLVM has a very hard time optimizing for this
particular case.
The code being changed is part of the hottest path for snappy
decompression. In the benchmarks for decompressing protocol buffers,
this has proven to be amazingly sensitive to the slightest changes in
code layout. For example, previously we added '.p2align 5' assembly
directive to the code. This essentially padded the loop out from the
function. Merely by doing this we saw significant performance
improvements.
As a consequence, several of the compiler's typically reasonable
optimizations can have surprising bad impacts. Loop unrolling is a
primary culprit, but in the next LLVM release we are seeing an issue due
to loop rotation. While some of the problems caused by the newly
triggered loop rotation in LLVM can be mitigated with ongoing work on
LLVM's code layout optimizations (specifically, loop header cloning),
that is a fairly long term project. And even minor fluctuations in how
that subsequent optimization is performed may prevent gaining the
performance back.
For now, we need some way to unblock the next LLVM release which
contains a generic improvement to the LLVM loop optimizer that enables
loop rotation in more places, but uncovers this sensitivity and weakness
in a particular case.
This CL restructures the loop to have a simpler structure. Specifically,
we eagerly test what the terminal condition will be and provide two
versions of the copy loop that use a single loop predicate.
The comments in the source code and benchmarks indicate that only one of
these two cases is actually hot: we expect to generally have enough slop
in the buffer. That in turn allows us to generate a much simpler branch
and loop structure for the hot path (especially for the protocol buffer
decompression benchmark).
However, structuring even this simple loop in a way that doesn't trigger
some other performance bubble (often a more severe one) is quite
challenging. We have to carefully manage the variables used in the loop
and the addressing pattern. We should teach LLVM how to do this
reliably, but that too is a *much* more significant undertaking and is
extremely rare to have this degree of importance. The desired structure
of the loop, as shown with IACA's analysis for the broadwell
micro-architecture (HSW and SKX are similar):
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | mov rcx, qword ptr [rdi+rdx*1-0x8]
| 2^ | | | | 0.4 | 1.0 | | | 0.6 | | mov qword ptr [rdi], rcx
| 1 | | | | 1.0 1.0 | | | | | | mov rcx, qword ptr [rdi+rdx*1]
| 2^ | | | 0.3 | | 1.0 | | | 0.7 | | mov qword ptr [rdi+0x8], rcx
| 1 | 0.5 | | | | | 0.5 | | | | add rdi, 0x10
| 1 | 0.2 | | | | | | 0.8 | | | cmp rdi, rax
| 0F | | | | | | | | | | jb 0xffffffffffffffe9
Specifically, the arrangement of addressing modes for the stores such
that micro-op fusion (indicated by the `^` on the `2` micro-op count) is
important to achieve good throughput for this loop.
The other thing necessary to make this change effective is to remove our
previous hack using `.p2align 5` to pad out the main decompression loop,
and to forcibly disable loop unrolling for critical loops. Because this
change simplifies the loop structure, more unrolling opportunities show
up. Also, the next LLVM release's generic loop optimization improvements
allow unrolling in more places, requiring still more disabling of
unrolling in this change. Perhaps most surprising of these is that we
must disable loop unrolling in the *slow* path. While unrolling there
seems pointless, it should also be harmless. This cold code is laid out
very far away from all of the hot code. All the samples shown in a
profile of the benchmark occur before this loop in the function. And
yet, if the loop gets unrolled (which seems to only happen reliably with
the next LLVM release) we see a nearly 20% regression in decompressing
protocol buffers!
With the current release of LLVM, we still observe some regression from
this source change, but it is fairly small (5% on decompressing protocol
buffers, less elsewhere). And with the next LLVM release it drops to
under 1% even in that case. Meanwhile, without this change, the next
release of LLVM will regress decompressing protocol buffers by more than
10%.
We found LLVM upstream change at rL310792 degraded zippy benchmark by
~3%. Performance analysis showed the regression was caused by some
side-effect. The incidental loop alignment change (from 32 bytes to 16
bytes) led to increase of branch miss prediction and caused the
regression. The regression was reproducible on several intel
micro-architectures, like sandybridge, haswell and skylake. Sadly we
still don't have good understanding about the internal of intel branch
predictor and cannot explain how the branch miss prediction increases
when the loop alignment changes, so we cannot make a real fix here. The
workaround solution in the patch is to add a directive, align the hot
loop to 32 bytes, which can restore the performance. This is in order to
unblock the flip of default compiler to LLVM.
This is modeled after https://github.com/google/googletest/pull/1160.
The immediate benefit is fixing the library install paths on 64-bit
Linux distributions, which tend to support running 32-bit and 64-bit
code side by side by installing 32-bit libraries in /usr/lib and 64-bit
libraries in /usr/lib64.
getpagesize(), as well as its POSIX.2001 replacement
sysconf(_SC_PAGESIZE), is defined in <unistd.h>. On Linux and OS X,
including <sys/mman.h> is sufficient to get a definition for
getpagesize(). However, this is not true for the Android NDK. This CL
brings back the HAVE_UNISTD_H definition and its associated header
check.
This also adds a HAVE_FUNC_SYSCONF definition, which checks for the
presence of sysconf(). The definition can be used later to replace
getpagesize() with sysconf().
The style was changed to match the official manual [1], the install
configuration was simplified and now matches the official packaging
guide [2], and the config files use the CMake-specific variable syntax
${VAR} instead of the autoconf-compatible syntax @VAR@, as documented in
[3]. The public header files are declared as such (for CMake 3.3+), and
the generated headers are included in the library target definition.
The tests are only built if SNAPPY_BUILD_TESTS (default ON) is true, so
zippy can be easily used in projects that add_subdirectory() its source
code directly, instead of using find_package().
[1] https://cmake.org/cmake/help/git-master/manual/cmake-language.7.html
[2] https://cmake.org/cmake/help/git-master/manual/cmake-packages.7.html
[3] https://cmake.org/cmake/help/git-master/command/configure_file.html
This CL fixes 64-bit Windows testing (), makes it possible to view the
test output in the Travis / AppVeyor CI console while the test is
running, and takes advantage of the new support for the .appveyor.yml
file name to make the CI configuration less obtrusive.
snappy-stubs-public.h defined the DISALLOW_COPY_AND_ASSIGN macro, so the
definition propagated to all translation units that included the open
source headers. The macro is now inlined, thus avoiding polluting the
macro environment of snappy users.
Unused macros: HAVE_DLFCN_H, HAVE_INTTYPES_H, HAVE_MEMORY_H,
HAVE_STDLIB_H, HAVE_STRINGS_H, HAVE_STRING_H, HAVE_SYS_BYTESWAP_H,
HAVE_SYS_STAT_H, HAVE_SYS_TYPES_H, HAVE_UNISTD_H.
Used but never set macros: HAVE_LIBLZF, HAVE_LIBQUICKLZ. These only gate
conditional includes. The code that takes advantage of them was removed.
Unused types: ssize_t.
The testing code uses HAVE_FUNC_MMAP, which was not wired in the CMake
build, causing a whole test to be skipped.
optimization with LLVM that converts const stack arrays to global arrays. This
is a temporary change and should be reverted when https://reviews.llvm.org/D30759
is fixed.
With PIE, accessing stack arrays is more efficient than global arrays and
wordmask was moved to the stack due to that. However, the LLVM compiler
automatically converts stack arrays, detected as constant, to global arrays
and this transformation hurts PIE performance with LLVM.
We are working to fix this in the LLVM compiler, via
https://reviews.llvm.org/D30759, to not do this conversion in PIE mode. Until
this patch is finished, please consider this source change as a temporary
work around to keep this array on the stack. This source change is important
to allow some projects to flip the default compiler from GCC to LLVM for
optimized builds.
This change works for the following reason. The LLVM compiler does not convert
non-const stack arrays to global arrays and explicitly copying the elements is
enough to make the compiler assume that this is a non-const array.
With GCC, this change does not affect code-gen in any significant way. The
array initialization code is slightly different as it copies the constants
directly to the stack.
With LLVM, this keeps the array on the stack.
No change in performance with GCC (within noise range). With LLVM, ~0.7%
improvement in optimized mode (no FDO) and ~1.75% improvement in FDO
mode.
Also, force inlining util::compression::Sample().
The inlining change is necessary. Without it even with FDO+LIPO the call
doesn't get inlined and uses 4 registers to construct parameters (which
won't be used in the common case). In some of the more compute-bound
tests that causes extra spills and significant overhead (even if
call is sufficiently long).
For example, with inlining:
BM_UFlat/0 32.7µs ± 1% 33.1µs ± 1% +1.41%
without:
BM_UFlat/0 32.7µs ± 1% 37.7µs ± 1% +15.29%
global array access.
With PIE, accessing global arrays needs two instructions whereas it can be
done with a single instruction without PIE. See []
For example, without PIE the access looks like:
mov 0x400780(,%rdi,4),%eax // One instruction to access arr[i]
and with PIE the access looks like:
lea 0x149(%rip),%rax # 400780 <_ZL3arr>
mov (%rax,%rdi,4),%eax
This causes a slow down in zippy as it has two global arrays, wordmask and
char_table. There is no equivalent PC-relative insn. with PIE to do this in
one instruction.
The slow down can be seen as an increase in dynamic instruction count and
cycles with a similar IPC. We have seen this affect REDACTED recently and this
is causing a ~1% perf. slow down.
One of the mitigation techniques for small arrays is to move it onto the stack,
use the stack pointer to make the access a single instruction. The downside to
this is the extra instructions at function call to mov the array onto the stack
which is why we want to do this only for small arrays. I tried moving
wordmask onto the stack since it is a small array. The performance numbers look
good overall. There is an improvement in the dynamic instruction count for
almost all BM_UFlat benchmarks. BM_UFlat/2 and BM_UFlat/3 are pretty noisy.
The only case where there is a regression is BM_UFlat/10. Here, the instruction
count does go down but the IPC also goes down affecting performance. This also
looks noisy but I do see a small IPC drop with this change. Otherwise, the
numbers look good and consistent. I measured this on a perflab ivybridge
machine multiple times. Numbers are given below. For Improv. (improvements),
positive is good.
Binaries built as: blaze build -c opt --dynamic_mode=off
Benchmark Base CPU(ns) Opt CPU(ns) Improv. Base Cycles Opt Cycles Improv. Base Insns Opt Insns Improv.
BM_UFlat/1 541711 537052 0.86% 46068129918 45442732684 1.36% 85113352848 83917656016 1.40%
BM_UFlat/2 6228 6388 -2.57% 582789808 583267855 -0.08% 1261517746 1261116553 0.03%
BM_UFlat/3 159 120 24.53% 61538641 58783800 4.48% 90008672 90980060 -1.08%
BM_UFlat/4 7878 7787 1.16% 710491888 703718556 0.95% 1914898283 1525060250 20.36%
BM_UFlat/5 208854 207673 0.57% 17640846255 17609530720 0.18% 36546983483 36008920788 1.47%
BM_UFlat/6 172595 167225 3.11% 14642082831 14232371166 2.80% 33647820489 33056659600 1.76%
BM_UFlat/7 152364 147901 2.93% 12904338645 12635220582 2.09% 28958390984 28457982504 1.73%
BM_UFlat/8 463764 448244 3.35% 39423576973 37917435891 3.82% 88350964483 86800265943 1.76%
BM_UFlat/9 639517 621811 2.77% 54275945823 52555988926 3.17% 119503172410 117432599704 1.73%
BM_UFlat/10 41929 42358 -1.02% 3593125535 3647231492 -1.51% 8559206066 8446526639 1.32%
BM_UFlat/11 174754 173936 0.47% 14885371426 14749410955 0.91% 36693421142 35987215897 1.92%
BM_UFlat/12 13388 13257 0.98% 1192648670 1179645044 1.09% 3506482177 3454962579 1.47%
BM_UFlat/13 6801 6588 3.13% 627960003 608367286 3.12% 1847877894 1818368400 1.60%
BM_UFlat/14 2057 1989 3.31% 229005588 217393157 5.07% 609686274 599419511 1.68%
BM_UFlat/15 831618 799881 3.82% 70440388955 67911853013 3.59% 167178603105 164653652416 1.51%
BM_UFlat/16 199 199 0.00% 70109081 68747579 1.94% 106263639 105569531 0.65%
BM_UFlat/17 279031 273890 1.84% 23361373312 23294246637 0.29% 40474834585 39981682217 1.22%
BM_UFlat/18 233 199 14.59% 74530664 67841101 8.98% 94305848 92271053 2.16%
BM_UFlat/19 26743 25309 5.36% 2327215133 2206712016 5.18% 6024314357 5935228694 1.48%
BM_UFlat/20 2731 2625 3.88% 282018757 276772813 1.86% 768382519 758277029 1.32%
Is this a reasonable work-around for the problem? Do you need more performance
measurements? haih@ is evaluating this change for [] and I will update those
numbers once we have it.
Tested:
Performance with zippy_unittest.