mirror of https://github.com/google/snappy.git
Speed up Zippy decompression in PIE mode by removing the penalty for
global array access. With PIE, accessing global arrays needs two instructions whereas it can be done with a single instruction without PIE. See [] For example, without PIE the access looks like: mov 0x400780(,%rdi,4),%eax // One instruction to access arr[i] and with PIE the access looks like: lea 0x149(%rip),%rax # 400780 <_ZL3arr> mov (%rax,%rdi,4),%eax This causes a slow down in zippy as it has two global arrays, wordmask and char_table. There is no equivalent PC-relative insn. with PIE to do this in one instruction. The slow down can be seen as an increase in dynamic instruction count and cycles with a similar IPC. We have seen this affect REDACTED recently and this is causing a ~1% perf. slow down. One of the mitigation techniques for small arrays is to move it onto the stack, use the stack pointer to make the access a single instruction. The downside to this is the extra instructions at function call to mov the array onto the stack which is why we want to do this only for small arrays. I tried moving wordmask onto the stack since it is a small array. The performance numbers look good overall. There is an improvement in the dynamic instruction count for almost all BM_UFlat benchmarks. BM_UFlat/2 and BM_UFlat/3 are pretty noisy. The only case where there is a regression is BM_UFlat/10. Here, the instruction count does go down but the IPC also goes down affecting performance. This also looks noisy but I do see a small IPC drop with this change. Otherwise, the numbers look good and consistent. I measured this on a perflab ivybridge machine multiple times. Numbers are given below. For Improv. (improvements), positive is good. Binaries built as: blaze build -c opt --dynamic_mode=off Benchmark Base CPU(ns) Opt CPU(ns) Improv. Base Cycles Opt Cycles Improv. Base Insns Opt Insns Improv. BM_UFlat/1 541711 537052 0.86% 46068129918 45442732684 1.36% 85113352848 83917656016 1.40% BM_UFlat/2 6228 6388 -2.57% 582789808 583267855 -0.08% 1261517746 1261116553 0.03% BM_UFlat/3 159 120 24.53% 61538641 58783800 4.48% 90008672 90980060 -1.08% BM_UFlat/4 7878 7787 1.16% 710491888 703718556 0.95% 1914898283 1525060250 20.36% BM_UFlat/5 208854 207673 0.57% 17640846255 17609530720 0.18% 36546983483 36008920788 1.47% BM_UFlat/6 172595 167225 3.11% 14642082831 14232371166 2.80% 33647820489 33056659600 1.76% BM_UFlat/7 152364 147901 2.93% 12904338645 12635220582 2.09% 28958390984 28457982504 1.73% BM_UFlat/8 463764 448244 3.35% 39423576973 37917435891 3.82% 88350964483 86800265943 1.76% BM_UFlat/9 639517 621811 2.77% 54275945823 52555988926 3.17% 119503172410 117432599704 1.73% BM_UFlat/10 41929 42358 -1.02% 3593125535 3647231492 -1.51% 8559206066 8446526639 1.32% BM_UFlat/11 174754 173936 0.47% 14885371426 14749410955 0.91% 36693421142 35987215897 1.92% BM_UFlat/12 13388 13257 0.98% 1192648670 1179645044 1.09% 3506482177 3454962579 1.47% BM_UFlat/13 6801 6588 3.13% 627960003 608367286 3.12% 1847877894 1818368400 1.60% BM_UFlat/14 2057 1989 3.31% 229005588 217393157 5.07% 609686274 599419511 1.68% BM_UFlat/15 831618 799881 3.82% 70440388955 67911853013 3.59% 167178603105 164653652416 1.51% BM_UFlat/16 199 199 0.00% 70109081 68747579 1.94% 106263639 105569531 0.65% BM_UFlat/17 279031 273890 1.84% 23361373312 23294246637 0.29% 40474834585 39981682217 1.22% BM_UFlat/18 233 199 14.59% 74530664 67841101 8.98% 94305848 92271053 2.16% BM_UFlat/19 26743 25309 5.36% 2327215133 2206712016 5.18% 6024314357 5935228694 1.48% BM_UFlat/20 2731 2625 3.88% 282018757 276772813 1.86% 768382519 758277029 1.32% Is this a reasonable work-around for the problem? Do you need more performance measurements? haih@ is evaluating this change for [] and I will update those numbers once we have it. Tested: Performance with zippy_unittest.
This commit is contained in:
parent
38a5ec5fca
commit
4a74094080
|
@ -578,6 +578,10 @@ class SnappyDecompressor {
|
|||
template <class Writer>
|
||||
void DecompressAllTags(Writer* writer) {
|
||||
const char* ip = ip_;
|
||||
// For position-independent executables, accessing global arrays can be
|
||||
// slow. Move wordmask array onto the stack to mitigate this.
|
||||
uint32 wordmask[sizeof(internal::wordmask)/sizeof(uint32)];
|
||||
memcpy(wordmask, internal::wordmask, sizeof(wordmask));
|
||||
|
||||
// We could have put this refill fragment only at the beginning of the loop.
|
||||
// However, duplicating it at the end of each branch gives the compiler more
|
||||
|
|
Loading…
Reference in New Issue