Speed up Zippy decompression in PIE mode by removing the penalty for

global array access.

With PIE, accessing global arrays needs two instructions whereas it can be
done with a single instruction without PIE.  See []
For example, without PIE the access looks like:
mov    0x400780(,%rdi,4),%eax  // One instruction to access arr[i]

and with PIE the access looks like:
lea    0x149(%rip),%rax        # 400780 <_ZL3arr>
mov    (%rax,%rdi,4),%eax

This causes a slow down in zippy as it has two global arrays, wordmask and
char_table.  There is no equivalent PC-relative insn. with PIE to do this in
one instruction.

The slow down can be seen as an increase in dynamic instruction count and
cycles with a similar IPC.  We have seen this affect REDACTED recently and this
is causing a ~1% perf. slow down.

One of the mitigation techniques for small arrays is to move it onto the stack,
use the stack pointer to make the access a single instruction.  The downside to
this is the extra instructions at function call to mov the array onto the stack
which is why we want to do this only for small arrays.  I tried moving
wordmask onto the stack since it is a small array. The performance numbers look
good overall. There is an improvement in the dynamic instruction count for
almost all BM_UFlat benchmarks.  BM_UFlat/2 and BM_UFlat/3 are pretty noisy.
The only case where there is a regression is BM_UFlat/10.  Here, the instruction
count does go down but the IPC also goes down affecting performance. This also
looks noisy but I do see a small IPC drop with this change.  Otherwise, the
numbers look good and consistent.  I measured this on a perflab ivybridge
machine multiple times.  Numbers are given below.  For Improv. (improvements),
positive is good.

Binaries built as: blaze build -c opt --dynamic_mode=off

Benchmark	Base CPU(ns)	Opt CPU(ns)	Improv.	Base Cycles	Opt Cycles	Improv.	Base Insns	Opt Insns	Improv.

BM_UFlat/1	541711		537052		0.86%	46068129918	45442732684	1.36%	85113352848	83917656016	1.40%
BM_UFlat/2	6228		6388		-2.57%	582789808	583267855	-0.08%	1261517746	1261116553	0.03%
BM_UFlat/3	159		120		24.53%	61538641	58783800	4.48%	90008672	90980060	-1.08%
BM_UFlat/4	7878		7787		1.16%	710491888	703718556	0.95%	1914898283	1525060250	20.36%
BM_UFlat/5	208854		207673		0.57%	17640846255	17609530720	0.18%	36546983483	36008920788	1.47%
BM_UFlat/6	172595		167225		3.11%	14642082831	14232371166	2.80%	33647820489	33056659600	1.76%
BM_UFlat/7	152364		147901		2.93%	12904338645	12635220582	2.09%	28958390984	28457982504	1.73%
BM_UFlat/8	463764		448244		3.35%	39423576973	37917435891	3.82%	88350964483	86800265943	1.76%
BM_UFlat/9	639517		621811		2.77%	54275945823	52555988926	3.17%	119503172410	117432599704	1.73%
BM_UFlat/10	41929		42358		-1.02%	3593125535	3647231492	-1.51%	8559206066	8446526639	1.32%
BM_UFlat/11	174754		173936		0.47%	14885371426	14749410955	0.91%	36693421142	35987215897	1.92%
BM_UFlat/12	13388		13257		0.98%	1192648670	1179645044	1.09%	3506482177	3454962579	1.47%
BM_UFlat/13	6801		6588		3.13%	627960003	608367286	3.12%	1847877894	1818368400	1.60%
BM_UFlat/14	2057		1989		3.31%	229005588	217393157	5.07%	609686274	599419511	1.68%
BM_UFlat/15	831618		799881		3.82%	70440388955	67911853013	3.59%	167178603105	164653652416	1.51%
BM_UFlat/16	199		199		0.00%	70109081	68747579	1.94%	106263639	105569531	0.65%
BM_UFlat/17	279031		273890		1.84%	23361373312	23294246637	0.29%	40474834585	39981682217	1.22%
BM_UFlat/18	233		199		14.59%	74530664	67841101	8.98%	94305848	92271053	2.16%
BM_UFlat/19	26743		25309		5.36%	2327215133	2206712016	5.18%	6024314357	5935228694	1.48%
BM_UFlat/20	2731		2625		3.88%	282018757	276772813	1.86%	768382519	758277029	1.32%

Is this a reasonable work-around for the problem?  Do you need more performance
measurements?  haih@ is evaluating this change for [] and I will update those
numbers once we have it.

Tested:
   Performance with zippy_unittest.
This commit is contained in:
Sriraman Tallam 2016-06-29 10:08:46 -07:00 committed by Alkis Evlogimenos
parent 38a5ec5fca
commit 4a74094080
1 changed files with 4 additions and 0 deletions

View File

@ -578,6 +578,10 @@ class SnappyDecompressor {
template <class Writer>
void DecompressAllTags(Writer* writer) {
const char* ip = ip_;
// For position-independent executables, accessing global arrays can be
// slow. Move wordmask array onto the stack to mitigate this.
uint32 wordmask[sizeof(internal::wordmask)/sizeof(uint32)];
memcpy(wordmask, internal::wordmask, sizeof(wordmask));
// We could have put this refill fragment only at the beginning of the loop.
// However, duplicating it at the end of each branch gives the compiler more