Commit Graph

134 Commits

Author SHA1 Message Date
Geoff Pike 27c5d86527 Re-work fast path for handling copies in zippy decompression.
This is a performance-tuning change that shouldn't change the behavior
of the library.

This adds some complexity but the performance gain might make that
worthwhile: With FDO on perflab/haswell, a 4.0% gain (geometric mean).

SAMPLE (before)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           36638      36552     100000 2.6GB/s  html
BM_UFlat/1          457153     455895       9173 1.4GB/s  urls
BM_UFlat/2            5850       5837     685481 19.6GB/s  jpg
BM_UFlat/3             122        122   34551988 1.5GB/s  jpg_200
BM_UFlat/4            6797       6781     620811 14.1GB/s  pdf
BM_UFlat/5          179485     179037      23471 2.1GB/s  html4
BM_UFlat/6          142734     142384      29525 1018.7MB/s  txt1
BM_UFlat/7          125233     124924      33709 955.6MB/s  txt2
BM_UFlat/8          382548     381533      10000 1066.7MB/s  txt3
BM_UFlat/9          525614     524297       8018 876.5MB/s  txt4
BM_UFlat/10          34946      34868     100000 3.2GB/s  pb
BM_UFlat/11         149548     149208      28063 1.2GB/s  gaviota
BM_UFlat/12          10684      10663     392580 2.1GB/s  cp
BM_UFlat/13           5494       5484     766584 1.9GB/s  c
BM_UFlat/14           1691       1688    2488784 2.1GB/s  lsp
BM_UFlat/15         676443     674726       6129 1.4GB/s  xls
BM_UFlat/16            156        156   26656909 1.2GB/s  xls_200
BM_UFlat/17         239911     239297      17558 2.0GB/s  bin
BM_UFlat/18            182        182   23072932 1047.9MB/s  bin_200
BM_UFlat/19          21544      21499     194484 1.7GB/s  sum
BM_UFlat/20           2236       2232    1877810 1.8GB/s  man
BM_UFlatSink/0       42266      42179      99732 2.3GB/s  html
BM_UFlatSink/1      461810     460633       9055 1.4GB/s  urls
BM_UFlatSink/2        5816       5804     632829 19.8GB/s  jpg
BM_UFlatSink/3         124        123   34351698 1.5GB/s  jpg_200
BM_UFlatSink/4        7173       7157     609929 13.3GB/s  pdf
BM_UFlatSink/5      184795     184302      22660 2.1GB/s  html4
BM_UFlatSink/6      143552     143223      29272 1012.7MB/s  txt1
BM_UFlatSink/7      127160     126890      33178 940.8MB/s  txt2
BM_UFlatSink/8      382219     381313      10000 1067.3MB/s  txt3
BM_UFlatSink/9      528042     526713       7988 872.5MB/s  txt4
BM_UFlatSink/10      41389      41305     100000 2.7GB/s  pb
BM_UFlatSink/11     147215     146877      28854 1.2GB/s  gaviota
BM_UFlatSink/12      12008      11984     348139 1.9GB/s  cp
BM_UFlatSink/13       5444       5433     775084 1.9GB/s  c
BM_UFlatSink/14       1647       1644    2552119 2.1GB/s  lsp
BM_UFlatSink/15     665011     663424       6320 1.4GB/s  xls
BM_UFlatSink/16        153        153   27571837 1.2GB/s  xls_200
BM_UFlatSink/17     239735     239169      17411 2.0GB/s  bin
BM_UFlatSink/18        183        182   23005573 1046.8MB/s  bin_200
BM_UFlatSink/19      22544      22498     187705 1.6GB/s  sum
BM_UFlatSink/20       2190       2186    1917894 1.8GB/s  man

SAMPLE (after)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           33940      33889     100000 2.8GB/s  html
BM_UFlat/1          440728     439944       9586 1.5GB/s  urls
BM_UFlat/2            5652       5641     744776 20.3GB/s  jpg
BM_UFlat/3             123        123   34647884 1.5GB/s  jpg_200
BM_UFlat/4            6628       6615     631892 14.4GB/s  pdf
BM_UFlat/5          169523     169227      24197 2.3GB/s  html4
BM_UFlat/6          144139     143892      29232 1008.0MB/s  txt1
BM_UFlat/7          127148     126915      33144 940.6MB/s  txt2
BM_UFlat/8          380267     379233      10000 1073.2MB/s  txt3
BM_UFlat/9          529495     528194       7957 870.0MB/s  txt4
BM_UFlat/10          31844      31784     100000 3.5GB/s  pb
BM_UFlat/11         146822     146476      28737 1.2GB/s  gaviota
BM_UFlat/12          10784      10762     392176 2.1GB/s  cp
BM_UFlat/13           5528       5518     760934 1.9GB/s  c
BM_UFlat/14           1721       1719    2449291 2.0GB/s  lsp
BM_UFlat/15         673304     671774       6255 1.4GB/s  xls
BM_UFlat/16            155        155   27092003 1.2GB/s  xls_200
BM_UFlat/17         230424     229902      18285 2.1GB/s  bin
BM_UFlat/18            185        184   22818199 1033.9MB/s  bin_200
BM_UFlat/19          21035      20996     200765 1.7GB/s  sum
BM_UFlat/20           2242       2238    1864380 1.8GB/s  man
BM_UFlatSink/0       33487      33405     100000 2.9GB/s  html
BM_UFlatSink/1      431108     430226       9764 1.5GB/s  urls
BM_UFlatSink/2        5927       5916     648112 19.4GB/s  jpg
BM_UFlatSink/3         123        122   34704423 1.5GB/s  jpg_200
BM_UFlatSink/4        6472       6461     653462 14.8GB/s  pdf
BM_UFlatSink/5      164309     163988      25567 2.3GB/s  html4
BM_UFlatSink/6      138274     138020      30311 1050.9MB/s  txt1
BM_UFlatSink/7      120844     120637      34708 989.6MB/s  txt2
BM_UFlatSink/8      371046     370366      10000 1098.9MB/s  txt3
BM_UFlatSink/9      510021     508982       8269 902.9MB/s  txt4
BM_UFlatSink/10      30889      30844     100000 3.6GB/s  pb
BM_UFlatSink/11     140752     140521      29903 1.2GB/s  gaviota
BM_UFlatSink/12      10162      10146     413600 2.3GB/s  cp
BM_UFlatSink/13       5264       5256     762398 2.0GB/s  c
BM_UFlatSink/14       1622       1619    2606069 2.1GB/s  lsp
BM_UFlatSink/15     646897     645756       6512 1.5GB/s  xls
BM_UFlatSink/16        150        150   28223595 1.2GB/s  xls_200
BM_UFlatSink/17     226096     225650      18629 2.1GB/s  bin
BM_UFlatSink/18        185        184   22907935 1035.3MB/s  bin_200
BM_UFlatSink/19      21369      21335     198881 1.7GB/s  sum
BM_UFlatSink/20       2139       2136    1953637 1.8GB/s  man
2017-01-26 21:42:26 +01:00
Sriraman Tallam 4a74094080 Speed up Zippy decompression in PIE mode by removing the penalty for
global array access.

With PIE, accessing global arrays needs two instructions whereas it can be
done with a single instruction without PIE.  See []
For example, without PIE the access looks like:
mov    0x400780(,%rdi,4),%eax  // One instruction to access arr[i]

and with PIE the access looks like:
lea    0x149(%rip),%rax        # 400780 <_ZL3arr>
mov    (%rax,%rdi,4),%eax

This causes a slow down in zippy as it has two global arrays, wordmask and
char_table.  There is no equivalent PC-relative insn. with PIE to do this in
one instruction.

The slow down can be seen as an increase in dynamic instruction count and
cycles with a similar IPC.  We have seen this affect REDACTED recently and this
is causing a ~1% perf. slow down.

One of the mitigation techniques for small arrays is to move it onto the stack,
use the stack pointer to make the access a single instruction.  The downside to
this is the extra instructions at function call to mov the array onto the stack
which is why we want to do this only for small arrays.  I tried moving
wordmask onto the stack since it is a small array. The performance numbers look
good overall. There is an improvement in the dynamic instruction count for
almost all BM_UFlat benchmarks.  BM_UFlat/2 and BM_UFlat/3 are pretty noisy.
The only case where there is a regression is BM_UFlat/10.  Here, the instruction
count does go down but the IPC also goes down affecting performance. This also
looks noisy but I do see a small IPC drop with this change.  Otherwise, the
numbers look good and consistent.  I measured this on a perflab ivybridge
machine multiple times.  Numbers are given below.  For Improv. (improvements),
positive is good.

Binaries built as: blaze build -c opt --dynamic_mode=off

Benchmark	Base CPU(ns)	Opt CPU(ns)	Improv.	Base Cycles	Opt Cycles	Improv.	Base Insns	Opt Insns	Improv.

BM_UFlat/1	541711		537052		0.86%	46068129918	45442732684	1.36%	85113352848	83917656016	1.40%
BM_UFlat/2	6228		6388		-2.57%	582789808	583267855	-0.08%	1261517746	1261116553	0.03%
BM_UFlat/3	159		120		24.53%	61538641	58783800	4.48%	90008672	90980060	-1.08%
BM_UFlat/4	7878		7787		1.16%	710491888	703718556	0.95%	1914898283	1525060250	20.36%
BM_UFlat/5	208854		207673		0.57%	17640846255	17609530720	0.18%	36546983483	36008920788	1.47%
BM_UFlat/6	172595		167225		3.11%	14642082831	14232371166	2.80%	33647820489	33056659600	1.76%
BM_UFlat/7	152364		147901		2.93%	12904338645	12635220582	2.09%	28958390984	28457982504	1.73%
BM_UFlat/8	463764		448244		3.35%	39423576973	37917435891	3.82%	88350964483	86800265943	1.76%
BM_UFlat/9	639517		621811		2.77%	54275945823	52555988926	3.17%	119503172410	117432599704	1.73%
BM_UFlat/10	41929		42358		-1.02%	3593125535	3647231492	-1.51%	8559206066	8446526639	1.32%
BM_UFlat/11	174754		173936		0.47%	14885371426	14749410955	0.91%	36693421142	35987215897	1.92%
BM_UFlat/12	13388		13257		0.98%	1192648670	1179645044	1.09%	3506482177	3454962579	1.47%
BM_UFlat/13	6801		6588		3.13%	627960003	608367286	3.12%	1847877894	1818368400	1.60%
BM_UFlat/14	2057		1989		3.31%	229005588	217393157	5.07%	609686274	599419511	1.68%
BM_UFlat/15	831618		799881		3.82%	70440388955	67911853013	3.59%	167178603105	164653652416	1.51%
BM_UFlat/16	199		199		0.00%	70109081	68747579	1.94%	106263639	105569531	0.65%
BM_UFlat/17	279031		273890		1.84%	23361373312	23294246637	0.29%	40474834585	39981682217	1.22%
BM_UFlat/18	233		199		14.59%	74530664	67841101	8.98%	94305848	92271053	2.16%
BM_UFlat/19	26743		25309		5.36%	2327215133	2206712016	5.18%	6024314357	5935228694	1.48%
BM_UFlat/20	2731		2625		3.88%	282018757	276772813	1.86%	768382519	758277029	1.32%

Is this a reasonable work-around for the problem?  Do you need more performance
measurements?  haih@ is evaluating this change for [] and I will update those
numbers once we have it.

Tested:
   Performance with zippy_unittest.
2017-01-26 21:42:11 +01:00
Geoff Pike 38a5ec5fca Re-work fast path that emits copies in zippy compression.
The primary motivation for the change is that FindMatchLength is
likely to discover a difference in the first 8 bytes it compares.
If that occurs then we know the length of the match is less than 12,
because FindMatchLength is invoked after a 4-byte match is found.
When emitting a copy, it is useful to know that the length is less
than 12 because the two-byte variant of an emitted copy requires that.

This is a performance-tuning change that should not affect the
library's behavior.

With FDO on perflab/Haswell the geometric mean for ZFlat/* went from
47,290ns to 45,741ns, an improvement of 3.4%.

SAMPLE (before)

BM_ZFlat/0      102824     102650      40691 951.4MB/s  html (22.31 %)
BM_ZFlat/1     1293512    1290442       3225 518.9MB/s  urls (47.78 %)
BM_ZFlat/2       10373      10353     417959 11.1GB/s  jpg (99.95 %)
BM_ZFlat/3         268        268   15745324 712.4MB/s  jpg_200 (73.00 %)
BM_ZFlat/4       12137      12113     342462 7.9GB/s  pdf (83.30 %)
BM_ZFlat/5      430672     429720       9724 909.0MB/s  html4 (22.52 %)
BM_ZFlat/6      420541     419636       9833 345.6MB/s  txt1 (57.88 %)
BM_ZFlat/7      373829     373158      10000 319.9MB/s  txt2 (61.91 %)
BM_ZFlat/8     1119014    1116604       3755 364.5MB/s  txt3 (54.99 %)
BM_ZFlat/9     1544203    1540657       2748 298.3MB/s  txt4 (66.26 %)
BM_ZFlat/10      91041      90866      46002 1.2GB/s  pb (19.68 %)
BM_ZFlat/11     332766     331990      10000 529.5MB/s  gaviota (37.72 %)
BM_ZFlat/12      39960      39886     100000 588.3MB/s  cp (48.12 %)
BM_ZFlat/13      14493      14465     287181 735.1MB/s  c (42.47 %)
BM_ZFlat/14       4447       4440     947927 799.3MB/s  lsp (48.37 %)
BM_ZFlat/15    1316362    1313350       3196 747.7MB/s  xls (41.23 %)
BM_ZFlat/16        312        311   10000000 613.0MB/s  xls_200 (78.00 %)
BM_ZFlat/17     388471     387502      10000 1.2GB/s  bin (18.11 %)
BM_ZFlat/18         65         64   64838208 2.9GB/s  bin_200 (7.50 %)
BM_ZFlat/19      65900      65787      63099 554.3MB/s  sum (48.96 %)
BM_ZFlat/20       6188       6177     681951 652.6MB/s  man (59.21 %)

SAMPLE (after)

Benchmark     Time(ns)    CPU(ns) Iterations
--------------------------------------------
BM_ZFlat/0       99259      99044      42428 986.0MB/s  html (22.31 %)
BM_ZFlat/1     1257039    1255276       3341 533.4MB/s  urls (47.78 %)
BM_ZFlat/2       10044      10030     405781 11.4GB/s  jpg (99.95 %)
BM_ZFlat/3         268        267   15732282 713.3MB/s  jpg_200 (73.00 %)
BM_ZFlat/4       11675      11657     358629 8.2GB/s  pdf (83.30 %)
BM_ZFlat/5      420951     419818       9739 930.5MB/s  html4 (22.52 %)
BM_ZFlat/6      415460     414632      10000 349.8MB/s  txt1 (57.88 %)
BM_ZFlat/7      367191     366436      10000 325.8MB/s  txt2 (61.91 %)
BM_ZFlat/8     1098345    1096036       3819 371.3MB/s  txt3 (54.99 %)
BM_ZFlat/9     1508701    1505306       2758 305.3MB/s  txt4 (66.26 %)
BM_ZFlat/10      87195      87031      47289 1.3GB/s  pb (19.68 %)
BM_ZFlat/11     322338     321637      10000 546.5MB/s  gaviota (37.72 %)
BM_ZFlat/12      36739      36668     100000 639.9MB/s  cp (48.12 %)
BM_ZFlat/13      13646      13618     304009 780.9MB/s  c (42.47 %)
BM_ZFlat/14       4249       4240     992456 837.0MB/s  lsp (48.37 %)
BM_ZFlat/15    1262925    1260012       3314 779.4MB/s  xls (41.23 %)
BM_ZFlat/16        308        308   10000000 619.8MB/s  xls_200 (78.00 %)
BM_ZFlat/17     379750     378944      10000 1.3GB/s  bin (18.11 %)
BM_ZFlat/18         62         62   67443280 3.0GB/s  bin_200 (7.50 %)
BM_ZFlat/19      61706      61587      67645 592.1MB/s  sum (48.96 %)
BM_ZFlat/20       5968       5958     698974 676.6MB/s  man (59.21 %)
2017-01-26 21:39:39 +01:00
ckennelly 094c67de88 Speed up the EmitLiteral fast path, +1.62% for ZFlat benchmarks.
This is inspired by the Go version in
//third_party/golang/snappy/encode_amd64.s (emitLiteralFastPath)

        Benchmark         Base:Reference   (1)
--------------------------------------------------
(BM_ZFlat_0 1/cputime_ns)        9.669e-06  +1.65%
(BM_ZFlat_1 1/cputime_ns)        7.643e-07  +2.53%
(BM_ZFlat_10 1/cputime_ns)       1.107e-05  -0.97%
(BM_ZFlat_11 1/cputime_ns)       3.002e-06  +0.71%
(BM_ZFlat_12 1/cputime_ns)       2.338e-05  +7.22%
(BM_ZFlat_13 1/cputime_ns)       6.386e-05  +9.18%
(BM_ZFlat_14 1/cputime_ns)       0.0002256  -0.05%
(BM_ZFlat_15 1/cputime_ns)       7.608e-07  -1.29%
(BM_ZFlat_16 1/cputime_ns)        0.003236  -1.28%
(BM_ZFlat_17 1/cputime_ns)        2.58e-06  +0.52%
(BM_ZFlat_18 1/cputime_ns)         0.01538  +0.00%
(BM_ZFlat_19 1/cputime_ns)       1.436e-05  +6.21%
(BM_ZFlat_2 1/cputime_ns)        0.0001044  +4.99%
(BM_ZFlat_20 1/cputime_ns)       0.0001608  -0.18%
(BM_ZFlat_3 1/cputime_ns)         0.003745  +0.38%
(BM_ZFlat_4 1/cputime_ns)        8.144e-05  +6.21%
(BM_ZFlat_5 1/cputime_ns)        2.328e-06  -1.60%
(BM_ZFlat_6 1/cputime_ns)        2.391e-06  +0.06%
(BM_ZFlat_7 1/cputime_ns)         2.68e-06  -0.61%
(BM_ZFlat_8 1/cputime_ns)        8.852e-07  +0.19%
(BM_ZFlat_9 1/cputime_ns)        6.441e-07  +1.06%

geometric mean                              +1.62%
2017-01-26 21:38:49 +01:00
Geoff Pike fce661fa8c Speed up zippy decompression by removing some zero-extensions.
This is a performance tuning change that should not affect
correctness.  On perflab with FDO on Haswell the performance gain is
21,776ns before vs 21,255ns after, about 2.4%.  (Using geometric means.)

SAMPLE PERFORMANCE with FDO on HASWELL (NEW)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           37366      37279     100000 2.6GB/s  html
BM_UFlat/1          471153     470204       8975 1.4GB/s  urls
BM_UFlat/2            6116       6105     639496 18.8GB/s  jpg
BM_UFlat/3             123        123   34709908 1.5GB/s  jpg_200
BM_UFlat/4            6724       6714     623318 14.2GB/s  pdf
BM_UFlat/5          183122     182722      23138 2.1GB/s  html4
BM_UFlat/6          144981     144689      29384 1002.5MB/s  txt1
BM_UFlat/7          125939     125691      33423 949.8MB/s  txt2
BM_UFlat/8          383101     382241      10000 1064.7MB/s  txt3
BM_UFlat/9          527824     526606       7958 872.6MB/s  txt4
BM_UFlat/10          34849      34790     100000 3.2GB/s  pb
BM_UFlat/11         150213     149937      28131 1.1GB/s  gaviota
BM_UFlat/12          10850      10830     393231 2.1GB/s  cp
BM_UFlat/13           5532       5523     735739 1.9GB/s  c
BM_UFlat/14           1698       1695    2478035 2.0GB/s  lsp
BM_UFlat/15         678396     676917       6200 1.4GB/s  xls
BM_UFlat/16            155        155   26909789 1.2GB/s  xls_200
BM_UFlat/17         241235     240698      17416 2.0GB/s  bin
BM_UFlat/18            183        183   23000841 1043.5MB/s  bin_200
BM_UFlat/19          21461      21424     193275 1.7GB/s  sum
BM_UFlat/20           2232       2228    1887191 1.8GB/s  man
BM_UFlatSink/0       42272      42199      98528 2.3GB/s  html
BM_UFlatSink/1      460814     459898       9092 1.4GB/s  urls
BM_UFlatSink/2        5558       5547     768629 20.7GB/s  jpg
BM_UFlatSink/3         124        123   33629141 1.5GB/s  jpg_200
BM_UFlatSink/4        6634       6621     629989 14.4GB/s  pdf
BM_UFlatSink/5      182883     182491      23030 2.1GB/s  html4
BM_UFlatSink/6      143269     142964      29410 1014.5MB/s  txt1
BM_UFlatSink/7      127041     126809      33136 941.4MB/s  txt2
BM_UFlatSink/8      384367     383577      10000 1061.0MB/s  txt3
BM_UFlatSink/9      529979     528890       7898 868.9MB/s  txt4
BM_UFlatSink/10      41154      41075     100000 2.7GB/s  pb
BM_UFlatSink/11     146446     146155      28742 1.2GB/s  gaviota
BM_UFlatSink/12      11939      11918     352663 1.9GB/s  cp
BM_UFlatSink/13       5430       5421     770451 1.9GB/s  c
BM_UFlatSink/14       1665       1662    2538921 2.1GB/s  lsp
BM_UFlatSink/15     666840     665617       6309 1.4GB/s  xls
BM_UFlatSink/16        152        152   27639460 1.2GB/s  xls_200
BM_UFlatSink/17     240076     239573      17643 2.0GB/s  bin
BM_UFlatSink/18        183        182   23128210 1046.0MB/s  bin_200
BM_UFlatSink/19      22570      22528     185839 1.6GB/s  sum
BM_UFlatSink/20       2183       2180    1899526 1.8GB/s  man

SAMPLE PERFORMANCE with FDO on HASWELL (OLD)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           37041      36990     100000 2.6GB/s  html
BM_UFlat/1          471384     470574       8930 1.4GB/s  urls
BM_UFlat/2            5997       5986     722354 19.2GB/s  jpg
BM_UFlat/3             124        123   34964717 1.5GB/s  jpg_200
BM_UFlat/4            6850       6838     621414 13.9GB/s  pdf
BM_UFlat/5          182578     182271      23001 2.1GB/s  html4
BM_UFlat/6          148338     147989      28132 980.1MB/s  txt1
BM_UFlat/7          130682     130471      32347 915.0MB/s  txt2
BM_UFlat/8          397420     396553      10000 1026.3MB/s  txt3
BM_UFlat/9          550126     548872       7736 837.2MB/s  txt4
BM_UFlat/10          35013      34958     100000 3.2GB/s  pb
BM_UFlat/11         152270     151889      27508 1.1GB/s  gaviota
BM_UFlat/12          11117      11096     379059 2.1GB/s  cp
BM_UFlat/13           5812       5801     725240 1.8GB/s  c
BM_UFlat/14           1780       1777    2383982 2.0GB/s  lsp
BM_UFlat/15         707871     706139       5946 1.4GB/s  xls
BM_UFlat/16            157        157   26889747 1.2GB/s  xls_200
BM_UFlat/17         239160     238556      17512 2.0GB/s  bin
BM_UFlat/18            181        180   23326040 1057.5MB/s  bin_200
BM_UFlat/19          22706      22656     186285 1.6GB/s  sum
BM_UFlat/20           2319       2315    1813186 1.7GB/s  man
BM_UFlatSink/0       42657      42574      99000 2.2GB/s  html
BM_UFlatSink/1      466316     465262       9036 1.4GB/s  urls
BM_UFlatSink/2        6873       6859     648525 16.7GB/s  jpg
BM_UFlatSink/3         124        124   34434643 1.5GB/s  jpg_200
BM_UFlatSink/4        6804       6790     624282 14.0GB/s  pdf
BM_UFlatSink/5      185468     185062      22746 2.1GB/s  html4
BM_UFlatSink/6      148511     148209      28284 978.6MB/s  txt1
BM_UFlatSink/7      130865     130607      32144 914.0MB/s  txt2
BM_UFlatSink/8      393931     392983      10000 1035.6MB/s  txt3
BM_UFlatSink/9      545548     544275       7740 844.3MB/s  txt4
BM_UFlatSink/10      41659      41584     100000 2.7GB/s  pb
BM_UFlatSink/11     152062     151721      27854 1.1GB/s  gaviota
BM_UFlatSink/12      11987      11968     350909 1.9GB/s  cp
BM_UFlatSink/13       5652       5641     743280 1.8GB/s  c
BM_UFlatSink/14       1728       1725    2446140 2.0GB/s  lsp
BM_UFlatSink/15     687879     686231       6138 1.4GB/s  xls
BM_UFlatSink/16        155        155   27254484 1.2GB/s  xls_200
BM_UFlatSink/17     240689     240083      17450 2.0GB/s  bin
BM_UFlatSink/18        183        182   22932858 1046.8MB/s  bin_200
BM_UFlatSink/19      22718      22674     185207 1.6GB/s  sum
BM_UFlatSink/20       2272       2268    1851664 1.7GB/s  man
2017-01-26 21:38:36 +01:00
ckennelly e788e527d3 Avoid calling memset when resizing the buffer.
This buffer will be initialized and then trimmed down to size during the
compression phase.
2017-01-26 21:35:55 +01:00
Steinar H. Gunderson d53de18799 Make heuristic match skipping more aggressive.
This causes compression to be much faster on incompressible inputs
(such as the jpeg and pdf tests), and is neutral or even positive on the other
tests. The test set shows only microscopic density regressions; I attempted to
construct a worst-case test set containing ~1500 different cases of mixed
plaintext + /dev/urandom, and even those seemed to be only 0.38 percentage
points less dense on average (the single worst case was 87.8% -> 89.0%), which
we can live with given that this is already an edge case.

The original idea is by Klaus Post; I only tweaked the implementation.
Ironically, the new implementation is almost more in line with the
comment that was there, so I've left that largely alone, albeit
with a small modification.

Microbenchmark results (opt mode, 64-bit, static linking):

Ivy Bridge:

Benchmark                 Base (ns)  New (ns)                                Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0                   120284    115480  847.0MB/s  html (22.31 %)        +4.2%
BM_ZFlat/1                  1527911   1522242  440.7MB/s  urls (47.78 %)        +0.4%
BM_ZFlat/2                    17591     10582  10.9GB/s  jpg (99.95 %)         +66.2%
BM_ZFlat/3                      323       322  593.3MB/s  jpg_200 (73.00 %)     +0.3%
BM_ZFlat/4                    53691     14063  6.8GB/s  pdf (83.30 %)         +281.8%
BM_ZFlat/5                   495442    492347  794.8MB/s  html4 (22.52 %)       +0.6%
BM_ZFlat/6                   473523    473622  306.7MB/s  txt1 (57.88 %)        -0.0%
BM_ZFlat/7                   421406    420120  284.5MB/s  txt2 (61.91 %)        +0.3%
BM_ZFlat/8                  1265632   1270538  320.8MB/s  txt3 (54.99 %)        -0.4%
BM_ZFlat/9                  1742688   1737894  264.8MB/s  txt4 (66.26 %)        +0.3%
BM_ZFlat/10                  107950    103404  1095.1MB/s  pb (19.68 %)         +4.4%
BM_ZFlat/11                  372660    371818  473.5MB/s  gaviota (37.72 %)     +0.2%
BM_ZFlat/12                   53239     49528  474.4MB/s  cp (48.12 %)          +7.5%
BM_ZFlat/13                   18940     17349  613.9MB/s  c (42.47 %)           +9.2%
BM_ZFlat/14                    5155      5075  700.3MB/s  lsp (48.37 %)         +1.6%
BM_ZFlat/15                 1474757   1474471  667.2MB/s  xls (41.23 %)         +0.0%
BM_ZFlat/16                     363       362  528.0MB/s  xls_200 (78.00 %)     +0.3%
BM_ZFlat/17                  453849    456931  1073.2MB/s  bin (18.11 %)        -0.7%
BM_ZFlat/18                      90        87  2.1GB/s  bin_200 (7.50 %)        +3.4%
BM_ZFlat/19                   82163     80498  453.7MB/s  sum (48.96 %)         +2.1%
BM_ZFlat/20                    7174      7124  566.7MB/s  man (59.21 %)         +0.7%
Sum of all benchmarks       8694831   8623857                                   +0.8%

Sandy Bridge:

Benchmark                 Base (ns)  New (ns)                                Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0                   117426    112649  868.2MB/s  html (22.31 %)        +4.2%
BM_ZFlat/1                  1517095   1498522  447.5MB/s  urls (47.78 %)        +1.2%
BM_ZFlat/2                    18601     10649  10.8GB/s  jpg (99.95 %)         +74.7%
BM_ZFlat/3                      359       356  536.0MB/s  jpg_200 (73.00 %)     +0.8%
BM_ZFlat/4                    60249     13832  6.9GB/s  pdf (83.30 %)         +335.6%
BM_ZFlat/5                   481246    475571  822.7MB/s  html4 (22.52 %)       +1.2%
BM_ZFlat/6                   460541    455693  318.8MB/s  txt1 (57.88 %)        +1.1%
BM_ZFlat/7                   407751    404147  295.8MB/s  txt2 (61.91 %)        +0.9%
BM_ZFlat/8                  1228255   1222519  333.4MB/s  txt3 (54.99 %)        +0.5%
BM_ZFlat/9                  1678299   1666379  276.2MB/s  txt4 (66.26 %)        +0.7%
BM_ZFlat/10                  106499    101715  1113.4MB/s  pb (19.68 %)         +4.7%
BM_ZFlat/11                  361913    360222  488.7MB/s  gaviota (37.72 %)     +0.5%
BM_ZFlat/12                   53137     49618  473.6MB/s  cp (48.12 %)          +7.1%
BM_ZFlat/13                   18801     17812  597.8MB/s  c (42.47 %)           +5.6%
BM_ZFlat/14                    5394      5383  660.2MB/s  lsp (48.37 %)         +0.2%
BM_ZFlat/15                 1435411   1432870  686.4MB/s  xls (41.23 %)         +0.2%
BM_ZFlat/16                     389       395  483.3MB/s  xls_200 (78.00 %)     -1.5%
BM_ZFlat/17                  447255    445510  1100.4MB/s  bin (18.11 %)        +0.4%
BM_ZFlat/18                      86        86  2.2GB/s  bin_200 (7.50 %)        +0.0%
BM_ZFlat/19                   82555     79512  459.3MB/s  sum (48.96 %)         +3.8%
BM_ZFlat/20                    7527      7553  534.5MB/s  man (59.21 %)         -0.3%
Sum of all benchmarks       8488789   8360993                                   +1.5%

Haswell:

Benchmark                 Base (ns)  New (ns)                                Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0                   107512    105621  925.6MB/s  html (22.31 %)        +1.8%
BM_ZFlat/1                  1344306   1332479  503.1MB/s  urls (47.78 %)        +0.9%
BM_ZFlat/2                    14752      9471  12.1GB/s  jpg (99.95 %)         +55.8%
BM_ZFlat/3                      287       275  694.0MB/s  jpg_200 (73.00 %)     +4.4%
BM_ZFlat/4                    48810     12263  7.8GB/s  pdf (83.30 %)         +298.0%
BM_ZFlat/5                   443013    442064  884.6MB/s  html4 (22.52 %)       +0.2%
BM_ZFlat/6                   429239    432124  336.0MB/s  txt1 (57.88 %)        -0.7%
BM_ZFlat/7                   381765    383681  311.5MB/s  txt2 (61.91 %)        -0.5%
BM_ZFlat/8                  1136667   1154304  353.0MB/s  txt3 (54.99 %)        -1.5%
BM_ZFlat/9                  1579925   1592431  288.9MB/s  txt4 (66.26 %)        -0.8%
BM_ZFlat/10                   98345     92411  1.2GB/s  pb (19.68 %)            +6.4%
BM_ZFlat/11                  340397    340466  516.8MB/s  gaviota (37.72 %)     -0.0%
BM_ZFlat/12                   47076     43536  539.5MB/s  cp (48.12 %)          +8.1%
BM_ZFlat/13                   16680     15637  680.8MB/s  c (42.47 %)           +6.7%
BM_ZFlat/14                    4616      4539  782.6MB/s  lsp (48.37 %)         +1.7%
BM_ZFlat/15                 1331231   1334094  736.9MB/s  xls (41.23 %)         -0.2%
BM_ZFlat/16                     326       322  593.5MB/s  xls_200 (78.00 %)     +1.2%
BM_ZFlat/17                  404383    400326  1.2GB/s  bin (18.11 %)           +1.0%
BM_ZFlat/18                      69        69  2.7GB/s  bin_200 (7.50 %)        +0.0%
BM_ZFlat/19                   74771     71348  511.7MB/s  sum (48.96 %)         +4.8%
BM_ZFlat/20                    6461      6383  632.2MB/s  man (59.21 %)         +1.2%
Sum of all benchmarks       7810631   7773844                                   +0.5%

I've done a quick test that there are no performance regressions on external
GCC (4.9.2, Debian, Haswell, 64-bit), too.
2016-04-05 11:50:26 +02:00
Steinar H. Gunderson 7525a1600d Fix an issue where the ByteSource path (used for parsing std::string)
would incorrectly accept some invalid varints that the other path would not,
causing potential CHECK-failures if the unit test were run with
--write_uncompressed and a corrupted input file.

Found by the afl fuzzer.
2016-01-04 12:52:15 +01:00
Steinar H. Gunderson 0852af7606 Move the logic from ComputeTable into the unit test, which means it's run
automatically together with the other tests, and also removes the stray
function ComputeTable() (which was never referenced by anything else
in the open-source version, causing compiler warnings for some)
out of the core library.

Fixes public issue 96.

A=sesse
R=sanjay
2015-08-19 11:37:51 +02:00
Steinar H. Gunderson d80342922c Fix signed-vs.-unsigned comparison warnings.
These were found by compiling Chromium's external copy of this code with MSVC
with warning C4018 enabled.

A=pkasting
R=sanjay
2015-08-03 13:17:04 +02:00
Steinar H. Gunderson eb66d8176b Initialized members of SnappyArrayWriter and SnappyDecompressionValidator.
These members were almost surely initialized before use by other member
functions, but Coverity was warning about this. Eliminating these warnings
minimizes clutter in that report and the likelihood of overlooking a real bug.

A=cmumford
R=jeff
2015-07-06 14:21:16 +02:00
Steinar H. Gunderson b2312c4c25 Add support for Uncompress(source, sink). Various changes to allow
Uncompress(source, sink) to get the same performance as the different
variants of Uncompress to Cord/DataBuffer/String/FlatBuffer.

Changes to efficiently support Uncompress(source, sink)
--------

a) For strings - we add support to StringByteSink to do GetAppendBuffer so we
   can write to it without copying.
b) For flat array buffers, we do GetAppendBuffer and see if we can get a full buffer.

With the above changes we get performance with ByteSource/ByteSink
that is	very close to directly using flat arrays and strings.

We add various benchmark cases to demonstrate that.

Orthogonal change
------------------

Add support for TryFastAppend() for SnappyScatteredWriter.

Benchmark results are below

CPU: Intel Core2 dL1:32KB dL2:4096KB
Benchmark              Time(ns)    CPU(ns) Iterations
-----------------------------------------------------
BM_UFlat/0               109065     108996       6410 896.0MB/s  html
BM_UFlat/1              1012175    1012343        691 661.4MB/s  urls
BM_UFlat/2                26775      26771      26149 4.4GB/s  jpg
BM_UFlat/3                48947      48940      14363 1.8GB/s  pdf
BM_UFlat/4               441029     440835       1589 886.1MB/s  html4
BM_UFlat/5                39861      39880      17823 588.3MB/s  cp
BM_UFlat/6                18315      18300      38126 581.1MB/s  c
BM_UFlat/7                 5254       5254     100000 675.4MB/s  lsp
BM_UFlat/8              1568060    1567376        447 626.6MB/s  xls
BM_UFlat/9               337512     337734       2073 429.5MB/s  txt1
BM_UFlat/10              287269     287054       2434 415.9MB/s  txt2
BM_UFlat/11              890098     890219        787 457.2MB/s  txt3
BM_UFlat/12             1186593    1186863        590 387.2MB/s  txt4
BM_UFlat/13              573927     573318       1000 853.7MB/s  bin
BM_UFlat/14               64250      64294      10000 567.2MB/s  sum
BM_UFlat/15                7301       7300      96153 552.2MB/s  man
BM_UFlat/16              109617     109636       6375 1031.5MB/s  pb
BM_UFlat/17              364438     364497       1921 482.3MB/s  gaviota
BM_UFlatSink/0           108518     108465       6450 900.4MB/s  html
BM_UFlatSink/1           991952     991997        705 675.0MB/s  urls
BM_UFlatSink/2            26815      26798      26065 4.4GB/s  jpg
BM_UFlatSink/3            49127      49122      14255 1.8GB/s  pdf
BM_UFlatSink/4           436674     436731       1604 894.4MB/s  html4
BM_UFlatSink/5            39738      39733      17345 590.5MB/s  cp
BM_UFlatSink/6            18413      18416      37962 577.4MB/s  c
BM_UFlatSink/7             5677       5676     100000 625.2MB/s  lsp
BM_UFlatSink/8          1552175    1551026        451 633.2MB/s  xls
BM_UFlatSink/9           338526     338489       2065 428.5MB/s  txt1
BM_UFlatSink/10          289387     289307       2420 412.6MB/s  txt2
BM_UFlatSink/11          893803     893706        783 455.4MB/s  txt3
BM_UFlatSink/12         1195919    1195459        586 384.4MB/s  txt4
BM_UFlatSink/13          559637     559779       1000 874.3MB/s  bin
BM_UFlatSink/14           65073      65094      10000 560.2MB/s  sum
BM_UFlatSink/15            7618       7614      92823 529.5MB/s  man
BM_UFlatSink/16          110085     110121       6352 1027.0MB/s  pb
BM_UFlatSink/17          369196     368915       1896 476.5MB/s  gaviota
BM_UValidate/0            46954      46957      14899 2.0GB/s  html
BM_UValidate/1           500621     500868       1000 1.3GB/s  urls
BM_UValidate/2              283        283    2481447 417.2GB/s  jpg
BM_UValidate/3            16230      16228      43137 5.4GB/s  pdf
BM_UValidate/4           189129     189193       3701 2.0GB/s  html4

A=uday
R=sanjay
2015-07-06 14:21:00 +02:00
Steinar H. Gunderson 86eb8b152b Change a few branch annotations that profiling found to be wrong.
Overall performance is neutral or slightly positive.

Westmere (64-bit, opt):

Benchmark               Base (ns)  New (ns)                                Improvement
--------------------------------------------------------------------------------------
BM_UFlat/0                  73798     71464  1.3GB/s  html                    +3.3%
BM_UFlat/1                 715223    704318  953.5MB/s  urls                  +1.5%
BM_UFlat/2                   8137      8871  13.0GB/s  jpg                    -8.3%
BM_UFlat/3                    200       204  935.5MB/s  jpg_200               -2.0%
BM_UFlat/4                  21627     21281  4.5GB/s  pdf                     +1.6%
BM_UFlat/5                 302806    290350  1.3GB/s  html4                   +4.3%
BM_UFlat/6                 218920    219017  664.1MB/s  txt1                  -0.0%
BM_UFlat/7                 190437    191212  626.1MB/s  txt2                  -0.4%
BM_UFlat/8                 584192    580484  703.4MB/s  txt3                  +0.6%
BM_UFlat/9                 776537    779055  591.6MB/s  txt4                  -0.3%
BM_UFlat/10                 76056     72606  1.5GB/s  pb                      +4.8%
BM_UFlat/11                235962    239043  737.4MB/s  gaviota               -1.3%
BM_UFlat/12                 28049     28000  840.1MB/s  cp                    +0.2%
BM_UFlat/13                 12225     12021  886.9MB/s  c                     +1.7%
BM_UFlat/14                  3362      3544  1004.0MB/s  lsp                  -5.1%
BM_UFlat/15                937015    939206  1048.9MB/s  xls                  -0.2%
BM_UFlat/16                   236       233  823.1MB/s  xls_200               +1.3%
BM_UFlat/17                373170    361947  1.3GB/s  bin                     +3.1%
BM_UFlat/18                   264       264  725.5MB/s  bin_200               +0.0%
BM_UFlat/19                 42834     43577  839.2MB/s  sum                   -1.7%
BM_UFlat/20                  4770      4736  853.6MB/s  man                   +0.7%
BM_UValidate/0              39671     39944  2.4GB/s  html                    -0.7%
BM_UValidate/1             443391    443391  1.5GB/s  urls                    +0.0%
BM_UValidate/2                163       163  703.3GB/s  jpg                   +0.0%
BM_UValidate/3                113       112  1.7GB/s  jpg_200                 +0.9%
BM_UValidate/4               7555      7608  12.6GB/s  pdf                    -0.7%
BM_ZFlat/0                 157616    157568  621.5MB/s  html (22.31 %)        +0.0%
BM_ZFlat/1                1997290   2014486  333.4MB/s  urls (47.77 %)        -0.9%
BM_ZFlat/2                  23035     22237  5.2GB/s  jpg (99.95 %)           +3.6%
BM_ZFlat/3                    539       540  354.5MB/s  jpg_200 (73.00 %)     -0.2%
BM_ZFlat/4                  80709     81369  1.2GB/s  pdf (81.85 %)           -0.8%
BM_ZFlat/5                 639059    639220  613.0MB/s  html4 (22.51 %)       -0.0%
BM_ZFlat/6                 577203    583370  249.3MB/s  txt1 (57.87 %)        -1.1%
BM_ZFlat/7                 510887    516094  232.0MB/s  txt2 (61.93 %)        -1.0%
BM_ZFlat/8                1535843   1556973  262.2MB/s  txt3 (54.92 %)        -1.4%
BM_ZFlat/9                2070068   2102380  219.3MB/s  txt4 (66.22 %)        -1.5%
BM_ZFlat/10                152396    152148  745.5MB/s  pb (19.64 %)          +0.2%
BM_ZFlat/11                447367    445859  395.4MB/s  gaviota (37.72 %)     +0.3%
BM_ZFlat/12                 76375     76797  306.3MB/s  cp (48.12 %)          -0.5%
BM_ZFlat/13                 31518     31987  333.3MB/s  c (42.40 %)           -1.5%
BM_ZFlat/14                 10598     10827  328.6MB/s  lsp (48.37 %)         -2.1%
BM_ZFlat/15               1782243   1802728  546.5MB/s  xls (41.23 %)         -1.1%
BM_ZFlat/16                   526       539  355.0MB/s  xls_200 (78.00 %)     -2.4%
BM_ZFlat/17                598141    597311  822.1MB/s  bin (18.11 %)         +0.1%
BM_ZFlat/18                   121       120  1.6GB/s  bin_200 (7.50 %)        +0.8%
BM_ZFlat/19                109981    112173  326.0MB/s  sum (48.96 %)         -2.0%
BM_ZFlat/20                 14355     14575  277.4MB/s  man (59.36 %)         -1.5%
Sum of all benchmarks    33882722  33879325                                   +0.0%

Sandy Bridge (64-bit, opt):

Benchmark               Base (ns)  New (ns)                                Improvement
--------------------------------------------------------------------------------------
BM_UFlat/0                  43764     41600  2.3GB/s  html                    +5.2%
BM_UFlat/1                 517990    507058  1.3GB/s  urls                    +2.2%
BM_UFlat/2                   6625      5529  20.8GB/s  jpg                   +19.8%
BM_UFlat/3                    154       155  1.2GB/s  jpg_200                 -0.6%
BM_UFlat/4                  12795     11747  8.1GB/s  pdf                     +8.9%
BM_UFlat/5                 200335    193413  2.0GB/s  html4                   +3.6%
BM_UFlat/6                 156574    156426  929.2MB/s  txt1                  +0.1%
BM_UFlat/7                 137574    137464  870.4MB/s  txt2                  +0.1%
BM_UFlat/8                 422551    421603  967.4MB/s  txt3                  +0.2%
BM_UFlat/9                 577749    578985  795.6MB/s  txt4                  -0.2%
BM_UFlat/10                 42329     39362  2.8GB/s  pb                      +7.5%
BM_UFlat/11                170615    169751  1037.9MB/s  gaviota              +0.5%
BM_UFlat/12                 12800     12719  1.8GB/s  cp                      +0.6%
BM_UFlat/13                  6585      6579  1.6GB/s  c                       +0.1%
BM_UFlat/14                  2066      2044  1.7GB/s  lsp                     +1.1%
BM_UFlat/15                750861    746911  1.3GB/s  xls                     +0.5%
BM_UFlat/16                   188       192  996.0MB/s  xls_200               -2.1%
BM_UFlat/17                271622    264333  1.8GB/s  bin                     +2.8%
BM_UFlat/18                   208       207  923.6MB/s  bin_200               +0.5%
BM_UFlat/19                 24667     24845  1.4GB/s  sum                     -0.7%
BM_UFlat/20                  2663      2662  1.5GB/s  man                     +0.0%
BM_ZFlat/0                 115173    115624  846.5MB/s  html (22.31 %)        -0.4%
BM_ZFlat/1                1530331   1537769  436.5MB/s  urls (47.77 %)        -0.5%
BM_ZFlat/2                  17503     17013  6.8GB/s  jpg (99.95 %)           +2.9%
BM_ZFlat/3                    385       385  496.3MB/s  jpg_200 (73.00 %)     +0.0%
BM_ZFlat/4                  61753     61540  1.6GB/s  pdf (81.85 %)           +0.3%
BM_ZFlat/5                 484806    483356  810.1MB/s  html4 (22.51 %)       +0.3%
BM_ZFlat/6                 464143    467609  310.9MB/s  txt1 (57.87 %)        -0.7%
BM_ZFlat/7                 410315    413319  289.5MB/s  txt2 (61.93 %)        -0.7%
BM_ZFlat/8                1244082   1249381  326.5MB/s  txt3 (54.92 %)        -0.4%
BM_ZFlat/9                1696914   1709685  269.4MB/s  txt4 (66.22 %)        -0.7%
BM_ZFlat/10                104148    103372  1096.7MB/s  pb (19.64 %)         +0.8%
BM_ZFlat/11                363522    359722  489.8MB/s  gaviota (37.72 %)     +1.1%
BM_ZFlat/12                 47021     50095  469.3MB/s  cp (48.12 %)          -6.1%
BM_ZFlat/13                 16888     16985  627.4MB/s  c (42.40 %)           -0.6%
BM_ZFlat/14                  5496      5469  650.3MB/s  lsp (48.37 %)         +0.5%
BM_ZFlat/15               1460713   1448760  679.5MB/s  xls (41.23 %)         +0.8%
BM_ZFlat/16                   387       393  486.8MB/s  xls_200 (78.00 %)     -1.5%
BM_ZFlat/17                457654    451462  1086.6MB/s  bin (18.11 %)        +1.4%
BM_ZFlat/18                    97        87  2.1GB/s  bin_200 (7.50 %)       +11.5%
BM_ZFlat/19                 77904     80924  451.7MB/s  sum (48.96 %)         -3.7%
BM_ZFlat/20                  7648      7663  527.1MB/s  man (59.36 %)         -0.2%
Sum of all benchmarks    25493635  25482069                                   +0.0%

A=dehao
R=sesse
2015-06-22 16:09:56 +02:00
Steinar H. Gunderson 11ccdfb868 Sync with various Google-internal changes.
Should not mean much for the open-source version.
2015-06-22 16:08:38 +02:00
snappy.mirrorbot@gmail.com 7c3c01df77 When we compare the number of bytes produced with the offset for a
backreference, make the signedness of the bytes produced clear,
by sticking it into a size_t. This avoids a signed/unsigned compare
warning from MSVC (public issue 71), and also is slightly clearer.

Since the line is now so long the explanatory comment about the -1u
trick has to go somewhere else anyway, I used the opportunity to
explain it in slightly more detail.

This is a purely stylistic change; the emitted assembler from GCC
is identical.

R=jeff


git-svn-id: https://snappy.googlecode.com/svn/trunk@79 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2013-07-29 11:06:44 +00:00
snappy.mirrorbot@gmail.com 2f0aaf8631 In the fast path for decompressing literals, instead of checking
whether there's 16 bytes free and then checking right afterwards
(when having subtracted the literal size) that there are now 
5 bytes free, just check once for 21 bytes. This skips a compare
and a branch; although it is easily predictable, it is still
a few cycles on a fast path that we would like to get rid of.

Benchmarking this yields very confusing results. On open-source
GCC 4.8.1 on Haswell, we get exactly the expected results; the
benchmarks where we hit the fast path for literals (in particular
the two HTML benchmarks and the protobuf benchmark) give very nice
speedups, and the others are not really affected.

However, benchmarks with Google's GCC branch on other hardware
is much less clear. It seems that we have a weak loss in some cases
(and the win for the “typical” win cases are not nearly as clear),
but that it depends on microarchitecture and plain luck in how we run
the benchmark. Looking at the generated assembler, it seems that
the removal of the if causes other large-scale changes in how the
function is laid out, which makes it likely that this is just bad luck.

Thus, we should keep this change, even though its exact current impact is
unclear; it's a sensible change per se, and dropping it on the basis of
microoptimization for a given compiler (or even branch of a compiler)
would seem like a bad strategy in the long run.

Microbenchmark results (all in 64-bit, opt mode):

  Nehalem, Google GCC:

  Benchmark                Base (ns)  New (ns)                       Improvement
  ------------------------------------------------------------------------------
  BM_UFlat/0                   76747     75591  1.3GB/s  html           +1.5%
  BM_UFlat/1                  765756    757040  886.3MB/s  urls         +1.2%
  BM_UFlat/2                   10867     10893  10.9GB/s  jpg           -0.2%
  BM_UFlat/3                     124       131  1.4GB/s  jpg_200        -5.3%
  BM_UFlat/4                   31663     31596  2.8GB/s  pdf            +0.2%
  BM_UFlat/5                  314162    308176  1.2GB/s  html4          +1.9%
  BM_UFlat/6                   29668     29746  790.6MB/s  cp           -0.3%
  BM_UFlat/7                   12958     13386  796.4MB/s  c            -3.2%
  BM_UFlat/8                    3596      3682  966.0MB/s  lsp          -2.3%
  BM_UFlat/9                 1019193   1033493  953.3MB/s  xls          -1.4%
  BM_UFlat/10                    239       247  775.3MB/s  xls_200      -3.2%
  BM_UFlat/11                 236411    240271  606.9MB/s  txt1         -1.6%
  BM_UFlat/12                 206639    209768  571.2MB/s  txt2         -1.5%
  BM_UFlat/13                 627803    635722  641.4MB/s  txt3         -1.2%
  BM_UFlat/14                 845932    857816  538.2MB/s  txt4         -1.4%
  BM_UFlat/15                 402107    391670  1.2GB/s  bin            +2.7%
  BM_UFlat/16                    283       279  683.6MB/s  bin_200      +1.4%
  BM_UFlat/17                  46070     46815  781.5MB/s  sum          -1.6%
  BM_UFlat/18                   5053      5163  782.0MB/s  man          -2.1%
  BM_UFlat/19                  79721     76581  1.4GB/s  pb             +4.1%
  BM_UFlat/20                 251158    252330  697.5MB/s  gaviota      -0.5%
  Sum of all benchmarks      4966150   4980396                          -0.3%


  Sandy Bridge, Google GCC:
  
  Benchmark                Base (ns)  New (ns)                       Improvement
  ------------------------------------------------------------------------------
  BM_UFlat/0                   42850     42182  2.3GB/s  html           +1.6%
  BM_UFlat/1                  525660    515816  1.3GB/s  urls           +1.9%
  BM_UFlat/2                    7173      7283  16.3GB/s  jpg           -1.5%
  BM_UFlat/3                      92        91  2.1GB/s  jpg_200        +1.1%
  BM_UFlat/4                   15147     14872  5.9GB/s  pdf            +1.8%
  BM_UFlat/5                  199936    192116  2.0GB/s  html4          +4.1%
  BM_UFlat/6                   12796     12443  1.8GB/s  cp             +2.8%
  BM_UFlat/7                    6588      6400  1.6GB/s  c              +2.9%
  BM_UFlat/8                    2010      1951  1.8GB/s  lsp            +3.0%
  BM_UFlat/9                  761124    763049  1.3GB/s  xls            -0.3%
  BM_UFlat/10                    186       189  1016.1MB/s  xls_200     -1.6%
  BM_UFlat/11                 159354    158460  918.6MB/s  txt1         +0.6%
  BM_UFlat/12                 139732    139950  856.1MB/s  txt2         -0.2%
  BM_UFlat/13                 429917    425027  961.7MB/s  txt3         +1.2%
  BM_UFlat/14                 585255    587324  785.8MB/s  txt4         -0.4%
  BM_UFlat/15                 276186    266173  1.8GB/s  bin            +3.8%
  BM_UFlat/16                    205       207  925.5MB/s  bin_200      -1.0%
  BM_UFlat/17                  24925     24935  1.4GB/s  sum            -0.0%
  BM_UFlat/18                   2632      2576  1.5GB/s  man            +2.2%
  BM_UFlat/19                  40546     39108  2.8GB/s  pb             +3.7%
  BM_UFlat/20                 175803    168209  1048.9MB/s  gaviota     +4.5%
  Sum of all benchmarks      3408117   3368361                          +1.2%


  Haswell, upstream GCC 4.8.1:

  Benchmark                Base (ns)  New (ns)                       Improvement
  ------------------------------------------------------------------------------
  BM_UFlat/0                   46308     40641  2.3GB/s  html          +13.9%
  BM_UFlat/1                  513385    514706  1.3GB/s  urls           -0.3%
  BM_UFlat/2                    6197      6151  19.2GB/s  jpg           +0.7%
  BM_UFlat/3                      61        61  3.0GB/s  jpg_200        +0.0%
  BM_UFlat/4                   13551     13429  6.5GB/s  pdf            +0.9%
  BM_UFlat/5                  198317    190243  2.0GB/s  html4          +4.2%
  BM_UFlat/6                   14768     12560  1.8GB/s  cp            +17.6%
  BM_UFlat/7                    6453      6447  1.6GB/s  c              +0.1%
  BM_UFlat/8                    1991      1980  1.8GB/s  lsp            +0.6%
  BM_UFlat/9                  766947    770424  1.2GB/s  xls            -0.5%
  BM_UFlat/10                    170       169  1.1GB/s  xls_200        +0.6%
  BM_UFlat/11                 164350    163554  888.7MB/s  txt1         +0.5%
  BM_UFlat/12                 145444    143830  832.1MB/s  txt2         +1.1%
  BM_UFlat/13                 437849    438413  929.2MB/s  txt3         -0.1%
  BM_UFlat/14                 603587    605309  759.8MB/s  txt4         -0.3%
  BM_UFlat/15                 249799    248067  1.9GB/s  bin            +0.7%
  BM_UFlat/16                    191       188  1011.4MB/s  bin_200     +1.6%
  BM_UFlat/17                  26064     24778  1.4GB/s  sum            +5.2%
  BM_UFlat/18                   2620      2601  1.5GB/s  man            +0.7%
  BM_UFlat/19                  44551     37373  3.0GB/s  pb            +19.2%
  BM_UFlat/20                 165408    164584  1.0GB/s  gaviota        +0.5%
  Sum of all benchmarks      3408011   3385508                          +0.7%


git-svn-id: https://snappy.googlecode.com/svn/trunk@78 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2013-06-30 19:24:03 +00:00
snappy.mirrorbot@gmail.com 062bf544a6 Make the two IncrementalCopy* functions take in an ssize_t instead of a len,
in order to avoid having to do 32-to-64-bit signed conversions on a hot path
during decompression. (Also fixes some MSVC warnings, mentioned in public
issue 75, but more of those remain.) They cannot be size_t because we expect
them to go negative and test for that.

This saves a few movzwl instructions, yielding ~2% speedup in decompression.


Sandy Bridge:

Benchmark                          Base (ns)  New (ns)                                Improvement
-------------------------------------------------------------------------------------------------
BM_UFlat/0                             48009     41283  2.3GB/s  html                   +16.3%
BM_UFlat/1                            531274    513419  1.3GB/s  urls                    +3.5%
BM_UFlat/2                              7378      7062  16.8GB/s  jpg                    +4.5%
BM_UFlat/3                                92        92  2.0GB/s  jpg_200                 +0.0%
BM_UFlat/4                             15057     14974  5.9GB/s  pdf                     +0.6%
BM_UFlat/5                            204323    193140  2.0GB/s  html4                   +5.8%
BM_UFlat/6                             13282     12611  1.8GB/s  cp                      +5.3%
BM_UFlat/7                              6511      6504  1.6GB/s  c                       +0.1%
BM_UFlat/8                              2014      2030  1.7GB/s  lsp                     -0.8%
BM_UFlat/9                            775909    768336  1.3GB/s  xls                     +1.0%
BM_UFlat/10                              182       184  1043.2MB/s  xls_200              -1.1%
BM_UFlat/11                           167352    161630  901.2MB/s  txt1                  +3.5%
BM_UFlat/12                           147393    142246  842.8MB/s  txt2                  +3.6%
BM_UFlat/13                           449960    432853  944.4MB/s  txt3                  +4.0%
BM_UFlat/14                           620497    594845  775.9MB/s  txt4                  +4.3%
BM_UFlat/15                           265610    267356  1.8GB/s  bin                     -0.7%
BM_UFlat/16                              206       205  932.7MB/s  bin_200               +0.5%
BM_UFlat/17                            25561     24730  1.4GB/s  sum                     +3.4%
BM_UFlat/18                             2620      2644  1.5GB/s  man                     -0.9%
BM_UFlat/19                            45766     38589  2.9GB/s  pb                     +18.6%
BM_UFlat/20                           171107    169832  1039.5MB/s  gaviota              +0.8%
Sum of all benchmarks                3500103   3394565                                   +3.1%


Westmere:

Benchmark                          Base (ns)  New (ns)                                Improvement
-------------------------------------------------------------------------------------------------
BM_UFlat/0                             72624     71526  1.3GB/s  html                    +1.5%
BM_UFlat/1                            735821    722917  930.8MB/s  urls                  +1.8%
BM_UFlat/2                             10450     10172  11.7GB/s  jpg                    +2.7%
BM_UFlat/3                               117       117  1.6GB/s  jpg_200                 +0.0%
BM_UFlat/4                             29817     29648  3.0GB/s  pdf                     +0.6%
BM_UFlat/5                            297126    293073  1.3GB/s  html4                   +1.4%
BM_UFlat/6                             28252     27994  842.0MB/s  cp                    +0.9%
BM_UFlat/7                             12672     12391  862.1MB/s  c                     +2.3%
BM_UFlat/8                              3507      3425  1040.9MB/s  lsp                  +2.4%
BM_UFlat/9                           1004268    969395  1018.0MB/s  xls                  +3.6%
BM_UFlat/10                              233       227  844.8MB/s  xls_200               +2.6%
BM_UFlat/11                           230054    224981  647.8MB/s  txt1                  +2.3%
BM_UFlat/12                           201229    196447  610.5MB/s  txt2                  +2.4%
BM_UFlat/13                           609547    596761  685.3MB/s  txt3                  +2.1%
BM_UFlat/14                           824362    804821  573.8MB/s  txt4                  +2.4%
BM_UFlat/15                           371095    374899  1.3GB/s  bin                     -1.0%
BM_UFlat/16                              267       267  717.8MB/s  bin_200               +0.0%
BM_UFlat/17                            44623     43828  835.9MB/s  sum                   +1.8%
BM_UFlat/18                             5077      4815  841.0MB/s  man                   +5.4%
BM_UFlat/19                            74964     73210  1.5GB/s  pb                      +2.4%
BM_UFlat/20                           237987    236745  746.0MB/s  gaviota               +0.5%
Sum of all benchmarks                4794092   4697659                                   +2.1%


Istanbul:

Benchmark                          Base (ns)  New (ns)                                Improvement
-------------------------------------------------------------------------------------------------
BM_UFlat/0                             98614     96376  1020.4MB/s  html                 +2.3%
BM_UFlat/1                            963740    953241  707.2MB/s  urls                  +1.1%
BM_UFlat/2                             25042     24769  4.8GB/s  jpg                     +1.1%
BM_UFlat/3                               180       180  1065.6MB/s  jpg_200              +0.0%
BM_UFlat/4                             45942     45403  1.9GB/s  pdf                     +1.2%
BM_UFlat/5                            400135    390226  1008.2MB/s  html4                +2.5%
BM_UFlat/6                             37768     37392  631.9MB/s  cp                    +1.0%
BM_UFlat/7                             18585     18200  588.2MB/s  c                     +2.1%
BM_UFlat/8                              5751      5690  627.7MB/s  lsp                   +1.1%
BM_UFlat/9                           1543154   1542209  641.4MB/s  xls                   +0.1%
BM_UFlat/10                              381       388  494.6MB/s  xls_200               -1.8%
BM_UFlat/11                           339715    331973  440.1MB/s  txt1                  +2.3%
BM_UFlat/12                           294807    289418  415.4MB/s  txt2                  +1.9%
BM_UFlat/13                           906160    884094  463.3MB/s  txt3                  +2.5%
BM_UFlat/14                          1224221   1198435  386.1MB/s  txt4                  +2.2%
BM_UFlat/15                           516277    502923  979.5MB/s  bin                   +2.7%
BM_UFlat/16                              405       402  477.2MB/s  bin_200               +0.7%
BM_UFlat/17                            61640     60621  605.6MB/s  sum                   +1.7%
BM_UFlat/18                             7326      7383  549.5MB/s  man                   -0.8%
BM_UFlat/19                            94720     92653  1.2GB/s  pb                      +2.2%
BM_UFlat/20                           360435    346687  510.6MB/s  gaviota               +4.0%
Sum of all benchmarks                6944998   6828663                                   +1.7%


git-svn-id: https://snappy.googlecode.com/svn/trunk@77 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2013-06-14 21:42:26 +00:00
snappy.mirrorbot@gmail.com 328aafa198 Add support for uncompressing to iovecs (scatter I/O).
Windows does not have struct iovec defined anywhere,
so we define our own version that's equal to what UNIX
typically has.

The bulk of this patch was contributed by Mohit Aron.

R=jeff


git-svn-id: https://snappy.googlecode.com/svn/trunk@76 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2013-06-13 16:19:52 +00:00
snappy.mirrorbot@gmail.com cd92eb0852 Some code reorganization needed for an internal change.
R=fikes


git-svn-id: https://snappy.googlecode.com/svn/trunk@75 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2013-06-12 19:51:15 +00:00
snappy.mirrorbot@gmail.com 698af469b4 Change a few ORs to additions where they don't matter. This helps the compiler
use the LEA instruction more efficiently, since e.g. a + (b << 2) can be encoded
as one instruction. Even more importantly, it can constant-fold the
COPY_* enums together with the shifted negative constants, which also saves
some instructions. (We don't need it for LITERAL, since it happens to be 0.)

I am unsure why the compiler couldn't do this itself, but the theory is that
it cannot prove that len-1 and len-4 cannot underflow/wrap, and thus can't
do the optimization safely.

The gains are small but measurable; 0.5-1.0% over the BM_Z* benchmarks
(measured on Westmere, Sandy Bridge and Istanbul).

R=sanjay


git-svn-id: https://snappy.googlecode.com/svn/trunk@69 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2013-01-04 11:54:20 +00:00
snappy.mirrorbot@gmail.com 8b95464146 Snappy library no longer depends on iostream.
Achieved by moving logging macro definitions to a test-only
header file, and by changing non-test code to use assert,
fprintf, and abort instead of LOG/CHECK macros.

R=sesse


git-svn-id: https://snappy.googlecode.com/svn/trunk@62 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2012-05-22 09:32:50 +00:00
snappy.mirrorbot@gmail.com dc63e0ad96 For 32-bit platforms, do not try to accelerate multiple neighboring
32-bit loads with a 64-bit load during compression (it's not a win).

The main target for this optimization is ARM, but 32-bit x86 gets
a small gain, too, although there is noise in the microbenchmarks.
It's a no-op for 64-bit x86. It does not affect decompression.

Microbenchmark results on a Cortex-A9 1GHz, using g++ 4.6.2 (from
Ubuntu/Linaro), -O2 -DNDEBUG -Wa,-march=armv7a -mtune=cortex-a9
-mthumb-interwork, minimum 1000 iterations:

  Benchmark            Time(ns)    CPU(ns) Iterations
  ---------------------------------------------------
  BM_ZFlat/0            1158277    1160000       1000 84.2MB/s  html (23.57 %)    [ +4.3%]
  BM_ZFlat/1           14861782   14860000       1000 45.1MB/s  urls (50.89 %)    [ +1.1%]
  BM_ZFlat/2             393595     390000       1000 310.5MB/s  jpg (99.88 %)    [ +0.0%]
  BM_ZFlat/3             650583     650000       1000 138.4MB/s  pdf (82.13 %)    [ +3.1%]
  BM_ZFlat/4            4661480    4660000       1000 83.8MB/s  html4 (23.55 %)   [ +4.3%]
  BM_ZFlat/5             491973     490000       1000 47.9MB/s  cp (48.12 %)      [ +2.0%]
  BM_ZFlat/6             193575     192678       1038 55.2MB/s  c (42.40 %)       [ +9.0%]
  BM_ZFlat/7              62343      62754       3187 56.5MB/s  lsp (48.37 %)     [ +2.6%]
  BM_ZFlat/8           17708468   17710000       1000 55.5MB/s  xls (41.34 %)     [ -0.3%]
  BM_ZFlat/9            3755345    3760000       1000 38.6MB/s  txt1 (59.81 %)    [ +8.2%]
  BM_ZFlat/10           3324217    3320000       1000 36.0MB/s  txt2 (64.07 %)    [ +4.2%]
  BM_ZFlat/11          10139932   10140000       1000 40.1MB/s  txt3 (57.11 %)    [ +6.4%]
  BM_ZFlat/12          13532109   13530000       1000 34.0MB/s  txt4 (68.35 %)    [ +5.0%]
  BM_ZFlat/13           4690847    4690000       1000 104.4MB/s  bin (18.21 %)    [ +4.1%]
  BM_ZFlat/14            830682     830000       1000 43.9MB/s  sum (51.88 %)     [ +1.2%]
  BM_ZFlat/15             84784      85011       2235 47.4MB/s  man (59.36 %)     [ +1.1%]
  BM_ZFlat/16           1293254    1290000       1000 87.7MB/s  pb (23.15 %)      [ +2.3%]
  BM_ZFlat/17           2775155    2780000       1000 63.2MB/s  gaviota (38.27 %) [+12.2%]

Core i7 in 32-bit mode (only one run and 100 iterations, though, so noisy):

  Benchmark            Time(ns)    CPU(ns) Iterations
  ---------------------------------------------------
  BM_ZFlat/0             227582     223464       3043 437.0MB/s  html (23.57 %)    [ +7.4%]
  BM_ZFlat/1            2982430    2918455        233 229.4MB/s  urls (50.89 %)    [ +2.9%]
  BM_ZFlat/2              46967      46658      15217 2.5GB/s  jpg (99.88 %)       [ +0.0%]
  BM_ZFlat/3             115298     114864       5833 783.2MB/s  pdf (82.13 %)     [ +1.5%]
  BM_ZFlat/4             913440     899743        778 434.2MB/s  html4 (23.55 %)   [ +0.3%]
  BM_ZFlat/5             110302     108571       7000 216.1MB/s  cp (48.12 %)      [ +0.0%]
  BM_ZFlat/6              44409      43372      15909 245.2MB/s  c (42.40 %)       [ +0.8%]
  BM_ZFlat/7              15713      15643      46667 226.9MB/s  lsp (48.37 %)     [ +2.7%]
  BM_ZFlat/8            2625539    2602230        269 377.4MB/s  xls (41.34 %)     [ +1.4%]
  BM_ZFlat/9             808884     811429        875 178.8MB/s  txt1 (59.81 %)    [ -3.9%]
  BM_ZFlat/10            709532     700000       1000 170.5MB/s  txt2 (64.07 %)    [ +0.0%]
  BM_ZFlat/11           2177682    2162162        333 188.2MB/s  txt3 (57.11 %)    [ -1.4%]
  BM_ZFlat/12           2849640    2840000        250 161.8MB/s  txt4 (68.35 %)    [ -1.4%]
  BM_ZFlat/13            849760     835476        778 585.8MB/s  bin (18.21 %)     [ +1.2%]
  BM_ZFlat/14            165940     164571       4375 221.6MB/s  sum (51.88 %)     [ +1.4%]
  BM_ZFlat/15             20939      20571      35000 196.0MB/s  man (59.36 %)     [ +2.1%]
  BM_ZFlat/16            239209     236544       2917 478.1MB/s  pb (23.15 %)      [ +4.2%]
  BM_ZFlat/17            616206     610000       1000 288.2MB/s  gaviota (38.27 %) [ -1.6%]

R=sanjay


git-svn-id: https://snappy.googlecode.com/svn/trunk@60 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2012-02-23 17:00:36 +00:00
snappy.mirrorbot@gmail.com f8829ea39d Enable the use of unaligned loads and stores for ARM-based architectures
where they are available (ARMv7 and higher). This gives a significant 
speed boost on ARM, both for compression and decompression. 
It should not affect x86 at all. 
 
There are more changes possible to speed up ARM, but it might not be 
that easy to do without hurting x86 or making the code uglier. 
Also, we de not try to use NEON yet. 
 
Microbenchmark results on a Cortex-A9 1GHz, using g++ 4.6.2 (from Ubuntu/Linaro), 
-O2 -DNDEBUG -Wa,-march=armv7a -mtune=cortex-a9 -mthumb-interwork: 
 
Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0             524806     529100        378 184.6MB/s  html            [+33.6%]
BM_UFlat/1            5139790    5200000        100 128.8MB/s  urls            [+28.8%]
BM_UFlat/2              86540      84166       1901 1.4GB/s  jpg               [ +0.6%]
BM_UFlat/3             215351     210176        904 428.0MB/s  pdf             [+29.8%]
BM_UFlat/4            2144490    2100000        100 186.0MB/s  html4           [+33.3%]
BM_UFlat/5             194482     190000       1000 123.5MB/s  cp              [+36.2%]
BM_UFlat/6              91843      90175       2107 117.9MB/s  c               [+38.6%]
BM_UFlat/7              28535      28426       6684 124.8MB/s  lsp             [+34.7%]
BM_UFlat/8            9206600    9200000        100 106.7MB/s  xls             [+42.4%]
BM_UFlat/9            1865273    1886792        106 76.9MB/s  txt1             [+32.5%]
BM_UFlat/10           1576809    1587301        126 75.2MB/s  txt2             [+32.3%]
BM_UFlat/11           4968450    4900000        100 83.1MB/s  txt3             [+32.7%]
BM_UFlat/12           6673970    6700000        100 68.6MB/s  txt4             [+32.8%]
BM_UFlat/13           2391470    2400000        100 203.9MB/s  bin             [+29.2%]
BM_UFlat/14            334601     344827        522 105.8MB/s  sum             [+30.6%]
BM_UFlat/15             37404      38080       5252 105.9MB/s  man             [+33.8%]
BM_UFlat/16            535470     540540        370 209.2MB/s  pb              [+31.2%]
BM_UFlat/17           1875245    1886792        106 93.2MB/s  gaviota          [+37.8%]
BM_UValidate/0         178425     179533       1114 543.9MB/s  html            [ +2.7%]
BM_UValidate/1        2100450    2000000        100 334.8MB/s  urls            [ +5.0%]
BM_UValidate/2           1039       1044     172413 113.3GB/s  jpg             [ +3.4%]
BM_UValidate/3          59423      59470       3363 1.5GB/s  pdf               [ +7.8%]
BM_UValidate/4         760716     766283        261 509.8MB/s  html4           [ +6.5%]
BM_ZFlat/0            1204632    1204819        166 81.1MB/s  html (23.57 %)   [+32.8%]
BM_ZFlat/1           15656190   15600000        100 42.9MB/s  urls (50.89 %)   [+27.6%]
BM_ZFlat/2             403336     410677        487 294.8MB/s  jpg (99.88 %)   [+16.5%]
BM_ZFlat/3             664073     671140        298 134.0MB/s  pdf (82.13 %)   [+28.4%]
BM_ZFlat/4            4961940    4900000        100 79.7MB/s  html4 (23.55 %)  [+30.6%]
BM_ZFlat/5             500664     501253        399 46.8MB/s  cp (48.12 %)     [+33.4%]
BM_ZFlat/6             217276     215982        926 49.2MB/s  c (42.40 %)      [+25.0%]
BM_ZFlat/7              64122      65487       3054 54.2MB/s  lsp (48.37 %)    [+36.1%]
BM_ZFlat/8           18045730   18000000        100 54.6MB/s  xls (41.34 %)    [+34.4%]
BM_ZFlat/9            4051530    4000000        100 36.3MB/s  txt1 (59.81 %)   [+25.0%]
BM_ZFlat/10           3451800    3500000        100 34.1MB/s  txt2 (64.07 %)   [+25.7%]
BM_ZFlat/11          11052340   11100000        100 36.7MB/s  txt3 (57.11 %)   [+24.3%]
BM_ZFlat/12          14538690   14600000        100 31.5MB/s  txt4 (68.35 %)   [+24.7%]
BM_ZFlat/13           5041850    5000000        100 97.9MB/s  bin (18.21 %)    [+32.0%]
BM_ZFlat/14            908840     909090        220 40.1MB/s  sum (51.88 %)    [+22.2%]
BM_ZFlat/15             86921      86206       1972 46.8MB/s  man (59.36 %)    [+42.2%]
BM_ZFlat/16           1312315    1315789        152 86.0MB/s  pb (23.15 %)     [+34.5%]
BM_ZFlat/17           3173120    3200000        100 54.9MB/s  gaviota (38.27%) [+28.1%]


The move from 64-bit to 32-bit operations for the copies also affected 32-bit x86;
positive on the decompression side, and slightly negative on the compression side
(unless that is noise; I only ran once):

Benchmark              Time(ns)    CPU(ns) Iterations
-----------------------------------------------------
BM_UFlat/0                86279      86140       7778 1.1GB/s  html             [ +7.5%]
BM_UFlat/1               839265     822622        778 813.9MB/s  urls           [ +9.4%]
BM_UFlat/2                 9180       9143      87500 12.9GB/s  jpg             [ +1.2%]
BM_UFlat/3                35080      35000      20000 2.5GB/s  pdf              [+10.1%]
BM_UFlat/4               350318     345000       2000 1.1GB/s  html4            [ +7.0%]
BM_UFlat/5                33808      33472      21212 701.0MB/s  cp             [ +9.0%]
BM_UFlat/6                15201      15214      46667 698.9MB/s  c              [+14.9%]
BM_UFlat/7                 4652       4651     159091 762.9MB/s  lsp            [ +7.5%]
BM_UFlat/8              1285551    1282528        538 765.7MB/s  xls            [+10.7%]
BM_UFlat/9               282510     281690       2414 514.9MB/s  txt1           [+13.6%]
BM_UFlat/10              243494     239286       2800 498.9MB/s  txt2           [+14.4%]
BM_UFlat/11              743625     740000       1000 550.0MB/s  txt3           [+14.3%]
BM_UFlat/12              999441     989717        778 464.3MB/s  txt4           [+16.1%]
BM_UFlat/13              412402     410076       1707 1.2GB/s  bin              [ +7.3%]
BM_UFlat/14               54876      54000      10000 675.3MB/s  sum            [+13.0%]
BM_UFlat/15                6146       6100     100000 660.8MB/s  man            [+14.8%]
BM_UFlat/16               90496      90286       8750 1.2GB/s  pb               [ +4.0%]
BM_UFlat/17              292650     292000       2500 602.0MB/s  gaviota        [+18.1%]
BM_UValidate/0            49620      49699      14286 1.9GB/s  html             [ +0.0%]
BM_UValidate/1           501371     500000       1000 1.3GB/s  urls             [ +0.0%]
BM_UValidate/2              232        227    3043478 521.5GB/s  jpg            [ +1.3%]
BM_UValidate/3            17250      17143      43750 5.1GB/s  pdf              [ -1.3%]
BM_UValidate/4           198643     200000       3500 1.9GB/s  html4            [ -0.9%]
BM_ZFlat/0               227128     229415       3182 425.7MB/s  html (23.57 %) [ -1.4%]
BM_ZFlat/1              2970089    2960000        250 226.2MB/s  urls (50.89 %) [ -1.9%]
BM_ZFlat/2                45683      44999      15556 2.6GB/s  jpg (99.88 %)    [ +2.2%]
BM_ZFlat/3               114661     113136       6364 795.1MB/s  pdf (82.13 %)  [ -1.5%]
BM_ZFlat/4               919702     914286        875 427.2MB/s  html4 (23.55%) [ -1.3%]
BM_ZFlat/5               108189     108422       6364 216.4MB/s  cp (48.12 %)   [ -1.2%]
BM_ZFlat/6                44525      44000      15909 241.7MB/s  c (42.40 %)    [ -2.9%]
BM_ZFlat/7                15973      15857      46667 223.8MB/s  lsp (48.37 %)  [ +0.0%]
BM_ZFlat/8              2677888    2639405        269 372.1MB/s  xls (41.34 %)  [ -1.4%]
BM_ZFlat/9               800715     780000       1000 186.0MB/s  txt1 (59.81 %) [ -0.4%]
BM_ZFlat/10              700089     700000       1000 170.5MB/s  txt2 (64.07 %) [ -2.9%]
BM_ZFlat/11             2159356    2138365        318 190.3MB/s  txt3 (57.11 %) [ -0.3%]
BM_ZFlat/12             2796143    2779923        259 165.3MB/s  txt4 (68.35 %) [ -1.4%]
BM_ZFlat/13              856458     835476        778 585.8MB/s  bin (18.21 %)  [ -0.1%]
BM_ZFlat/14              166908     166857       4375 218.6MB/s  sum (51.88 %)  [ -1.4%]
BM_ZFlat/15               21181      20857      35000 193.3MB/s  man (59.36 %)  [ -0.8%]
BM_ZFlat/16              244009     239973       2917 471.3MB/s  pb (23.15 %)   [ -1.4%]
BM_ZFlat/17              596362     590000       1000 297.9MB/s  gaviota (38.27%) [ +0.0%]

R=sanjay


git-svn-id: https://snappy.googlecode.com/svn/trunk@59 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2012-02-21 17:02:17 +00:00
snappy.mirrorbot@gmail.com e750dc0f05 Minor refactoring to accomodate changes in Google's internal code tree.
git-svn-id: https://snappy.googlecode.com/svn/trunk@57 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2012-01-08 17:55:48 +00:00
snappy.mirrorbot@gmail.com d9068ee301 Fix public issue r57: Fix most warnings with -Wall, mostly signed/unsigned
warnings. There are still some in the unit test, but the main .cc file should
be clean. We haven't enabled -Wall for the default build, since the unit test
is still not clean.

This also fixes a real bug in the open-source implementation of
ReadFileToStringOrDie(); it would not detect errors correctly.

I had to go through some pains to avoid performance loss as the types
were changed; I think there might still be some with 32-bit if and only if LFS
is enabled (ie., size_t is 64-bit), but for regular 32-bit and 64-bit I can't
see any losses, and I've diffed the generated GCC assembler between the old and
new code without seeing any significant choices. If anything, it's ever so
slightly faster.

This may or may not enable compression of very large blocks (>2^32 bytes)
when size_t is 64-bit, but I haven't checked, and it is still not a supported
case.


git-svn-id: https://snappy.googlecode.com/svn/trunk@56 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2012-01-04 13:10:46 +00:00
snappy.mirrorbot@gmail.com d7eb2dc413 Speed up decompression by moving the refill check to the end of the loop.
This seems to work because in most of the branches, the compiler can evaluate
“ip_limit_ - ip” in a more efficient way than reloading ip_limit_ from memory
(either by already having the entire expression in a register, or reconstructing
it from “avail”, or something else). Memory loads, even from L1, are seemingly
costly in the big picture at the current decompression speeds.

Microbenchmarks (64-bit, opt mode):

Westmere (Intel Core i7):

  Benchmark     Time(ns)    CPU(ns) Iterations
  --------------------------------------------
  BM_UFlat/0       74492      74491     187894 1.3GB/s  html      [ +5.9%]
  BM_UFlat/1      712268     712263      19644 940.0MB/s  urls    [ +3.8%]
  BM_UFlat/2       10591      10590    1000000 11.2GB/s  jpg      [ -6.8%]
  BM_UFlat/3       29643      29643     469915 3.0GB/s  pdf       [ +7.9%]
  BM_UFlat/4      304669     304667      45930 1.3GB/s  html4     [ +4.8%]
  BM_UFlat/5       28508      28507     490077 823.1MB/s  cp      [ +4.0%]
  BM_UFlat/6       12415      12415    1000000 856.5MB/s  c       [ +8.6%]
  BM_UFlat/7        3415       3415    4084723 1039.0MB/s  lsp    [+18.0%]
  BM_UFlat/8      979569     979563      14261 1002.5MB/s  xls    [ +5.8%]
  BM_UFlat/9      230150     230148      60934 630.2MB/s  txt1    [ +5.2%]
  BM_UFlat/10     197167     197166      71135 605.5MB/s  txt2    [ +4.7%]
  BM_UFlat/11     607394     607390      23041 670.1MB/s  txt3    [ +5.6%]
  BM_UFlat/12     808502     808496      17316 568.4MB/s  txt4    [ +5.0%]
  BM_UFlat/13     372791     372788      37564 1.3GB/s  bin       [ +3.3%]
  BM_UFlat/14      44541      44541     313969 818.8MB/s  sum     [ +5.7%]
  BM_UFlat/15       4833       4833    2898697 834.1MB/s  man     [ +4.8%]
  BM_UFlat/16      79855      79855     175356 1.4GB/s  pb        [ +4.8%]
  BM_UFlat/17     245845     245843      56838 715.0MB/s  gaviota [ +5.8%]

Clovertown (Intel Core 2):

  Benchmark     Time(ns)    CPU(ns) Iterations
  --------------------------------------------
  BM_UFlat/0      107911     107890     100000 905.1MB/s  html    [ +2.2%]
  BM_UFlat/1     1011237    1011041      10000 662.3MB/s  urls    [ +2.5%]
  BM_UFlat/2       26775      26770     523089 4.4GB/s  jpg       [ +0.0%]
  BM_UFlat/3       48103      48095     290618 1.8GB/s  pdf       [ +3.4%]
  BM_UFlat/4      437724     437644      31937 892.6MB/s  html4   [ +2.1%]
  BM_UFlat/5       39607      39600     358284 592.5MB/s  cp      [ +2.4%]
  BM_UFlat/6       18227      18224     768191 583.5MB/s  c       [ +2.7%]
  BM_UFlat/7        5171       5170    2709437 686.4MB/s  lsp     [ +3.9%]
  BM_UFlat/8     1560291    1559989       8970 629.5MB/s  xls     [ +3.6%]
  BM_UFlat/9      335401     335343      41731 432.5MB/s  txt1    [ +3.0%]
  BM_UFlat/10     287014     286963      48758 416.0MB/s  txt2    [ +2.8%]
  BM_UFlat/11     888522     888356      15752 458.1MB/s  txt3    [ +2.9%]
  BM_UFlat/12    1186600    1186378      10000 387.3MB/s  txt4    [ +3.1%]
  BM_UFlat/13     572295     572188      24468 855.4MB/s  bin     [ +2.1%]
  BM_UFlat/14      64060      64049     218401 569.4MB/s  sum     [ +4.1%]
  BM_UFlat/15       7264       7263    1916168 555.0MB/s  man     [ +1.4%]
  BM_UFlat/16     108853     108836     100000 1039.1MB/s  pb     [ +1.7%]
  BM_UFlat/17     364289     364223      38419 482.6MB/s  gaviota [ +4.9%]

Barcelona (AMD Opteron):

  Benchmark     Time(ns)    CPU(ns) Iterations
  --------------------------------------------
  BM_UFlat/0      103900     103871     100000 940.2MB/s  html    [ +8.3%]
  BM_UFlat/1     1000435    1000107      10000 669.5MB/s  urls    [ +6.6%]
  BM_UFlat/2       24659      24652     567362 4.8GB/s  jpg       [ +0.1%]
  BM_UFlat/3       48206      48193     291121 1.8GB/s  pdf       [ +5.0%]
  BM_UFlat/4      421980     421850      33174 926.0MB/s  html4   [ +7.3%]
  BM_UFlat/5       40368      40357     346994 581.4MB/s  cp      [ +8.7%]
  BM_UFlat/6       19836      19830     708695 536.2MB/s  c       [ +8.0%]
  BM_UFlat/7        6100       6098    2292774 581.9MB/s  lsp     [ +9.0%]
  BM_UFlat/8     1693093    1692514       8261 580.2MB/s  xls     [ +8.0%]
  BM_UFlat/9      365991     365886      38225 396.4MB/s  txt1    [ +7.1%]
  BM_UFlat/10     311330     311238      44950 383.6MB/s  txt2    [ +7.6%]
  BM_UFlat/11     975037     974737      14376 417.5MB/s  txt3    [ +6.9%]
  BM_UFlat/12    1303558    1303175      10000 352.6MB/s  txt4    [ +7.3%]
  BM_UFlat/13     517448     517290      27144 946.2MB/s  bin     [ +5.5%]
  BM_UFlat/14      66537      66518     210352 548.3MB/s  sum     [ +7.5%]
  BM_UFlat/15       7976       7974    1760383 505.6MB/s  man     [ +5.6%]
  BM_UFlat/16     103121     103092     100000 1097.0MB/s  pb     [ +8.7%]
  BM_UFlat/17     391431     391314      35733 449.2MB/s  gaviota [ +6.5%]

R=sanjay


git-svn-id: https://snappy.googlecode.com/svn/trunk@54 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2011-12-05 21:27:26 +00:00
snappy.mirrorbot@gmail.com 5ed51ce15f Speed up decompression by making the fast path for literals faster.
We do the fast-path step as soon as possible; in fact, as soon as we know the
literal length. Since we usually hit the fast path, we can then skip the checks
for long literals and available input space (beyond what the fast path check
already does).

Note that this changes the decompression Writer API; however, it does not
change the ABI, since writers are always templatized and as such never
cross compilation units. The new API is slightly more general, in that it
doesn't hard-code the value 16. Note that we also take care to check
for len <= 16 first, since the other two checks almost always succeed
(so we don't want to waste time checking for them until we have to).

The improvements are most marked on Nehalem, but are generally positive
on other platforms as well. All microbenchmarks are 64-bit, opt.

Clovertown (Core 2):

  Benchmark     Time(ns)    CPU(ns) Iterations
  --------------------------------------------
  BM_UFlat/0      110226     110224     100000 886.0MB/s  html    [ +1.5%]
  BM_UFlat/1     1036523    1036508      10000 646.0MB/s  urls    [ -0.8%]
  BM_UFlat/2       26775      26775     522570 4.4GB/s  jpg       [ +0.0%]
  BM_UFlat/3       49738      49737     280974 1.8GB/s  pdf       [ +0.3%]
  BM_UFlat/4      446790     446792      31334 874.3MB/s  html4   [ +0.8%]
  BM_UFlat/5       40561      40562     350424 578.5MB/s  cp      [ +1.3%]
  BM_UFlat/6       18722      18722     746903 568.0MB/s  c       [ +1.4%]
  BM_UFlat/7        5373       5373    2608632 660.5MB/s  lsp     [ +8.3%]
  BM_UFlat/8     1615716    1615718       8670 607.8MB/s  xls     [ +2.0%]
  BM_UFlat/9      345278     345281      40481 420.1MB/s  txt1    [ +1.4%]
  BM_UFlat/10     294855     294855      47452 404.9MB/s  txt2    [ +1.6%]
  BM_UFlat/11     914263     914263      15316 445.2MB/s  txt3    [ +1.1%]
  BM_UFlat/12    1222694    1222691      10000 375.8MB/s  txt4    [ +1.4%]
  BM_UFlat/13     584495     584489      23954 837.4MB/s  bin     [ -0.6%]
  BM_UFlat/14      66662      66662     210123 547.1MB/s  sum     [ +1.2%]
  BM_UFlat/15       7368       7368    1881856 547.1MB/s  man     [ +4.0%]
  BM_UFlat/16     110727     110726     100000 1021.4MB/s  pb     [ +2.3%]
  BM_UFlat/17     382138     382141      36616 460.0MB/s  gaviota [ -0.7%]

Westmere (Core i7):

  Benchmark     Time(ns)    CPU(ns) Iterations
  --------------------------------------------
  BM_UFlat/0       78861      78853     177703 1.2GB/s  html      [ +2.1%]
  BM_UFlat/1      739560     739491      18912 905.4MB/s  urls    [ +3.4%]
  BM_UFlat/2        9867       9866    1419014 12.0GB/s  jpg      [ +3.4%]
  BM_UFlat/3       31989      31986     438385 2.7GB/s  pdf       [ +0.2%]
  BM_UFlat/4      319406     319380      43771 1.2GB/s  html4     [ +1.9%]
  BM_UFlat/5       29639      29636     472862 791.7MB/s  cp      [ +5.2%]
  BM_UFlat/6       13478      13477    1000000 789.0MB/s  c       [ +2.3%]
  BM_UFlat/7        4030       4029    3475364 880.7MB/s  lsp     [ +8.7%]
  BM_UFlat/8     1036585    1036492      10000 947.5MB/s  xls     [ +6.9%]
  BM_UFlat/9      242127     242105      57838 599.1MB/s  txt1    [ +3.0%]
  BM_UFlat/10     206499     206480      67595 578.2MB/s  txt2    [ +3.4%]
  BM_UFlat/11     641635     641570      21811 634.4MB/s  txt3    [ +2.4%]
  BM_UFlat/12     848847     848769      16443 541.4MB/s  txt4    [ +3.1%]
  BM_UFlat/13     384968     384938      36366 1.2GB/s  bin       [ +0.3%]
  BM_UFlat/14      47106      47101     297770 774.3MB/s  sum     [ +4.4%]
  BM_UFlat/15       5063       5063    2772202 796.2MB/s  man     [ +7.7%]
  BM_UFlat/16      83663      83656     167697 1.3GB/s  pb        [ +1.8%]
  BM_UFlat/17     260224     260198      53823 675.6MB/s  gaviota [ -0.5%]

Barcelona (Opteron):

  Benchmark     Time(ns)    CPU(ns) Iterations
  --------------------------------------------
  BM_UFlat/0      112490     112457     100000 868.4MB/s  html    [ -0.4%]
  BM_UFlat/1     1066719    1066339      10000 627.9MB/s  urls    [ +1.0%]
  BM_UFlat/2       24679      24672     563802 4.8GB/s  jpg       [ +0.7%]
  BM_UFlat/3       50603      50589     277285 1.7GB/s  pdf       [ +2.6%]
  BM_UFlat/4      452982     452849      30900 862.6MB/s  html4   [ -0.2%]
  BM_UFlat/5       43860      43848     319554 535.1MB/s  cp      [ +1.2%]
  BM_UFlat/6       21419      21413     653573 496.6MB/s  c       [ +1.0%]
  BM_UFlat/7        6646       6645    2105405 534.1MB/s  lsp     [ +0.3%]
  BM_UFlat/8     1828487    1827886       7658 537.3MB/s  xls     [ +2.6%]
  BM_UFlat/9      391824     391714      35708 370.3MB/s  txt1    [ +2.2%]
  BM_UFlat/10     334913     334816      41885 356.6MB/s  txt2    [ +1.7%]
  BM_UFlat/11    1042062    1041674      10000 390.7MB/s  txt3    [ +1.1%]
  BM_UFlat/12    1398902    1398456      10000 328.6MB/s  txt4    [ +1.7%]
  BM_UFlat/13     545706     545530      25669 897.2MB/s  bin     [ -0.4%]
  BM_UFlat/14      71512      71505     196035 510.0MB/s  sum     [ +1.4%]
  BM_UFlat/15       8422       8421    1665036 478.7MB/s  man     [ +2.6%]
  BM_UFlat/16     112053     112048     100000 1009.3MB/s  pb     [ -0.4%]
  BM_UFlat/17     416723     416713      33612 421.8MB/s  gaviota [ -2.0%]

R=sanjay


git-svn-id: https://snappy.googlecode.com/svn/trunk@53 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2011-11-23 11:14:17 +00:00
snappy.mirrorbot@gmail.com 57e7cd7255 Fix public issue #44: Make the definition and declaration of CompressFragment
identical, even regarding cv-qualifiers.

This is required to work around a bug in the Solaris Studio C++ compiler
(it does not properly disregard cv-qualifiers when doing name mangling).

R=sanjay


git-svn-id: https://snappy.googlecode.com/svn/trunk@44 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2011-06-28 11:40:25 +00:00
snappy.mirrorbot@gmail.com f540673740 Speed up decompression by removing a fast-path attempt.
Whenever we try to enter a copy fast-path, there is a certain cost in checking
that all the preconditions are in place, but it's normally offset by the fact
that we can usually take the cheaper path. However, in a certain path we've
already established that "avail < literal_length", which usually means that
either the available space is small, or the literal is big. Both will disqualify
us from taking the fast path, and thus we take the hit from the precondition
checking without gaining much from having a fast path. Thus, simply don't try
the fast path in this situation -- we're already on a slow path anyway
(one where we need to refill more data from the reader).

I'm a bit surprised at how much this gained; it could be that this path is
more common than I thought, or that the simpler structure somehow makes the
compiler happier. I haven't looked at the assembler, but it's a win across
the board on both Core 2, Core i7 and Opteron, at least for the cases we
typically care about. The gains seem to be the largest on Core i7, though.
Results from my Core i7 workstation:


  Benchmark            Time(ns)    CPU(ns) Iterations
  ---------------------------------------------------
  BM_UFlat/0              73337      73091     190996 1.3GB/s  html      [ +1.7%]
  BM_UFlat/1             696379     693501      20173 965.5MB/s  urls    [ +2.7%]
  BM_UFlat/2               9765       9734    1472135 12.1GB/s  jpg      [ +0.7%]
  BM_UFlat/3              29720      29621     472973 3.0GB/s  pdf       [ +1.8%]
  BM_UFlat/4             294636     293834      47782 1.3GB/s  html4     [ +2.3%]
  BM_UFlat/5              28399      28320     494700 828.5MB/s  cp      [ +3.5%]
  BM_UFlat/6              12795      12760    1000000 833.3MB/s  c       [ +1.2%]
  BM_UFlat/7               3984       3973    3526448 893.2MB/s  lsp     [ +5.7%]
  BM_UFlat/8             991996     989322      14141 992.6MB/s  xls     [ +3.3%]
  BM_UFlat/9             228620     227835      61404 636.6MB/s  txt1    [ +4.0%]
  BM_UFlat/10            197114     196494      72165 607.5MB/s  txt2    [ +3.5%]
  BM_UFlat/11            605240     603437      23217 674.4MB/s  txt3    [ +3.7%]
  BM_UFlat/12            804157     802016      17456 573.0MB/s  txt4    [ +3.9%]
  BM_UFlat/13            347860     346998      40346 1.4GB/s  bin       [ +1.2%]
  BM_UFlat/14             44684      44559     315315 818.4MB/s  sum     [ +2.3%]
  BM_UFlat/15              5120       5106    2739726 789.4MB/s  man     [ +3.3%]
  BM_UFlat/16             76591      76355     183486 1.4GB/s  pb        [ +2.8%]
  BM_UFlat/17            238564     237828      58824 739.1MB/s  gaviota [ +1.6%]
  BM_UValidate/0          42194      42060     333333 2.3GB/s  html      [ -0.1%]
  BM_UValidate/1         433182     432005      32407 1.5GB/s  urls      [ -0.1%]
  BM_UValidate/2            197        196   71428571 603.3GB/s  jpg     [ +0.5%]
  BM_UValidate/3          14494      14462     972222 6.1GB/s  pdf       [ +0.5%]
  BM_UValidate/4         168444     167836      83832 2.3GB/s  html4     [ +0.1%]
	
R=jeff

Revision created by MOE tool push_codebase.


git-svn-id: https://snappy.googlecode.com/svn/trunk@42 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2011-06-03 20:53:06 +00:00
snappy.mirrorbot@gmail.com 197f3ee9f9 Speed up decompression by not needing a lookup table for literal items.
Looking up into and decoding the values from char_table has long shown up as a
hotspot in the decompressor. While it turns out that it's hard to make a more
efficient decoder for the copy ops, the literals are simple enough that we can
decode them without needing a table lookup. (This means that 1/4 of the table
is now unused, although that in itself doesn't buy us anything.)

The gains are small, but definitely present; some tests win as much as 10%,
but 1-4% is more typical. These results are from Core i7, in 64-bit mode;
Core 2 and Opteron show similar results. (I've run with more iterations
than unusual to make sure the smaller gains don't drown entirely in noise.)

  Benchmark            Time(ns)    CPU(ns) Iterations
  ---------------------------------------------------
  BM_UFlat/0              74665      74428     182055 1.3GB/s  html      [ +3.1%]
  BM_UFlat/1             714106     711997      19663 940.4MB/s  urls    [ +4.4%]
  BM_UFlat/2               9820       9789    1427115 12.1GB/s  jpg      [ -1.2%]
  BM_UFlat/3              30461      30380     465116 2.9GB/s  pdf       [ +0.8%]
  BM_UFlat/4             301445     300568      46512 1.3GB/s  html4     [ +2.2%]
  BM_UFlat/5              29338      29263     479452 801.8MB/s  cp      [ +1.6%]
  BM_UFlat/6              13004      12970    1000000 819.9MB/s  c       [ +2.1%]
  BM_UFlat/7               4180       4168    3349282 851.4MB/s  lsp     [ +1.3%]
  BM_UFlat/8            1026149    1024000      10000 959.0MB/s  xls     [+10.7%]
  BM_UFlat/9             237441     236830      59072 612.4MB/s  txt1    [ +0.3%]
  BM_UFlat/10            203966     203298      69307 587.2MB/s  txt2    [ +0.8%]
  BM_UFlat/11            627230     625000      22400 651.2MB/s  txt3    [ +0.7%]
  BM_UFlat/12            836188     833979      16787 551.0MB/s  txt4    [ +1.3%]
  BM_UFlat/13            351904     350750      39886 1.4GB/s  bin       [ +3.8%]
  BM_UFlat/14             45685      45562     308370 800.4MB/s  sum     [ +5.9%]
  BM_UFlat/15              5286       5270    2656546 764.9MB/s  man     [ +1.5%]
  BM_UFlat/16             78774      78544     178117 1.4GB/s  pb        [ +4.3%]
  BM_UFlat/17            242270     241345      58091 728.3MB/s  gaviota [ +1.2%]
  BM_UValidate/0          42149      42000     333333 2.3GB/s  html      [ -3.0%]
  BM_UValidate/1         432741     431303      32483 1.5GB/s  urls      [ +7.8%]
  BM_UValidate/2            198        197   71428571 600.7GB/s  jpg     [+16.8%]
  BM_UValidate/3          14560      14521     965517 6.1GB/s  pdf       [ -4.1%]
  BM_UValidate/4         169065     168671      83832 2.3GB/s  html4     [ -2.9%]

R=jeff

Revision created by MOE tool push_codebase.


git-svn-id: https://snappy.googlecode.com/svn/trunk@41 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2011-06-03 20:47:14 +00:00
snappy.mirrorbot@gmail.com 2e12124bd8 Remove an unneeded goto in the decompressor; it turns out that the
state of ip_ after decompression (or attempted decompresion) is
completely irrelevant, so we don't need the trailer.

Performance is, as expected, mostly flat -- there's a curious ~3-5%
loss in the "lsp" test, but that test case is so short it is hard to say
anything definitive about why (most likely, it's some sort of
unrelated effect).

R=jeff


git-svn-id: https://snappy.googlecode.com/svn/trunk@39 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2011-06-02 18:06:54 +00:00
snappy.mirrorbot@gmail.com c266bbf321 Speed up decompression by caching ip_.
It is seemingly hard for the compiler to understand that ip_, the current input
pointer into the compressed data stream, can not alias on anything else, and
thus using it directly will incur memory traffic as it cannot be kept in a
register. The code already knew about this and cached it into a local
variable, but since Step() only decoded one tag, it had to move ip_ back into
place between every tag. This seems to have cost us a significant amount of
performance, so changing Step() into a function that decodes as much as it can
before it saves ip_ back and returns. (Note that Step() was already inlined,
so it is not the manual inlining that buys the performance here.)

The wins are about 3-6% for Core 2, 6-13% on Core i7 and 5-12% on Opteron
(for plain array-to-array decompression, in 64-bit opt mode).

There is a tiny difference in the behavior here; if an invalid literal is
encountered (ie., the writer refuses the Append() operation), ip_ will now
point to the byte past the tag byte, instead of where the literal was
originally thought to end. However, we don't use ip_ for anything after
DecompressAllTags() has returned, so this should not change external behavior
in any way.

Microbenchmark results for Core i7, 64-bit (Opteron results are similar):

Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0              79134      79110       8835 1.2GB/s  html      [ +6.2%]
BM_UFlat/1             786126     786096        891 851.8MB/s  urls    [+10.0%]
BM_UFlat/2               9948       9948      69125 11.9GB/s  jpg      [ -1.3%]
BM_UFlat/3              31999      31998      21898 2.7GB/s  pdf       [ +6.5%]
BM_UFlat/4             318909     318829       2204 1.2GB/s  html4     [ +6.5%]
BM_UFlat/5              31384      31390      22363 747.5MB/s  cp      [ +9.2%]
BM_UFlat/6              14037      14034      49858 757.7MB/s  c       [+10.6%]
BM_UFlat/7               4612       4612     151395 769.5MB/s  lsp     [ +9.5%]
BM_UFlat/8            1203174    1203007        582 816.3MB/s  xls     [+19.3%]
BM_UFlat/9             253869     253955       2757 571.1MB/s  txt1    [+11.4%]
BM_UFlat/10            219292     219290       3194 544.4MB/s  txt2    [+12.1%]
BM_UFlat/11            672135     672131       1000 605.5MB/s  txt3    [+11.2%]
BM_UFlat/12            902512     902492        776 509.2MB/s  txt4    [+12.5%]
BM_UFlat/13            372110     371998       1881 1.3GB/s  bin       [ +5.8%]
BM_UFlat/14             50407      50407      10000 723.5MB/s  sum     [+13.5%]
BM_UFlat/15              5699       5701     100000 707.2MB/s  man     [+12.4%]
BM_UFlat/16             83448      83424       8383 1.3GB/s  pb        [ +5.7%]
BM_UFlat/17            256958     256963       2723 684.1MB/s  gaviota [ +7.9%]
BM_UValidate/0          42795      42796      16351 2.2GB/s  html      [+25.8%]
BM_UValidate/1         490672     490622       1427 1.3GB/s  urls      [+22.7%]
BM_UValidate/2            237        237    2950297 499.0GB/s  jpg     [+24.9%]
BM_UValidate/3          14610      14611      47901 6.0GB/s  pdf       [+26.8%]
BM_UValidate/4         171973     171990       4071 2.2GB/s  html4     [+25.7%]




git-svn-id: https://snappy.googlecode.com/svn/trunk@38 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2011-06-02 17:59:40 +00:00
snappy.mirrorbot@gmail.com 7e8ca8f831 Change on 2011-03-25 19:18:00-07:00 by sesse
Replace the Apache 2.0 license header by the BSD-type license header;
	somehow a lot of the files were missed in the last round.

	R=dannyb,csilvers
	DELTA=147  (74 added, 2 deleted, 71 changed)

Change on 2011-03-25 19:25:07-07:00 by sesse

	Unbreak the build; the relicensing removed a bit too much (only comments
	were intended, but I also accidentially removed some of the top lines of
	the actual source).



Revision created by MOE tool push_codebase.
MOE_MIGRATION=1072


git-svn-id: https://snappy.googlecode.com/svn/trunk@21 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2011-03-26 02:34:34 +00:00
snappy.mirrorbot@gmail.com 28a6440239 Revision created by MOE tool push_codebase.
MOE_MIGRATION=


git-svn-id: https://snappy.googlecode.com/svn/trunk@2 03e5f5b5-db94-4691-08a0-1a8bf15f6143
2011-03-18 17:14:15 +00:00