snappy

mirror of https://github.com/google/snappy.git synced 2024-11-30 22:42:07 +00:00

Author	SHA1	Message	Date
costan	73c31e824c	Fix Visual Studio build. Commit `8f469d97e2` introduced SSSE3 fast paths that are gated by __SSE3__ macro checks and the <x86intrin.h> header, neither of which exists in Visual Studio. This commit adds logic for detecting SSSE3 compiler support that works for all compilers supported by the open source release. The commit also replaces the header with <tmmintrin.h>, which only defines intrinsics supported by SSSE3 and below. This should help flag any use of SIMD instructions that require more advanced SSE support, so the uses can be gated by checks that also work in the open source release. Last, this commit requires C++11 support for the open source build. This is needed by the alignas specifier, which was also introduced in commit `8f469d97e2`.	2018-08-08 22:25:14 -07:00
jefflim	27ff0af12a	Improve performance of zippy decompression to IOVecs by up to almost 50% 1) Simplify loop condition for small pattern IncrementalCopy 2) Use pointers rather than indices to track current iovec. 3) Use fast IncrementalCopy 4) Bypass Append check from within AppendFromSelf While this code greatly improves the performance of ZippyIOVecWriter, a bigger question is whether IOVec writing should be improved, or removed. Perf tests: name old speed new speed delta BM_UFlat/0 [html ] 2.13GB/s ± 0% 2.14GB/s ± 1% ~ BM_UFlat/1 [urls ] 1.22GB/s ± 0% 1.24GB/s ± 0% +1.87% BM_UFlat/2 [jpg ] 17.2GB/s ± 1% 17.1GB/s ± 0% ~ BM_UFlat/3 [jpg_200 ] 1.55GB/s ± 0% 1.53GB/s ± 2% ~ BM_UFlat/4 [pdf ] 12.8GB/s ± 1% 12.7GB/s ± 2% -0.36% BM_UFlat/5 [html4 ] 1.89GB/s ± 0% 1.90GB/s ± 1% ~ BM_UFlat/6 [txt1 ] 811MB/s ± 0% 829MB/s ± 1% +2.24% BM_UFlat/7 [txt2 ] 756MB/s ± 0% 774MB/s ± 1% +2.41% BM_UFlat/8 [txt3 ] 860MB/s ± 0% 879MB/s ± 1% +2.16% BM_UFlat/9 [txt4 ] 699MB/s ± 0% 715MB/s ± 1% +2.31% BM_UFlat/10 [pb ] 2.64GB/s ± 0% 2.65GB/s ± 1% ~ BM_UFlat/11 [gaviota ] 1.00GB/s ± 0% 0.99GB/s ± 2% ~ BM_UFlat/12 [cp ] 1.66GB/s ± 1% 1.66GB/s ± 2% ~ BM_UFlat/13 [c ] 1.53GB/s ± 0% 1.47GB/s ± 5% -3.97% BM_UFlat/14 [lsp ] 1.60GB/s ± 1% 1.55GB/s ± 5% -3.41% BM_UFlat/15 [xls ] 1.12GB/s ± 0% 1.15GB/s ± 0% +1.93% BM_UFlat/16 [xls_200 ] 918MB/s ± 2% 929MB/s ± 1% +1.15% BM_UFlat/17 [bin ] 1.86GB/s ± 0% 1.89GB/s ± 1% +1.61% BM_UFlat/18 [bin_200 ] 1.90GB/s ± 1% 1.97GB/s ± 1% +3.67% BM_UFlat/19 [sum ] 1.32GB/s ± 0% 1.33GB/s ± 1% ~ BM_UFlat/20 [man ] 1.39GB/s ± 0% 1.36GB/s ± 3% ~ BM_UValidate/0 [html ] 2.85GB/s ± 3% 2.90GB/s ± 0% ~ BM_UValidate/1 [urls ] 1.57GB/s ± 0% 1.56GB/s ± 0% -0.20% BM_UValidate/2 [jpg ] 824GB/s ± 0% 825GB/s ± 0% +0.11% BM_UValidate/3 [jpg_200 ] 2.01GB/s ± 0% 2.02GB/s ± 0% +0.10% BM_UValidate/4 [pdf ] 30.4GB/s ±11% 33.5GB/s ± 0% ~ BM_UIOVec/0 [html ] 604MB/s ± 0% 856MB/s ± 0% +41.70% BM_UIOVec/1 [urls ] 440MB/s ± 0% 660MB/s ± 0% +49.91% BM_UIOVec/2 [jpg ] 15.1GB/s ± 1% 15.3GB/s ± 1% +1.22% BM_UIOVec/3 [jpg_200 ] 567MB/s ± 1% 629MB/s ± 0% +10.89% BM_UIOVec/4 [pdf ] 7.16GB/s ± 2% 8.56GB/s ± 1% +19.64% BM_UFlatSink/0 [html ] 2.13GB/s ± 0% 2.16GB/s ± 0% +1.47% BM_UFlatSink/1 [urls ] 1.22GB/s ± 0% 1.25GB/s ± 0% +2.18% BM_UFlatSink/2 [jpg ] 17.1GB/s ± 2% 17.1GB/s ± 2% ~ BM_UFlatSink/3 [jpg_200 ] 1.51GB/s ± 1% 1.53GB/s ± 2% +1.11% BM_UFlatSink/4 [pdf ] 12.7GB/s ± 2% 12.8GB/s ± 1% +0.67% BM_UFlatSink/5 [html4 ] 1.90GB/s ± 0% 1.92GB/s ± 0% +1.31% BM_UFlatSink/6 [txt1 ] 810MB/s ± 0% 835MB/s ± 0% +3.04% BM_UFlatSink/7 [txt2 ] 755MB/s ± 0% 779MB/s ± 0% +3.19% BM_UFlatSink/8 [txt3 ] 859MB/s ± 0% 884MB/s ± 0% +2.86% BM_UFlatSink/9 [txt4 ] 698MB/s ± 0% 718MB/s ± 0% +2.96% BM_UFlatSink/10 [pb ] 2.64GB/s ± 0% 2.67GB/s ± 0% +1.16% BM_UFlatSink/11 [gaviota ] 1.00GB/s ± 0% 1.01GB/s ± 0% +1.04% BM_UFlatSink/12 [cp ] 1.66GB/s ± 1% 1.68GB/s ± 1% +0.83% BM_UFlatSink/13 [c ] 1.52GB/s ± 1% 1.53GB/s ± 0% +0.38% BM_UFlatSink/14 [lsp ] 1.60GB/s ± 1% 1.61GB/s ± 0% +0.91% BM_UFlatSink/15 [xls ] 1.12GB/s ± 0% 1.15GB/s ± 0% +1.96% BM_UFlatSink/16 [xls_200 ] 906MB/s ± 3% 920MB/s ± 1% +1.55% BM_UFlatSink/17 [bin ] 1.86GB/s ± 0% 1.90GB/s ± 0% +2.15% BM_UFlatSink/18 [bin_200 ] 1.85GB/s ± 2% 1.92GB/s ± 2% +4.01% BM_UFlatSink/19 [sum ] 1.32GB/s ± 1% 1.35GB/s ± 0% +2.23% BM_UFlatSink/20 [man ] 1.39GB/s ± 1% 1.40GB/s ± 0% +1.12% BM_ZFlat/0 [html (22.31 %) ] 800MB/s ± 0% 793MB/s ± 0% -0.95% BM_ZFlat/1 [urls (47.78 %) ] 423MB/s ± 0% 424MB/s ± 0% +0.11% BM_ZFlat/2 [jpg (99.95 %) ] 12.0GB/s ± 2% 12.0GB/s ± 4% ~ BM_ZFlat/3 [jpg_200 (73.00 %)] 592MB/s ± 3% 594MB/s ± 2% ~ BM_ZFlat/4 [pdf (83.30 %) ] 7.26GB/s ± 1% 7.23GB/s ± 2% -0.49% BM_ZFlat/5 [html4 (22.52 %) ] 738MB/s ± 0% 739MB/s ± 0% +0.17% BM_ZFlat/6 [txt1 (57.88 %) ] 286MB/s ± 0% 285MB/s ± 0% -0.09% BM_ZFlat/7 [txt2 (61.91 %) ] 264MB/s ± 0% 264MB/s ± 0% +0.08% BM_ZFlat/8 [txt3 (54.99 %) ] 300MB/s ± 0% 300MB/s ± 0% ~ BM_ZFlat/9 [txt4 (66.26 %) ] 248MB/s ± 0% 247MB/s ± 0% -0.20% BM_ZFlat/10 [pb (19.68 %) ] 1.04GB/s ± 0% 1.03GB/s ± 0% -1.17% BM_ZFlat/11 [gaviota (37.72 %)] 451MB/s ± 0% 450MB/s ± 0% -0.35% BM_ZFlat/12 [cp (48.12 %) ] 543MB/s ± 0% 538MB/s ± 0% -1.04% BM_ZFlat/13 [c (42.47 %) ] 638MB/s ± 1% 643MB/s ± 0% +0.68% BM_ZFlat/14 [lsp (48.37 %) ] 686MB/s ± 0% 691MB/s ± 1% +0.76% BM_ZFlat/15 [xls (41.23 %) ] 636MB/s ± 0% 633MB/s ± 0% -0.52% BM_ZFlat/16 [xls_200 (78.00 %)] 523MB/s ± 2% 520MB/s ± 2% -0.56% BM_ZFlat/17 [bin (18.11 %) ] 1.01GB/s ± 0% 1.01GB/s ± 0% +0.50% BM_ZFlat/18 [bin_200 (7.50 %) ] 2.45GB/s ± 1% 2.44GB/s ± 1% -0.54% BM_ZFlat/19 [sum (48.96 %) ] 487MB/s ± 0% 478MB/s ± 0% -1.89% BM_ZFlat/20 [man (59.21 %) ] 567MB/s ± 1% 566MB/s ± 1% ~ The BM_UFlat/13 and BM_UFlat/14 results showed high variance, so I reran them: name old speed new speed delta BM_UFlat/13 [c ] 1.53GB/s ± 0% 1.53GB/s ± 1% ~ BM_UFlat/14 [lsp] 1.61GB/s ± 1% 1.61GB/s ± 1% +0.25%	2018-08-07 23:41:17 -07:00
costan	4ffb0e62c5	Update Travis CI configuration.	2018-08-07 21:33:14 -07:00
atdt	be490ef9ec	Test for SSE3 suppport before using pshufb.	2018-08-04 18:51:13 -07:00
atdt	8f469d97e2	Avoid store-forwarding stalls in Zippy's IncrementalCopy NEW: Annotate `pattern` as initialized, for MSan. Snappy's IncrementalCopy routine optimizes for speed by reading and writing memory in blocks of eight or sixteen bytes. If the gap between the source and destination pointers is smaller than eight bytes, snappy's strategy is to expand the gap by issuing a series of partly-overlapping eight-byte loads+stores. Because the range of each load partly overlaps that of the store which preceded it, the store buffer cannot be forwarded to the load, and the load stalls while it waits for the store to retire. This is called a store-forwarding stall. We can use fewer loads and avoid most of the stalls by loading the first eight bytes into an 128-bit XMM register, then using PSHUFB to permute the register's contents in-place into the desired repeating sequence of bytes. When falling back to IncrementalCopySlow, use memset if the pattern size == 1. This eliminates around 60% of the stalls. name old time/op new time/op delta BM_UFlat/0 [html] 48.6µs ± 0% 48.2µs ± 0% -0.92% (p=0.000 n=19+18) BM_UFlat/1 [urls] 589µs ± 0% 576µs ± 0% -2.17% (p=0.000 n=19+18) BM_UFlat/2 [jpg] 7.12µs ± 0% 7.10µs ± 0% ~ (p=0.071 n=19+18) BM_UFlat/3 [jpg_200] 162ns ± 0% 151ns ± 0% -7.06% (p=0.000 n=19+18) BM_UFlat/4 [pdf] 8.25µs ± 0% 8.19µs ± 0% -0.74% (p=0.000 n=19+18) BM_UFlat/5 [html4] 218µs ± 0% 218µs ± 0% +0.09% (p=0.000 n=17+18) BM_UFlat/6 [txt1] 191µs ± 0% 189µs ± 0% -1.12% (p=0.000 n=19+18) BM_UFlat/7 [txt2] 168µs ± 0% 167µs ± 0% -1.01% (p=0.000 n=19+18) BM_UFlat/8 [txt3] 502µs ± 0% 499µs ± 0% -0.52% (p=0.000 n=19+18) BM_UFlat/9 [txt4] 704µs ± 0% 695µs ± 0% -1.26% (p=0.000 n=19+18) BM_UFlat/10 [pb] 45.6µs ± 0% 44.2µs ± 0% -3.13% (p=0.000 n=19+15) BM_UFlat/11 [gaviota] 188µs ± 0% 194µs ± 0% +3.06% (p=0.000 n=15+18) BM_UFlat/12 [cp] 15.1µs ± 2% 14.7µs ± 1% -2.09% (p=0.000 n=18+18) BM_UFlat/13 [c] 7.38µs ± 0% 7.36µs ± 0% -0.28% (p=0.000 n=16+18) BM_UFlat/14 [lsp] 2.31µs ± 0% 2.37µs ± 0% +2.64% (p=0.000 n=19+18) BM_UFlat/15 [xls] 984µs ± 0% 909µs ± 0% -7.59% (p=0.000 n=19+18) BM_UFlat/16 [xls_200] 215ns ± 0% 217ns ± 0% +0.71% (p=0.000 n=19+15) BM_UFlat/17 [bin] 289µs ± 0% 287µs ± 0% -0.71% (p=0.000 n=19+18) BM_UFlat/18 [bin_200] 161ns ± 0% 116ns ± 0% -28.09% (p=0.000 n=19+16) BM_UFlat/19 [sum] 31.9µs ± 0% 29.2µs ± 0% -8.37% (p=0.000 n=19+18) BM_UFlat/20 [man] 3.13µs ± 1% 3.07µs ± 0% -1.79% (p=0.000 n=19+18) name old allocs/op new allocs/op delta BM_UFlat/0 [html] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/1 [urls] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/2 [jpg] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/3 [jpg_200] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/4 [pdf] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/5 [html4] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/6 [txt1] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/7 [txt2] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/8 [txt3] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/9 [txt4] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/10 [pb] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/11 [gaviota] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/12 [cp] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/13 [c] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/14 [lsp] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/15 [xls] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/16 [xls_200] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/17 [bin] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/18 [bin_200] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/19 [sum] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) BM_UFlat/20 [man] 0.00 ±NaN% 0.00 ±NaN% ~ (all samples are equal) name old speed new speed delta BM_UFlat/0 [html] 2.11GB/s ± 0% 2.13GB/s ± 0% +0.92% (p=0.000 n=19+18) BM_UFlat/1 [urls] 1.19GB/s ± 0% 1.22GB/s ± 0% +2.22% (p=0.000 n=16+17) BM_UFlat/2 [jpg] 17.3GB/s ± 0% 17.3GB/s ± 0% ~ (p=0.074 n=19+18) BM_UFlat/3 [jpg_200] 1.23GB/s ± 0% 1.33GB/s ± 0% +7.58% (p=0.000 n=19+18) BM_UFlat/4 [pdf] 12.4GB/s ± 0% 12.5GB/s ± 0% +0.74% (p=0.000 n=19+18) BM_UFlat/5 [html4] 1.88GB/s ± 0% 1.88GB/s ± 0% -0.09% (p=0.000 n=18+18) BM_UFlat/6 [txt1] 798MB/s ± 0% 807MB/s ± 0% +1.13% (p=0.000 n=19+18) BM_UFlat/7 [txt2] 743MB/s ± 0% 751MB/s ± 0% +1.02% (p=0.000 n=19+18) BM_UFlat/8 [txt3] 850MB/s ± 0% 855MB/s ± 0% +0.52% (p=0.000 n=19+18) BM_UFlat/9 [txt4] 684MB/s ± 0% 693MB/s ± 0% +1.28% (p=0.000 n=19+18) BM_UFlat/10 [pb] 2.60GB/s ± 0% 2.69GB/s ± 0% +3.25% (p=0.000 n=19+16) BM_UFlat/11 [gaviota] 979MB/s ± 0% 950MB/s ± 0% -2.97% (p=0.000 n=15+18) BM_UFlat/12 [cp] 1.63GB/s ± 2% 1.67GB/s ± 1% +2.13% (p=0.000 n=18+18) BM_UFlat/13 [c] 1.51GB/s ± 0% 1.52GB/s ± 0% +0.29% (p=0.000 n=16+18) BM_UFlat/14 [lsp] 1.61GB/s ± 1% 1.57GB/s ± 0% -2.57% (p=0.000 n=19+18) BM_UFlat/15 [xls] 1.05GB/s ± 0% 1.13GB/s ± 0% +8.22% (p=0.000 n=19+18) BM_UFlat/16 [xls_200] 928MB/s ± 0% 921MB/s ± 0% -0.81% (p=0.000 n=19+17) BM_UFlat/17 [bin] 1.78GB/s ± 0% 1.79GB/s ± 0% +0.71% (p=0.000 n=19+18) BM_UFlat/18 [bin_200] 1.24GB/s ± 0% 1.72GB/s ± 0% +38.92% (p=0.000 n=19+18) BM_UFlat/19 [sum] 1.20GB/s ± 0% 1.31GB/s ± 0% +9.15% (p=0.000 n=19+18) BM_UFlat/20 [man] 1.35GB/s ± 1% 1.38GB/s ± 0% +1.84% (p=0.000 n=19+18)	2018-08-04 18:51:07 -07:00
costan	4f7bd2dbfd	Update CI configurations. Bump GCC and Clang on Travis and remove Visual Studio 2015 from AppVeyor.	2018-03-09 09:02:34 -08:00
jgorbe	ca37ab7fb9	Ensure DecompressAllTags starts on a 32-byte boundary + 16 bytes. First of all, I'm sorry about this ugly hack. I hope the following long explanation is enough to justify it. We have observed that, in some conditions, the results for dataset number 10 (pb) in the zippy benchmark can show a >20% regression on Skylake CPUs. In order to diagnose this, we profiled the benchmark looking at hot functions (99% of the time is spent on DecompressAllTags), then looked at the generated code to see if there was any difference. In order to discard a minor difference we observed in register allocation we replaced zippy.cc with a pre-built assembly file so it was the same in both variants, and we still were able to reproduce the regression. After discarding a regression caused by the compiler, we digged a bit further and noticed that the alignment of the function in the final binary was different. Both were aligned to a 16-byte boundary, but the slower one was also (by chance) aligned to a 32-byte boundary. A regression caused by alignment differences would explain why I could reproduce it consistently on the same CitC client, but not others: slight differences in the sources can cause the resulting binary to have different layout. Here are some detailed benchmark results before/after the fix. Note how fixing the alignment makes the difference between baseline and experiment go away, but regular 32-byte alignment puts both variants in the same ballpark as the original regression: Original (note BM_UCord_10 and BM_UDataBuffer_10 around the -24% line): BASELINE BM_UCord/10 2938 2932 24194 3.767GB/s pb BM_UDataBuffer/10 3008 3004 23316 3.677GB/s pb EXPERIMENT BM_UCord/10 3797 3789 18512 2.915GB/s pb BM_UDataBuffer/10 4024 4016 17543 2.750GB/s pb Aligning DecompressAllTags to a 32-byte boundary: BASELINE BM_UCord/10 3872 3862 18035 2.860GB/s pb BM_UDataBuffer/10 4010 3998 17591 2.763GB/s pb EXPERIMENT BM_UCord/10 3884 3876 18126 2.850GB/s pb BM_UDataBuffer/10 4037 4027 17199 2.743GB/s pb Aligning DecompressAllTags to a 32-byte boundary + 16 bytes (this patch): BASELINE BM_UCord/10 3103 3095 22642 3.569GB/s pb BM_UDataBuffer/10 3186 3177 21947 3.476GB/s pb EXPERIMENT BM_UCord/10 3104 3095 22632 3.569GB/s pb BM_UDataBuffer/10 3167 3159 22076 3.496GB/s pb This change forces the "good" alignment for DecompressAllTags which, if anything, should make benchmark results more stable (and maybe we'll improve some unlucky application!).	2018-02-17 00:47:18 -08:00
scrubbed	15a2804cd2	Fix an incorrect analysis / comment in the "pattern doubling" code. This should have a miniscule positive effect on performance; the main idea of the CL is just to fix the incorrect comment.	2018-02-17 00:46:31 -08:00
costan	e69d9f8806	Fix Travis CI configuration for OSX.	2018-01-04 15:27:36 -08:00
chandlerc	4aba5426d4	Rework a very hot, very sensitive part of snappy to reduce the number of instructions, the number of dynamic branches, and avoid a particular loop structure than LLVM has a very hard time optimizing for this particular case. The code being changed is part of the hottest path for snappy decompression. In the benchmarks for decompressing protocol buffers, this has proven to be amazingly sensitive to the slightest changes in code layout. For example, previously we added '.p2align 5' assembly directive to the code. This essentially padded the loop out from the function. Merely by doing this we saw significant performance improvements. As a consequence, several of the compiler's typically reasonable optimizations can have surprising bad impacts. Loop unrolling is a primary culprit, but in the next LLVM release we are seeing an issue due to loop rotation. While some of the problems caused by the newly triggered loop rotation in LLVM can be mitigated with ongoing work on LLVM's code layout optimizations (specifically, loop header cloning), that is a fairly long term project. And even minor fluctuations in how that subsequent optimization is performed may prevent gaining the performance back. For now, we need some way to unblock the next LLVM release which contains a generic improvement to the LLVM loop optimizer that enables loop rotation in more places, but uncovers this sensitivity and weakness in a particular case. This CL restructures the loop to have a simpler structure. Specifically, we eagerly test what the terminal condition will be and provide two versions of the copy loop that use a single loop predicate. The comments in the source code and benchmarks indicate that only one of these two cases is actually hot: we expect to generally have enough slop in the buffer. That in turn allows us to generate a much simpler branch and loop structure for the hot path (especially for the protocol buffer decompression benchmark). However, structuring even this simple loop in a way that doesn't trigger some other performance bubble (often a more severe one) is quite challenging. We have to carefully manage the variables used in the loop and the addressing pattern. We should teach LLVM how to do this reliably, but that too is a much more significant undertaking and is extremely rare to have this degree of importance. The desired structure of the loop, as shown with IACA's analysis for the broadwell micro-architecture (HSW and SKX are similar): \| Num Of \| Ports pressure in cycles \| \| \| Uops \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| 6 \| 7 \| \| --------------------------------------------------------------------------------- \| 1 \| \| \| 1.0 1.0 \| \| \| \| \| \| \| mov rcx, qword ptr [rdi+rdx1-0x8] \| 2^ \| \| \| \| 0.4 \| 1.0 \| \| \| 0.6 \| \| mov qword ptr [rdi], rcx \| 1 \| \| \| \| 1.0 1.0 \| \| \| \| \| \| mov rcx, qword ptr [rdi+rdx1] \| 2^ \| \| \| 0.3 \| \| 1.0 \| \| \| 0.7 \| \| mov qword ptr [rdi+0x8], rcx \| 1 \| 0.5 \| \| \| \| \| 0.5 \| \| \| \| add rdi, 0x10 \| 1 \| 0.2 \| \| \| \| \| \| 0.8 \| \| \| cmp rdi, rax \| 0F \| \| \| \| \| \| \| \| \| \| jb 0xffffffffffffffe9 Specifically, the arrangement of addressing modes for the stores such that micro-op fusion (indicated by the `^` on the `2` micro-op count) is important to achieve good throughput for this loop. The other thing necessary to make this change effective is to remove our previous hack using `.p2align 5` to pad out the main decompression loop, and to forcibly disable loop unrolling for critical loops. Because this change simplifies the loop structure, more unrolling opportunities show up. Also, the next LLVM release's generic loop optimization improvements allow unrolling in more places, requiring still more disabling of unrolling in this change. Perhaps most surprising of these is that we must disable loop unrolling in the slow path. While unrolling there seems pointless, it should also be harmless. This cold code is laid out very far away from all of the hot code. All the samples shown in a profile of the benchmark occur before this loop in the function. And yet, if the loop gets unrolled (which seems to only happen reliably with the next LLVM release) we see a nearly 20% regression in decompressing protocol buffers! With the current release of LLVM, we still observe some regression from this source change, but it is fairly small (5% on decompressing protocol buffers, less elsewhere). And with the next LLVM release it drops to under 1% even in that case. Meanwhile, without this change, the next release of LLVM will regress decompressing protocol buffers by more than 10%.	2018-01-04 15:27:15 -08:00
costan	26102a0c66	Fix generated version number in open source release. Lands GitHub PR #61. The patch was also independently contributed by Martin Gieseking <martin.gieseking@uos.de>.	2017-12-20 14:32:54 -08:00
costan	b02bfa754e	Tag open source release 1.1.7.	2017-08-24 16:54:23 -07:00
wmi	824e6718b5	Add a loop alignment directive to work around a performance regression. We found LLVM upstream change at rL310792 degraded zippy benchmark by ~3%. Performance analysis showed the regression was caused by some side-effect. The incidental loop alignment change (from 32 bytes to 16 bytes) led to increase of branch miss prediction and caused the regression. The regression was reproducible on several intel micro-architectures, like sandybridge, haswell and skylake. Sadly we still don't have good understanding about the internal of intel branch predictor and cannot explain how the branch miss prediction increases when the loop alignment changes, so we cannot make a real fix here. The workaround solution in the patch is to add a directive, align the hot loop to 32 bytes, which can restore the performance. This is in order to unblock the flip of default compiler to LLVM.	2017-08-24 16:54:12 -07:00
costan	55924d1109	Add GNUInstallDirs to CMake configuration. This is modeled after https://github.com/google/googletest/pull/1160. The immediate benefit is fixing the library install paths on 64-bit Linux distributions, which tend to support running 32-bit and 64-bit code side by side by installing 32-bit libraries in /usr/lib and 64-bit libraries in /usr/lib64.	2017-08-16 19:19:31 -07:00
costan	632cd0f128	Use 64-bit optimized code path for ARM64. This is inspired by https://github.com/google/snappy/pull/22. Benchmark results with the change, Pixel C with Android N2G48B Benchmark Time(ns) CPU(ns) Iterations --------------------------------------------------- BM_UFlat/0 119544 119253 1501 818.9MB/s html BM_UFlat/1 1223950 1208588 163 554.0MB/s urls BM_UFlat/2 16081 15962 11527 7.2GB/s jpg BM_UFlat/3 356 352 416666 540.6MB/s jpg_200 BM_UFlat/4 25010 24860 7683 3.8GB/s pdf BM_UFlat/5 484832 481572 407 811.1MB/s html4 BM_UFlat/6 408410 408713 482 354.9MB/s txt1 BM_UFlat/7 361714 361663 553 330.1MB/s txt2 BM_UFlat/8 1090582 1087912 182 374.1MB/s txt3 BM_UFlat/9 1503127 1503759 133 305.6MB/s txt4 BM_UFlat/10 114183 114285 1715 989.6MB/s pb BM_UFlat/11 406714 407331 491 431.5MB/s gaviota BM_UIOVec/0 370397 369888 538 264.0MB/s html BM_UIOVec/1 3207510 3190000 100 209.9MB/s urls BM_UIOVec/2 16589 16573 11223 6.9GB/s jpg BM_UIOVec/3 1052 1052 165289 181.2MB/s jpg_200 BM_UIOVec/4 49151 49184 3985 1.9GB/s pdf BM_UValidate/0 68115 68095 2893 1.4GB/s html BM_UValidate/1 792652 792000 250 845.4MB/s urls BM_UValidate/2 334 334 487804 343.1GB/s jpg BM_UValidate/3 235 235 666666 809.9MB/s jpg_200 BM_UValidate/4 6126 6130 32626 15.6GB/s pdf BM_ZFlat/0 292697 290560 678 336.1MB/s html (22.31 %) BM_ZFlat/1 4062080 4050000 100 165.3MB/s urls (47.78 %) BM_ZFlat/2 29225 29274 6422 3.9GB/s jpg (99.95 %) BM_ZFlat/3 1099 1098 163934 173.7MB/s jpg_200 (73.00 %) BM_ZFlat/4 44117 44233 4205 2.2GB/s pdf (83.30 %) BM_ZFlat/5 1158058 1157894 171 337.4MB/s html4 (22.52 %) BM_ZFlat/6 1102983 1093922 181 132.6MB/s txt1 (57.88 %) BM_ZFlat/7 974142 975490 204 122.4MB/s txt2 (61.91 %) BM_ZFlat/8 2984670 2990000 100 136.1MB/s txt3 (54.99 %) BM_ZFlat/9 4100130 4090000 100 112.4MB/s txt4 (66.26 %) BM_ZFlat/10 276236 275139 716 411.0MB/s pb (19.68 %) BM_ZFlat/11 760091 759541 262 231.4MB/s gaviota (37.72 %) Baseline benchmark results, Pixel C with Android N2G48B Benchmark Time(ns) CPU(ns) Iterations --------------------------------------------------- BM_UFlat/0 148957 147565 1335 661.8MB/s html BM_UFlat/1 1527257 1500000 132 446.4MB/s urls BM_UFlat/2 19589 19397 8764 5.9GB/s jpg BM_UFlat/3 425 418 408163 455.3MB/s jpg_200 BM_UFlat/4 30096 29552 6497 3.2GB/s pdf BM_UFlat/5 595933 594594 333 657.0MB/s html4 BM_UFlat/6 516315 514360 383 282.0MB/s txt1 BM_UFlat/7 454653 453514 441 263.2MB/s txt2 BM_UFlat/8 1382687 1361111 144 299.0MB/s txt3 BM_UFlat/9 1967590 1904761 105 241.3MB/s txt4 BM_UFlat/10 148271 144560 1342 782.3MB/s pb BM_UFlat/11 523997 510471 382 344.4MB/s gaviota BM_UIOVec/0 478443 465227 417 209.9MB/s html BM_UIOVec/1 4172860 4060000 100 164.9MB/s urls BM_UIOVec/2 21470 20975 7342 5.5GB/s jpg BM_UIOVec/3 1357 1330 75187 143.4MB/s jpg_200 BM_UIOVec/4 63143 61365 3031 1.6GB/s pdf BM_UValidate/0 86910 85125 2279 1.1GB/s html BM_UValidate/1 1022256 1000000 195 669.6MB/s urls BM_UValidate/2 420 417 400000 274.6GB/s jpg BM_UValidate/3 311 302 571428 630.0MB/s jpg_200 BM_UValidate/4 7778 7584 25445 12.6GB/s pdf BM_ZFlat/0 469209 457547 424 213.4MB/s html (22.31 %) BM_ZFlat/1 5633510 5460000 100 122.6MB/s urls (47.78 %) BM_ZFlat/2 37896 36693 4524 3.1GB/s jpg (99.95 %) BM_ZFlat/3 1485 1441 123456 132.3MB/s jpg_200 (73.00 %) BM_ZFlat/4 74870 72775 2652 1.3GB/s pdf (83.30 %) BM_ZFlat/5 1857321 1785714 112 218.8MB/s html4 (22.52 %) BM_ZFlat/6 1538723 1492307 130 97.2MB/s txt1 (57.88 %) BM_ZFlat/7 1338236 1310810 148 91.1MB/s txt2 (61.91 %) BM_ZFlat/8 4050820 4040000 100 100.7MB/s txt3 (54.99 %) BM_ZFlat/9 5234940 5230000 100 87.9MB/s txt4 (66.26 %) BM_ZFlat/10 400309 400000 495 282.7MB/s pb (19.68 %) BM_ZFlat/11 1063042 1058510 188 166.1MB/s gaviota (37.72 %)	2017-08-16 19:18:22 -07:00
costan	77c12adc19	Add unistd.h checks back to the CMake build. getpagesize(), as well as its POSIX.2001 replacement sysconf(_SC_PAGESIZE), is defined in <unistd.h>. On Linux and OS X, including <sys/mman.h> is sufficient to get a definition for getpagesize(). However, this is not true for the Android NDK. This CL brings back the HAVE_UNISTD_H definition and its associated header check. This also adds a HAVE_FUNC_SYSCONF definition, which checks for the presence of sysconf(). The definition can be used later to replace getpagesize() with sysconf().	2017-08-02 10:56:06 -07:00
costan	c8049c5827	Replace getpagesize() with sysconf(_SC_PAGESIZE). getpagesize() has been removed from POSIX.1-2001. Its recommended replacement is sysconf(_SC_PAGESIZE).	2017-08-01 14:38:57 -07:00
costan	18e2f220d8	Add guidelines for opensource contributions. The guidelines follow the instructions at https://opensource.google.com/docs/releasing/preparing/#CONTRIBUTING	2017-08-01 14:38:24 -07:00
costan	f0d3237c32	Use _BitScanForward and _BitScanReverse on MSVC. Based on https://github.com/google/snappy/pull/30	2017-08-01 14:38:02 -07:00
jueminyang	71b8f86887	Add SNAPPY_ prefix to PREDICT_{TRUE,FALSE} macros.	2017-08-01 14:36:26 -07:00
costan	be6dc3db83	Redo CMake configuration. The style was changed to match the official manual [1], the install configuration was simplified and now matches the official packaging guide [2], and the config files use the CMake-specific variable syntax ${VAR} instead of the autoconf-compatible syntax @VAR@, as documented in [3]. The public header files are declared as such (for CMake 3.3+), and the generated headers are included in the library target definition. The tests are only built if SNAPPY_BUILD_TESTS (default ON) is true, so zippy can be easily used in projects that add_subdirectory() its source code directly, instead of using find_package(). [1] https://cmake.org/cmake/help/git-master/manual/cmake-language.7.html [2] https://cmake.org/cmake/help/git-master/manual/cmake-packages.7.html [3] https://cmake.org/cmake/help/git-master/command/configure_file.html	2017-07-28 10:14:21 -07:00
costan	e4de6ce087	Small improvements to open source CI configuration. This CL fixes 64-bit Windows testing (), makes it possible to view the test output in the Travis / AppVeyor CI console while the test is running, and takes advantage of the new support for the .appveyor.yml file name to make the CI configuration less obtrusive.	2017-07-27 16:46:54 -07:00
costan	c756f7f5d9	Support both static and shared library CMake builds. This can be used to fix https://github.com/Homebrew/homebrew-core/issues/15722.	2017-07-27 16:46:54 -07:00
costan	038a3329b1	Inline DISALLOW_COPY_AND_ASSIGN. snappy-stubs-public.h defined the DISALLOW_COPY_AND_ASSIGN macro, so the definition propagated to all translation units that included the open source headers. The macro is now inlined, thus avoiding polluting the macro environment of snappy users.	2017-07-27 16:46:42 -07:00
costan	a8b239c3de	snappy: Remove autoconf build configuration.	2017-07-25 18:20:38 -07:00
costan	27671c6aec	Clean up CMake header and type checks. Unused macros: HAVE_DLFCN_H, HAVE_INTTYPES_H, HAVE_MEMORY_H, HAVE_STDLIB_H, HAVE_STRINGS_H, HAVE_STRING_H, HAVE_SYS_BYTESWAP_H, HAVE_SYS_STAT_H, HAVE_SYS_TYPES_H, HAVE_UNISTD_H. Used but never set macros: HAVE_LIBLZF, HAVE_LIBQUICKLZ. These only gate conditional includes. The code that takes advantage of them was removed. Unused types: ssize_t. The testing code uses HAVE_FUNC_MMAP, which was not wired in the CMake build, causing a whole test to be skipped.	2017-07-25 18:17:35 -07:00
costan	548501c988	zippy: Re-release snappy 1.1.5 as 1.1.6. The migration from autotools to CMake in 1.1.5 wasn't as smooth as intended. The SONAME / SOVERSION were broken in both build systems, causing breakages in systems that upgraded from snappy 1.1.4 to 1.1.5, as reported in https://github.com/Homebrew/homebrew-core/issues/15274 and https://github.com/google/snappy/pull/45.	2017-07-13 03:56:49 -07:00
costan	513df5fb5a	Tag open source release 1.1.5.	2017-06-28 18:37:30 -07:00
costan	5bc9c82ae3	Set minimum CMake version to 3.1. The project only needs CMake 3.1 features, and some Travis CI bots have CMake 3.2.2. Therefore, requiring CMake 3.4 is inconvenient.	2017-06-28 18:37:08 -07:00
costan	e9720a001d	Update Travis CI config, add AppVeyor for Windows CI coverage.	2017-06-28 18:36:37 -07:00
tmsriram	f24f9d2d97	Explicitly copy internal::wordmask to the stack array to work around a compiler optimization with LLVM that converts const stack arrays to global arrays. This is a temporary change and should be reverted when https://reviews.llvm.org/D30759 is fixed. With PIE, accessing stack arrays is more efficient than global arrays and wordmask was moved to the stack due to that. However, the LLVM compiler automatically converts stack arrays, detected as constant, to global arrays and this transformation hurts PIE performance with LLVM. We are working to fix this in the LLVM compiler, via https://reviews.llvm.org/D30759, to not do this conversion in PIE mode. Until this patch is finished, please consider this source change as a temporary work around to keep this array on the stack. This source change is important to allow some projects to flip the default compiler from GCC to LLVM for optimized builds. This change works for the following reason. The LLVM compiler does not convert non-const stack arrays to global arrays and explicitly copying the elements is enough to make the compiler assume that this is a non-const array. With GCC, this change does not affect code-gen in any significant way. The array initialization code is slightly different as it copies the constants directly to the stack. With LLVM, this keeps the array on the stack. No change in performance with GCC (within noise range). With LLVM, ~0.7% improvement in optimized mode (no FDO) and ~1.75% improvement in FDO mode.	2017-06-28 18:34:54 -07:00
ysaed	82deffcde7	Remove benchmarking support for fastlz.	2017-06-28 18:33:55 -07:00
alkis	18488d6212	Use 64 bit little endian on ppc64le. This has tangible performance benefits. This lands https://github.com/google/snappy/pull/27	2017-06-28 18:33:13 -07:00
alkis	7b9532b878	Improve the SSE2 macro check on Windows. This lands https://github.com/google/snappy/pull/37	2017-06-05 13:54:17 -07:00
alkis	7dadceea52	Check for the existence of sys/uio.h in autoconf build. This lands https://github.com/google/snappy/pull/32	2017-06-05 13:54:17 -07:00
jyrki	83179dd8be	Remove quicklz and lzf support in benchmarks.	2017-06-05 13:54:10 -07:00
vrabaud	c8131680d0	Provide a CMakeLists.txt. This lands https://github.com/google/snappy/pull/29	2017-06-05 13:53:29 -07:00
costan	ed3b7b242b	Clean up unused function warnings in snappy.	2017-03-17 13:59:03 -07:00
costan	8b60aac4fd	Remove "using namespace std;" from zippy-stubs-internal.h. This makes it easier to build zippy, as some compiles require a warning suppression to accept "using namespace std".	2017-03-13 13:03:01 -07:00
costan	7d7a8ec805	Add Travis CI configuration to snappy and fix the make build. The make build in the open source version uses autoconf, which is set up to expect a project that follows the gnu standard.	2017-03-10 12:40:15 -08:00
alkis	1cd3ab02e9	Rename README to README.md. It already in markdown, we might as well let github know so that it renders nicely.	2017-03-08 12:05:05 -08:00
alkis	597fa795de	Delete UnalignedCopy64 from snappy-stubs since the version in snappy.cc is more robust and possibly faster (assuming the compiler knows how to best copy 8 bytes between locations in memory the fastest way possible - a rather safe bet).	2017-03-08 11:42:30 -08:00
scrubbed	039b3a7ace	Add std:: prefix to STL non-type names. In order to disable global using declarations, this CL qualifies stl names with the std namespace.	2017-03-08 11:42:30 -08:00
alkis	3c706d2230	Make UnalignedCopy64 not exhibit undefined behavior when src and dst overlap. name old speed new speed delta BM_UFlat/0 3.09GB/s ± 3% 3.07GB/s ± 2% -0.78% (p=0.009 n=19+19) BM_UFlat/1 1.63GB/s ± 2% 1.62GB/s ± 2% ~ (p=0.099 n=19+20) BM_UFlat/2 19.7GB/s ±19% 20.7GB/s ±11% ~ (p=0.054 n=20+19) BM_UFlat/3 1.61GB/s ± 2% 1.60GB/s ± 1% -0.48% (p=0.049 n=20+17) BM_UFlat/4 15.8GB/s ± 7% 15.6GB/s ±10% ~ (p=0.234 n=20+20) BM_UFlat/5 2.47GB/s ± 1% 2.46GB/s ± 2% ~ (p=0.608 n=19+19) BM_UFlat/6 1.07GB/s ± 2% 1.07GB/s ± 1% ~ (p=0.128 n=20+19) BM_UFlat/7 1.01GB/s ± 1% 1.00GB/s ± 2% ~ (p=0.656 n=15+19) BM_UFlat/8 1.13GB/s ± 1% 1.13GB/s ± 1% ~ (p=0.532 n=18+19) BM_UFlat/9 918MB/s ± 1% 916MB/s ± 1% ~ (p=0.443 n=19+18) BM_UFlat/10 3.90GB/s ± 1% 3.90GB/s ± 1% ~ (p=0.895 n=20+19) BM_UFlat/11 1.30GB/s ± 1% 1.29GB/s ± 2% ~ (p=0.156 n=19+19) BM_UFlat/12 2.35GB/s ± 2% 2.34GB/s ± 1% ~ (p=0.349 n=19+17) BM_UFlat/13 2.07GB/s ± 1% 2.06GB/s ± 2% ~ (p=0.475 n=18+19) BM_UFlat/14 2.23GB/s ± 1% 2.23GB/s ± 1% ~ (p=0.983 n=19+19) BM_UFlat/15 1.55GB/s ± 1% 1.55GB/s ± 1% ~ (p=0.314 n=19+19) BM_UFlat/16 1.26GB/s ± 1% 1.26GB/s ± 1% ~ (p=0.907 n=15+18) BM_UFlat/17 2.32GB/s ± 1% 2.32GB/s ± 1% ~ (p=0.604 n=18+19) BM_UFlat/18 1.61GB/s ± 1% 1.61GB/s ± 1% ~ (p=0.212 n=18+19) BM_UFlat/19 1.78GB/s ± 1% 1.78GB/s ± 2% ~ (p=0.350 n=19+19) BM_UFlat/20 1.89GB/s ± 1% 1.90GB/s ± 2% ~ (p=0.092 n=19+19) Also tested the current version against UNALIGNED_STORE64(dst, UNALIGNED_LOAD64(src)), there is no difference (old is memcpy, new is UNALIGNED*): name old speed new speed delta BM_UFlat/0 3.14GB/s ± 1% 3.16GB/s ± 2% ~ (p=0.156 n=19+19) BM_UFlat/1 1.62GB/s ± 1% 1.61GB/s ± 2% ~ (p=0.102 n=19+20) BM_UFlat/2 18.8GB/s ±17% 19.1GB/s ±11% ~ (p=0.390 n=20+16) BM_UFlat/3 1.59GB/s ± 1% 1.58GB/s ± 1% -1.06% (p=0.000 n=18+18) BM_UFlat/4 15.8GB/s ± 6% 15.6GB/s ± 7% ~ (p=0.184 n=19+20) BM_UFlat/5 2.46GB/s ± 1% 2.44GB/s ± 1% -0.95% (p=0.000 n=19+18) BM_UFlat/6 1.08GB/s ± 1% 1.06GB/s ± 1% -1.17% (p=0.000 n=19+18) BM_UFlat/7 1.00GB/s ± 1% 0.99GB/s ± 1% -1.16% (p=0.000 n=19+18) BM_UFlat/8 1.14GB/s ± 2% 1.12GB/s ± 1% -1.12% (p=0.000 n=19+18) BM_UFlat/9 921MB/s ± 1% 914MB/s ± 1% -0.84% (p=0.000 n=20+17) BM_UFlat/10 3.94GB/s ± 2% 3.92GB/s ± 1% ~ (p=0.058 n=19+17) BM_UFlat/11 1.29GB/s ± 1% 1.28GB/s ± 1% -0.77% (p=0.001 n=19+17) BM_UFlat/12 2.34GB/s ± 1% 2.31GB/s ± 1% -1.10% (p=0.000 n=18+18) BM_UFlat/13 2.06GB/s ± 1% 2.05GB/s ± 1% -0.73% (p=0.001 n=19+18) BM_UFlat/14 2.22GB/s ± 1% 2.20GB/s ± 1% -0.73% (p=0.000 n=18+18) BM_UFlat/15 1.55GB/s ± 1% 1.53GB/s ± 1% -1.07% (p=0.000 n=19+18) BM_UFlat/16 1.26GB/s ± 1% 1.25GB/s ± 1% -0.79% (p=0.000 n=18+18) BM_UFlat/17 2.31GB/s ± 1% 2.29GB/s ± 1% -0.98% (p=0.000 n=20+18) BM_UFlat/18 1.61GB/s ± 1% 1.60GB/s ± 2% -0.71% (p=0.001 n=20+19) BM_UFlat/19 1.77GB/s ± 1% 1.76GB/s ± 1% -0.61% (p=0.007 n=19+18) BM_UFlat/20 1.89GB/s ± 1% 1.88GB/s ± 1% -0.75% (p=0.000 n=20+18)	2017-03-08 11:42:30 -08:00
skanev	d3c6d20d0a	Add compression size reporting hooks. Also, force inlining util::compression::Sample(). The inlining change is necessary. Without it even with FDO+LIPO the call doesn't get inlined and uses 4 registers to construct parameters (which won't be used in the common case). In some of the more compute-bound tests that causes extra spills and significant overhead (even if call is sufficiently long). For example, with inlining: BM_UFlat/0 32.7µs ± 1% 33.1µs ± 1% +1.41% without: BM_UFlat/0 32.7µs ± 1% 37.7µs ± 1% +15.29%	2017-03-08 11:42:21 -08:00
alkis	626e1b9faa	Use #ifdef __SSE2__ for the emmintrin.h include, otherwise snappy.cc does not compile with -march=prescott.	2017-03-07 18:09:49 -08:00
Alkis Evlogimenos	2d99bd14d4	1.1.4 release.	2017-01-27 09:12:04 +01:00
Alkis Evlogimenos	8bfb028b61	Improve zippy decompression speed. The CL contains the following optimizations: 1) rewrite IncrementalCopy routine: single routine that splits the code into sections based on typical probabilities observed across a variety of inputs and helps reduce branch mispredictions both for FDO and non-FDO builds. IncrementalCopy is an adaptive routine that selects the best strategy based on input. 2) introduce UnalignedCopy128 that copies 128 bits per cycle using SSE2. 3) add branch hint for the main decoding loop. The non-literal case is taken more often in benchmarks. I expect this to be a noop in production with FDO. Note that this became apparent after step 1 above. 4) use the new IncrementalCopy in ZippyScatteredWriter. I test two archs: x86_haswell and ppc_power8. For x86_haswell I use FDO. For ppc_power8 I do not use FDO. x86_haswell + FDO name old speed new speed delta BM_UCord/0 1.97GB/s ± 1% 3.19GB/s ± 1% +62.08% (p=0.000 n=19+18) BM_UCord/1 1.28GB/s ± 1% 1.51GB/s ± 1% +18.14% (p=0.000 n=19+18) BM_UCord/2 15.6GB/s ± 9% 15.5GB/s ± 7% ~ (p=0.620 n=20+20) BM_UCord/3 811MB/s ± 1% 808MB/s ± 1% -0.38% (p=0.009 n=17+18) BM_UCord/4 12.4GB/s ± 4% 12.7GB/s ± 8% +2.70% (p=0.002 n=17+20) BM_UCord/5 1.77GB/s ± 0% 2.33GB/s ± 1% +31.37% (p=0.000 n=18+18) BM_UCord/6 900MB/s ± 1% 1006MB/s ± 1% +11.71% (p=0.000 n=18+17) BM_UCord/7 858MB/s ± 1% 938MB/s ± 2% +9.36% (p=0.000 n=19+16) BM_UCord/8 921MB/s ± 1% 985MB/s ±21% +6.94% (p=0.028 n=19+20) BM_UCord/9 824MB/s ± 1% 800MB/s ±20% ~ (p=0.113 n=19+20) BM_UCord/10 2.60GB/s ± 1% 3.67GB/s ±21% +41.31% (p=0.000 n=19+20) BM_UCord/11 1.07GB/s ± 1% 1.21GB/s ± 1% +13.17% (p=0.000 n=16+16) BM_UCord/12 1.84GB/s ± 8% 2.18GB/s ± 1% +18.44% (p=0.000 n=16+19) BM_UCord/13 1.83GB/s ±18% 1.89GB/s ± 1% +3.14% (p=0.000 n=17+19) BM_UCord/14 1.96GB/s ± 2% 1.97GB/s ± 1% +0.55% (p=0.000 n=16+17) BM_UCord/15 1.30GB/s ±20% 1.43GB/s ± 1% +9.85% (p=0.000 n=20+20) BM_UCord/16 658MB/s ±20% 705MB/s ± 1% +7.22% (p=0.000 n=20+19) BM_UCord/17 1.96GB/s ± 2% 2.15GB/s ± 1% +9.73% (p=0.000 n=16+19) BM_UCord/18 555MB/s ± 1% 833MB/s ± 1% +50.11% (p=0.000 n=18+19) BM_UCord/19 1.57GB/s ± 1% 1.75GB/s ± 1% +11.34% (p=0.000 n=20+20) BM_UCord/20 1.72GB/s ± 2% 1.70GB/s ± 2% -1.01% (p=0.001 n=20+20) BM_UCordStringSink/0 2.88GB/s ± 1% 3.15GB/s ± 1% +9.56% (p=0.000 n=17+20) BM_UCordStringSink/1 1.50GB/s ± 1% 1.52GB/s ± 1% +1.96% (p=0.000 n=19+20) BM_UCordStringSink/2 14.5GB/s ±10% 14.6GB/s ±10% ~ (p=0.542 n=20+20) BM_UCordStringSink/3 1.06GB/s ± 1% 1.08GB/s ± 1% +1.77% (p=0.000 n=18+20) BM_UCordStringSink/4 12.6GB/s ± 7% 13.2GB/s ± 4% +4.63% (p=0.000 n=20+20) BM_UCordStringSink/5 2.29GB/s ± 1% 2.36GB/s ± 1% +3.05% (p=0.000 n=19+20) BM_UCordStringSink/6 1.01GB/s ± 2% 1.01GB/s ± 0% ~ (p=0.055 n=20+18) BM_UCordStringSink/7 945MB/s ± 1% 939MB/s ± 1% -0.60% (p=0.000 n=19+20) BM_UCordStringSink/8 1.06GB/s ± 1% 1.07GB/s ± 1% +0.62% (p=0.000 n=18+20) BM_UCordStringSink/9 866MB/s ± 1% 864MB/s ± 1% ~ (p=0.107 n=19+20) BM_UCordStringSink/10 3.64GB/s ± 2% 3.98GB/s ± 1% +9.32% (p=0.000 n=19+20) BM_UCordStringSink/11 1.22GB/s ± 1% 1.22GB/s ± 1% +0.61% (p=0.001 n=19+20) BM_UCordStringSink/12 2.23GB/s ± 1% 2.23GB/s ± 1% ~ (p=0.692 n=19+20) BM_UCordStringSink/13 1.96GB/s ± 1% 1.94GB/s ± 1% -0.82% (p=0.000 n=17+18) BM_UCordStringSink/14 2.09GB/s ± 2% 2.08GB/s ± 1% ~ (p=0.147 n=20+18) BM_UCordStringSink/15 1.47GB/s ± 1% 1.45GB/s ± 1% -0.88% (p=0.000 n=20+19) BM_UCordStringSink/16 908MB/s ± 1% 917MB/s ± 1% +0.97% (p=0.000 n=19+19) BM_UCordStringSink/17 2.11GB/s ± 1% 2.20GB/s ± 1% +4.35% (p=0.000 n=18+20) BM_UCordStringSink/18 804MB/s ± 2% 1106MB/s ± 1% +37.52% (p=0.000 n=20+20) BM_UCordStringSink/19 1.67GB/s ± 1% 1.72GB/s ± 0% +2.81% (p=0.000 n=18+20) BM_UCordStringSink/20 1.77GB/s ± 3% 1.77GB/s ± 3% ~ (p=0.815 n=20+20) ppc_power8 name old speed new speed delta BM_UCord/0 918MB/s ± 6% 1262MB/s ± 0% +37.56% (p=0.000 n=17+16) BM_UCord/1 671MB/s ±13% 879MB/s ± 2% +30.99% (p=0.000 n=18+16) BM_UCord/2 12.6GB/s ± 8% 12.6GB/s ± 5% ~ (p=0.452 n=17+19) BM_UCord/3 285MB/s ±10% 284MB/s ± 4% -0.50% (p=0.021 n=19+17) BM_UCord/4 5.21GB/s ±12% 6.59GB/s ± 1% +26.37% (p=0.000 n=17+16) BM_UCord/5 913MB/s ± 4% 1253MB/s ± 1% +37.27% (p=0.000 n=16+17) BM_UCord/6 461MB/s ±13% 547MB/s ± 1% +18.67% (p=0.000 n=18+16) BM_UCord/7 455MB/s ± 2% 524MB/s ± 3% +15.28% (p=0.000 n=16+18) BM_UCord/8 489MB/s ± 2% 584MB/s ± 2% +19.47% (p=0.000 n=17+17) BM_UCord/9 410MB/s ±33% 490MB/s ± 1% +19.64% (p=0.000 n=17+18) BM_UCord/10 1.10GB/s ± 3% 1.55GB/s ± 2% +41.21% (p=0.000 n=16+16) BM_UCord/11 494MB/s ± 1% 558MB/s ± 1% +12.92% (p=0.000 n=17+18) BM_UCord/12 608MB/s ± 3% 793MB/s ± 1% +30.45% (p=0.000 n=17+16) BM_UCord/13 545MB/s ±18% 721MB/s ± 2% +32.22% (p=0.000 n=19+17) BM_UCord/14 594MB/s ± 4% 748MB/s ± 3% +25.99% (p=0.000 n=17+17) BM_UCord/15 628MB/s ± 1% 822MB/s ± 3% +30.94% (p=0.000 n=18+16) BM_UCord/16 277MB/s ± 2% 280MB/s ±15% +0.86% (p=0.001 n=17+17) BM_UCord/17 864MB/s ± 1% 1001MB/s ± 3% +15.96% (p=0.000 n=17+17) BM_UCord/18 121MB/s ± 2% 284MB/s ± 4% +134.08% (p=0.000 n=17+18) BM_UCord/19 594MB/s ± 0% 713MB/s ± 2% +19.93% (p=0.000 n=16+17) BM_UCord/20 553MB/s ±10% 662MB/s ± 5% +19.74% (p=0.000 n=16+18) BM_UCordStringSink/0 1.37GB/s ± 4% 1.48GB/s ± 2% +8.51% (p=0.000 n=16+16) BM_UCordStringSink/1 969MB/s ± 1% 990MB/s ± 1% +2.16% (p=0.000 n=16+18) BM_UCordStringSink/2 13.1GB/s ±11% 13.0GB/s ±14% ~ (p=0.858 n=17+18) BM_UCordStringSink/3 411MB/s ± 1% 415MB/s ± 1% +0.93% (p=0.000 n=16+17) BM_UCordStringSink/4 6.81GB/s ± 8% 7.29GB/s ± 5% +7.12% (p=0.000 n=16+19) BM_UCordStringSink/5 1.35GB/s ± 5% 1.45GB/s ±13% +8.00% (p=0.000 n=16+17) BM_UCordStringSink/6 653MB/s ± 8% 653MB/s ± 3% -0.12% (p=0.007 n=17+19) BM_UCordStringSink/7 618MB/s ±13% 597MB/s ±18% -3.45% (p=0.001 n=18+18) BM_UCordStringSink/8 702MB/s ± 5% 702MB/s ± 1% -0.10% (p=0.012 n=17+16) BM_UCordStringSink/9 590MB/s ± 2% 564MB/s ±13% -4.46% (p=0.000 n=16+17) BM_UCordStringSink/10 1.63GB/s ± 2% 1.76GB/s ± 4% +8.28% (p=0.000 n=17+16) BM_UCordStringSink/11 630MB/s ±14% 684MB/s ±15% +8.51% (p=0.000 n=19+17) BM_UCordStringSink/12 858MB/s ±12% 903MB/s ± 9% +5.17% (p=0.000 n=19+17) BM_UCordStringSink/13 806MB/s ±22% 879MB/s ± 1% +8.98% (p=0.000 n=19+19) BM_UCordStringSink/14 854MB/s ±13% 901MB/s ± 5% +5.60% (p=0.000 n=19+17) BM_UCordStringSink/15 930MB/s ± 2% 964MB/s ± 3% +3.59% (p=0.000 n=16+16) BM_UCordStringSink/16 363MB/s ±10% 356MB/s ± 6% ~ (p=0.050 n=20+19) BM_UCordStringSink/17 976MB/s ±12% 1078MB/s ± 1% +10.52% (p=0.000 n=20+17) BM_UCordStringSink/18 227MB/s ± 1% 355MB/s ± 3% +56.45% (p=0.000 n=16+17) BM_UCordStringSink/19 751MB/s ± 4% 808MB/s ± 4% +7.70% (p=0.000 n=18+17) BM_UCordStringSink/20 761MB/s ± 8% 786MB/s ± 4% +3.23% (p=0.000 n=18+17)	2017-01-27 09:10:36 +01:00
Behzad Nouri	818b583387	adds std:: to stl types (#061 )	2017-01-26 21:43:13 +01:00
Geoff Pike	27c5d86527	Re-work fast path for handling copies in zippy decompression. This is a performance-tuning change that shouldn't change the behavior of the library. This adds some complexity but the performance gain might make that worthwhile: With FDO on perflab/haswell, a 4.0% gain (geometric mean). SAMPLE (before) Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------ BM_UFlat/0 36638 36552 100000 2.6GB/s html BM_UFlat/1 457153 455895 9173 1.4GB/s urls BM_UFlat/2 5850 5837 685481 19.6GB/s jpg BM_UFlat/3 122 122 34551988 1.5GB/s jpg_200 BM_UFlat/4 6797 6781 620811 14.1GB/s pdf BM_UFlat/5 179485 179037 23471 2.1GB/s html4 BM_UFlat/6 142734 142384 29525 1018.7MB/s txt1 BM_UFlat/7 125233 124924 33709 955.6MB/s txt2 BM_UFlat/8 382548 381533 10000 1066.7MB/s txt3 BM_UFlat/9 525614 524297 8018 876.5MB/s txt4 BM_UFlat/10 34946 34868 100000 3.2GB/s pb BM_UFlat/11 149548 149208 28063 1.2GB/s gaviota BM_UFlat/12 10684 10663 392580 2.1GB/s cp BM_UFlat/13 5494 5484 766584 1.9GB/s c BM_UFlat/14 1691 1688 2488784 2.1GB/s lsp BM_UFlat/15 676443 674726 6129 1.4GB/s xls BM_UFlat/16 156 156 26656909 1.2GB/s xls_200 BM_UFlat/17 239911 239297 17558 2.0GB/s bin BM_UFlat/18 182 182 23072932 1047.9MB/s bin_200 BM_UFlat/19 21544 21499 194484 1.7GB/s sum BM_UFlat/20 2236 2232 1877810 1.8GB/s man BM_UFlatSink/0 42266 42179 99732 2.3GB/s html BM_UFlatSink/1 461810 460633 9055 1.4GB/s urls BM_UFlatSink/2 5816 5804 632829 19.8GB/s jpg BM_UFlatSink/3 124 123 34351698 1.5GB/s jpg_200 BM_UFlatSink/4 7173 7157 609929 13.3GB/s pdf BM_UFlatSink/5 184795 184302 22660 2.1GB/s html4 BM_UFlatSink/6 143552 143223 29272 1012.7MB/s txt1 BM_UFlatSink/7 127160 126890 33178 940.8MB/s txt2 BM_UFlatSink/8 382219 381313 10000 1067.3MB/s txt3 BM_UFlatSink/9 528042 526713 7988 872.5MB/s txt4 BM_UFlatSink/10 41389 41305 100000 2.7GB/s pb BM_UFlatSink/11 147215 146877 28854 1.2GB/s gaviota BM_UFlatSink/12 12008 11984 348139 1.9GB/s cp BM_UFlatSink/13 5444 5433 775084 1.9GB/s c BM_UFlatSink/14 1647 1644 2552119 2.1GB/s lsp BM_UFlatSink/15 665011 663424 6320 1.4GB/s xls BM_UFlatSink/16 153 153 27571837 1.2GB/s xls_200 BM_UFlatSink/17 239735 239169 17411 2.0GB/s bin BM_UFlatSink/18 183 182 23005573 1046.8MB/s bin_200 BM_UFlatSink/19 22544 22498 187705 1.6GB/s sum BM_UFlatSink/20 2190 2186 1917894 1.8GB/s man SAMPLE (after) Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------ BM_UFlat/0 33940 33889 100000 2.8GB/s html BM_UFlat/1 440728 439944 9586 1.5GB/s urls BM_UFlat/2 5652 5641 744776 20.3GB/s jpg BM_UFlat/3 123 123 34647884 1.5GB/s jpg_200 BM_UFlat/4 6628 6615 631892 14.4GB/s pdf BM_UFlat/5 169523 169227 24197 2.3GB/s html4 BM_UFlat/6 144139 143892 29232 1008.0MB/s txt1 BM_UFlat/7 127148 126915 33144 940.6MB/s txt2 BM_UFlat/8 380267 379233 10000 1073.2MB/s txt3 BM_UFlat/9 529495 528194 7957 870.0MB/s txt4 BM_UFlat/10 31844 31784 100000 3.5GB/s pb BM_UFlat/11 146822 146476 28737 1.2GB/s gaviota BM_UFlat/12 10784 10762 392176 2.1GB/s cp BM_UFlat/13 5528 5518 760934 1.9GB/s c BM_UFlat/14 1721 1719 2449291 2.0GB/s lsp BM_UFlat/15 673304 671774 6255 1.4GB/s xls BM_UFlat/16 155 155 27092003 1.2GB/s xls_200 BM_UFlat/17 230424 229902 18285 2.1GB/s bin BM_UFlat/18 185 184 22818199 1033.9MB/s bin_200 BM_UFlat/19 21035 20996 200765 1.7GB/s sum BM_UFlat/20 2242 2238 1864380 1.8GB/s man BM_UFlatSink/0 33487 33405 100000 2.9GB/s html BM_UFlatSink/1 431108 430226 9764 1.5GB/s urls BM_UFlatSink/2 5927 5916 648112 19.4GB/s jpg BM_UFlatSink/3 123 122 34704423 1.5GB/s jpg_200 BM_UFlatSink/4 6472 6461 653462 14.8GB/s pdf BM_UFlatSink/5 164309 163988 25567 2.3GB/s html4 BM_UFlatSink/6 138274 138020 30311 1050.9MB/s txt1 BM_UFlatSink/7 120844 120637 34708 989.6MB/s txt2 BM_UFlatSink/8 371046 370366 10000 1098.9MB/s txt3 BM_UFlatSink/9 510021 508982 8269 902.9MB/s txt4 BM_UFlatSink/10 30889 30844 100000 3.6GB/s pb BM_UFlatSink/11 140752 140521 29903 1.2GB/s gaviota BM_UFlatSink/12 10162 10146 413600 2.3GB/s cp BM_UFlatSink/13 5264 5256 762398 2.0GB/s c BM_UFlatSink/14 1622 1619 2606069 2.1GB/s lsp BM_UFlatSink/15 646897 645756 6512 1.5GB/s xls BM_UFlatSink/16 150 150 28223595 1.2GB/s xls_200 BM_UFlatSink/17 226096 225650 18629 2.1GB/s bin BM_UFlatSink/18 185 184 22907935 1035.3MB/s bin_200 BM_UFlatSink/19 21369 21335 198881 1.7GB/s sum BM_UFlatSink/20 2139 2136 1953637 1.8GB/s man	2017-01-26 21:42:26 +01:00

... 3 4 5 6 7 ...

361 commits