Commit Graph

140 Commits

Author SHA1 Message Date
costan e4de6ce087 Small improvements to open source CI configuration.
This CL fixes 64-bit Windows testing (), makes it possible to view the
test output in the Travis / AppVeyor CI console while the test is
running, and takes advantage of the new support for the .appveyor.yml
file name to make the CI configuration less obtrusive.
2017-07-27 16:46:54 -07:00
costan c756f7f5d9 Support both static and shared library CMake builds.
This can be used to fix https://github.com/Homebrew/homebrew-core/issues/15722.
2017-07-27 16:46:54 -07:00
costan 038a3329b1 Inline DISALLOW_COPY_AND_ASSIGN.
snappy-stubs-public.h defined the DISALLOW_COPY_AND_ASSIGN macro, so the
definition propagated to all translation units that included the open
source headers. The macro is now inlined, thus avoiding polluting the
macro environment of snappy users.
2017-07-27 16:46:42 -07:00
costan a8b239c3de snappy: Remove autoconf build configuration. 2017-07-25 18:20:38 -07:00
costan 27671c6aec Clean up CMake header and type checks.
Unused macros: HAVE_DLFCN_H, HAVE_INTTYPES_H, HAVE_MEMORY_H,
HAVE_STDLIB_H, HAVE_STRINGS_H, HAVE_STRING_H, HAVE_SYS_BYTESWAP_H,
HAVE_SYS_STAT_H, HAVE_SYS_TYPES_H, HAVE_UNISTD_H.

Used but never set macros: HAVE_LIBLZF, HAVE_LIBQUICKLZ. These only gate
conditional includes. The code that takes advantage of them was removed.

Unused types: ssize_t.

The testing code uses HAVE_FUNC_MMAP, which was not wired in the CMake
build, causing a whole test to be skipped.
2017-07-25 18:17:35 -07:00
costan 548501c988 zippy: Re-release snappy 1.1.5 as 1.1.6.
The migration from autotools to CMake in 1.1.5 wasn't as smooth as
intended. The SONAME / SOVERSION were broken in both build systems,
causing breakages in systems that upgraded from snappy 1.1.4 to 1.1.5,
as reported in https://github.com/Homebrew/homebrew-core/issues/15274
and https://github.com/google/snappy/pull/45.
2017-07-13 03:56:49 -07:00
costan 513df5fb5a Tag open source release 1.1.5. 2017-06-28 18:37:30 -07:00
costan 5bc9c82ae3 Set minimum CMake version to 3.1.
The project only needs CMake 3.1 features, and some Travis CI bots have
CMake 3.2.2. Therefore, requiring CMake 3.4 is inconvenient.
2017-06-28 18:37:08 -07:00
costan e9720a001d Update Travis CI config, add AppVeyor for Windows CI coverage. 2017-06-28 18:36:37 -07:00
tmsriram f24f9d2d97 Explicitly copy internal::wordmask to the stack array to work around a compiler
optimization with LLVM that converts const stack arrays to global arrays.  This
is a temporary change and should be reverted when https://reviews.llvm.org/D30759
is fixed.

With PIE, accessing stack arrays is more efficient than global arrays and
wordmask was moved to the stack due to that.  However, the LLVM compiler
automatically converts stack arrays, detected as constant, to global arrays
and this transformation hurts PIE performance with LLVM.

We are working to fix this in the LLVM compiler, via
https://reviews.llvm.org/D30759, to not do this conversion in PIE mode.  Until
this patch is finished, please consider this source change as a temporary
work around to keep this array on the stack.  This source change is important
to allow some projects to flip the default compiler from GCC to LLVM for
optimized builds.

This change works for the following reason.  The LLVM compiler does not convert
non-const stack arrays to global arrays and explicitly copying the elements is
enough to make the compiler assume that this is a non-const array.

With GCC, this change does not affect code-gen in any significant way.  The
array initialization code is slightly different as it copies the constants
directly to the stack.

With LLVM, this keeps the array on the stack.

No change in performance with GCC (within noise range). With LLVM, ~0.7%
improvement in optimized mode (no FDO) and ~1.75% improvement in FDO
mode.
2017-06-28 18:34:54 -07:00
ysaed 82deffcde7 Remove benchmarking support for fastlz. 2017-06-28 18:33:55 -07:00
alkis 18488d6212 Use 64 bit little endian on ppc64le.
This has tangible performance benefits.

This lands https://github.com/google/snappy/pull/27
2017-06-28 18:33:13 -07:00
alkis 7b9532b878 Improve the SSE2 macro check on Windows.
This lands https://github.com/google/snappy/pull/37
2017-06-05 13:54:17 -07:00
alkis 7dadceea52 Check for the existence of sys/uio.h in autoconf build.
This lands https://github.com/google/snappy/pull/32
2017-06-05 13:54:17 -07:00
jyrki 83179dd8be Remove quicklz and lzf support in benchmarks. 2017-06-05 13:54:10 -07:00
vrabaud c8131680d0 Provide a CMakeLists.txt.
This lands https://github.com/google/snappy/pull/29
2017-06-05 13:53:29 -07:00
costan ed3b7b242b Clean up unused function warnings in snappy. 2017-03-17 13:59:03 -07:00
costan 8b60aac4fd Remove "using namespace std;" from zippy-stubs-internal.h.
This makes it easier to build zippy, as some compiles require a warning
suppression to accept "using namespace std".
2017-03-13 13:03:01 -07:00
costan 7d7a8ec805 Add Travis CI configuration to snappy and fix the make build.
The make build in the open source version uses autoconf, which is set up
to expect a project that follows the gnu standard.
2017-03-10 12:40:15 -08:00
alkis 1cd3ab02e9 Rename README to README.md. It already in markdown, we might as well let github know so that it renders nicely. 2017-03-08 12:05:05 -08:00
alkis 597fa795de Delete UnalignedCopy64 from snappy-stubs since the version in snappy.cc is more robust and possibly faster (assuming the compiler knows how to best copy 8 bytes between locations in memory the fastest way possible - a rather safe bet). 2017-03-08 11:42:30 -08:00
scrubbed 039b3a7ace Add std:: prefix to STL non-type names.
In order to disable global using declarations, this CL qualifies
stl names with the std namespace.
2017-03-08 11:42:30 -08:00
alkis 3c706d2230 Make UnalignedCopy64 not exhibit undefined behavior when src and dst overlap.
name         old speed      new speed      delta
BM_UFlat/0   3.09GB/s ± 3%  3.07GB/s ± 2%  -0.78%  (p=0.009 n=19+19)
BM_UFlat/1   1.63GB/s ± 2%  1.62GB/s ± 2%    ~     (p=0.099 n=19+20)
BM_UFlat/2   19.7GB/s ±19%  20.7GB/s ±11%    ~     (p=0.054 n=20+19)
BM_UFlat/3   1.61GB/s ± 2%  1.60GB/s ± 1%  -0.48%  (p=0.049 n=20+17)
BM_UFlat/4   15.8GB/s ± 7%  15.6GB/s ±10%    ~     (p=0.234 n=20+20)
BM_UFlat/5   2.47GB/s ± 1%  2.46GB/s ± 2%    ~     (p=0.608 n=19+19)
BM_UFlat/6   1.07GB/s ± 2%  1.07GB/s ± 1%    ~     (p=0.128 n=20+19)
BM_UFlat/7   1.01GB/s ± 1%  1.00GB/s ± 2%    ~     (p=0.656 n=15+19)
BM_UFlat/8   1.13GB/s ± 1%  1.13GB/s ± 1%    ~     (p=0.532 n=18+19)
BM_UFlat/9    918MB/s ± 1%   916MB/s ± 1%    ~     (p=0.443 n=19+18)
BM_UFlat/10  3.90GB/s ± 1%  3.90GB/s ± 1%    ~     (p=0.895 n=20+19)
BM_UFlat/11  1.30GB/s ± 1%  1.29GB/s ± 2%    ~     (p=0.156 n=19+19)
BM_UFlat/12  2.35GB/s ± 2%  2.34GB/s ± 1%    ~     (p=0.349 n=19+17)
BM_UFlat/13  2.07GB/s ± 1%  2.06GB/s ± 2%    ~     (p=0.475 n=18+19)
BM_UFlat/14  2.23GB/s ± 1%  2.23GB/s ± 1%    ~     (p=0.983 n=19+19)
BM_UFlat/15  1.55GB/s ± 1%  1.55GB/s ± 1%    ~     (p=0.314 n=19+19)
BM_UFlat/16  1.26GB/s ± 1%  1.26GB/s ± 1%    ~     (p=0.907 n=15+18)
BM_UFlat/17  2.32GB/s ± 1%  2.32GB/s ± 1%    ~     (p=0.604 n=18+19)
BM_UFlat/18  1.61GB/s ± 1%  1.61GB/s ± 1%    ~     (p=0.212 n=18+19)
BM_UFlat/19  1.78GB/s ± 1%  1.78GB/s ± 2%    ~     (p=0.350 n=19+19)
BM_UFlat/20  1.89GB/s ± 1%  1.90GB/s ± 2%    ~     (p=0.092 n=19+19)

Also tested the current version against UNALIGNED_STORE64(dst, UNALIGNED_LOAD64(src)), there is no difference (old is memcpy, new is UNALIGNED*):

name         old speed      new speed      delta
BM_UFlat/0   3.14GB/s ± 1%  3.16GB/s ± 2%    ~     (p=0.156 n=19+19)
BM_UFlat/1   1.62GB/s ± 1%  1.61GB/s ± 2%    ~     (p=0.102 n=19+20)
BM_UFlat/2   18.8GB/s ±17%  19.1GB/s ±11%    ~     (p=0.390 n=20+16)
BM_UFlat/3   1.59GB/s ± 1%  1.58GB/s ± 1%  -1.06%  (p=0.000 n=18+18)
BM_UFlat/4   15.8GB/s ± 6%  15.6GB/s ± 7%    ~     (p=0.184 n=19+20)
BM_UFlat/5   2.46GB/s ± 1%  2.44GB/s ± 1%  -0.95%  (p=0.000 n=19+18)
BM_UFlat/6   1.08GB/s ± 1%  1.06GB/s ± 1%  -1.17%  (p=0.000 n=19+18)
BM_UFlat/7   1.00GB/s ± 1%  0.99GB/s ± 1%  -1.16%  (p=0.000 n=19+18)
BM_UFlat/8   1.14GB/s ± 2%  1.12GB/s ± 1%  -1.12%  (p=0.000 n=19+18)
BM_UFlat/9    921MB/s ± 1%   914MB/s ± 1%  -0.84%  (p=0.000 n=20+17)
BM_UFlat/10  3.94GB/s ± 2%  3.92GB/s ± 1%    ~     (p=0.058 n=19+17)
BM_UFlat/11  1.29GB/s ± 1%  1.28GB/s ± 1%  -0.77%  (p=0.001 n=19+17)
BM_UFlat/12  2.34GB/s ± 1%  2.31GB/s ± 1%  -1.10%  (p=0.000 n=18+18)
BM_UFlat/13  2.06GB/s ± 1%  2.05GB/s ± 1%  -0.73%  (p=0.001 n=19+18)
BM_UFlat/14  2.22GB/s ± 1%  2.20GB/s ± 1%  -0.73%  (p=0.000 n=18+18)
BM_UFlat/15  1.55GB/s ± 1%  1.53GB/s ± 1%  -1.07%  (p=0.000 n=19+18)
BM_UFlat/16  1.26GB/s ± 1%  1.25GB/s ± 1%  -0.79%  (p=0.000 n=18+18)
BM_UFlat/17  2.31GB/s ± 1%  2.29GB/s ± 1%  -0.98%  (p=0.000 n=20+18)
BM_UFlat/18  1.61GB/s ± 1%  1.60GB/s ± 2%  -0.71%  (p=0.001 n=20+19)
BM_UFlat/19  1.77GB/s ± 1%  1.76GB/s ± 1%  -0.61%  (p=0.007 n=19+18)
BM_UFlat/20  1.89GB/s ± 1%  1.88GB/s ± 1%  -0.75%  (p=0.000 n=20+18)
2017-03-08 11:42:30 -08:00
skanev d3c6d20d0a Add compression size reporting hooks.
Also, force inlining util::compression::Sample().

The inlining change is necessary. Without it even with FDO+LIPO the call
doesn't get inlined and uses 4 registers to construct parameters (which
won't be used in the common case). In some of the more compute-bound
tests that causes extra spills and significant overhead (even if
call is sufficiently long).
For example, with inlining:
BM_UFlat/0         32.7µs ± 1%    33.1µs ± 1%  +1.41%
without:
BM_UFlat/0         32.7µs ± 1%    37.7µs ± 1%  +15.29%
2017-03-08 11:42:21 -08:00
alkis 626e1b9faa Use #ifdef __SSE2__ for the emmintrin.h include, otherwise snappy.cc does not compile with -march=prescott. 2017-03-07 18:09:49 -08:00
Alkis Evlogimenos 2d99bd14d4 1.1.4 release. 2017-01-27 09:12:04 +01:00
Alkis Evlogimenos 8bfb028b61 Improve zippy decompression speed.
The CL contains the following optimizations:

1) rewrite IncrementalCopy routine: single routine that splits the code into sections based on typical probabilities observed across a variety of inputs and helps reduce branch mispredictions both for FDO and non-FDO builds. IncrementalCopy is an adaptive routine that selects the best strategy based on input.
2) introduce UnalignedCopy128 that copies 128 bits per cycle using SSE2.
3) add branch hint for the main decoding loop. The non-literal case is taken more often in benchmarks. I expect this to be a noop in production with FDO. Note that this became apparent after step 1 above.
4) use the new IncrementalCopy in ZippyScatteredWriter.

I test two archs: x86_haswell and ppc_power8.

For x86_haswell I use FDO. For ppc_power8 I do not use FDO.

x86_haswell + FDO

name                   old speed      new speed      delta
BM_UCord/0             1.97GB/s ± 1%  3.19GB/s ± 1%  +62.08%  (p=0.000 n=19+18)
BM_UCord/1             1.28GB/s ± 1%  1.51GB/s ± 1%  +18.14%  (p=0.000 n=19+18)
BM_UCord/2             15.6GB/s ± 9%  15.5GB/s ± 7%     ~     (p=0.620 n=20+20)
BM_UCord/3              811MB/s ± 1%   808MB/s ± 1%   -0.38%  (p=0.009 n=17+18)
BM_UCord/4             12.4GB/s ± 4%  12.7GB/s ± 8%   +2.70%  (p=0.002 n=17+20)
BM_UCord/5             1.77GB/s ± 0%  2.33GB/s ± 1%  +31.37%  (p=0.000 n=18+18)
BM_UCord/6              900MB/s ± 1%  1006MB/s ± 1%  +11.71%  (p=0.000 n=18+17)
BM_UCord/7              858MB/s ± 1%   938MB/s ± 2%   +9.36%  (p=0.000 n=19+16)
BM_UCord/8              921MB/s ± 1%   985MB/s ±21%   +6.94%  (p=0.028 n=19+20)
BM_UCord/9              824MB/s ± 1%   800MB/s ±20%     ~     (p=0.113 n=19+20)
BM_UCord/10            2.60GB/s ± 1%  3.67GB/s ±21%  +41.31%  (p=0.000 n=19+20)
BM_UCord/11            1.07GB/s ± 1%  1.21GB/s ± 1%  +13.17%  (p=0.000 n=16+16)
BM_UCord/12            1.84GB/s ± 8%  2.18GB/s ± 1%  +18.44%  (p=0.000 n=16+19)
BM_UCord/13            1.83GB/s ±18%  1.89GB/s ± 1%   +3.14%  (p=0.000 n=17+19)
BM_UCord/14            1.96GB/s ± 2%  1.97GB/s ± 1%   +0.55%  (p=0.000 n=16+17)
BM_UCord/15            1.30GB/s ±20%  1.43GB/s ± 1%   +9.85%  (p=0.000 n=20+20)
BM_UCord/16             658MB/s ±20%   705MB/s ± 1%   +7.22%  (p=0.000 n=20+19)
BM_UCord/17            1.96GB/s ± 2%  2.15GB/s ± 1%   +9.73%  (p=0.000 n=16+19)
BM_UCord/18             555MB/s ± 1%   833MB/s ± 1%  +50.11%  (p=0.000 n=18+19)
BM_UCord/19            1.57GB/s ± 1%  1.75GB/s ± 1%  +11.34%  (p=0.000 n=20+20)
BM_UCord/20            1.72GB/s ± 2%  1.70GB/s ± 2%   -1.01%  (p=0.001 n=20+20)
BM_UCordStringSink/0   2.88GB/s ± 1%  3.15GB/s ± 1%   +9.56%  (p=0.000 n=17+20)
BM_UCordStringSink/1   1.50GB/s ± 1%  1.52GB/s ± 1%   +1.96%  (p=0.000 n=19+20)
BM_UCordStringSink/2   14.5GB/s ±10%  14.6GB/s ±10%     ~     (p=0.542 n=20+20)
BM_UCordStringSink/3   1.06GB/s ± 1%  1.08GB/s ± 1%   +1.77%  (p=0.000 n=18+20)
BM_UCordStringSink/4   12.6GB/s ± 7%  13.2GB/s ± 4%   +4.63%  (p=0.000 n=20+20)
BM_UCordStringSink/5   2.29GB/s ± 1%  2.36GB/s ± 1%   +3.05%  (p=0.000 n=19+20)
BM_UCordStringSink/6   1.01GB/s ± 2%  1.01GB/s ± 0%     ~     (p=0.055 n=20+18)
BM_UCordStringSink/7    945MB/s ± 1%   939MB/s ± 1%   -0.60%  (p=0.000 n=19+20)
BM_UCordStringSink/8   1.06GB/s ± 1%  1.07GB/s ± 1%   +0.62%  (p=0.000 n=18+20)
BM_UCordStringSink/9    866MB/s ± 1%   864MB/s ± 1%     ~     (p=0.107 n=19+20)
BM_UCordStringSink/10  3.64GB/s ± 2%  3.98GB/s ± 1%   +9.32%  (p=0.000 n=19+20)
BM_UCordStringSink/11  1.22GB/s ± 1%  1.22GB/s ± 1%   +0.61%  (p=0.001 n=19+20)
BM_UCordStringSink/12  2.23GB/s ± 1%  2.23GB/s ± 1%     ~     (p=0.692 n=19+20)
BM_UCordStringSink/13  1.96GB/s ± 1%  1.94GB/s ± 1%   -0.82%  (p=0.000 n=17+18)
BM_UCordStringSink/14  2.09GB/s ± 2%  2.08GB/s ± 1%     ~     (p=0.147 n=20+18)
BM_UCordStringSink/15  1.47GB/s ± 1%  1.45GB/s ± 1%   -0.88%  (p=0.000 n=20+19)
BM_UCordStringSink/16   908MB/s ± 1%   917MB/s ± 1%   +0.97%  (p=0.000 n=19+19)
BM_UCordStringSink/17  2.11GB/s ± 1%  2.20GB/s ± 1%   +4.35%  (p=0.000 n=18+20)
BM_UCordStringSink/18   804MB/s ± 2%  1106MB/s ± 1%  +37.52%  (p=0.000 n=20+20)
BM_UCordStringSink/19  1.67GB/s ± 1%  1.72GB/s ± 0%   +2.81%  (p=0.000 n=18+20)
BM_UCordStringSink/20  1.77GB/s ± 3%  1.77GB/s ± 3%     ~     (p=0.815 n=20+20)

ppc_power8

name                   old speed      new speed      delta
BM_UCord/0              918MB/s ± 6%  1262MB/s ± 0%   +37.56%  (p=0.000 n=17+16)
BM_UCord/1              671MB/s ±13%   879MB/s ± 2%   +30.99%  (p=0.000 n=18+16)
BM_UCord/2             12.6GB/s ± 8%  12.6GB/s ± 5%      ~     (p=0.452 n=17+19)
BM_UCord/3              285MB/s ±10%   284MB/s ± 4%    -0.50%  (p=0.021 n=19+17)
BM_UCord/4             5.21GB/s ±12%  6.59GB/s ± 1%   +26.37%  (p=0.000 n=17+16)
BM_UCord/5              913MB/s ± 4%  1253MB/s ± 1%   +37.27%  (p=0.000 n=16+17)
BM_UCord/6              461MB/s ±13%   547MB/s ± 1%   +18.67%  (p=0.000 n=18+16)
BM_UCord/7              455MB/s ± 2%   524MB/s ± 3%   +15.28%  (p=0.000 n=16+18)
BM_UCord/8              489MB/s ± 2%   584MB/s ± 2%   +19.47%  (p=0.000 n=17+17)
BM_UCord/9              410MB/s ±33%   490MB/s ± 1%   +19.64%  (p=0.000 n=17+18)
BM_UCord/10            1.10GB/s ± 3%  1.55GB/s ± 2%   +41.21%  (p=0.000 n=16+16)
BM_UCord/11             494MB/s ± 1%   558MB/s ± 1%   +12.92%  (p=0.000 n=17+18)
BM_UCord/12             608MB/s ± 3%   793MB/s ± 1%   +30.45%  (p=0.000 n=17+16)
BM_UCord/13             545MB/s ±18%   721MB/s ± 2%   +32.22%  (p=0.000 n=19+17)
BM_UCord/14             594MB/s ± 4%   748MB/s ± 3%   +25.99%  (p=0.000 n=17+17)
BM_UCord/15             628MB/s ± 1%   822MB/s ± 3%   +30.94%  (p=0.000 n=18+16)
BM_UCord/16             277MB/s ± 2%   280MB/s ±15%    +0.86%  (p=0.001 n=17+17)
BM_UCord/17             864MB/s ± 1%  1001MB/s ± 3%   +15.96%  (p=0.000 n=17+17)
BM_UCord/18             121MB/s ± 2%   284MB/s ± 4%  +134.08%  (p=0.000 n=17+18)
BM_UCord/19             594MB/s ± 0%   713MB/s ± 2%   +19.93%  (p=0.000 n=16+17)
BM_UCord/20             553MB/s ±10%   662MB/s ± 5%   +19.74%  (p=0.000 n=16+18)
BM_UCordStringSink/0   1.37GB/s ± 4%  1.48GB/s ± 2%    +8.51%  (p=0.000 n=16+16)
BM_UCordStringSink/1    969MB/s ± 1%   990MB/s ± 1%    +2.16%  (p=0.000 n=16+18)
BM_UCordStringSink/2   13.1GB/s ±11%  13.0GB/s ±14%      ~     (p=0.858 n=17+18)
BM_UCordStringSink/3    411MB/s ± 1%   415MB/s ± 1%    +0.93%  (p=0.000 n=16+17)
BM_UCordStringSink/4   6.81GB/s ± 8%  7.29GB/s ± 5%    +7.12%  (p=0.000 n=16+19)
BM_UCordStringSink/5   1.35GB/s ± 5%  1.45GB/s ±13%    +8.00%  (p=0.000 n=16+17)
BM_UCordStringSink/6    653MB/s ± 8%   653MB/s ± 3%    -0.12%  (p=0.007 n=17+19)
BM_UCordStringSink/7    618MB/s ±13%   597MB/s ±18%    -3.45%  (p=0.001 n=18+18)
BM_UCordStringSink/8    702MB/s ± 5%   702MB/s ± 1%    -0.10%  (p=0.012 n=17+16)
BM_UCordStringSink/9    590MB/s ± 2%   564MB/s ±13%    -4.46%  (p=0.000 n=16+17)
BM_UCordStringSink/10  1.63GB/s ± 2%  1.76GB/s ± 4%    +8.28%  (p=0.000 n=17+16)
BM_UCordStringSink/11   630MB/s ±14%   684MB/s ±15%    +8.51%  (p=0.000 n=19+17)
BM_UCordStringSink/12   858MB/s ±12%   903MB/s ± 9%    +5.17%  (p=0.000 n=19+17)
BM_UCordStringSink/13   806MB/s ±22%   879MB/s ± 1%    +8.98%  (p=0.000 n=19+19)
BM_UCordStringSink/14   854MB/s ±13%   901MB/s ± 5%    +5.60%  (p=0.000 n=19+17)
BM_UCordStringSink/15   930MB/s ± 2%   964MB/s ± 3%    +3.59%  (p=0.000 n=16+16)
BM_UCordStringSink/16   363MB/s ±10%   356MB/s ± 6%      ~     (p=0.050 n=20+19)
BM_UCordStringSink/17   976MB/s ±12%  1078MB/s ± 1%   +10.52%  (p=0.000 n=20+17)
BM_UCordStringSink/18   227MB/s ± 1%   355MB/s ± 3%   +56.45%  (p=0.000 n=16+17)
BM_UCordStringSink/19   751MB/s ± 4%   808MB/s ± 4%    +7.70%  (p=0.000 n=18+17)
BM_UCordStringSink/20   761MB/s ± 8%   786MB/s ± 4%    +3.23%  (p=0.000 n=18+17)
2017-01-27 09:10:36 +01:00
Behzad Nouri 818b583387 adds std:: to stl types (#061) 2017-01-26 21:43:13 +01:00
Geoff Pike 27c5d86527 Re-work fast path for handling copies in zippy decompression.
This is a performance-tuning change that shouldn't change the behavior
of the library.

This adds some complexity but the performance gain might make that
worthwhile: With FDO on perflab/haswell, a 4.0% gain (geometric mean).

SAMPLE (before)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           36638      36552     100000 2.6GB/s  html
BM_UFlat/1          457153     455895       9173 1.4GB/s  urls
BM_UFlat/2            5850       5837     685481 19.6GB/s  jpg
BM_UFlat/3             122        122   34551988 1.5GB/s  jpg_200
BM_UFlat/4            6797       6781     620811 14.1GB/s  pdf
BM_UFlat/5          179485     179037      23471 2.1GB/s  html4
BM_UFlat/6          142734     142384      29525 1018.7MB/s  txt1
BM_UFlat/7          125233     124924      33709 955.6MB/s  txt2
BM_UFlat/8          382548     381533      10000 1066.7MB/s  txt3
BM_UFlat/9          525614     524297       8018 876.5MB/s  txt4
BM_UFlat/10          34946      34868     100000 3.2GB/s  pb
BM_UFlat/11         149548     149208      28063 1.2GB/s  gaviota
BM_UFlat/12          10684      10663     392580 2.1GB/s  cp
BM_UFlat/13           5494       5484     766584 1.9GB/s  c
BM_UFlat/14           1691       1688    2488784 2.1GB/s  lsp
BM_UFlat/15         676443     674726       6129 1.4GB/s  xls
BM_UFlat/16            156        156   26656909 1.2GB/s  xls_200
BM_UFlat/17         239911     239297      17558 2.0GB/s  bin
BM_UFlat/18            182        182   23072932 1047.9MB/s  bin_200
BM_UFlat/19          21544      21499     194484 1.7GB/s  sum
BM_UFlat/20           2236       2232    1877810 1.8GB/s  man
BM_UFlatSink/0       42266      42179      99732 2.3GB/s  html
BM_UFlatSink/1      461810     460633       9055 1.4GB/s  urls
BM_UFlatSink/2        5816       5804     632829 19.8GB/s  jpg
BM_UFlatSink/3         124        123   34351698 1.5GB/s  jpg_200
BM_UFlatSink/4        7173       7157     609929 13.3GB/s  pdf
BM_UFlatSink/5      184795     184302      22660 2.1GB/s  html4
BM_UFlatSink/6      143552     143223      29272 1012.7MB/s  txt1
BM_UFlatSink/7      127160     126890      33178 940.8MB/s  txt2
BM_UFlatSink/8      382219     381313      10000 1067.3MB/s  txt3
BM_UFlatSink/9      528042     526713       7988 872.5MB/s  txt4
BM_UFlatSink/10      41389      41305     100000 2.7GB/s  pb
BM_UFlatSink/11     147215     146877      28854 1.2GB/s  gaviota
BM_UFlatSink/12      12008      11984     348139 1.9GB/s  cp
BM_UFlatSink/13       5444       5433     775084 1.9GB/s  c
BM_UFlatSink/14       1647       1644    2552119 2.1GB/s  lsp
BM_UFlatSink/15     665011     663424       6320 1.4GB/s  xls
BM_UFlatSink/16        153        153   27571837 1.2GB/s  xls_200
BM_UFlatSink/17     239735     239169      17411 2.0GB/s  bin
BM_UFlatSink/18        183        182   23005573 1046.8MB/s  bin_200
BM_UFlatSink/19      22544      22498     187705 1.6GB/s  sum
BM_UFlatSink/20       2190       2186    1917894 1.8GB/s  man

SAMPLE (after)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           33940      33889     100000 2.8GB/s  html
BM_UFlat/1          440728     439944       9586 1.5GB/s  urls
BM_UFlat/2            5652       5641     744776 20.3GB/s  jpg
BM_UFlat/3             123        123   34647884 1.5GB/s  jpg_200
BM_UFlat/4            6628       6615     631892 14.4GB/s  pdf
BM_UFlat/5          169523     169227      24197 2.3GB/s  html4
BM_UFlat/6          144139     143892      29232 1008.0MB/s  txt1
BM_UFlat/7          127148     126915      33144 940.6MB/s  txt2
BM_UFlat/8          380267     379233      10000 1073.2MB/s  txt3
BM_UFlat/9          529495     528194       7957 870.0MB/s  txt4
BM_UFlat/10          31844      31784     100000 3.5GB/s  pb
BM_UFlat/11         146822     146476      28737 1.2GB/s  gaviota
BM_UFlat/12          10784      10762     392176 2.1GB/s  cp
BM_UFlat/13           5528       5518     760934 1.9GB/s  c
BM_UFlat/14           1721       1719    2449291 2.0GB/s  lsp
BM_UFlat/15         673304     671774       6255 1.4GB/s  xls
BM_UFlat/16            155        155   27092003 1.2GB/s  xls_200
BM_UFlat/17         230424     229902      18285 2.1GB/s  bin
BM_UFlat/18            185        184   22818199 1033.9MB/s  bin_200
BM_UFlat/19          21035      20996     200765 1.7GB/s  sum
BM_UFlat/20           2242       2238    1864380 1.8GB/s  man
BM_UFlatSink/0       33487      33405     100000 2.9GB/s  html
BM_UFlatSink/1      431108     430226       9764 1.5GB/s  urls
BM_UFlatSink/2        5927       5916     648112 19.4GB/s  jpg
BM_UFlatSink/3         123        122   34704423 1.5GB/s  jpg_200
BM_UFlatSink/4        6472       6461     653462 14.8GB/s  pdf
BM_UFlatSink/5      164309     163988      25567 2.3GB/s  html4
BM_UFlatSink/6      138274     138020      30311 1050.9MB/s  txt1
BM_UFlatSink/7      120844     120637      34708 989.6MB/s  txt2
BM_UFlatSink/8      371046     370366      10000 1098.9MB/s  txt3
BM_UFlatSink/9      510021     508982       8269 902.9MB/s  txt4
BM_UFlatSink/10      30889      30844     100000 3.6GB/s  pb
BM_UFlatSink/11     140752     140521      29903 1.2GB/s  gaviota
BM_UFlatSink/12      10162      10146     413600 2.3GB/s  cp
BM_UFlatSink/13       5264       5256     762398 2.0GB/s  c
BM_UFlatSink/14       1622       1619    2606069 2.1GB/s  lsp
BM_UFlatSink/15     646897     645756       6512 1.5GB/s  xls
BM_UFlatSink/16        150        150   28223595 1.2GB/s  xls_200
BM_UFlatSink/17     226096     225650      18629 2.1GB/s  bin
BM_UFlatSink/18        185        184   22907935 1035.3MB/s  bin_200
BM_UFlatSink/19      21369      21335     198881 1.7GB/s  sum
BM_UFlatSink/20       2139       2136    1953637 1.8GB/s  man
2017-01-26 21:42:26 +01:00
Sriraman Tallam 4a74094080 Speed up Zippy decompression in PIE mode by removing the penalty for
global array access.

With PIE, accessing global arrays needs two instructions whereas it can be
done with a single instruction without PIE.  See []
For example, without PIE the access looks like:
mov    0x400780(,%rdi,4),%eax  // One instruction to access arr[i]

and with PIE the access looks like:
lea    0x149(%rip),%rax        # 400780 <_ZL3arr>
mov    (%rax,%rdi,4),%eax

This causes a slow down in zippy as it has two global arrays, wordmask and
char_table.  There is no equivalent PC-relative insn. with PIE to do this in
one instruction.

The slow down can be seen as an increase in dynamic instruction count and
cycles with a similar IPC.  We have seen this affect REDACTED recently and this
is causing a ~1% perf. slow down.

One of the mitigation techniques for small arrays is to move it onto the stack,
use the stack pointer to make the access a single instruction.  The downside to
this is the extra instructions at function call to mov the array onto the stack
which is why we want to do this only for small arrays.  I tried moving
wordmask onto the stack since it is a small array. The performance numbers look
good overall. There is an improvement in the dynamic instruction count for
almost all BM_UFlat benchmarks.  BM_UFlat/2 and BM_UFlat/3 are pretty noisy.
The only case where there is a regression is BM_UFlat/10.  Here, the instruction
count does go down but the IPC also goes down affecting performance. This also
looks noisy but I do see a small IPC drop with this change.  Otherwise, the
numbers look good and consistent.  I measured this on a perflab ivybridge
machine multiple times.  Numbers are given below.  For Improv. (improvements),
positive is good.

Binaries built as: blaze build -c opt --dynamic_mode=off

Benchmark	Base CPU(ns)	Opt CPU(ns)	Improv.	Base Cycles	Opt Cycles	Improv.	Base Insns	Opt Insns	Improv.

BM_UFlat/1	541711		537052		0.86%	46068129918	45442732684	1.36%	85113352848	83917656016	1.40%
BM_UFlat/2	6228		6388		-2.57%	582789808	583267855	-0.08%	1261517746	1261116553	0.03%
BM_UFlat/3	159		120		24.53%	61538641	58783800	4.48%	90008672	90980060	-1.08%
BM_UFlat/4	7878		7787		1.16%	710491888	703718556	0.95%	1914898283	1525060250	20.36%
BM_UFlat/5	208854		207673		0.57%	17640846255	17609530720	0.18%	36546983483	36008920788	1.47%
BM_UFlat/6	172595		167225		3.11%	14642082831	14232371166	2.80%	33647820489	33056659600	1.76%
BM_UFlat/7	152364		147901		2.93%	12904338645	12635220582	2.09%	28958390984	28457982504	1.73%
BM_UFlat/8	463764		448244		3.35%	39423576973	37917435891	3.82%	88350964483	86800265943	1.76%
BM_UFlat/9	639517		621811		2.77%	54275945823	52555988926	3.17%	119503172410	117432599704	1.73%
BM_UFlat/10	41929		42358		-1.02%	3593125535	3647231492	-1.51%	8559206066	8446526639	1.32%
BM_UFlat/11	174754		173936		0.47%	14885371426	14749410955	0.91%	36693421142	35987215897	1.92%
BM_UFlat/12	13388		13257		0.98%	1192648670	1179645044	1.09%	3506482177	3454962579	1.47%
BM_UFlat/13	6801		6588		3.13%	627960003	608367286	3.12%	1847877894	1818368400	1.60%
BM_UFlat/14	2057		1989		3.31%	229005588	217393157	5.07%	609686274	599419511	1.68%
BM_UFlat/15	831618		799881		3.82%	70440388955	67911853013	3.59%	167178603105	164653652416	1.51%
BM_UFlat/16	199		199		0.00%	70109081	68747579	1.94%	106263639	105569531	0.65%
BM_UFlat/17	279031		273890		1.84%	23361373312	23294246637	0.29%	40474834585	39981682217	1.22%
BM_UFlat/18	233		199		14.59%	74530664	67841101	8.98%	94305848	92271053	2.16%
BM_UFlat/19	26743		25309		5.36%	2327215133	2206712016	5.18%	6024314357	5935228694	1.48%
BM_UFlat/20	2731		2625		3.88%	282018757	276772813	1.86%	768382519	758277029	1.32%

Is this a reasonable work-around for the problem?  Do you need more performance
measurements?  haih@ is evaluating this change for [] and I will update those
numbers once we have it.

Tested:
   Performance with zippy_unittest.
2017-01-26 21:42:11 +01:00
Geoff Pike 38a5ec5fca Re-work fast path that emits copies in zippy compression.
The primary motivation for the change is that FindMatchLength is
likely to discover a difference in the first 8 bytes it compares.
If that occurs then we know the length of the match is less than 12,
because FindMatchLength is invoked after a 4-byte match is found.
When emitting a copy, it is useful to know that the length is less
than 12 because the two-byte variant of an emitted copy requires that.

This is a performance-tuning change that should not affect the
library's behavior.

With FDO on perflab/Haswell the geometric mean for ZFlat/* went from
47,290ns to 45,741ns, an improvement of 3.4%.

SAMPLE (before)

BM_ZFlat/0      102824     102650      40691 951.4MB/s  html (22.31 %)
BM_ZFlat/1     1293512    1290442       3225 518.9MB/s  urls (47.78 %)
BM_ZFlat/2       10373      10353     417959 11.1GB/s  jpg (99.95 %)
BM_ZFlat/3         268        268   15745324 712.4MB/s  jpg_200 (73.00 %)
BM_ZFlat/4       12137      12113     342462 7.9GB/s  pdf (83.30 %)
BM_ZFlat/5      430672     429720       9724 909.0MB/s  html4 (22.52 %)
BM_ZFlat/6      420541     419636       9833 345.6MB/s  txt1 (57.88 %)
BM_ZFlat/7      373829     373158      10000 319.9MB/s  txt2 (61.91 %)
BM_ZFlat/8     1119014    1116604       3755 364.5MB/s  txt3 (54.99 %)
BM_ZFlat/9     1544203    1540657       2748 298.3MB/s  txt4 (66.26 %)
BM_ZFlat/10      91041      90866      46002 1.2GB/s  pb (19.68 %)
BM_ZFlat/11     332766     331990      10000 529.5MB/s  gaviota (37.72 %)
BM_ZFlat/12      39960      39886     100000 588.3MB/s  cp (48.12 %)
BM_ZFlat/13      14493      14465     287181 735.1MB/s  c (42.47 %)
BM_ZFlat/14       4447       4440     947927 799.3MB/s  lsp (48.37 %)
BM_ZFlat/15    1316362    1313350       3196 747.7MB/s  xls (41.23 %)
BM_ZFlat/16        312        311   10000000 613.0MB/s  xls_200 (78.00 %)
BM_ZFlat/17     388471     387502      10000 1.2GB/s  bin (18.11 %)
BM_ZFlat/18         65         64   64838208 2.9GB/s  bin_200 (7.50 %)
BM_ZFlat/19      65900      65787      63099 554.3MB/s  sum (48.96 %)
BM_ZFlat/20       6188       6177     681951 652.6MB/s  man (59.21 %)

SAMPLE (after)

Benchmark     Time(ns)    CPU(ns) Iterations
--------------------------------------------
BM_ZFlat/0       99259      99044      42428 986.0MB/s  html (22.31 %)
BM_ZFlat/1     1257039    1255276       3341 533.4MB/s  urls (47.78 %)
BM_ZFlat/2       10044      10030     405781 11.4GB/s  jpg (99.95 %)
BM_ZFlat/3         268        267   15732282 713.3MB/s  jpg_200 (73.00 %)
BM_ZFlat/4       11675      11657     358629 8.2GB/s  pdf (83.30 %)
BM_ZFlat/5      420951     419818       9739 930.5MB/s  html4 (22.52 %)
BM_ZFlat/6      415460     414632      10000 349.8MB/s  txt1 (57.88 %)
BM_ZFlat/7      367191     366436      10000 325.8MB/s  txt2 (61.91 %)
BM_ZFlat/8     1098345    1096036       3819 371.3MB/s  txt3 (54.99 %)
BM_ZFlat/9     1508701    1505306       2758 305.3MB/s  txt4 (66.26 %)
BM_ZFlat/10      87195      87031      47289 1.3GB/s  pb (19.68 %)
BM_ZFlat/11     322338     321637      10000 546.5MB/s  gaviota (37.72 %)
BM_ZFlat/12      36739      36668     100000 639.9MB/s  cp (48.12 %)
BM_ZFlat/13      13646      13618     304009 780.9MB/s  c (42.47 %)
BM_ZFlat/14       4249       4240     992456 837.0MB/s  lsp (48.37 %)
BM_ZFlat/15    1262925    1260012       3314 779.4MB/s  xls (41.23 %)
BM_ZFlat/16        308        308   10000000 619.8MB/s  xls_200 (78.00 %)
BM_ZFlat/17     379750     378944      10000 1.3GB/s  bin (18.11 %)
BM_ZFlat/18         62         62   67443280 3.0GB/s  bin_200 (7.50 %)
BM_ZFlat/19      61706      61587      67645 592.1MB/s  sum (48.96 %)
BM_ZFlat/20       5968       5958     698974 676.6MB/s  man (59.21 %)
2017-01-26 21:39:39 +01:00
ckennelly 094c67de88 Speed up the EmitLiteral fast path, +1.62% for ZFlat benchmarks.
This is inspired by the Go version in
//third_party/golang/snappy/encode_amd64.s (emitLiteralFastPath)

        Benchmark         Base:Reference   (1)
--------------------------------------------------
(BM_ZFlat_0 1/cputime_ns)        9.669e-06  +1.65%
(BM_ZFlat_1 1/cputime_ns)        7.643e-07  +2.53%
(BM_ZFlat_10 1/cputime_ns)       1.107e-05  -0.97%
(BM_ZFlat_11 1/cputime_ns)       3.002e-06  +0.71%
(BM_ZFlat_12 1/cputime_ns)       2.338e-05  +7.22%
(BM_ZFlat_13 1/cputime_ns)       6.386e-05  +9.18%
(BM_ZFlat_14 1/cputime_ns)       0.0002256  -0.05%
(BM_ZFlat_15 1/cputime_ns)       7.608e-07  -1.29%
(BM_ZFlat_16 1/cputime_ns)        0.003236  -1.28%
(BM_ZFlat_17 1/cputime_ns)        2.58e-06  +0.52%
(BM_ZFlat_18 1/cputime_ns)         0.01538  +0.00%
(BM_ZFlat_19 1/cputime_ns)       1.436e-05  +6.21%
(BM_ZFlat_2 1/cputime_ns)        0.0001044  +4.99%
(BM_ZFlat_20 1/cputime_ns)       0.0001608  -0.18%
(BM_ZFlat_3 1/cputime_ns)         0.003745  +0.38%
(BM_ZFlat_4 1/cputime_ns)        8.144e-05  +6.21%
(BM_ZFlat_5 1/cputime_ns)        2.328e-06  -1.60%
(BM_ZFlat_6 1/cputime_ns)        2.391e-06  +0.06%
(BM_ZFlat_7 1/cputime_ns)         2.68e-06  -0.61%
(BM_ZFlat_8 1/cputime_ns)        8.852e-07  +0.19%
(BM_ZFlat_9 1/cputime_ns)        6.441e-07  +1.06%

geometric mean                              +1.62%
2017-01-26 21:38:49 +01:00
Geoff Pike fce661fa8c Speed up zippy decompression by removing some zero-extensions.
This is a performance tuning change that should not affect
correctness.  On perflab with FDO on Haswell the performance gain is
21,776ns before vs 21,255ns after, about 2.4%.  (Using geometric means.)

SAMPLE PERFORMANCE with FDO on HASWELL (NEW)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           37366      37279     100000 2.6GB/s  html
BM_UFlat/1          471153     470204       8975 1.4GB/s  urls
BM_UFlat/2            6116       6105     639496 18.8GB/s  jpg
BM_UFlat/3             123        123   34709908 1.5GB/s  jpg_200
BM_UFlat/4            6724       6714     623318 14.2GB/s  pdf
BM_UFlat/5          183122     182722      23138 2.1GB/s  html4
BM_UFlat/6          144981     144689      29384 1002.5MB/s  txt1
BM_UFlat/7          125939     125691      33423 949.8MB/s  txt2
BM_UFlat/8          383101     382241      10000 1064.7MB/s  txt3
BM_UFlat/9          527824     526606       7958 872.6MB/s  txt4
BM_UFlat/10          34849      34790     100000 3.2GB/s  pb
BM_UFlat/11         150213     149937      28131 1.1GB/s  gaviota
BM_UFlat/12          10850      10830     393231 2.1GB/s  cp
BM_UFlat/13           5532       5523     735739 1.9GB/s  c
BM_UFlat/14           1698       1695    2478035 2.0GB/s  lsp
BM_UFlat/15         678396     676917       6200 1.4GB/s  xls
BM_UFlat/16            155        155   26909789 1.2GB/s  xls_200
BM_UFlat/17         241235     240698      17416 2.0GB/s  bin
BM_UFlat/18            183        183   23000841 1043.5MB/s  bin_200
BM_UFlat/19          21461      21424     193275 1.7GB/s  sum
BM_UFlat/20           2232       2228    1887191 1.8GB/s  man
BM_UFlatSink/0       42272      42199      98528 2.3GB/s  html
BM_UFlatSink/1      460814     459898       9092 1.4GB/s  urls
BM_UFlatSink/2        5558       5547     768629 20.7GB/s  jpg
BM_UFlatSink/3         124        123   33629141 1.5GB/s  jpg_200
BM_UFlatSink/4        6634       6621     629989 14.4GB/s  pdf
BM_UFlatSink/5      182883     182491      23030 2.1GB/s  html4
BM_UFlatSink/6      143269     142964      29410 1014.5MB/s  txt1
BM_UFlatSink/7      127041     126809      33136 941.4MB/s  txt2
BM_UFlatSink/8      384367     383577      10000 1061.0MB/s  txt3
BM_UFlatSink/9      529979     528890       7898 868.9MB/s  txt4
BM_UFlatSink/10      41154      41075     100000 2.7GB/s  pb
BM_UFlatSink/11     146446     146155      28742 1.2GB/s  gaviota
BM_UFlatSink/12      11939      11918     352663 1.9GB/s  cp
BM_UFlatSink/13       5430       5421     770451 1.9GB/s  c
BM_UFlatSink/14       1665       1662    2538921 2.1GB/s  lsp
BM_UFlatSink/15     666840     665617       6309 1.4GB/s  xls
BM_UFlatSink/16        152        152   27639460 1.2GB/s  xls_200
BM_UFlatSink/17     240076     239573      17643 2.0GB/s  bin
BM_UFlatSink/18        183        182   23128210 1046.0MB/s  bin_200
BM_UFlatSink/19      22570      22528     185839 1.6GB/s  sum
BM_UFlatSink/20       2183       2180    1899526 1.8GB/s  man

SAMPLE PERFORMANCE with FDO on HASWELL (OLD)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           37041      36990     100000 2.6GB/s  html
BM_UFlat/1          471384     470574       8930 1.4GB/s  urls
BM_UFlat/2            5997       5986     722354 19.2GB/s  jpg
BM_UFlat/3             124        123   34964717 1.5GB/s  jpg_200
BM_UFlat/4            6850       6838     621414 13.9GB/s  pdf
BM_UFlat/5          182578     182271      23001 2.1GB/s  html4
BM_UFlat/6          148338     147989      28132 980.1MB/s  txt1
BM_UFlat/7          130682     130471      32347 915.0MB/s  txt2
BM_UFlat/8          397420     396553      10000 1026.3MB/s  txt3
BM_UFlat/9          550126     548872       7736 837.2MB/s  txt4
BM_UFlat/10          35013      34958     100000 3.2GB/s  pb
BM_UFlat/11         152270     151889      27508 1.1GB/s  gaviota
BM_UFlat/12          11117      11096     379059 2.1GB/s  cp
BM_UFlat/13           5812       5801     725240 1.8GB/s  c
BM_UFlat/14           1780       1777    2383982 2.0GB/s  lsp
BM_UFlat/15         707871     706139       5946 1.4GB/s  xls
BM_UFlat/16            157        157   26889747 1.2GB/s  xls_200
BM_UFlat/17         239160     238556      17512 2.0GB/s  bin
BM_UFlat/18            181        180   23326040 1057.5MB/s  bin_200
BM_UFlat/19          22706      22656     186285 1.6GB/s  sum
BM_UFlat/20           2319       2315    1813186 1.7GB/s  man
BM_UFlatSink/0       42657      42574      99000 2.2GB/s  html
BM_UFlatSink/1      466316     465262       9036 1.4GB/s  urls
BM_UFlatSink/2        6873       6859     648525 16.7GB/s  jpg
BM_UFlatSink/3         124        124   34434643 1.5GB/s  jpg_200
BM_UFlatSink/4        6804       6790     624282 14.0GB/s  pdf
BM_UFlatSink/5      185468     185062      22746 2.1GB/s  html4
BM_UFlatSink/6      148511     148209      28284 978.6MB/s  txt1
BM_UFlatSink/7      130865     130607      32144 914.0MB/s  txt2
BM_UFlatSink/8      393931     392983      10000 1035.6MB/s  txt3
BM_UFlatSink/9      545548     544275       7740 844.3MB/s  txt4
BM_UFlatSink/10      41659      41584     100000 2.7GB/s  pb
BM_UFlatSink/11     152062     151721      27854 1.1GB/s  gaviota
BM_UFlatSink/12      11987      11968     350909 1.9GB/s  cp
BM_UFlatSink/13       5652       5641     743280 1.8GB/s  c
BM_UFlatSink/14       1728       1725    2446140 2.0GB/s  lsp
BM_UFlatSink/15     687879     686231       6138 1.4GB/s  xls
BM_UFlatSink/16        155        155   27254484 1.2GB/s  xls_200
BM_UFlatSink/17     240689     240083      17450 2.0GB/s  bin
BM_UFlatSink/18        183        182   22932858 1046.8MB/s  bin_200
BM_UFlatSink/19      22718      22674     185207 1.6GB/s  sum
BM_UFlatSink/20       2272       2268    1851664 1.7GB/s  man
2017-01-26 21:38:36 +01:00
ckennelly e788e527d3 Avoid calling memset when resizing the buffer.
This buffer will be initialized and then trimmed down to size during the
compression phase.
2017-01-26 21:35:55 +01:00
Steinar H. Gunderson 32d6d7d8a2 Merge pull request #6 from deviance/provide-pkg-config-data
Provide pkg-config data
2016-05-23 11:16:01 +02:00
Peter Kasting 971613510f Add #ifdef to guard against macro redefinition if this is included in another
Google project that also defines this.
2016-05-20 11:35:25 +02:00
Steinar H. Gunderson 0000f997dd Merge pull request #13 from huachaohuang/patch-1
Allow to compile in nested packages.
2016-05-20 11:28:45 +02:00
Steinar H. Gunderson d53de18799 Make heuristic match skipping more aggressive.
This causes compression to be much faster on incompressible inputs
(such as the jpeg and pdf tests), and is neutral or even positive on the other
tests. The test set shows only microscopic density regressions; I attempted to
construct a worst-case test set containing ~1500 different cases of mixed
plaintext + /dev/urandom, and even those seemed to be only 0.38 percentage
points less dense on average (the single worst case was 87.8% -> 89.0%), which
we can live with given that this is already an edge case.

The original idea is by Klaus Post; I only tweaked the implementation.
Ironically, the new implementation is almost more in line with the
comment that was there, so I've left that largely alone, albeit
with a small modification.

Microbenchmark results (opt mode, 64-bit, static linking):

Ivy Bridge:

Benchmark                 Base (ns)  New (ns)                                Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0                   120284    115480  847.0MB/s  html (22.31 %)        +4.2%
BM_ZFlat/1                  1527911   1522242  440.7MB/s  urls (47.78 %)        +0.4%
BM_ZFlat/2                    17591     10582  10.9GB/s  jpg (99.95 %)         +66.2%
BM_ZFlat/3                      323       322  593.3MB/s  jpg_200 (73.00 %)     +0.3%
BM_ZFlat/4                    53691     14063  6.8GB/s  pdf (83.30 %)         +281.8%
BM_ZFlat/5                   495442    492347  794.8MB/s  html4 (22.52 %)       +0.6%
BM_ZFlat/6                   473523    473622  306.7MB/s  txt1 (57.88 %)        -0.0%
BM_ZFlat/7                   421406    420120  284.5MB/s  txt2 (61.91 %)        +0.3%
BM_ZFlat/8                  1265632   1270538  320.8MB/s  txt3 (54.99 %)        -0.4%
BM_ZFlat/9                  1742688   1737894  264.8MB/s  txt4 (66.26 %)        +0.3%
BM_ZFlat/10                  107950    103404  1095.1MB/s  pb (19.68 %)         +4.4%
BM_ZFlat/11                  372660    371818  473.5MB/s  gaviota (37.72 %)     +0.2%
BM_ZFlat/12                   53239     49528  474.4MB/s  cp (48.12 %)          +7.5%
BM_ZFlat/13                   18940     17349  613.9MB/s  c (42.47 %)           +9.2%
BM_ZFlat/14                    5155      5075  700.3MB/s  lsp (48.37 %)         +1.6%
BM_ZFlat/15                 1474757   1474471  667.2MB/s  xls (41.23 %)         +0.0%
BM_ZFlat/16                     363       362  528.0MB/s  xls_200 (78.00 %)     +0.3%
BM_ZFlat/17                  453849    456931  1073.2MB/s  bin (18.11 %)        -0.7%
BM_ZFlat/18                      90        87  2.1GB/s  bin_200 (7.50 %)        +3.4%
BM_ZFlat/19                   82163     80498  453.7MB/s  sum (48.96 %)         +2.1%
BM_ZFlat/20                    7174      7124  566.7MB/s  man (59.21 %)         +0.7%
Sum of all benchmarks       8694831   8623857                                   +0.8%

Sandy Bridge:

Benchmark                 Base (ns)  New (ns)                                Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0                   117426    112649  868.2MB/s  html (22.31 %)        +4.2%
BM_ZFlat/1                  1517095   1498522  447.5MB/s  urls (47.78 %)        +1.2%
BM_ZFlat/2                    18601     10649  10.8GB/s  jpg (99.95 %)         +74.7%
BM_ZFlat/3                      359       356  536.0MB/s  jpg_200 (73.00 %)     +0.8%
BM_ZFlat/4                    60249     13832  6.9GB/s  pdf (83.30 %)         +335.6%
BM_ZFlat/5                   481246    475571  822.7MB/s  html4 (22.52 %)       +1.2%
BM_ZFlat/6                   460541    455693  318.8MB/s  txt1 (57.88 %)        +1.1%
BM_ZFlat/7                   407751    404147  295.8MB/s  txt2 (61.91 %)        +0.9%
BM_ZFlat/8                  1228255   1222519  333.4MB/s  txt3 (54.99 %)        +0.5%
BM_ZFlat/9                  1678299   1666379  276.2MB/s  txt4 (66.26 %)        +0.7%
BM_ZFlat/10                  106499    101715  1113.4MB/s  pb (19.68 %)         +4.7%
BM_ZFlat/11                  361913    360222  488.7MB/s  gaviota (37.72 %)     +0.5%
BM_ZFlat/12                   53137     49618  473.6MB/s  cp (48.12 %)          +7.1%
BM_ZFlat/13                   18801     17812  597.8MB/s  c (42.47 %)           +5.6%
BM_ZFlat/14                    5394      5383  660.2MB/s  lsp (48.37 %)         +0.2%
BM_ZFlat/15                 1435411   1432870  686.4MB/s  xls (41.23 %)         +0.2%
BM_ZFlat/16                     389       395  483.3MB/s  xls_200 (78.00 %)     -1.5%
BM_ZFlat/17                  447255    445510  1100.4MB/s  bin (18.11 %)        +0.4%
BM_ZFlat/18                      86        86  2.2GB/s  bin_200 (7.50 %)        +0.0%
BM_ZFlat/19                   82555     79512  459.3MB/s  sum (48.96 %)         +3.8%
BM_ZFlat/20                    7527      7553  534.5MB/s  man (59.21 %)         -0.3%
Sum of all benchmarks       8488789   8360993                                   +1.5%

Haswell:

Benchmark                 Base (ns)  New (ns)                                Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0                   107512    105621  925.6MB/s  html (22.31 %)        +1.8%
BM_ZFlat/1                  1344306   1332479  503.1MB/s  urls (47.78 %)        +0.9%
BM_ZFlat/2                    14752      9471  12.1GB/s  jpg (99.95 %)         +55.8%
BM_ZFlat/3                      287       275  694.0MB/s  jpg_200 (73.00 %)     +4.4%
BM_ZFlat/4                    48810     12263  7.8GB/s  pdf (83.30 %)         +298.0%
BM_ZFlat/5                   443013    442064  884.6MB/s  html4 (22.52 %)       +0.2%
BM_ZFlat/6                   429239    432124  336.0MB/s  txt1 (57.88 %)        -0.7%
BM_ZFlat/7                   381765    383681  311.5MB/s  txt2 (61.91 %)        -0.5%
BM_ZFlat/8                  1136667   1154304  353.0MB/s  txt3 (54.99 %)        -1.5%
BM_ZFlat/9                  1579925   1592431  288.9MB/s  txt4 (66.26 %)        -0.8%
BM_ZFlat/10                   98345     92411  1.2GB/s  pb (19.68 %)            +6.4%
BM_ZFlat/11                  340397    340466  516.8MB/s  gaviota (37.72 %)     -0.0%
BM_ZFlat/12                   47076     43536  539.5MB/s  cp (48.12 %)          +8.1%
BM_ZFlat/13                   16680     15637  680.8MB/s  c (42.47 %)           +6.7%
BM_ZFlat/14                    4616      4539  782.6MB/s  lsp (48.37 %)         +1.7%
BM_ZFlat/15                 1331231   1334094  736.9MB/s  xls (41.23 %)         -0.2%
BM_ZFlat/16                     326       322  593.5MB/s  xls_200 (78.00 %)     +1.2%
BM_ZFlat/17                  404383    400326  1.2GB/s  bin (18.11 %)           +1.0%
BM_ZFlat/18                      69        69  2.7GB/s  bin_200 (7.50 %)        +0.0%
BM_ZFlat/19                   74771     71348  511.7MB/s  sum (48.96 %)         +4.8%
BM_ZFlat/20                    6461      6383  632.2MB/s  man (59.21 %)         +1.2%
Sum of all benchmarks       7810631   7773844                                   +0.5%

I've done a quick test that there are no performance regressions on external
GCC (4.9.2, Debian, Haswell, 64-bit), too.
2016-04-05 11:50:26 +02:00
Steinar H. Gunderson 2b9152d9c5 Default to glibtoolize instead of libtoolize if it exists,
and also make it customizable through the environment variable
$LIBTOOLIZE.

Fixes autogen.sh issues on OS X, which ships its own
(incompatible) libtoolize.

R=jeff
2016-03-10 18:37:05 +01:00
Steinar H. Gunderson 0800b1e4c7 Work around an issue where some compilers interpret <:: as a trigraph.
Also correct the namespace name.
2016-01-08 15:05:44 +01:00
Steinar H. Gunderson e7d2818d1e Unbreak the open-source build for ARM due to missing ATTRIBUTE_PACKED
declaration.
2016-01-08 11:40:06 +01:00
Steinar H. Gunderson 7525a1600d Fix an issue where the ByteSource path (used for parsing std::string)
would incorrectly accept some invalid varints that the other path would not,
causing potential CHECK-failures if the unit test were run with
--write_uncompressed and a corrupted input file.

Found by the afl fuzzer.
2016-01-04 12:52:15 +01:00
Steinar H. Gunderson ef5598aa0e Make UNALIGNED_LOAD16/32 on ARMv7 go through an explicitly unaligned struct,
to avoid the compiler coalescing multiple loads into a single load instruction
(which only work for aligned accesses).

A typical example where GCC would coalesce:

  uint8* p = ...;
  uint32 a = UNALIGNED_LOAD32(p);
  uint32 b = UNALIGNED_LOAD32(p + 4);
  uint32 c = a | b;
2016-01-04 12:51:31 +01:00
Huachao Huang b8cd908a86 Allow to compile in nested packages. 2015-10-28 17:18:23 +08:00
Steinar H. Gunderson 96a2e340f3 Update URLs in the Snappy README to reflect the move to GitHub.
A=sesse
R=sanjay
2015-08-26 17:50:48 +02:00
Steinar H. Gunderson 0852af7606 Move the logic from ComputeTable into the unit test, which means it's run
automatically together with the other tests, and also removes the stray
function ComputeTable() (which was never referenced by anything else
in the open-source version, causing compiler warnings for some)
out of the core library.

Fixes public issue 96.

A=sesse
R=sanjay
2015-08-19 11:37:51 +02:00
Steinar H. Gunderson d80342922c Fix signed-vs.-unsigned comparison warnings.
These were found by compiling Chromium's external copy of this code with MSVC
with warning C4018 enabled.

A=pkasting
R=sanjay
2015-08-03 13:17:04 +02:00
Aleksandr Makarov d2cb73b6ac Provide pkg-config data 2015-07-31 16:25:17 +03:00
Steinar H. Gunderson efb39e81b8 Release Snappy 1.1.3; getting the new Uncompress variant in a release is nice,
and it's also good to finally get an official release out after the migration
to GitHub.

The GitHub releases are basically done by tagging a commit and then uploading
the .tar.gz file generated by make dist as a binary asset; GitHub will add
all files on the tagged commit on top of the tarball and recompress, but since
we don't have any nodist_* files in configure.ac, this works fine for us.
(As far as I can see, this behavior of GitHub--uncompressing the .tar.gz,
and the behavior of silently ignoring files in it that are also in the git
repository--is undocumented, but also seems to be used in some official
screenshots, so I guess we can rely on it.)

A=sesse
R=jeff
2015-07-07 10:45:04 +02:00
Steinar H. Gunderson eb66d8176b Initialized members of SnappyArrayWriter and SnappyDecompressionValidator.
These members were almost surely initialized before use by other member
functions, but Coverity was warning about this. Eliminating these warnings
minimizes clutter in that report and the likelihood of overlooking a real bug.

A=cmumford
R=jeff
2015-07-06 14:21:16 +02:00