costan
26102a0c66
Fix generated version number in open source release.
...
Lands GitHub PR #61 . The patch was also independently contributed by
Martin Gieseking <martin.gieseking@uos.de>.
2017-12-20 14:32:54 -08:00
costan
b02bfa754e
Tag open source release 1.1.7.
2017-08-24 16:54:23 -07:00
wmi
824e6718b5
Add a loop alignment directive to work around a performance regression.
...
We found LLVM upstream change at rL310792 degraded zippy benchmark by
~3%. Performance analysis showed the regression was caused by some
side-effect. The incidental loop alignment change (from 32 bytes to 16
bytes) led to increase of branch miss prediction and caused the
regression. The regression was reproducible on several intel
micro-architectures, like sandybridge, haswell and skylake. Sadly we
still don't have good understanding about the internal of intel branch
predictor and cannot explain how the branch miss prediction increases
when the loop alignment changes, so we cannot make a real fix here. The
workaround solution in the patch is to add a directive, align the hot
loop to 32 bytes, which can restore the performance. This is in order to
unblock the flip of default compiler to LLVM.
2017-08-24 16:54:12 -07:00
costan
55924d1109
Add GNUInstallDirs to CMake configuration.
...
This is modeled after https://github.com/google/googletest/pull/1160 .
The immediate benefit is fixing the library install paths on 64-bit
Linux distributions, which tend to support running 32-bit and 64-bit
code side by side by installing 32-bit libraries in /usr/lib and 64-bit
libraries in /usr/lib64.
2017-08-16 19:19:31 -07:00
costan
632cd0f128
Use 64-bit optimized code path for ARM64.
...
This is inspired by https://github.com/google/snappy/pull/22 .
Benchmark results with the change, Pixel C with Android N2G48B
Benchmark Time(ns) CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0 119544 119253 1501 818.9MB/s html
BM_UFlat/1 1223950 1208588 163 554.0MB/s urls
BM_UFlat/2 16081 15962 11527 7.2GB/s jpg
BM_UFlat/3 356 352 416666 540.6MB/s jpg_200
BM_UFlat/4 25010 24860 7683 3.8GB/s pdf
BM_UFlat/5 484832 481572 407 811.1MB/s html4
BM_UFlat/6 408410 408713 482 354.9MB/s txt1
BM_UFlat/7 361714 361663 553 330.1MB/s txt2
BM_UFlat/8 1090582 1087912 182 374.1MB/s txt3
BM_UFlat/9 1503127 1503759 133 305.6MB/s txt4
BM_UFlat/10 114183 114285 1715 989.6MB/s pb
BM_UFlat/11 406714 407331 491 431.5MB/s gaviota
BM_UIOVec/0 370397 369888 538 264.0MB/s html
BM_UIOVec/1 3207510 3190000 100 209.9MB/s urls
BM_UIOVec/2 16589 16573 11223 6.9GB/s jpg
BM_UIOVec/3 1052 1052 165289 181.2MB/s jpg_200
BM_UIOVec/4 49151 49184 3985 1.9GB/s pdf
BM_UValidate/0 68115 68095 2893 1.4GB/s html
BM_UValidate/1 792652 792000 250 845.4MB/s urls
BM_UValidate/2 334 334 487804 343.1GB/s jpg
BM_UValidate/3 235 235 666666 809.9MB/s jpg_200
BM_UValidate/4 6126 6130 32626 15.6GB/s pdf
BM_ZFlat/0 292697 290560 678 336.1MB/s html (22.31 %)
BM_ZFlat/1 4062080 4050000 100 165.3MB/s urls (47.78 %)
BM_ZFlat/2 29225 29274 6422 3.9GB/s jpg (99.95 %)
BM_ZFlat/3 1099 1098 163934 173.7MB/s jpg_200 (73.00 %)
BM_ZFlat/4 44117 44233 4205 2.2GB/s pdf (83.30 %)
BM_ZFlat/5 1158058 1157894 171 337.4MB/s html4 (22.52 %)
BM_ZFlat/6 1102983 1093922 181 132.6MB/s txt1 (57.88 %)
BM_ZFlat/7 974142 975490 204 122.4MB/s txt2 (61.91 %)
BM_ZFlat/8 2984670 2990000 100 136.1MB/s txt3 (54.99 %)
BM_ZFlat/9 4100130 4090000 100 112.4MB/s txt4 (66.26 %)
BM_ZFlat/10 276236 275139 716 411.0MB/s pb (19.68 %)
BM_ZFlat/11 760091 759541 262 231.4MB/s gaviota (37.72 %)
Baseline benchmark results, Pixel C with Android N2G48B
Benchmark Time(ns) CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0 148957 147565 1335 661.8MB/s html
BM_UFlat/1 1527257 1500000 132 446.4MB/s urls
BM_UFlat/2 19589 19397 8764 5.9GB/s jpg
BM_UFlat/3 425 418 408163 455.3MB/s jpg_200
BM_UFlat/4 30096 29552 6497 3.2GB/s pdf
BM_UFlat/5 595933 594594 333 657.0MB/s html4
BM_UFlat/6 516315 514360 383 282.0MB/s txt1
BM_UFlat/7 454653 453514 441 263.2MB/s txt2
BM_UFlat/8 1382687 1361111 144 299.0MB/s txt3
BM_UFlat/9 1967590 1904761 105 241.3MB/s txt4
BM_UFlat/10 148271 144560 1342 782.3MB/s pb
BM_UFlat/11 523997 510471 382 344.4MB/s gaviota
BM_UIOVec/0 478443 465227 417 209.9MB/s html
BM_UIOVec/1 4172860 4060000 100 164.9MB/s urls
BM_UIOVec/2 21470 20975 7342 5.5GB/s jpg
BM_UIOVec/3 1357 1330 75187 143.4MB/s jpg_200
BM_UIOVec/4 63143 61365 3031 1.6GB/s pdf
BM_UValidate/0 86910 85125 2279 1.1GB/s html
BM_UValidate/1 1022256 1000000 195 669.6MB/s urls
BM_UValidate/2 420 417 400000 274.6GB/s jpg
BM_UValidate/3 311 302 571428 630.0MB/s jpg_200
BM_UValidate/4 7778 7584 25445 12.6GB/s pdf
BM_ZFlat/0 469209 457547 424 213.4MB/s html (22.31 %)
BM_ZFlat/1 5633510 5460000 100 122.6MB/s urls (47.78 %)
BM_ZFlat/2 37896 36693 4524 3.1GB/s jpg (99.95 %)
BM_ZFlat/3 1485 1441 123456 132.3MB/s jpg_200 (73.00 %)
BM_ZFlat/4 74870 72775 2652 1.3GB/s pdf (83.30 %)
BM_ZFlat/5 1857321 1785714 112 218.8MB/s html4 (22.52 %)
BM_ZFlat/6 1538723 1492307 130 97.2MB/s txt1 (57.88 %)
BM_ZFlat/7 1338236 1310810 148 91.1MB/s txt2 (61.91 %)
BM_ZFlat/8 4050820 4040000 100 100.7MB/s txt3 (54.99 %)
BM_ZFlat/9 5234940 5230000 100 87.9MB/s txt4 (66.26 %)
BM_ZFlat/10 400309 400000 495 282.7MB/s pb (19.68 %)
BM_ZFlat/11 1063042 1058510 188 166.1MB/s gaviota (37.72 %)
2017-08-16 19:18:22 -07:00
costan
77c12adc19
Add unistd.h checks back to the CMake build.
...
getpagesize(), as well as its POSIX.2001 replacement
sysconf(_SC_PAGESIZE), is defined in <unistd.h>. On Linux and OS X,
including <sys/mman.h> is sufficient to get a definition for
getpagesize(). However, this is not true for the Android NDK. This CL
brings back the HAVE_UNISTD_H definition and its associated header
check.
This also adds a HAVE_FUNC_SYSCONF definition, which checks for the
presence of sysconf(). The definition can be used later to replace
getpagesize() with sysconf().
2017-08-02 10:56:06 -07:00
costan
c8049c5827
Replace getpagesize() with sysconf(_SC_PAGESIZE).
...
getpagesize() has been removed from POSIX.1-2001. Its recommended
replacement is sysconf(_SC_PAGESIZE).
2017-08-01 14:38:57 -07:00
costan
18e2f220d8
Add guidelines for opensource contributions.
...
The guidelines follow the instructions at
https://opensource.google.com/docs/releasing/preparing/#CONTRIBUTING
2017-08-01 14:38:24 -07:00
costan
f0d3237c32
Use _BitScanForward and _BitScanReverse on MSVC.
...
Based on https://github.com/google/snappy/pull/30
2017-08-01 14:38:02 -07:00
jueminyang
71b8f86887
Add SNAPPY_ prefix to PREDICT_{TRUE,FALSE} macros.
2017-08-01 14:36:26 -07:00
costan
be6dc3db83
Redo CMake configuration.
...
The style was changed to match the official manual [1], the install
configuration was simplified and now matches the official packaging
guide [2], and the config files use the CMake-specific variable syntax
${VAR} instead of the autoconf-compatible syntax @VAR@, as documented in
[3]. The public header files are declared as such (for CMake 3.3+), and
the generated headers are included in the library target definition.
The tests are only built if SNAPPY_BUILD_TESTS (default ON) is true, so
zippy can be easily used in projects that add_subdirectory() its source
code directly, instead of using find_package().
[1] https://cmake.org/cmake/help/git-master/manual/cmake-language.7.html
[2] https://cmake.org/cmake/help/git-master/manual/cmake-packages.7.html
[3] https://cmake.org/cmake/help/git-master/command/configure_file.html
2017-07-28 10:14:21 -07:00
costan
e4de6ce087
Small improvements to open source CI configuration.
...
This CL fixes 64-bit Windows testing (), makes it possible to view the
test output in the Travis / AppVeyor CI console while the test is
running, and takes advantage of the new support for the .appveyor.yml
file name to make the CI configuration less obtrusive.
2017-07-27 16:46:54 -07:00
costan
c756f7f5d9
Support both static and shared library CMake builds.
...
This can be used to fix https://github.com/Homebrew/homebrew-core/issues/15722 .
2017-07-27 16:46:54 -07:00
costan
038a3329b1
Inline DISALLOW_COPY_AND_ASSIGN.
...
snappy-stubs-public.h defined the DISALLOW_COPY_AND_ASSIGN macro, so the
definition propagated to all translation units that included the open
source headers. The macro is now inlined, thus avoiding polluting the
macro environment of snappy users.
2017-07-27 16:46:42 -07:00
costan
a8b239c3de
snappy: Remove autoconf build configuration.
2017-07-25 18:20:38 -07:00
costan
27671c6aec
Clean up CMake header and type checks.
...
Unused macros: HAVE_DLFCN_H, HAVE_INTTYPES_H, HAVE_MEMORY_H,
HAVE_STDLIB_H, HAVE_STRINGS_H, HAVE_STRING_H, HAVE_SYS_BYTESWAP_H,
HAVE_SYS_STAT_H, HAVE_SYS_TYPES_H, HAVE_UNISTD_H.
Used but never set macros: HAVE_LIBLZF, HAVE_LIBQUICKLZ. These only gate
conditional includes. The code that takes advantage of them was removed.
Unused types: ssize_t.
The testing code uses HAVE_FUNC_MMAP, which was not wired in the CMake
build, causing a whole test to be skipped.
2017-07-25 18:17:35 -07:00
costan
548501c988
zippy: Re-release snappy 1.1.5 as 1.1.6.
...
The migration from autotools to CMake in 1.1.5 wasn't as smooth as
intended. The SONAME / SOVERSION were broken in both build systems,
causing breakages in systems that upgraded from snappy 1.1.4 to 1.1.5,
as reported in https://github.com/Homebrew/homebrew-core/issues/15274
and https://github.com/google/snappy/pull/45 .
2017-07-13 03:56:49 -07:00
costan
513df5fb5a
Tag open source release 1.1.5.
2017-06-28 18:37:30 -07:00
costan
5bc9c82ae3
Set minimum CMake version to 3.1.
...
The project only needs CMake 3.1 features, and some Travis CI bots have
CMake 3.2.2. Therefore, requiring CMake 3.4 is inconvenient.
2017-06-28 18:37:08 -07:00
costan
e9720a001d
Update Travis CI config, add AppVeyor for Windows CI coverage.
2017-06-28 18:36:37 -07:00
tmsriram
f24f9d2d97
Explicitly copy internal::wordmask to the stack array to work around a compiler
...
optimization with LLVM that converts const stack arrays to global arrays. This
is a temporary change and should be reverted when https://reviews.llvm.org/D30759
is fixed.
With PIE, accessing stack arrays is more efficient than global arrays and
wordmask was moved to the stack due to that. However, the LLVM compiler
automatically converts stack arrays, detected as constant, to global arrays
and this transformation hurts PIE performance with LLVM.
We are working to fix this in the LLVM compiler, via
https://reviews.llvm.org/D30759 , to not do this conversion in PIE mode. Until
this patch is finished, please consider this source change as a temporary
work around to keep this array on the stack. This source change is important
to allow some projects to flip the default compiler from GCC to LLVM for
optimized builds.
This change works for the following reason. The LLVM compiler does not convert
non-const stack arrays to global arrays and explicitly copying the elements is
enough to make the compiler assume that this is a non-const array.
With GCC, this change does not affect code-gen in any significant way. The
array initialization code is slightly different as it copies the constants
directly to the stack.
With LLVM, this keeps the array on the stack.
No change in performance with GCC (within noise range). With LLVM, ~0.7%
improvement in optimized mode (no FDO) and ~1.75% improvement in FDO
mode.
2017-06-28 18:34:54 -07:00
ysaed
82deffcde7
Remove benchmarking support for fastlz.
2017-06-28 18:33:55 -07:00
alkis
18488d6212
Use 64 bit little endian on ppc64le.
...
This has tangible performance benefits.
This lands https://github.com/google/snappy/pull/27
2017-06-28 18:33:13 -07:00
alkis
7b9532b878
Improve the SSE2 macro check on Windows.
...
This lands https://github.com/google/snappy/pull/37
2017-06-05 13:54:17 -07:00
alkis
7dadceea52
Check for the existence of sys/uio.h in autoconf build.
...
This lands https://github.com/google/snappy/pull/32
2017-06-05 13:54:17 -07:00
jyrki
83179dd8be
Remove quicklz and lzf support in benchmarks.
2017-06-05 13:54:10 -07:00
vrabaud
c8131680d0
Provide a CMakeLists.txt.
...
This lands https://github.com/google/snappy/pull/29
2017-06-05 13:53:29 -07:00
costan
ed3b7b242b
Clean up unused function warnings in snappy.
2017-03-17 13:59:03 -07:00
costan
8b60aac4fd
Remove "using namespace std;" from zippy-stubs-internal.h.
...
This makes it easier to build zippy, as some compiles require a warning
suppression to accept "using namespace std".
2017-03-13 13:03:01 -07:00
costan
7d7a8ec805
Add Travis CI configuration to snappy and fix the make build.
...
The make build in the open source version uses autoconf, which is set up
to expect a project that follows the gnu standard.
2017-03-10 12:40:15 -08:00
alkis
1cd3ab02e9
Rename README to README.md. It already in markdown, we might as well let github know so that it renders nicely.
2017-03-08 12:05:05 -08:00
alkis
597fa795de
Delete UnalignedCopy64 from snappy-stubs since the version in snappy.cc is more robust and possibly faster (assuming the compiler knows how to best copy 8 bytes between locations in memory the fastest way possible - a rather safe bet).
2017-03-08 11:42:30 -08:00
scrubbed
039b3a7ace
Add std:: prefix to STL non-type names.
...
In order to disable global using declarations, this CL qualifies
stl names with the std namespace.
2017-03-08 11:42:30 -08:00
alkis
3c706d2230
Make UnalignedCopy64 not exhibit undefined behavior when src and dst overlap.
...
name old speed new speed delta
BM_UFlat/0 3.09GB/s ± 3% 3.07GB/s ± 2% -0.78% (p=0.009 n=19+19)
BM_UFlat/1 1.63GB/s ± 2% 1.62GB/s ± 2% ~ (p=0.099 n=19+20)
BM_UFlat/2 19.7GB/s ±19% 20.7GB/s ±11% ~ (p=0.054 n=20+19)
BM_UFlat/3 1.61GB/s ± 2% 1.60GB/s ± 1% -0.48% (p=0.049 n=20+17)
BM_UFlat/4 15.8GB/s ± 7% 15.6GB/s ±10% ~ (p=0.234 n=20+20)
BM_UFlat/5 2.47GB/s ± 1% 2.46GB/s ± 2% ~ (p=0.608 n=19+19)
BM_UFlat/6 1.07GB/s ± 2% 1.07GB/s ± 1% ~ (p=0.128 n=20+19)
BM_UFlat/7 1.01GB/s ± 1% 1.00GB/s ± 2% ~ (p=0.656 n=15+19)
BM_UFlat/8 1.13GB/s ± 1% 1.13GB/s ± 1% ~ (p=0.532 n=18+19)
BM_UFlat/9 918MB/s ± 1% 916MB/s ± 1% ~ (p=0.443 n=19+18)
BM_UFlat/10 3.90GB/s ± 1% 3.90GB/s ± 1% ~ (p=0.895 n=20+19)
BM_UFlat/11 1.30GB/s ± 1% 1.29GB/s ± 2% ~ (p=0.156 n=19+19)
BM_UFlat/12 2.35GB/s ± 2% 2.34GB/s ± 1% ~ (p=0.349 n=19+17)
BM_UFlat/13 2.07GB/s ± 1% 2.06GB/s ± 2% ~ (p=0.475 n=18+19)
BM_UFlat/14 2.23GB/s ± 1% 2.23GB/s ± 1% ~ (p=0.983 n=19+19)
BM_UFlat/15 1.55GB/s ± 1% 1.55GB/s ± 1% ~ (p=0.314 n=19+19)
BM_UFlat/16 1.26GB/s ± 1% 1.26GB/s ± 1% ~ (p=0.907 n=15+18)
BM_UFlat/17 2.32GB/s ± 1% 2.32GB/s ± 1% ~ (p=0.604 n=18+19)
BM_UFlat/18 1.61GB/s ± 1% 1.61GB/s ± 1% ~ (p=0.212 n=18+19)
BM_UFlat/19 1.78GB/s ± 1% 1.78GB/s ± 2% ~ (p=0.350 n=19+19)
BM_UFlat/20 1.89GB/s ± 1% 1.90GB/s ± 2% ~ (p=0.092 n=19+19)
Also tested the current version against UNALIGNED_STORE64(dst, UNALIGNED_LOAD64(src)), there is no difference (old is memcpy, new is UNALIGNED*):
name old speed new speed delta
BM_UFlat/0 3.14GB/s ± 1% 3.16GB/s ± 2% ~ (p=0.156 n=19+19)
BM_UFlat/1 1.62GB/s ± 1% 1.61GB/s ± 2% ~ (p=0.102 n=19+20)
BM_UFlat/2 18.8GB/s ±17% 19.1GB/s ±11% ~ (p=0.390 n=20+16)
BM_UFlat/3 1.59GB/s ± 1% 1.58GB/s ± 1% -1.06% (p=0.000 n=18+18)
BM_UFlat/4 15.8GB/s ± 6% 15.6GB/s ± 7% ~ (p=0.184 n=19+20)
BM_UFlat/5 2.46GB/s ± 1% 2.44GB/s ± 1% -0.95% (p=0.000 n=19+18)
BM_UFlat/6 1.08GB/s ± 1% 1.06GB/s ± 1% -1.17% (p=0.000 n=19+18)
BM_UFlat/7 1.00GB/s ± 1% 0.99GB/s ± 1% -1.16% (p=0.000 n=19+18)
BM_UFlat/8 1.14GB/s ± 2% 1.12GB/s ± 1% -1.12% (p=0.000 n=19+18)
BM_UFlat/9 921MB/s ± 1% 914MB/s ± 1% -0.84% (p=0.000 n=20+17)
BM_UFlat/10 3.94GB/s ± 2% 3.92GB/s ± 1% ~ (p=0.058 n=19+17)
BM_UFlat/11 1.29GB/s ± 1% 1.28GB/s ± 1% -0.77% (p=0.001 n=19+17)
BM_UFlat/12 2.34GB/s ± 1% 2.31GB/s ± 1% -1.10% (p=0.000 n=18+18)
BM_UFlat/13 2.06GB/s ± 1% 2.05GB/s ± 1% -0.73% (p=0.001 n=19+18)
BM_UFlat/14 2.22GB/s ± 1% 2.20GB/s ± 1% -0.73% (p=0.000 n=18+18)
BM_UFlat/15 1.55GB/s ± 1% 1.53GB/s ± 1% -1.07% (p=0.000 n=19+18)
BM_UFlat/16 1.26GB/s ± 1% 1.25GB/s ± 1% -0.79% (p=0.000 n=18+18)
BM_UFlat/17 2.31GB/s ± 1% 2.29GB/s ± 1% -0.98% (p=0.000 n=20+18)
BM_UFlat/18 1.61GB/s ± 1% 1.60GB/s ± 2% -0.71% (p=0.001 n=20+19)
BM_UFlat/19 1.77GB/s ± 1% 1.76GB/s ± 1% -0.61% (p=0.007 n=19+18)
BM_UFlat/20 1.89GB/s ± 1% 1.88GB/s ± 1% -0.75% (p=0.000 n=20+18)
2017-03-08 11:42:30 -08:00
skanev
d3c6d20d0a
Add compression size reporting hooks.
...
Also, force inlining util::compression::Sample().
The inlining change is necessary. Without it even with FDO+LIPO the call
doesn't get inlined and uses 4 registers to construct parameters (which
won't be used in the common case). In some of the more compute-bound
tests that causes extra spills and significant overhead (even if
call is sufficiently long).
For example, with inlining:
BM_UFlat/0 32.7µs ± 1% 33.1µs ± 1% +1.41%
without:
BM_UFlat/0 32.7µs ± 1% 37.7µs ± 1% +15.29%
2017-03-08 11:42:21 -08:00
alkis
626e1b9faa
Use #ifdef __SSE2__ for the emmintrin.h include, otherwise snappy.cc does not compile with -march=prescott.
2017-03-07 18:09:49 -08:00
Alkis Evlogimenos
2d99bd14d4
1.1.4 release.
2017-01-27 09:12:04 +01:00
Alkis Evlogimenos
8bfb028b61
Improve zippy decompression speed.
...
The CL contains the following optimizations:
1) rewrite IncrementalCopy routine: single routine that splits the code into sections based on typical probabilities observed across a variety of inputs and helps reduce branch mispredictions both for FDO and non-FDO builds. IncrementalCopy is an adaptive routine that selects the best strategy based on input.
2) introduce UnalignedCopy128 that copies 128 bits per cycle using SSE2.
3) add branch hint for the main decoding loop. The non-literal case is taken more often in benchmarks. I expect this to be a noop in production with FDO. Note that this became apparent after step 1 above.
4) use the new IncrementalCopy in ZippyScatteredWriter.
I test two archs: x86_haswell and ppc_power8.
For x86_haswell I use FDO. For ppc_power8 I do not use FDO.
x86_haswell + FDO
name old speed new speed delta
BM_UCord/0 1.97GB/s ± 1% 3.19GB/s ± 1% +62.08% (p=0.000 n=19+18)
BM_UCord/1 1.28GB/s ± 1% 1.51GB/s ± 1% +18.14% (p=0.000 n=19+18)
BM_UCord/2 15.6GB/s ± 9% 15.5GB/s ± 7% ~ (p=0.620 n=20+20)
BM_UCord/3 811MB/s ± 1% 808MB/s ± 1% -0.38% (p=0.009 n=17+18)
BM_UCord/4 12.4GB/s ± 4% 12.7GB/s ± 8% +2.70% (p=0.002 n=17+20)
BM_UCord/5 1.77GB/s ± 0% 2.33GB/s ± 1% +31.37% (p=0.000 n=18+18)
BM_UCord/6 900MB/s ± 1% 1006MB/s ± 1% +11.71% (p=0.000 n=18+17)
BM_UCord/7 858MB/s ± 1% 938MB/s ± 2% +9.36% (p=0.000 n=19+16)
BM_UCord/8 921MB/s ± 1% 985MB/s ±21% +6.94% (p=0.028 n=19+20)
BM_UCord/9 824MB/s ± 1% 800MB/s ±20% ~ (p=0.113 n=19+20)
BM_UCord/10 2.60GB/s ± 1% 3.67GB/s ±21% +41.31% (p=0.000 n=19+20)
BM_UCord/11 1.07GB/s ± 1% 1.21GB/s ± 1% +13.17% (p=0.000 n=16+16)
BM_UCord/12 1.84GB/s ± 8% 2.18GB/s ± 1% +18.44% (p=0.000 n=16+19)
BM_UCord/13 1.83GB/s ±18% 1.89GB/s ± 1% +3.14% (p=0.000 n=17+19)
BM_UCord/14 1.96GB/s ± 2% 1.97GB/s ± 1% +0.55% (p=0.000 n=16+17)
BM_UCord/15 1.30GB/s ±20% 1.43GB/s ± 1% +9.85% (p=0.000 n=20+20)
BM_UCord/16 658MB/s ±20% 705MB/s ± 1% +7.22% (p=0.000 n=20+19)
BM_UCord/17 1.96GB/s ± 2% 2.15GB/s ± 1% +9.73% (p=0.000 n=16+19)
BM_UCord/18 555MB/s ± 1% 833MB/s ± 1% +50.11% (p=0.000 n=18+19)
BM_UCord/19 1.57GB/s ± 1% 1.75GB/s ± 1% +11.34% (p=0.000 n=20+20)
BM_UCord/20 1.72GB/s ± 2% 1.70GB/s ± 2% -1.01% (p=0.001 n=20+20)
BM_UCordStringSink/0 2.88GB/s ± 1% 3.15GB/s ± 1% +9.56% (p=0.000 n=17+20)
BM_UCordStringSink/1 1.50GB/s ± 1% 1.52GB/s ± 1% +1.96% (p=0.000 n=19+20)
BM_UCordStringSink/2 14.5GB/s ±10% 14.6GB/s ±10% ~ (p=0.542 n=20+20)
BM_UCordStringSink/3 1.06GB/s ± 1% 1.08GB/s ± 1% +1.77% (p=0.000 n=18+20)
BM_UCordStringSink/4 12.6GB/s ± 7% 13.2GB/s ± 4% +4.63% (p=0.000 n=20+20)
BM_UCordStringSink/5 2.29GB/s ± 1% 2.36GB/s ± 1% +3.05% (p=0.000 n=19+20)
BM_UCordStringSink/6 1.01GB/s ± 2% 1.01GB/s ± 0% ~ (p=0.055 n=20+18)
BM_UCordStringSink/7 945MB/s ± 1% 939MB/s ± 1% -0.60% (p=0.000 n=19+20)
BM_UCordStringSink/8 1.06GB/s ± 1% 1.07GB/s ± 1% +0.62% (p=0.000 n=18+20)
BM_UCordStringSink/9 866MB/s ± 1% 864MB/s ± 1% ~ (p=0.107 n=19+20)
BM_UCordStringSink/10 3.64GB/s ± 2% 3.98GB/s ± 1% +9.32% (p=0.000 n=19+20)
BM_UCordStringSink/11 1.22GB/s ± 1% 1.22GB/s ± 1% +0.61% (p=0.001 n=19+20)
BM_UCordStringSink/12 2.23GB/s ± 1% 2.23GB/s ± 1% ~ (p=0.692 n=19+20)
BM_UCordStringSink/13 1.96GB/s ± 1% 1.94GB/s ± 1% -0.82% (p=0.000 n=17+18)
BM_UCordStringSink/14 2.09GB/s ± 2% 2.08GB/s ± 1% ~ (p=0.147 n=20+18)
BM_UCordStringSink/15 1.47GB/s ± 1% 1.45GB/s ± 1% -0.88% (p=0.000 n=20+19)
BM_UCordStringSink/16 908MB/s ± 1% 917MB/s ± 1% +0.97% (p=0.000 n=19+19)
BM_UCordStringSink/17 2.11GB/s ± 1% 2.20GB/s ± 1% +4.35% (p=0.000 n=18+20)
BM_UCordStringSink/18 804MB/s ± 2% 1106MB/s ± 1% +37.52% (p=0.000 n=20+20)
BM_UCordStringSink/19 1.67GB/s ± 1% 1.72GB/s ± 0% +2.81% (p=0.000 n=18+20)
BM_UCordStringSink/20 1.77GB/s ± 3% 1.77GB/s ± 3% ~ (p=0.815 n=20+20)
ppc_power8
name old speed new speed delta
BM_UCord/0 918MB/s ± 6% 1262MB/s ± 0% +37.56% (p=0.000 n=17+16)
BM_UCord/1 671MB/s ±13% 879MB/s ± 2% +30.99% (p=0.000 n=18+16)
BM_UCord/2 12.6GB/s ± 8% 12.6GB/s ± 5% ~ (p=0.452 n=17+19)
BM_UCord/3 285MB/s ±10% 284MB/s ± 4% -0.50% (p=0.021 n=19+17)
BM_UCord/4 5.21GB/s ±12% 6.59GB/s ± 1% +26.37% (p=0.000 n=17+16)
BM_UCord/5 913MB/s ± 4% 1253MB/s ± 1% +37.27% (p=0.000 n=16+17)
BM_UCord/6 461MB/s ±13% 547MB/s ± 1% +18.67% (p=0.000 n=18+16)
BM_UCord/7 455MB/s ± 2% 524MB/s ± 3% +15.28% (p=0.000 n=16+18)
BM_UCord/8 489MB/s ± 2% 584MB/s ± 2% +19.47% (p=0.000 n=17+17)
BM_UCord/9 410MB/s ±33% 490MB/s ± 1% +19.64% (p=0.000 n=17+18)
BM_UCord/10 1.10GB/s ± 3% 1.55GB/s ± 2% +41.21% (p=0.000 n=16+16)
BM_UCord/11 494MB/s ± 1% 558MB/s ± 1% +12.92% (p=0.000 n=17+18)
BM_UCord/12 608MB/s ± 3% 793MB/s ± 1% +30.45% (p=0.000 n=17+16)
BM_UCord/13 545MB/s ±18% 721MB/s ± 2% +32.22% (p=0.000 n=19+17)
BM_UCord/14 594MB/s ± 4% 748MB/s ± 3% +25.99% (p=0.000 n=17+17)
BM_UCord/15 628MB/s ± 1% 822MB/s ± 3% +30.94% (p=0.000 n=18+16)
BM_UCord/16 277MB/s ± 2% 280MB/s ±15% +0.86% (p=0.001 n=17+17)
BM_UCord/17 864MB/s ± 1% 1001MB/s ± 3% +15.96% (p=0.000 n=17+17)
BM_UCord/18 121MB/s ± 2% 284MB/s ± 4% +134.08% (p=0.000 n=17+18)
BM_UCord/19 594MB/s ± 0% 713MB/s ± 2% +19.93% (p=0.000 n=16+17)
BM_UCord/20 553MB/s ±10% 662MB/s ± 5% +19.74% (p=0.000 n=16+18)
BM_UCordStringSink/0 1.37GB/s ± 4% 1.48GB/s ± 2% +8.51% (p=0.000 n=16+16)
BM_UCordStringSink/1 969MB/s ± 1% 990MB/s ± 1% +2.16% (p=0.000 n=16+18)
BM_UCordStringSink/2 13.1GB/s ±11% 13.0GB/s ±14% ~ (p=0.858 n=17+18)
BM_UCordStringSink/3 411MB/s ± 1% 415MB/s ± 1% +0.93% (p=0.000 n=16+17)
BM_UCordStringSink/4 6.81GB/s ± 8% 7.29GB/s ± 5% +7.12% (p=0.000 n=16+19)
BM_UCordStringSink/5 1.35GB/s ± 5% 1.45GB/s ±13% +8.00% (p=0.000 n=16+17)
BM_UCordStringSink/6 653MB/s ± 8% 653MB/s ± 3% -0.12% (p=0.007 n=17+19)
BM_UCordStringSink/7 618MB/s ±13% 597MB/s ±18% -3.45% (p=0.001 n=18+18)
BM_UCordStringSink/8 702MB/s ± 5% 702MB/s ± 1% -0.10% (p=0.012 n=17+16)
BM_UCordStringSink/9 590MB/s ± 2% 564MB/s ±13% -4.46% (p=0.000 n=16+17)
BM_UCordStringSink/10 1.63GB/s ± 2% 1.76GB/s ± 4% +8.28% (p=0.000 n=17+16)
BM_UCordStringSink/11 630MB/s ±14% 684MB/s ±15% +8.51% (p=0.000 n=19+17)
BM_UCordStringSink/12 858MB/s ±12% 903MB/s ± 9% +5.17% (p=0.000 n=19+17)
BM_UCordStringSink/13 806MB/s ±22% 879MB/s ± 1% +8.98% (p=0.000 n=19+19)
BM_UCordStringSink/14 854MB/s ±13% 901MB/s ± 5% +5.60% (p=0.000 n=19+17)
BM_UCordStringSink/15 930MB/s ± 2% 964MB/s ± 3% +3.59% (p=0.000 n=16+16)
BM_UCordStringSink/16 363MB/s ±10% 356MB/s ± 6% ~ (p=0.050 n=20+19)
BM_UCordStringSink/17 976MB/s ±12% 1078MB/s ± 1% +10.52% (p=0.000 n=20+17)
BM_UCordStringSink/18 227MB/s ± 1% 355MB/s ± 3% +56.45% (p=0.000 n=16+17)
BM_UCordStringSink/19 751MB/s ± 4% 808MB/s ± 4% +7.70% (p=0.000 n=18+17)
BM_UCordStringSink/20 761MB/s ± 8% 786MB/s ± 4% +3.23% (p=0.000 n=18+17)
2017-01-27 09:10:36 +01:00
Behzad Nouri
818b583387
adds std:: to stl types ( #061 )
2017-01-26 21:43:13 +01:00
Geoff Pike
27c5d86527
Re-work fast path for handling copies in zippy decompression.
...
This is a performance-tuning change that shouldn't change the behavior
of the library.
This adds some complexity but the performance gain might make that
worthwhile: With FDO on perflab/haswell, a 4.0% gain (geometric mean).
SAMPLE (before)
Benchmark Time(ns) CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0 36638 36552 100000 2.6GB/s html
BM_UFlat/1 457153 455895 9173 1.4GB/s urls
BM_UFlat/2 5850 5837 685481 19.6GB/s jpg
BM_UFlat/3 122 122 34551988 1.5GB/s jpg_200
BM_UFlat/4 6797 6781 620811 14.1GB/s pdf
BM_UFlat/5 179485 179037 23471 2.1GB/s html4
BM_UFlat/6 142734 142384 29525 1018.7MB/s txt1
BM_UFlat/7 125233 124924 33709 955.6MB/s txt2
BM_UFlat/8 382548 381533 10000 1066.7MB/s txt3
BM_UFlat/9 525614 524297 8018 876.5MB/s txt4
BM_UFlat/10 34946 34868 100000 3.2GB/s pb
BM_UFlat/11 149548 149208 28063 1.2GB/s gaviota
BM_UFlat/12 10684 10663 392580 2.1GB/s cp
BM_UFlat/13 5494 5484 766584 1.9GB/s c
BM_UFlat/14 1691 1688 2488784 2.1GB/s lsp
BM_UFlat/15 676443 674726 6129 1.4GB/s xls
BM_UFlat/16 156 156 26656909 1.2GB/s xls_200
BM_UFlat/17 239911 239297 17558 2.0GB/s bin
BM_UFlat/18 182 182 23072932 1047.9MB/s bin_200
BM_UFlat/19 21544 21499 194484 1.7GB/s sum
BM_UFlat/20 2236 2232 1877810 1.8GB/s man
BM_UFlatSink/0 42266 42179 99732 2.3GB/s html
BM_UFlatSink/1 461810 460633 9055 1.4GB/s urls
BM_UFlatSink/2 5816 5804 632829 19.8GB/s jpg
BM_UFlatSink/3 124 123 34351698 1.5GB/s jpg_200
BM_UFlatSink/4 7173 7157 609929 13.3GB/s pdf
BM_UFlatSink/5 184795 184302 22660 2.1GB/s html4
BM_UFlatSink/6 143552 143223 29272 1012.7MB/s txt1
BM_UFlatSink/7 127160 126890 33178 940.8MB/s txt2
BM_UFlatSink/8 382219 381313 10000 1067.3MB/s txt3
BM_UFlatSink/9 528042 526713 7988 872.5MB/s txt4
BM_UFlatSink/10 41389 41305 100000 2.7GB/s pb
BM_UFlatSink/11 147215 146877 28854 1.2GB/s gaviota
BM_UFlatSink/12 12008 11984 348139 1.9GB/s cp
BM_UFlatSink/13 5444 5433 775084 1.9GB/s c
BM_UFlatSink/14 1647 1644 2552119 2.1GB/s lsp
BM_UFlatSink/15 665011 663424 6320 1.4GB/s xls
BM_UFlatSink/16 153 153 27571837 1.2GB/s xls_200
BM_UFlatSink/17 239735 239169 17411 2.0GB/s bin
BM_UFlatSink/18 183 182 23005573 1046.8MB/s bin_200
BM_UFlatSink/19 22544 22498 187705 1.6GB/s sum
BM_UFlatSink/20 2190 2186 1917894 1.8GB/s man
SAMPLE (after)
Benchmark Time(ns) CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0 33940 33889 100000 2.8GB/s html
BM_UFlat/1 440728 439944 9586 1.5GB/s urls
BM_UFlat/2 5652 5641 744776 20.3GB/s jpg
BM_UFlat/3 123 123 34647884 1.5GB/s jpg_200
BM_UFlat/4 6628 6615 631892 14.4GB/s pdf
BM_UFlat/5 169523 169227 24197 2.3GB/s html4
BM_UFlat/6 144139 143892 29232 1008.0MB/s txt1
BM_UFlat/7 127148 126915 33144 940.6MB/s txt2
BM_UFlat/8 380267 379233 10000 1073.2MB/s txt3
BM_UFlat/9 529495 528194 7957 870.0MB/s txt4
BM_UFlat/10 31844 31784 100000 3.5GB/s pb
BM_UFlat/11 146822 146476 28737 1.2GB/s gaviota
BM_UFlat/12 10784 10762 392176 2.1GB/s cp
BM_UFlat/13 5528 5518 760934 1.9GB/s c
BM_UFlat/14 1721 1719 2449291 2.0GB/s lsp
BM_UFlat/15 673304 671774 6255 1.4GB/s xls
BM_UFlat/16 155 155 27092003 1.2GB/s xls_200
BM_UFlat/17 230424 229902 18285 2.1GB/s bin
BM_UFlat/18 185 184 22818199 1033.9MB/s bin_200
BM_UFlat/19 21035 20996 200765 1.7GB/s sum
BM_UFlat/20 2242 2238 1864380 1.8GB/s man
BM_UFlatSink/0 33487 33405 100000 2.9GB/s html
BM_UFlatSink/1 431108 430226 9764 1.5GB/s urls
BM_UFlatSink/2 5927 5916 648112 19.4GB/s jpg
BM_UFlatSink/3 123 122 34704423 1.5GB/s jpg_200
BM_UFlatSink/4 6472 6461 653462 14.8GB/s pdf
BM_UFlatSink/5 164309 163988 25567 2.3GB/s html4
BM_UFlatSink/6 138274 138020 30311 1050.9MB/s txt1
BM_UFlatSink/7 120844 120637 34708 989.6MB/s txt2
BM_UFlatSink/8 371046 370366 10000 1098.9MB/s txt3
BM_UFlatSink/9 510021 508982 8269 902.9MB/s txt4
BM_UFlatSink/10 30889 30844 100000 3.6GB/s pb
BM_UFlatSink/11 140752 140521 29903 1.2GB/s gaviota
BM_UFlatSink/12 10162 10146 413600 2.3GB/s cp
BM_UFlatSink/13 5264 5256 762398 2.0GB/s c
BM_UFlatSink/14 1622 1619 2606069 2.1GB/s lsp
BM_UFlatSink/15 646897 645756 6512 1.5GB/s xls
BM_UFlatSink/16 150 150 28223595 1.2GB/s xls_200
BM_UFlatSink/17 226096 225650 18629 2.1GB/s bin
BM_UFlatSink/18 185 184 22907935 1035.3MB/s bin_200
BM_UFlatSink/19 21369 21335 198881 1.7GB/s sum
BM_UFlatSink/20 2139 2136 1953637 1.8GB/s man
2017-01-26 21:42:26 +01:00
Sriraman Tallam
4a74094080
Speed up Zippy decompression in PIE mode by removing the penalty for
...
global array access.
With PIE, accessing global arrays needs two instructions whereas it can be
done with a single instruction without PIE. See []
For example, without PIE the access looks like:
mov 0x400780(,%rdi,4),%eax // One instruction to access arr[i]
and with PIE the access looks like:
lea 0x149(%rip),%rax # 400780 <_ZL3arr>
mov (%rax,%rdi,4),%eax
This causes a slow down in zippy as it has two global arrays, wordmask and
char_table. There is no equivalent PC-relative insn. with PIE to do this in
one instruction.
The slow down can be seen as an increase in dynamic instruction count and
cycles with a similar IPC. We have seen this affect REDACTED recently and this
is causing a ~1% perf. slow down.
One of the mitigation techniques for small arrays is to move it onto the stack,
use the stack pointer to make the access a single instruction. The downside to
this is the extra instructions at function call to mov the array onto the stack
which is why we want to do this only for small arrays. I tried moving
wordmask onto the stack since it is a small array. The performance numbers look
good overall. There is an improvement in the dynamic instruction count for
almost all BM_UFlat benchmarks. BM_UFlat/2 and BM_UFlat/3 are pretty noisy.
The only case where there is a regression is BM_UFlat/10. Here, the instruction
count does go down but the IPC also goes down affecting performance. This also
looks noisy but I do see a small IPC drop with this change. Otherwise, the
numbers look good and consistent. I measured this on a perflab ivybridge
machine multiple times. Numbers are given below. For Improv. (improvements),
positive is good.
Binaries built as: blaze build -c opt --dynamic_mode=off
Benchmark Base CPU(ns) Opt CPU(ns) Improv. Base Cycles Opt Cycles Improv. Base Insns Opt Insns Improv.
BM_UFlat/1 541711 537052 0.86% 46068129918 45442732684 1.36% 85113352848 83917656016 1.40%
BM_UFlat/2 6228 6388 -2.57% 582789808 583267855 -0.08% 1261517746 1261116553 0.03%
BM_UFlat/3 159 120 24.53% 61538641 58783800 4.48% 90008672 90980060 -1.08%
BM_UFlat/4 7878 7787 1.16% 710491888 703718556 0.95% 1914898283 1525060250 20.36%
BM_UFlat/5 208854 207673 0.57% 17640846255 17609530720 0.18% 36546983483 36008920788 1.47%
BM_UFlat/6 172595 167225 3.11% 14642082831 14232371166 2.80% 33647820489 33056659600 1.76%
BM_UFlat/7 152364 147901 2.93% 12904338645 12635220582 2.09% 28958390984 28457982504 1.73%
BM_UFlat/8 463764 448244 3.35% 39423576973 37917435891 3.82% 88350964483 86800265943 1.76%
BM_UFlat/9 639517 621811 2.77% 54275945823 52555988926 3.17% 119503172410 117432599704 1.73%
BM_UFlat/10 41929 42358 -1.02% 3593125535 3647231492 -1.51% 8559206066 8446526639 1.32%
BM_UFlat/11 174754 173936 0.47% 14885371426 14749410955 0.91% 36693421142 35987215897 1.92%
BM_UFlat/12 13388 13257 0.98% 1192648670 1179645044 1.09% 3506482177 3454962579 1.47%
BM_UFlat/13 6801 6588 3.13% 627960003 608367286 3.12% 1847877894 1818368400 1.60%
BM_UFlat/14 2057 1989 3.31% 229005588 217393157 5.07% 609686274 599419511 1.68%
BM_UFlat/15 831618 799881 3.82% 70440388955 67911853013 3.59% 167178603105 164653652416 1.51%
BM_UFlat/16 199 199 0.00% 70109081 68747579 1.94% 106263639 105569531 0.65%
BM_UFlat/17 279031 273890 1.84% 23361373312 23294246637 0.29% 40474834585 39981682217 1.22%
BM_UFlat/18 233 199 14.59% 74530664 67841101 8.98% 94305848 92271053 2.16%
BM_UFlat/19 26743 25309 5.36% 2327215133 2206712016 5.18% 6024314357 5935228694 1.48%
BM_UFlat/20 2731 2625 3.88% 282018757 276772813 1.86% 768382519 758277029 1.32%
Is this a reasonable work-around for the problem? Do you need more performance
measurements? haih@ is evaluating this change for [] and I will update those
numbers once we have it.
Tested:
Performance with zippy_unittest.
2017-01-26 21:42:11 +01:00
Geoff Pike
38a5ec5fca
Re-work fast path that emits copies in zippy compression.
...
The primary motivation for the change is that FindMatchLength is
likely to discover a difference in the first 8 bytes it compares.
If that occurs then we know the length of the match is less than 12,
because FindMatchLength is invoked after a 4-byte match is found.
When emitting a copy, it is useful to know that the length is less
than 12 because the two-byte variant of an emitted copy requires that.
This is a performance-tuning change that should not affect the
library's behavior.
With FDO on perflab/Haswell the geometric mean for ZFlat/* went from
47,290ns to 45,741ns, an improvement of 3.4%.
SAMPLE (before)
BM_ZFlat/0 102824 102650 40691 951.4MB/s html (22.31 %)
BM_ZFlat/1 1293512 1290442 3225 518.9MB/s urls (47.78 %)
BM_ZFlat/2 10373 10353 417959 11.1GB/s jpg (99.95 %)
BM_ZFlat/3 268 268 15745324 712.4MB/s jpg_200 (73.00 %)
BM_ZFlat/4 12137 12113 342462 7.9GB/s pdf (83.30 %)
BM_ZFlat/5 430672 429720 9724 909.0MB/s html4 (22.52 %)
BM_ZFlat/6 420541 419636 9833 345.6MB/s txt1 (57.88 %)
BM_ZFlat/7 373829 373158 10000 319.9MB/s txt2 (61.91 %)
BM_ZFlat/8 1119014 1116604 3755 364.5MB/s txt3 (54.99 %)
BM_ZFlat/9 1544203 1540657 2748 298.3MB/s txt4 (66.26 %)
BM_ZFlat/10 91041 90866 46002 1.2GB/s pb (19.68 %)
BM_ZFlat/11 332766 331990 10000 529.5MB/s gaviota (37.72 %)
BM_ZFlat/12 39960 39886 100000 588.3MB/s cp (48.12 %)
BM_ZFlat/13 14493 14465 287181 735.1MB/s c (42.47 %)
BM_ZFlat/14 4447 4440 947927 799.3MB/s lsp (48.37 %)
BM_ZFlat/15 1316362 1313350 3196 747.7MB/s xls (41.23 %)
BM_ZFlat/16 312 311 10000000 613.0MB/s xls_200 (78.00 %)
BM_ZFlat/17 388471 387502 10000 1.2GB/s bin (18.11 %)
BM_ZFlat/18 65 64 64838208 2.9GB/s bin_200 (7.50 %)
BM_ZFlat/19 65900 65787 63099 554.3MB/s sum (48.96 %)
BM_ZFlat/20 6188 6177 681951 652.6MB/s man (59.21 %)
SAMPLE (after)
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------
BM_ZFlat/0 99259 99044 42428 986.0MB/s html (22.31 %)
BM_ZFlat/1 1257039 1255276 3341 533.4MB/s urls (47.78 %)
BM_ZFlat/2 10044 10030 405781 11.4GB/s jpg (99.95 %)
BM_ZFlat/3 268 267 15732282 713.3MB/s jpg_200 (73.00 %)
BM_ZFlat/4 11675 11657 358629 8.2GB/s pdf (83.30 %)
BM_ZFlat/5 420951 419818 9739 930.5MB/s html4 (22.52 %)
BM_ZFlat/6 415460 414632 10000 349.8MB/s txt1 (57.88 %)
BM_ZFlat/7 367191 366436 10000 325.8MB/s txt2 (61.91 %)
BM_ZFlat/8 1098345 1096036 3819 371.3MB/s txt3 (54.99 %)
BM_ZFlat/9 1508701 1505306 2758 305.3MB/s txt4 (66.26 %)
BM_ZFlat/10 87195 87031 47289 1.3GB/s pb (19.68 %)
BM_ZFlat/11 322338 321637 10000 546.5MB/s gaviota (37.72 %)
BM_ZFlat/12 36739 36668 100000 639.9MB/s cp (48.12 %)
BM_ZFlat/13 13646 13618 304009 780.9MB/s c (42.47 %)
BM_ZFlat/14 4249 4240 992456 837.0MB/s lsp (48.37 %)
BM_ZFlat/15 1262925 1260012 3314 779.4MB/s xls (41.23 %)
BM_ZFlat/16 308 308 10000000 619.8MB/s xls_200 (78.00 %)
BM_ZFlat/17 379750 378944 10000 1.3GB/s bin (18.11 %)
BM_ZFlat/18 62 62 67443280 3.0GB/s bin_200 (7.50 %)
BM_ZFlat/19 61706 61587 67645 592.1MB/s sum (48.96 %)
BM_ZFlat/20 5968 5958 698974 676.6MB/s man (59.21 %)
2017-01-26 21:39:39 +01:00
ckennelly
094c67de88
Speed up the EmitLiteral fast path, +1.62% for ZFlat benchmarks.
...
This is inspired by the Go version in
//third_party/golang/snappy/encode_amd64.s (emitLiteralFastPath)
Benchmark Base:Reference (1)
--------------------------------------------------
(BM_ZFlat_0 1/cputime_ns) 9.669e-06 +1.65%
(BM_ZFlat_1 1/cputime_ns) 7.643e-07 +2.53%
(BM_ZFlat_10 1/cputime_ns) 1.107e-05 -0.97%
(BM_ZFlat_11 1/cputime_ns) 3.002e-06 +0.71%
(BM_ZFlat_12 1/cputime_ns) 2.338e-05 +7.22%
(BM_ZFlat_13 1/cputime_ns) 6.386e-05 +9.18%
(BM_ZFlat_14 1/cputime_ns) 0.0002256 -0.05%
(BM_ZFlat_15 1/cputime_ns) 7.608e-07 -1.29%
(BM_ZFlat_16 1/cputime_ns) 0.003236 -1.28%
(BM_ZFlat_17 1/cputime_ns) 2.58e-06 +0.52%
(BM_ZFlat_18 1/cputime_ns) 0.01538 +0.00%
(BM_ZFlat_19 1/cputime_ns) 1.436e-05 +6.21%
(BM_ZFlat_2 1/cputime_ns) 0.0001044 +4.99%
(BM_ZFlat_20 1/cputime_ns) 0.0001608 -0.18%
(BM_ZFlat_3 1/cputime_ns) 0.003745 +0.38%
(BM_ZFlat_4 1/cputime_ns) 8.144e-05 +6.21%
(BM_ZFlat_5 1/cputime_ns) 2.328e-06 -1.60%
(BM_ZFlat_6 1/cputime_ns) 2.391e-06 +0.06%
(BM_ZFlat_7 1/cputime_ns) 2.68e-06 -0.61%
(BM_ZFlat_8 1/cputime_ns) 8.852e-07 +0.19%
(BM_ZFlat_9 1/cputime_ns) 6.441e-07 +1.06%
geometric mean +1.62%
2017-01-26 21:38:49 +01:00
Geoff Pike
fce661fa8c
Speed up zippy decompression by removing some zero-extensions.
...
This is a performance tuning change that should not affect
correctness. On perflab with FDO on Haswell the performance gain is
21,776ns before vs 21,255ns after, about 2.4%. (Using geometric means.)
SAMPLE PERFORMANCE with FDO on HASWELL (NEW)
Benchmark Time(ns) CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0 37366 37279 100000 2.6GB/s html
BM_UFlat/1 471153 470204 8975 1.4GB/s urls
BM_UFlat/2 6116 6105 639496 18.8GB/s jpg
BM_UFlat/3 123 123 34709908 1.5GB/s jpg_200
BM_UFlat/4 6724 6714 623318 14.2GB/s pdf
BM_UFlat/5 183122 182722 23138 2.1GB/s html4
BM_UFlat/6 144981 144689 29384 1002.5MB/s txt1
BM_UFlat/7 125939 125691 33423 949.8MB/s txt2
BM_UFlat/8 383101 382241 10000 1064.7MB/s txt3
BM_UFlat/9 527824 526606 7958 872.6MB/s txt4
BM_UFlat/10 34849 34790 100000 3.2GB/s pb
BM_UFlat/11 150213 149937 28131 1.1GB/s gaviota
BM_UFlat/12 10850 10830 393231 2.1GB/s cp
BM_UFlat/13 5532 5523 735739 1.9GB/s c
BM_UFlat/14 1698 1695 2478035 2.0GB/s lsp
BM_UFlat/15 678396 676917 6200 1.4GB/s xls
BM_UFlat/16 155 155 26909789 1.2GB/s xls_200
BM_UFlat/17 241235 240698 17416 2.0GB/s bin
BM_UFlat/18 183 183 23000841 1043.5MB/s bin_200
BM_UFlat/19 21461 21424 193275 1.7GB/s sum
BM_UFlat/20 2232 2228 1887191 1.8GB/s man
BM_UFlatSink/0 42272 42199 98528 2.3GB/s html
BM_UFlatSink/1 460814 459898 9092 1.4GB/s urls
BM_UFlatSink/2 5558 5547 768629 20.7GB/s jpg
BM_UFlatSink/3 124 123 33629141 1.5GB/s jpg_200
BM_UFlatSink/4 6634 6621 629989 14.4GB/s pdf
BM_UFlatSink/5 182883 182491 23030 2.1GB/s html4
BM_UFlatSink/6 143269 142964 29410 1014.5MB/s txt1
BM_UFlatSink/7 127041 126809 33136 941.4MB/s txt2
BM_UFlatSink/8 384367 383577 10000 1061.0MB/s txt3
BM_UFlatSink/9 529979 528890 7898 868.9MB/s txt4
BM_UFlatSink/10 41154 41075 100000 2.7GB/s pb
BM_UFlatSink/11 146446 146155 28742 1.2GB/s gaviota
BM_UFlatSink/12 11939 11918 352663 1.9GB/s cp
BM_UFlatSink/13 5430 5421 770451 1.9GB/s c
BM_UFlatSink/14 1665 1662 2538921 2.1GB/s lsp
BM_UFlatSink/15 666840 665617 6309 1.4GB/s xls
BM_UFlatSink/16 152 152 27639460 1.2GB/s xls_200
BM_UFlatSink/17 240076 239573 17643 2.0GB/s bin
BM_UFlatSink/18 183 182 23128210 1046.0MB/s bin_200
BM_UFlatSink/19 22570 22528 185839 1.6GB/s sum
BM_UFlatSink/20 2183 2180 1899526 1.8GB/s man
SAMPLE PERFORMANCE with FDO on HASWELL (OLD)
Benchmark Time(ns) CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0 37041 36990 100000 2.6GB/s html
BM_UFlat/1 471384 470574 8930 1.4GB/s urls
BM_UFlat/2 5997 5986 722354 19.2GB/s jpg
BM_UFlat/3 124 123 34964717 1.5GB/s jpg_200
BM_UFlat/4 6850 6838 621414 13.9GB/s pdf
BM_UFlat/5 182578 182271 23001 2.1GB/s html4
BM_UFlat/6 148338 147989 28132 980.1MB/s txt1
BM_UFlat/7 130682 130471 32347 915.0MB/s txt2
BM_UFlat/8 397420 396553 10000 1026.3MB/s txt3
BM_UFlat/9 550126 548872 7736 837.2MB/s txt4
BM_UFlat/10 35013 34958 100000 3.2GB/s pb
BM_UFlat/11 152270 151889 27508 1.1GB/s gaviota
BM_UFlat/12 11117 11096 379059 2.1GB/s cp
BM_UFlat/13 5812 5801 725240 1.8GB/s c
BM_UFlat/14 1780 1777 2383982 2.0GB/s lsp
BM_UFlat/15 707871 706139 5946 1.4GB/s xls
BM_UFlat/16 157 157 26889747 1.2GB/s xls_200
BM_UFlat/17 239160 238556 17512 2.0GB/s bin
BM_UFlat/18 181 180 23326040 1057.5MB/s bin_200
BM_UFlat/19 22706 22656 186285 1.6GB/s sum
BM_UFlat/20 2319 2315 1813186 1.7GB/s man
BM_UFlatSink/0 42657 42574 99000 2.2GB/s html
BM_UFlatSink/1 466316 465262 9036 1.4GB/s urls
BM_UFlatSink/2 6873 6859 648525 16.7GB/s jpg
BM_UFlatSink/3 124 124 34434643 1.5GB/s jpg_200
BM_UFlatSink/4 6804 6790 624282 14.0GB/s pdf
BM_UFlatSink/5 185468 185062 22746 2.1GB/s html4
BM_UFlatSink/6 148511 148209 28284 978.6MB/s txt1
BM_UFlatSink/7 130865 130607 32144 914.0MB/s txt2
BM_UFlatSink/8 393931 392983 10000 1035.6MB/s txt3
BM_UFlatSink/9 545548 544275 7740 844.3MB/s txt4
BM_UFlatSink/10 41659 41584 100000 2.7GB/s pb
BM_UFlatSink/11 152062 151721 27854 1.1GB/s gaviota
BM_UFlatSink/12 11987 11968 350909 1.9GB/s cp
BM_UFlatSink/13 5652 5641 743280 1.8GB/s c
BM_UFlatSink/14 1728 1725 2446140 2.0GB/s lsp
BM_UFlatSink/15 687879 686231 6138 1.4GB/s xls
BM_UFlatSink/16 155 155 27254484 1.2GB/s xls_200
BM_UFlatSink/17 240689 240083 17450 2.0GB/s bin
BM_UFlatSink/18 183 182 22932858 1046.8MB/s bin_200
BM_UFlatSink/19 22718 22674 185207 1.6GB/s sum
BM_UFlatSink/20 2272 2268 1851664 1.7GB/s man
2017-01-26 21:38:36 +01:00
ckennelly
e788e527d3
Avoid calling memset when resizing the buffer.
...
This buffer will be initialized and then trimmed down to size during the
compression phase.
2017-01-26 21:35:55 +01:00
Steinar H. Gunderson
32d6d7d8a2
Merge pull request #6 from deviance/provide-pkg-config-data
...
Provide pkg-config data
2016-05-23 11:16:01 +02:00
Peter Kasting
971613510f
Add #ifdef to guard against macro redefinition if this is included in another
...
Google project that also defines this.
2016-05-20 11:35:25 +02:00
Steinar H. Gunderson
0000f997dd
Merge pull request #13 from huachaohuang/patch-1
...
Allow to compile in nested packages.
2016-05-20 11:28:45 +02:00
Steinar H. Gunderson
d53de18799
Make heuristic match skipping more aggressive.
...
This causes compression to be much faster on incompressible inputs
(such as the jpeg and pdf tests), and is neutral or even positive on the other
tests. The test set shows only microscopic density regressions; I attempted to
construct a worst-case test set containing ~1500 different cases of mixed
plaintext + /dev/urandom, and even those seemed to be only 0.38 percentage
points less dense on average (the single worst case was 87.8% -> 89.0%), which
we can live with given that this is already an edge case.
The original idea is by Klaus Post; I only tweaked the implementation.
Ironically, the new implementation is almost more in line with the
comment that was there, so I've left that largely alone, albeit
with a small modification.
Microbenchmark results (opt mode, 64-bit, static linking):
Ivy Bridge:
Benchmark Base (ns) New (ns) Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0 120284 115480 847.0MB/s html (22.31 %) +4.2%
BM_ZFlat/1 1527911 1522242 440.7MB/s urls (47.78 %) +0.4%
BM_ZFlat/2 17591 10582 10.9GB/s jpg (99.95 %) +66.2%
BM_ZFlat/3 323 322 593.3MB/s jpg_200 (73.00 %) +0.3%
BM_ZFlat/4 53691 14063 6.8GB/s pdf (83.30 %) +281.8%
BM_ZFlat/5 495442 492347 794.8MB/s html4 (22.52 %) +0.6%
BM_ZFlat/6 473523 473622 306.7MB/s txt1 (57.88 %) -0.0%
BM_ZFlat/7 421406 420120 284.5MB/s txt2 (61.91 %) +0.3%
BM_ZFlat/8 1265632 1270538 320.8MB/s txt3 (54.99 %) -0.4%
BM_ZFlat/9 1742688 1737894 264.8MB/s txt4 (66.26 %) +0.3%
BM_ZFlat/10 107950 103404 1095.1MB/s pb (19.68 %) +4.4%
BM_ZFlat/11 372660 371818 473.5MB/s gaviota (37.72 %) +0.2%
BM_ZFlat/12 53239 49528 474.4MB/s cp (48.12 %) +7.5%
BM_ZFlat/13 18940 17349 613.9MB/s c (42.47 %) +9.2%
BM_ZFlat/14 5155 5075 700.3MB/s lsp (48.37 %) +1.6%
BM_ZFlat/15 1474757 1474471 667.2MB/s xls (41.23 %) +0.0%
BM_ZFlat/16 363 362 528.0MB/s xls_200 (78.00 %) +0.3%
BM_ZFlat/17 453849 456931 1073.2MB/s bin (18.11 %) -0.7%
BM_ZFlat/18 90 87 2.1GB/s bin_200 (7.50 %) +3.4%
BM_ZFlat/19 82163 80498 453.7MB/s sum (48.96 %) +2.1%
BM_ZFlat/20 7174 7124 566.7MB/s man (59.21 %) +0.7%
Sum of all benchmarks 8694831 8623857 +0.8%
Sandy Bridge:
Benchmark Base (ns) New (ns) Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0 117426 112649 868.2MB/s html (22.31 %) +4.2%
BM_ZFlat/1 1517095 1498522 447.5MB/s urls (47.78 %) +1.2%
BM_ZFlat/2 18601 10649 10.8GB/s jpg (99.95 %) +74.7%
BM_ZFlat/3 359 356 536.0MB/s jpg_200 (73.00 %) +0.8%
BM_ZFlat/4 60249 13832 6.9GB/s pdf (83.30 %) +335.6%
BM_ZFlat/5 481246 475571 822.7MB/s html4 (22.52 %) +1.2%
BM_ZFlat/6 460541 455693 318.8MB/s txt1 (57.88 %) +1.1%
BM_ZFlat/7 407751 404147 295.8MB/s txt2 (61.91 %) +0.9%
BM_ZFlat/8 1228255 1222519 333.4MB/s txt3 (54.99 %) +0.5%
BM_ZFlat/9 1678299 1666379 276.2MB/s txt4 (66.26 %) +0.7%
BM_ZFlat/10 106499 101715 1113.4MB/s pb (19.68 %) +4.7%
BM_ZFlat/11 361913 360222 488.7MB/s gaviota (37.72 %) +0.5%
BM_ZFlat/12 53137 49618 473.6MB/s cp (48.12 %) +7.1%
BM_ZFlat/13 18801 17812 597.8MB/s c (42.47 %) +5.6%
BM_ZFlat/14 5394 5383 660.2MB/s lsp (48.37 %) +0.2%
BM_ZFlat/15 1435411 1432870 686.4MB/s xls (41.23 %) +0.2%
BM_ZFlat/16 389 395 483.3MB/s xls_200 (78.00 %) -1.5%
BM_ZFlat/17 447255 445510 1100.4MB/s bin (18.11 %) +0.4%
BM_ZFlat/18 86 86 2.2GB/s bin_200 (7.50 %) +0.0%
BM_ZFlat/19 82555 79512 459.3MB/s sum (48.96 %) +3.8%
BM_ZFlat/20 7527 7553 534.5MB/s man (59.21 %) -0.3%
Sum of all benchmarks 8488789 8360993 +1.5%
Haswell:
Benchmark Base (ns) New (ns) Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0 107512 105621 925.6MB/s html (22.31 %) +1.8%
BM_ZFlat/1 1344306 1332479 503.1MB/s urls (47.78 %) +0.9%
BM_ZFlat/2 14752 9471 12.1GB/s jpg (99.95 %) +55.8%
BM_ZFlat/3 287 275 694.0MB/s jpg_200 (73.00 %) +4.4%
BM_ZFlat/4 48810 12263 7.8GB/s pdf (83.30 %) +298.0%
BM_ZFlat/5 443013 442064 884.6MB/s html4 (22.52 %) +0.2%
BM_ZFlat/6 429239 432124 336.0MB/s txt1 (57.88 %) -0.7%
BM_ZFlat/7 381765 383681 311.5MB/s txt2 (61.91 %) -0.5%
BM_ZFlat/8 1136667 1154304 353.0MB/s txt3 (54.99 %) -1.5%
BM_ZFlat/9 1579925 1592431 288.9MB/s txt4 (66.26 %) -0.8%
BM_ZFlat/10 98345 92411 1.2GB/s pb (19.68 %) +6.4%
BM_ZFlat/11 340397 340466 516.8MB/s gaviota (37.72 %) -0.0%
BM_ZFlat/12 47076 43536 539.5MB/s cp (48.12 %) +8.1%
BM_ZFlat/13 16680 15637 680.8MB/s c (42.47 %) +6.7%
BM_ZFlat/14 4616 4539 782.6MB/s lsp (48.37 %) +1.7%
BM_ZFlat/15 1331231 1334094 736.9MB/s xls (41.23 %) -0.2%
BM_ZFlat/16 326 322 593.5MB/s xls_200 (78.00 %) +1.2%
BM_ZFlat/17 404383 400326 1.2GB/s bin (18.11 %) +1.0%
BM_ZFlat/18 69 69 2.7GB/s bin_200 (7.50 %) +0.0%
BM_ZFlat/19 74771 71348 511.7MB/s sum (48.96 %) +4.8%
BM_ZFlat/20 6461 6383 632.2MB/s man (59.21 %) +1.2%
Sum of all benchmarks 7810631 7773844 +0.5%
I've done a quick test that there are no performance regressions on external
GCC (4.9.2, Debian, Haswell, 64-bit), too.
2016-04-05 11:50:26 +02:00
Steinar H. Gunderson
2b9152d9c5
Default to glibtoolize instead of libtoolize if it exists,
...
and also make it customizable through the environment variable
$LIBTOOLIZE.
Fixes autogen.sh issues on OS X, which ships its own
(incompatible) libtoolize.
R=jeff
2016-03-10 18:37:05 +01:00