Commit Graph

161 Commits

Author SHA1 Message Date
costan 73c31e824c Fix Visual Studio build.
Commit 8f469d97e2 introduced SSSE3 fast
paths that are gated by __SSE3__ macro checks and the <x86intrin.h>
header, neither of which exists in Visual Studio. This commit adds logic
for detecting SSSE3 compiler support that works for all compilers
supported by the open source release.

The commit also replaces the header with <tmmintrin.h>, which only
defines intrinsics supported by SSSE3 and below. This should help flag
any use of SIMD instructions that require more advanced SSE support, so
the uses can be gated by checks that also work in the open source
release.

Last, this commit requires C++11 support for the open source build. This is
needed by the alignas specifier, which was also introduced in commit
8f469d97e2.
2018-08-08 22:25:14 -07:00
jefflim 27ff0af12a Improve performance of zippy decompression to IOVecs by up to almost 50%
1) Simplify loop condition for small pattern IncrementalCopy
2) Use pointers rather than indices to track current iovec.
3) Use fast IncrementalCopy
4) Bypass Append check from within AppendFromSelf

While this code greatly improves the performance of ZippyIOVecWriter, a
bigger question is whether IOVec writing should be improved, or removed.

Perf tests:

name                                 old speed      new speed      delta
BM_UFlat/0      [html             ]  2.13GB/s ± 0%  2.14GB/s ± 1%     ~
BM_UFlat/1      [urls             ]  1.22GB/s ± 0%  1.24GB/s ± 0%   +1.87%
BM_UFlat/2      [jpg              ]  17.2GB/s ± 1%  17.1GB/s ± 0%     ~
BM_UFlat/3      [jpg_200          ]  1.55GB/s ± 0%  1.53GB/s ± 2%     ~
BM_UFlat/4      [pdf              ]  12.8GB/s ± 1%  12.7GB/s ± 2%   -0.36%
BM_UFlat/5      [html4            ]  1.89GB/s ± 0%  1.90GB/s ± 1%     ~
BM_UFlat/6      [txt1             ]   811MB/s ± 0%   829MB/s ± 1%   +2.24%
BM_UFlat/7      [txt2             ]   756MB/s ± 0%   774MB/s ± 1%   +2.41%
BM_UFlat/8      [txt3             ]   860MB/s ± 0%   879MB/s ± 1%   +2.16%
BM_UFlat/9      [txt4             ]   699MB/s ± 0%   715MB/s ± 1%   +2.31%
BM_UFlat/10     [pb               ]  2.64GB/s ± 0%  2.65GB/s ± 1%     ~
BM_UFlat/11     [gaviota          ]  1.00GB/s ± 0%  0.99GB/s ± 2%     ~
BM_UFlat/12     [cp               ]  1.66GB/s ± 1%  1.66GB/s ± 2%     ~
BM_UFlat/13     [c                ]  1.53GB/s ± 0%  1.47GB/s ± 5%   -3.97%
BM_UFlat/14     [lsp              ]  1.60GB/s ± 1%  1.55GB/s ± 5%   -3.41%
BM_UFlat/15     [xls              ]  1.12GB/s ± 0%  1.15GB/s ± 0%   +1.93%
BM_UFlat/16     [xls_200          ]   918MB/s ± 2%   929MB/s ± 1%   +1.15%
BM_UFlat/17     [bin              ]  1.86GB/s ± 0%  1.89GB/s ± 1%   +1.61%
BM_UFlat/18     [bin_200          ]  1.90GB/s ± 1%  1.97GB/s ± 1%   +3.67%
BM_UFlat/19     [sum              ]  1.32GB/s ± 0%  1.33GB/s ± 1%     ~
BM_UFlat/20     [man              ]  1.39GB/s ± 0%  1.36GB/s ± 3%     ~
BM_UValidate/0  [html             ]  2.85GB/s ± 3%  2.90GB/s ± 0%     ~
BM_UValidate/1  [urls             ]  1.57GB/s ± 0%  1.56GB/s ± 0%   -0.20%
BM_UValidate/2  [jpg              ]   824GB/s ± 0%   825GB/s ± 0%   +0.11%
BM_UValidate/3  [jpg_200          ]  2.01GB/s ± 0%  2.02GB/s ± 0%   +0.10%
BM_UValidate/4  [pdf              ]  30.4GB/s ±11%  33.5GB/s ± 0%     ~
BM_UIOVec/0     [html             ]   604MB/s ± 0%   856MB/s ± 0%  +41.70%
BM_UIOVec/1     [urls             ]   440MB/s ± 0%   660MB/s ± 0%  +49.91%
BM_UIOVec/2     [jpg              ]  15.1GB/s ± 1%  15.3GB/s ± 1%   +1.22%
BM_UIOVec/3     [jpg_200          ]   567MB/s ± 1%   629MB/s ± 0%  +10.89%
BM_UIOVec/4     [pdf              ]  7.16GB/s ± 2%  8.56GB/s ± 1%  +19.64%
BM_UFlatSink/0  [html             ]  2.13GB/s ± 0%  2.16GB/s ± 0%   +1.47%
BM_UFlatSink/1  [urls             ]  1.22GB/s ± 0%  1.25GB/s ± 0%   +2.18%
BM_UFlatSink/2  [jpg              ]  17.1GB/s ± 2%  17.1GB/s ± 2%     ~
BM_UFlatSink/3  [jpg_200          ]  1.51GB/s ± 1%  1.53GB/s ± 2%   +1.11%
BM_UFlatSink/4  [pdf              ]  12.7GB/s ± 2%  12.8GB/s ± 1%   +0.67%
BM_UFlatSink/5  [html4            ]  1.90GB/s ± 0%  1.92GB/s ± 0%   +1.31%
BM_UFlatSink/6  [txt1             ]   810MB/s ± 0%   835MB/s ± 0%   +3.04%
BM_UFlatSink/7  [txt2             ]   755MB/s ± 0%   779MB/s ± 0%   +3.19%
BM_UFlatSink/8  [txt3             ]   859MB/s ± 0%   884MB/s ± 0%   +2.86%
BM_UFlatSink/9  [txt4             ]   698MB/s ± 0%   718MB/s ± 0%   +2.96%
BM_UFlatSink/10 [pb               ]  2.64GB/s ± 0%  2.67GB/s ± 0%   +1.16%
BM_UFlatSink/11 [gaviota          ]  1.00GB/s ± 0%  1.01GB/s ± 0%   +1.04%
BM_UFlatSink/12 [cp               ]  1.66GB/s ± 1%  1.68GB/s ± 1%   +0.83%
BM_UFlatSink/13 [c                ]  1.52GB/s ± 1%  1.53GB/s ± 0%   +0.38%
BM_UFlatSink/14 [lsp              ]  1.60GB/s ± 1%  1.61GB/s ± 0%   +0.91%
BM_UFlatSink/15 [xls              ]  1.12GB/s ± 0%  1.15GB/s ± 0%   +1.96%
BM_UFlatSink/16 [xls_200          ]   906MB/s ± 3%   920MB/s ± 1%   +1.55%
BM_UFlatSink/17 [bin              ]  1.86GB/s ± 0%  1.90GB/s ± 0%   +2.15%
BM_UFlatSink/18 [bin_200          ]  1.85GB/s ± 2%  1.92GB/s ± 2%   +4.01%
BM_UFlatSink/19 [sum              ]  1.32GB/s ± 1%  1.35GB/s ± 0%   +2.23%
BM_UFlatSink/20 [man              ]  1.39GB/s ± 1%  1.40GB/s ± 0%   +1.12%
BM_ZFlat/0      [html (22.31 %)   ]   800MB/s ± 0%   793MB/s ± 0%   -0.95%
BM_ZFlat/1      [urls (47.78 %)   ]   423MB/s ± 0%   424MB/s ± 0%   +0.11%
BM_ZFlat/2      [jpg (99.95 %)    ]  12.0GB/s ± 2%  12.0GB/s ± 4%     ~
BM_ZFlat/3      [jpg_200 (73.00 %)]   592MB/s ± 3%   594MB/s ± 2%     ~
BM_ZFlat/4      [pdf (83.30 %)    ]  7.26GB/s ± 1%  7.23GB/s ± 2%   -0.49%
BM_ZFlat/5      [html4 (22.52 %)  ]   738MB/s ± 0%   739MB/s ± 0%   +0.17%
BM_ZFlat/6      [txt1 (57.88 %)   ]   286MB/s ± 0%   285MB/s ± 0%   -0.09%
BM_ZFlat/7      [txt2 (61.91 %)   ]   264MB/s ± 0%   264MB/s ± 0%   +0.08%
BM_ZFlat/8      [txt3 (54.99 %)   ]   300MB/s ± 0%   300MB/s ± 0%     ~
BM_ZFlat/9      [txt4 (66.26 %)   ]   248MB/s ± 0%   247MB/s ± 0%   -0.20%
BM_ZFlat/10     [pb (19.68 %)     ]  1.04GB/s ± 0%  1.03GB/s ± 0%   -1.17%
BM_ZFlat/11     [gaviota (37.72 %)]   451MB/s ± 0%   450MB/s ± 0%   -0.35%
BM_ZFlat/12     [cp (48.12 %)     ]   543MB/s ± 0%   538MB/s ± 0%   -1.04%
BM_ZFlat/13     [c (42.47 %)      ]   638MB/s ± 1%   643MB/s ± 0%   +0.68%
BM_ZFlat/14     [lsp (48.37 %)    ]   686MB/s ± 0%   691MB/s ± 1%   +0.76%
BM_ZFlat/15     [xls (41.23 %)    ]   636MB/s ± 0%   633MB/s ± 0%   -0.52%
BM_ZFlat/16     [xls_200 (78.00 %)]   523MB/s ± 2%   520MB/s ± 2%   -0.56%
BM_ZFlat/17     [bin (18.11 %)    ]  1.01GB/s ± 0%  1.01GB/s ± 0%   +0.50%
BM_ZFlat/18     [bin_200 (7.50 %) ]  2.45GB/s ± 1%  2.44GB/s ± 1%   -0.54%
BM_ZFlat/19     [sum (48.96 %)    ]   487MB/s ± 0%   478MB/s ± 0%   -1.89%
BM_ZFlat/20     [man (59.21 %)    ]   567MB/s ± 1%   566MB/s ± 1%     ~

The BM_UFlat/13 and BM_UFlat/14 results showed high variance, so I reran them:

name               old speed      new speed      delta
BM_UFlat/13 [c  ]  1.53GB/s ± 0%  1.53GB/s ± 1%    ~
BM_UFlat/14 [lsp]  1.61GB/s ± 1%  1.61GB/s ± 1%  +0.25%
2018-08-07 23:41:17 -07:00
costan 4ffb0e62c5 Update Travis CI configuration. 2018-08-07 21:33:14 -07:00
atdt be490ef9ec Test for SSE3 suppport before using pshufb. 2018-08-04 18:51:13 -07:00
atdt 8f469d97e2 Avoid store-forwarding stalls in Zippy's IncrementalCopy
NEW: Annotate `pattern` as initialized, for MSan.

Snappy's IncrementalCopy routine optimizes for speed by reading and writing
memory in blocks of eight or sixteen bytes. If the gap between the source
and destination pointers is smaller than eight bytes, snappy's strategy is
to expand the gap by issuing a series of partly-overlapping eight-byte
loads+stores. Because the range of each load partly overlaps that of the
store which preceded it, the store buffer cannot be forwarded to the load,
and the load stalls while it waits for the store to retire. This is called a
store-forwarding stall.

We can use fewer loads and avoid most of the stalls by loading the first
eight bytes into an 128-bit XMM register, then using PSHUFB to permute the
register's contents in-place into the desired repeating sequence of bytes.
When falling back to IncrementalCopySlow, use memset if the pattern size == 1.
This eliminates around 60% of the stalls.

name                       old time/op    new time/op    delta
BM_UFlat/0 [html]        48.6µs ± 0%    48.2µs ± 0%   -0.92%        (p=0.000 n=19+18)
BM_UFlat/1 [urls]         589µs ± 0%     576µs ± 0%   -2.17%        (p=0.000 n=19+18)
BM_UFlat/2 [jpg]         7.12µs ± 0%    7.10µs ± 0%     ~           (p=0.071 n=19+18)
BM_UFlat/3 [jpg_200]      162ns ± 0%     151ns ± 0%   -7.06%        (p=0.000 n=19+18)
BM_UFlat/4 [pdf]         8.25µs ± 0%    8.19µs ± 0%   -0.74%        (p=0.000 n=19+18)
BM_UFlat/5 [html4]        218µs ± 0%     218µs ± 0%   +0.09%        (p=0.000 n=17+18)
BM_UFlat/6 [txt1]         191µs ± 0%     189µs ± 0%   -1.12%        (p=0.000 n=19+18)
BM_UFlat/7 [txt2]         168µs ± 0%     167µs ± 0%   -1.01%        (p=0.000 n=19+18)
BM_UFlat/8 [txt3]         502µs ± 0%     499µs ± 0%   -0.52%        (p=0.000 n=19+18)
BM_UFlat/9 [txt4]         704µs ± 0%     695µs ± 0%   -1.26%        (p=0.000 n=19+18)
BM_UFlat/10 [pb]         45.6µs ± 0%    44.2µs ± 0%   -3.13%        (p=0.000 n=19+15)
BM_UFlat/11 [gaviota]     188µs ± 0%     194µs ± 0%   +3.06%        (p=0.000 n=15+18)
BM_UFlat/12 [cp]         15.1µs ± 2%    14.7µs ± 1%   -2.09%        (p=0.000 n=18+18)
BM_UFlat/13 [c]          7.38µs ± 0%    7.36µs ± 0%   -0.28%        (p=0.000 n=16+18)
BM_UFlat/14 [lsp]        2.31µs ± 0%    2.37µs ± 0%   +2.64%        (p=0.000 n=19+18)
BM_UFlat/15 [xls]         984µs ± 0%     909µs ± 0%   -7.59%        (p=0.000 n=19+18)
BM_UFlat/16 [xls_200]     215ns ± 0%     217ns ± 0%   +0.71%        (p=0.000 n=19+15)
BM_UFlat/17 [bin]         289µs ± 0%     287µs ± 0%   -0.71%        (p=0.000 n=19+18)
BM_UFlat/18 [bin_200]     161ns ± 0%     116ns ± 0%  -28.09%        (p=0.000 n=19+16)
BM_UFlat/19 [sum]        31.9µs ± 0%    29.2µs ± 0%   -8.37%        (p=0.000 n=19+18)
BM_UFlat/20 [man]        3.13µs ± 1%    3.07µs ± 0%   -1.79%        (p=0.000 n=19+18)

name                       old allocs/op  new allocs/op  delta
BM_UFlat/0 [html]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/1 [urls]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/2 [jpg]          0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/3 [jpg_200]      0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/4 [pdf]          0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/5 [html4]        0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/6 [txt1]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/7 [txt2]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/8 [txt3]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/9 [txt4]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/10 [pb]          0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/11 [gaviota]     0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/12 [cp]          0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/13 [c]           0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/14 [lsp]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/15 [xls]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/16 [xls_200]     0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/17 [bin]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/18 [bin_200]     0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/19 [sum]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)
BM_UFlat/20 [man]         0.00 ±NaN%     0.00 ±NaN%     ~     (all samples are equal)

name                       old speed      new speed      delta
BM_UFlat/0 [html]      2.11GB/s ± 0%  2.13GB/s ± 0%   +0.92%        (p=0.000 n=19+18)
BM_UFlat/1 [urls]      1.19GB/s ± 0%  1.22GB/s ± 0%   +2.22%        (p=0.000 n=16+17)
BM_UFlat/2 [jpg]       17.3GB/s ± 0%  17.3GB/s ± 0%     ~           (p=0.074 n=19+18)
BM_UFlat/3 [jpg_200]   1.23GB/s ± 0%  1.33GB/s ± 0%   +7.58%        (p=0.000 n=19+18)
BM_UFlat/4 [pdf]       12.4GB/s ± 0%  12.5GB/s ± 0%   +0.74%        (p=0.000 n=19+18)
BM_UFlat/5 [html4]     1.88GB/s ± 0%  1.88GB/s ± 0%   -0.09%        (p=0.000 n=18+18)
BM_UFlat/6 [txt1]       798MB/s ± 0%   807MB/s ± 0%   +1.13%        (p=0.000 n=19+18)
BM_UFlat/7 [txt2]       743MB/s ± 0%   751MB/s ± 0%   +1.02%        (p=0.000 n=19+18)
BM_UFlat/8 [txt3]       850MB/s ± 0%   855MB/s ± 0%   +0.52%        (p=0.000 n=19+18)
BM_UFlat/9 [txt4]       684MB/s ± 0%   693MB/s ± 0%   +1.28%        (p=0.000 n=19+18)
BM_UFlat/10 [pb]       2.60GB/s ± 0%  2.69GB/s ± 0%   +3.25%        (p=0.000 n=19+16)
BM_UFlat/11 [gaviota]   979MB/s ± 0%   950MB/s ± 0%   -2.97%        (p=0.000 n=15+18)
BM_UFlat/12 [cp]       1.63GB/s ± 2%  1.67GB/s ± 1%   +2.13%        (p=0.000 n=18+18)
BM_UFlat/13 [c]        1.51GB/s ± 0%  1.52GB/s ± 0%   +0.29%        (p=0.000 n=16+18)
BM_UFlat/14 [lsp]      1.61GB/s ± 1%  1.57GB/s ± 0%   -2.57%        (p=0.000 n=19+18)
BM_UFlat/15 [xls]      1.05GB/s ± 0%  1.13GB/s ± 0%   +8.22%        (p=0.000 n=19+18)
BM_UFlat/16 [xls_200]   928MB/s ± 0%   921MB/s ± 0%   -0.81%        (p=0.000 n=19+17)
BM_UFlat/17 [bin]      1.78GB/s ± 0%  1.79GB/s ± 0%   +0.71%        (p=0.000 n=19+18)
BM_UFlat/18 [bin_200]  1.24GB/s ± 0%  1.72GB/s ± 0%  +38.92%        (p=0.000 n=19+18)
BM_UFlat/19 [sum]      1.20GB/s ± 0%  1.31GB/s ± 0%   +9.15%        (p=0.000 n=19+18)
BM_UFlat/20 [man]      1.35GB/s ± 1%  1.38GB/s ± 0%   +1.84%        (p=0.000 n=19+18)
2018-08-04 18:51:07 -07:00
costan 4f7bd2dbfd Update CI configurations.
Bump GCC and Clang on Travis and remove Visual Studio 2015 from AppVeyor.
2018-03-09 09:02:34 -08:00
jgorbe ca37ab7fb9 Ensure DecompressAllTags starts on a 32-byte boundary + 16 bytes.
First of all, I'm sorry about this ugly hack. I hope the following long
explanation is enough to justify it.

We have observed that, in some conditions, the results for dataset number 10
(pb) in the zippy benchmark can show a >20% regression on Skylake CPUs.

In order to diagnose this, we profiled the benchmark looking at hot functions
(99% of the time is spent on DecompressAllTags), then looked at the generated
code to see if there was any difference. In order to discard a minor difference
we observed in register allocation we replaced zippy.cc with a pre-built assembly
file so it was the same in both variants, and we still were able to reproduce the
regression.

After discarding a regression caused by the compiler, we digged a bit further
and noticed that the alignment of the function in the final binary was
different. Both were aligned to a 16-byte boundary, but the slower one was also
(by chance) aligned to a 32-byte boundary. A regression caused by alignment
differences would explain why I could reproduce it consistently on the same CitC
client, but not others: slight differences in the sources can cause the resulting
binary to have different layout.

Here are some detailed benchmark results before/after the fix. Note how fixing
the alignment makes the difference between baseline and experiment go away, but
regular 32-byte alignment puts both variants in the same ballpark as the
original regression:

Original (note BM_UCord_10 and BM_UDataBuffer_10 around the -24% line):

  BASELINE
  BM_UCord/10                    2938           2932          24194 3.767GB/s  pb
  BM_UDataBuffer/10              3008           3004          23316 3.677GB/s  pb

  EXPERIMENT
  BM_UCord/10                    3797           3789          18512 2.915GB/s  pb
  BM_UDataBuffer/10              4024           4016          17543 2.750GB/s  pb

Aligning DecompressAllTags to a 32-byte boundary:

  BASELINE
  BM_UCord/10                    3872           3862          18035 2.860GB/s  pb
  BM_UDataBuffer/10              4010           3998          17591 2.763GB/s  pb

  EXPERIMENT
  BM_UCord/10                    3884           3876          18126 2.850GB/s  pb
  BM_UDataBuffer/10              4037           4027          17199 2.743GB/s  pb

Aligning DecompressAllTags to a 32-byte boundary + 16 bytes (this patch):

  BASELINE
  BM_UCord/10                    3103           3095          22642 3.569GB/s  pb
  BM_UDataBuffer/10              3186           3177          21947 3.476GB/s  pb

  EXPERIMENT
  BM_UCord/10                    3104           3095          22632 3.569GB/s  pb
  BM_UDataBuffer/10              3167           3159          22076 3.496GB/s  pb

This change forces the "good" alignment for DecompressAllTags which, if
anything, should make benchmark results more stable (and maybe we'll improve
some unlucky application!).
2018-02-17 00:47:18 -08:00
scrubbed 15a2804cd2 Fix an incorrect analysis / comment in the "pattern doubling" code.
This should have a miniscule positive effect on performance; the
main idea of the CL is just to fix the incorrect comment.
2018-02-17 00:46:31 -08:00
costan e69d9f8806 Fix Travis CI configuration for OSX. 2018-01-04 15:27:36 -08:00
chandlerc 4aba5426d4 Rework a very hot, very sensitive part of snappy to reduce the number of
instructions, the number of dynamic branches, and avoid a particular
loop structure than LLVM has a very hard time optimizing for this
particular case.

The code being changed is part of the hottest path for snappy
decompression. In the benchmarks for decompressing protocol buffers,
this has proven to be amazingly sensitive to the slightest changes in
code layout. For example, previously we added '.p2align 5' assembly
directive to the code. This essentially padded the loop out from the
function. Merely by doing this we saw significant performance
improvements.

As a consequence, several of the compiler's typically reasonable
optimizations can have surprising bad impacts. Loop unrolling is a
primary culprit, but in the next LLVM release we are seeing an issue due
to loop rotation. While some of the problems caused by the newly
triggered loop rotation in LLVM can be mitigated with ongoing work on
LLVM's code layout optimizations (specifically, loop header cloning),
that is a fairly long term project. And even minor fluctuations in how
that subsequent optimization is performed may prevent gaining the
performance back.

For now, we need some way to unblock the next LLVM release which
contains a generic improvement to the LLVM loop optimizer that enables
loop rotation in more places, but uncovers this sensitivity and weakness
in a particular case.

This CL restructures the loop to have a simpler structure. Specifically,
we eagerly test what the terminal condition will be and provide two
versions of the copy loop that use a single loop predicate.

The comments in the source code and benchmarks indicate that only one of
these two cases is actually hot: we expect to generally have enough slop
in the buffer. That in turn allows us to generate a much simpler branch
and loop structure for the hot path (especially for the protocol buffer
decompression benchmark).

However, structuring even this simple loop in a way that doesn't trigger
some other performance bubble (often a more severe one) is quite
challenging. We have to carefully manage the variables used in the loop
and the addressing pattern. We should teach LLVM how to do this
reliably, but that too is a *much* more significant undertaking and is
extremely rare to have this degree of importance. The desired structure
of the loop, as shown with IACA's analysis for the broadwell
micro-architecture (HSW and SKX are similar):

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | mov rcx, qword ptr [rdi+rdx*1-0x8]
|   2^   |           |     |           | 0.4       | 1.0 |     |     | 0.6 |    | mov qword ptr [rdi], rcx
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | mov rcx, qword ptr [rdi+rdx*1]
|   2^   |           |     | 0.3       |           | 1.0 |     |     | 0.7 |    | mov qword ptr [rdi+0x8], rcx
|   1    | 0.5       |     |           |           |     | 0.5 |     |     |    | add rdi, 0x10
|   1    | 0.2       |     |           |           |     |     | 0.8 |     |    | cmp rdi, rax
|   0F   |           |     |           |           |     |     |     |     |    | jb 0xffffffffffffffe9

Specifically, the arrangement of addressing modes for the stores such
that micro-op fusion (indicated by the `^` on the `2` micro-op count) is
important to achieve good throughput for this loop.

The other thing necessary to make this change effective is to remove our
previous hack using `.p2align 5` to pad out the main decompression loop,
and to forcibly disable loop unrolling for critical loops. Because this
change simplifies the loop structure, more unrolling opportunities show
up. Also, the next LLVM release's generic loop optimization improvements
allow unrolling in more places, requiring still more disabling of
unrolling in this change.  Perhaps most surprising of these is that we
must disable loop unrolling in the *slow* path. While unrolling there
seems pointless, it should also be harmless.  This cold code is laid out
very far away from all of the hot code. All the samples shown in a
profile of the benchmark occur before this loop in the function. And
yet, if the loop gets unrolled (which seems to only happen reliably with
the next LLVM release) we see a nearly 20% regression in decompressing
protocol buffers!

With the current release of LLVM, we still observe some regression from
this source change, but it is fairly small (5% on decompressing protocol
buffers, less elsewhere). And with the next LLVM release it drops to
under 1% even in that case. Meanwhile, without this change, the next
release of LLVM will regress decompressing protocol buffers by more than
10%.
2018-01-04 15:27:15 -08:00
costan 26102a0c66 Fix generated version number in open source release.
Lands GitHub PR #61. The patch was also independently contributed by
Martin Gieseking <martin.gieseking@uos.de>.
2017-12-20 14:32:54 -08:00
costan b02bfa754e Tag open source release 1.1.7. 2017-08-24 16:54:23 -07:00
wmi 824e6718b5 Add a loop alignment directive to work around a performance regression.
We found LLVM upstream change at rL310792 degraded zippy benchmark by
~3%. Performance analysis showed the regression was caused by some
side-effect. The incidental loop alignment change (from 32 bytes to 16
bytes) led to increase of branch miss prediction and caused the
regression. The regression was reproducible on several intel
micro-architectures, like sandybridge, haswell and skylake. Sadly we
still don't have good understanding about the internal of intel branch
predictor and cannot explain how the branch miss prediction increases
when the loop alignment changes, so we cannot make a real fix here. The
workaround solution in the patch is to add a directive, align the hot
loop to 32 bytes, which can restore the performance. This is in order to
unblock the flip of default compiler to LLVM.
2017-08-24 16:54:12 -07:00
costan 55924d1109 Add GNUInstallDirs to CMake configuration.
This is modeled after https://github.com/google/googletest/pull/1160.
The immediate benefit is fixing the library install paths on 64-bit
Linux distributions, which tend to support running 32-bit and 64-bit
code side by side by installing 32-bit libraries in /usr/lib and  64-bit
libraries in /usr/lib64.
2017-08-16 19:19:31 -07:00
costan 632cd0f128 Use 64-bit optimized code path for ARM64.
This is inspired by https://github.com/google/snappy/pull/22.

Benchmark results with the change, Pixel C with Android N2G48B

Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0             119544     119253       1501 818.9MB/s  html
BM_UFlat/1            1223950    1208588        163 554.0MB/s  urls
BM_UFlat/2              16081      15962      11527 7.2GB/s  jpg
BM_UFlat/3                356        352     416666 540.6MB/s  jpg_200
BM_UFlat/4              25010      24860       7683 3.8GB/s  pdf
BM_UFlat/5             484832     481572        407 811.1MB/s  html4
BM_UFlat/6             408410     408713        482 354.9MB/s  txt1
BM_UFlat/7             361714     361663        553 330.1MB/s  txt2
BM_UFlat/8            1090582    1087912        182 374.1MB/s  txt3
BM_UFlat/9            1503127    1503759        133 305.6MB/s  txt4
BM_UFlat/10            114183     114285       1715 989.6MB/s  pb
BM_UFlat/11            406714     407331        491 431.5MB/s  gaviota
BM_UIOVec/0            370397     369888        538 264.0MB/s  html
BM_UIOVec/1           3207510    3190000        100 209.9MB/s  urls
BM_UIOVec/2             16589      16573      11223 6.9GB/s  jpg
BM_UIOVec/3              1052       1052     165289 181.2MB/s  jpg_200
BM_UIOVec/4             49151      49184       3985 1.9GB/s  pdf
BM_UValidate/0          68115      68095       2893 1.4GB/s  html
BM_UValidate/1         792652     792000        250 845.4MB/s  urls
BM_UValidate/2            334        334     487804 343.1GB/s  jpg
BM_UValidate/3            235        235     666666 809.9MB/s  jpg_200
BM_UValidate/4           6126       6130      32626 15.6GB/s  pdf
BM_ZFlat/0             292697     290560        678 336.1MB/s  html (22.31 %)
BM_ZFlat/1            4062080    4050000        100 165.3MB/s  urls (47.78 %)
BM_ZFlat/2              29225      29274       6422 3.9GB/s  jpg (99.95 %)
BM_ZFlat/3               1099       1098     163934 173.7MB/s  jpg_200 (73.00 %)
BM_ZFlat/4              44117      44233       4205 2.2GB/s  pdf (83.30 %)
BM_ZFlat/5            1158058    1157894        171 337.4MB/s  html4 (22.52 %)
BM_ZFlat/6            1102983    1093922        181 132.6MB/s  txt1 (57.88 %)
BM_ZFlat/7             974142     975490        204 122.4MB/s  txt2 (61.91 %)
BM_ZFlat/8            2984670    2990000        100 136.1MB/s  txt3 (54.99 %)
BM_ZFlat/9            4100130    4090000        100 112.4MB/s  txt4 (66.26 %)
BM_ZFlat/10            276236     275139        716 411.0MB/s  pb (19.68 %)
BM_ZFlat/11            760091     759541        262 231.4MB/s  gaviota (37.72 %)

Baseline benchmark results, Pixel C with Android N2G48B

Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0             148957     147565       1335 661.8MB/s  html
BM_UFlat/1            1527257    1500000        132 446.4MB/s  urls
BM_UFlat/2              19589      19397       8764 5.9GB/s  jpg
BM_UFlat/3                425        418     408163 455.3MB/s  jpg_200
BM_UFlat/4              30096      29552       6497 3.2GB/s  pdf
BM_UFlat/5             595933     594594        333 657.0MB/s  html4
BM_UFlat/6             516315     514360        383 282.0MB/s  txt1
BM_UFlat/7             454653     453514        441 263.2MB/s  txt2
BM_UFlat/8            1382687    1361111        144 299.0MB/s  txt3
BM_UFlat/9            1967590    1904761        105 241.3MB/s  txt4
BM_UFlat/10            148271     144560       1342 782.3MB/s  pb
BM_UFlat/11            523997     510471        382 344.4MB/s  gaviota
BM_UIOVec/0            478443     465227        417 209.9MB/s  html
BM_UIOVec/1           4172860    4060000        100 164.9MB/s  urls
BM_UIOVec/2             21470      20975       7342 5.5GB/s  jpg
BM_UIOVec/3              1357       1330      75187 143.4MB/s  jpg_200
BM_UIOVec/4             63143      61365       3031 1.6GB/s  pdf
BM_UValidate/0          86910      85125       2279 1.1GB/s  html
BM_UValidate/1        1022256    1000000        195 669.6MB/s  urls
BM_UValidate/2            420        417     400000 274.6GB/s  jpg
BM_UValidate/3            311        302     571428 630.0MB/s  jpg_200
BM_UValidate/4           7778       7584      25445 12.6GB/s  pdf
BM_ZFlat/0             469209     457547        424 213.4MB/s  html (22.31 %)
BM_ZFlat/1            5633510    5460000        100 122.6MB/s  urls (47.78 %)
BM_ZFlat/2              37896      36693       4524 3.1GB/s  jpg (99.95 %)
BM_ZFlat/3               1485       1441     123456 132.3MB/s  jpg_200 (73.00 %)
BM_ZFlat/4              74870      72775       2652 1.3GB/s  pdf (83.30 %)
BM_ZFlat/5            1857321    1785714        112 218.8MB/s  html4 (22.52 %)
BM_ZFlat/6            1538723    1492307        130 97.2MB/s  txt1 (57.88 %)
BM_ZFlat/7            1338236    1310810        148 91.1MB/s  txt2 (61.91 %)
BM_ZFlat/8            4050820    4040000        100 100.7MB/s  txt3 (54.99 %)
BM_ZFlat/9            5234940    5230000        100 87.9MB/s  txt4 (66.26 %)
BM_ZFlat/10            400309     400000        495 282.7MB/s  pb (19.68 %)
BM_ZFlat/11           1063042    1058510        188 166.1MB/s  gaviota (37.72 %)
2017-08-16 19:18:22 -07:00
costan 77c12adc19 Add unistd.h checks back to the CMake build.
getpagesize(), as well as its POSIX.2001 replacement
sysconf(_SC_PAGESIZE), is defined in <unistd.h>. On Linux and OS X,
including <sys/mman.h> is sufficient to get a definition for
getpagesize(). However, this is not true for the Android NDK. This CL
brings back the HAVE_UNISTD_H definition and its associated header
check.

This also adds a HAVE_FUNC_SYSCONF definition, which checks for the
presence of sysconf(). The definition can be used later to replace
getpagesize() with sysconf().
2017-08-02 10:56:06 -07:00
costan c8049c5827 Replace getpagesize() with sysconf(_SC_PAGESIZE).
getpagesize() has been removed from POSIX.1-2001. Its recommended
replacement is sysconf(_SC_PAGESIZE).
2017-08-01 14:38:57 -07:00
costan 18e2f220d8 Add guidelines for opensource contributions.
The guidelines follow the instructions at
https://opensource.google.com/docs/releasing/preparing/#CONTRIBUTING
2017-08-01 14:38:24 -07:00
costan f0d3237c32 Use _BitScanForward and _BitScanReverse on MSVC.
Based on https://github.com/google/snappy/pull/30
2017-08-01 14:38:02 -07:00
jueminyang 71b8f86887 Add SNAPPY_ prefix to PREDICT_{TRUE,FALSE} macros. 2017-08-01 14:36:26 -07:00
costan be6dc3db83 Redo CMake configuration.
The style was changed to match the official manual [1], the install
configuration was simplified and now matches the official packaging
guide [2], and the config files use the CMake-specific variable syntax
${VAR} instead of the autoconf-compatible syntax @VAR@, as documented in
[3]. The public header files are declared as such (for CMake 3.3+), and
the generated headers are included in the library target definition.

The tests are only built if SNAPPY_BUILD_TESTS (default ON) is true, so
zippy can be easily used in projects that add_subdirectory() its source
code directly, instead of using find_package().

[1] https://cmake.org/cmake/help/git-master/manual/cmake-language.7.html
[2] https://cmake.org/cmake/help/git-master/manual/cmake-packages.7.html
[3] https://cmake.org/cmake/help/git-master/command/configure_file.html
2017-07-28 10:14:21 -07:00
costan e4de6ce087 Small improvements to open source CI configuration.
This CL fixes 64-bit Windows testing (), makes it possible to view the
test output in the Travis / AppVeyor CI console while the test is
running, and takes advantage of the new support for the .appveyor.yml
file name to make the CI configuration less obtrusive.
2017-07-27 16:46:54 -07:00
costan c756f7f5d9 Support both static and shared library CMake builds.
This can be used to fix https://github.com/Homebrew/homebrew-core/issues/15722.
2017-07-27 16:46:54 -07:00
costan 038a3329b1 Inline DISALLOW_COPY_AND_ASSIGN.
snappy-stubs-public.h defined the DISALLOW_COPY_AND_ASSIGN macro, so the
definition propagated to all translation units that included the open
source headers. The macro is now inlined, thus avoiding polluting the
macro environment of snappy users.
2017-07-27 16:46:42 -07:00
costan a8b239c3de snappy: Remove autoconf build configuration. 2017-07-25 18:20:38 -07:00
costan 27671c6aec Clean up CMake header and type checks.
Unused macros: HAVE_DLFCN_H, HAVE_INTTYPES_H, HAVE_MEMORY_H,
HAVE_STDLIB_H, HAVE_STRINGS_H, HAVE_STRING_H, HAVE_SYS_BYTESWAP_H,
HAVE_SYS_STAT_H, HAVE_SYS_TYPES_H, HAVE_UNISTD_H.

Used but never set macros: HAVE_LIBLZF, HAVE_LIBQUICKLZ. These only gate
conditional includes. The code that takes advantage of them was removed.

Unused types: ssize_t.

The testing code uses HAVE_FUNC_MMAP, which was not wired in the CMake
build, causing a whole test to be skipped.
2017-07-25 18:17:35 -07:00
costan 548501c988 zippy: Re-release snappy 1.1.5 as 1.1.6.
The migration from autotools to CMake in 1.1.5 wasn't as smooth as
intended. The SONAME / SOVERSION were broken in both build systems,
causing breakages in systems that upgraded from snappy 1.1.4 to 1.1.5,
as reported in https://github.com/Homebrew/homebrew-core/issues/15274
and https://github.com/google/snappy/pull/45.
2017-07-13 03:56:49 -07:00
costan 513df5fb5a Tag open source release 1.1.5. 2017-06-28 18:37:30 -07:00
costan 5bc9c82ae3 Set minimum CMake version to 3.1.
The project only needs CMake 3.1 features, and some Travis CI bots have
CMake 3.2.2. Therefore, requiring CMake 3.4 is inconvenient.
2017-06-28 18:37:08 -07:00
costan e9720a001d Update Travis CI config, add AppVeyor for Windows CI coverage. 2017-06-28 18:36:37 -07:00
tmsriram f24f9d2d97 Explicitly copy internal::wordmask to the stack array to work around a compiler
optimization with LLVM that converts const stack arrays to global arrays.  This
is a temporary change and should be reverted when https://reviews.llvm.org/D30759
is fixed.

With PIE, accessing stack arrays is more efficient than global arrays and
wordmask was moved to the stack due to that.  However, the LLVM compiler
automatically converts stack arrays, detected as constant, to global arrays
and this transformation hurts PIE performance with LLVM.

We are working to fix this in the LLVM compiler, via
https://reviews.llvm.org/D30759, to not do this conversion in PIE mode.  Until
this patch is finished, please consider this source change as a temporary
work around to keep this array on the stack.  This source change is important
to allow some projects to flip the default compiler from GCC to LLVM for
optimized builds.

This change works for the following reason.  The LLVM compiler does not convert
non-const stack arrays to global arrays and explicitly copying the elements is
enough to make the compiler assume that this is a non-const array.

With GCC, this change does not affect code-gen in any significant way.  The
array initialization code is slightly different as it copies the constants
directly to the stack.

With LLVM, this keeps the array on the stack.

No change in performance with GCC (within noise range). With LLVM, ~0.7%
improvement in optimized mode (no FDO) and ~1.75% improvement in FDO
mode.
2017-06-28 18:34:54 -07:00
ysaed 82deffcde7 Remove benchmarking support for fastlz. 2017-06-28 18:33:55 -07:00
alkis 18488d6212 Use 64 bit little endian on ppc64le.
This has tangible performance benefits.

This lands https://github.com/google/snappy/pull/27
2017-06-28 18:33:13 -07:00
alkis 7b9532b878 Improve the SSE2 macro check on Windows.
This lands https://github.com/google/snappy/pull/37
2017-06-05 13:54:17 -07:00
alkis 7dadceea52 Check for the existence of sys/uio.h in autoconf build.
This lands https://github.com/google/snappy/pull/32
2017-06-05 13:54:17 -07:00
jyrki 83179dd8be Remove quicklz and lzf support in benchmarks. 2017-06-05 13:54:10 -07:00
vrabaud c8131680d0 Provide a CMakeLists.txt.
This lands https://github.com/google/snappy/pull/29
2017-06-05 13:53:29 -07:00
costan ed3b7b242b Clean up unused function warnings in snappy. 2017-03-17 13:59:03 -07:00
costan 8b60aac4fd Remove "using namespace std;" from zippy-stubs-internal.h.
This makes it easier to build zippy, as some compiles require a warning
suppression to accept "using namespace std".
2017-03-13 13:03:01 -07:00
costan 7d7a8ec805 Add Travis CI configuration to snappy and fix the make build.
The make build in the open source version uses autoconf, which is set up
to expect a project that follows the gnu standard.
2017-03-10 12:40:15 -08:00
alkis 1cd3ab02e9 Rename README to README.md. It already in markdown, we might as well let github know so that it renders nicely. 2017-03-08 12:05:05 -08:00
alkis 597fa795de Delete UnalignedCopy64 from snappy-stubs since the version in snappy.cc is more robust and possibly faster (assuming the compiler knows how to best copy 8 bytes between locations in memory the fastest way possible - a rather safe bet). 2017-03-08 11:42:30 -08:00
scrubbed 039b3a7ace Add std:: prefix to STL non-type names.
In order to disable global using declarations, this CL qualifies
stl names with the std namespace.
2017-03-08 11:42:30 -08:00
alkis 3c706d2230 Make UnalignedCopy64 not exhibit undefined behavior when src and dst overlap.
name         old speed      new speed      delta
BM_UFlat/0   3.09GB/s ± 3%  3.07GB/s ± 2%  -0.78%  (p=0.009 n=19+19)
BM_UFlat/1   1.63GB/s ± 2%  1.62GB/s ± 2%    ~     (p=0.099 n=19+20)
BM_UFlat/2   19.7GB/s ±19%  20.7GB/s ±11%    ~     (p=0.054 n=20+19)
BM_UFlat/3   1.61GB/s ± 2%  1.60GB/s ± 1%  -0.48%  (p=0.049 n=20+17)
BM_UFlat/4   15.8GB/s ± 7%  15.6GB/s ±10%    ~     (p=0.234 n=20+20)
BM_UFlat/5   2.47GB/s ± 1%  2.46GB/s ± 2%    ~     (p=0.608 n=19+19)
BM_UFlat/6   1.07GB/s ± 2%  1.07GB/s ± 1%    ~     (p=0.128 n=20+19)
BM_UFlat/7   1.01GB/s ± 1%  1.00GB/s ± 2%    ~     (p=0.656 n=15+19)
BM_UFlat/8   1.13GB/s ± 1%  1.13GB/s ± 1%    ~     (p=0.532 n=18+19)
BM_UFlat/9    918MB/s ± 1%   916MB/s ± 1%    ~     (p=0.443 n=19+18)
BM_UFlat/10  3.90GB/s ± 1%  3.90GB/s ± 1%    ~     (p=0.895 n=20+19)
BM_UFlat/11  1.30GB/s ± 1%  1.29GB/s ± 2%    ~     (p=0.156 n=19+19)
BM_UFlat/12  2.35GB/s ± 2%  2.34GB/s ± 1%    ~     (p=0.349 n=19+17)
BM_UFlat/13  2.07GB/s ± 1%  2.06GB/s ± 2%    ~     (p=0.475 n=18+19)
BM_UFlat/14  2.23GB/s ± 1%  2.23GB/s ± 1%    ~     (p=0.983 n=19+19)
BM_UFlat/15  1.55GB/s ± 1%  1.55GB/s ± 1%    ~     (p=0.314 n=19+19)
BM_UFlat/16  1.26GB/s ± 1%  1.26GB/s ± 1%    ~     (p=0.907 n=15+18)
BM_UFlat/17  2.32GB/s ± 1%  2.32GB/s ± 1%    ~     (p=0.604 n=18+19)
BM_UFlat/18  1.61GB/s ± 1%  1.61GB/s ± 1%    ~     (p=0.212 n=18+19)
BM_UFlat/19  1.78GB/s ± 1%  1.78GB/s ± 2%    ~     (p=0.350 n=19+19)
BM_UFlat/20  1.89GB/s ± 1%  1.90GB/s ± 2%    ~     (p=0.092 n=19+19)

Also tested the current version against UNALIGNED_STORE64(dst, UNALIGNED_LOAD64(src)), there is no difference (old is memcpy, new is UNALIGNED*):

name         old speed      new speed      delta
BM_UFlat/0   3.14GB/s ± 1%  3.16GB/s ± 2%    ~     (p=0.156 n=19+19)
BM_UFlat/1   1.62GB/s ± 1%  1.61GB/s ± 2%    ~     (p=0.102 n=19+20)
BM_UFlat/2   18.8GB/s ±17%  19.1GB/s ±11%    ~     (p=0.390 n=20+16)
BM_UFlat/3   1.59GB/s ± 1%  1.58GB/s ± 1%  -1.06%  (p=0.000 n=18+18)
BM_UFlat/4   15.8GB/s ± 6%  15.6GB/s ± 7%    ~     (p=0.184 n=19+20)
BM_UFlat/5   2.46GB/s ± 1%  2.44GB/s ± 1%  -0.95%  (p=0.000 n=19+18)
BM_UFlat/6   1.08GB/s ± 1%  1.06GB/s ± 1%  -1.17%  (p=0.000 n=19+18)
BM_UFlat/7   1.00GB/s ± 1%  0.99GB/s ± 1%  -1.16%  (p=0.000 n=19+18)
BM_UFlat/8   1.14GB/s ± 2%  1.12GB/s ± 1%  -1.12%  (p=0.000 n=19+18)
BM_UFlat/9    921MB/s ± 1%   914MB/s ± 1%  -0.84%  (p=0.000 n=20+17)
BM_UFlat/10  3.94GB/s ± 2%  3.92GB/s ± 1%    ~     (p=0.058 n=19+17)
BM_UFlat/11  1.29GB/s ± 1%  1.28GB/s ± 1%  -0.77%  (p=0.001 n=19+17)
BM_UFlat/12  2.34GB/s ± 1%  2.31GB/s ± 1%  -1.10%  (p=0.000 n=18+18)
BM_UFlat/13  2.06GB/s ± 1%  2.05GB/s ± 1%  -0.73%  (p=0.001 n=19+18)
BM_UFlat/14  2.22GB/s ± 1%  2.20GB/s ± 1%  -0.73%  (p=0.000 n=18+18)
BM_UFlat/15  1.55GB/s ± 1%  1.53GB/s ± 1%  -1.07%  (p=0.000 n=19+18)
BM_UFlat/16  1.26GB/s ± 1%  1.25GB/s ± 1%  -0.79%  (p=0.000 n=18+18)
BM_UFlat/17  2.31GB/s ± 1%  2.29GB/s ± 1%  -0.98%  (p=0.000 n=20+18)
BM_UFlat/18  1.61GB/s ± 1%  1.60GB/s ± 2%  -0.71%  (p=0.001 n=20+19)
BM_UFlat/19  1.77GB/s ± 1%  1.76GB/s ± 1%  -0.61%  (p=0.007 n=19+18)
BM_UFlat/20  1.89GB/s ± 1%  1.88GB/s ± 1%  -0.75%  (p=0.000 n=20+18)
2017-03-08 11:42:30 -08:00
skanev d3c6d20d0a Add compression size reporting hooks.
Also, force inlining util::compression::Sample().

The inlining change is necessary. Without it even with FDO+LIPO the call
doesn't get inlined and uses 4 registers to construct parameters (which
won't be used in the common case). In some of the more compute-bound
tests that causes extra spills and significant overhead (even if
call is sufficiently long).
For example, with inlining:
BM_UFlat/0         32.7µs ± 1%    33.1µs ± 1%  +1.41%
without:
BM_UFlat/0         32.7µs ± 1%    37.7µs ± 1%  +15.29%
2017-03-08 11:42:21 -08:00
alkis 626e1b9faa Use #ifdef __SSE2__ for the emmintrin.h include, otherwise snappy.cc does not compile with -march=prescott. 2017-03-07 18:09:49 -08:00
Alkis Evlogimenos 2d99bd14d4 1.1.4 release. 2017-01-27 09:12:04 +01:00
Alkis Evlogimenos 8bfb028b61 Improve zippy decompression speed.
The CL contains the following optimizations:

1) rewrite IncrementalCopy routine: single routine that splits the code into sections based on typical probabilities observed across a variety of inputs and helps reduce branch mispredictions both for FDO and non-FDO builds. IncrementalCopy is an adaptive routine that selects the best strategy based on input.
2) introduce UnalignedCopy128 that copies 128 bits per cycle using SSE2.
3) add branch hint for the main decoding loop. The non-literal case is taken more often in benchmarks. I expect this to be a noop in production with FDO. Note that this became apparent after step 1 above.
4) use the new IncrementalCopy in ZippyScatteredWriter.

I test two archs: x86_haswell and ppc_power8.

For x86_haswell I use FDO. For ppc_power8 I do not use FDO.

x86_haswell + FDO

name                   old speed      new speed      delta
BM_UCord/0             1.97GB/s ± 1%  3.19GB/s ± 1%  +62.08%  (p=0.000 n=19+18)
BM_UCord/1             1.28GB/s ± 1%  1.51GB/s ± 1%  +18.14%  (p=0.000 n=19+18)
BM_UCord/2             15.6GB/s ± 9%  15.5GB/s ± 7%     ~     (p=0.620 n=20+20)
BM_UCord/3              811MB/s ± 1%   808MB/s ± 1%   -0.38%  (p=0.009 n=17+18)
BM_UCord/4             12.4GB/s ± 4%  12.7GB/s ± 8%   +2.70%  (p=0.002 n=17+20)
BM_UCord/5             1.77GB/s ± 0%  2.33GB/s ± 1%  +31.37%  (p=0.000 n=18+18)
BM_UCord/6              900MB/s ± 1%  1006MB/s ± 1%  +11.71%  (p=0.000 n=18+17)
BM_UCord/7              858MB/s ± 1%   938MB/s ± 2%   +9.36%  (p=0.000 n=19+16)
BM_UCord/8              921MB/s ± 1%   985MB/s ±21%   +6.94%  (p=0.028 n=19+20)
BM_UCord/9              824MB/s ± 1%   800MB/s ±20%     ~     (p=0.113 n=19+20)
BM_UCord/10            2.60GB/s ± 1%  3.67GB/s ±21%  +41.31%  (p=0.000 n=19+20)
BM_UCord/11            1.07GB/s ± 1%  1.21GB/s ± 1%  +13.17%  (p=0.000 n=16+16)
BM_UCord/12            1.84GB/s ± 8%  2.18GB/s ± 1%  +18.44%  (p=0.000 n=16+19)
BM_UCord/13            1.83GB/s ±18%  1.89GB/s ± 1%   +3.14%  (p=0.000 n=17+19)
BM_UCord/14            1.96GB/s ± 2%  1.97GB/s ± 1%   +0.55%  (p=0.000 n=16+17)
BM_UCord/15            1.30GB/s ±20%  1.43GB/s ± 1%   +9.85%  (p=0.000 n=20+20)
BM_UCord/16             658MB/s ±20%   705MB/s ± 1%   +7.22%  (p=0.000 n=20+19)
BM_UCord/17            1.96GB/s ± 2%  2.15GB/s ± 1%   +9.73%  (p=0.000 n=16+19)
BM_UCord/18             555MB/s ± 1%   833MB/s ± 1%  +50.11%  (p=0.000 n=18+19)
BM_UCord/19            1.57GB/s ± 1%  1.75GB/s ± 1%  +11.34%  (p=0.000 n=20+20)
BM_UCord/20            1.72GB/s ± 2%  1.70GB/s ± 2%   -1.01%  (p=0.001 n=20+20)
BM_UCordStringSink/0   2.88GB/s ± 1%  3.15GB/s ± 1%   +9.56%  (p=0.000 n=17+20)
BM_UCordStringSink/1   1.50GB/s ± 1%  1.52GB/s ± 1%   +1.96%  (p=0.000 n=19+20)
BM_UCordStringSink/2   14.5GB/s ±10%  14.6GB/s ±10%     ~     (p=0.542 n=20+20)
BM_UCordStringSink/3   1.06GB/s ± 1%  1.08GB/s ± 1%   +1.77%  (p=0.000 n=18+20)
BM_UCordStringSink/4   12.6GB/s ± 7%  13.2GB/s ± 4%   +4.63%  (p=0.000 n=20+20)
BM_UCordStringSink/5   2.29GB/s ± 1%  2.36GB/s ± 1%   +3.05%  (p=0.000 n=19+20)
BM_UCordStringSink/6   1.01GB/s ± 2%  1.01GB/s ± 0%     ~     (p=0.055 n=20+18)
BM_UCordStringSink/7    945MB/s ± 1%   939MB/s ± 1%   -0.60%  (p=0.000 n=19+20)
BM_UCordStringSink/8   1.06GB/s ± 1%  1.07GB/s ± 1%   +0.62%  (p=0.000 n=18+20)
BM_UCordStringSink/9    866MB/s ± 1%   864MB/s ± 1%     ~     (p=0.107 n=19+20)
BM_UCordStringSink/10  3.64GB/s ± 2%  3.98GB/s ± 1%   +9.32%  (p=0.000 n=19+20)
BM_UCordStringSink/11  1.22GB/s ± 1%  1.22GB/s ± 1%   +0.61%  (p=0.001 n=19+20)
BM_UCordStringSink/12  2.23GB/s ± 1%  2.23GB/s ± 1%     ~     (p=0.692 n=19+20)
BM_UCordStringSink/13  1.96GB/s ± 1%  1.94GB/s ± 1%   -0.82%  (p=0.000 n=17+18)
BM_UCordStringSink/14  2.09GB/s ± 2%  2.08GB/s ± 1%     ~     (p=0.147 n=20+18)
BM_UCordStringSink/15  1.47GB/s ± 1%  1.45GB/s ± 1%   -0.88%  (p=0.000 n=20+19)
BM_UCordStringSink/16   908MB/s ± 1%   917MB/s ± 1%   +0.97%  (p=0.000 n=19+19)
BM_UCordStringSink/17  2.11GB/s ± 1%  2.20GB/s ± 1%   +4.35%  (p=0.000 n=18+20)
BM_UCordStringSink/18   804MB/s ± 2%  1106MB/s ± 1%  +37.52%  (p=0.000 n=20+20)
BM_UCordStringSink/19  1.67GB/s ± 1%  1.72GB/s ± 0%   +2.81%  (p=0.000 n=18+20)
BM_UCordStringSink/20  1.77GB/s ± 3%  1.77GB/s ± 3%     ~     (p=0.815 n=20+20)

ppc_power8

name                   old speed      new speed      delta
BM_UCord/0              918MB/s ± 6%  1262MB/s ± 0%   +37.56%  (p=0.000 n=17+16)
BM_UCord/1              671MB/s ±13%   879MB/s ± 2%   +30.99%  (p=0.000 n=18+16)
BM_UCord/2             12.6GB/s ± 8%  12.6GB/s ± 5%      ~     (p=0.452 n=17+19)
BM_UCord/3              285MB/s ±10%   284MB/s ± 4%    -0.50%  (p=0.021 n=19+17)
BM_UCord/4             5.21GB/s ±12%  6.59GB/s ± 1%   +26.37%  (p=0.000 n=17+16)
BM_UCord/5              913MB/s ± 4%  1253MB/s ± 1%   +37.27%  (p=0.000 n=16+17)
BM_UCord/6              461MB/s ±13%   547MB/s ± 1%   +18.67%  (p=0.000 n=18+16)
BM_UCord/7              455MB/s ± 2%   524MB/s ± 3%   +15.28%  (p=0.000 n=16+18)
BM_UCord/8              489MB/s ± 2%   584MB/s ± 2%   +19.47%  (p=0.000 n=17+17)
BM_UCord/9              410MB/s ±33%   490MB/s ± 1%   +19.64%  (p=0.000 n=17+18)
BM_UCord/10            1.10GB/s ± 3%  1.55GB/s ± 2%   +41.21%  (p=0.000 n=16+16)
BM_UCord/11             494MB/s ± 1%   558MB/s ± 1%   +12.92%  (p=0.000 n=17+18)
BM_UCord/12             608MB/s ± 3%   793MB/s ± 1%   +30.45%  (p=0.000 n=17+16)
BM_UCord/13             545MB/s ±18%   721MB/s ± 2%   +32.22%  (p=0.000 n=19+17)
BM_UCord/14             594MB/s ± 4%   748MB/s ± 3%   +25.99%  (p=0.000 n=17+17)
BM_UCord/15             628MB/s ± 1%   822MB/s ± 3%   +30.94%  (p=0.000 n=18+16)
BM_UCord/16             277MB/s ± 2%   280MB/s ±15%    +0.86%  (p=0.001 n=17+17)
BM_UCord/17             864MB/s ± 1%  1001MB/s ± 3%   +15.96%  (p=0.000 n=17+17)
BM_UCord/18             121MB/s ± 2%   284MB/s ± 4%  +134.08%  (p=0.000 n=17+18)
BM_UCord/19             594MB/s ± 0%   713MB/s ± 2%   +19.93%  (p=0.000 n=16+17)
BM_UCord/20             553MB/s ±10%   662MB/s ± 5%   +19.74%  (p=0.000 n=16+18)
BM_UCordStringSink/0   1.37GB/s ± 4%  1.48GB/s ± 2%    +8.51%  (p=0.000 n=16+16)
BM_UCordStringSink/1    969MB/s ± 1%   990MB/s ± 1%    +2.16%  (p=0.000 n=16+18)
BM_UCordStringSink/2   13.1GB/s ±11%  13.0GB/s ±14%      ~     (p=0.858 n=17+18)
BM_UCordStringSink/3    411MB/s ± 1%   415MB/s ± 1%    +0.93%  (p=0.000 n=16+17)
BM_UCordStringSink/4   6.81GB/s ± 8%  7.29GB/s ± 5%    +7.12%  (p=0.000 n=16+19)
BM_UCordStringSink/5   1.35GB/s ± 5%  1.45GB/s ±13%    +8.00%  (p=0.000 n=16+17)
BM_UCordStringSink/6    653MB/s ± 8%   653MB/s ± 3%    -0.12%  (p=0.007 n=17+19)
BM_UCordStringSink/7    618MB/s ±13%   597MB/s ±18%    -3.45%  (p=0.001 n=18+18)
BM_UCordStringSink/8    702MB/s ± 5%   702MB/s ± 1%    -0.10%  (p=0.012 n=17+16)
BM_UCordStringSink/9    590MB/s ± 2%   564MB/s ±13%    -4.46%  (p=0.000 n=16+17)
BM_UCordStringSink/10  1.63GB/s ± 2%  1.76GB/s ± 4%    +8.28%  (p=0.000 n=17+16)
BM_UCordStringSink/11   630MB/s ±14%   684MB/s ±15%    +8.51%  (p=0.000 n=19+17)
BM_UCordStringSink/12   858MB/s ±12%   903MB/s ± 9%    +5.17%  (p=0.000 n=19+17)
BM_UCordStringSink/13   806MB/s ±22%   879MB/s ± 1%    +8.98%  (p=0.000 n=19+19)
BM_UCordStringSink/14   854MB/s ±13%   901MB/s ± 5%    +5.60%  (p=0.000 n=19+17)
BM_UCordStringSink/15   930MB/s ± 2%   964MB/s ± 3%    +3.59%  (p=0.000 n=16+16)
BM_UCordStringSink/16   363MB/s ±10%   356MB/s ± 6%      ~     (p=0.050 n=20+19)
BM_UCordStringSink/17   976MB/s ±12%  1078MB/s ± 1%   +10.52%  (p=0.000 n=20+17)
BM_UCordStringSink/18   227MB/s ± 1%   355MB/s ± 3%   +56.45%  (p=0.000 n=16+17)
BM_UCordStringSink/19   751MB/s ± 4%   808MB/s ± 4%    +7.70%  (p=0.000 n=18+17)
BM_UCordStringSink/20   761MB/s ± 8%   786MB/s ± 4%    +3.23%  (p=0.000 n=18+17)
2017-01-27 09:10:36 +01:00
Behzad Nouri 818b583387 adds std:: to stl types (#061) 2017-01-26 21:43:13 +01:00
Geoff Pike 27c5d86527 Re-work fast path for handling copies in zippy decompression.
This is a performance-tuning change that shouldn't change the behavior
of the library.

This adds some complexity but the performance gain might make that
worthwhile: With FDO on perflab/haswell, a 4.0% gain (geometric mean).

SAMPLE (before)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           36638      36552     100000 2.6GB/s  html
BM_UFlat/1          457153     455895       9173 1.4GB/s  urls
BM_UFlat/2            5850       5837     685481 19.6GB/s  jpg
BM_UFlat/3             122        122   34551988 1.5GB/s  jpg_200
BM_UFlat/4            6797       6781     620811 14.1GB/s  pdf
BM_UFlat/5          179485     179037      23471 2.1GB/s  html4
BM_UFlat/6          142734     142384      29525 1018.7MB/s  txt1
BM_UFlat/7          125233     124924      33709 955.6MB/s  txt2
BM_UFlat/8          382548     381533      10000 1066.7MB/s  txt3
BM_UFlat/9          525614     524297       8018 876.5MB/s  txt4
BM_UFlat/10          34946      34868     100000 3.2GB/s  pb
BM_UFlat/11         149548     149208      28063 1.2GB/s  gaviota
BM_UFlat/12          10684      10663     392580 2.1GB/s  cp
BM_UFlat/13           5494       5484     766584 1.9GB/s  c
BM_UFlat/14           1691       1688    2488784 2.1GB/s  lsp
BM_UFlat/15         676443     674726       6129 1.4GB/s  xls
BM_UFlat/16            156        156   26656909 1.2GB/s  xls_200
BM_UFlat/17         239911     239297      17558 2.0GB/s  bin
BM_UFlat/18            182        182   23072932 1047.9MB/s  bin_200
BM_UFlat/19          21544      21499     194484 1.7GB/s  sum
BM_UFlat/20           2236       2232    1877810 1.8GB/s  man
BM_UFlatSink/0       42266      42179      99732 2.3GB/s  html
BM_UFlatSink/1      461810     460633       9055 1.4GB/s  urls
BM_UFlatSink/2        5816       5804     632829 19.8GB/s  jpg
BM_UFlatSink/3         124        123   34351698 1.5GB/s  jpg_200
BM_UFlatSink/4        7173       7157     609929 13.3GB/s  pdf
BM_UFlatSink/5      184795     184302      22660 2.1GB/s  html4
BM_UFlatSink/6      143552     143223      29272 1012.7MB/s  txt1
BM_UFlatSink/7      127160     126890      33178 940.8MB/s  txt2
BM_UFlatSink/8      382219     381313      10000 1067.3MB/s  txt3
BM_UFlatSink/9      528042     526713       7988 872.5MB/s  txt4
BM_UFlatSink/10      41389      41305     100000 2.7GB/s  pb
BM_UFlatSink/11     147215     146877      28854 1.2GB/s  gaviota
BM_UFlatSink/12      12008      11984     348139 1.9GB/s  cp
BM_UFlatSink/13       5444       5433     775084 1.9GB/s  c
BM_UFlatSink/14       1647       1644    2552119 2.1GB/s  lsp
BM_UFlatSink/15     665011     663424       6320 1.4GB/s  xls
BM_UFlatSink/16        153        153   27571837 1.2GB/s  xls_200
BM_UFlatSink/17     239735     239169      17411 2.0GB/s  bin
BM_UFlatSink/18        183        182   23005573 1046.8MB/s  bin_200
BM_UFlatSink/19      22544      22498     187705 1.6GB/s  sum
BM_UFlatSink/20       2190       2186    1917894 1.8GB/s  man

SAMPLE (after)

Benchmark         Time(ns)    CPU(ns) Iterations
------------------------------------------------
BM_UFlat/0           33940      33889     100000 2.8GB/s  html
BM_UFlat/1          440728     439944       9586 1.5GB/s  urls
BM_UFlat/2            5652       5641     744776 20.3GB/s  jpg
BM_UFlat/3             123        123   34647884 1.5GB/s  jpg_200
BM_UFlat/4            6628       6615     631892 14.4GB/s  pdf
BM_UFlat/5          169523     169227      24197 2.3GB/s  html4
BM_UFlat/6          144139     143892      29232 1008.0MB/s  txt1
BM_UFlat/7          127148     126915      33144 940.6MB/s  txt2
BM_UFlat/8          380267     379233      10000 1073.2MB/s  txt3
BM_UFlat/9          529495     528194       7957 870.0MB/s  txt4
BM_UFlat/10          31844      31784     100000 3.5GB/s  pb
BM_UFlat/11         146822     146476      28737 1.2GB/s  gaviota
BM_UFlat/12          10784      10762     392176 2.1GB/s  cp
BM_UFlat/13           5528       5518     760934 1.9GB/s  c
BM_UFlat/14           1721       1719    2449291 2.0GB/s  lsp
BM_UFlat/15         673304     671774       6255 1.4GB/s  xls
BM_UFlat/16            155        155   27092003 1.2GB/s  xls_200
BM_UFlat/17         230424     229902      18285 2.1GB/s  bin
BM_UFlat/18            185        184   22818199 1033.9MB/s  bin_200
BM_UFlat/19          21035      20996     200765 1.7GB/s  sum
BM_UFlat/20           2242       2238    1864380 1.8GB/s  man
BM_UFlatSink/0       33487      33405     100000 2.9GB/s  html
BM_UFlatSink/1      431108     430226       9764 1.5GB/s  urls
BM_UFlatSink/2        5927       5916     648112 19.4GB/s  jpg
BM_UFlatSink/3         123        122   34704423 1.5GB/s  jpg_200
BM_UFlatSink/4        6472       6461     653462 14.8GB/s  pdf
BM_UFlatSink/5      164309     163988      25567 2.3GB/s  html4
BM_UFlatSink/6      138274     138020      30311 1050.9MB/s  txt1
BM_UFlatSink/7      120844     120637      34708 989.6MB/s  txt2
BM_UFlatSink/8      371046     370366      10000 1098.9MB/s  txt3
BM_UFlatSink/9      510021     508982       8269 902.9MB/s  txt4
BM_UFlatSink/10      30889      30844     100000 3.6GB/s  pb
BM_UFlatSink/11     140752     140521      29903 1.2GB/s  gaviota
BM_UFlatSink/12      10162      10146     413600 2.3GB/s  cp
BM_UFlatSink/13       5264       5256     762398 2.0GB/s  c
BM_UFlatSink/14       1622       1619    2606069 2.1GB/s  lsp
BM_UFlatSink/15     646897     645756       6512 1.5GB/s  xls
BM_UFlatSink/16        150        150   28223595 1.2GB/s  xls_200
BM_UFlatSink/17     226096     225650      18629 2.1GB/s  bin
BM_UFlatSink/18        185        184   22907935 1035.3MB/s  bin_200
BM_UFlatSink/19      21369      21335     198881 1.7GB/s  sum
BM_UFlatSink/20       2139       2136    1953637 1.8GB/s  man
2017-01-26 21:42:26 +01:00