Make heuristic match skipping more aggressive.

This causes compression to be much faster on incompressible inputs
(such as the jpeg and pdf tests), and is neutral or even positive on the other
tests. The test set shows only microscopic density regressions; I attempted to
construct a worst-case test set containing ~1500 different cases of mixed
plaintext + /dev/urandom, and even those seemed to be only 0.38 percentage
points less dense on average (the single worst case was 87.8% -> 89.0%), which
we can live with given that this is already an edge case.

The original idea is by Klaus Post; I only tweaked the implementation.
Ironically, the new implementation is almost more in line with the
comment that was there, so I've left that largely alone, albeit
with a small modification.

Microbenchmark results (opt mode, 64-bit, static linking):

Ivy Bridge:

Benchmark                 Base (ns)  New (ns)                                Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0                   120284    115480  847.0MB/s  html (22.31 %)        +4.2%
BM_ZFlat/1                  1527911   1522242  440.7MB/s  urls (47.78 %)        +0.4%
BM_ZFlat/2                    17591     10582  10.9GB/s  jpg (99.95 %)         +66.2%
BM_ZFlat/3                      323       322  593.3MB/s  jpg_200 (73.00 %)     +0.3%
BM_ZFlat/4                    53691     14063  6.8GB/s  pdf (83.30 %)         +281.8%
BM_ZFlat/5                   495442    492347  794.8MB/s  html4 (22.52 %)       +0.6%
BM_ZFlat/6                   473523    473622  306.7MB/s  txt1 (57.88 %)        -0.0%
BM_ZFlat/7                   421406    420120  284.5MB/s  txt2 (61.91 %)        +0.3%
BM_ZFlat/8                  1265632   1270538  320.8MB/s  txt3 (54.99 %)        -0.4%
BM_ZFlat/9                  1742688   1737894  264.8MB/s  txt4 (66.26 %)        +0.3%
BM_ZFlat/10                  107950    103404  1095.1MB/s  pb (19.68 %)         +4.4%
BM_ZFlat/11                  372660    371818  473.5MB/s  gaviota (37.72 %)     +0.2%
BM_ZFlat/12                   53239     49528  474.4MB/s  cp (48.12 %)          +7.5%
BM_ZFlat/13                   18940     17349  613.9MB/s  c (42.47 %)           +9.2%
BM_ZFlat/14                    5155      5075  700.3MB/s  lsp (48.37 %)         +1.6%
BM_ZFlat/15                 1474757   1474471  667.2MB/s  xls (41.23 %)         +0.0%
BM_ZFlat/16                     363       362  528.0MB/s  xls_200 (78.00 %)     +0.3%
BM_ZFlat/17                  453849    456931  1073.2MB/s  bin (18.11 %)        -0.7%
BM_ZFlat/18                      90        87  2.1GB/s  bin_200 (7.50 %)        +3.4%
BM_ZFlat/19                   82163     80498  453.7MB/s  sum (48.96 %)         +2.1%
BM_ZFlat/20                    7174      7124  566.7MB/s  man (59.21 %)         +0.7%
Sum of all benchmarks       8694831   8623857                                   +0.8%

Sandy Bridge:

Benchmark                 Base (ns)  New (ns)                                Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0                   117426    112649  868.2MB/s  html (22.31 %)        +4.2%
BM_ZFlat/1                  1517095   1498522  447.5MB/s  urls (47.78 %)        +1.2%
BM_ZFlat/2                    18601     10649  10.8GB/s  jpg (99.95 %)         +74.7%
BM_ZFlat/3                      359       356  536.0MB/s  jpg_200 (73.00 %)     +0.8%
BM_ZFlat/4                    60249     13832  6.9GB/s  pdf (83.30 %)         +335.6%
BM_ZFlat/5                   481246    475571  822.7MB/s  html4 (22.52 %)       +1.2%
BM_ZFlat/6                   460541    455693  318.8MB/s  txt1 (57.88 %)        +1.1%
BM_ZFlat/7                   407751    404147  295.8MB/s  txt2 (61.91 %)        +0.9%
BM_ZFlat/8                  1228255   1222519  333.4MB/s  txt3 (54.99 %)        +0.5%
BM_ZFlat/9                  1678299   1666379  276.2MB/s  txt4 (66.26 %)        +0.7%
BM_ZFlat/10                  106499    101715  1113.4MB/s  pb (19.68 %)         +4.7%
BM_ZFlat/11                  361913    360222  488.7MB/s  gaviota (37.72 %)     +0.5%
BM_ZFlat/12                   53137     49618  473.6MB/s  cp (48.12 %)          +7.1%
BM_ZFlat/13                   18801     17812  597.8MB/s  c (42.47 %)           +5.6%
BM_ZFlat/14                    5394      5383  660.2MB/s  lsp (48.37 %)         +0.2%
BM_ZFlat/15                 1435411   1432870  686.4MB/s  xls (41.23 %)         +0.2%
BM_ZFlat/16                     389       395  483.3MB/s  xls_200 (78.00 %)     -1.5%
BM_ZFlat/17                  447255    445510  1100.4MB/s  bin (18.11 %)        +0.4%
BM_ZFlat/18                      86        86  2.2GB/s  bin_200 (7.50 %)        +0.0%
BM_ZFlat/19                   82555     79512  459.3MB/s  sum (48.96 %)         +3.8%
BM_ZFlat/20                    7527      7553  534.5MB/s  man (59.21 %)         -0.3%
Sum of all benchmarks       8488789   8360993                                   +1.5%

Haswell:

Benchmark                 Base (ns)  New (ns)                                Improvement
----------------------------------------------------------------------------------------
BM_ZFlat/0                   107512    105621  925.6MB/s  html (22.31 %)        +1.8%
BM_ZFlat/1                  1344306   1332479  503.1MB/s  urls (47.78 %)        +0.9%
BM_ZFlat/2                    14752      9471  12.1GB/s  jpg (99.95 %)         +55.8%
BM_ZFlat/3                      287       275  694.0MB/s  jpg_200 (73.00 %)     +4.4%
BM_ZFlat/4                    48810     12263  7.8GB/s  pdf (83.30 %)         +298.0%
BM_ZFlat/5                   443013    442064  884.6MB/s  html4 (22.52 %)       +0.2%
BM_ZFlat/6                   429239    432124  336.0MB/s  txt1 (57.88 %)        -0.7%
BM_ZFlat/7                   381765    383681  311.5MB/s  txt2 (61.91 %)        -0.5%
BM_ZFlat/8                  1136667   1154304  353.0MB/s  txt3 (54.99 %)        -1.5%
BM_ZFlat/9                  1579925   1592431  288.9MB/s  txt4 (66.26 %)        -0.8%
BM_ZFlat/10                   98345     92411  1.2GB/s  pb (19.68 %)            +6.4%
BM_ZFlat/11                  340397    340466  516.8MB/s  gaviota (37.72 %)     -0.0%
BM_ZFlat/12                   47076     43536  539.5MB/s  cp (48.12 %)          +8.1%
BM_ZFlat/13                   16680     15637  680.8MB/s  c (42.47 %)           +6.7%
BM_ZFlat/14                    4616      4539  782.6MB/s  lsp (48.37 %)         +1.7%
BM_ZFlat/15                 1331231   1334094  736.9MB/s  xls (41.23 %)         -0.2%
BM_ZFlat/16                     326       322  593.5MB/s  xls_200 (78.00 %)     +1.2%
BM_ZFlat/17                  404383    400326  1.2GB/s  bin (18.11 %)           +1.0%
BM_ZFlat/18                      69        69  2.7GB/s  bin_200 (7.50 %)        +0.0%
BM_ZFlat/19                   74771     71348  511.7MB/s  sum (48.96 %)         +4.8%
BM_ZFlat/20                    6461      6383  632.2MB/s  man (59.21 %)         +1.2%
Sum of all benchmarks       7810631   7773844                                   +0.5%

I've done a quick test that there are no performance regressions on external
GCC (4.9.2, Debian, Haswell, 64-bit), too.
This commit is contained in:
Steinar H. Gunderson 2016-04-05 11:50:26 +02:00
parent 2b9152d9c5
commit d53de18799
1 changed files with 5 additions and 4 deletions

View File

@ -364,9 +364,9 @@ char* CompressFragment(const char* input,
//
// Heuristic match skipping: If 32 bytes are scanned with no matches
// found, start looking only at every other byte. If 32 more bytes are
// scanned, look at every third byte, etc.. When a match is found,
// immediately go back to looking at every byte. This is a small loss
// (~5% performance, ~0.1% density) for compressible data due to more
// scanned (or skipped), look at every third byte, etc.. When a match is
// found, immediately go back to looking at every byte. This is a small
// loss (~5% performance, ~0.1% density) for compressible data due to more
// bookkeeping, but for non-compressible data (such as JPEG) it's a huge
// win since the compressor quickly "realizes" the data is incompressible
// and doesn't bother looking for matches everywhere.
@ -382,7 +382,8 @@ char* CompressFragment(const char* input,
ip = next_ip;
uint32 hash = next_hash;
assert(hash == Hash(ip, shift));
uint32 bytes_between_hash_lookups = skip++ >> 5;
uint32 bytes_between_hash_lookups = skip >> 5;
skip += bytes_between_hash_lookups;
next_ip = ip + bytes_between_hash_lookups;
if (PREDICT_FALSE(next_ip > ip_limit)) {
goto emit_remainder;