Commit graph

302 commits

Author SHA1 Message Date
Jun He f52721b2b4 decompression: optimize ExtractOffset for Arm
Inspired by kExtractMasksCombined, this patch uses shift
to replace table lookup. On Arm the codegen is 2 shift ops
(lsl+lsr). Comparing to previous ldr which requires 4 cycles
latency, the lsl+lsr only need 2 cycles.
Slight (~0.3%) uplift observed on N1, and ~3% on A72.

Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I5b53632d22d9e5cf1a49d0c5cdd16265a15de23b
2021-08-06 15:44:27 +08:00
Snappy Team f2db8f77ce Move the extract masks variable out in zippy. I see a consistent 1.5-2% improvement for ARM. Probably because ARM has more relaxed address computation than x86 https://www.godbolt.org/z/bfM1ezx41. I don't think this is a compiler bug or it can do something about it
PiperOrigin-RevId: 387569896
2021-08-02 14:50:16 +00:00
Snappy Team c8f7641646 Remove inline assembly as the bug in clang was fixed
PiperOrigin-RevId: 387356237
2021-08-02 14:50:09 +00:00
Snappy Team 9cc3689b21 Optimize memset to pure SIMD because compilers generate consistently bad code. clang for ARM and gcc for x86 https://gcc.godbolt.org/z/oxeGG7aEx
PiperOrigin-RevId: 383467656
2021-08-02 14:49:57 +00:00
Snappy Team b4888f7616 Optimize tag extraction for ARM with conditional increment instruction generation (csinc). For codegen see https://gcc.godbolt.org/z/a8z9j95Pv
PiperOrigin-RevId: 382688740
2021-07-05 01:05:54 +00:00
atdt b3fb0b5b4b Enable vector byte shuffle optimizations on ARM NEON
The SSSE3 intrinsics we use have their direct analogues in NEON, so making this optimization portable requires a very thin translation layer.

PiperOrigin-RevId: 381280165
2021-07-05 01:05:44 +00:00
Victor Costan b638ebe5d9 Update Travis CI config.
Xcode (drives macOS image) : 12.2 => 12.5
Clang                      : 10 => 12
GCC                        : 10 => 11
PiperOrigin-RevId: 375610083
2021-05-25 02:20:52 +00:00
Snappy Team d8f5dd8eca Clarify, in a comment, that offset/256 fits in 3 bits. It has to in this context, because the other 5 bits in the byte are used for len-4 and the tag.
PiperOrigin-RevId: 374926553
2021-05-25 02:20:42 +00:00
Victor Costan 2b63814b15 Tag open source release 1.1.9.
PiperOrigin-RevId: 372007801
2021-05-04 22:53:34 +00:00
atdt 9c1be17938 'size' remains unused if none of ZLIB, LZO and LZ4 are available.
While we're here, take care of a couple of lint warnings by converting CHECK(a != b) to CHECK_NE(a, b).

PiperOrigin-RevId: 369132446
2021-04-22 04:27:48 +00:00
Chris Mumford 78650d126a Add project goals to CONTRIBUTING.md.
PiperOrigin-RevId: 362386747
2021-03-12 06:41:07 +00:00
Victor Costan 5e7c14bd05 Add stubs for abseil flags.
This CL also removes support for using the gflags library to modify the
flags.

PiperOrigin-RevId: 361583626
2021-03-08 17:26:48 +00:00
Victor Costan 80a2a10c8c Remove unused run_microbenchmarks flag.
PiperOrigin-RevId: 361582956
2021-03-08 17:26:39 +00:00
Snappy Team 453942b38f Add absl::GetFlag and absl::SetFlag to uses of flags.
PiperOrigin-RevId: 357807059
2021-02-17 04:41:41 +00:00
Victor Costan ea368c2f07 Add AppVeyor status badge.
PiperOrigin-RevId: 347861379
2020-12-16 19:28:23 +00:00
Victor Costan d1d1f48604 Remove unused include in snappy_benchmark.cc.
PiperOrigin-RevId: 347861229
2020-12-16 19:28:12 +00:00
Victor Costan 4ebd8b2f23 Split benchmarks and test tools into separate targets.
This lets us remove main() from snappy_bench.cc and snappy_unittest.cc,
which simplifies integrating these tests and benchmarks with other
suites.

PiperOrigin-RevId: 347857427
2020-12-16 19:09:56 +00:00
Victor Costan 0793e2ae2d Merge pull request #117 from cmumford:disable-osx-fuzzer
PiperOrigin-RevId: 347736844
2020-12-16 03:02:51 +00:00
Victor Costan ac55f842f7 Test stub improvements.
PiperOrigin-RevId: 347736380
2020-12-16 02:58:39 +00:00
Chris Mumford 6e9ae72423 Disable fuzzing on OSX.
LibFuzzer does not ship with the Mac OSX Command Line Tools.

```
ld: file not found: /Applications/Xcode-12.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/12.0.0/lib/darwin/libclang_rt.fuzzer_osx.a

clang: error: linker command failed with exit code 1 (use -v to see invocation)
```
2020-12-15 17:10:14 -08:00
Victor Costan 402d88812c
Fixup for adding the third_party/{benchmark, googletest} submodules. (#115) 2020-12-15 12:01:28 -08:00
Victor Costan 6badb0a261 Merge pull request #114 from cmumford:werror-only-clang
PiperOrigin-RevId: 347660305
2020-12-15 19:49:13 +00:00
Chris Mumford bc53daa7be Fixed endif clause. 2020-12-15 11:19:53 -08:00
Chris Mumford e9a6a08439 Matching clang. 2020-12-15 11:17:28 -08:00
Chris Mumford 955a5dd1b3 Building with -Werror only with clang.
gcc was unable to inline a function call, which caused a build
failure due to `-Wall -Werror`.

The build error was:

```
../snappy.cc:292:76: error: ignoring attributes on template argument ‘__m128i’ [-Werror=ignored-attributes]
  292 | static inline std::pair<__m128i /* pattern */, __m128i /* reshuffle_mask */>
      |                                                                            ^
../snappy.cc:292:76: error: ignoring attributes on template argument ‘__m128i’ [-Werror=ignored-attributes]
cc1plus: all warnings being treated as errors
```
2020-12-15 11:02:17 -08:00
Chris Mumford 42d1dd7ea3 Fix CHECK_EQ to call ok() instead of CheckSuccess().
CheckSuccess was removed in e1e91ee464.

PiperOrigin-RevId: 347625874
2020-12-15 09:16:39 -08:00
Victor Costan eaaa0ed0ca
Fixup for adding the third_party/{benchmark, googletest} submodules. (#111) 2020-12-15 08:49:01 -08:00
Victor Costan e1e91ee464 Rework file:: stubs.
PiperOrigin-RevId: 347541488
2020-12-15 06:21:47 +00:00
Victor Costan 6aa79cb471 Wrap snappy_unittest in an anonymous namespace and remove static from functions.
PiperOrigin-RevId: 347541028
2020-12-15 06:18:35 +00:00
Victor Costan bae9f9bef8
Fixup for adding the third_party/{benchmark, googletest} submodules. (#110) 2020-12-14 20:27:33 -08:00
Victor Costan 5f913be04e Fix unused local variable warnings.
This will not change the compilation output.

PiperOrigin-RevId: 347525836
2020-12-15 04:14:46 +00:00
Victor Costan 549685a598 Remove custom testing and benchmarking code.
Snappy includes a testing framework, which implements a subset of the
Google Test API, and can be used when Google Test is not available.
Snappy also includes a micro-benchmark framework, which implements an
old version of the Google Benchmark API.

This CL replaces the custom test and micro-benchmark frameworks with
google/googletest and google/benchmark. The code is vendored in
third_party/ via git submodules. The setup is similar to google/crc32c
and google/leveldb.

This CL also updates the benchmarking code to the modern Google
Benchmark API.

Benchmark results are expected to be more precise, as the old framework
ran each benchmark with a fixed number of iterations, whereas Google
Benchmark keeps iterating until the noise is low.

PiperOrigin-RevId: 347456142
2020-12-14 21:27:31 +00:00
Chris Mumford 11f9a77a2f Add Travis-CI build status badge to README.md.
PiperOrigin-RevId: 347402877
2020-12-14 09:40:22 -08:00
Victor Costan 49540965a3 Update Travis CI config.
PiperOrigin-RevId: 347397797
2020-12-14 09:11:46 -08:00
Victor Costan 8995ffabb9 Replace #pragma nounroll with equivalent used elsewhere.
PiperOrigin-RevId: 347341130
2020-12-14 09:59:34 +00:00
Victor Costan d1daa83044 Remove inline qualifier from static variables.
This feature requires C++17. Fortunately, inline is useful for header declarations, which may be included in multiple compilation units. The declarations modified by this CL occur in a single compilation unit.

PiperOrigin-RevId: 347338760
2020-12-14 09:59:23 +00:00
Snappy Team 3b571656fa 1) Improve the lookup table data to require less instructions to extract the necessary data. We now store len - offset in a signed int16, this happens to remove masking offset in the calculations and the calculations that need to be done precisely give the flags that we need for testing correctness.
2) Replace offset extraction with a lookup mask. This is less uops and is needed because we need to special case type 3 to always return 0 as to properly trigger the fallback.
3) Unroll the loop twice, this removes some loop-condition checks AND it improves the generated assembly. The loop variables tend to end up in a different register requiring mov's having two consecutive copies allows the elision of the mov's.

PiperOrigin-RevId: 346663328
2020-12-14 02:48:03 +00:00
Shahriar Rouf a9730ed505 Optimize zippy decompression by making IncrementalCopy faster.
When SSSE3 is available:
- Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7).
- This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16.
- Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length.
- The PSHUFB masks are now generated programmatically at compile-time.

When SSSE3 is unavailable:
- No change.

In both cases:
- assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this.

'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1).

PiperOrigin-RevId: 346454471
2020-12-14 02:47:49 +00:00
Snappy Team 56c2c247d0 Internal change
PiperOrigin-RevId: 345360683
2020-12-03 22:52:52 +00:00
Shahriar Rouf a94be58e65 Optimize zippy decompression by making IncrementalCopy faster.
When SSSE3 is available:
- Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7).
- This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16.
- Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length.
- The PSHUFB masks are now generated programmatically at compile-time.

When SSSE3 is unavailable:
- No change.

In both cases:
- assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this.

'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1).

PiperOrigin-RevId: 345340892
2020-12-03 22:52:41 +00:00
Snappy Team 01a566f825 Fix opensource version
PiperOrigin-RevId: 343272548
2020-11-19 17:06:26 +00:00
Snappy Team 616b8229b6 Add LZ4 as a benchmark option. Snappy is starting to look really good compared to LZ4. LZ4 is considered the fastest solution by many on internet. We now see that Snappy is actually becoming very competitive with compression a little faster and decompression slower but certainly not terribly slower.
PiperOrigin-RevId: 343140860
2020-11-18 23:22:04 +00:00
Snappy Team e4a6e97b91 Extend validate benchmarks over all types and also add a medley for validation.
I also made the compression happen only once per benchmark. This way we get a cleaner measurement of #branch-misses using "perf stat". Compression suffers naturally from a large number of branch misses which was polluting the measurements.

This showed that with the new decompression the branch misses is actually much lower then initially reported, only .2% and very stable, ie. doesn't really fluctuate with how you execute the benchmarks.

PiperOrigin-RevId: 342628576
2020-11-18 23:21:55 +00:00
Snappy Team 719bed0ae2 Bug fix. Error on 0 offset copies.
PiperOrigin-RevId: 342447553
2020-11-18 23:21:47 +00:00
Snappy Team 289c8a3c0a Make zippy decompression branchless
PiperOrigin-RevId: 342423961
2020-11-18 23:21:38 +00:00
Snappy Team 3bfa265a04 Revert zippy optimization that causes heap buffer overflows.
PiperOrigin-RevId: 342283314
2020-11-18 23:21:30 +00:00
Shahriar Rouf 4d2dc9dcbb Optimize zippy unzipping by upto >10% by making IncrementalCopy faster.
When SSSE3 is available:
- Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7).
- This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16.
- Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length.
- The PSHUFB masks are now generated programmatically at compile-time.

When SSSE3 is unavailable:
- No change.

In both cases:
- assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this.

PiperOrigin-RevId: 342267037
2020-11-18 23:21:21 +00:00
Snappy Team 11e5165b98 Add a benchmark that decreased the branch prediction memorization by increasing the amount of independent branches executed per benchmark iteration.
PiperOrigin-RevId: 342242843
2020-11-18 23:21:12 +00:00
Luca Versari 6835abd953 Change hash function for Compress.
((a*b)>>18) & mask has higher throughput than (a*b)>>shift, and produces the
same results when the hash table size is 2**14. In other cases, the hash
function is still good, but it's not as necessary for that to be the case as
the input is small anyway. This speeds up in encoding, especially in cases
where hashing is a significant part of the encoding critical path (small or
uncompressible files).

PiperOrigin-RevId: 341498741
2020-11-18 23:20:58 +00:00
Victor Costan 368b01c8dd Merge pull request #107 from jsteemann:bug-fix/fix-compile-warning
PiperOrigin-RevId: 340505526
2020-11-03 20:51:55 +00:00