I also made the compression happen only once per benchmark. This way we get a cleaner measurement of #branch-misses using "perf stat". Compression suffers naturally from a large number of branch misses which was polluting the measurements.
This showed that with the new decompression the branch misses is actually much lower then initially reported, only .2% and very stable, ie. doesn't really fluctuate with how you execute the benchmarks.
PiperOrigin-RevId: 342628576
When SSSE3 is available:
- Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7).
- This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16.
- Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length.
- The PSHUFB masks are now generated programmatically at compile-time.
When SSSE3 is unavailable:
- No change.
In both cases:
- assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this.
PiperOrigin-RevId: 342267037
((a*b)>>18) & mask has higher throughput than (a*b)>>shift, and produces the
same results when the hash table size is 2**14. In other cases, the hash
function is still good, but it's not as necessary for that to be the case as
the input is small anyway. This speeds up in encoding, especially in cases
where hashing is a significant part of the encoding critical path (small or
uncompressible files).
PiperOrigin-RevId: 341498741
See https://reviews.llvm.org/D67122 for some discussion of why this can matter.
I don't think this should have any noticeable effect on performance.
PiperOrigin-RevId: 340255083
When building Snappy with compiler option `-Wsuggest-override` set
via the CMAKE_CXX_FLAGS, compilation produces warnings in
`snappy-sinksource.h`:
```
In file included from ./snappy-fork/snappy-sinksource.cc:32:
./snappy-fork/snappy-sinksource.h:150:18: error: ‘virtual size_t snappy::ByteArraySource::Available() const’ can be marked override [-Werror=suggest-override]
150 | virtual size_t Available() const;
| ^~~~~~~~~
./snappy-fork/snappy-sinksource.h:151:23: error: ‘virtual const char* snappy::ByteArraySource::Peek(size_t*)’ can be marked override [-Werror=suggest-override]
151 | virtual const char* Peek(size_t* len);
| ^~~~
./snappy-fork/snappy-sinksource.h:152:16: error: ‘virtual void snappy::ByteArraySource::Skip(size_t)’ can be marked override [-Werror=suggest-override]
152 | virtual void Skip(size_t n);
| ^~~~
./snappy-fork/snappy-sinksource.h:163:16: error: ‘virtual void snappy::UncheckedByteArraySink::Append(const char*, size_t)’ can be marked override [-Werror=suggest-override]
163 | virtual void Append(const char* data, size_t n);
| ^~~~~~
./snappy-fork/snappy-sinksource.h:164:17: error: ‘virtual char* snappy::UncheckedByteArraySink::GetAppendBuffer(size_t, char*)’ can be marked override [-Werror=suggest-override]
164 | virtual char* GetAppendBuffer(size_t len, char* scratch);
| ^~~~~~~~~~~~~~~
./snappy-fork/snappy-sinksource.h:165:17: error: ‘virtual char* snappy::UncheckedByteArraySink::GetAppendBufferVariable(size_t, size_t, char*, size_t, size_t*)’ can be marked override [-Werror=suggest-override]
165 | virtual char* GetAppendBufferVariable(
| ^~~~~~~~~~~~~~~~~~~~~~~
./snappy-fork/snappy-sinksource.h:168:16: error: ‘virtual void snappy::UncheckedByteArraySink::AppendAndTakeOwnership(char*, size_t, void (*)(void*, const char*, size_t), void*)’ can be marked override [-Werror=suggest-override]
168 | virtual void AppendAndTakeOwnership(
| ^~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors
```
This PR adds the missing override specifiers to the sink
implementations, so compilation works fine again.
Tested it with g++-9.3 and g++-10.2.
Compatibility note:
Override specifiers were introduced with C++11, which Snappy seems
to effectively require, at least according to its CMakeLists.txt file
and due to the usage of some C++11-only STL types in its tests.
Bits::FindLSBSetNonZero64() is now available unconditionally, and it
should be easier to reason about the code included in each build
configuration.
This reduces the amount of conditional compiling going on, which makes
it easier to reason about what stubs are a used in each build
configuration.
The guards were added to work around the fact that MSVC has a
_BitScanForward64() intrinsic, but the intrinsic is only implemented on
64-bit targets, and causes a compilation error on 32-bit targets errors.
By contrast, Clang and GCC support __builtin_ctzll() everywhere, and
implement it with varying efficiency.
This CL reworks the conditional compilation directives so that
Bits::FindLSBSetNonZero64() uses the _BitScanForward64() intrinsic on
MSVC when available, and the portable implementation otherwise.
PiperOrigin-RevId: 310007748
* Replace post-increment with pre-increment in for loops.
* Replace unsigned int counters with precise types, like uint8_t.
* Switch to C++11 iterating loops when possible.
PiperOrigin-RevId: 309724233
This CL makes the following substitutions.
* assert.h -> cassert
* math.h -> cmath
* stdarg.h -> cstdarg
* stdio.h -> cstdio
* stdlib.h -> cstdlib
* string.h -> cstring
stddef.h and stdint.h are not migrated to C++ headers.
PiperOrigin-RevId: 309074805
Snappy issues multi-byte (16/32/64-bit) loads and stores that are not
aligned, meaning the addresses are 16/32/64-bit multiples. This is
accomplished using two methods:
1) The portable method allocates a uint{16,32,64}_t on the stack, and
std::memcpy()s the bytes into/from the integer. This method relies on
well-defined behaviori (std::memcpy() works on all valid pointers,
fixed-width unsigned integer types use a pure binary representation and
therefore have no invalid values), and should compile to valid code on
all platforms.
2) The fast method reinterpret_casts the address to a pointer to a
uint{16,32,64}_t and dereferences the pointer. This is expected to
compile to one hardware instruction (mov on x86, ldr/str on arm). The
caveat is that the reinterpret_cast is undefined behavior (UB) unless the
address happened to be a valid uint{16,32,64}_t pointer. The UB shows up
as follows.
* On architectures that don't have hardware instructions for unaligned
loads / stores, the pointer access can trigger a hardware exceptions.
This is mitigated by #ifdef blocks that attempt to restrict the fast
method to platforms that support it.
* On architectures that have separate instructions for aligned and
unaligned access, the compiler may need an explicit hint to emit the
hardware instruction for unaligned access. This is accomplished on
Clang and GCC by wrapping the pointers into structs tagged with
__attribute__((__packed__)).
This CL removes the fast method. Fortunately, compilers have advanced
enough that the portable method gets compiled down to the same
instructions as the fast method, without the need for the caveats
explained above. Specifically, modern Clang, GCC and MSVC optimize
std::memcpy() to a single instruction (mov / ldr / str). A test case
proving this can be seen at https://godbolt.org/z/gZg2Fk
PiperOrigin-RevId: 306342728
The platform-independent code that breaks down the loads and stores into
byte-level operations is optimized into single instructions (mov or
ldr/str) and instruction pairs (mov+bswap or ldr/str+rev) by recent
versions of Clang and GCC. Tested at https://godbolt.org/z/2BQP-o
PiperOrigin-RevId: 306321608
An internal CL started using ABSL_ATTRIBUTE_ALWAYS_INLINE
from Abseil. This CL introduces equivalent functionality as
SNAPPY_ALWAYS_INLINE.
PiperOrigin-RevId: 306289650
The following changes are done via find/replace.
* int8 -> int8_t
* int16 -> int16_t
* int32 -> int32_t
* int64 -> int64_t
The aliases were removed from snappy-stubs-public.h.
PiperOrigin-RevId: 306141557
This CL replaces memcpy() with std::memcpy()
and memmove() with std::memmove(), and #includes
<cstring> in files that use either function.
PiperOrigin-RevId: 306067788
BM_ZCord/0 [html ] 1.26GB/s ± 0% 1.35GB/s ± 0% +7.90% (p=0.008 n=5+5)
BM_ZCord/1 [urls ] 535MB/s ± 0% 562MB/s ± 0% +5.05% (p=0.008 n=5+5)
BM_ZCord/2 [jpg ] 10.2GB/s ± 1% 10.2GB/s ± 0% ~ (p=0.310 n=5+5)
BM_ZCord/3 [jpg_200] 841MB/s ± 1% 846MB/s ± 1% ~ (p=0.421 n=5+5)
BM_ZCord/4 [pdf ] 6.77GB/s ± 1% 7.06GB/s ± 1% +4.28% (p=0.008 n=5+5)
BM_ZCord/5 [html4 ] 1.00GB/s ± 0% 1.08GB/s ± 0% +7.94% (p=0.008 n=5+5)
BM_ZCord/6 [txt1 ] 391MB/s ± 0% 417MB/s ± 0% +6.71% (p=0.008 n=5+5)
BM_ZCord/7 [txt2 ] 363MB/s ± 0% 388MB/s ± 0% +6.73% (p=0.016 n=5+4)
BM_ZCord/8 [txt3 ] 400MB/s ± 0% 426MB/s ± 0% +6.55% (p=0.008 n=5+5)
BM_ZCord/9 [txt4 ] 328MB/s ± 0% 350MB/s ± 0% +6.66% (p=0.008 n=5+5)
BM_ZCord/10 [pb ] 1.67GB/s ± 1% 1.80GB/s ± 0% +7.52% (p=0.008 n=5+5)
1) A key bottleneck in the data dependency chain is figuring out how many bytes are matched and loading the data for next hash value. The load-to-use latency is 5 cycles, in previous cl/303353110 we removed the load in lieu of "shrd" to align previous loads. Unfortunately "shrd" itself has a latency of 4 cycles, we'd prefer "shrx" which takes 1 cycle for variable shifts.
2)Maximally use data already computed. The above trick calculates 5 bytes of useful data. So in case we need to search for new match we can use this for the first search (which is one byte further).
PiperOrigin-RevId: 303875535
1) It shaves of a few cycles from the data dependency chain. By using "shrd" instead of a load.
2) The important loop is finding small copies (4-12) which are either "copy 1", or "copy 2" depending if the offset fits <2048. It turns out that this is a branch that is mispredicted often. Due to the long dependency chain the CPU is running with IPC~1 anyway so we can freely add instructions to instead emit copies branchfree. This reduces the branch misspredicts from 15% to 11% (for BM_ZFlat/6 txt1) and from 5.6% to 4% (for BM_ZFlat/10 or pb).
PiperOrigin-RevId: 303328967
Two ideas
1) The code uses "heuristic match skipping" has a quadratic interpolation. However for the first 32 bytes it's just every byte. Special case 16 bytes. This removes a lot of code.
2) Load 64 bit integers and shift instead of reload. The hashing loop has a very long chain data = Load32(ip) -> hash = Hash(data) -> offset = table[hash] -> copy_data = Load32(base_ip + offset) followed by a compare between data and copy_data. This chain is around 20 cycles. It's unreasonable for the branch predictor to be able to predict when it's a match (that is completely driven by the content of the data). So when it's a miss this chain is on the critical path. By loading 64 bits and shifting we can effectively remove the first load.
PiperOrigin-RevId: 302893821
2) Reduce the input pointer loop carried dependency chain from 7 cycles to 4 cycles by using pre-loading. This is a very subtle point.
3) Just brutally copy 64 bytes which removes a difficult to predict branch from the inner most loop. There is enough bandwidth to do so in the intrinsic cycles of the loop.
4) Implement limit pointers that include the slop region. This removes unnecessary instructions from the hot path.
5) It seems the removal of the difficult to predict branch has removed the code sensitivity to alignment, so remove the asm nop's.
PiperOrigin-RevId: 294692928
On skylake, this improves protobuf and html decompression speed by 15% and 9% respectively, and the rest by ~2%.
On haswell, this improves protobuf and html decompression speed by 23% and 16% respectively, and the rest by ~3%.
PiperOrigin-RevId: 289090401
* Fix bash conditionals: [ a == b ] should be [ a = b ].
* Upgrade to LLVM 9 on Travis.
* Upgrade fuzzer build arguments for LLVM 9.
PiperOrigin-RevId: 271898655