snappy

Commit Graph

Author	SHA1	Message	Date
Victor Costan	8995ffabb9	Replace #pragma nounroll with equivalent used elsewhere. PiperOrigin-RevId: 347341130	2020-12-14 09:59:34 +00:00
Victor Costan	d1daa83044	Remove inline qualifier from static variables. This feature requires C++17. Fortunately, inline is useful for header declarations, which may be included in multiple compilation units. The declarations modified by this CL occur in a single compilation unit. PiperOrigin-RevId: 347338760	2020-12-14 09:59:23 +00:00
Snappy Team	3b571656fa	1) Improve the lookup table data to require less instructions to extract the necessary data. We now store len - offset in a signed int16, this happens to remove masking offset in the calculations and the calculations that need to be done precisely give the flags that we need for testing correctness. 2) Replace offset extraction with a lookup mask. This is less uops and is needed because we need to special case type 3 to always return 0 as to properly trigger the fallback. 3) Unroll the loop twice, this removes some loop-condition checks AND it improves the generated assembly. The loop variables tend to end up in a different register requiring mov's having two consecutive copies allows the elision of the mov's. PiperOrigin-RevId: 346663328	2020-12-14 02:48:03 +00:00
Shahriar Rouf	a9730ed505	Optimize zippy decompression by making IncrementalCopy faster. When SSSE3 is available: - Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7). - This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16. - Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length. - The PSHUFB masks are now generated programmatically at compile-time. When SSSE3 is unavailable: - No change. In both cases: - assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this. 'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1). PiperOrigin-RevId: 346454471	2020-12-14 02:47:49 +00:00
Snappy Team	56c2c247d0	Internal change PiperOrigin-RevId: 345360683	2020-12-03 22:52:52 +00:00
Shahriar Rouf	a94be58e65	Optimize zippy decompression by making IncrementalCopy faster. When SSSE3 is available: - Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7). - This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16. - Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length. - The PSHUFB masks are now generated programmatically at compile-time. When SSSE3 is unavailable: - No change. In both cases: - assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this. 'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1). PiperOrigin-RevId: 345340892	2020-12-03 22:52:41 +00:00
Snappy Team	01a566f825	Fix opensource version PiperOrigin-RevId: 343272548	2020-11-19 17:06:26 +00:00
Snappy Team	616b8229b6	Add LZ4 as a benchmark option. Snappy is starting to look really good compared to LZ4. LZ4 is considered the fastest solution by many on internet. We now see that Snappy is actually becoming very competitive with compression a little faster and decompression slower but certainly not terribly slower. PiperOrigin-RevId: 343140860	2020-11-18 23:22:04 +00:00
Snappy Team	e4a6e97b91	Extend validate benchmarks over all types and also add a medley for validation. I also made the compression happen only once per benchmark. This way we get a cleaner measurement of #branch-misses using "perf stat". Compression suffers naturally from a large number of branch misses which was polluting the measurements. This showed that with the new decompression the branch misses is actually much lower then initially reported, only .2% and very stable, ie. doesn't really fluctuate with how you execute the benchmarks. PiperOrigin-RevId: 342628576	2020-11-18 23:21:55 +00:00
Snappy Team	719bed0ae2	Bug fix. Error on 0 offset copies. PiperOrigin-RevId: 342447553	2020-11-18 23:21:47 +00:00
Snappy Team	289c8a3c0a	Make zippy decompression branchless PiperOrigin-RevId: 342423961	2020-11-18 23:21:38 +00:00
Snappy Team	3bfa265a04	Revert zippy optimization that causes heap buffer overflows. PiperOrigin-RevId: 342283314	2020-11-18 23:21:30 +00:00
Shahriar Rouf	4d2dc9dcbb	Optimize zippy unzipping by upto >10% by making IncrementalCopy faster. When SSSE3 is available: - Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7). - This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16. - Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length. - The PSHUFB masks are now generated programmatically at compile-time. When SSSE3 is unavailable: - No change. In both cases: - assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this. PiperOrigin-RevId: 342267037	2020-11-18 23:21:21 +00:00
Snappy Team	11e5165b98	Add a benchmark that decreased the branch prediction memorization by increasing the amount of independent branches executed per benchmark iteration. PiperOrigin-RevId: 342242843	2020-11-18 23:21:12 +00:00
Luca Versari	6835abd953	Change hash function for Compress. ((ab)>>18) & mask has higher throughput than (ab)>>shift, and produces the same results when the hash table size is 2**14. In other cases, the hash function is still good, but it's not as necessary for that to be the case as the input is small anyway. This speeds up in encoding, especially in cases where hashing is a significant part of the encoding critical path (small or uncompressible files). PiperOrigin-RevId: 341498741	2020-11-18 23:20:58 +00:00
Victor Costan	368b01c8dd	Merge pull request #107 from jsteemann:bug-fix/fix-compile-warning PiperOrigin-RevId: 340505526	2020-11-03 20:51:55 +00:00
Snappy Team	1ce58af28e	Fix the use of op + len when op is nullptr and len is non-zero. See https://reviews.llvm.org/D67122 for some discussion of why this can matter. I don't think this should have any noticeable effect on performance. PiperOrigin-RevId: 340255083	2020-11-03 20:30:24 +00:00
Luca Versari	0b990db2b8	Run clang-format PiperOrigin-RevId: 339897712	2020-11-03 20:30:11 +00:00
jsteemann	cb2b3c7ec6	fix compile warnings due to missing override specifiers When building Snappy with compiler option `-Wsuggest-override` set via the CMAKE_CXX_FLAGS, compilation produces warnings in `snappy-sinksource.h`: ``` In file included from ./snappy-fork/snappy-sinksource.cc:32: ./snappy-fork/snappy-sinksource.h:150:18: error: ‘virtual size_t snappy::ByteArraySource::Available() const’ can be marked override [-Werror=suggest-override] 150 \| virtual size_t Available() const; \| ^~~~~~~~~ ./snappy-fork/snappy-sinksource.h:151:23: error: ‘virtual const char* snappy::ByteArraySource::Peek(size_t)’ can be marked override [-Werror=suggest-override] 151 \| virtual const char Peek(size_t* len); \| ^~~~ ./snappy-fork/snappy-sinksource.h:152:16: error: ‘virtual void snappy::ByteArraySource::Skip(size_t)’ can be marked override [-Werror=suggest-override] 152 \| virtual void Skip(size_t n); \| ^~~~ ./snappy-fork/snappy-sinksource.h:163:16: error: ‘virtual void snappy::UncheckedByteArraySink::Append(const char, size_t)’ can be marked override [-Werror=suggest-override] 163 \| virtual void Append(const char data, size_t n); \| ^~~~~~ ./snappy-fork/snappy-sinksource.h:164:17: error: ‘virtual char* snappy::UncheckedByteArraySink::GetAppendBuffer(size_t, char)’ can be marked override [-Werror=suggest-override] 164 \| virtual char GetAppendBuffer(size_t len, char* scratch); \| ^~~~~~~~~~~~~~~ ./snappy-fork/snappy-sinksource.h:165:17: error: ‘virtual char* snappy::UncheckedByteArraySink::GetAppendBufferVariable(size_t, size_t, char, size_t, size_t)’ can be marked override [-Werror=suggest-override] 165 \| virtual char* GetAppendBufferVariable( \| ^~~~~~~~~~~~~~~~~~~~~~~ ./snappy-fork/snappy-sinksource.h:168:16: error: ‘virtual void snappy::UncheckedByteArraySink::AppendAndTakeOwnership(char, size_t, void ()(void, const char, size_t), void*)’ can be marked override [-Werror=suggest-override] 168 \| virtual void AppendAndTakeOwnership( \| ^~~~~~~~~~~~~~~~~~~~~~ cc1plus: all warnings being treated as errors ``` This PR adds the missing override specifiers to the sink implementations, so compilation works fine again. Tested it with g++-9.3 and g++-10.2. Compatibility note: Override specifiers were introduced with C++11, which Snappy seems to effectively require, at least according to its CMakeLists.txt file and due to the usage of some C++11-only STL types in its tests.	2020-10-29 21:50:50 +01:00
Chris Kennelly	7ffaf77cf4	Replace ARCH_K8 with __x86_64__. PiperOrigin-RevId: 321389098	2020-10-07 21:12:27 +00:00
Snappy Team	4dd277fed4	Replace the division with a constant table in IncrementalCopy PiperOrigin-RevId: 320686580	2020-07-11 01:54:52 +00:00
Snappy Team	f16eda3466	Correct uninitialized variable. PiperOrigin-RevId: 312741918	2020-06-01 23:46:44 +00:00
Victor Costan	837f38b3e0	Revise stubs for ARCH_{K8,PPC,ARM}. * ARCH_K8 and ARCH_ARM now work correctly on MSVC. * ARCH_PPC now uses the same macro as tcmalloc. Microsoft documentation: https://docs.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=vs-2019 PowerPC documentation: http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655243_75216.html PiperOrigin-RevId: 310160787	2020-05-06 16:07:47 +00:00
Victor Costan	e1353b9fa8	Remove ARCH_* guards around Bits::FindLSBSetNonZero64(). Bits::FindLSBSetNonZero64() is now available unconditionally, and it should be easier to reason about the code included in each build configuration. This reduces the amount of conditional compiling going on, which makes it easier to reason about what stubs are a used in each build configuration. The guards were added to work around the fact that MSVC has a _BitScanForward64() intrinsic, but the intrinsic is only implemented on 64-bit targets, and causes a compilation error on 32-bit targets errors. By contrast, Clang and GCC support __builtin_ctzll() everywhere, and implement it with varying efficiency. This CL reworks the conditional compilation directives so that Bits::FindLSBSetNonZero64() uses the _BitScanForward64() intrinsic on MSVC when available, and the portable implementation otherwise. PiperOrigin-RevId: 310007748	2020-05-05 20:18:12 +00:00
Victor Costan	c98344f626	Fix Clang/GCC compilation warnings. This makes it easier to adopt snappy in other projects. PiperOrigin-RevId: 309958249	2020-05-05 16:15:02 +00:00
Victor Costan	113cd97ab3	Tighten types on a few for loops. * Replace post-increment with pre-increment in for loops. * Replace unsigned int counters with precise types, like uint8_t. * Switch to C++11 iterating loops when possible. PiperOrigin-RevId: 309724233	2020-05-04 12:32:00 +00:00
Victor Costan	abde3abb1f	Fix Travis CI build. PiperOrigin-RevId: 309143110	2020-04-30 02:09:07 +00:00
Victor Costan	e6506681fa	Fix accidental double std:: qualifiers. PiperOrigin-RevId: 309136120	2020-04-30 01:19:26 +00:00
Victor Costan	63620c06d2	Add some std:: qualifiers to types and functions. PiperOrigin-RevId: 309110343	2020-04-29 22:31:55 +00:00
Victor Costan	5417da69b7	Switch from C headers to C++ headers. This CL makes the following substitutions. * assert.h -> cassert * math.h -> cmath * stdarg.h -> cstdarg * stdio.h -> cstdio * stdlib.h -> cstdlib * string.h -> cstring stddef.h and stdint.h are not migrated to C++ headers. PiperOrigin-RevId: 309074805	2020-04-29 19:38:03 +00:00
Victor Costan	251d935d50	Remove #include <string> from snappy-stubs-public.h. The header hasn't been needed since the removal of the snappy::string alias to std::string. PiperOrigin-RevId: 306446542	2020-04-14 16:50:30 +00:00
Victor Costan	4f195aee43	Remove mismatched #endif. PiperOrigin-RevId: 306345559	2020-04-14 00:38:04 +00:00
Victor Costan	041c608086	Remove platform-dependent code for unaligned loads/stores. Snappy issues multi-byte (16/32/64-bit) loads and stores that are not aligned, meaning the addresses are 16/32/64-bit multiples. This is accomplished using two methods: 1) The portable method allocates a uint{16,32,64}_t on the stack, and std::memcpy()s the bytes into/from the integer. This method relies on well-defined behaviori (std::memcpy() works on all valid pointers, fixed-width unsigned integer types use a pure binary representation and therefore have no invalid values), and should compile to valid code on all platforms. 2) The fast method reinterpret_casts the address to a pointer to a uint{16,32,64}_t and dereferences the pointer. This is expected to compile to one hardware instruction (mov on x86, ldr/str on arm). The caveat is that the reinterpret_cast is undefined behavior (UB) unless the address happened to be a valid uint{16,32,64}_t pointer. The UB shows up as follows. * On architectures that don't have hardware instructions for unaligned loads / stores, the pointer access can trigger a hardware exceptions. This is mitigated by #ifdef blocks that attempt to restrict the fast method to platforms that support it. * On architectures that have separate instructions for aligned and unaligned access, the compiler may need an explicit hint to emit the hardware instruction for unaligned access. This is accomplished on Clang and GCC by wrapping the pointers into structs tagged with __attribute__((__packed__)). This CL removes the fast method. Fortunately, compilers have advanced enough that the portable method gets compiled down to the same instructions as the fast method, without the need for the caveats explained above. Specifically, modern Clang, GCC and MSVC optimize std::memcpy() to a single instruction (mov / ldr / str). A test case proving this can be seen at https://godbolt.org/z/gZg2Fk PiperOrigin-RevId: 306342728	2020-04-14 00:22:20 +00:00
Victor Costan	27ff130ff9	Remove platform-dependent code for little-endian loads and stores. The platform-independent code that breaks down the loads and stores into byte-level operations is optimized into single instructions (mov or ldr/str) and instruction pairs (mov+bswap or ldr/str+rev) by recent versions of Clang and GCC. Tested at https://godbolt.org/z/2BQP-o PiperOrigin-RevId: 306321608	2020-04-13 22:30:59 +00:00
Victor Costan	a4cdb5d133	Introduce SNAPPY_ATTRIBUTE_ALWAYS_INLINE. An internal CL started using ABSL_ATTRIBUTE_ALWAYS_INLINE from Abseil. This CL introduces equivalent functionality as SNAPPY_ALWAYS_INLINE. PiperOrigin-RevId: 306289650	2020-04-13 19:51:05 +00:00
Victor Costan	231b8be076	Migrate to standard integral types. The following changes are done via find/replace. * int8 -> int8_t * int16 -> int16_t * int32 -> int32_t * int64 -> int64_t The aliases were removed from snappy-stubs-public.h. PiperOrigin-RevId: 306141557	2020-04-12 20:10:03 +00:00
Victor Costan	14bef66290	Modernize memcpy() and memmove() usage. This CL replaces memcpy() with std::memcpy() and memmove() with std::memmove(), and #includes <cstring> in files that use either function. PiperOrigin-RevId: 306067788	2020-04-12 00:06:15 +00:00
Snappy Team	d674348a0c	Improve zippy with 5-10%. BM_ZCord/0 [html ] 1.26GB/s ± 0% 1.35GB/s ± 0% +7.90% (p=0.008 n=5+5) BM_ZCord/1 [urls ] 535MB/s ± 0% 562MB/s ± 0% +5.05% (p=0.008 n=5+5) BM_ZCord/2 [jpg ] 10.2GB/s ± 1% 10.2GB/s ± 0% ~ (p=0.310 n=5+5) BM_ZCord/3 [jpg_200] 841MB/s ± 1% 846MB/s ± 1% ~ (p=0.421 n=5+5) BM_ZCord/4 [pdf ] 6.77GB/s ± 1% 7.06GB/s ± 1% +4.28% (p=0.008 n=5+5) BM_ZCord/5 [html4 ] 1.00GB/s ± 0% 1.08GB/s ± 0% +7.94% (p=0.008 n=5+5) BM_ZCord/6 [txt1 ] 391MB/s ± 0% 417MB/s ± 0% +6.71% (p=0.008 n=5+5) BM_ZCord/7 [txt2 ] 363MB/s ± 0% 388MB/s ± 0% +6.73% (p=0.016 n=5+4) BM_ZCord/8 [txt3 ] 400MB/s ± 0% 426MB/s ± 0% +6.55% (p=0.008 n=5+5) BM_ZCord/9 [txt4 ] 328MB/s ± 0% 350MB/s ± 0% +6.66% (p=0.008 n=5+5) BM_ZCord/10 [pb ] 1.67GB/s ± 1% 1.80GB/s ± 0% +7.52% (p=0.008 n=5+5) 1) A key bottleneck in the data dependency chain is figuring out how many bytes are matched and loading the data for next hash value. The load-to-use latency is 5 cycles, in previous cl/303353110 we removed the load in lieu of "shrd" to align previous loads. Unfortunately "shrd" itself has a latency of 4 cycles, we'd prefer "shrx" which takes 1 cycle for variable shifts. 2)Maximally use data already computed. The above trick calculates 5 bytes of useful data. So in case we need to search for new match we can use this for the first search (which is one byte further). PiperOrigin-RevId: 303875535	2020-04-11 04:41:15 +00:00
Snappy Team	4dfcad9f4e	assertion failure on darwin_x86_64, have to investigage PiperOrigin-RevId: 303428229	2020-04-11 04:41:07 +00:00
Snappy Team	e19178748f	assertion failure on darwin_x86_64, have to investigage PiperOrigin-RevId: 303346402	2020-04-11 04:40:57 +00:00
Snappy Team	0faf56378e	This cl does two things 1) It shaves of a few cycles from the data dependency chain. By using "shrd" instead of a load. 2) The important loop is finding small copies (4-12) which are either "copy 1", or "copy 2" depending if the offset fits <2048. It turns out that this is a branch that is mispredicted often. Due to the long dependency chain the CPU is running with IPC~1 anyway so we can freely add instructions to instead emit copies branchfree. This reduces the branch misspredicts from 15% to 11% (for BM_ZFlat/6 txt1) and from 5.6% to 4% (for BM_ZFlat/10 or pb). PiperOrigin-RevId: 303328967	2020-04-11 04:40:48 +00:00
Snappy Team	0c7ed08a25	The result on protobuf benchmark is around 19%. Results vary by their propensity for compression. As the frequency of finding matches influences the amount of branch misspredicts and the amount of hashing. Two ideas 1) The code uses "heuristic match skipping" has a quadratic interpolation. However for the first 32 bytes it's just every byte. Special case 16 bytes. This removes a lot of code. 2) Load 64 bit integers and shift instead of reload. The hashing loop has a very long chain data = Load32(ip) -> hash = Hash(data) -> offset = table[hash] -> copy_data = Load32(base_ip + offset) followed by a compare between data and copy_data. This chain is around 20 cycles. It's unreasonable for the branch predictor to be able to predict when it's a match (that is completely driven by the content of the data). So when it's a miss this chain is on the critical path. By loading 64 bits and shifting we can effectively remove the first load. PiperOrigin-RevId: 302893821	2020-04-11 04:40:39 +00:00
Snappy Team	3c77e01459	1) Make the output pointer a local variable such it doesn't need a load add store on it's loop carried dependency chain. 2) Reduce the input pointer loop carried dependency chain from 7 cycles to 4 cycles by using pre-loading. This is a very subtle point. 3) Just brutally copy 64 bytes which removes a difficult to predict branch from the inner most loop. There is enough bandwidth to do so in the intrinsic cycles of the loop. 4) Implement limit pointers that include the slop region. This removes unnecessary instructions from the hot path. 5) It seems the removal of the difficult to predict branch has removed the code sensitivity to alignment, so remove the asm nop's. PiperOrigin-RevId: 294692928	2020-04-11 04:40:29 +00:00
Snappy Team	9eabb7baba	Cut a load from the critical dependency chain of the input pointer by speculating the uncommon case of COPY_4 is not happening. PiperOrigin-RevId: 293803653	2020-04-11 04:40:20 +00:00
Snappy Team	cddd9c0875	Improve comments in IncrementalCopy, add an assert. PiperOrigin-RevId: 292506754	2020-04-11 04:40:09 +00:00
Victor Costan	537f4ad624	Tag open source release 1.1.8. PiperOrigin-RevId: 289675084	2020-01-14 10:58:53 -08:00
Snappy Team	b5477a8457	Optimize IncrementalCopy: There are between 1 and 4 copy iterations. Allow FDO to work with full knowledge of the probabilities for each branch. On skylake, this improves protobuf and html decompression speed by 15% and 9% respectively, and the rest by ~2%. On haswell, this improves protobuf and html decompression speed by 23% and 16% respectively, and the rest by ~3%. PiperOrigin-RevId: 289090401	2020-01-14 10:58:42 -08:00
Victor Costan	f5acee902c	Move CI to Visual Studio 2019. PiperOrigin-RevId: 279785698	2019-11-11 12:05:59 -08:00
Victor Costan	26410cc4f8	Merge pull request #85 from bitomaxsp:patch-1 PiperOrigin-RevId: 279633518	2019-11-10 14:10:50 -08:00
Victor Costan	0eec45ed16	Align CMake configuration with related projects. PiperOrigin-RevId: 279237837	2019-11-07 22:39:04 -08:00

1 2 3 4 5 ...

368 Commits All Branches Search

368 Commits

All Branches