snappy

Commit Graph

Author	SHA1	Message	Date
Mathias Stearn	5ec5d16bd6	Perf tuning for gcc + aarch64	2024-01-18 15:07:21 +00:00
Richard O'Grady	27f34a580b	Fix -Wsign-compare warning PiperOrigin-RevId: 547529709	2023-07-12 11:12:48 -07:00
Ilya Tokar	92f18e66fd	Add prefetch to zippy compress PiperOrigin-RevId: 518358512	2023-03-29 17:31:17 -07:00
Snappy Team	9c42b71b19	Optimize check for uncommon decompression for ARM, saving two instructions and three cycles. PiperOrigin-RevId: 517141646	2023-03-29 17:30:58 -07:00
Snappy Team	7b82423c59	The output buffer in DecompressBranchless is never read from and the source buffers are never written. This allows us to defer any writes to the output buffer for an arbitrary amount of time as long as the writes all occur in the proper order. When a MemCopy64 would have normally occurred we save away the source address and length. Once we reach the location of the next write to the output buffer first perform the deferred copy. This gives time for the source address calculation and length to finish before the deferred copy. This change gives 1.84% on CLX and 0.97% Milan. PiperOrigin-RevId: 504012310	2023-03-07 06:35:00 -08:00
Snappy Team	74960e8bd6	Allow some buffer overwrite on literal emitting Calls to memcpy seem to be quite expensive ``` BM_ZFlat/0 [html (22.24 %) ] 114µs ± 6% 110µs ± 6% -3.97% (p=0.000 n=118+115) BM_ZFlat/1 [urls (47.84 %) ] 1.63ms ± 5% 1.58ms ± 5% -3.39% (p=0.000 n=117+115) BM_ZFlat/2 [jpg (99.95 %) ] 7.84µs ± 6% 7.70µs ± 6% -1.66% (p=0.000 n=119+117) BM_ZFlat/3 [jpg_200 (73.00 %)] 265ns ± 6% 255ns ± 6% -3.48% (p=0.000 n=101+98) BM_ZFlat/4 [pdf (83.31 %) ] 11.8µs ± 6% 11.6µs ± 6% -2.14% (p=0.000 n=118+116) BM_ZFlat/5 [html4 (22.52 %) ] 525µs ± 6% 513µs ± 6% -2.36% (p=0.000 n=117+116) BM_ZFlat/6 [txt1 (57.87 %) ] 494µs ± 5% 480µs ± 6% -2.84% (p=0.000 n=118+116) BM_ZFlat/7 [txt2 (62.02 %) ] 444µs ± 4% 428µs ± 7% -3.51% (p=0.000 n=119+117) BM_ZFlat/8 [txt3 (55.17 %) ] 1.34ms ± 5% 1.30ms ± 5% -2.40% (p=0.000 n=120+116) BM_ZFlat/9 [txt4 (66.41 %) ] 1.84ms ± 5% 1.78ms ± 5% -3.55% (p=0.000 n=110+111) BM_ZFlat/10 [pb (19.61 %) ] 101µs ± 5% 97µs ± 5% -4.67% (p=0.000 n=118+118) BM_ZFlat/11 [gaviota (37.73 %)] 368µs ± 5% 360µs ± 6% -2.13% (p=0.000 n=91+90) BM_ZFlat/12 [cp (48.25 %) ] 38.9µs ± 6% 36.8µs ± 6% -5.36% (p=0.000 n=88+87) BM_ZFlat/13 [c (42.52 %) ] 13.4µs ± 6% 13.1µs ± 8% -2.38% (p=0.000 n=115+116) BM_ZFlat/14 [lsp (48.94 %) ] 4.05µs ± 4% 3.94µs ± 4% -2.58% (p=0.000 n=91+85) BM_ZFlat/15 [xls (41.10 %) ] 1.42ms ± 5% 1.39ms ± 7% -2.49% (p=0.000 n=116+117) BM_ZFlat/16 [xls_200 (78.00 %)] 313ns ± 6% 307ns ± 5% -1.89% (p=0.000 n=89+84) BM_ZFlat/17 [bin (18.12 %) ] 518µs ± 5% 506µs ± 5% -2.42% (p=0.000 n=118+116) BM_ZFlat/18 [bin_200 (7.50 %) ] 86.8ns ± 6% 85.3ns ± 6% -1.76% (p=0.000 n=118+114) BM_ZFlat/19 [sum (48.99 %) ] 67.9µs ± 4% 61.1µs ± 6% -9.96% (p=0.000 n=114+117) BM_ZFlat/20 [man (59.45 %) ] 5.64µs ± 6% 5.47µs ± 7% -3.06% (p=0.000 n=117+115) BM_ZFlatAll [21 kTestDataFiles] 9.23ms ± 4% 9.01ms ± 5% -2.44% (p=0.000 n=80+83) BM_ZFlatIncreasingTableSize [7 tables ] 30.4µs ± 5% 29.3µs ± 7% -3.45% (p=0.000 n=96+96) ``` PiperOrigin-RevId: 490184133	2023-01-12 13:33:17 +00:00
Ilya Tokar	37f375ddeb	Add prefetch to zippy decompess, PiperOrigin-RevId: 489554313	2023-01-12 13:33:10 +00:00
Snappy Team	15e2a0e13d	Add "cc" clobbers to inline asm that modifies flags. As far as we know, the lack of "cc" in the clobbers hasn't caused problems yet, but it could. This change is to improve correctness, and is also almost certainly performance neutral. PiperOrigin-RevId: 487133620	2023-01-12 13:33:01 +00:00
Snappy Team	8881ba172a	Improve the speed of hashing in zippy compression. This change replaces the hashing function used during compression with one that is roughly as good but faster. This speeds up compression by two to a few percent on the Intel-, AMD-, and Arm-based machines we tested. The amount of compression is roughly unchanged. PiperOrigin-RevId: 485960303	2023-01-12 13:32:54 +00:00
Snappy Team	a2d219a8a8	Modify MemCopy64 to use AVX 32 byte copies instead of SSE2 16 byte copies on capable x86 platforms. This gives an average speedup of 6.87% on Milan and 1.90% on Skylake. PiperOrigin-RevId: 480370725	2023-01-12 13:32:43 +00:00
Matt Callanan	974fcc49e8	Fix compilation errors under C++11. `std::string::data()` is const-only until C++17. PiperOrigin-RevId: 479708109	2022-10-08 08:41:35 +02:00
Marcin Kowalczyk	d644ca8770	Fix warnings due to use of `__attribute__(always_inline)` without `inline`. PiperOrigin-RevId: 478984028	2022-10-05 10:38:16 +02:00
Matt Callanan	9758c9dfd7	Add `snappy::CompressFromIOVec`. This reads from an `iovec` array rather than from a `char` array as in `snappy::Compress`. PiperOrigin-RevId: 476930623	2022-09-29 09:32:28 -07:00
Victor Costan	af720f9a3b	Merge pull request #148 from pitrou:ubsan-ptr-add-overflow PiperOrigin-RevId: 463090354	2022-07-27 15:28:16 +00:00
Marcin Kowalczyk	44caf79086	Move the comment about non-overlap requirement from the implementation to the contract of `MemCopy64()`, and clarify that it applies to `size`, not to 64. PiperOrigin-RevId: 453920284	2022-07-27 15:28:08 +00:00
Snappy Team	d261d2766f	Optimize zippy MemCpy / MemMove during decompression By default MemCpy() / MemMove() always copies 64 bytes in DecompressBranchless(). Profiling shows that the vast majority of the time we need to copy many fewer bytes (typically <= 16 bytes). It is safe to copy fewer bytes as long as we exceed len. This change improves throughput by ~12% on ARM, ~35% on AMD Milan, and ~7% on Intel Cascade Lake. PiperOrigin-RevId: 453917840	2022-07-27 15:27:58 +00:00
Snappy Team	8dd58a519f	Fix compilation for older GCC and Clang versions. Not everything defining __GNUC__ supports flag outputs from asm statements; in particular, some Clang versions on macOS does not. The correct test per the GCC documentation is __GCC_ASM_FLAG_OUTPUTS__, so use that instead. PiperOrigin-RevId: 423749308	2022-02-20 18:19:45 +00:00
Antoine Pitrou	64df9f28c8	Fix UBSan error (ptr + offset overflow) As `i + offset` is promoted to a "negative" size_t, UBSan would complain when adding the resulting offset to `dst`: ``` /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43: runtime error: addition of unsigned offset to 0x6120003c5ec1 overflowed to 0x6120003c5ec0 #0 0x7f9ebd21769c in snappy::(anonymous namespace)::Copy64BytesWithPatternExtension(char, unsigned long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43 #1 0x7f9ebd21769c in std::__1::pair<unsigned char const, long> snappy::DecompressBranchless<char>(unsigned char const, unsigned char const, long, char, long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:1160:15 ```	2021-11-30 19:46:18 +01:00
Snappy Team	65dc7b3839	Pass by reference the first argument of ExtractLowBytes to avoid UB of passing uninitialized argument by value. PiperOrigin-RevId: 406052814	2021-11-14 22:09:42 +00:00
Jun He	aeb5de55a9	decompress: refine data depdency The final ip advance value doesn't have to wait for the result of offset to load *tag. It can be computed along with the offset, so the codegen will use one csinc in parallel with ldrb. This will improve the throughput. With this change it is observed ~4.2% uplift in UFlat/10 and ~3.7% in UFlatMedley Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I20ab211235bbf578c6c978f2bbd9160a49e920da	2021-08-30 09:51:37 +08:00
Victor Costan	b9c9a989b2	Merge pull request #135 from JunHe77:remove_extra PiperOrigin-RevId: 390767998	2021-08-14 08:15:44 +00:00
Victor Costan	5c87bc61b6	Merge pull request #136 from JunHe77:ext_arm PiperOrigin-RevId: 390715690	2021-08-13 23:24:49 +00:00
Jun He	d643b9a988	decompress: add hint to remove extra AND Clang doesn't realize the load with free zero-extension, and emits another extra 'and xn, xm, 0xff' to calc offset. With this change ,this extra op is removed, and consistent 1.7% performance uplift is observed. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: Ica4617852c4b93eadc6c5c551dc3961ffbadb8f0	2021-08-12 15:19:53 +08:00
Jun He	f52721b2b4	decompression: optimize ExtractOffset for Arm Inspired by kExtractMasksCombined, this patch uses shift to replace table lookup. On Arm the codegen is 2 shift ops (lsl+lsr). Comparing to previous ldr which requires 4 cycles latency, the lsl+lsr only need 2 cycles. Slight (~0.3%) uplift observed on N1, and ~3% on A72. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I5b53632d22d9e5cf1a49d0c5cdd16265a15de23b	2021-08-06 15:44:27 +08:00
Snappy Team	f2db8f77ce	Move the extract masks variable out in zippy. I see a consistent 1.5-2% improvement for ARM. Probably because ARM has more relaxed address computation than x86 https://www.godbolt.org/z/bfM1ezx41 . I don't think this is a compiler bug or it can do something about it PiperOrigin-RevId: 387569896	2021-08-02 14:50:16 +00:00
Snappy Team	c8f7641646	Remove inline assembly as the bug in clang was fixed PiperOrigin-RevId: 387356237	2021-08-02 14:50:09 +00:00
Snappy Team	9cc3689b21	Optimize memset to pure SIMD because compilers generate consistently bad code. clang for ARM and gcc for x86 https://gcc.godbolt.org/z/oxeGG7aEx PiperOrigin-RevId: 383467656	2021-08-02 14:49:57 +00:00
Snappy Team	b4888f7616	Optimize tag extraction for ARM with conditional increment instruction generation (csinc). For codegen see https://gcc.godbolt.org/z/a8z9j95Pv PiperOrigin-RevId: 382688740	2021-07-05 01:05:54 +00:00
atdt	b3fb0b5b4b	Enable vector byte shuffle optimizations on ARM NEON The SSSE3 intrinsics we use have their direct analogues in NEON, so making this optimization portable requires a very thin translation layer. PiperOrigin-RevId: 381280165	2021-07-05 01:05:44 +00:00
Victor Costan	5f913be04e	Fix unused local variable warnings. This will not change the compilation output. PiperOrigin-RevId: 347525836	2020-12-15 04:14:46 +00:00
Victor Costan	8995ffabb9	Replace #pragma nounroll with equivalent used elsewhere. PiperOrigin-RevId: 347341130	2020-12-14 09:59:34 +00:00
Victor Costan	d1daa83044	Remove inline qualifier from static variables. This feature requires C++17. Fortunately, inline is useful for header declarations, which may be included in multiple compilation units. The declarations modified by this CL occur in a single compilation unit. PiperOrigin-RevId: 347338760	2020-12-14 09:59:23 +00:00
Snappy Team	3b571656fa	1) Improve the lookup table data to require less instructions to extract the necessary data. We now store len - offset in a signed int16, this happens to remove masking offset in the calculations and the calculations that need to be done precisely give the flags that we need for testing correctness. 2) Replace offset extraction with a lookup mask. This is less uops and is needed because we need to special case type 3 to always return 0 as to properly trigger the fallback. 3) Unroll the loop twice, this removes some loop-condition checks AND it improves the generated assembly. The loop variables tend to end up in a different register requiring mov's having two consecutive copies allows the elision of the mov's. PiperOrigin-RevId: 346663328	2020-12-14 02:48:03 +00:00
Shahriar Rouf	a9730ed505	Optimize zippy decompression by making IncrementalCopy faster. When SSSE3 is available: - Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7). - This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16. - Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length. - The PSHUFB masks are now generated programmatically at compile-time. When SSSE3 is unavailable: - No change. In both cases: - assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this. 'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1). PiperOrigin-RevId: 346454471	2020-12-14 02:47:49 +00:00
Snappy Team	56c2c247d0	Internal change PiperOrigin-RevId: 345360683	2020-12-03 22:52:52 +00:00
Shahriar Rouf	a94be58e65	Optimize zippy decompression by making IncrementalCopy faster. When SSSE3 is available: - Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7). - This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16. - Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length. - The PSHUFB masks are now generated programmatically at compile-time. When SSSE3 is unavailable: - No change. In both cases: - assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this. 'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1). PiperOrigin-RevId: 345340892	2020-12-03 22:52:41 +00:00
Snappy Team	01a566f825	Fix opensource version PiperOrigin-RevId: 343272548	2020-11-19 17:06:26 +00:00
Snappy Team	719bed0ae2	Bug fix. Error on 0 offset copies. PiperOrigin-RevId: 342447553	2020-11-18 23:21:47 +00:00
Snappy Team	289c8a3c0a	Make zippy decompression branchless PiperOrigin-RevId: 342423961	2020-11-18 23:21:38 +00:00
Snappy Team	3bfa265a04	Revert zippy optimization that causes heap buffer overflows. PiperOrigin-RevId: 342283314	2020-11-18 23:21:30 +00:00
Shahriar Rouf	4d2dc9dcbb	Optimize zippy unzipping by upto >10% by making IncrementalCopy faster. When SSSE3 is available: - Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7). - This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16. - Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length. - The PSHUFB masks are now generated programmatically at compile-time. When SSSE3 is unavailable: - No change. In both cases: - assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this. PiperOrigin-RevId: 342267037	2020-11-18 23:21:21 +00:00
Luca Versari	6835abd953	Change hash function for Compress. ((ab)>>18) & mask has higher throughput than (ab)>>shift, and produces the same results when the hash table size is 2**14. In other cases, the hash function is still good, but it's not as necessary for that to be the case as the input is small anyway. This speeds up in encoding, especially in cases where hashing is a significant part of the encoding critical path (small or uncompressible files). PiperOrigin-RevId: 341498741	2020-11-18 23:20:58 +00:00
Snappy Team	1ce58af28e	Fix the use of op + len when op is nullptr and len is non-zero. See https://reviews.llvm.org/D67122 for some discussion of why this can matter. I don't think this should have any noticeable effect on performance. PiperOrigin-RevId: 340255083	2020-11-03 20:30:24 +00:00
Luca Versari	0b990db2b8	Run clang-format PiperOrigin-RevId: 339897712	2020-11-03 20:30:11 +00:00
Snappy Team	4dd277fed4	Replace the division with a constant table in IncrementalCopy PiperOrigin-RevId: 320686580	2020-07-11 01:54:52 +00:00
Snappy Team	f16eda3466	Correct uninitialized variable. PiperOrigin-RevId: 312741918	2020-06-01 23:46:44 +00:00
Victor Costan	c98344f626	Fix Clang/GCC compilation warnings. This makes it easier to adopt snappy in other projects. PiperOrigin-RevId: 309958249	2020-05-05 16:15:02 +00:00
Victor Costan	113cd97ab3	Tighten types on a few for loops. * Replace post-increment with pre-increment in for loops. * Replace unsigned int counters with precise types, like uint8_t. * Switch to C++11 iterating loops when possible. PiperOrigin-RevId: 309724233	2020-05-04 12:32:00 +00:00
Victor Costan	5417da69b7	Switch from C headers to C++ headers. This CL makes the following substitutions. * assert.h -> cassert * math.h -> cmath * stdarg.h -> cstdarg * stdio.h -> cstdio * stdlib.h -> cstdlib * string.h -> cstring stddef.h and stdint.h are not migrated to C++ headers. PiperOrigin-RevId: 309074805	2020-04-29 19:38:03 +00:00
Victor Costan	a4cdb5d133	Introduce SNAPPY_ATTRIBUTE_ALWAYS_INLINE. An internal CL started using ABSL_ATTRIBUTE_ALWAYS_INLINE from Abseil. This CL introduces equivalent functionality as SNAPPY_ALWAYS_INLINE. PiperOrigin-RevId: 306289650	2020-04-13 19:51:05 +00:00

1 2 3

134 Commits