The #if predicate evaluates to false if the macro is undefined, or
defined to 0. #ifdef (and its synonym #if defined) evaluates to false
only if the macro is undefined.
The new setup allows differentiating between setting a macro to 0 (to
express that the capability definitely does not exist / should not be
used) and leaving a macro undefined (to express not knowing whether a
capability exists / not caring if a capability is used).
PiperOrigin-RevId: 391094241
After SHUFFLE code blocks are refactored, "tmmintrin.h"
is missed, and bmi2 code part will have build failure
as type conflicts.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I7800cd7e050f4d349e5a227206b14b9c566e547f
Clang doesn't realize the load with free zero-extension,
and emits another extra 'and xn, xm, 0xff' to calc offset.
With this change ,this extra op is removed, and consistent
1.7% performance uplift is observed.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Ica4617852c4b93eadc6c5c551dc3961ffbadb8f0
Inspired by kExtractMasksCombined, this patch uses shift
to replace table lookup. On Arm the codegen is 2 shift ops
(lsl+lsr). Comparing to previous ldr which requires 4 cycles
latency, the lsl+lsr only need 2 cycles.
Slight (~0.3%) uplift observed on N1, and ~3% on A72.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I5b53632d22d9e5cf1a49d0c5cdd16265a15de23b
The SSSE3 intrinsics we use have their direct analogues in NEON, so making this optimization portable requires a very thin translation layer.
PiperOrigin-RevId: 381280165
This lets us remove main() from snappy_bench.cc and snappy_unittest.cc,
which simplifies integrating these tests and benchmarks with other
suites.
PiperOrigin-RevId: 347857427
LibFuzzer does not ship with the Mac OSX Command Line Tools.
```
ld: file not found: /Applications/Xcode-12.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/12.0.0/lib/darwin/libclang_rt.fuzzer_osx.a
clang: error: linker command failed with exit code 1 (use -v to see invocation)
```
gcc was unable to inline a function call, which caused a build
failure due to `-Wall -Werror`.
The build error was:
```
../snappy.cc:292:76: error: ignoring attributes on template argument ‘__m128i’ [-Werror=ignored-attributes]
292 | static inline std::pair<__m128i /* pattern */, __m128i /* reshuffle_mask */>
| ^
../snappy.cc:292:76: error: ignoring attributes on template argument ‘__m128i’ [-Werror=ignored-attributes]
cc1plus: all warnings being treated as errors
```
Snappy includes a testing framework, which implements a subset of the
Google Test API, and can be used when Google Test is not available.
Snappy also includes a micro-benchmark framework, which implements an
old version of the Google Benchmark API.
This CL replaces the custom test and micro-benchmark frameworks with
google/googletest and google/benchmark. The code is vendored in
third_party/ via git submodules. The setup is similar to google/crc32c
and google/leveldb.
This CL also updates the benchmarking code to the modern Google
Benchmark API.
Benchmark results are expected to be more precise, as the old framework
ran each benchmark with a fixed number of iterations, whereas Google
Benchmark keeps iterating until the noise is low.
PiperOrigin-RevId: 347456142
This feature requires C++17. Fortunately, inline is useful for header declarations, which may be included in multiple compilation units. The declarations modified by this CL occur in a single compilation unit.
PiperOrigin-RevId: 347338760
2) Replace offset extraction with a lookup mask. This is less uops and is needed because we need to special case type 3 to always return 0 as to properly trigger the fallback.
3) Unroll the loop twice, this removes some loop-condition checks AND it improves the generated assembly. The loop variables tend to end up in a different register requiring mov's having two consecutive copies allows the elision of the mov's.
PiperOrigin-RevId: 346663328
When SSSE3 is available:
- Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7).
- This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16.
- Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length.
- The PSHUFB masks are now generated programmatically at compile-time.
When SSSE3 is unavailable:
- No change.
In both cases:
- assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this.
'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1).
PiperOrigin-RevId: 346454471
When SSSE3 is available:
- Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7).
- This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16.
- Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length.
- The PSHUFB masks are now generated programmatically at compile-time.
When SSSE3 is unavailable:
- No change.
In both cases:
- assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this.
'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1).
PiperOrigin-RevId: 345340892