By default MemCpy() / MemMove() always copies 64 bytes in DecompressBranchless(). Profiling shows that the vast majority of the time we need to copy many fewer bytes (typically <= 16 bytes). It is safe to copy fewer bytes as long as we exceed len.
This change improves throughput by ~12% on ARM, ~35% on AMD Milan, and ~7% on Intel Cascade Lake.
PiperOrigin-RevId: 453917840
Not everything defining __GNUC__ supports flag outputs
from asm statements; in particular, some Clang versions
on macOS does not. The correct test per the GCC documentation
is __GCC_ASM_FLAG_OUTPUTS__, so use that instead.
PiperOrigin-RevId: 423749308
As `i + offset` is promoted to a "negative" size_t,
UBSan would complain when adding the resulting offset to `dst`:
```
/tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43: runtime error: addition of unsigned offset to 0x6120003c5ec1 overflowed to 0x6120003c5ec0
#0 0x7f9ebd21769c in snappy::(anonymous namespace)::Copy64BytesWithPatternExtension(char*, unsigned long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43
#1 0x7f9ebd21769c in std::__1::pair<unsigned char const*, long> snappy::DecompressBranchless<char*>(unsigned char const*, unsigned char const*, long, char*, long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:1160:15
```
The final ip advance value doesn't have to wait for
the result of offset to load *tag. It can be computed
along with the offset, so the codegen will use one
csinc in parallel with ldrb. This will improve the
throughput.
With this change it is observed ~4.2% uplift in UFlat/10
and ~3.7% in UFlatMedley
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I20ab211235bbf578c6c978f2bbd9160a49e920da
The #if predicate evaluates to false if the macro is undefined, or
defined to 0. #ifdef (and its synonym #if defined) evaluates to false
only if the macro is undefined.
The new setup allows differentiating between setting a macro to 0 (to
express that the capability definitely does not exist / should not be
used) and leaving a macro undefined (to express not knowing whether a
capability exists / not caring if a capability is used).
PiperOrigin-RevId: 391094241
After SHUFFLE code blocks are refactored, "tmmintrin.h"
is missed, and bmi2 code part will have build failure
as type conflicts.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I7800cd7e050f4d349e5a227206b14b9c566e547f
Clang doesn't realize the load with free zero-extension,
and emits another extra 'and xn, xm, 0xff' to calc offset.
With this change ,this extra op is removed, and consistent
1.7% performance uplift is observed.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Ica4617852c4b93eadc6c5c551dc3961ffbadb8f0
Inspired by kExtractMasksCombined, this patch uses shift
to replace table lookup. On Arm the codegen is 2 shift ops
(lsl+lsr). Comparing to previous ldr which requires 4 cycles
latency, the lsl+lsr only need 2 cycles.
Slight (~0.3%) uplift observed on N1, and ~3% on A72.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I5b53632d22d9e5cf1a49d0c5cdd16265a15de23b
The SSSE3 intrinsics we use have their direct analogues in NEON, so making this optimization portable requires a very thin translation layer.
PiperOrigin-RevId: 381280165
This lets us remove main() from snappy_bench.cc and snappy_unittest.cc,
which simplifies integrating these tests and benchmarks with other
suites.
PiperOrigin-RevId: 347857427
LibFuzzer does not ship with the Mac OSX Command Line Tools.
```
ld: file not found: /Applications/Xcode-12.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/12.0.0/lib/darwin/libclang_rt.fuzzer_osx.a
clang: error: linker command failed with exit code 1 (use -v to see invocation)
```
gcc was unable to inline a function call, which caused a build
failure due to `-Wall -Werror`.
The build error was:
```
../snappy.cc:292:76: error: ignoring attributes on template argument ‘__m128i’ [-Werror=ignored-attributes]
292 | static inline std::pair<__m128i /* pattern */, __m128i /* reshuffle_mask */>
| ^
../snappy.cc:292:76: error: ignoring attributes on template argument ‘__m128i’ [-Werror=ignored-attributes]
cc1plus: all warnings being treated as errors
```