Not everything defining __GNUC__ supports flag outputs
from asm statements; in particular, some Clang versions
on macOS does not. The correct test per the GCC documentation
is __GCC_ASM_FLAG_OUTPUTS__, so use that instead.
PiperOrigin-RevId: 423749308
The existing code uses a series of 8bit loads with shifts and ors to
emulate an (unaligned) load of a larger type. These are then expected to
become single loads in the compiler, producing optimal assembly. Whilst
this is true it happens very late in the compiler, meaning that
throughout most of the pipeline it is treated (and cost-modelled) as
multiple loads, shifts and ors. This can make the compiler make poor
decisions (such as not unrolling loops that should be), or to break up
the pattern before it is turned into a single load.
For example the loops in CompressFragment do not get unrolled as
expected due to a higher cost than the unroll threshold in clang.
Instead this patch uses a more conventional methods of loading unaligned
data, using a memcpy directly which the compiler will be able to deal
with much more straight forwardly, modelling it as a single unaligned
load. The old code is left as-is for big-endian systems.
This helps improve the performance of the BM_ZFlat benchmarks by up to
10-15% on an Arm Neoverse N1.
Change-Id: I986f845ebd0a0806d052d2be3e4dbcbee91713d7
As `i + offset` is promoted to a "negative" size_t,
UBSan would complain when adding the resulting offset to `dst`:
```
/tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43: runtime error: addition of unsigned offset to 0x6120003c5ec1 overflowed to 0x6120003c5ec0
#0 0x7f9ebd21769c in snappy::(anonymous namespace)::Copy64BytesWithPatternExtension(char*, unsigned long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43
#1 0x7f9ebd21769c in std::__1::pair<unsigned char const*, long> snappy::DecompressBranchless<char*>(unsigned char const*, unsigned char const*, long, char*, long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:1160:15
```
The final ip advance value doesn't have to wait for
the result of offset to load *tag. It can be computed
along with the offset, so the codegen will use one
csinc in parallel with ldrb. This will improve the
throughput.
With this change it is observed ~4.2% uplift in UFlat/10
and ~3.7% in UFlatMedley
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I20ab211235bbf578c6c978f2bbd9160a49e920da
The #if predicate evaluates to false if the macro is undefined, or
defined to 0. #ifdef (and its synonym #if defined) evaluates to false
only if the macro is undefined.
The new setup allows differentiating between setting a macro to 0 (to
express that the capability definitely does not exist / should not be
used) and leaving a macro undefined (to express not knowing whether a
capability exists / not caring if a capability is used).
PiperOrigin-RevId: 391094241
After SHUFFLE code blocks are refactored, "tmmintrin.h"
is missed, and bmi2 code part will have build failure
as type conflicts.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I7800cd7e050f4d349e5a227206b14b9c566e547f
Clang doesn't realize the load with free zero-extension,
and emits another extra 'and xn, xm, 0xff' to calc offset.
With this change ,this extra op is removed, and consistent
1.7% performance uplift is observed.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Ica4617852c4b93eadc6c5c551dc3961ffbadb8f0
Inspired by kExtractMasksCombined, this patch uses shift
to replace table lookup. On Arm the codegen is 2 shift ops
(lsl+lsr). Comparing to previous ldr which requires 4 cycles
latency, the lsl+lsr only need 2 cycles.
Slight (~0.3%) uplift observed on N1, and ~3% on A72.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I5b53632d22d9e5cf1a49d0c5cdd16265a15de23b
The SSSE3 intrinsics we use have their direct analogues in NEON, so making this optimization portable requires a very thin translation layer.
PiperOrigin-RevId: 381280165
This lets us remove main() from snappy_bench.cc and snappy_unittest.cc,
which simplifies integrating these tests and benchmarks with other
suites.
PiperOrigin-RevId: 347857427
LibFuzzer does not ship with the Mac OSX Command Line Tools.
```
ld: file not found: /Applications/Xcode-12.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/12.0.0/lib/darwin/libclang_rt.fuzzer_osx.a
clang: error: linker command failed with exit code 1 (use -v to see invocation)
```
gcc was unable to inline a function call, which caused a build
failure due to `-Wall -Werror`.
The build error was:
```
../snappy.cc:292:76: error: ignoring attributes on template argument ‘__m128i’ [-Werror=ignored-attributes]
292 | static inline std::pair<__m128i /* pattern */, __m128i /* reshuffle_mask */>
| ^
../snappy.cc:292:76: error: ignoring attributes on template argument ‘__m128i’ [-Werror=ignored-attributes]
cc1plus: all warnings being treated as errors
```
Snappy includes a testing framework, which implements a subset of the
Google Test API, and can be used when Google Test is not available.
Snappy also includes a micro-benchmark framework, which implements an
old version of the Google Benchmark API.
This CL replaces the custom test and micro-benchmark frameworks with
google/googletest and google/benchmark. The code is vendored in
third_party/ via git submodules. The setup is similar to google/crc32c
and google/leveldb.
This CL also updates the benchmarking code to the modern Google
Benchmark API.
Benchmark results are expected to be more precise, as the old framework
ran each benchmark with a fixed number of iterations, whereas Google
Benchmark keeps iterating until the noise is low.
PiperOrigin-RevId: 347456142