We also increased the hashtable size by 1 bit as it significantly degraded the ratio. Thus even level 1 might slightly improve.
PiperOrigin-RevId: 621456036
Don't pass -Wno-sign-compare on Windows.
Add a #define HAVE_WINDOWS_H if _WIN32 is defined.
Don't assume sys/uio.h is available on Windows.
PiperOrigin-RevId: 524416809
snappy-internal.h uses std::pair, which is defined in the <utility>
header. Typically, this works because existing C++ standard library
implementations provide <utility> via other transitive includes;
however, these transitive includes are not guaranteed to exist, and
don't exist in certain contexts (e.g. compiling against LLVM's libc++
with Clang modules.)
PiperOrigin-RevId: 517213822
As far as we know, the lack of "cc" in the clobbers hasn't caused
problems yet, but it could. This change is to improve correctness,
and is also almost certainly performance neutral.
PiperOrigin-RevId: 487133620
This change replaces the hashing function used during compression with
one that is roughly as good but faster. This speeds up compression by
two to a few percent on the Intel-, AMD-, and Arm-based machines we
tested. The amount of compression is roughly unchanged.
PiperOrigin-RevId: 485960303
By default MemCpy() / MemMove() always copies 64 bytes in DecompressBranchless(). Profiling shows that the vast majority of the time we need to copy many fewer bytes (typically <= 16 bytes). It is safe to copy fewer bytes as long as we exceed len.
This change improves throughput by ~12% on ARM, ~35% on AMD Milan, and ~7% on Intel Cascade Lake.
PiperOrigin-RevId: 453917840
Not everything defining __GNUC__ supports flag outputs
from asm statements; in particular, some Clang versions
on macOS does not. The correct test per the GCC documentation
is __GCC_ASM_FLAG_OUTPUTS__, so use that instead.
PiperOrigin-RevId: 423749308
The existing code uses a series of 8bit loads with shifts and ors to
emulate an (unaligned) load of a larger type. These are then expected to
become single loads in the compiler, producing optimal assembly. Whilst
this is true it happens very late in the compiler, meaning that
throughout most of the pipeline it is treated (and cost-modelled) as
multiple loads, shifts and ors. This can make the compiler make poor
decisions (such as not unrolling loops that should be), or to break up
the pattern before it is turned into a single load.
For example the loops in CompressFragment do not get unrolled as
expected due to a higher cost than the unroll threshold in clang.
Instead this patch uses a more conventional methods of loading unaligned
data, using a memcpy directly which the compiler will be able to deal
with much more straight forwardly, modelling it as a single unaligned
load. The old code is left as-is for big-endian systems.
This helps improve the performance of the BM_ZFlat benchmarks by up to
10-15% on an Arm Neoverse N1.
Change-Id: I986f845ebd0a0806d052d2be3e4dbcbee91713d7
As `i + offset` is promoted to a "negative" size_t,
UBSan would complain when adding the resulting offset to `dst`:
```
/tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43: runtime error: addition of unsigned offset to 0x6120003c5ec1 overflowed to 0x6120003c5ec0
#0 0x7f9ebd21769c in snappy::(anonymous namespace)::Copy64BytesWithPatternExtension(char*, unsigned long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43
#1 0x7f9ebd21769c in std::__1::pair<unsigned char const*, long> snappy::DecompressBranchless<char*>(unsigned char const*, unsigned char const*, long, char*, long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:1160:15
```
The final ip advance value doesn't have to wait for
the result of offset to load *tag. It can be computed
along with the offset, so the codegen will use one
csinc in parallel with ldrb. This will improve the
throughput.
With this change it is observed ~4.2% uplift in UFlat/10
and ~3.7% in UFlatMedley
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I20ab211235bbf578c6c978f2bbd9160a49e920da