The existing code uses a series of 8bit loads with shifts and ors to
emulate an (unaligned) load of a larger type. These are then expected to
become single loads in the compiler, producing optimal assembly. Whilst
this is true it happens very late in the compiler, meaning that
throughout most of the pipeline it is treated (and cost-modelled) as
multiple loads, shifts and ors. This can make the compiler make poor
decisions (such as not unrolling loops that should be), or to break up
the pattern before it is turned into a single load.
For example the loops in CompressFragment do not get unrolled as
expected due to a higher cost than the unroll threshold in clang.
Instead this patch uses a more conventional methods of loading unaligned
data, using a memcpy directly which the compiler will be able to deal
with much more straight forwardly, modelling it as a single unaligned
load. The old code is left as-is for big-endian systems.
This helps improve the performance of the BM_ZFlat benchmarks by up to
10-15% on an Arm Neoverse N1.
Change-Id: I986f845ebd0a0806d052d2be3e4dbcbee91713d7
The #if predicate evaluates to false if the macro is undefined, or
defined to 0. #ifdef (and its synonym #if defined) evaluates to false
only if the macro is undefined.
The new setup allows differentiating between setting a macro to 0 (to
express that the capability definitely does not exist / should not be
used) and leaving a macro undefined (to express not knowing whether a
capability exists / not caring if a capability is used).
PiperOrigin-RevId: 391094241
Bits::FindLSBSetNonZero64() is now available unconditionally, and it
should be easier to reason about the code included in each build
configuration.
This reduces the amount of conditional compiling going on, which makes
it easier to reason about what stubs are a used in each build
configuration.
The guards were added to work around the fact that MSVC has a
_BitScanForward64() intrinsic, but the intrinsic is only implemented on
64-bit targets, and causes a compilation error on 32-bit targets errors.
By contrast, Clang and GCC support __builtin_ctzll() everywhere, and
implement it with varying efficiency.
This CL reworks the conditional compilation directives so that
Bits::FindLSBSetNonZero64() uses the _BitScanForward64() intrinsic on
MSVC when available, and the portable implementation otherwise.
PiperOrigin-RevId: 310007748
This CL makes the following substitutions.
* assert.h -> cassert
* math.h -> cmath
* stdarg.h -> cstdarg
* stdio.h -> cstdio
* stdlib.h -> cstdlib
* string.h -> cstring
stddef.h and stdint.h are not migrated to C++ headers.
PiperOrigin-RevId: 309074805
Snappy issues multi-byte (16/32/64-bit) loads and stores that are not
aligned, meaning the addresses are 16/32/64-bit multiples. This is
accomplished using two methods:
1) The portable method allocates a uint{16,32,64}_t on the stack, and
std::memcpy()s the bytes into/from the integer. This method relies on
well-defined behaviori (std::memcpy() works on all valid pointers,
fixed-width unsigned integer types use a pure binary representation and
therefore have no invalid values), and should compile to valid code on
all platforms.
2) The fast method reinterpret_casts the address to a pointer to a
uint{16,32,64}_t and dereferences the pointer. This is expected to
compile to one hardware instruction (mov on x86, ldr/str on arm). The
caveat is that the reinterpret_cast is undefined behavior (UB) unless the
address happened to be a valid uint{16,32,64}_t pointer. The UB shows up
as follows.
* On architectures that don't have hardware instructions for unaligned
loads / stores, the pointer access can trigger a hardware exceptions.
This is mitigated by #ifdef blocks that attempt to restrict the fast
method to platforms that support it.
* On architectures that have separate instructions for aligned and
unaligned access, the compiler may need an explicit hint to emit the
hardware instruction for unaligned access. This is accomplished on
Clang and GCC by wrapping the pointers into structs tagged with
__attribute__((__packed__)).
This CL removes the fast method. Fortunately, compilers have advanced
enough that the portable method gets compiled down to the same
instructions as the fast method, without the need for the caveats
explained above. Specifically, modern Clang, GCC and MSVC optimize
std::memcpy() to a single instruction (mov / ldr / str). A test case
proving this can be seen at https://godbolt.org/z/gZg2Fk
PiperOrigin-RevId: 306342728
The platform-independent code that breaks down the loads and stores into
byte-level operations is optimized into single instructions (mov or
ldr/str) and instruction pairs (mov+bswap or ldr/str+rev) by recent
versions of Clang and GCC. Tested at https://godbolt.org/z/2BQP-o
PiperOrigin-RevId: 306321608
An internal CL started using ABSL_ATTRIBUTE_ALWAYS_INLINE
from Abseil. This CL introduces equivalent functionality as
SNAPPY_ALWAYS_INLINE.
PiperOrigin-RevId: 306289650
The following changes are done via find/replace.
* int8 -> int8_t
* int16 -> int16_t
* int32 -> int32_t
* int64 -> int64_t
The aliases were removed from snappy-stubs-public.h.
PiperOrigin-RevId: 306141557
This CL replaces memcpy() with std::memcpy()
and memmove() with std::memmove(), and #includes
<cstring> in files that use either function.
PiperOrigin-RevId: 306067788
Two ideas
1) The code uses "heuristic match skipping" has a quadratic interpolation. However for the first 32 bytes it's just every byte. Special case 16 bytes. This removes a lot of code.
2) Load 64 bit integers and shift instead of reload. The hashing loop has a very long chain data = Load32(ip) -> hash = Hash(data) -> offset = table[hash] -> copy_data = Load32(base_ip + offset) followed by a compare between data and copy_data. This chain is around 20 cycles. It's unreasonable for the branch predictor to be able to predict when it's a match (that is completely driven by the content of the data). So when it's a miss this chain is on the critical path. By loading 64 bits and shifting we can effectively remove the first load.
PiperOrigin-RevId: 302893821
Copybara transforms code slightly different than MOE. One
example is the TODO username stripping where Copybara
produces different results than MOE did. This change
moves the Copybara versions of comments to the public
repository.
Note: These changes didn't originate in cl/247950252.
PiperOrigin-RevId: 247950252
A previous CL introduced _builtin_clz in zippy.cc. This is a GCC / Clang
intrinsic, and is not supported in Visual Studio. The rest of the
project uses bit manipulation intrinsics via the functions in Bits::,
which are stubbed out for the open source build in
zippy-stubs-internal.h.
This CL extracts Bits::Log2FloorNonZero() out of Bits::Log2Floor() in
the stubbed version of Bits, adds assertions to the Bits::*NonZero()
functions in the stubs, and converts _builtin_clz to a
Bits::Log2FloorNonZero() call.
The latter part is not obvious. A mathematical proof of correctness is
outlined in added comments. An empirical proof is available at
https://godbolt.org/z/mPKWmh -- CalculateTableSizeOld(), which is the
current code, compiles to the same assembly on Clang as
CalculateTableSizeNew1(), which is the bigger jump in the proof.
CalculateTableSizeNew2() is a fairly obvious transformation from
CalculateTableSizeNew1(), and results in slightly better assembly on all
supported compilers.
Two benchmark runs with the same arguments as the original CL only
showed differences in completely disjoint tests, suggesting that the
differences are pure noise.
getpagesize(), as well as its POSIX.2001 replacement
sysconf(_SC_PAGESIZE), is defined in <unistd.h>. On Linux and OS X,
including <sys/mman.h> is sufficient to get a definition for
getpagesize(). However, this is not true for the Android NDK. This CL
brings back the HAVE_UNISTD_H definition and its associated header
check.
This also adds a HAVE_FUNC_SYSCONF definition, which checks for the
presence of sysconf(). The definition can be used later to replace
getpagesize() with sysconf().
snappy-stubs-public.h defined the DISALLOW_COPY_AND_ASSIGN macro, so the
definition propagated to all translation units that included the open
source headers. The macro is now inlined, thus avoiding polluting the
macro environment of snappy users.
to avoid the compiler coalescing multiple loads into a single load instruction
(which only work for aligned accesses).
A typical example where GCC would coalesce:
uint8* p = ...;
uint32 a = UNALIGNED_LOAD32(p);
uint32 b = UNALIGNED_LOAD32(p + 4);
uint32 c = a | b;
apparently Debian still targets these by default, giving us segfaults on
armel.
R=sanjay
git-svn-id: https://snappy.googlecode.com/svn/trunk@64 03e5f5b5-db94-4691-08a0-1a8bf15f6143
Achieved by moving logging macro definitions to a test-only
header file, and by changing non-test code to use assert,
fprintf, and abort instead of LOG/CHECK macros.
R=sesse
git-svn-id: https://snappy.googlecode.com/svn/trunk@62 03e5f5b5-db94-4691-08a0-1a8bf15f6143
warnings. There are still some in the unit test, but the main .cc file should
be clean. We haven't enabled -Wall for the default build, since the unit test
is still not clean.
This also fixes a real bug in the open-source implementation of
ReadFileToStringOrDie(); it would not detect errors correctly.
I had to go through some pains to avoid performance loss as the types
were changed; I think there might still be some with 32-bit if and only if LFS
is enabled (ie., size_t is 64-bit), but for regular 32-bit and 64-bit I can't
see any losses, and I've diffed the generated GCC assembler between the old and
new code without seeing any significant choices. If anything, it's ever so
slightly faster.
This may or may not enable compression of very large blocks (>2^32 bytes)
when size_t is 64-bit, but I haven't checked, and it is still not a supported
case.
git-svn-id: https://snappy.googlecode.com/svn/trunk@56 03e5f5b5-db94-4691-08a0-1a8bf15f6143
among others, Windows support. For Windows in specific, we could have used
CreateFileMapping/MapViewOfFile, but this should at least get us a bit closer
to compiling, and is of course also relevant for embedded systems with no MMU.
(Part 1/2)
R=csilvers
DELTA=9 (8 added, 0 deleted, 1 changed)
Revision created by MOE tool push_codebase.
MOE_MIGRATION=1031
git-svn-id: https://snappy.googlecode.com/svn/trunk@15 03e5f5b5-db94-4691-08a0-1a8bf15f6143