mirror of
https://github.com/facebook/rocksdb.git
synced 2024-11-30 04:41:49 +00:00
459969e993
Summary: **Background** - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant. What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench (similar on an older Skylake system without AVX512): PORTABLE=1 empty USE_SSE : xxh3->8 GB/s crc32c->0.8 GB/s (no SSE4.2 nor AVX2 instructions) PORTABLE=1 USE_SSE=1 : xxh3->19 GB/s crc32c->16 GB/s (with SSE4.2 and AVX2) PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s crc32c->16 GB/s (also some AVX512) Testing a ~10 year old system, with SSE4.2 but without AVX2, crc32c is a similar speed to the new systems but xxh3 is only about half that speed, also 8GB/s like the non-AVX2 compile above. Given that xxh3 has specific optimization for AVX2, I think we can infer that that crc32c is only fastest for that ~2008-2013 period when SSE4.2 was included but not AVX2. And given that xxh3 is only about 2x slower on these systems (not like >10x slower for unoptimized crc32c), I don't think we need to invest too much in optimally adapting to these old cases. x86 hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is fine IMHO. **This change** does two related things: * Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining this code is non-zero work, and compiling special code that doesn't work on the configured target instruction set for code generation is always dubious. (On the one hand we have to ensure the CRC32c code uses SSE4.2 but on the other hand we have to ensure nothing else does.) * Detect CPU features in source code, not in build scripts. Although there are some hypothetical advantages to detectiong in build scripts (compiler generality), RocksDB supports at least three build systems: make, cmake, and buck. It's not practical to support feature detection on all three, and we have suffered from missed optimization opportunities by relying on missing or incomplete detection in cmake and buck. We also depend on some components like xxhash that do source code detection anyway. **In more detail:** * `HAVE_SSE42`, `HAVE_AVX2`, and `HAVE_PCLMUL` replaced by standard macros `__SSE4_2__`, `__AVX2__`, and `__PCLMUL__`. * MSVC does not provide high fidelity defines for SSE, PCLMUL, or POPCNT, but we can infer those from `__AVX__` or `__AVX2__` in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue. * `__POPCNT__` is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer. * `PORTABLE` can be set to a CPU type, e.g. "haswell", to compile for that CPU type. * `USE_SSE` is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior. Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11419 Test Plan: existing tests, CI Manual performance tests after the change match the before above (none expected with make build). We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.) Reviewed By: ajkr Differential Revision: D45489041 Pulled By: pdillinger fbshipit-source-id: 60ceb0dd2aa3b365c99ed08a8b2a087a9abb6a70
300 lines
10 KiB
C++
300 lines
10 KiB
C++
// Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
#pragma once
|
|
|
|
#include <assert.h>
|
|
#ifdef _MSC_VER
|
|
#include <intrin.h>
|
|
#endif
|
|
|
|
#include <cstdint>
|
|
#include <type_traits>
|
|
|
|
#include "port/lang.h"
|
|
#include "rocksdb/rocksdb_namespace.h"
|
|
|
|
ASSERT_FEATURE_COMPAT_HEADER();
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
// Fast implementation of floor(log2(v)). Undefined for 0 or negative
|
|
// numbers (in case of signed type).
|
|
template <typename T>
|
|
inline int FloorLog2(T v) {
|
|
static_assert(std::is_integral<T>::value, "non-integral type");
|
|
assert(v > 0);
|
|
#ifdef _MSC_VER
|
|
static_assert(sizeof(T) <= sizeof(uint64_t), "type too big");
|
|
unsigned long idx = 0;
|
|
if (sizeof(T) <= sizeof(uint32_t)) {
|
|
_BitScanReverse(&idx, static_cast<uint32_t>(v));
|
|
} else {
|
|
#if defined(_M_X64) || defined(_M_ARM64)
|
|
_BitScanReverse64(&idx, static_cast<uint64_t>(v));
|
|
#else
|
|
const auto vh = static_cast<uint32_t>(static_cast<uint64_t>(v) >> 32);
|
|
if (vh != 0) {
|
|
_BitScanReverse(&idx, static_cast<uint32_t>(vh));
|
|
idx += 32;
|
|
} else {
|
|
_BitScanReverse(&idx, static_cast<uint32_t>(v));
|
|
}
|
|
#endif
|
|
}
|
|
return idx;
|
|
#else
|
|
static_assert(sizeof(T) <= sizeof(unsigned long long), "type too big");
|
|
if (sizeof(T) <= sizeof(unsigned int)) {
|
|
int lz = __builtin_clz(static_cast<unsigned int>(v));
|
|
return int{sizeof(unsigned int)} * 8 - 1 - lz;
|
|
} else if (sizeof(T) <= sizeof(unsigned long)) {
|
|
int lz = __builtin_clzl(static_cast<unsigned long>(v));
|
|
return int{sizeof(unsigned long)} * 8 - 1 - lz;
|
|
} else {
|
|
int lz = __builtin_clzll(static_cast<unsigned long long>(v));
|
|
return int{sizeof(unsigned long long)} * 8 - 1 - lz;
|
|
}
|
|
#endif
|
|
}
|
|
|
|
// Constexpr version of FloorLog2
|
|
template <typename T>
|
|
constexpr int ConstexprFloorLog2(T v) {
|
|
int rv = 0;
|
|
while (v > T{1}) {
|
|
++rv;
|
|
v >>= 1;
|
|
}
|
|
return rv;
|
|
}
|
|
|
|
// Number of low-order zero bits before the first 1 bit. Undefined for 0.
|
|
template <typename T>
|
|
inline int CountTrailingZeroBits(T v) {
|
|
static_assert(std::is_integral<T>::value, "non-integral type");
|
|
assert(v != 0);
|
|
#ifdef _MSC_VER
|
|
static_assert(sizeof(T) <= sizeof(uint64_t), "type too big");
|
|
unsigned long tz = 0;
|
|
if (sizeof(T) <= sizeof(uint32_t)) {
|
|
_BitScanForward(&tz, static_cast<uint32_t>(v));
|
|
} else {
|
|
#if defined(_M_X64) || defined(_M_ARM64)
|
|
_BitScanForward64(&tz, static_cast<uint64_t>(v));
|
|
#else
|
|
_BitScanForward(&tz, static_cast<uint32_t>(v));
|
|
if (tz == 0) {
|
|
_BitScanForward(&tz,
|
|
static_cast<uint32_t>(static_cast<uint64_t>(v) >> 32));
|
|
tz += 32;
|
|
}
|
|
#endif
|
|
}
|
|
return static_cast<int>(tz);
|
|
#else
|
|
static_assert(sizeof(T) <= sizeof(unsigned long long), "type too big");
|
|
if (sizeof(T) <= sizeof(unsigned int)) {
|
|
return __builtin_ctz(static_cast<unsigned int>(v));
|
|
} else if (sizeof(T) <= sizeof(unsigned long)) {
|
|
return __builtin_ctzl(static_cast<unsigned long>(v));
|
|
} else {
|
|
return __builtin_ctzll(static_cast<unsigned long long>(v));
|
|
}
|
|
#endif
|
|
}
|
|
|
|
// Not all MSVC compile settings will use `BitsSetToOneFallback()`. We include
|
|
// the following code at coarse granularity for simpler macros. It's important
|
|
// to exclude at least so our non-MSVC unit test coverage tool doesn't see it.
|
|
#ifdef _MSC_VER
|
|
|
|
namespace detail {
|
|
|
|
template <typename T>
|
|
int BitsSetToOneFallback(T v) {
|
|
const int kBits = static_cast<int>(sizeof(T)) * 8;
|
|
static_assert((kBits & (kBits - 1)) == 0, "must be power of two bits");
|
|
// we static_cast these bit patterns in order to truncate them to the correct
|
|
// size. Warning C4309 dislikes this technique, so disable it here.
|
|
#pragma warning(disable : 4309)
|
|
v = static_cast<T>(v - ((v >> 1) & static_cast<T>(0x5555555555555555ull)));
|
|
v = static_cast<T>((v & static_cast<T>(0x3333333333333333ull)) +
|
|
((v >> 2) & static_cast<T>(0x3333333333333333ull)));
|
|
v = static_cast<T>((v + (v >> 4)) & static_cast<T>(0x0F0F0F0F0F0F0F0Full));
|
|
#pragma warning(default : 4309)
|
|
for (int shift_bits = 8; shift_bits < kBits; shift_bits <<= 1) {
|
|
v += static_cast<T>(v >> shift_bits);
|
|
}
|
|
// we want the bottom "slot" that's big enough to represent a value up to
|
|
// (and including) kBits.
|
|
return static_cast<int>(v & static_cast<T>(kBits | (kBits - 1)));
|
|
}
|
|
|
|
} // namespace detail
|
|
|
|
#endif // _MSC_VER
|
|
|
|
// Number of bits set to 1. Also known as "population count".
|
|
template <typename T>
|
|
inline int BitsSetToOne(T v) {
|
|
static_assert(std::is_integral<T>::value, "non-integral type");
|
|
#ifdef _MSC_VER
|
|
static_assert(sizeof(T) <= sizeof(uint64_t), "type too big");
|
|
if (sizeof(T) < sizeof(uint32_t)) {
|
|
// This bit mask is to avoid a compiler warning on unused path
|
|
constexpr auto mm = 8 * sizeof(uint32_t) - 1;
|
|
// The bit mask is to neutralize sign extension on small signed types
|
|
constexpr uint32_t m = (uint32_t{1} << ((8 * sizeof(T)) & mm)) - 1;
|
|
#if __POPCNT__
|
|
return static_cast<int>(__popcnt(static_cast<uint32_t>(v) & m));
|
|
#else
|
|
return static_cast<int>(detail::BitsSetToOneFallback(v) & m);
|
|
#endif // __POPCNT__
|
|
} else if (sizeof(T) == sizeof(uint32_t)) {
|
|
#if __POPCNT__
|
|
return static_cast<int>(__popcnt(static_cast<uint32_t>(v)));
|
|
#else
|
|
return detail::BitsSetToOneFallback(static_cast<uint32_t>(v));
|
|
#endif // __POPCNT__
|
|
} else {
|
|
#if __POPCNT__
|
|
#ifdef _M_X64
|
|
return static_cast<int>(__popcnt64(static_cast<uint64_t>(v)));
|
|
#else
|
|
return static_cast<int>(
|
|
__popcnt(static_cast<uint32_t>(static_cast<uint64_t>(v) >> 32) +
|
|
__popcnt(static_cast<uint32_t>(v))));
|
|
#endif // _M_X64
|
|
#else
|
|
return detail::BitsSetToOneFallback(static_cast<uint64_t>(v));
|
|
#endif // __POPCNT__
|
|
}
|
|
#else
|
|
static_assert(sizeof(T) <= sizeof(unsigned long long), "type too big");
|
|
if (sizeof(T) < sizeof(unsigned int)) {
|
|
// This bit mask is to avoid a compiler warning on unused path
|
|
constexpr auto mm = 8 * sizeof(unsigned int) - 1;
|
|
// This bit mask is to neutralize sign extension on small signed types
|
|
constexpr unsigned int m = (1U << ((8 * sizeof(T)) & mm)) - 1;
|
|
return __builtin_popcount(static_cast<unsigned int>(v) & m);
|
|
} else if (sizeof(T) == sizeof(unsigned int)) {
|
|
return __builtin_popcount(static_cast<unsigned int>(v));
|
|
} else if (sizeof(T) <= sizeof(unsigned long)) {
|
|
return __builtin_popcountl(static_cast<unsigned long>(v));
|
|
} else {
|
|
return __builtin_popcountll(static_cast<unsigned long long>(v));
|
|
}
|
|
#endif
|
|
}
|
|
|
|
template <typename T>
|
|
inline int BitParity(T v) {
|
|
static_assert(std::is_integral<T>::value, "non-integral type");
|
|
#ifdef _MSC_VER
|
|
// bit parity == oddness of popcount
|
|
return BitsSetToOne(v) & 1;
|
|
#else
|
|
static_assert(sizeof(T) <= sizeof(unsigned long long), "type too big");
|
|
if (sizeof(T) <= sizeof(unsigned int)) {
|
|
// On any sane systen, potential sign extension here won't change parity
|
|
return __builtin_parity(static_cast<unsigned int>(v));
|
|
} else if (sizeof(T) <= sizeof(unsigned long)) {
|
|
return __builtin_parityl(static_cast<unsigned long>(v));
|
|
} else {
|
|
return __builtin_parityll(static_cast<unsigned long long>(v));
|
|
}
|
|
#endif
|
|
}
|
|
|
|
// Swaps between big and little endian. Can be used in combination with the
|
|
// little-endian encoding/decoding functions in coding_lean.h and coding.h to
|
|
// encode/decode big endian.
|
|
template <typename T>
|
|
inline T EndianSwapValue(T v) {
|
|
static_assert(std::is_integral<T>::value, "non-integral type");
|
|
|
|
#ifdef _MSC_VER
|
|
if (sizeof(T) == 2) {
|
|
return static_cast<T>(_byteswap_ushort(static_cast<uint16_t>(v)));
|
|
} else if (sizeof(T) == 4) {
|
|
return static_cast<T>(_byteswap_ulong(static_cast<uint32_t>(v)));
|
|
} else if (sizeof(T) == 8) {
|
|
return static_cast<T>(_byteswap_uint64(static_cast<uint64_t>(v)));
|
|
}
|
|
#else
|
|
if (sizeof(T) == 2) {
|
|
return static_cast<T>(__builtin_bswap16(static_cast<uint16_t>(v)));
|
|
} else if (sizeof(T) == 4) {
|
|
return static_cast<T>(__builtin_bswap32(static_cast<uint32_t>(v)));
|
|
} else if (sizeof(T) == 8) {
|
|
return static_cast<T>(__builtin_bswap64(static_cast<uint64_t>(v)));
|
|
}
|
|
#endif
|
|
// Recognized by clang as bswap, but not by gcc :(
|
|
T ret_val = 0;
|
|
for (std::size_t i = 0; i < sizeof(T); ++i) {
|
|
ret_val |= ((v >> (8 * i)) & 0xff) << (8 * (sizeof(T) - 1 - i));
|
|
}
|
|
return ret_val;
|
|
}
|
|
|
|
// Reverses the order of bits in an integral value
|
|
template <typename T>
|
|
inline T ReverseBits(T v) {
|
|
T r = EndianSwapValue(v);
|
|
const T kHighestByte = T{1} << ((sizeof(T) - 1) * 8);
|
|
const T kEveryByte = kHighestByte | (kHighestByte / 255);
|
|
|
|
r = ((r & (kEveryByte * 0x0f)) << 4) | ((r >> 4) & (kEveryByte * 0x0f));
|
|
r = ((r & (kEveryByte * 0x33)) << 2) | ((r >> 2) & (kEveryByte * 0x33));
|
|
r = ((r & (kEveryByte * 0x55)) << 1) | ((r >> 1) & (kEveryByte * 0x55));
|
|
|
|
return r;
|
|
}
|
|
|
|
// Every output bit depends on many input bits in the same and higher
|
|
// positions, but not lower positions. Specifically, this function
|
|
// * Output highest bit set to 1 is same as input (same FloorLog2, or
|
|
// equivalently, same number of leading zeros)
|
|
// * Is its own inverse (an involution)
|
|
// * Guarantees that b bottom bits of v and c bottom bits of
|
|
// DownwardInvolution(v) uniquely identify b + c bottom bits of v
|
|
// (which is all of v if v < 2**(b + c)).
|
|
// ** A notable special case is that modifying c adjacent bits at
|
|
// some chosen position in the input is bijective with the bottom c
|
|
// output bits.
|
|
// * Distributes over xor, as in DI(a ^ b) == DI(a) ^ DI(b)
|
|
//
|
|
// This transformation is equivalent to a matrix*vector multiplication in
|
|
// GF(2) where the matrix is recursively defined by the pattern matrix
|
|
// P = | 1 1 |
|
|
// | 0 1 |
|
|
// and replacing 1's with P and 0's with 2x2 zero matices to some depth,
|
|
// e.g. depth of 6 for 64-bit T. An essential feature of this matrix
|
|
// is that all square sub-matrices that include the top row are invertible.
|
|
template <typename T>
|
|
inline T DownwardInvolution(T v) {
|
|
static_assert(std::is_integral<T>::value, "non-integral type");
|
|
static_assert(sizeof(T) <= 8, "only supported up to 64 bits");
|
|
|
|
uint64_t r = static_cast<uint64_t>(v);
|
|
if constexpr (sizeof(T) > 4) {
|
|
r ^= r >> 32;
|
|
}
|
|
if constexpr (sizeof(T) > 2) {
|
|
r ^= (r & 0xffff0000ffff0000U) >> 16;
|
|
}
|
|
if constexpr (sizeof(T) > 1) {
|
|
r ^= (r & 0xff00ff00ff00ff00U) >> 8;
|
|
}
|
|
r ^= (r & 0xf0f0f0f0f0f0f0f0U) >> 4;
|
|
r ^= (r & 0xccccccccccccccccU) >> 2;
|
|
r ^= (r & 0xaaaaaaaaaaaaaaaaU) >> 1;
|
|
return static_cast<T>(r);
|
|
}
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|