We also increased the hashtable size by 1 bit as it significantly degraded the ratio. Thus even level 1 might slightly improve.
PiperOrigin-RevId: 621456036
As far as we know, the lack of "cc" in the clobbers hasn't caused
problems yet, but it could. This change is to improve correctness,
and is also almost certainly performance neutral.
PiperOrigin-RevId: 487133620
This change replaces the hashing function used during compression with
one that is roughly as good but faster. This speeds up compression by
two to a few percent on the Intel-, AMD-, and Arm-based machines we
tested. The amount of compression is roughly unchanged.
PiperOrigin-RevId: 485960303
By default MemCpy() / MemMove() always copies 64 bytes in DecompressBranchless(). Profiling shows that the vast majority of the time we need to copy many fewer bytes (typically <= 16 bytes). It is safe to copy fewer bytes as long as we exceed len.
This change improves throughput by ~12% on ARM, ~35% on AMD Milan, and ~7% on Intel Cascade Lake.
PiperOrigin-RevId: 453917840
Not everything defining __GNUC__ supports flag outputs
from asm statements; in particular, some Clang versions
on macOS does not. The correct test per the GCC documentation
is __GCC_ASM_FLAG_OUTPUTS__, so use that instead.
PiperOrigin-RevId: 423749308
As `i + offset` is promoted to a "negative" size_t,
UBSan would complain when adding the resulting offset to `dst`:
```
/tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43: runtime error: addition of unsigned offset to 0x6120003c5ec1 overflowed to 0x6120003c5ec0
#0 0x7f9ebd21769c in snappy::(anonymous namespace)::Copy64BytesWithPatternExtension(char*, unsigned long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43
#1 0x7f9ebd21769c in std::__1::pair<unsigned char const*, long> snappy::DecompressBranchless<char*>(unsigned char const*, unsigned char const*, long, char*, long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:1160:15
```
The final ip advance value doesn't have to wait for
the result of offset to load *tag. It can be computed
along with the offset, so the codegen will use one
csinc in parallel with ldrb. This will improve the
throughput.
With this change it is observed ~4.2% uplift in UFlat/10
and ~3.7% in UFlatMedley
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I20ab211235bbf578c6c978f2bbd9160a49e920da
Clang doesn't realize the load with free zero-extension,
and emits another extra 'and xn, xm, 0xff' to calc offset.
With this change ,this extra op is removed, and consistent
1.7% performance uplift is observed.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Ica4617852c4b93eadc6c5c551dc3961ffbadb8f0
Inspired by kExtractMasksCombined, this patch uses shift
to replace table lookup. On Arm the codegen is 2 shift ops
(lsl+lsr). Comparing to previous ldr which requires 4 cycles
latency, the lsl+lsr only need 2 cycles.
Slight (~0.3%) uplift observed on N1, and ~3% on A72.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I5b53632d22d9e5cf1a49d0c5cdd16265a15de23b
The SSSE3 intrinsics we use have their direct analogues in NEON, so making this optimization portable requires a very thin translation layer.
PiperOrigin-RevId: 381280165
This feature requires C++17. Fortunately, inline is useful for header declarations, which may be included in multiple compilation units. The declarations modified by this CL occur in a single compilation unit.
PiperOrigin-RevId: 347338760
2) Replace offset extraction with a lookup mask. This is less uops and is needed because we need to special case type 3 to always return 0 as to properly trigger the fallback.
3) Unroll the loop twice, this removes some loop-condition checks AND it improves the generated assembly. The loop variables tend to end up in a different register requiring mov's having two consecutive copies allows the elision of the mov's.
PiperOrigin-RevId: 346663328
When SSSE3 is available:
- Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7).
- This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16.
- Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length.
- The PSHUFB masks are now generated programmatically at compile-time.
When SSSE3 is unavailable:
- No change.
In both cases:
- assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this.
'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1).
PiperOrigin-RevId: 346454471
When SSSE3 is available:
- Use PSHUFB (_mm_shuffle_epi8) to handle pattern size 1 to 15 (previously it handled size 1 to 7).
- This enables us to do 16 byte copies instead of 8 bytes copies because we know that the pattern size >= 16.
- Use shuffle-reshuffle strategy to generate the next pattern after loading the initial pattern. This enables us to write 4 conditionals (similar to when pattern size >= 16) which would allow FDO to layout the code with respect to actual probabilities of each length.
- The PSHUFB masks are now generated programmatically at compile-time.
When SSSE3 is unavailable:
- No change.
In both cases:
- assert(op < op_limit) in IncrementalCopy so that we can check 'op_limit <= buf_limit - 15' instead of 'op_limit <= buf_limit - 16'. All existing call sites of IncrementalCopy guarantee this.
'bin' case is notably >20% faster because it has many repeated character patterns (i.e. pattern_size = 1).
PiperOrigin-RevId: 345340892