added optimization for 64-bit shift and clip to 16-bits, nets 5-6% improvement #9

bitbank2 · 2019-12-12T21:18:18Z

Most of the decode time is spent in the PolyphaseStereo() and PolyphaseMono() functions doing 64-bit integer math. The SIMD instructions of the Cortex-M4 take care of most of that, but the 64-bit shift right followed by clip to 16-bits had room for improvement. I added an inline asm function to shave off a few cycles.

…eed increase

ladyada · 2019-12-15T07:43:42Z

@bitbank2 hi please rebase to get the travis CI fixes :)

jepler

This makes sense, though I have a few comments.

Testing performed: built (but didn't run) circuitpython for circuitplayground express, no compile or link errors encountered.

jepler · 2019-12-15T14:27:31Z

src/polyphase.c

@@ -142,7 +142,8 @@ void PolyphaseMono(short *pcm, int *vbuf, const int *coefBase)
 	MC0M(6)
 	MC0M(7)

-	*(pcm + 0) = ClipToShort((int)SAR64(sum1L, (32-CSHIFT)), DEF_NFRACBITS);
+//	*(pcm + 0) = ClipToShort((int)SAR64(sum1L, (32-CSHIFT)), DEF_NFRACBITS);


Prefer to delete outdated code, rather than commenting it out.

jepler · 2019-12-15T14:54:06Z

src/assembly.h

+//sign = x >> 31;
+//if (sign != (x >> 15))
+//    x = sign ^ ((1 << 15) - 1);
+__attribute__((__always_inline__)) static __inline short SAR64_Clip(Word64 x, int n)


This block gets used for building circuitpython's mp3 player, so we're covered.

jepler · 2019-12-15T14:55:33Z

src/assembly.h

+            shrd    eax, edx, cl
+            sar        edx, cl
+        }
+    } else if (n < 64) {


in the ARM versions you note that n < 32, but you cover all cases here. This could trip someone up who later tests on x86 and tries to deploy on arm.

jepler · 2019-12-15T15:01:21Z

src/assembly.h

+  __asm__ __volatile__( "lsl %2, %0, %3\n\t"  // tmp <- xHi<<(32-n)
+                        "lsr %1, %1, %4\n\t"  // xLo <- xLo>>n
+                        "orr %1, %2\n\t"      // xLo <= xLo || tmp


I checked what happens if I try to just code SAR64_clip in direct C code and compile it for ARM (gcc 7.mumble). This code:

short SAR64_clip(int64_t x, int n) { if(n >= 32) __builtin_unreachable(); return (x >> n); } short SAR64_clip_26(int64_t x) { return SAR64_clip(x, 26); }

gets me the sequence

00000040 <SAR64_clip_26>: 40: 0e80 lsrs r0, r0, #26 42: ea40 1081 orr.w r0, r0, r1, lsl #6 46: b200 sxth r0, r0 48: 4770 bx lr

Is this sequence (lsls + orrw-with-lsl) right too? Is it preferable? It looks like the SAR64_clip argument is always a compile time constant. (I assume that sxth is an operation that can be omitted but I'm not 100% sure)

The sign extension (sxth) is not needed and doesn't help if the value overflows, but since the value doesn't overflow, we don't have to worry about it. I wasn't sure if you would mind hard coding the shift right value of 26, but it does produce slightly faster code. I just pushed a new version with the simpler code and it's a tiny bit faster.

jepler · 2019-12-15T15:12:34Z

In general, I think that the state of the art of compilers has advanced a lot since src/assembly.h was written, and it doesn't hurt to check whether these fancy wrappers are still needed. It feels like assuming gcc or a compiler with optimization parity with gcc is not that outlandish.

MULSHIFT32 and MADD64 get sensible results when just coded in C, __builtin_clz uses the ARM clz instruction directly, but __builtin_abs creates a branching form.

Using a Programmer's Delight C implementation for FASTABS gives just 2 instructions, but they're both 32-bits in thumb mode:

int FASTABS1(int x) {
    int y = (x >> 31);
    return (x ^ y) - y;
}

gives

ea80 73e0 	eor.w	r3, r0, r0, asr #31
eba3 70e0 	sub.w	r0, r3, r0, asr #31

jepler · 2019-12-16T17:18:17Z

src/assembly.h

+  __asm__ __volatile__(
+                        "lsr %1, %1, #26\n\t"  // xLo <- xLo>>n
+                        "orr %1, %1, %0, lsl #6\n\t"      // xLo <= xLo || (xHi << 6)
+                        : "+&r" (xHi), "+r" (xLo) );


If written like this, the code will automatically adapt if n is constant and use the more efficient sequence:

__attribute__((__always_inline__)) static __inline short SAR64_Clip(Word64 x, int n) { unsigned int xLo = (unsigned int) x; int xHi = (int) (x >> 32); if(__builtin_constant_p(n)) { __asm__ __volatile__( "lsr %1, %1, %2\n\t" // xLo <- xLo>>(32-n) "orr %1, %1, %0, lsl %3\n\t" // xLo <= xLo || (xHi << n) : "+&r" (xHi), "+r" (xLo) : "M" (n), "M" (32-n) ); } else { int nComp = 32-n; int tmp; __asm__ __volatile__( "lsl %2, %0, %3\n\t" // tmp <- xHi<<(32-n) "lsr %1, %1, %4\n\t" // xLo <- xLo>>n "orr %1, %2\n\t" // xLo <= xLo || tmp : "+&r" (xHi), "+r" (xLo), "=&r" (tmp) : "r" (nComp), "r" (n) ); } return( (short)xLo ); }

Otherwise, it might be good to document it (e.g., by putting it in the function name?) that this is a constant shift of 26 bits.

Is there a missing implementation (or two?) of this function for the other CPU and compiler choices?

Good idea about the constant vs variable shift amount. I've written a lot of assembly language over the years, but avoided inline assembly because it would get mangled by the compiler into something different. External asm files were safe from compiler changes.

I wrote the x86 asm version of the function, but perhaps I missed something. If you're happy, then please do the merge.

In the version that exists now, it looks like you have incompatible SAR64_clip between x86 and arm.

+static __inline short SAR64_Clip(Word64 x, int n)

and

+__attribute__((__always_inline__)) static __inline short SAR64_Clip(Word64 x)

I don't know how to build the x86 version, but this seems fishy to me. Please let me know what's up.

…s and fixed the x86 target

jepler

Thank you! Some additional discussion occurred in discord and we agreed on the points of contention, with the last minor change.

Additionally, I tested and mp3s still play back on my pygamer with these changes. Audio sounds identical.

bitbank2 added 2 commits December 12, 2019 22:14

added optimization for 64-bit shift and clip to 16-bits, nets 5-6% sp…

ca216de

…eed increase

Added x86 version of SAR64_Clip for completeness

6a7ba6d

tannewt requested a review from jepler December 13, 2019 18:22

jepler approved these changes Dec 15, 2019

View reviewed changes

Slightly improved version with constant shifts

f3088fd

jepler suggested changes Dec 16, 2019

View reviewed changes

Settled on making the SAR64_Clip function use a fixed value of 26 bit…

1b1f774

…s and fixed the x86 target

jepler approved these changes Dec 17, 2019

View reviewed changes

jepler merged commit 4024ab6 into adafruit:master Dec 17, 2019

jepler mentioned this pull request Dec 19, 2019

mp3: update to upstream release 1.1.1 adafruit/circuitpython#2409

Merged

jepler mentioned this pull request Jun 17, 2020

Restore sample clipping #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added optimization for 64-bit shift and clip to 16-bits, nets 5-6% improvement #9

added optimization for 64-bit shift and clip to 16-bits, nets 5-6% improvement #9

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

added optimization for 64-bit shift and clip to 16-bits, nets 5-6% improvement #9

added optimization for 64-bit shift and clip to 16-bits, nets 5-6% improvement #9

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!