-
Notifications
You must be signed in to change notification settings - Fork 18
added optimization for 64-bit shift and clip to 16-bits, nets 5-6% improvement #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bitbank2 hi please rebase to get the travis CI fixes :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, though I have a few comments.
Testing performed: built (but didn't run) circuitpython for circuitplayground express, no compile or link errors encountered.
@@ -142,7 +142,8 @@ void PolyphaseMono(short *pcm, int *vbuf, const int *coefBase) | |||
MC0M(6) | |||
MC0M(7) | |||
|
|||
*(pcm + 0) = ClipToShort((int)SAR64(sum1L, (32-CSHIFT)), DEF_NFRACBITS); | |||
// *(pcm + 0) = ClipToShort((int)SAR64(sum1L, (32-CSHIFT)), DEF_NFRACBITS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to delete outdated code, rather than commenting it out.
src/assembly.h
Outdated
//sign = x >> 31; | ||
//if (sign != (x >> 15)) | ||
// x = sign ^ ((1 << 15) - 1); | ||
__attribute__((__always_inline__)) static __inline short SAR64_Clip(Word64 x, int n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block gets used for building circuitpython's mp3 player, so we're covered.
src/assembly.h
Outdated
shrd eax, edx, cl | ||
sar edx, cl | ||
} | ||
} else if (n < 64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the ARM versions you note that n < 32, but you cover all cases here. This could trip someone up who later tests on x86 and tries to deploy on arm.
src/assembly.h
Outdated
__asm__ __volatile__( "lsl %2, %0, %3\n\t" // tmp <- xHi<<(32-n) | ||
"lsr %1, %1, %4\n\t" // xLo <- xLo>>n | ||
"orr %1, %2\n\t" // xLo <= xLo || tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked what happens if I try to just code SAR64_clip in direct C code and compile it for ARM (gcc 7.mumble). This code:
short SAR64_clip(int64_t x, int n) {
if(n >= 32) __builtin_unreachable();
return (x >> n);
}
short SAR64_clip_26(int64_t x) {
return SAR64_clip(x, 26);
}
gets me the sequence
00000040 <SAR64_clip_26>:
40: 0e80 lsrs r0, r0, #26
42: ea40 1081 orr.w r0, r0, r1, lsl #6
46: b200 sxth r0, r0
48: 4770 bx lr
Is this sequence (lsls + orrw-with-lsl) right too? Is it preferable? It looks like the SAR64_clip argument is always a compile time constant. (I assume that sxth
is an operation that can be omitted but I'm not 100% sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sign extension (sxth) is not needed and doesn't help if the value overflows, but since the value doesn't overflow, we don't have to worry about it. I wasn't sure if you would mind hard coding the shift right value of 26, but it does produce slightly faster code. I just pushed a new version with the simpler code and it's a tiny bit faster.
In general, I think that the state of the art of compilers has advanced a lot since src/assembly.h was written, and it doesn't hurt to check whether these fancy wrappers are still needed. It feels like assuming gcc or a compiler with optimization parity with gcc is not that outlandish. MULSHIFT32 and MADD64 get sensible results when just coded in C, __builtin_clz uses the ARM clz instruction directly, but __builtin_abs creates a branching form. Using a Programmer's Delight C implementation for FASTABS gives just 2 instructions, but they're both 32-bits in thumb mode:
gives
|
__asm__ __volatile__( | ||
"lsr %1, %1, #26\n\t" // xLo <- xLo>>n | ||
"orr %1, %1, %0, lsl #6\n\t" // xLo <= xLo || (xHi << 6) | ||
: "+&r" (xHi), "+r" (xLo) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If written like this, the code will automatically adapt if n is constant and use the more efficient sequence:
__attribute__((__always_inline__)) static __inline short SAR64_Clip(Word64 x, int n) {
unsigned int xLo = (unsigned int) x;
int xHi = (int) (x >> 32);
if(__builtin_constant_p(n)) {
__asm__ __volatile__(
"lsr %1, %1, %2\n\t" // xLo <- xLo>>(32-n)
"orr %1, %1, %0, lsl %3\n\t" // xLo <= xLo || (xHi << n)
: "+&r" (xHi), "+r" (xLo)
: "M" (n), "M" (32-n) );
} else {
int nComp = 32-n;
int tmp;
__asm__ __volatile__(
"lsl %2, %0, %3\n\t" // tmp <- xHi<<(32-n)
"lsr %1, %1, %4\n\t" // xLo <- xLo>>n
"orr %1, %2\n\t" // xLo <= xLo || tmp
: "+&r" (xHi), "+r" (xLo), "=&r" (tmp)
: "r" (nComp), "r" (n) );
}
return( (short)xLo );
}
Otherwise, it might be good to document it (e.g., by putting it in the function name?) that this is a constant shift of 26 bits.
Is there a missing implementation (or two?) of this function for the other CPU and compiler choices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea about the constant vs variable shift amount. I've written a lot of assembly language over the years, but avoided inline assembly because it would get mangled by the compiler into something different. External asm files were safe from compiler changes.
I wrote the x86 asm version of the function, but perhaps I missed something. If you're happy, then please do the merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the version that exists now, it looks like you have incompatible SAR64_clip between x86 and arm.
+static __inline short SAR64_Clip(Word64 x, int n)
and
+__attribute__((__always_inline__)) static __inline short SAR64_Clip(Word64 x)
I don't know how to build the x86 version, but this seems fishy to me. Please let me know what's up.
…s and fixed the x86 target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Some additional discussion occurred in discord and we agreed on the points of contention, with the last minor change.
Additionally, I tested and mp3s still play back on my pygamer with these changes. Audio sounds identical.
Most of the decode time is spent in the PolyphaseStereo() and PolyphaseMono() functions doing 64-bit integer math. The SIMD instructions of the Cortex-M4 take care of most of that, but the 64-bit shift right followed by clip to 16-bits had room for improvement. I added an inline asm function to shave off a few cycles.