8000 added optimization for 64-bit shift and clip to 16-bits, nets 5-6% improvement by bitbank2 · Pull Request #9 · adafruit/Adafruit_MP3 · GitHub
[go: up one dir, main page]

Skip to content

added optimization for 64-bit shift and clip to 16-bits, nets 5-6% improvement #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 17, 2019

Conversation

bitbank2
Copy link
Contributor

Most of the decode time is spent in the PolyphaseStereo() and PolyphaseMono() functions doing 64-bit integer math. The SIMD instructions of the Cortex-M4 take care of most of that, but the 64-bit shift right followed by clip to 16-bits had room for improvement. I added an inline asm function to shave off a few cycles.

@tannewt tannewt requested a review from jepler December 13, 2019 18:22
@ladyada
Copy link
Member
ladyada commented Dec 15, 2019

@bitbank2 hi please rebase to get the travis CI fixes :)

Copy link
Contributor
@jepler jepler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, though I have a few comments.

Testing performed: built (but didn't run) circuitpython for circuitplayground express, no compile or link errors encountered.

@@ -142,7 +142,8 @@ void PolyphaseMono(short *pcm, int *vbuf, const int *coefBase)
MC0M(6)
MC0M(7)

*(pcm + 0) = ClipToShort((int)SAR64(sum1L, (32-CSHIFT)), DEF_NFRACBITS);
// *(pcm + 0) = ClipToShort((int)SAR64(sum1L, (32-CSHIFT)), DEF_NFRACBITS);
Copy link
Contributor
@jepler jepler Dec 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to delete outdated code, rather than commenting it out.

src/assembly.h Outdated
//sign = x >> 31;
//if (sign != (x >> 15))
// x = sign ^ ((1 << 15) - 1);
__attribute__((__always_inline__)) static __inline short SAR64_Clip(Word64 x, int n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block gets used for building circuitpython's mp3 player, so we're covered.

src/assembly.h Outdated
shrd eax, edx, cl
sar edx, cl
}
} else if (n < 64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the ARM versions you note that n < 32, but you cover all cases here. This could trip someone up who later tests on x86 and tries to deploy on arm.

src/assembly.h Outdated
Comment on lines 397 to 399
__asm__ __volatile__( "lsl %2, %0, %3\n\t" // tmp <- xHi<<(32-n)
"lsr %1, %1, %4\n\t" // xLo <- xLo>>n
"orr %1, %2\n\t" // xLo <= xLo || tmp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked what happens if I try to just code SAR64_clip in direct C code and compile it for ARM (gcc 7.mumble). This code:

short SAR64_clip(int64_t x, int n) {
    if(n >= 32) __builtin_unreachable();
    return (x >> n);
}

short SAR64_clip_26(int64_t x) {
    return SAR64_clip(x, 26);
}

gets me the sequence

00000040 <SAR64_clip_26>:
  40:	0e80      	lsrs	r0, r0, #26
  42:	ea40 1081 	orr.w	r0, r0, r1, lsl #6
  46:	b200      	sxth	r0, r0
  48:	4770      	bx	lr

Is this sequence (lsls + orrw-with-lsl) right too? Is it preferable? It looks like the SAR64_clip argument is always a compile time constant. (I assume that sxth is an operation that can be omitted but I'm not 100% sure)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sign extension (sxth) is not needed and doesn't help if the value overflows, but since the value doesn't overflow, we don't have to worry about it. I wasn't sure if you would mind hard coding the shift right value of 26, but it does produce slightly faster code. I just pushed a new version with the simpler code and it's a tiny bit faster.

@jepler
Copy link
Contributor
jepler commented Dec 15, 2019

In general, I think that the state of the art of compilers has advanced a lot since src/assembly.h was written, and it doesn't hurt to check whether these fancy wrappers are still needed. It feels like assuming gcc or a compiler with optimization parity with gcc is not that outlandish.

MULSHIFT32 and MADD64 get sensible results when just coded in C, __builtin_clz uses the ARM clz instruction directly, but __builtin_abs creates a branching form.

Using a Programmer's Delight C implementation for FASTABS gives just 2 instructions, but they're both 32-bits in thumb mode:

int FASTABS1(int x) {
    int y = (x >> 31);
    return (x ^ y) - y;
}

gives

ea80 73e0 	eor.w	r3, r0, r0, asr #31
eba3 70e0 	sub.w	r0, r3, r0, asr #31

Comment on lines +391 to +394
__asm__ __volatile__(
"lsr %1, %1, #26\n\t" // xLo <- xLo>>n
"orr %1, %1, %0, lsl #6\n\t" // xLo <= xLo || (xHi << 6)
: "+&r" (xHi), "+r" (xLo) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If written like this, the code will automatically adapt if n is constant and use the more efficient sequence:

__attribute__((__always_inline__)) static __inline short SAR64_Clip(Word64 x, int n) {
    unsigned int xLo = (unsigned int) x;
    int xHi = (int) (x >> 32);
    if(__builtin_constant_p(n)) {
        __asm__ __volatile__(
            "lsr %1, %1, %2\n\t"         // xLo <- xLo>>(32-n)
            "orr %1, %1, %0, lsl %3\n\t" // xLo <= xLo || (xHi << n)
            : "+&r" (xHi), "+r" (xLo)
            : "M" (n), "M" (32-n) );
    } else {
        int nComp = 32-n; 
        int tmp;
        __asm__ __volatile__(
            "lsl %2, %0, %3\n\t"         // tmp <- xHi<<(32-n)
            "lsr %1, %1, %4\n\t"         // xLo <- xLo>>n
            "orr %1, %2\n\t"             // xLo <= xLo || tmp
            : "+&r" (xHi), "+r" (xLo), "=&r" (tmp)
            : "r" (nComp), "r" (n) );
    }
    return( (short)xLo );
}

Otherwise, it might be good to document it (e.g., by putting it in the function name?) that this is a constant shift of 26 bits.

Is there a missing implementation (or two?) of this function for the other CPU and compiler choices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea about the constant vs variable shift amount. I've written a lot of assembly language over the years, but avoided inline assembly because it would get mangled by the compiler into something different. External asm files were safe from compiler changes.

I wrote the x86 asm version of the function, but perhaps I missed something. If you're happy, then please do the merge.

Copy link
Contributor
@jepler jepler Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the version that exists now, it looks like you have incompatible SAR64_clip between x86 and arm.

+static __inline short SAR64_Clip(Word64 x, int n)

and

+__attribute__((__always_inline__)) static __inline short SAR64_Clip(Word64 x)

I don't know how to build the x86 version, but this seems fishy to me. Please let me know what's up.

Copy link
Contributor
@jepler jepler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Some additional discussion occurred in discord and we agreed on the points of contention, with the last minor change.

Additionally, I tested and mp3s still play back on my pygamer with these changes. Audio sounds identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0