8000 py/emitinlinextensa: Add all supported LX3/LX106 opcodes to the inline assembler. by agatti · Pull Request #16731 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

py/emitinlinextensa: Add all supported LX3/LX106 opcodes to the inline assembler. #16731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 29, 2025

Conversation

agatti
Copy link
Contributor
@agatti agatti commented Feb 10, 2025

Summary

This PR expands the Xtensa inline assembler to support most if not all opcodes available on the ESP8266 and LX3/LX106 Xtensa cores.

This is meant as a stepping stone to add inline assembler support for the ESP32 and its LX6 core, along to windowed-specific opcodes and additional opcodes that are present only on the LX7 core (ESP32-S3 and later). Rather than having a single monolithic PR with all sorts of assembler changes I chose to split this into ESP8266-only support first and then use this to add ESP32 support (see #16594).

New opcodes being added are covered by tests, and the provided tests were expanded to also include opcodes available in the existing implementation. Given that the ESP8266 space requirements are tighter than ESP32's, certain opcodes that won't be co 8000 mmonly used are put behind a define (MICROPY_EMIT_INLINE_XTENSA_UNCOMMON_OPCODES, disabled by default) to save some space in the general use case.

Testing

The new inline assembler additions were tested on an ESP8266 with ./run-tests.py -t <device> -d inlineasm/xtensa.

Trade-offs and Alternatives

This increases the footprint of the ESP8266 firmware, as it has inline assembler enabled by default. I've tried to keep the size increase within reason but there's only so much you can do after all.

Also, some extmod tests would be emitted incorrectly when ran through mpy-cross to generate Xtensa/Xtensawin native code (some jumps are too far away to be encoded with a simple opcode) but now they fail at compile time rather than at runtime. This will be fixed in a later PR.

Copy link
codecov bot commented Feb 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (5cfafb7) to head (f5d10c3).
Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #16731   +/-   ##
=======================================
  Coverage   98.54%   98.54%           
=======================================
  Files         169      169           
  Lines       21898    21898           
=======================================
  Hits        21579    21579           
  Misses        319      319           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Feb 12, 2025
@rkompass
Copy link

Hello @agatti,
I started to try out this PR.

You noted in the tests:

# SLLI shift argument is actually (32 - shift)

I suggest the following correction for this in emitinlinextensa.c (line 472ff):

        } else if (op == MP_QSTR_slli) {
            mp_uint_t r0 = get_arg_reg(emit, op_str, pn_args[0]);
            mp_uint_t r1 = get_arg_reg(emit, op_str, pn_args[1]);
            mp_uint_t bits = 32 - get_arg_i(emit, op_str, pn_args[2], 1, 31);  // correct role of i
            asm_xtensa_op24(&emit->as, ASM_XTENSA_ENCODE_RRR(0, 1, 0 | ((bits >> 4) & 0x01), r0, r1, bits & 0x0F));

@agatti
Copy link
Contributor Author
agatti commented Feb 14, 2025

Right, turns out it's also what GAS does when assembling the SLLI opcode. Thanks for pointing it out, I'll fix it right away.

@rkompass
Copy link

Thanks. Another one which is purely cosmetic:

@micropython.asm_xtensa
def lsr31(a2):
    srli(a2, a2, 15)

print(hex(lsr31(0x80000000)))

in the tests should be titled lsr15, because you do the shift by 15 bits, right?

@agatti
Copy link
Contributor Author
agatti 8000 commented Feb 14, 2025

Yep. SRLI only accepts from 1 to 15 as its shift parameter - good catch. I'll fix this after the current test run passes. Thanks!

@rkompass
Copy link
rkompass commented Feb 14, 2025

I had in my Preliminary asm_xtensa docs
noted that:

#   slli(Ax, Ay, imm4)  AX = Ay << imm4  Shift left imm.     data(3, 0x110000 | x<<12 | y<<8 | (16-imm4)<<4)
#      slli(Ax, Ay, imm5) with imm5 > 16 may be coded as     data(3, 0x010000 | x<<12 | y<<8 | (32-imm5)<<4)
#   srli(Ax, Ay, imm4)  AX = Ay >> imm4  L. sh. right imm.   data(3, 0x410000 | x<<12 | y<<4 | imm4<<8)
#   srai(Ax, Ay, imm4)  AX = Ay >> imm4  A. sh. right imm.   data(3, 0x210000 | x<<12 | y<<4 | imm4<<8)
#      srai(Ax, Ay, imm5) with imm5 > 15 may be coded as     data(3, 0x310000 | x<<12 | y<<4 | (imm5-8)<<8)

which includes that srli is only shiftable by 0..15 bits (i.e. imm4).
But the other two have a more complicated coding where the case i<16 has to be handled differently than i>=16.
This is not reflected in the PR yet.

Consequently now the first test in xtensa_asmshift.py fails and produces -0x80000000 instead of 0x246:

@micropython.asm_xtensa
def lsl1(a2):
    slli(a2, a2, 31)


print(hex(lsl1(0x123)), hex(0x123<<1))

I ask myself why below "All checks have passed". Wouldn't that give an error?

@agatti
Copy link
Contributor Author
agatti commented Feb 14, 2025

I've implemented SRAI SLLI according to https://www.cadence.com/content/dam/cadence-www/global/en_US/documents/tools/silicon-solutions/compute-ip/isa-summary.pdf pages 613 610 and 614 611:

        } else if (op == MP_QSTR_slli) {
            mp_uint_t r0 = get_arg_reg(emit, op_str, pn_args[0]);
            mp_uint_t r1 = get_arg_reg(emit, op_str, pn_args[1]);
            mp_uint_t bits = 32 - get_arg_i(emit, op_str, pn_args[2], 1, 31);
            asm_xtensa_op24(&emit->as, ASM_XTENSA_ENCODE_RRR(0, 1, 0 | ((bits >> 4) & 0x01), r0, r1, bits & 0x0F));

The case for shifts larger than 15 is handled, the third argument of ASM_XTENSA_ENCODE_RRR (2 | ((bits >> 4) & 0x01)) (0 | ((bits >> 4) & 0x01)) covers bits 20 to 23. How should that work instead?

(Sorry, I misread SLLI for SRAI, not enough coffee I guess)

@rkompass
Copy link

I added the snippet:

@micropython.asm_xtensa
def asr24(a2):
    srai(a2, a2, 24)

print(hex(asr1(0x12345678)), hex(0x12345678>>24))

to the test file xtensa_asmshift.py and it also reports an error with the current code:
0x91a2b3c is returned instead of 0x12.

So the results seem to show that there is something wrong yet. Can you confirm that?
I got my info from Xtensa® Instruction Set Architecture (ISA) Reference Manual
pages 525 (slli) and 527 (srai).

@agatti
Copy link
Contributor Author
agatti commented Feb 14, 2025

Both documents' instruction descriptions match indeed. Still, thanks for the snippet, I'll look into it later today.

@agatti
Copy link
Contributor Author
agatti commented Feb 14, 2025

(Just a quick test, I'll update this comment later.)

The opcode generated for srai(a2, a2, 24) is correct:

Disassembly of section .text:

00000000 <start>:
   0:   312820          srai    a2, a2, 24

And that opcode appears at 0x710a28d539f1:

./build/mpy-cross -X emit=native -march=xtensa test.py

[...]

XTENSA ASM:
0x710a28d539e0: 0600 0000 12c1 e009 01c9 11d9 21e9 31f9
0x710a28d539f0: 4120 2831 f841 e831 d821 c811 0801 12c1
0x710a28d53a00: 200d f000 0000 0000 0000 0000 0000 0000

@agatti
Copy link
Contributor Author
agatti commented Feb 14, 2025
@micropython.asm_xtensa
def asr24(a2):
    srai(a2, a2, 24)

print(hex(asr1(0x12345678)), hex(0x12345678>>24))

You're calling asr1(...) whilst the test is called asr24(...)... :)

@rkompass
Copy link

Both documents' instruction descriptions match indeed.

Yeah. And I also checked the code. It is correct now. And the tests pass. I had the mistake that you found in it.

It might be useful to add the test for asr for 24 bits anyway, because of the complications.

@micropython.asm_xtensa
def asr24(a2):
    srai(a2, a2, 24)

My translation of the encoding is correct but made things more complicated.

I replaced the data statements by the slli and srli in my CRC code and the tests for correctness pass.

@agatti
Copy link
Contributor Author
agatti commented Feb 16, 2025

It might be useful to add the test for asr for 24 bits anyway, because of the complications.

No worries, just pushed a new version with an additional ASR test.

@agatti agatti changed the title py/emitinlinextensa: Add all supported LX6 opcodes to the inline assembler. py/emitinlinextensa: Add all supported ESP8266 LX6 opcodes to the inline assembler. Feb 17, 2025
@agatti
Copy link
Contributor Author
agatti commented Feb 17, 2025

I left over some code that was used by opcodes that don't work on the 8266, now removed. Sorry!

@dpgeorge
Copy link
Member

Thanks for this, it's a very nice enhancement.

I tested it on an esp8266 and the new tests all pass.

The code size increase for this PR is +2200 bytes on ESP8266_GENERIC.

@agatti
Copy link
Contributor Author
agatti commented Feb 27, 2025

The code size increase for this PR is +2200 bytes on ESP8266_GENERIC.

2.2KiB... I thought it took less space. I wonder if using a similar approach as what the RV32 assembler does but with one table per argument count would reduce the footprint - even by just a bit, as there's a lot of small duplicated code fragments that can be streamlined into fewer instances. The compiler used for the 8266 port is not that recent so maybe on ESP32 this is less of an issue, but the ESP32 has more space available. Upgrading the 8266 port to use a more recent compiler isn't something I'm looking forward to :)

Given that the plan is to eventually submit a PR to make this feature available on ESP32 as well, would it make sense to add a PR to add a QEMU Xtensa port in order to be able to put the inline assembler (and natmods too) under CI? Regular QEMU seems to support a few Xtensa targets but I need to see which instruction sets they support, though.

@dpgeorge
Copy link
Member
dpgeorge commented Mar 3, 2025

2.2KiB... I thought it took less space.

This is not a show stopper for this PR. If we are short on space in the future, we can look into optimisations then.

Given that the plan is to eventually submit a PR to make this feature available on ESP32 as well, would it make sense to add a PR to add a QEMU Xtensa port in order to be able to put the inline assembler (and natmods too) under CI?

If that's not too much work/effort, then yes that would be a good idea.

@agatti agatti changed the title py/emitinlinextensa: Add all supported ESP8266 LX6 opcodes to the inline assembler. py/emitinlinextensa: Add all supported LX3/LX106 opcodes to the inline assembler. Mar 3, 2025
@agatti agatti force-pushed the xtensa-inline branch 2 times, most recently from 525cb18 to 13a659d Compare March 4, 2025 13:52
@agatti
Copy link
Contributor Author
agatti commented Mar 4, 2025

I believe this should be in its final form. Changes since last review round:

  • J and JX opcodes have their own tests
  • NSA and NSAU opcodes have been added, each with its own test
  • Xtensa prologue will skip dealing with the constants table if it is empty - extremely unlikely for native code, but a common occurrence for inline assembler
  • BEQZ.N and BNEZ.N have been folded into BEQZ and BNEZ, respectively
  • L32R address checks also take alignment into account.

@rkompass If you want to start to work on documentation when you've got some time, that'd be great :)

Anyway, once I'm done with Picolibc stuff for natmods I'll start working on ESP32/ESP32S3 (LX6+LX7) opcodes. I'll skip floating point opcodes as I don't think any other inline assembler implementation supports them anyway.

@rkompass
Copy link
rkompass commented Mar 5, 2025

@agatti I will start in a few days, hopefully.

Copy link
Member
@dpgeorge dpgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9E88

Apart from the comments above, this is looking very good now.

agatti added 4 commits May 29, 2025 12:12
This commit removes old raw printf calls happening inside certain branch
opcode emitters, indicating the target label is out of range for the
opcode.  They have been replaced with a RuntimeError being raised in
these cases, using a parameterised qstr instead.

Whilst this technically breaks runtime behaviour expectations, the
generated code would not have worked anyway so it's better to catch
those cases early.  This should be updated to always emit long jumps
unless jumps are backwards and short enough, following the other ports,
but that's something coming later.

This is actually needed because there are test files that do not work
when processed through mpy-cross and entirely converted to native code.
The original implementation would still generate mostly-valid code that
was bound to crash on the device, whilst this change would prevent
invalid code to even be emitted in the first place.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit fixes compilation errors occurring when enabling the Xtensa
code dumper inside mpy-cross.

The original code was meant to dump the code from an Xtensa device
itself, but for debugging the inline assembler this functionality was
also needed off-line.  The changes involve solving a signed/unsigned
mismatch that was not much of a problem for the 8266's gcc version but
made modern compilers complain, and using the printf formatter for
pointers when it comes to printing code addresses.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit expands the Xtensa inline assembler to support most if not
all opcodes available on the ESP8266 and LX3 Xtensa cores.

This is meant as a stepping stone to add inline assembler support for
the ESP32 and its LX6 core, along to windowed-specific opcodes and
additional opcodes that are present only on the LX7 core (ESP32-S3 and
later).

New opcodes being added are covered by tests, and the provided tests
were expanded to also include opcodes available in the existing
implementation.  Given that the ESP8266 space requirements are tighter
than ESP32's, certain opcodes that won't be commonly used have been put
behind a define to save some space in the general use case.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit simplifies native functions' prologue code by not emitting a
jump opcode that goes over the function's constants pool if the pool is
empty.

The original code assumed the constants pool is never empty as large
32-bits constants are commonly used, but for inline assembler functions
that may not be the case.  This meant that inline assembler functions
may start with an unneeded jump (along with its alignment byte), using
four bytes more than necessary.

This commit is limited to the "xtensa" target, as "xtensawin" doesn't
support inline assembler functions yet, so native functions' constant
pools are almost always guaranteed to hold one or more values.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
Copy link
Member
@dpgeorge dpgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with ESP32_GENERIC and ESP8266_GENERIC firmware. Works well.

Thank you!

@dpgeorge dpgeorge merged commit f5d10c3 into micropython:master May 29, 2025
65 checks passed
@agatti agatti deleted the xtensa-inline branch May 29, 2025 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0