py/emitinlinextensa: Add all supported LX3/LX106 opcodes to the inline assembler. #16731

agatti · 2025-02-10T17:24:03Z

Summary

This PR expands the Xtensa inline assembler to support most if not all opcodes available on the ESP8266 and LX3/LX106 Xtensa cores.

This is meant as a stepping stone to add inline assembler support for the ESP32 and its LX6 core, along to windowed-specific opcodes and additional opcodes that are present only on the LX7 core (ESP32-S3 and later). Rather than having a single monolithic PR with all sorts of assembler changes I chose to split this into ESP8266-only support first and then use this to add ESP32 support (see #16594).

New opcodes being added are covered by tests, and the provided tests were expanded to also include opcodes available in the existing implementation. 8000 Given that the ESP8266 space requirements are tighter than ESP32's, certain opcodes that won't be commonly used are put behind a define (MICROPY_EMIT_INLINE_XTENSA_UNCOMMON_OPCODES, disabled by default) to save some space in the general use case.

Testing

The new inline assembler additions were tested on an ESP8266 with ./run-tests.py -t <device> -d inlineasm/xtensa.

Trade-offs and Alternatives

This increases the footprint of the ESP8266 firmware, as it has inline assembler enabled by default. I've tried to keep the size increase within reason but there's only so much you can do after all.

Also, some extmod tests would be emitted incorrectly when ran through mpy-cross to generate Xtensa/Xtensawin native code (some jumps are too far away to be encoded with a simple opcode) but now they fail at compile time rather than at runtime. This will be fixed in a later PR.

codecov · 2025-02-10T17:27:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (5cfafb7) to head (f5d10c3).
Report is 4 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #16731   +/-   ##
=======================================
  Coverage   98.54%   98.54%           
=======================================
  Files         169      169           
  Lines       21898    21898           
=======================================
  Hits        21579    21579           
  Misses        319      319

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2025-02-10T17:33:31Z

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

rkompass · 2025-02-14T09:43:30Z

Hello @agatti,
I started to try out this PR.

You noted in the tests:

# SLLI shift argument is actually (32 - shift)

I suggest the following correction for this in emitinlinextensa.c (line 472ff):

        } else if (op == MP_QSTR_slli) {
            mp_uint_t r0 = get_arg_reg(emit, op_str, pn_args[0]);
            mp_uint_t r1 = get_arg_reg(emit, op_str, pn_args[1]);
            mp_uint_t bits = 32 - get_arg_i(emit, op_str, pn_args[2], 1, 31);  // correct role of i
            asm_xtensa_op24(&emit->as, ASM_XTENSA_ENCODE_RRR(0, 1, 0 | ((bits >> 4) & 0x01), r0, r1, bits & 0x0F));

agatti · 2025-02-14T10:48:36Z

Right, turns out it's also what GAS does when assembling the SLLI opcode. Thanks for pointing it out, I'll fix it right away.

rkompass · 2025-02-14T11:12:11Z

Thanks. Another one which is purely cosmetic:

@micropython.asm_xtensa
def lsr31(a2):
    srli(a2, a2, 15)

print(hex(lsr31(0x80000000)))

in the tests should be titled lsr15, because you do the shift by 15 bits, right?

agatti · 2025-02-14T11:13:30Z

Yep. SRLI only accepts from 1 to 15 as its shift parameter - good catch. I'll fix this after the current test run passes. Thanks!

rkompass · 2025-02-14T12:35:53Z

I had in my Preliminary asm_xtensa docs
noted that:

#   slli(Ax, Ay, imm4)  AX = Ay << imm4  Shift left imm.     data(3, 0x110000 | x<<12 | y<<8 | (16-imm4)<<4)
#      slli(Ax, Ay, imm5) with imm5 > 16 may be coded as     data(3, 0x010000 | x<<12 | y<<8 | (32-imm5)<<4)
#   srli(Ax, Ay, imm4)  AX = Ay >> imm4  L. sh. right imm.   data(3, 0x410000 | x<<12 | y<<4 | imm4<<8)
#   srai(Ax, Ay, imm4)  AX = Ay >> imm4  A. sh. right imm.   data(3, 0x210000 | x<<12 | y<<4 | imm4<<8)
#      srai(Ax, Ay, imm5) with imm5 > 15 may be coded as     data(3, 0x310000 | x<<12 | y<<4 | (imm5-8)<<8)

which includes that srli is only shiftable by 0..15 bits (i.e. imm4).
But the other two have a more complicated coding where the case i<16 has to be handled differently than i>=16.
This is not reflected in the PR yet.

Consequently now the first test in xtensa_asmshift.py fails and produces -0x80000000 instead of 0x246:

@micropython.asm_xtensa
def lsl1(a2):
    slli(a2, a2, 31)


print(hex(lsl1(0x123)), hex(0x123<<1))

I ask myself why below "All checks have passed". Wouldn't that give an error?

agatti · 2025-02-14T12:42:09Z

I've implemented ~~SRAI~~ SLLI according to https://www.cadence.com/content/dam/cadence-www/global/en_US/documents/tools/silicon-solutions/compute-ip/isa-summary.pdf pages ~~613~~ 610 and ~~614~~ 611:

        } else if (op == MP_QSTR_slli) {
            mp_uint_t r0 = get_arg_reg(emit, op_str, pn_args[0]);
            mp_uint_t r1 = get_arg_reg(emit, op_str, pn_args[1]);
            mp_uint_t bits = 32 - get_arg_i(emit, op_str, pn_args[2], 1, 31);
            asm_xtensa_op24(&emit->as, ASM_XTENSA_ENCODE_RRR(0, 1, 0 | ((bits >> 4) & 0x01), r0, r1, bits & 0x0F));

The case for shifts larger than 15 is handled, the third argument of ASM_XTENSA_ENCODE_RRR ~~(2 | ((bits >> 4) & 0x01))~~ (0 | ((bits >> 4) & 0x01)) covers bits 20 to 23. How should that work instead?

(Sorry, I misread SLLI for SRAI, not enough coffee I guess)

rkompass · 2025-02-14T12:51:29Z

I added the snippet:

@micropython.asm_xtensa
def asr24(a2):
    srai(a2, a2, 24)

print(hex(asr1(0x12345678)), hex(0x12345678>>24))

to the test file xtensa_asmshift.py and it also reports an error with the current code:
0x91a2b3c is returned instead of 0x12.

So the results seem to show that there is something wrong yet. Can you confirm that?
I got my info from Xtensa® Instruction Set Architecture (ISA) Reference Manual
pages 525 (slli) and 527 (srai).

agatti · 2025-02-14T12:57:30Z

Both documents' instruction descriptions match indeed. Still, thanks for the snippet, I'll look into it later today.

agatti · 2025-02-14T13:02:29Z

(Just a quick test, I'll update this comment later.)

The opcode generated for srai(a2, a2, 24) is correct:

Disassembly of section .text:

00000000 <start>:
   0:   312820          srai    a2, a2, 24

And that opcode appears at 0x710a28d539f1:

./build/mpy-cross -X emit=native -march=xtensa test.py

[...]

XTENSA ASM:
0x710a28d539e0: 0600 0000 12c1 e009 01c9 11d9 21e9 31f9
0x710a28d539f0: 4120 2831 f841 e831 d821 c811 0801 12c1
0x710a28d53a00: 200d f000 0000 0000 0000 0000 0000 0000

agatti · 2025-02-14T13:07:47Z

agatti commented

Feb 14, 2025

@micropython.asm_xtensa
def asr24(a2):
    srai(a2, a2, 24)

print(hex(asr1(0x12345678)), hex(0x12345678>>24))

You're calling asr1(...) whilst the test is called asr24(...)... :)

rkompass · 2025-02-14T13:19:31Z

Both documents' instruction descriptions match indeed.

Yeah. And I also checked the code. It is correct now. And the tests pass. I had the mistake that you found in it.

It might be useful to add the test for asr for 24 bits anyway, because of the complications.

@micropython.asm_xtensa
def asr24(a2):
    srai(a2, a2, 24)

My translation of the encoding is correct but made things more complicated.

I replaced the data statements by the slli and srli in my CRC code and the tests for correctness pass.

agatti · 2025-02-16T13:59:59Z

It might be useful to add the test for asr for 24 bits anyway, because of the complications.

No worries, just pushed a new version with an additional ASR test.

agatti · 2025-02-17T22:30:52Z

I left over some code that was used by opcodes that don't work on the 8266, now removed. Sorry!

dpgeorge · 2025-02-25T13:12:21Z

Thanks for this, it's a very nice enhancement.

I tested it on an esp8266 and the new tests all pass.

The code size increase for this PR is +2200 bytes on ESP8266_GENERIC.

py/asmxtensa.c

py/emitinlinextensa.c

agatti · 2025-02-27T18:19:12Z

The code size increase for this PR is +2200 bytes on ESP8266_GENERIC.

2.2KiB... I thought it took less space. I wonder if using a similar approach as what the RV32 assembler does but with one table per argument count would reduce the footprint - even by just a bit, as there's a lot of small duplicated code fragments that can be streamlined into fewer instances. The compiler used for the 8266 port is not that recent so maybe on ESP32 this is less of an issue, but the ESP32 has more space available. Upgrading the 8266 port to use a more recent compiler isn't something I'm looking forward to :)

Given that the plan is to eventually submit a PR to make this feature available on ESP32 as well, would it make sense to add a PR to add a QEMU Xtensa port in order to be able to put the inline assembler (and natmods too) under CI? Regular QEMU seems to support a few Xtensa targets but I need to see which instruction sets they support, though.

dpgeorge · 2025-03-03T01:39:05Z

2.2KiB... I thought it took less space.

This is not a show stopper for this PR. If we are short on space in the future, we can look into optimisations then.

Given that the plan is to eventually submit a PR to make this feature available on ESP32 as well, would it make sense to add a PR to add a QEMU Xtensa port in order to be able to put the inline assembler (and natmods too) under CI?

If that's not too much work/effort, then yes that would be a good idea.

agatti · 2025-03-04T14:02:59Z

I believe this should be in its final form. Changes since last review round:

J and JX opcodes have their own tests
NSA and NSAU opcodes have been added, each with its own test
Xtensa prologue will skip dealing with the constants table if it is empty - extremely unlikely for native code, but a common occurrence for inline assembler
BEQZ.N and BNEZ.N have been folded into BEQZ and BNEZ, respectively
L32R address checks also take alignment into account.

@rkompass If you want to start to work on documentation when you've got some time, that'd be great :)

Anyway, once I'm done with Picolibc stuff for natmods I'll start working on ESP32/ESP32S3 (LX6+LX7) opcodes. I'll skip floating point opcodes as I don't think any other inline assembler implementation supports them anyway.

rkompass · 2025-03-05T18:14:50Z

@agatti I will start in a few days, hopefully.

dpgeorge

Apart from the comments above, this is looking very good now.

py/asmxtensa.c

tests/inlineasm/xtensa/asmloadstore.py

py/asmxtensa.c

This commit removes old raw printf calls happening inside certain branch opcode emitters, indicating the target label is out of range for the opcode. They have been replaced with a RuntimeError being raised in these cases, using a parameterised qstr instead. Whilst this technically breaks runtime behaviour expectations, the generated code would not have worked anyway so it's better to catch those cases early. This should be updated to always emit long jumps unless jumps are backwards and short enough, following the other ports, but that's something coming later. This is actually needed because there are test files that do not work when processed through mpy-cross and entirely converted to native code. The original implementation would still generate mostly-valid code that was bound to crash on the device, whilst this change would prevent invalid code to even be emitted in the first place. Signed-off-by: Alessandro Gatti <a.gatti@frob.it>

This commit fixes compilation errors occurring when enabling the Xtensa code dumper inside mpy-cross. The original code was meant to dump the code from an Xtensa device itself, but for debugging the inline assembler this functionality was also needed off-line. The changes involve solving a signed/unsigned mismatch that was not much of a problem for the 8266's gcc version but made modern compilers complain, and using the printf formatter for pointers when it comes to printing code addresses. Signed-off-by: Alessandro Gatti <a.gatti@frob.it>

This commit expands the Xtensa inline assembler to support most if not all opcodes available on the ESP8266 and LX3 Xtensa cores. This is meant as a stepping stone to add inline assembler support for the ESP32 and its LX6 core, along to windowed-specific opcodes and additional opcodes that are present only on the LX7 core (ESP32-S3 and later). New opcodes being added are covered by tests, and the provided tests were expanded to also include opcodes available in the existing implementation. Given that the ESP8266 space requirements are tighter than ESP32's, certain opcodes that won't be commonly used have been put behind a define to save some space in the general use case. Signed-off-by: Alessandro Gatti <a.gatti@frob.it>

This commit simplifies native functions' prologue code by not emitting a jump opcode that goes over the function's constants pool if the pool is empty. The original code assumed the constants pool is never empty as large 32-bits constants are commonly used, but for inline assembler functions that may not be the case. This meant that inline assembler functions may start with an unneeded jump (along with its alignment byte), using four bytes more than necessary. This commit is limited to the "xtensa" target, as "xtensawin" doesn't support inline assembler functions yet, so native functions' constant pools are almost always guaranteed to hold one or more values. Signed-off-by: Alessandro Gatti <a.gatti@frob.it>

dpgeorge

Tested with ESP32_GENERIC and ESP8266_GENERIC firmware. Works well.

Thank you!

agatti mentioned this pull request Feb 11, 2025

py/emitinlinextensa: Simplify register name lookup. #16736

Merged

dpgeorge added the py-core Relates to py/ directory in source label Feb 12, 2025

agatti force-pushed the xtensa-inline branch from cf2e027 to ad02cf3 Compare February 14, 2025 10:55

agatti force-pushed the xtensa-inline branch from ad02cf3 to 1a49084 Compare February 14, 2025 11:16

agatti force-pushed the xtensa-inline branch from 1a49084 to 9b6ed4c Compare February 16, 2025 13:56

agatti force-pushed the xtensa-inline branch from 9b6ed4c to a8274c8 Compare February 17, 2025 22:29

agatti changed the title ~~py/emitinlinextensa: Add all supported LX6 opcodes to the inline assembler.~~ py/emitinlinextensa: Add all supported ESP8266 LX6 opcodes to the inline assembler. Feb 17, 2025

dpgeorge reviewed Feb 25, 2025

View reviewed changes

py/asmxtensa.c Outdated Show resolved Hide resolved

py/asmxtensa.c Show resolved Hide resolved

py/emitinlinextensa.c Show resolved Hide resolved

py/emitinlinextensa.c Outdated Show resolved Hide resolved

agatti changed the title ~~py/emitinlinextensa: Add all supported ESP8266 LX6 opcodes to the inline assembler.~~ py/emitinlinextensa: Add all supported LX3/LX106 opcodes to the inline assembler. Mar 3, 2025

agatti force-pushed the xtensa-inline branch 2 times, most recently from 525cb18 to 13a659d Compare 67E6 March 4, 2025 13:52

dpgeorge reviewed May 13, 2025

View reviewed changes

py/asmxtensa.c Outdated Show resolved Hide resolved

tests/inlineasm/xtensa/asmloadstore.py Show resolved Hide resolved

agatti force-pushed the xtensa-inline branch from 13a659d to 9c3d94f Compare May 15, 2025 07:28

dpgeorge reviewed May 15, 2025

View reviewed changes

py/asmxtensa.c Outdated Show resolved Hide resolved

agatti force-pushed the xtensa-inline branch from 9c3d94f to 947afb7 Compare May 23, 2025 12:08

agatti added 4 commits May 29, 2025 12:12

dpgeorge approved these changes May 29, 2025

View reviewed changes

dpgeorge force-pushed the xtensa-inline branch from 947afb7 to f5d10c3 Compare May 29, 2025 02:51

dpgeorge merged commit f5d10c3 into micropython:master May 29, 2025
65 checks passed

agatti deleted the xtensa-inline branch May 29, 2025 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

py/emitinlinextensa: Add all supported LX3/LX106 opcodes to the inline assembler. #16731

py/emitinlinextensa: Add all supported LX3/LX106 opcodes to the inline assembler. #16731

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

py/emitinlinextensa: Add all supported LX3/LX106 opcodes to the inline assembler. #16731

py/emitinlinextensa: Add all supported LX3/LX106 opcodes to the inline assembler. #16731

Uh oh!

Conversation

Uh oh!

Summary

Testing

Trade-offs and Alternatives

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!