8000 Improve documentation & handling of spurious watchdog reset after power up in ESP32 revision 0 devices, often perceived as WiFi failure on every other power on. · Issue #4863 · espressif/arduino-esp32 · GitHub
[go: up one dir, main page]

Skip to content
Improve documentation & handling of spurious watchdog reset after power up in ESP32 revision 0 devices, often perceived as WiFi failure on every other power on. #4863
Closed
@gsuberland

Description

@gsuberland

Espressif published an ECO documenting errata in the ESP32 revision 0 silicon. The first issue, 3.1, is "A spurious watchdog reset occurs when ESP32 is powered up or wakes up from Deep-sleep". This issue has been fixed in a newer silicon release, but a large proportion of the ESP32 development boards on the market suffer from this issue. I have a number of ESP-WROOM-32 modules that are affected.

ESP-IDF v1.0 and later apparently has a workaround for the deep sleep side of things (although judging by #796 a complete fix may not be integrated here yet?) but on first power reset (e.g. after programming) affected ESP32 devices will initially appear to (mostly) work, but then reset after a short time due to the spurious WDT reset. A side-effect is that some of the inbuilt peripherals fail to behave properly during this first boot, whereas serial output does work, leading to misunderstandings of the root cause of the problem when users run into it.

I believe that a reasonable portion of users reportedly experiencing issue #2501 are actually running into this silicon bug. Most users' code attempts to call WiFi.begin() in setup. During first boot after power on, access to the WiFi peripheral is flaky. The begin() call can hang, or it can appear to succeed but the status will never reach the connected state. After a while the spurious WDT reset will kick in and the device will boot as normal. At this point the device will function correctly. This can produce the impression that the WiFi works every other time. Users who write something like WiFi.begin(...); /* ... */ WiFi.begin(...); may be tricked into thinking that this works because they see two sets of output, which they attribute to their code, when in reality the board is hanging on the first power on and running normally after the WDT reset. WiFi is almost certainly nothing to do with it, it just happens to be something that causes setup to take long enough during setup to coincide with the spurious WDT reset.

Put simply, the behaviour of this silicon bug is consistent with user issues that follow this pattern:

  • Device is powered on, reset via the button, or flashed
  • Serial output comes through
  • An attempt is made to use some functionality (e.g. WiFi)
  • That attempt hangs or otherwise misbehaves
  • The device resets roughly 10 seconds after boot
  • Things start working again on the second run

Users with debug output enabled at the same baud rate as the Serial peripheral is set to will see the WDT reset information coming through at the point of rebooting. Users with debug output enabled, but with the baud rate configured to a different value, will likely see junk on the serial output just before the device starts working again. Users with debug output disabled will just see the odd behaviour, the hang, and normal behaviour afterwards.

On affected ESP32 devices this issue reliably occurs on the first power on and after flashing, because both of these events trigger a full power on. The reset button will also cause this behaviour because pulling the reset pin to ground causes a POWERON_RESET event.

A simple workaround for most users affected by this issue is as follows:

#include <rom/rtc.h>

void setup() {
  if (rtc_get_reset_reason(0) == POWERON_RESET) {
  {
    ESP.restart();
  }
  /* ... */
}

This detects the power-on reset event and automatically restarts the ESP device, skipping waiting for the spurious WDT reset.

The problem I see with integrating this fix directly into the arduino-esp32 core is that it breaks user code that reads the reset reason and expects to see a power-on reset. In most cases such user code would be broken by the spurious WDT reset anyway, since they would still see the different reset reason after the spurious WDT reset, but I'm hesitant to suggest including any solution that masks the POWERON_RESET completely since there are edge cases (e.g. user code that initialises EEPROM contents when the reset reason is POWERON_RESET) that would still work despite the spurious WDT issue, but would be broken by a fix that doesn't send it.

As far as I know there's no generic way to store state past the reset (short of using EERPROM, which is a terrible idea) to indicate that a "workaround reset" was caused, so there's also no way to detect that this just happened and fake the result of rtc_get_reset_reason accordingly.

Bits [2] and [11:9] of the EFUSE_BLK0_RDATA3_REG register store pkg_version, which may be of use in detecting revision 0 devices. See pages 518-520 of the ESP32 technical reference manual for details. The package version can be detected as follows:

#define EFUSE_BLK0_RDATA3_REG 0x3FF5A00C
// pkg_version[2:0] = EFUSE_BLK0_RDATA3_REG[11:9]
#define PKG_VERSION_REG_LOW_BITMASK (0b111<<9)
#define PKG_VERSION_REG_LOW_SHIFT 9
// pkg_version[3] = EFUSE_BLK0_RDATA3_REG[2]
#define PKG_VERSION_REG_HIGH_BITMASK (1<<2)
#define PKG_VERSION_REG_HIGH_SHIFT -1

uint8_t get_esp32_package_revision()
{
  uint32_t blk0_rdata3 = REG_READ(EFUSE_BLK0_RDATA3_REG);
  uint8_t pkg_version =
    ((blk0_rdata3 & PKG_VERSION_REG_LOW_BITMASK) >> PKG_VERSION_REG_LOW_SHIFT) |
    ((blk0_rdata3 & PKG_VERSION_REG_HIGH_BITMASK) >> PKG_VERSION_REG_HIGH_SHIFT);
  return pkg_version;
}

I have tested this on my affected devices and it does indeed return 0; further testing would be required on newer boards to ensure that this is functioning as expected.

Being able to detect revision 0 devices is at least a step in the right direction. At minimum there should be a debug message at boot warning users of this issue, perhaps with a link to a page describing the problem and some fixes. I am willing to help write this documentation.

In summary, I believe the following steps would be helpful going forward:

  • Validate that the get_esp32_package_revision() code above behaves as expected on revision 0 and revision 1 devices (and others if they exist) in the hopes of detecting the likelihood of this issue at runtime.
  • Validate that the reported package version value holds a strong correlation with the behaviour described in the ESP32 errata.
  • Discuss potential strategies for automatic mitigation of the issue, in the hopes of coming up with something that doesn't break user code on either revision 0 or 1 devices.
  • Discuss alternative compile-time mitigation strategies (e.g. a workaround enabled by a preprocessor directive, perhaps set at the board level for known-affected development boards) if a sufficiently safe automatic mitigation cannot be devised.
  • Determine which open issues on GitHub are likely to be partially or wholly due to this silicon bug, and communicate the results of the aforementioned discussions in reply to those issues.
  • Add debug output at boot that warns users when they are running hardware that is known to be affected by this bug. Include sufficient information (ideally both a unique search term and URL to documentation) for a user to be able to find out more about the issue and work around it.
  • Write up documentation that covers the specifics of the issue, available workarounds and their side-effects/caveats, and a summary of the decisions that were made regarding a fix.

As noted before, I'm happy to help with as much of this as I reasonably can. I would also appreciate as much input as possible from more seasoned ESP32 developers and project maintainers, since this issue clearly affects a large number of users and I am hesitant to make potentially breaking changes in the code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0