10000 Add Bootstrapping Logic to `Application.run_*` by Bibo-Joshi · Pull Request #4673 · python-telegram-bot/python-telegram-bot · GitHub
[go: up one dir, main page]

Skip to content

Add Bootstrapping Logic to Application.run_* #4673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 1, 2025

Conversation

Bibo-Joshi
Copy link
Member
@Bibo-Joshi Bibo-Joshi commented Feb 7, 2025

When ready, closes #4657.

This also changes the default value for bootstrap_retries for start/run_polling to 0. Previously the logic was roughly

async def run_main(bootstrap_retries="indefinite"):
    delete_webhook_and_drop_pending_updates(retries=bootstrap_retries)
    poll_for_updates(retries="indefinite")

I now changed this to

async def run_main(bootstrap_retries=None):
    delete_webhook_and_drop_pending_updates(retries=bootstrap_retries)
    poll_for_updates(retries="indefinite")

I think this is saner, and I'm not even sure if indefinite retries during the bootstrapping phase were ever actually intended …

ToDo

  • Update existing tests to new logs
  • Add new tests for bootstrapping in Application

Also changes the default value for `bootstrap_retries` on polling
@Bibo-Joshi Bibo-Joshi added 🛠 refactor change type: refactor 🔌 enhancement pr description: enhancement labels Feb 7, 2025
Copy link
Member
@Poolitzer Poolitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im just asking questions at this point

@@ -409,12 +414,14 @@ def default_error_callback(exc: TelegramError) -> None:
# updates from Telegram and inserts them in the update queue of the
# Application.
self.__polling_task = asyncio.create_task(
self._network_loop_retry(
network_retry_loop(
is_running=lambda: self.running,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is that a lambda?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while effective_is_running():

this line expects a callable. If we simply pass is_runnig=self.running and edit above line to

while is_running:

then we have an infinite loop, b/c setting self.running=False within Updater won't change the value in the network retry loop :) There are probably a billion other ways do to this (pass an object with a running attribute to check, pass a dict with a corresponding entry, pass an event …). This one seemed rather straight forward to me, but I'm open to changing it. Accepting an instance of

class HasRunning(Protocol):
    running: bool

would sound like the next best thing to me at first glance

@tsnoam
Copy link
Member
tsnoam commented Feb 9, 2025

@Eldinnie
here's my vague memory for the original changes:

  1. infinite retries were really intended.
  2. the idea was to find some common ground (i.e. defaults) for most of the user base (we understood that not all can be pleased)
  3. there is a bootstrapping phase in which certain actions had to be made (are they still needed? I don't know... I haven't been following the bot API in the past years).
  4. what happens if these operations fail? do we resume with normal execution of the bot? do we abort (and let the user run the bot again)? or do we just retry until a success?
  5. we chose the option of indefinite retries (and afair, with appropriate logging). that way we make sure that the bot is properly started and will be started eventually. aborting seemed like too much, after all, the user will just need to restart the bot and come to the same "problematic" part of the initialization. so what's the point in that?

I hope that helps...

@tsnoam
Copy link
Member
tsnoam commented Feb 9, 2025

and why did i tag Eldinnie instead of @Bibo-Joshi ?

it's my memory playing tricks on me. lol.

@Bibo-Joshi
Copy link
Member Author

Thanks very much for your insights @tsnoam !

The bootstrapping phase is indeed still needed. It deletes the previous webhook if you run polling, sets a new webhook if your run in webhook mode and drops pending updates if requested.

Do you recall why you chose indefinite retries over doing a limited number of retries (say, 3)?. Even though we have logging, indefinite retries can get you in a situation where the process is running in a healthy way (eagerly retrying to set the webhook) but the bot doesn't react at all. That sounds undesirable to me. A limitd number of retries OTOH would be a sane effort of getting things going whil still preventing that the process get's stuck in an unusable state without explicit indication to the user.

Moreover I noticed that for webhook mode, the retries were kept at "none". I suspect that this was an oversight: The docstring was updated, but not the signature: https://github.com/python-telegram-bot/python-telegram-bot/pull/1018/files#r1948035215

@tsnoam
Copy link
Member
tsnoam commented Feb 9, 2025

@Bibo-Joshi

Do you recall why you chose indefinite retries over doing a limited number of retries (say, 3)?.
as stated on bullet 5:

we chose the option of indefinite retries (and afair, with appropriate logging). that way we make sure that the bot is properly started and will be started eventually. aborting seemed like too much, after all, the user will just need to restart the bot and come to the same "problematic" part of the initialization. so what's the point in that?

A limitd number of retries OTOH would be a sane effort of getting things going whil still preventing that the process get's stuck in an unusable state without explicit indication to the user.

I'm not sure about it. The initialization phase is the for a reason. Skipping it might lead to a limbo state in which you don't know exactly what happens.

Moreover I noticed that for webhook mode, the retries were kept at "none". I suspect that this was an oversight

Could be an oversight. That much I don't remember.

@Bibo-Joshi
Copy link
Member Author

Skipping it might lead to a limbo state in which you don't know exactly what happens.

I would not skip it, but rather abort if the bootstrapping phase fails. Meaning that fater n < ∞ retries you either have a bot that's able to handle updates or the process has shut down.

But okay, I get your reasoning and now have to decide what to make of it 😅 Thanks for the input!

@tsnoam
Copy link
Member
tsnoam commented Feb 9, 2025

as long as you can get to an expected state, any solution is good.

@septatrix
Copy link
Contributor

I think a finite number of retries is more than enough as long as the delay between them is large enough to eliminate intermittent connectivity problems as the source of potential problems. Eg. in my issue report the problem was that the service was brought up before network/DNS was online. This state is usually resolved within the first 10-20 seconds after boot.

Having an infinite amount of retries by default sound dangerous and may instead hide problems which the user wants the be notified about (by the service exiting)

@Bibo-Joshi
Copy link
Member Author

as long as the delay between them is large enough to eliminate intermittent connectivity problems as the source of potential problems

So far I have implemented the initial retry-interval to 0 seconds for Application.initialize

but I see that the bootstrapping within Updater uses an initial interval of 1 second - I can use that for Application as well:

bootstrap_interval: float = 1,

Note that the retry-loop retries immediately on timeout errors. On other errors, the interval is increased step by step up to 30 seconds:

try:
if not await do_action():
break
except RetryAfter as exc:
slack_time = 0.5
_LOGGER.info(
"%s %s. Adding %s seconds to the specified time.", log_prefix, exc, slack_time
)
cur_interval = slack_time + exc.retry_after
except TimedOut as toe:
_LOGGER.debug("%s Timed out: %s. Retrying immediately.", log_prefix, toe)
# If failure is due to timeout, we should retry asap.
cur_interval = 0
except InvalidToken:
_LOGGER.exception("%s Invalid token. Aborting retry loop.", log_prefix)
raise
except TelegramError as telegram_exc:
if on_err_cb:
on_err_cb(telegram_exc)
if max_retries < 0 or retries < max_retries:
_LOGGER.debug(
"%s Failed run number %s of %s. Retrying.", log_prefix, retries, max_retries
)
else:
_LOGGER.exception(
"%s Failed run number %s of %s. Aborting.", log_prefix, retries, max_retries
)
raise
# increase waiting times on subsequent errors up to 30secs
cur_interval = 1 if cur_interval == 0 else min(30, 1.5 * cur_interval)

Specifying the retry-interval is currently not exposed to the user in the Updater.start_* methods and for starters that would seem like a bit of overkill, TBH.

@septatrix would this + (customizable) finite number of retries be enough for you?

@septatrix
Copy link
Contributor

as long as the delay between them is large enough to eliminate intermittent connectivity problems as the source of potential problems

So far I have implemented the initial retry-interval to 0 seconds for Application.initialize [...] but I see that the bootstrapping within Updater uses an initial interval of 1 second - I can use that for Application as well:

I think having a default 1s interval is better than 0s - or at least something non-zero. Depending on how far along the network stack is the connections would try instantly instead of after a timeout so adding a small delay (such as the 1s) on our side is better.

Note that the retry-loop retries immediately on timeout errors. On other errors, the interval is increased step by step up to 30 seconds:

On timeout errors this should not be a problem because the timeout itself will result in some delay (though it also wouldn't hurt to wait, it's just not necessary).

Specifying the retry-interval is currently not exposed to the user in the Updater.start_* methods and for starters that would seem like a bit of overkill, TBH.

I agree. As long as the default interval is non-zero it should be fine for most people.

@septatrix would this + (customizable) finite number of retries be enough for you?

Yes, as far as I understand how this is hooked up now this would be enough for me.


Small aside while peeking at the code I found a small curiosity, nothing urgent: I noticed the slack added to RetryAfter exceptions (introduced in a68cf8d) is different from the one in AIORateLimiter. If you already extract it into a variable maybe put it in some shared location? However, this is completely independent from the rest of this PR so feel free to disregard it

Copy link
codecov bot commented Feb 17, 2025

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
6531 2 6529 746
View the top 2 failed test(s) by shortest run time
tests.test_bot.TestBotWithRequest::test_send_close_date_default_tz[Europe/Berlin-ZoneInfo]
Stack Traces | 7.35s run time
self = <tests.test_bot.TestBotWithRequest object at 0x000001CE6EFE70D0>
tz_bot = PytestExtBot[token=690091347:AAFLmR5pAB5Ycpe_mOh7zM4JFBOh0z3T0To]
super_group_id = '-1001279600026'

    async def test_send_close_date_default_tz(self, tz_bot, super_group_id):
        question = "Is this a test?"
        answers = ["Yes", "No", "Maybe"]
        reply_markup = InlineKeyboardMarkup.from_button(
            InlineKeyboardButton(text="text", callback_data="data")
        )
    
        aware_close_date = dtm.datetime.now(tz=tz_bot.defaults.tzinfo) + dtm.timedelta(seconds=5)
        close_date = aware_close_date.replace(tzinfo=None)
    
        msg = await tz_bot.send_poll(  # The timezone returned from this is always converted to UTC
            chat_id=super_group_id,
            question=question,
            options=answers,
            close_date=close_date,
            read_timeout=60,
        )
        msg.poll._unfreeze()
        # Sometimes there can be a few seconds delay, so don't let the test fail due to that-
>       msg.poll.close_date = msg.poll.close_date.astimezone(aware_close_date.tzinfo)
E       AttributeError: 'NoneType' object has no attribute 'astimezone'

tests\test_bot.py:2843: AttributeError
tests.test_bot.TestBotWithRequest::test_send_close_date_default_tz[Asia/Singapore-timezone]
Stack Traces | 7.37s run time
self = <tests.test_bot.TestBotWithRequest object at 0x00000223F0A048D0>
tz_bot = PytestExtBot[token=690091347:AAFLmR5pAB5Ycpe_mOh7zM4JFBOh0z3T0To]
super_group_id = '-1001279600026'

    async def test_send_close_date_default_tz(self, tz_bot, super_group_id):
        question = "Is this a test?"
        answers = ["Yes", "No", "Maybe"]
        reply_markup = InlineKeyboardMarkup.from_button(
            InlineKeyboardButton(text="text", callback_data="data")
        )
    
        aware_close_date = dtm.datetime.now(tz=tz_bot.defaults.tzinfo) + dtm.timedelta(seconds=5)
        close_date = aware_close_date.replace(tzinfo=None)
    
        msg = await tz_bot.send_poll(  # The timezone returned from this is always converted to UTC
            chat_id=super_group_id,
            question=question,
            options=answers,
            close_date=close_date,
            read_timeout=60,
        )
        msg.poll._unfreeze()
        # Sometimes there can be a few seconds delay, so don't let the test fail due to that-
>       msg.poll.close_date = msg.poll.close_date.astimezone(aware_close_date.tzinfo)
E       AttributeError: 'NoneType' object has no attribute 'astimezone'

tests\test_bot.py:2843: AttributeError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@Bibo-Joshi Bibo-Joshi marked this pull request as ready for review February 17, 2025 17:20
@Bibo-Joshi
Copy link
Member Author

updated the tests, ready for review :)

@harshil21 harshil21 requested a review from Copilot February 26, 2025 18:18
Copy link
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 4 out of 4 chan 67E6 ged files in this pull request and generated 1 comment.

@Bibo-Joshi Bibo-Joshi merged commit b0d22ac into master Mar 1, 2025
25 of 26 checks passed
@Bibo-Joshi Bibo-Joshi deleted the feature/app-bootstrapping branch March 1, 2025 10:13
@github-actions github-actions bot locked and limited conversation to collaborators Mar 9, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🔌 enhancement pr description: enhancement 🛠 refactor change type: refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Bootstrapping Logic to Application.initialize when Called within Application.run_*
4 participants
0