8000 [FEATURE] tool for translating unicode offsets to utf-16 offsets of MessageEntity · Issue #4319 · python-telegram-bot/python-telegram-bot · GitHub
[go: up one dir, main page]

Skip to content

[FEATURE] tool for translating unicode offsets to utf-16 offsets of MessageEntity #4319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Antares0982 opened this issue Jun 22, 2024 · 8 comments · Fixed by #4323
Closed

Comments

@Antares0982
Copy link
Contributor
Antares0982 commented Jun 22, 2024

What kind of feature are you missing? Where do you notice a shortcoming of PTB?

I want to send a message with customized MessageEntity. At first I assumed ptb already handled the unicode -> utf-16 translation. It generally works for most texts, but turns out it will fail if there are some emojis in the message.

Describe the solution you'd like

A new class UnicodeMessageEntity is needed. Currently I have a simple solution:

    def fix_entities_offset(self):
        for text, entities in zip(self.texts, self.entities):
            cur_index = 0
            accumulated_len = 0
            for i, entity in enumerate(entities):
                cur_text = text[cur_index:entity.offset]
                accumulated_len += len(cur_text.encode('utf-16-le'))
                cur_off = accumulated_len // 2
                cur_text = text[entity.offset:entity.offset+entity.length]
                accumulated_len += len(cur_text.encode('utf-16-le'))
                cur_len = accumulated_len // 2 - cur_off
                entities[i] = MessageEntity(offset=cur_off, length=cur_len, type=entity.type, language=entity.language)
                cur_index = entity.offset + entity.length

It would be nice if this can be automatically applied when send messages, when entity object is an instance of UnicodeMessageEntity.

Describe alternatives you've considered

No response

Additional context

No response

@Bibo-Joshi
Copy link
Member

Hi, thanks for reaching out. I can't quite follow. I understood that you have troubles specifying the offset and length argument of MessageEntity if you have emoji in your text, but I don't quite get how your snippet is expected to solve that. Can you give an example of how you'd like a usre interface to look like?

@Antares0982
Copy link
Contributor Author
Antares0982 commented Jun 23, 2024

I wrote a custom Markdown parser, which cuts super long text (exceeding the limit length of telegram message, usually long code) into many parts, and mark all code blocks manually

def longtext_markdown_split(txt: str) -> tuple[list[str], list[list[UnicodeMessageEntity]]]:
    ...

I mean if we can support this

texts, entities_list = longtext_markdown_split(original_text)
for text, entities in zip(texts, entities_list):
    # the type of `entities` is list of `UnicodeMessageEntity | MessageEntity`, and the `UnicodeMessageEntity` objects will be translated into `MessageEntity` internally in ptb
    await bot.send_message(text, entities=entities)

Sorry for misleading:

  • the fix_entities_offset above is my current workaround, for a list of sorted UnicodeMessageEntity.
  • Edit: utf-8 -> unicode

@Antares0982 Antares0982 changed the title [FEATURE] tool for translating utf-8 offsets to utf-16 offsets of MessageEntity [FEATURE] tool for translating unicode offsets to utf-16 offsets of MessageEntity Jun 23, 2024
@Bibo-Joshi
Copy link
Member

Mh, I still don't get it. longtext_markdown_split(original_text) does not accept any entities, only text. Does it receive MD-formatted text and then parse that into plain text + entities? If so, you could probably utf-16-encode the text before you start parsing?

Note that MessageEntity is a pure implementation of the TG definition. As it does not have a direct connection to the text covered by the entity, TG does not adjust the offset or length in any way.

@Antares0982
Copy link
Contributor Author
Antares0982 commented Jun 23, 2024

If so, you could probably utf-16-encode the text before you start parsing?

Yes, but if the code of parsing entities is hard to refactor (for example, using a third-party library, which assumes str type everywhere), this would not be a good solution.

Note that MessageEntity is a pure implementation of the TG definition. As it does not have a direct connection to the text covered by the entity, TG does not adjust the offset or length in any way.

In my opinion, since str is the most common type in processing strings, a tool class for unicode string may ease the burden of developers, and it is actually the thing that a wrapper should do.

@Bibo-Joshi
Copy link
Member

Fair points. I agree that PTB can provide more utility functionality for working with MessageEntity, I just took a bit to understand what exactly you're asking for : ) If I understand correctly now, you'd like to have a conversion functionality like

def adjust_message_entities_for_utf_16(text: str, entities: Sequence[MessageEntity]) -> Sequence[MessageEntity]:
    """Utility functionality for converting the offset and length of entities from unicode to UTF-16.

    Tip:
       Useful if you want to express formatting of text containing arbitrary unicode characters with `MessageEntity` objects
       but do not want to take care of caculating in UTF-16 space. Instead you can just count characters with the cursor
       in your editor.

    Args:
        text: The text that the entities belong to
        entities: Sequence of entities with offset and length calculated in UTF-8 encoding

    Returns:
        Sequence[MessageEntity]: 
    """

Based on your snippet I've build an mwe. Suing utf_16_entities gives the same result as using adjust_message_entities_for_utf_16(text, unicode_entities)

import asyncio
from collections.abc import Sequence
from pprint import pprint

from telegram import Bot, MessageEntity


def adjust_message_entities_for_utf_16(
    text: str, entities: Sequence[MessageEntity]
) -> Sequence[MessageEntity]:
    cur_index = 0
    accumulated_len = 0
    out = []
    for entity in entities:
        filler_text = text[cur_index : entity.offset]
        accumulated_len += len(filler_text.encode("utf-16-le"))
        cur_off = accumulated_len // 2

        entity_text = text[entity.offset: entity.offset + entity.length]
        accumulated_len += len(entity_text.encode("utf-16-le"))
        cur_len = accumulated_len // 2 - cur_off
        out.append(
            MessageEntity(
                offset=cur_off, length=cur_len, type=entity.type, language=entity.language
            )
        )
        cur_index = entity.offset + entity.length

    return out


async def main():
    text = "𠌕 bold 𝄢 italic\nunderlined: 𝛙𝌢𑁍"
    unicode_entities = [
        MessageEntity(offset=2, length=4, type=MessageEntity.BOLD),
        MessageEntity(offset=9, length=6, type=MessageEntity.ITALIC),
        MessageEntity(offset=28, length=3, type=MessageEntity.UNDERLINE),
    ]
    utf_16_entities = [
        MessageEntity(offset=3, length=4, type=MessageEntity.BOLD),
        MessageEntity(offset=11, length=6, type=MessageEntity.ITALIC),
        MessageEntity(offset=30, length=6, type=MessageEntity.UNDERLINE),
    ]
    async with Bot(token="token") as bot:
        await bot.send_message(
            chat_id=123,
            text=text,
            entities=adjust_message_entities_for_utf_16(text, unicode_entities),
            # entities=utf_16_entities,
        )


if __name__ == "__main__":
    asyncio.run(main())

However, I see no need for a new class UnicodeMessageEntity. This would be a rather tight integration of this conversion into the bot methods, plus a deviation from the Bot API with we generally avoid. For starters, I think adding a helper function should be enough.
I'm now wondering where such a function would be suitable - in telegram.helpers as simple function or as static method of MessageEntity? The latter may be a tad easier to find, but binding this helper function to the class doesn't give any inherit benefit 🤔 I guess I'd tend for helpers. Maybe you can give an opinion on where you'd be looking for this?

@Antares0982
Copy link
Contributor Author

I'd prefer static method of MessageEntity. It's easier to find, and can also let developers aware there is a problem of translating the offsets at their first use.

@Bibo-Joshi
Copy link
Member

Okay, fair point, let's go ahead with those then :)
Would you like to send a PR? We can probably come up with a better function name and also docstring (an example would be nice). In the code snippets, I would like to improve the variable names somewhat to make them more self-speaking :)

@Antares0982
Copy link
Contributor Author

I've made a PR #4323

@Poolitzer Poolitzer linked a pull request Jul 3, 2024 that will close this issue
@github-actions github-actions bot locked and limited conversation to collaborators Jul 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
0