-
Notifications
You must be signed in to change notification settings - Fork 5.7k
[FEATURE] tool for translating unicode offsets to utf-16 offsets of MessageEntity #4319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, thanks for reaching out. I can't quite follow. I understood that you have troubles specifying the |
I wrote a custom Markdown parser, which cuts super long text (exceeding the limit length of telegram message, usually long code) into many parts, and mark all code blocks manually def longtext_markdown_split(txt: str) -> tuple[list[str], list[list[UnicodeMessageEntity]]]:
... I mean if we can support this texts, entities_list = longtext_markdown_split(original_text)
for text, entities in zip(texts, entities_list):
# the type of `entities` is list of `UnicodeMessageEntity | MessageEntity`, and the `UnicodeMessageEntity` objects will be translated into `MessageEntity` internally in ptb
await bot.send_message(text, entities=entities) Sorry for misleading:
|
Mh, I still don't get it. Note that |
Yes, but if the code of parsing entities is hard to refactor (for example, using a third-party library, which assumes
In my opinion, since |
Fair points. I agree that PTB can provide more utility functionality for working with MessageEntity, I just took a bit to understand what exactly you're asking for : ) If I understand correctly now, you'd like to have a conversion functionality like def adjust_message_entities_for_utf_16(text: str, entities: Sequence[MessageEntity]) -> Sequence[MessageEntity]:
"""Utility functionality for converting the offset and length of entities from unicode to UTF-16.
Tip:
Useful if you want to express formatting of text containing arbitrary unicode characters with `MessageEntity` objects
but do not want to take care of caculating in UTF-16 space. Instead you can just count characters with the cursor
in your editor.
Args:
text: The text that the entities belong to
entities: Sequence of entities with offset and length calculated in UTF-8 encoding
Returns:
Sequence[MessageEntity]:
""" Based on your snippet I've build an mwe. Suing import asyncio
from collections.abc import Sequence
from pprint import pprint
from telegram import Bot, MessageEntity
def adjust_message_entities_for_utf_16(
text: str, entities: Sequence[MessageEntity]
) -> Sequence[MessageEntity]:
cur_index = 0
accumulated_len = 0
out = []
for entity in entities:
filler_text = text[cur_index : entity.offset]
accumulated_len += len(filler_text.encode("utf-16-le"))
cur_off = accumulated_len // 2
entity_text = text[entity.offset: entity.offset + entity.length]
accumulated_len += len(entity_text.encode("utf-16-le"))
cur_len = accumulated_len // 2 - cur_off
out.append(
MessageEntity(
offset=cur_off, length=cur_len, type=entity.type, language=entity.language
)
)
cur_index = entity.offset + entity.length
return out
async def main():
text = "𠌕 bold 𝄢 italic\nunderlined: 𝛙𝌢𑁍"
unicode_entities = [
MessageEntity(offset=2, length=4, type=MessageEntity.BOLD),
MessageEntity(offset=9, length=6, type=MessageEntity.ITALIC),
MessageEntity(offset=28, length=3, type=MessageEntity.UNDERLINE),
]
utf_16_entities = [
MessageEntity(offset=3, length=4, type=MessageEntity.BOLD),
MessageEntity(offset=11, length=6, type=MessageEntity.ITALIC),
MessageEntity(offset=30, length=6, type=MessageEntity.UNDERLINE),
]
async with Bot(token="token") as bot:
await bot.send_message(
chat_id=123,
text=text,
entities=adjust_message_entities_for_utf_16(text, unicode_entities),
# entities=utf_16_entities,
)
if __name__ == "__main__":
asyncio.run(main()) However, I see no need for a new class |
I'd prefer static method of |
Okay, fair point, let's go ahead with those then :) |
I've made a PR #4323 |
Uh oh!
There was an error while loading. Please reload this page.
What kind of feature are you missing? Where do you notice a shortcoming of PTB?
I want to send a message with customized
MessageEntity
. At first I assumed ptb already handled the unicode -> utf-16 translation. It generally works for most texts, but turns out it will fail if there are some emojis in the message.Describe the solution you'd like
A new class
UnicodeMessageEntity
is needed. Currently I have a simple solution:It would be nice if this can be automatically applied when send messages, when entity object is an instance of
UnicodeMessageEntity
.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: