8000 gh-131507: Clean up tests and type checking for `_pyrepl` by ambv · Pull Request #131509 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-131507: Clean up tests and type checking for _pyrepl #131509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 21, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add _pyrepl.utils.unbracket
  • Loading branch information
ambv committed Mar 20, 2025
commit b83d282616ad8549328d8b043979236ec25edf14
45 changes: 9 additions & 36 deletions Lib/_pyrepl/reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@


from . import commands, console, input
from .utils import ANSI_ESCAPE_SEQUENCE, wlen, str_width
from .utils import wlen, unbracket, str_width
from .trace import trace


Expand Down Expand Up @@ -421,42 +421,15 @@ def calc_screen(self) -> list[str]:

@staticmethod
def process_prompt(prompt: str) -> tuple[str, int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something but looks like the old function was tracking nested ANSI escape seas outside \x01 and \x02 brackets but this one drops this no? What is the rationale?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three things the old function was trying to do (badly):

  • ignore anything between \001 and \002 for length calculation; and
  • remove \001 and \002 characters from the emitted prompt; and
  • also ignore any ANSI escape sequences from length calculation.

Since wlen() is already doing 3., a contributor here needed to recreate this function to not do that, because we were doing that later anyway. But it's simpler to just use the main wlen() with a string stripped of \001 .. \002 bracketed content.

I find the new form much easier on the eyes.

"""Process the prompt.
r"""Return a tuple with the prompt string and its visible length.

This means calculate the length of the prompt. The character \x01
and \x02 are used to bracket ANSI control sequences and need to be
excluded from the length calculation. So also a copy of the prompt
is returned with these control characters removed."""

# The logic below also ignores the length of common escape
# sequences if they were not explicitly within \x01...\x02.
# They are CSI (or ANSI) sequences ( ESC [ ... LETTER )

# wlen from utils already excludes ANSI_ESCAPE_SEQUENCE chars,
# which breaks the logic below so we redefine it here.
def wlen(s: str) -> int:
return sum(str_width(i) for i in s)

out_prompt = ""
l = wlen(prompt)
pos = 0
while True:
s = prompt.find("\x01", pos)
if s == -1:
break
e = prompt.find("\x02", s)
if e == -1:
break
# Found start and end brackets, subtract from string length
l = l - (e - s + 1)
keep = prompt[pos:s]
l -= sum(map(wlen, ANSI_ESCAPE_SEQUENCE.findall(keep)))
out_prompt += keep + prompt[s + 1 : e]
pos = e + 1
keep = prompt[pos:]
l -= sum(map(wlen, ANSI_ESCAPE_SEQUENCE.findall(keep)))
out_prompt += keep
return out_prompt, l
The prompt string has the zero-width brackets recognized by shells
(\x01 and \x02) removed. The length ignores anything between those
brackets as well as any ANSI escape sequences.
Comment on lines +426 to +428
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The prompt string has the zero-width brackets recognized by shells
(\x01 and \x02) removed. The length ignores anything between those
brackets as well as any ANSI escape sequences.
The prompt string has the zero-width brackets (\x01 and \x02)
recognized by shells removed. The length ignores anything between
those brackets as well as any ANSI escape sequences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an improvement? There are many kinds of zero-width brackets, the ones I am interested in are those recognized by shells (\x01 and \x02). The original docstring uses that ordering then.

"""
out_prompt = unbracket(prompt, including_content=False)
visible_prompt = unbracket(prompt, including_content=True)
return out_prompt, wlen(visible_prompt)

def bow(self, p: int | None = None) -> int:
"""Return the 0-based index of the word break preceding p most
Expand Down
19 changes: 16 additions & 3 deletions Lib/_pyrepl/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,36 @@
import functools

ANSI_ESCAPE_SEQUENCE = re.compile(r"\x1b\[[ -@]*[A-~]")
ZERO_WIDTH_BRACKET = re.compile(r"\x01.*?\x02")
ZERO_WIDTH_TRANS = str.maketrans({"\x01": "", "\x02": ""})


@functools.cache
def str_width(c: str) -> int:
if ord(c) < 128:
return 1
w = unicodedata.east_asian_width(c)
if w in ('N', 'Na', 'H', 'A'):
if w in ("N", "Na", "H", "A"):
return 1
return 2


def wlen(s: str) -> int:
if len(s) == 1 and s != '\x1a':
if len(s) == 1 and s != "\x1a":
return str_width(s)
length = sum(str_width(i) for i in s)
# remove lengths of any escape sequences
sequence = ANSI_ESCAPE_SEQUENCE.findall(s)
ctrl_z_cnt = s.count('\x1a')
ctrl_z_cnt = s.count("\x1a")
return length - sum(len(i) for i in sequence) + ctrl_z_cnt


def unbracket(s: str, including_content: bool = False) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like this approach. I was concerned about the regular expressions performance but couldn't find anything too bad when testing

r"""Return `s` with \001 and \002 characters removed.

If `including_content` is True, content between \001 and \002 is also
stripped.
"""
if including_content:
return ZERO_WIDTH_BRACKET.sub("", s)
return s.translate(ZERO_WIDTH_TRANS)
0