8000 gh-115077: Argument Clinic: generate better error messages when parsing function declaration by erlend-aasland · Pull Request #115555 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-115077: Argument Clinic: generate better error messages when parsing function declaration #115555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

erlend-aasland
Copy link
Contributor
@erlend-aasland erlend-aasland commented Feb 16, 2024

@erlend-aasland
Copy link
Contributor Author
erlend-aasland commented Feb 16, 2024

With this experiment, we can in the future make use of shlex's character position, and thus easily provide the position in the line where parsing failed. For example, by providing error messages that look more similar to the familiar Python tracebacks.

@erlend-aasland
Copy link
Contributor Author
erlend-aasland commented Feb 16, 2024

Some examples of improved cases:

  • foo = bar ->
    • main: Illegal function name foo = bar ->
    • This PR: No return annotation provided ...
  • foo as
    • main: Illegal function name foo as
    • This PR: No C base name provided ...
  • a b c d:
    • main: Illegal function name a b c d
    • This PR: Invalid syntax ...
  • foo = bar baz:
    • main: Illegal function mame foo = bar baz
    • This PR: Invalid syntax ...

UPDATE: after 9b93771, the latter two cases are no longer improved.

@erlend-aasland
Copy link
Contributor Author

Another positive side effect: previously, the parsing fail()s (for function declarations) were scattered around in various places; now they are collected in one place. IMO, that helps readability and maintainability.

@erlend-aasland erlend-aasland changed the title gh-115077: Argument Clinic: use a lexer to generate better error message gh-115077: Argument Clinic: use a lexer to generate better error messages Feb 16, 2024
Copy link
Member
@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use the shell tokenizer to parse Argument Clinic syntax? Isn't it closer to Python syntax?

@erlend-aasland
Copy link
Contributor Author

Why use the shell tokenizer to parse Argument Clinic syntax? Isn't it closer to Python syntax?

Because it was the short route to a proof-of-concept PR. We can of course rewrite it to use the Python tokeniser instead.

@erlend-aasland
Copy link
Contributor Author
erlend-aasland commented Feb 16, 2024

We can of course rewrite it to use the Python tokeniser instead.

Possible gotcha: the Python tokeniser will probably split up the full name (e.g. mod.cls.fn will be returned as ["mod", ".", "cls", ".", "fn"], IIRC). Currently, the shell tokeniser is easily configured to give us a single token for the full name: ["mod.cls.fn"]. This means we'd have to do extra post-processing for full names.

@serhiy-storchaka
Copy link
Member

The shell tokenizer has much more gotchas.

@erlend-aasland
Copy link
Contributor Author

The shell tokenizer has much more gotchas.

We already use the shell tokenizer for parsing the checksum line. Should we also stop using it there?

Let's rewrite it using the Python tokenizer then. If it introduces too much complexity, let's just forget about this experiment and leave the error messages like they are today.

@erlend-aasland
Copy link
Contributor Author

The shell tokenizer has much more gotchas.

Could you point to some, so I can add tests for those?

@serhiy-storchaka
Copy link
Member

I expect some surprises in handling quotes and escapes.

But for such simple case both look overkill to me. It can be done with regexpes or string methods. What are the problems in the current code?

@serhiy-storchaka
Copy link
Member
serhiy-storchaka commented Feb 16, 2024

For example:

        m = re.match(r'\s*([\w.]+)\s*', line)
        assert m
        full_name = m[1]
        if not libclinic.is_legal_py_identifier(full_name):
            fail(f"Illegal function name: {full_name!r}")
        pos = m.end()

        m = re.compile(r'\bas\b\s*(?:([^-=\s]+)\s*)?').match(line, pos)
        if m:
            if not m[1]:
                fail(f"No C basename provided for {full_name!r} after 'as' keyword")
            c_basename = m[1]
            if not libclinic.is_legal_c_identifier(c_basename):
                fail(f"Illegal C basename: {c_basename!r}")
            pos = m.end()
        else:
            c_basename = self.generate_c_basename(full_name)

        m = re.compile(r'=\s*(?:([^-=\s]+)\s*)?').match(line, pos)
        if m:
            if not m[1]:
                fail(f"No source function provided for {full_name!r} after '=' keyword")
            cloned = m[1]
            if not libclinic.is_legal_py_identifier(cloned):
                fail(f"Illegal source function name: {cloned!r}")
            pos = m.end()

        m = re.compile(r'->\s*(.*)').match(line, pos)
        if m:
            if not m[1]:
                fail(f"No return annotation provided for {full_name!r} after '->' keyword")
            returns = m[1].strip()

@erlend-aasland
Copy link
Contributor Author

I expect some surprises in handling quotes and escapes.

That should be easy to check; I don't expect it to be a problem with our simple syntax; as you can see, the test suite completes without error, and all clinic code in our repo is parsed without problems. No surprises (yet).

But for such simple case both look overkill to me. It can be done with regexpes or string methods. What are the problems in the current code?

It generates very bad error messages in many cases (reflected in the PR title). Also, the parsing failures are scattered around the code, instead of collected in one place as in this PR. See my earlier comments:

IMO, it is worth it to generate better error messages.

@erlend-aasland
Copy link
Contributor Author

For example:

It misses some corner cases, but it is a good alternative; thanks.

@erlend-aasland
Copy link
Contributor Author
erlend-aasland commented Feb 16, 2024

@serhiy-storchaka, I adapted it to fit in commits 9b93771 and 1cc7248. I removed some edge cases1; perhaps it is extreme to check for such cases of invalid syntax anyway 🤷 It is a handful of lines shorter, which is nice. IMO, the shlex approach is more readable, but we don't have to weight that too heavy.

What do you think?

Footnotes

  1. see https://github.com/python/cpython/pull/115555#issuecomment-1948104973

@erlend-aasland erlend-aasland changed the title gh-115077: Argument Clinic: use a lexer to generate better error messages gh-115077: Argument Clinic: generate better error messages when parsing function declaration Feb 16, 2024
Comment on lines +5 to +6
RE_C_BASENAME = re.compile(r"\bas\b\s*(?:([^-=\s]+)\s*)?")
RE_CLONE = re.compile(r"=\s*(?:([^-=\s]+)\s*)?")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it pass most of your tests, but perhaps \w+ or [\w.]+ is better than [^-=\s]+. It will produce different error message for foo.bar as '', but it may be for good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my test case might also be too contrived.

@erlend-aasland
Copy link
Contributor Author

I don't have the bandwidth to follow this up now; closing the PR but keeping the local branch. Feel free to pick it up.

@erlend-aasland erlend-aasland deleted the clinic/tokenizer branch March 27, 2024 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0