gh-115077: Argument Clinic: generate better error messages when parsing function declaration #115555

erlend-aasland · 2024-02-16T09:56:20Z

Issue: Argument Clinic: make error messages more helpful to developers #115077

erlend-aasland · 2024-02-16T10:01:20Z

With this experiment, we can in the future make use of shlex's character position, and thus easily provide the position in the line where parsing failed. For example, by providing error messages that look more similar to the familiar Python tracebacks.

erlend-aasland · 2024-02-16T10:17:33Z

Some examples of improved cases:

foo = bar ->
- main: Illegal function name foo = bar ->
- This PR: No return annotation provided ...
foo as
- main: Illegal function name foo as
- This PR: No C base name provided ...
a b c d:
- main: Illegal function name a b c d
- This PR: Invalid syntax ...
foo = bar baz:
- main: Illegal function mame foo = bar baz
- This PR: Invalid syntax ...

UPDATE: after 9b93771, the latter two cases are no longer improved.

erlend-aasland · 2024-02-16T10:25:12Z

Another positive side effect: previously, the parsing fail()s (for function declarations) were scattered around in various places; now they are collected in one place. IMO, that helps readability and maintainability.

serhiy-storchaka

Why use the shell tokenizer to parse Argument Clinic syntax? Isn't it closer to Python syntax?

erlend-aasland · 2024-02-16T11:21:11Z

Why use the shell tokenizer to parse Argument Clinic syntax? Isn't it closer to Python syntax?

Because it was the short route to a proof-of-concept PR. We can of course rewrite it to use the Python tokeniser instead.

erlend-aasland · 2024-02-16T11:23:37Z

We can of course rewrite it to use the Python tokeniser instead.

Possible gotcha: the Python tokeniser will probably split up the full name (e.g. mod.cls.fn will be returned as ["mod", ".", "cls", ".", "fn"], IIRC). Currently, the shell tokeniser is easily configured to give us a single token for the full name: ["mod.cls.fn"]. This means we'd have to do extra post-processing for full names.

serhiy-storchaka · 2024-02-16T11:42:06Z

The shell tokenizer has much more gotchas.

erlend-aasland · 2024-02-16T12:03:16Z

The shell tokenizer has much more gotchas.

We already use the shell tokenizer for parsing the checksum line. Should we also stop using it there?

Let's rewrite it using the Python tokenizer then. If it introduces too much complexity, let's just forget about this experiment and leave the error messages like they are today.

erlend-aasland · 2024-02-16T12:15:35Z

The shell tokenizer has much more gotchas.

Could you point to some, so I can add tests for those?

serhiy-storchaka · 2024-02-16T12:31:49Z

I expect some surprises in handling quotes and escapes.

But for such simple case both look overkill to me. It can be done with regexpes or string methods. What are the problems in the current code?

serhiy-storchaka · 2024-02-16T13:02:03Z

For example:

        m = re.match(r'\s*([\w.]+)\s*', line)
        assert m
        full_name = m[1]
        if not libclinic.is_legal_py_identifier(full_name):
            fail(f"Illegal function name: {full_name!r}")
        pos = m.end()

        m = re.compile(r'\bas\b\s*(?:([^-=\s]+)\s*)?').match(line, pos)
        if m:
            if not m[1]:
                fail(f"No C basename provided for {full_name!r} after 'as' keyword")
            c_basename = m[1]
            if not libclinic.is_legal_c_identifier(c_basename):
                fail(f"Illegal C basename: {c_basename!r}")
            pos = m.end()
        else:
            c_basename = self.generate_c_basename(full_name)

        m = re.compile(r'=\s*(?:([^-=\s]+)\s*)?').match(line, pos)
        if m:
            if not m[1]:
                fail(f"No source function provided for {full_name!r} after '=' keyword")
            cloned = m[1]
            if not libclinic.is_legal_py_identifier(cloned):
                fail(f"Illegal source function name: {cloned!r}")
            pos = m.end()

        m = re.compile(r'->\s*(.*)').match(line, pos)
        if m:
            if not m[1]:
                fail(f"No return annotation provided for {full_name!r} after '->' keyword")
            returns = m[1].strip()

erlend-aasland · 2024-02-16T13:08:42Z

I expect some surprises in handling quotes and escapes.

That should be easy to check; I don't expect it to be a problem with our simple syntax; as you can see, the test suite completes without error, and all clinic code in our repo is parsed without problems. No surprises (yet).

But for such simple case both look overkill to me. It can be done with regexpes or string methods. What are the problems in the current code?

It generates very bad error messages in many cases (reflected in the PR title). Also, the parsing failures are scattered around the code, instead of collected in one place as in this PR. See my earlier comments:

IMO, it is worth it to generate better error messages.

erlend-aasland · 2024-02-16T13:40:12Z

For example:

It misses some corner cases, but it is a good alternative; thanks.

erlend-aasland · 2024-02-16T14:05:23Z

@serhiy-storchaka, I adapted it to fit in commits 9b93771 and 1cc7248. I removed some edge cases¹; perhaps it is extreme to check for such cases of invalid syntax anyway 🤷 It is a handful of lines shorter, which is nice. IMO, the shlex approach is more readable, but we don't have to weight that too heavy.

What do you think?

see https://github.com/python/cpython/pull/115555#issuecomment-1948104973 ↩

Tools/clinic/clinic.py

serhiy-storchaka · 2024-02-16T14:42:07Z

Tools/clinic/libclinic/parser.py

+RE_C_BASENAME = re.compile(r"\bas\b\s*(?:([^-=\s]+)\s*)?")
+RE_CLONE = re.compile(r"=\s*(?:([^-=\s]+)\s*)?")


I wrote it pass most of your tests, but perhaps \w+ or [\w.]+ is better than [^-=\s]+. It will produce different error message for foo.bar as '', but it may be for good.

Well, my test case might also be too contrived.

erlend-aasland · 2024-03-27T23:40:55Z

I don't have the bandwidth to follow this up now; closing the PR but keeping the local branch. Feel free to pick it up.

erlend-aasland added 2 commits February 16, 2024 10:55

Use a lexer to generate better error messages for invalid syntax

fe3b8ca

Extend test suite

c926abc

erlend-aasland requested review from sobolevn and serhiy-storchaka February 16, 2024 09:56

bedevere-app bot mentioned this pull request Feb 16, 2024

Argument Clinic: make error messages more helpful to developers #115077

Open

erlend-aasland added the skip news label Feb 16, 2024

erlend-aasland changed the title ~~gh-115077: Argument Clinic: use a lexer to generate better error message~~ gh-115077: Argument Clinic: use a lexer to generate better error messages Feb 16, 2024

erlend-aasland added 2 commits February 16, 2024 11:32

Validate cloned name post parsing

33c21e7

Remove now obsoleted comment

60553b7

serhiy-storchaka reviewed Feb 16, 2024

View reviewed changes

Use regex instead; compromise by not detecting some edge cases

9b93771

erlend-aasland changed the title ~~gh-115077: Argument Clinic: use a lexer to generate better error messages~~ gh-115077: Argument Clinic: generate better error messages when parsing function declaration Feb 16, 2024

Add parser.py

1cc7248

serhiy-storchaka reviewed Feb 16, 2024

View reviewed changes

Tools/clinic/clinic.py Show resolved Hide resolved

serhiy-storchaka reviewed Feb 16, 2024

View reviewed changes

erlend-aasland added 2 commits February 16, 2024 23:52

Pull in main

e8c1de3

Detect more cases of invalid syntax

3c83c5d

erlend-aasland closed this Mar 27, 2024

erlend-aasland deleted the clinic/tokenizer branch March 27, 2024 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-115077: Argument Clinic: generate better error messages when parsing function declaration #115555

gh-115077: Argument Clinic: generate better error messages when parsing function declaration #115555

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		RE_C_BASENAME = re.compile(r"\bas\b\s(?:([^-=\s]+)\s)?")
		RE_CLONE = re.compile(r"=\s(?:([^-=\s]+)\s)?")

Uh oh!

gh-115077: Argument Clinic: generate better error messages when parsing function declaration #115555

gh-115077: Argument Clinic: generate better error messages when parsing function declaration #115555

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Footnotes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!