Comparing changes

Per request in #116, parse display name syntax also, but don't allow it unless a new allow_display_name option is set. Parsing according to the MIME specification probably isn't what's generally wanted since the use case is probably parsing inputs in email composition-like user interfaces. So it's in the spirit of a MIME message but not the letter. If display name syntax is permitted, return the unquoted/unescaped display name in the returned object.

…lies to both the local part and the domain part

* Fix error message text for input addresses without @-signs. The incorrect message was "There must be something after the @-sign.". This was broken by the changes to parse display names. Prior to that, the message was "The email address is not valid. It must have exactly one @-sign.". * Move the allow_display_name check to the end of the syntax checks. The optional checks should be the last to occur so that fatal syntax errors are raised first. * Check that display name email addresses have a closing angle bracket and nothing after. * Don't treat < + U+0338 (Combining Long Solidus Overlay) as the start of a bracketed email address. This would already be rejected because the combining character would be reported as an unsafe character at the start of the address, but it may be confusing since the caller won't see the address that way. When splitting the address into parts, skip the other special characters (@, quote, backslash) that have meaningful combining characters after them (i.e. they change under NFC normalization), although I don't think there are any such cases.

…C normalization s + U+0323 + U+0307 normalizes under NFC to U+1E69 (Latin Small Letter S With Dot Below And Dot Above) (https://www.unicode.org/reports/tr15/). We normalize when creating the returned email address info.

… prevent injection of invalid characters We encourage callers to use the normalized email address returned by validate_email (in the `normalized` attribute). This form has had Unicode NFC normalization applied to the local part. However, all of the syntactic validation on the local part was performed before the normalization. Consequently, the normalization could change the local part to become invalid by the replacement of valid characters with invalid characters or by changing the length of the local part to exceed the maximum length. Callers who use the normalized form may then unexpectedly be using an invalid address. To ensure that callers do not get an invalid address, local part syntax checks are now repeated after Unicode normalization has been applied. A user submitted one case where NFC normalization changes a local part from valid to invalid: U+037E (Greek Question Mark)'s NFC normalization is the ASCII semicolon. The former is otherwise a permitted character, but ASCII semicolons are not permitted in local parts. The user noted that the semicolon could cause the address to be reinterpreted as a list and change the recipient of a message. No other Unicode character on its own is valid (in a local part) before normalization and invalid after --- I checked every character. I am not sure if there are character sequences that are valid before but not after normalization, but I can't yet find any: I checked that no Unicode character's NFD decomposition, when valid in a local part, normalizes under NFC to a sequence that is not valid. I also could not find any examples where NFC normalization changes something to or from a period, which could also change the validity of a local part. (The string '<' or '>' plus U+0338 (Combining Long Solidus Overlay) normalizes under NFC to ≮ U+226E (Not Less-Than) and ≯ U+226F (Not Greater-Than). The two-character sequences are not valid in a local part because < and > are not valid, although they are valid after NFC normalization. These addresses were rejected before and continue to be rejected. Although < could be the start of a bracketed email address if display names are permitted, the two-character sequence is now (in an earlier commit) is ignored for the purposes of parsing display names.) There are a small number of characters whose NFC normalization increases the string length, including U+FB2C (Hebrew Letter Shin With Dagesh And Shin Dot). This could also cause the local part to become invalid after normalization where it is valid before. This is now also caught by performing the syntax check again after normalization. (The whole-address length check is similarly fixed in a later commit.) Some checks that were previously only applied after normalization, for checking safe Unicode characters, are now also applied to the un-normalized form, which also may protect callers that ignore the normalized form and use the original email address string. However, I could not find an example where normalization turns an unsafe string into a safe string. See #142.

…s string since callers may continue to use that string Previously, we checked that the ASCII email address (with IDNA ASCII) and the normalized email address satisfied the whole-address length limit. However, callers may use the original input string. Since Unicode NFC normalization typically reduces string length (if it changes the string), this can cause the post-normalization check to pass when the pre-normalization length is not valid. So we should additionally check that the original input also meets the maximum length requirement. Callers might also construct an address that has an internationalized local part and ASCII domain, maybe? So that's now checked too. The whole-address length test is revised to test each possible address format, first the original email address string (with any display name removed) so that exception messages correspond to the input string where possible. Then the normalized address is checked, since we encourage callers to use it. Then the ASCII address is checked since callers who send email without a SMTPUTF8-enabled stack will use this, or the normalized internationalized local part (there won't be an ASCII local part in this case) combined with the ASCII domain. Some length tests are added with a Unicode character whose NFC normalization is actually a decomposition: U+FB2C (Hebrew Letter Shin With Dagesh And Shin Dot) is unusual in that its NFC normalization actually expands it to multiple code points (https://www.unicode.org/faq/normalization.html). In these cases, the address will be valid before normalization but not valid after. See #142.

… the length check ourselves rather than in idna.encode

…rs as a precaution Out of caution that normalization of the domain part to internationalized characters could turn a valid domain string into an invalid one, it is re-parsed at the end to ensure that it still is validated by the idna package. I could not find any examples where that was not already caught, however, since it seems like the existing IDNA calls already prevent it. Some tests are added for invalid characters in the domain part which become invalid after Unicode NFC normalization. These were already handled. (The new code never raises an exception.) See #142.

…r Unicode NFC normalization These cases were previously handled by the call to idna.encode or idna.alabel, but the error message wasn't consistent with similar checks we do for the local part. See #142.

Commits on Jun 20, 2024

Version 2.2.0

JoshData committed Jun 20, 2024

Configuration menu

View commit details

Copy full SHA for 6589b1e

Browse repository at this point

Copy the full SHA

6589b1e View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Apr 19, 2024

Commits on May 9, 2024

Commits on May 10, 2024

Commits on Jun 17, 2024

Commits on Jun 19, 2024

Commits on Jun 20, 2024

This comparison is taking too long to generate.

Uh oh!