The purpose of the module Match
is to get the offsets (as well as
the string between those offsets, for debugging) of a cleaned-up,
tokenized string from its original, untokenized source. “Big deal,” you
might say, but this is actually a pretty difficult task if the original
text is sufficiently messy, not to mention rife with Unicode characters.
Consider some text, stored in a variable original_text
, like:
I am writing a letter ! Sometimes,I forget to put spaces (and do weird stuff with punctuation) ? J'aurai une pomme, s'il vous plâit !
This will/should/might be properly tokenized as:
[['I', 'am', 'writing', 'a', 'letter', '!'],
['Sometimes', ',', 'I', 'forget', 'to', 'put', 'spaces', '-LRB-', 'and', 'do', 'weird', 'stuff', 'with', 'punctuation', '-RRB-', '?'],
["J'aurai", 'une', 'pomme', ',', "s'il", 'vous', 'plâit', '!']]
Now:
In [2]: import match
In [3]: match.match(original_text, ['-LRB-', 'and', 'do', 'weird', 'stuff', 'with', 'punctuation', '-RRB-'])
Out[3]: [(60, 97, '(and do weird stuff with punctuation)')]
In [4]: match.match(original_text, ['I', 'am', 'writing', 'a', 'letter', '!'])
Out[4]: [(0, 25, 'I am writing a letter !')]
In [5]: match.match(original_text, ["s'il", 'vous', 'plâit', '!'])
Out[5]: [(121, 138, "s'il vous plâit !")]