- From: Henry Story <henry.story@bblfish.net>
- Date: Wed, 29 Feb 2012 22:20:14 +0100
- To: public-rdf-comments@w3.org
- Cc: Alexandre Bertalis <bertails@w3.org>
Thanks all for your answers to the questions I put recently on this list. They helped me to finished the Scala parser: it passes all the official w3c tests (bar one). For those of you interested the main code for the parser is here. https://github.com/betehess/pimp-my-rdf/blob/d64ae11514f4bd8402c0857cb29c203ec821bd67/n3/src/main/scala/Turtle.scala It is written following very closely the spec - indeed it might seem to be nearly a statement for statement transposition of the spec's EBNF (just upside down). It is asynchrnous, and should use only as much memory as needed. I am sure there is a lot more to do on optimising efficiency still, but this is good enough for me right now. 1. EBNF change for '.' ---------------------- There is one change to the spec I would like to argue for. The current EBNF has the following rules for prefixed names such as foaf:knows PrefixedName ::= PNAME_LN | PNAME_NS <PNAME_NS> ::= (PN_PREFIX)? ":" <PNAME_LN> ::= PNAME_NS PN_LOCAL <PN_PREFIX> ::= PN_CHARS_BASE ( ( PN_CHARS | "." )* PN_CHARS )? <PN_LOCAL> ::= ( PN_CHARS_U | [0-9] ) ( ( PN_CHARS | "." )* PN_CHARS )? My issue is with the definitions of PN_PREFIX and PN_LOCAL. Both of those are just really nasty, and I don't think they give much value. They are nasty because one has a rule where you have a number of ( PN_CHARS | "." )* followed by the same PN_CHARS minus the dot. This is aimed at allowing people to write prefixed names such as foaf.duck:quack but without allowing foaf.duck:quack. That last dot is reserved for end of sentences. I spent a lot of time trying to implement this. Alex Hall wrote that he had trouble with this > FWIW, I had trouble implementing the same PN_PREFIX rule that you cite above using Antlr, and had to use Antlr's predicated production feature to work around the greediness. So I rewrote the rule as: > > fragment PN_LOCAL_CHARS : '.' | PN_CHARS ; > fragment PN_CHARS_SEQ : > ( ('.' PN_LOCAL_CHARS)=> '.' // '.' is not allowed at the end -- only match them if they're followed by another valid char > | PN_CHARS )* ; > fragment PN_PREFIX : PN_CHARS_BASE PN_CHARS_SEQ ; > Currently I just disallowed dots in the names, which gave me the very simple rule lazy val PN_PREFIX = (PN_CHARS_BASE ++ PN_CHARS.many) I could try to spend time implementing the dotted names, but I'd rather argue against it. I really doubt that people make a big use of dotted names when writing rdf by hand. I think it can make the turtle less readable, and it also clashes with the '.' notation in n3 (thought that may have it's own problems). i.e. we just have <PN_PREFIX> ::= PN_CHARS_BASE PN_CHARS* <PN_LOCAL> ::= ( PN_CHARS_U | [0-9] ) PN_CHARS* 2. Fixes to test suite ---------------------- I found a few bugs in the test suites. The diffs can be found here: https://github.com/betehess/pimp-my-rdf/commits/master/n3-test-suite/src/main/resources/www.w3.org/TR/turtle/tests I added a test for <> as that cought me out. 3. TODO ------- The code is open source. I tested it against Jena and Sesame using the framework https://github.com/betehess/pimp-my-rdf/blob/master/n3-test-suite/src/main/scala/TurtleParserTest.scala (When testing against Jena there seem to be more bugs, perhaps something related to bnode creation.) I am sure this can be optimised still a lot further. But it should be good enough for me at present. I welcome anyone to try it out and do some speed tests on it, and see what optimisations can be made. All the best, Henry Social Web Architect http://bblfish.net/
Received on Wednesday, 29 February 2012 21:20:51 UTC