The trouble with triples#
Sat, 26 Mar 2016 16:29:15 +0000
The other day I had occasion to write
(defn triples-to-map [triples] (reduce (fn [m row] (update-in m (butlast row) (fn [old new] (if old (conj old new) [new])) (last row))) {} triples))
and be surprised and delighted that it ran first time with the expected result. As witness:
foo.search=> (clojure.pprint/pprint triples_) ([:bnb:016691109 :published "2014"] [:bnb:016691109 :title "The Seven Streets of Liverpool"] [:bnb:016691109 :publisher "Orion"] [:bnb:016691109 :schema :shlv:Book] [:bnb:016691109 :author "Lee, Maureen"] [:bnb:016594932 :published "2013"] [:bnb:016594932 :title "Stephen Guy's forgotten Liverpool"] [:bnb:016594932 :publisher "Trinity Mirror"] [:bnb:016594932 :schema :shlv:Book] [:bnb:016594932 :author "Guy, Stephen"] [:bnb:016242841 :published "2012"] [:bnb:016242841 :title "Robbed : my Liverpool life : the Rob Jones story"] [:bnb:016242841 :publisher "Kids Academy Publishing"] [:bnb:016242841 :schema :shlv:Book] [:bnb:016242841 :author "Jones, Rob, 1971-"] [:bnb:016744037 :published "2012"] [:bnb:016744037 :title "Steven Gerrard : my Liverpool story"] [:bnb:016744037 :publisher "Headline"] [:bnb:016744037 :schema :shlv:Book] [:bnb:016744037 :author "Gerrard, Steven, 1980-"]) foo.search=> (clojure.pprint/pprint (triples-to-map triples_)) {:bnb:016691109 {:published ["2014"], :title ["The Seven Streets of Liverpool"], :publisher ["Orion"], :schema [:shlv:Book], :author ["Lee, Maureen"]}, :bnb:016594932 {:published ["2013"], :title ["Stephen Guy's forgotten Liverpool"], :publisher ["Trinity Mirror"], :schema [:shlv:Book], :author ["Guy, Stephen"]}, :bnb:016242841 {:published ["2012"], :title ["Robbed : my Liverpool life : the Rob Jones story"], :publisher ["Kids Academy Publishing"], :schema [:shlv:Book], :author ["Jones, Rob, 1971-"]}, :bnb:016744037 {:published ["2012"], :title ["Steven Gerrard : my Liverpool story"], :publisher ["Headline"], :schema [:shlv:Book], :author ["Gerrard, Steven, 1980-"]}} nil
(Now I write that code down for the second time I wonder whether using
update-in
is slightly overkill when I know the map will only ever be
two levels deep. But that's not something I'm interested in right now.)
What I'm interested in right now is that the input list for this function is itself the output of some other code which - mostly thanks to Instaparse - was unexpectedly easy to write. I've been playing around lately with RDF and the Semantic Web, and needed a way of parsing N-Triples - which looks superficially simple enough that Awk could do it, until you start thinking about comments and strings with spaces in them and escaped special characters and ...
Anyway, Instaparse steps in to save the day again. I believe I have written previously to give my opinion that Instaparse is awesome and I will go on record to say that this fresh experience merely serves to cement my first impression.
N-Triples has a published EBNF grammar . I had to monkey with this a bit to get it into Instaparse
- instaparse doesn't understand "character classes", so productions of
the form
[0-9]
had to be rewritten in a regex form#"[0-9]"
- there was nothing in there about whitespace or comments. I added the
WS
production which allows both, and scattered it into appropriate-looking places
- to simplify the code that walks the parse tree, I added a production for
IRI
so I could make instaparse strip off the < and > around IRI references for me.STRING_LITERAL
is likewise my creation
- #xnn notation for hex-encoded characters wasn't understood, so I swapped for \u or \x{nn}
Here's the final result
ntriplesDoc ::= line* line ::= WS* triple? EOL triple ::= subject WS* predicate WS* object WS* '.' WS* subject ::= IRIREF | BLANK_NODE_LABEL predicate ::= IRIREF object ::= IRIREF | BLANK_NODE_LABEL | literal literal ::= STRING_LITERAL_QUOTED ('^^' IRIREF | LANGTAG)? LANGTAG ::= '@' #"[a-zA-Z]"+ ('-' #"[a-zA-Z0-9]"+)* EOL ::= #"[\n\r]"+ WS ::= #"[ \t]" | #"#.*" IRIREF ::= '<' IRI '>' IRI ::= (#"[^\u0000-\u0020<>\"{}|^`\\]" | UCHAR)* STRING_LITERAL_QUOTED ::= '"' STRING_LITERAL '"' STRING_LITERAL ::= ( #"[^\u0022\u005C\u000A\u000D]" | ECHAR | UCHAR)* BLANK_NODE_LABEL ::= '_:' (PN_CHARS_U | #"[0-9]") ((PN_CHARS | '.')* PN_CHARS)? UCHAR ::= '\\u' HEX HEX HEX HEX | '\\U' HEX HEX HEX HEX HEX HEX HEX HEX ECHAR ::= "\\" #"[tbnrf\"\'\\]"HEX ::= #"[0-9A-Fa-f]"
PN_CHARS_BASE ::= #"[A-Z]" | #"[a-z]" | #"[\u00C0-\u00D6]" | #"[\u00D8-\u00F6]" | #"[\u00F8-\u02FF]" | #"[\u0370-\u037D]" | #"[\u037F-\u1FFF]" | #"[\u200C-\u200D]" | #"[\u2070-\u218F]" | #"[\u2C00-\u2FEF]" | #"[\u3001-\uD7FF]" | #"[\uF900-\uFDCF]" | #"[\uFDF0-\uFFFD]" | #"[\x{10000}-\x{EFFFF}]"
PN_CHARS_U ::= PN_CHARS_BASE | ":" | "_"
PN_CHARS ::= PN_CHARS_U | "-" | #"[0-9]" | "\u00B7" | #"[\u0300-\u036F]" | #"[\u203F-\u2040]"
Calling insta/parse
with this grammar on a sample line gets you
something looking like
[:ntriplesDoc [:line [:triple [:subject [:IRIREF "<" [:IRI "h" "t" "t" "p" ":" "/" "/" "b" "n" "b" "." "d" "a" "t" "a" "." "b" "l" "." "u" "k" "/" "i" "d" "/" "r" "e" "s" "o" "u" "r" "c" "e" "/" "0" "1" "6" "7" "0" "6" "8" "5" "5"] ">"]] [:WS " "] [:predicate [:IRIREF "<" [:IRI "h" "t" "t" "p" ":" "/" "/" "l" "o" "c" "a" "l" "h" "o" "s" "t" ":" "3" "0" "3" "0" "/" "p" "u" "b" "l" "i" "s" "h" "e" "d"] ">"]] [:WS " "] [:object [:literal [:STRING_LITERAL_QUOTED "\"" [:STRING_LITERAL "2" "0" "1" "4"] "\""]]] [:WS " "] "."] [:EOL "\n"]]]
which clearly is going to need some more attention before it's usable. We do this in two passes: first we visit the entire tree node-by-node to do things like turn literal node values into strings and IRI nodes into URI objects.
(defn visit-node [branch] (if (vector? branch) (case (first branch) :IRIREF (let [[_< [_iri_tok & letters] _>] (rest branch) iri (str/join letters)] (or (prefixize iri) (URI. iri))) :STRING_LITERAL (str/join (rest branch)) :STRING_LITERAL_QUOTED (let [[_ string _] (rest branch)] string) :literal (second branch) :WS "" :UCHAR (let [[_ & hexs] (rest branch)] (String. (Character/toChars (Integer/parseInt (str/join (map second hexs)) 16)))) :triple (let [m (reduce (fn [m [k v]] (assoc m k v)) {} (rest branch))] [:triple [(:subject m) (:predicate m) (:object m)]]) branch) branch))
Then we transform the tree into a seq and filter the seq to get only the
:triple
nodes. Putting it all together:
(defn parse-n-triples [in-string] (->> in-string (insta/parse n-triple-parser) (walk/postwalk visit-node) (tree-seq #(and (vector? %) (keyword? (first %)) (not (= (first %) :triple))) #(rest %)) (filter #(= (first %) :triple)) (map second)))
I'm reasonably confident that the grammar is correct: I pushed all the official N-Triples Test Suite through it without error. My post-parsing massage passes, though, are possibly not correct and certainly not complete, which is one reason I'm just blogging about it instead of publishing it as a standalone library somewhere. Things I already know it doesn't do: blank node support, language tags, datatypes, escaped characters. Things I don't know it doesn't do: don't know. But it seems to work for my use case - of which, more later.